Program Subjects:
As our program subjects, we selected four failures from Defects4J (a large and popular collection of reproducible real-world failures in Java programs) and two failures from LibRench (a set of real-world client-library project pairs with at least one library upgrade failure).
Table below shows the six selected subjects, together with a short description of each failure. To ensure participants could fully comprehend the code and answer numerous questions within a reasonable time frame of about 30-60 minutes, we shortened and simplified code snippets, preserving the essence of the changes and the failure, while removing implementation details.
Dataset | Sub. ID | Project (Failure ID) | Short Description |
---|---|---|---|
Defects4J | S1 | Commons Chart (8) | Modifying timeZone variable reference |
S2 | Commons Lang (18) | Modifying year format | |
S3 | Commons Math (37) | Deleting conditional calculations | |
S4 | Joda-Time (8) | Modifying arithmetic expression | |
Librench | S5 | JacksonDatabind/OpenAPI | Removing reference shortcut |
S6 | Alibaba-Druid/Dble | Modifying SQL Parsing |
Study Questionnaires:
The questionnaires were created using the Qualtrics – a popular platform that enables the creation and distribution of surveys, and data collection and analysis. Below, we provide two versions of the study questionnaire: “trace first” and “views first,” for each of the six subjects. The structure of the questionnaires is identical across subjects, except for the code views and corresponding textual explanations:
- Subject 1:
- Subject 2:
- Subject 3:
- Subject 4:
- Subject 5:
- Subject 6:
Statistical Tests:
To further support our claim that most developers consider contextual statements necessary, we conducted statistical analyses on the statement rankings provided by 55 participants across six debugging subjects, each evaluated using three code views: DualSlice, InPreSS, and Context (i.e., a DualSlice or InPreSS view augmented with contextual information). Specifically, we applied the following statistical tests:
– Friedman Test: To assess whether there are statistically significant differences in participants’ rankings across the three tools.
– Wilcoxon Signed-Rank Test: As a post-hoc analysis following the Friedman test, to determine which tool pairs differ significantly from each other.
– Kruskal-Wallis Test: To examine whether participant preferences for each tool differ significantly across the six debugging subjects. This test serves as a non-parametric extension of the Mann-Whitney U test for more than two groups.
These tests help quantify the significance of tool differences and the consistency of preferences across study subjects.
Test for Differences Between Tools:
Test | Comparison | Result Summary |
---|---|---|
Friedman | Dual Slicing vs. InPreSS vs. Context | X²r = 58.0364 (N = 55), p < .00001 → Significant at p < .05 |
Wilcoxon | Dual Slicing vs. InPreSS | z = -0.6074, p = .54186 → Not significant. W = 697.5, Mean Diff = -0.64 |
Dual Slicing vs. Context | z = -5.7184, p < .00001 → Significant. W = 87.5, Mean Diff = 1.36 | |
InPreSS vs. Context | z = -5.932, p < .00001 → Significant. W = 62, Mean Diff = 1.47 |
Compare Rankings Across Subjects:
Test | Comparison | Result Summary |
---|---|---|
Kruskal-Wallis | Dual Slicing across subjects | p = .7918 → Not significant at p < .05 |
InPreSS across subjects | p = .94343 → Not significant at p < .05 | |
Context across subjects | p = .69585 → Not significant at p < .05 |