User Study

Program Subjects:

As our program subjects, we selected four failures from Defects4J (a large and popular collection of reproducible real-world failures in Java programs) and two failures from LibRench (a set of real-world client-library project pairs with at least one library upgrade failure).

Table below shows the six selected subjects, together with a short description of each failure. To ensure participants could fully comprehend the code and answer numerous questions within a reasonable time frame of about 30-60 minutes, we shortened and simplified code snippets, preserving the essence of the changes and the failure, while removing implementation details.

Dataset	Sub. ID	Project (Failure ID)	Short Description
Defects4J	S1	Commons Chart (8)	Modifying `timeZone` variable reference
	S2	Commons Lang (18)	Modifying `year` format
	S3	Commons Math (37)	Deleting conditional calculations
	S4	Joda-Time (8)	Modifying arithmetic expression
Librench	S5	JacksonDatabind/OpenAPI	Removing reference shortcut
Librench	S6	Alibaba-Druid/Dble	Modifying SQL Parsing

Study Questionnaires:

The questionnaires were created using the Qualtrics – a popular platform that enables the creation and distribution of surveys, and data collection and analysis. Below, we provide two versions of the study questionnaire: “trace first” and “views first,” for each of the six subjects. The structure of the questionnaires is identical across subjects, except for the code views and corresponding textual explanations:

Subject 1:
- Trace First
- Views First
Subject 2:
- Trace First
- Views First
Subject 3:
- Trace First
- Views First
Subject 4:
- Trace First
- Views First
Subject 5:
- Trace First
- Views First
Subject 6:
- Trace First
- Views First

Statistical Tests:

To further support our claim that most developers consider contextual statements necessary, we conducted statistical analyses on the statement rankings provided by 55 participants across six debugging subjects, each evaluated using three code views: DualSlice, InPreSS, and Context (i.e., a DualSlice or InPreSS view augmented with contextual information). Specifically, we applied the following statistical tests:

– Friedman Test: To assess whether there are statistically significant differences in participants’ rankings across the three tools.

– Wilcoxon Signed-Rank Test: As a post-hoc analysis following the Friedman test, to determine which tool pairs differ significantly from each other.

– Kruskal-Wallis Test: To examine whether participant preferences for each tool differ significantly across the six debugging subjects. This test serves as a non-parametric extension of the Mann-Whitney U test for more than two groups.

These tests help quantify the significance of tool differences and the consistency of preferences across study subjects.

Test for Differences Between Tools:

Test	Comparison	Result Summary
Friedman	Dual Slicing vs. InPreSS vs. Context	X²r = 58.0364 (N = 55), p < .00001 → Significant at p < .05
Wilcoxon	Dual Slicing vs. InPreSS	z = -0.6074, p = .54186 → Not significant. W = 697.5, Mean Diff = -0.64
	Dual Slicing vs. Context	z = -5.7184, p < .00001 → Significant. W = 87.5, Mean Diff = 1.36
	InPreSS vs. Context	z = -5.932, p < .00001 → Significant. W = 62, Mean Diff = 1.47

Compare Rankings Across Subjects:

Test	Comparison	Result Summary
Kruskal-Wallis	Dual Slicing across subjects	p = .7918 → Not significant at p < .05
	InPreSS across subjects	p = .94343 → Not significant at p < .05
	Context across subjects	p = .69585 → Not significant at p < .05

Program Subjects:

Study Questionnaires:

Statistical Tests:

Test for Differences Between Tools:

Compare Rankings Across Subjects:

The raw data obtained in this study cannot be shared because of privacy issues.