
Using Qwen3.5-27B without any fine-tuning, we achieved 74.8% (374/500) on SWE-bench Verified — a benchmark that measures how well a model can fix real OSS issues from GitHub — by generating 8 candidate patches and selecting the best one. This is the highest score*1 among local LLMs with fewer than 229B parameters.
Overview
This work is by Kosaku Kimura, Satoshi Munakata*2, Satoshi Nakashima, Yu Ishikawa, Kosuke Maeda, Nao Soma, Kenichi Kobayashi, Keisuke Miyazaki, Keizo Kato, Shigeki Fukuta, Tatsuo Kumano, Nobutaka Imamura, Mehdi Bahrami, Kevin Takeshi Musgrave, Wei-Peng Chen, Shahbaz Abdul Khader, Kwun Ho Ngan, Joseph Townsend, Fayas Asharindavida, Matthieu Parizy, Akira Sakai, Yuma Ichikawa, Yang Zhao, Michiaki Takizawa, Taku Fukui, Hiroki Ohtsuji, and Hiro Kobashi — all at Fujitsu Research.
We set out to improve SWE-bench Verified scores by modifying mini-swe-agent *3. SWE-bench Verified is a well-established benchmark that tests how well a model can resolve real GitHub issues. In this post we describe the configuration that produced 374/500 = 74.8% using Qwen3.5-27B as a base model (no fine-tuning) with TTS@8*4. This post also serves as a technical supplement to the run artifacts, logs, and trajectories we are submitting to SWE-bench/experiments.
To our knowledge, this is the best result*5 among local LLMs with fewer than 229B parameters. The main point we want to make is that in benchmarks of this kind, harness engineering — the agent architecture, evaluation pipeline, and tooling — matters just as much as model training. Below we walk through the results and then cover the design decisions in mini-swe-agent and the evaluation pipeline that actually moved the needle.
Plotting publicly reported model sizes against resolution rates, our configuration sits on the Pareto frontier in the upper-left — a relatively small 27B model producing 74.8%, with no smaller model in our survey exceeding that score.
Table 1: Competitive comparison (public figures, selected entries)
| Configuration | LLM | Resolution (%) | Total params (B) | Notes |
|---|---|---|---|---|
Kozuchi mini-swe-agent |
Qwen3.5-27B |
74.8 | 27 | Official sb-cli / OSS LLM |
| Unknown | Qwen3.5-27B |
72.4 | 27 | OSS LLM |
| Unknown | Code World Model |
65.8 | 32.6 | OSS LLM |
mini-SWE-agent |
MiniMax M2.5 |
75.8 | 229 | Upper bound for comparison |
| Unknown | Qwen3.5-397B-A17B |
76.4 | 397 | OSS LLM |
OpenHands |
Qwen3-Coder-480B-A35B-Instruct |
69.6 | 480 | OSS LLM |
Lingxi v1.5 |
Kimi K2 Instruct |
71.2 | 1024 | OSS LLM |
| Unknown | Kimi-K2.5-1T-A32B |
76.8 | 1100 | OSS LLM |
The figure and table use our officially measured 74.8% from the cloud environment. Measurement methods across submissions are not uniform, so treat these comparisons as a rough sense of where different models and configurations sit, not a strict apples-to-apples comparison.
SWE-bench Verified is a 500-problem subset of SWE-bench, curated by human reviewers to filter for clear problem statements and valid evaluation tests. Each problem gives the agent an issue description and a codebase; the agent produces a patch. Official evaluation uses FAIL_TO_PASS tests to check whether the issue is resolved and PASS_TO_PASS tests to check for regressions. A problem counts as resolved only when both pass.
Note: The
FAIL_TO_PASS/PASS_TO_PASSterminology is used two ways in this post. In the official SWE-bench evaluation, these refer to tests bundled with the dataset. Elsewhere in this post — unless noted otherwise — they refer to verification tests the agent generates for itself during each run.
Table of Contents
1. Results
Using 8 candidate runs with Qwen3.5-27B followed by a selector, we achieved the following on SWE-bench Verified TTS@8:
TTS@8:374/500 = 74.8%(official SWE-bench cloud /sb-cli)- Positioning: state of the art among local LLMs under 229B parameters
- Selection rule:
simple weighted pass-rate(F2P=0.3,P2P=0.7; ties broken byshortest_patch_raw)
All public figures in this post use sb-cli measurements to keep comparisons consistent. Other public submissions may use different evaluation stacks, so take cross-system comparisons with appropriate caution. That said, 74.8% is the highest we have seen for a local LLM under 229B parameters.
Rather than generating a single patch and submitting it, our setup separates generation from selection:
- The agent produces multiple candidate patches.
- A selector picks the best one for final submission.
Decoupling these two stages lets us maintain diversity across candidates while improving submission accuracy. The mini-swe-agent is the front-end candidate generator, not the thing that directly produces the final submission — the published TTS@8 number reflects the full system, including the selector and official evaluation.
1.1 The TTS@8 number we report
The main result we report externally is TTS@8: run 8 candidate-generation runs, select 1.
- Candidate generation: 8 runs using the
orchestraconfiguration withQwen3.5-27B - Candidate comparison: cross-agent test application using each run's
FAIL_TO_PASS/PASS_TO_PASStests - Final selection rule:
simple weighted pass-rate(F2P=0.3,P2P=0.7; ties broken byshortest_patch_raw) - Published figure:
374/500 = 74.8%(official SWE-bench cloud /sb-cli)
More complex selection rules did not reliably outperform simple ones. We settled on a weighted aggregate of pass rates with a raw patch length tiebreaker. All external comparisons in this post use the sb-cli figure.
1.2 A concrete example: a Django fix spanning form validation and HTML rendering
Seeing what actually happened on individual problems makes the overall approach easier to understand. In one trajectory we solved, Django's MultiValueField was failing to handle the case where the parent field has required=False but one or more child fields have required=True. The bug had two symptoms: empty input was accepted when it shouldn't have been, and the required HTML attribute was missing from the rendered form.
At first glance it looks like a simple validation bug, but it actually straddles server-side validation and client-side attribute generation. Bugs that span multiple responsibilities like this rarely yield to a single-file patch; the key is isolating the root cause and understanding the full blast radius.
Problem summary:
MultiValueField ignores a required value of a sub field Form is valid: True Expected is_valid=False but got True. Number of 'required' attributes in HTML: 0 Expected 1 required attribute but got 0.
A one-location fix isn't enough. Closing the validation hole is necessary but not sufficient — the form rendering side also needs to correctly convey which sub-fields are required. The final patch touches three places: validation logic, attribute assignment, and widget rendering.
Final patch (excerpt):
diff --git a/django/forms/fields.py b/django/forms/fields.py @@ - return self.compress([]) + if not self.require_all_fields and isinstance(self.widget, MultiWidget): + for field in self.fields: + if field.required: + raise ValidationError(self.error_messages['incomplete'], code='incomplete') + return self.compress([]) diff --git a/django/forms/boundfield.py b/django/forms/boundfield.py @@ - if widget.use_required_attribute(self.initial) and self.field.required and self.form.use_required_attribute: - attrs['required'] = True + if widget.use_required_attribute(self.initial) and self.form.use_required_attribute: + if hasattr(self.field, 'require_all_fields') and not self.field.require_all_fields: + ... + elif self.field.required: + attrs['required'] = True diff --git a/django/forms/widgets.py b/django/forms/widgets.py @@ + if hasattr(self, '_field_requirements') and i < len(self._field_requirements): + widget_attrs['required'] = bool(self._field_requirements[i])
Post-patch test results (from the trajectory log — these are agent-generated verification tests, not the official SWE-bench tests):
FAIL_TO_PASS: 9/9 passed PASS_TO_PASS: 145/145 passed
What matters here is that the final patch re-establishes consistency across validation, attribute assignment, and widget rendering — it isn't a local workaround. The agent-generated FAIL_TO_PASS and PASS_TO_PASS tests both pass, confirming that the bug is closed without breaking existing behavior. Detailed tool-level exploration and the fix workflow are covered in Sections 3 and 4.4.
1.3 TTS@8 result details
The main public figure is sb-cli's 374/500 = 74.8%, but the adopted trajectories in the submission bundle let us analyze where turns were actually spent. Below we visualize total turn counts per instance as a stacked breakdown by phase.
The population for the following figures is 495 instances, not all 500 — the remaining 5 had no usable cross-agent test table or candidate and are excluded from this analysis. Adopted trajectories were drawn from all 8 runs; the per-run adoption counts ranged from 43 to 69, so no single run dominated the final submission.
Figure 1a: Box plot of total turn counts for TTS@8 adopted trajectories

Figure 1b: Total turn count distribution for TTS@8 adopted trajectories (stacked by phase)

The resolved / unresolved split in these figures uses the bundled report.json from each adopted run's source, since the submission bundle itself does not include per-instance sb-cli verdicts. Two error cases are counted as unresolved. These figures are diagnostic — the official number remains sb-cli's 374/500 = 74.8%.
From Figure 1a: across all 495 adopted trajectories, the median total turn count is 266, p90 is 449, and the maximum is 925. Resolved instances have a median of 259 and p90 of 428; unresolved ones have a heavier tail at median 280 and p90 of 522. The IQR is 231–330 overall, 228–324 for resolved, 243–336 for unresolved — the unresolved box sits visibly higher.
Figure 1b shows where that difference comes from.
Table 2: Turn count distribution by phase (median [Q1, Q3] / p90, in agent output turns)
| Phase | All | Resolved | Unresolved |
|---|---|---|---|
ISSUE_REPRODUCT |
35 [29, 46] / 57 |
35 [29, 44] / 54 |
39 [31, 49] / 63 |
TEST_SYNTHSIZE |
33 [29, 40] / 47 |
33 [28, 40] / 47 |
34 [30, 40] / 49 |
CODE_LOCALIZE |
26 [23, 30] / 38 |
26 [23, 29] / 36 |
28 [23, 34] / 44 |
TEST_LOCALIZE |
49 [42, 59] / 69 |
49 [42, 59] / 69 |
51 [43, 60] / 68 |
CODE_FIX |
41 [30, 75] / 138 |
40 [30, 69] / 134 |
46 [34, 80] / 168 |
VERIFY_PATCH |
25 [23, 52] / 114 |
25 [23, 54] / 107 |
25 [23, 37] / 125 |
ISSUE_CLOSE |
23 [21, 24] / 26 |
22 [21, 24] / 26 |
23 [21, 24] / 26 |
FINAL_REPORT |
10 [10, 12] / 13 |
10 [10, 11] / 13 |
10 [10, 12] / 13 |
Every instance in all 495 trajectories received at least one agent response in every phase, so rather than looking at reach rates, focus on how long the agent stayed in each phase. The biggest gap is in CODE_FIX: unresolved instances have a median of 46 turns and p90 of 168, noticeably heavier than the resolved 40 / 134. VERIFY_PATCH also stretches at p90 (107 → 125), suggesting that for unresolved cases, it's the late-stage repair-and-re-verify loop — not the initial investigation — where turns pile up.
2. Agent Design
2.1 Key design decisions
The main design choices in this work were:
- Phase decomposition with explicit transitions
- Filesystem-based state sharing across phases and workflows
- Context compression via handover
- The Orchestra runtime (
conductor+tool-specialist) - Specialized tools (
line_trace/caller_trace/coedit_localize/line_edit) - Phase-gated skill injection
- Cross-agent testing for candidate selection
- Operational stability (sharding + retry)
The performance gains came from all of these working together, not any single change. The pieces are: a structured exploration design, a shared filesystem that holds state outside the conversation history, a handover mechanism that compresses context within a phase, tools that quickly pinpoint failures, a prompting approach that surfaces only what's needed right now, cross-agent testing for comparing candidates, and operational infrastructure to reliably run 500-problem evaluations. Each is described below.
In TTS@8, the post-generation selection stage that picks one patch from 8 runs was also important; that is covered in Section 4.1.
2.2 Design 1: Phase decomposition + workflow decomposition + explicit transition control
We split the task into phases from ISSUE_REPRODUCT through FINAL_REPORT, each with a fixed responsibility. Within each phase, we further divided work into workflows W0..Wn, and required the agent to declare its current workflow at the start of every turn with WORKFLOW: Wn. Phase termination is declared via WORKFLOW: COMPLETE or WORKFLOW: GIVEUP, after which the runtime runs its own verification before the transition takes effect.
The motivation is straightforward: in a single long conversation, agents lose track of what they're supposed to be doing. Mixing reproduction, root-cause analysis, fixing, verification, and reporting in one undifferentiated stream leads to skipped steps and repeated exploration. The two-level structure — phases for major responsibilities, workflows for fine-grained steps — fixes this. Phases clarify what a stage is supposed to produce; workflows let the runtime track which steps have and haven't been taken.
Figure 2a: Phase transition diagram (vertical)
flowchart TD ISSUE_REPRODUCT["ISSUE_REPRODUCT<br/>Reproduce the bug"] -->|complete| TEST_SYNTHSIZE["TEST_SYNTHSIZE<br/>Write FAIL_TO_PASS tests"] ISSUE_REPRODUCT -->|giveup| CODE_LOCALIZE["CODE_LOCALIZE<br/>Localize the root cause"] TEST_SYNTHSIZE -->|complete| CODE_LOCALIZE TEST_SYNTHSIZE -->|giveup| CODE_LOCALIZE CODE_LOCALIZE -->|complete| TEST_LOCALIZE["TEST_LOCALIZE<br/>Select PASS_TO_PASS tests"] CODE_LOCALIZE -->|giveup| ISSUE_REPRODUCT TEST_LOCALIZE -->|complete| CODE_FIX["CODE_FIX<br/>Fix the code"] TEST_LOCALIZE -->|giveup| CODE_FIX CODE_FIX -->|complete| VERIFY_PATCH["VERIFY_PATCH<br/>Verify the patch"] CODE_FIX -->|giveup| CODE_LOCALIZE VERIFY_PATCH -->|complete| ISSUE_CLOSE["ISSUE_CLOSE<br/>Final diff review and pre-submission check"] VERIFY_PATCH -->|giveup| CODE_FIX ISSUE_CLOSE -->|complete| FINAL_REPORT["FINAL_REPORT<br/>Write audit report"] ISSUE_CLOSE -->|giveup| CODE_FIX FINAL_REPORT -->|giveup| FINAL_REPORT
Phase responsibilities:
| Phase | One-line description | Main output |
|---|---|---|
ISSUE_REPRODUCT |
Reproduce the bug and put the symptom into words | Reproduction scripts and observed results |
TEST_SYNTHSIZE |
Design and stabilize FAIL_TO_PASS tests | Agent-generated verification tests that fail before the fix and pass after |
CODE_LOCALIZE |
Pinpoint the root cause and candidate fix locations | Root-cause hypothesis and annotated fix candidates |
TEST_LOCALIZE |
Select PASS_TO_PASS tests for regression detection | Agent-generated regression tests covering behavior that must be preserved |
CODE_FIX |
Implement the code change that addresses the root cause | Diff addressing the root cause |
VERIFY_PATCH |
Check the patch and test results for correctness | Verification log confirming consistency between the fix and test outcomes |
ISSUE_CLOSE |
Clean up the diff and do a pre-submission review | Pre-submission checklist and confirmation of no unwanted changes |
FINAL_REPORT |
Write up root cause, fix, and verification for audit | Consolidated audit report |
Splitting into phases is necessary but not sufficient — the workflows within each phase matter too. The key is that workflows are not mere progress labels: in the configuration files, each workflow step specifies what to read, what to write, and where to go next. The runtime validates each turn's WORKFLOW: Wn declaration and rejects forward jumps (e.g., jumping from W3 straight to W5). Backward movement — say, W5 → W2 to retry after a failure — is explicitly allowed, so the agent can revise and retry without skipping steps.
How the specialized tools fit into the workflow sequence is described at the start of Section 3.1. For now, here is an excerpt from CODE_FIX that illustrates how a workflow is a real executable procedure, not just a label:
W2. Review the handed-over PASS_TO_PASS tests and extract invariants W3. Modify the code based on the handed-over candidate root causes and fix locations W4. Check whether the FAIL_TO_PASS tests pass, and if they fail, return to W2 W5. Check whether the PASS_TO_PASS tests pass, and if they fail, return to W2 W6. Based on the code changes, test untested boundary values and edge cases, and if any fail, return to W2
The fix loop itself — "establish invariants at W2, fix at W3, and come back to W2 if W4–W6 fail" — is encoded in the configuration. W3 → W4 → W2 is a legitimate retry; W3 → W7 is rejected.
VERIFY_PATCH takes this a step further. The workflow definition itself includes a branch that triggers WORKFLOW: GIVEUP if FAIL_TO_PASS or PASS_TO_PASS fail. On top of that, the runtime independently verifies that verify_*.log files exist and that the commands that produced them exited with code 0. The workflow definition is the first-order constraint; the runtime's post-hoc check is the second.
Phase exits are not taken on the agent's word alone. The only declarations the agent can make are WORKFLOW: COMPLETE or WORKFLOW: GIVEUP, and the runtime maps these to on_complete / on_giveup transitions, making it impossible to skip phases arbitrarily. After WORKFLOW: COMPLETE, the runtime checks required_assets; missing deliverables send the agent back to the same phase. In VERIFY_PATCH, the runtime additionally checks for the presence of /_share/verify_fail_to_pass.log and /_share/verify_pass_to_pass.log and confirms that the commands that wrote them exited cleanly. The gate is "do the required files and verification results exist?" — not "did the agent say it checked them?"
Figure 2b shows the post-hoc verification and handover flow after a phase exit.
Figure 2b: Post-hoc verification and handover after phase exit
flowchart TD
classDef phase fill:#eef6ff,stroke:#2563eb,stroke-width:1.2px,color:#111827;
classDef check fill:#f8fafc,stroke:#64748b,stroke-width:1px,color:#111827;
classDef decision fill:#fff7ed,stroke:#ea580c,stroke-width:1.2px,color:#111827;
A["Agent declares<br/>WORKFLOW: COMPLETE / GIVEUP"]:::phase
B["Runtime interprets phase exit declaration"]:::check
C{"COMPLETE?"}:::decision
D["Check required_assets"]:::check
E["Check hard gate conditions<br/>shared asset freeze / verify logs / exit codes"]:::check
F["Same-phase handover<br/>Redo the current phase"]:::phase
G["Forced GIVEUP handover<br/>Return to on_giveup destination"]:::phase
H["Normal handover<br/>Proceed to on_complete / on_giveup destination"]:::phase
I["Execute handover<br/>LLM writes memo → filesystem<br/>Freeze shared test assets if needed<br/>Rebuild prompt for next phase"]:::check
A --> B --> C
C -->|yes| D
C -->|no| E
D -->|missing assets| F
D -->|OK| E
E -->|hard gate not met| F
E -->|failure count exceeds threshold| G
E -->|OK| H
F --> I
G --> I
H --> I
When the agent declares a phase exit, the transition isn't confirmed yet. The runtime runs its checks; only when required_assets and hard gate conditions are satisfied does execution move forward. If they aren't, the agent is handed back to the same phase to try again. If failures exceed the threshold, a forced GIVEUP sends it to the preceding phase. The handover itself is also more than a phase name swap — the LLM writes a handover memo to the filesystem, and the prompt and skills for the next phase are rebuilt from scratch.
Retries are bounded. Each phase has a turn_handover_threshold ranging from 32 to 192 turns; exceeding it forces a same-phase handover to compress context and break local loops, even if the token budget hasn't been exceeded yet. VERIFY_PATCH also has hard_gate_giveup_threshold=3: three consecutive hard gate failures force a WORKFLOW: GIVEUP back to CODE_FIX. A global setting also triggers a GIVEUP if the same command is executed five times in a row. The design consistently favors "fail explicitly and back up" over "skip and proceed."
Separating CODE_LOCALIZE and TEST_LOCALIZE was one of the more important decisions in this setup. Finding the right place to change and identifying what existing behavior must not break are related but distinct problems. Making them explicit phases meant that regression safety wasn't an afterthought — it was a first-class deliverable. This isn't a benchmark-specific concern either. Real codebases carry enormous amounts of implicit current behavior; a small fix can break something far away. Knowing where to cut is important, but knowing what you can't afford to break is equally so. TEST_LOCALIZE is how the agent surfaces a piece of that implicit specification from existing tests, so that a narrow bug fix can land safely.
2.3 Design 2: Filesystem-based state sharing across phases and workflows
The runtime creates a shared working area in the container at /_share/. The key idea is putting state outside the conversation history. On long tasks, if state lives only in the conversation, it tends to fall out of context at turn limits or during handovers. Instead, we store progress notes, hypotheses, reproduction scripts, tests, and trace logs as files, and have subsequent workflows and phases read from those files. The information lives in shared files, not inside the model's reasoning.
| Shared asset | Scope | Purpose |
|---|---|---|
/_share/{PHASE}.md |
Across workflows within a phase | Working notes accumulating strategy, observations, candidates, and progress |
/_share/Kanban.md |
Across all phases | Summary of findings worth surfacing to every subsequent phase |
/_share/handover_{from}_to_{to}.md |
Across phases / same-phase re-handover | Compressed memo so the next agent can resume without confusion |
/_share/repro_*.py, test_FAIL_TO_PASS_*.py, test_FAIL_TO_PASS_all.sh, test_PASS_TO_PASS_all.sh |
Across workflows / phases | Verification scripts that subsequent agents can run as-is |
/_share/line_trace_*.log, caller_trace_*.log, coedit_*.log, verify_*.log |
Across workflows / phases | Investigation and verification logs that subsequent agents can read as-is |
Figure 3: Phases and shared assets
flowchart TD classDef phase fill:#eef6ff,stroke:#2563eb,stroke-width:1.2px,color:#111827; classDef asset fill:#f8fafc,stroke:#64748b,stroke-width:1px,color:#111827; classDef selector fill:#fff7ed,stroke:#ea580c,stroke-width:1.2px,color:#111827; P1["Exploration phases<br/>ISSUE_REPRODUCT / TEST_SYNTHSIZE / CODE_LOCALIZE / TEST_LOCALIZE"]:::phase P2["Fix phase<br/>CODE_FIX"]:::phase P3["Evaluation and submission phases<br/>VERIFY_PATCH / ISSUE_CLOSE / FINAL_REPORT"]:::phase M["/_share/*.md<br/>Phase notes / Kanban / handover memos"]:::asset T["/_share/test_FAIL_TO_PASS*.py<br/>/_share/test_FAIL_TO_PASS_all.sh<br/>/_share/test_PASS_TO_PASS_all.sh"]:::asset L["/_share/trace / coedit / verification logs"]:::asset X["Cross-agent testing<br/>Test assets from each run applied to other runs' patches"]:::selector P1 --> P2 --> P3 P1 -->|writes| M P1 -->|writes| T P1 -->|writes| L P2 -->|reads / updates| M P2 -->|runs read-only| T P3 -->|reads / appends| M P3 -->|reuses| T P3 -->|writes| L T -->|evidence for comparison| X
In Figure 3, the individual phases are grouped into three clusters: exploration, fix, and evaluation/submission.
Taking ISSUE_REPRODUCT as an example: W1 writes the strategy to the phase memo; W6–W7 produce reproduction scripts; W9 appends key findings to the Kanban. The next phase starts by reading the handover memo and the full set of shared assets on disk — it picks up from files, not from the previous agent's internal state. The same logic applies to the trace and coedit logs written by CODE_LOCALIZE and TEST_LOCALIZE.
The FAIL_TO_PASS and PASS_TO_PASS scripts written by TEST_SYNTHSIZE and TEST_LOCALIZE serve double duty: they're the handover medium for the fix phase, and they're the executable comparison evidence the cross-agent selector aggregates later. The shared filesystem is both an inter-phase notepad and the storage layer for the TTS selector.
The shared area also helps break local loops within a phase. When a phase exceeds its turn_handover_threshold, the runtime forces a same-phase handover: it writes a compressed memo and instructs the agent to resume from the existing filesystem assets. Context is trimmed; the in-progress scripts, tests, and analysis notes are preserved. When exactly to compress, and how much tool output to keep in history, is governed by the token budget logic described in the next section.
One more important point: the shared area isn't a free-form scratch space — it's a shared contract. Before CODE_FIX begins, /_share/test_PASS_TO_PASS* is frozen read-only, so the agent fixing the code can't weaken the regression tests. Git operations are scoped to /testbed/, so notes and logs in /_share/ accumulate freely without polluting the final patch. The collection pipeline also harvests shared assets per-instance from an explicit allowlist, which made it easy to audit handover memos and traces after the fact.
2.4 Design 3: Context compression via handover
One reason for keeping intermediate outputs on the filesystem is that it lets work resume without carrying the full conversation history forward. The runtime uses handover not just as a baton pass between phases, but as a context compression mechanism. When crossing a phase boundary — or even mid-phase — the runtime triggers handover(current_phase), which has the LLM write a summary memo to disk before rebuilding the next turn's message list from scratch.
The important detail is that the decision to compress is not based on a rough sense that "this is getting long." The runtime tokenizes the current messages using the Hugging Face tokenizer and chat template for the active MODEL_ID, then computes a safe prompt budget from max_prompt_tokens=150000, max_new_tokens=16384, MAX_MODEL_LEN, and context_margin. If the tokenizer can't be initialized, processing stops — the system refuses to proceed without reliable context management. When the model returns usage.prompt_tokens / usage.completion_tokens, those actuals are also logged for diagnostics.
The decision logic looks roughly like this:
prompt_budget =
min(max_prompt_tokens - context_margin,
MAX_MODEL_LEN - max_new_tokens - context_margin)
prompt_est = estimate_tokens(messages)
if prompt_est > prompt_budget:
handover(current_phase)
prompt_est = estimate_tokens(messages_after_handover)
if prompt_est > prompt_budget:
raise LimitsExceeded()
In a same-phase handover, instead of carrying the long conversation forward, the LLM writes a compressed memo to disk and the next turn is rebuilt from that memo plus the existing shared assets. Reproduction scripts, generated tests, trace logs, and the Kanban are all preserved; only the conversation history is shortened. The turn_handover_threshold works alongside this: phase-specific turn limits of 32 to 192 trigger a forced same-phase handover before the token budget runs out, cutting local loops early.
Figure 4: Context compression decision flow
flowchart TD
classDef step fill:#eef6ff,stroke:#2563eb,stroke-width:1.2px,color:#111827;
classDef decision fill:#fff7ed,stroke:#ea580c,stroke-width:1.2px,color:#111827;
classDef warn fill:#f8fafc,stroke:#64748b,stroke-width:1px,color:#111827;
A["Turn begins"]:::step --> B["Estimate prompt length"]:::step
B --> C{"Within budget?"}:::decision
C -->|no| D["Compress via same-phase handover"]:::step
D --> E{"Still over budget after compression?"}:::decision
E -->|yes| F["Raise LimitsExceeded"]:::warn
C -->|yes| G["Get LLM response and execute<br/>Generate observation"]:::step
E -->|no| G
G --> H{"Within budget including observation?"}:::decision
H -->|yes| I["Append to history"]:::step
H -->|no| J["Progressively shorten output"]:::step
J --> K{"Fits now?"}:::decision
K -->|yes| I
K -->|no| L["Replace with minimal stub"]:::warn
L --> I
I --> M{"Turn threshold exceeded?"}:::decision
M -->|yes| D
M -->|no| N["Proceed to next turn"]:::step
Tool outputs are also token-budget-managed. After a command runs, the observation string is built from the action_observation_template. Before appending it to history, the runtime re-estimates the token count for messages + observation. If it would exceed the budget, max_output_length is halved and the observation is re-rendered, up to 6 attempts; if it still doesn't fit, the result is replaced with a minimal stub containing only the returncode and an empty output. This prevents a single large trace or test dump from overwhelming the context window.
The result is that compression isn't "drop the oldest messages" — it's "convert to a summary that the next phase or the same agent can resume from, while keeping the shared assets intact." Phase design, filesystem, handover, and output truncation are not separate mechanisms — they're one integrated context management system.
2.5 Design 4: The Orchestra runtime (conductor + tool-specialist)
In the 8-run configuration, each turn is processed by two agents working in sequence: a conductor and a tool-specialist, as shown in Figure 5. The conductor handles hypothesis exploration and deciding what to do next; the tool-specialist's sole job is to produce a clean, syntactically valid command.
This split can work with different models or different vLLM servers for each role. We tried several combinations, but the best configuration ended up using the same Qwen3.5-27B on the same vLLM server for both, differing only in sampling temperature.
Figure 5: Orchestra runtime — basic interaction
flowchart TD
classDef agent fill:#e0f2fe,stroke:#0284c7,stroke-width:1.4px,color:#0f172a;
classDef data fill:#fff7ed,stroke:#ea580c,stroke-width:1.4px,color:#0f172a;
classDef runtime fill:#f8fafc,stroke:#64748b,stroke-width:1.2px,color:#0f172a;
A["Input<br/>Current phase + conversation state"]:::data --> B("conductor agent<br/>Hypothesis exploration and next-action decision"):::agent
B --> C["Full draft<br/>WORKFLOW / THOUGHT / draft command"]:::data
C --> D{{"Runtime junction<br/>Extract action block"}}:::runtime
D --> E("tool-specialist agent<br/>Command cleanup only"):::agent
E --> F["Cleaned action block"]:::data
F --> G["Final response<br/>WORKFLOW/THOUGHT from conductor<br/>Command from tool-specialist"]:::data
Role breakdown for the best-performing configuration:
| Role | Endpoint | Sampling | Responsibility |
|---|---|---|---|
conductor |
Shared Qwen3.5-27B on shared vLLM server |
temperature=0.6, top_p=0.95, top_k=20 |
Generate full response including WORKFLOW / THOUGHT |
tool-specialist |
Shared Qwen3.5-27B on shared vLLM server |
temperature=0.0, top_p=0.95, top_k=20 |
Clean up the action block; stabilize wrappers, quotes, and pipes |
The temperature difference matters because the two failure modes are opposites. Lower temperatures on the conductor side narrow exploration, making it more likely to loop on the same action pattern. Higher temperatures on the tool-specialist side destabilize tool call syntax and shell control flow. Giving the exploration role higher variance and locking the formatting role to greedy decoding cleanly separates these concerns.
The mechanics are simple. The conductor produces a full response including WORKFLOW, THOUGHT, and a draft command block. The runtime extracts just the command block, adds the current phase and detected tool name, and hands it to the tool-specialist. The tool-specialist doesn't redo the reasoning — it just reformats that command block. The final assistant response keeps the conductor's WORKFLOW and THOUGHT, with only the FINAL_ANSWER command replaced by the tool-specialist's version.
Here's a concrete example of where this mattered. In r01's matplotlib__matplotlib-20826, during VERIFY_PATCH, the conductor produced this draft:
WORKFLOW: W7 THOUGHT: W7 is to update the team's shared Kanban. I've updated the Kanban with my verification findings. Now I need to verify that all required phase assets exist and both test commands exit with code 0 before completing the phase. FINAL_ANSWER: <tool: bash> ls -la /_share/VERIFY_PATCH.md /_share/verify_fail_to_pass.log /_share/verify_pass_to_pass.log && echo "All required assets exist" && bash /_share/test_FAIL_TO_PASS_all.sh > /dev/null 2>&1; echo "FAIL_TO_PASS exit code: $?" && bash /_share/test_PASS_TO_PASS_all.sh > /dev/null 2>&1; echo "PASS_TO_PASS exit code: $?" </tool>
The problem here: using ; instead of && means the PASS_TO_PASS check runs even if FAIL_TO_PASS fails. The runtime extracted the action block and handed it to the tool-specialist, which produced:
<tool: bash> ls -la /_share/VERIFY_PATCH.md /_share/verify_fail_to_pass.log /_share/verify_pass_to_pass.log && echo "All required assets exist" && bash /_share/test_FAIL_TO_PASS_all.sh > /dev/null 2>&1 && echo "FAIL_TO_PASS exit code: 0" && bash /_share/test_PASS_TO_PASS_all.sh > /dev/null 2>&1 && echo "PASS_TO_PASS exit code: 0" </tool>
Two changes:
- ... bash /_share/test_FAIL_TO_PASS_all.sh > /dev/null 2>&1; echo "FAIL_TO_PASS exit code: $?" && bash /_share/test_PASS_TO_PASS_all.sh > /dev/null 2>&1; echo "PASS_TO_PASS exit code: $?" + ... bash /_share/test_FAIL_TO_PASS_all.sh > /dev/null 2>&1 && echo "FAIL_TO_PASS exit code: 0" && bash /_share/test_PASS_TO_PASS_all.sh > /dev/null 2>&1 && echo "PASS_TO_PASS exit code: 0"
The ; became &&, so the chain stops if FAIL_TO_PASS fails. And "$?" was replaced with the literal "0" — the echo only fires on success, making the intent explicit. The conductor decided what to check; the tool-specialist expressed that intent in shell control flow that actually does what was intended.
There's also a class of corrections involving proper tool usage. In r01's astropy__astropy-13453 during CODE_FIX, the conductor's draft tried to use line_edit to insert self.data.cols = cols but wrote a "replace one line with a three-line block" call:
<tool: bash> python /_tools/line_edit.py --file /testbed/astropy/io/ascii/html.py --line 356 --expected "self.data._set_fill_values(cols)" --text "self.data._set_fill_values(cols)\n self.data.cols = cols\n self.data._set_col_formats()" </tool>
The tool-specialist corrected this to a proper "append one line after the matched line" call:
<tool: bash> python /_tools/line_edit.py --file /testbed/astropy/io/ascii/html.py --line 356 --expected "self.data._set_fill_values(cols)" --text "self.data.cols = cols" --append </tool>
- python /_tools/line_edit.py ... --text "self.data._set_fill_values(cols)\n self.data.cols = cols\n self.data._set_col_formats()" + python /_tools/line_edit.py ... --text "self.data.cols = cols" --append
The code intent didn't change — only the way the tool was called. Since line_edit operates relative to a matched line using --expected, appending a single new line after a confirmed match is safer than replacing a multi-line block. The conductor decided what to add; the tool-specialist figured out the right way to express that edit with this particular tool.
The Orchestra runtime isn't about the structure of a config file — it's about splitting "what to do next" from "how to express that as a valid command." Using the same Qwen3.5-27B at different temperatures for the two roles turned out to be the most effective way to get both trajectory diversity and stable tool invocations.
3. Tools and Skills
3.1 Design 5: Specialized tools
We equipped the agent with four specialized tools: line_trace, caller_trace, coedit_localize, and line_edit. Together they make it practical to answer "what broke?", "how far does it reach?", "what else should I look at?", and "how do I fix it?" in tight loops.
A general-purpose agent can search with grep and run pytest repeatedly, but it isn't naturally good at precisely tracing execution paths, establishing blast radius, or systematically expanding to adjacent fix candidates. Rather than relying on broad exploration alone, we built these specialized tools into mini-swe-agent for investigation and editing support. What matters isn't just having the tools — it's that each tool has a defined place in the workflow where it belongs.
The role of specialized tools within the workflow is clearest in the TEST_LOCALIZE definition. The following is an illustrative excerpt (the actual workflow is longer):
- name: TEST_LOCALIZE definition: |- **Workflow (You MUST follow it!!):** * W2. List without omissions the test cases that currently pass before any code changes - Input: Test assets under `/testbed/` - Output: `/_share/TEST_LOCALIZE.md` * W3. Investigate with `line_trace` which passing tests may be affected by the code change - Input: `/_share/CODE_LOCALIZE.md`, `/_share/TEST_LOCALIZE.md` - Output: `/_share/line_trace_of_test_PASS_TO_PASS.log` * W4. Investigate with `caller_trace` which passing tests may be affected by the code change - Input: `/_share/CODE_LOCALIZE.md`, `/_share/TEST_LOCALIZE.md` - Output: `/_share/caller_trace_of_test_PASS_TO_PASS.log` * W5. Run `coedit_localize` on the strongest fix candidates and preserve the raw output - Input: `/_share/CODE_LOCALIZE.md`, `/_share/TEST_LOCALIZE.md` - Output: `/_share/coedit_TEST_LOCALIZE.log` * W6. Identify via code review the tests that should be affected by the code change - Input: logs from line_trace / caller_trace / coedit_localize - Output: `/_share/TEST_LOCALIZE.md` * W7. Develop a script that can run all selected PASS_TO_PASS tests at once - Input: `/_share/TEST_LOCALIZE.md` - Output: `/_share/test_PASS_TO_PASS_all.sh`
TEST_LOCALIZE isn't just "find some regression tests." It enumerates currently passing tests, applies three specialized tools from different angles, and synthesizes them into a single PASS_TO_PASS execution script. The tools aren't loosely attached — they're woven into the workflow with defined responsibilities.
3.1.1 line_trace: Visualizing executed lines and variable changes
Example usage (from the Django trajectory):
line_trace django/forms/fields.py <failing test>
Trajectory output:
[trace] /testbed/django/forms/fields.py:1024 | if not value or isinstance(value, (list, tuple)): [trace] /testbed/django/forms/fields.py:1025 | if not value or not [v for v in value if v not in self.empty_values]: [trace] /testbed/django/forms/fields.py:1026 | if self.required: [trace] /testbed/django/forms/fields.py:1029 | return self.compress([])
This trace confirmed, line by line, that empty input was falling straight through to compress([]) without checking whether any sub-fields were required. That kind of subtle branch behavior is easy to miss when reading code statically. Seeing the actual execution path was what made the root cause clear.
In bug fixes, tracing how a value came to be matters more than knowing where it finally breaks. line_trace makes that visible at line granularity, which significantly reduced the amount of speculative file exploration.
3.1.2 caller_trace: Mapping call chains and establishing blast radius
Example usage (from the Django trajectory):
caller_trace django/forms/boundfield.py:BoundField.build_widget_attrs <failing test>
Trajectory output:
=== caller_trace output === Target function: /testbed/django/forms/boundfield.py:BoundField.build_widget_attrs Target hits: 1 Unique transitive callers (<full_path>:<qualname>): 5 - /testbed/django/forms/boundfield.py:BoundField.as_widget [direct] - /testbed/django/forms/forms.py:BaseForm._html_output - /testbed/django/forms/forms.py:BaseForm.as_table
This made it clear that the bug wasn't just a missing check in clean() — the rendering path was involved too. Without the required attribute fix in HTML output, the server-side fix would have been incomplete.
Understanding what's affected is as important as knowing what to fix. A common failure mode on SWE-bench Verified is passing the target test while breaking something else; dynamically tracing the call chain is a direct countermeasure.
3.1.3 coedit_localize: Expanding candidate locations using co-edit history
Example usage (from the scikit-learn trajectory):
python /_tools/coedit_localize.py /testbed/sklearn/cluster/optics_.py 2>&1 | tee /_share/coedit_CODE_LOCALIZE.log
This was from scikit-learn__scikit-learn-14496. In that case there was no co-edit data for the seed file, so the tool returned warning: no pair data for seed path: sklearn/cluster/optics_.py. That matters: a null result is evidence against expanding to neighboring files. It let the agent stay focused on the strong leads from line_trace and caller_trace rather than widening the search unnecessarily.
Example of the tool in action (from verified tool output):
pkg/helpers.py coedit_insight: The helper participates in the same execution path as core. coedit_decision_reason: Both files implement the same feature area. pkg/shared.py coedit_insight: Shared helpers are often updated together with core logic. coedit_decision_reason: Shared abstractions couple the affected behavior. tests/test_core.py coedit_insight: Tests for the core module tend to move with core behavior changes. coedit_decision_reason: The test validates the public contract of pkg/core.py. docs/guide.md coedit_insight: Documentation sometimes follows feature changes. coedit_decision_reason: Docs are not executable code paths.
coedit_localize uses co-edit data specific to each SWE-bench instance to return, in priority order, the files most likely to need changes alongside a given seed file. In CODE_LOCALIZE it expands the set of fix candidates; in TEST_LOCALIZE it expands the set of PASS_TO_PASS candidates.
The co-edit data is constructed from commits older than each instance's base_commit — only history available at that point in time, avoiding any future-fix leakage. File pairs that appeared together in commits more often than chance would predict are kept; the rest are filtered. For each surviving pair, Devstral-Small-2-24B-Instruct-2512 was used to pre-generate explanations of why those files tend to move together. The coedit_insight and coedit_decision_reason fields in the output come from those pre-generated explanations — so you get not just a ranked list of filenames but a readable reason for each, like "both files implement the same feature area" or "the test validates the public contract of this module."
On instances where co-edit data exists, the tool returns a prioritized view of implementation files, shared helpers, likely-affected tests, and lower-priority docs — all with reasons. The workflow: find a strong lead with line_trace or caller_trace, then use coedit_localize to expand systematically to adjacent candidates. This is especially valuable on SWE-bench Verified, where fixing the right file but breaking a neighboring invariant is a common failure mode.
3.1.4 line_edit: Making single-line edits reliable
Example usage (from the Django trajectory):
python /_tools/line_edit.py --file /testbed/django/forms/fields.py --line 1029 --expected " return self.compress([])" --text " ... raise ValidationError(...) ...\n return self.compress([])"
Trajectory output:
Updated /testbed/django/forms/fields.py:1029 Old: ' return self.compress([])' New: ' ... raise ValidationError(...) ...\n return self.compress([])'
From this starting point, the same trajectory extended fixes to boundfield.py and widgets.py, with the final agent-generated test results at FAIL_TO_PASS 9/9 and PASS_TO_PASS 145/145. Incremental, safe edits let the agent accumulate changes across multiple files without losing consistency.
The reason for a dedicated editing tool isn't the size of the change — it's the cost of getting it wrong. An edit that lands on the wrong line means redoing root-cause analysis from scratch. Unglamorous as it is, this tool turned out to matter quite a bit over long experimental runs.
3.2 Design 6: Phase-gated skill injection (phase × tool)
Each skill has phases and tools conditions. At runtime, the agent only receives the skills relevant to its current phase and available toolset:
if current_phase in skill.phases and required_tools ⊆ agent.tools:
inject(skill.content)
The usual problem with adding useful instructions is that the prompt gets longer and the important parts get buried. Rather than loading everything every time, we inject only what's relevant for the current phase and tool context. This is more than a length reduction — giving the agent a smaller set of options for what to do next is itself a performance improvement. On long tasks especially, what's surfaced right now matters as much as what's known in total.
| Skill | Phase condition | Tool condition |
|---|---|---|
FAIL_TO_PASS triage (partial pass) |
TEST_SYNTHSIZE, CODE_LOCALIZE, TEST_LOCALIZE, CODE_FIX, VERIFY_PATCH, ISSUE_CLOSE |
line_trace, caller_trace |
List lines of code actually executed |
CODE_LOCALIZE, TEST_LOCALIZE |
line_trace |
List functions that progressively call the given code (dynamic) |
CODE_LOCALIZE, TEST_LOCALIZE |
caller_trace |
Expand fix candidates with co-edit history |
CODE_LOCALIZE, TEST_LOCALIZE |
coedit_localize |
Edit a specific line reliably (no git apply) |
CODE_FIX |
line_edit |
4. Evaluation and Operations
4.1 Design 7: Cross-agent testing for candidate selection
Separating candidate generation from final selection absorbs the variance of individual runs. Having distinct stages for producing diverse candidates, building comparison evidence via cross-agent testing, and reducing submission risk was a meaningful contributor to the TTS@8 score.
More candidates alone doesn't solve the problem — a larger pool just mixes more good and bad options together. To handle this, each run brings its own FAIL_TO_PASS / PASS_TO_PASS tests to the table, and those tests are applied to the patches from all other runs, producing a per-instance comparison matrix. These tests are the ones each run's agent generated for itself — not the SWE-bench dataset's official tests. Rather than a simple vote, this means each candidate can be evaluated against executable evidence from other runs' perspectives, which turns candidate diversity directly into selection accuracy.
Internally we call this matrix the xcheck, but in this post we describe it as cross-agent testing via instance_test_tables. In the final TTS@8, the selection rule applied to this matrix uses a weighted pass-rate of F2P=0.3, P2P=0.7, with ties broken by shortest_patch_raw. The weights were estimated on SWE-bench's dev split; across our 8 runs, the same score held across F2P=0.20 to 0.33. Tie-breaking by raw patch length follows the principle that, when candidates are otherwise equivalent in their explanatory power, the shorter description is preferred. The publicly reported figure is the sb-cli measurement of 374/500 = 74.8%.
In practice, weighted aggregation of cross-agent test results consistently outperformed LLM-based patch scoring and ranking, which was a practically important observation.
As a concrete example: for sympy__sympy-21612, 5 of the 8 runs produced valid patches. Sorted by size_desc, the candidates were r08, r04, r02, r07, and r05. Applying 5 FAIL_TO_PASS and 5 PASS_TO_PASS tests from each run to all candidates and computing (0.3 * F2P_pass + 0.7 * P2P_pass) / 5:
| Candidate | FAIL_TO_PASS |
PASS_TO_PASS |
Weighted score | Patch length | Decision |
|---|---|---|---|---|---|
r08 |
3/5 |
5/5 |
0.88 |
736 |
Rejected |
r04 |
3/5 |
5/5 |
0.88 |
683 |
Rejected |
r02 |
3/5 |
5/5 |
0.88 |
664 |
Rejected |
r07 |
4/5 |
5/5 |
0.94 |
577 |
Tied for first |
r05 |
4/5 |
5/5 |
0.94 |
553 |
Selected |
The cross-agent tests alone split the field into a 0.88 group and a 0.94 group, eliminating three candidates. r07 and r05 are tied on the comparison evidence, so shortest_patch_raw selects r05 at 553 characters. Without a tiebreaker, r07 would have been selected instead — a clean illustration of "use cross-agent tests to narrow the field, then apply a simple secondary criterion to get to one."
4.2 Design 8: Operational stability (sharding + retry)
To reliably complete 500-problem evaluations, we combined sharded execution with missing-result retry. This let us handle partial failures through local reruns and maintain reproducible aggregation.
At the scale of 500 problems, infrastructure variance can affect results as much as individual cases. Sharding and retry aren't interesting technically, but they were a prerequisite for producing numbers worth comparing.
The entry point is swebench_sota.py, which wraps each instance with hooks for cleanup, supplementary collection, and /_share asset export. Being able to recover per-instance outputs — FAIL_TO_PASS / PASS_TO_PASS tests, trace logs, handover memos, Kanban entries — made it much easier to inspect trajectory behavior and selector decisions after the fact.
4.3 Evaluation setup
TTS@8track (the main result reported here)- Dataset: SWE-bench Verified (
split=test) - Batch run entry:
minisweagent.run.extra.swebench_sota - Candidate generation:
Qwen3.5-27Bin Orchestra configuration, 8 runs - Selection foundation: cross-agent
instance_test_tables - Selection rule:
simple weighted pass-ratewithF2P=0.3,P2P=0.7estimated on dev split; ties broken byshortest_patch_raw - Published figure:
374/500 = 74.8%(official SWE-bench cloud /sb-cli)
- Dataset: SWE-bench Verified (
Local eval.json and cross-agent re-aggregation are useful for fast inner-loop iteration, but the TTS@8 figure published here uses only the sb-cli measurement to stay consistent with the submission's evaluation stack.
For submission, we package predictions along with metadata.yaml, README.md, trajs/, logs/, and other artifacts in the format required by SWE-bench/experiments. This post is a technical companion to trajs/ and logs/ in particular — filling in the design intent behind what shows up in those logs.
4.4 A concrete agent I/O walkthrough
Trajectory: one Django instance
Here is a single trajectory from start to finish. The agent reads the problem, calls the appropriate tools, makes minimal edits, and verifies the result using FAIL_TO_PASS / PASS_TO_PASS tests it generated itself. As noted earlier, "FAIL_TO_PASS / PASS_TO_PASS" here refers to the agent-generated verification tests in the trajectory, not the official SWE-bench tests.
Example 1 — Problem statement (excerpt)
MultiValueField ignores a required value of a sub field. Form is valid: True Expected is_valid=False but got True. Number of 'required' attributes in HTML: 0 Expected 1 required attribute but got 0.
Example 2 — Agent's intermediate reasoning (THOUGHT, excerpt; translated from English)
THOUGHT: This isn't a single missing validation check. Fixing MultiValueField.clean() alone won't close it — the HTML attribute assignment path needs attention too. First, line_trace to see which branch empty input actually takes; then caller_trace to follow the required attribute rendering path. After that, stack the local fixes with line_edit.
Following this THOUGHT, the agent generates a tool command:
<tool: bash> cd /testbed && TRACE_FILE_PATH="/testbed/django/forms/fields.py" python /_tools/line_trace.py /_share/test_FAIL_TO_PASS_1.py </tool>
Example 3 — Tool responses
line_trace:
[trace] /testbed/django/forms/fields.py:1024 | if not value or isinstance(value, (list, tuple)): [trace] /testbed/django/forms/fields.py:1025 | if not value or not [v for v in value if v not in self.empty_values]: [trace] /testbed/django/forms/fields.py:1026 | if self.required: [trace] /testbed/django/forms/fields.py:1029 | return self.compress([])
caller_trace:
=== caller_trace output === Target function: /testbed/django/forms/boundfield.py:BoundField.build_widget_attrs Target hits: 1 Unique transitive callers (<full_path>:<qualname>): 5 - /testbed/django/forms/boundfield.py:BoundField.as_widget [direct] - /testbed/django/forms/forms.py:BaseForm._html_output - /testbed/django/forms/forms.py:BaseForm.as_table
line_edit:
Updated /testbed/django/forms/fields.py:1029 Old: ' return self.compress([])' New: ' ... raise ValidationError(...) ...\n return self.compress([])'
Example 4 — Before and after verification (agent-generated FAIL_TO_PASS tests)
Before:
=== Test 1: Both sub fields are empty === Form is valid: True FAIL: Expected is_valid=False but got True. === Test 4: HTML required attributes === Number of 'required' attributes in HTML: 0 FAIL: Expected 1 required attribute but got 0.
After:
FAIL_TO_PASS: 9/9 passed
Example 5 — Regression confirmation (agent-generated PASS_TO_PASS tests)
PASS_TO_PASS: 145/145 passed
4.5 Comparison with related work
CORTEXA, AGENTLESS, and R2E-Gym all move away from single-shot patching toward combinations of localization, multiple candidates, and test-based selection. Our approach shares that motivation but has a different center of gravity: rather than relying on fine-tuned retrievers or learned verifiers, we focused on using harness-side state management, tooling, and selection rules to improve OSS LLM performance without any fine-tuning.
| Approach | Core emphasis | Corresponding element in our work | Our trade-off |
|---|---|---|---|
Nemotron-CORTEXA |
Fine-tuned code embeddings for file-level localization; repo-graph-based symbol-level narrowing; diverse context and prompt formats for patch generation; test + LLM judge for final selection | line_trace / caller_trace / coedit_localize, phase decomposition, cross-agent testing |
Localization runs on execution-time traces, call chains, and co-edit history — no fine-tuned retriever or graph infrastructure required. Final selection uses executable cross-agent test results rather than an LLM judge, making the selection rationale easier to audit. |
AGENTLESS |
Three-stage pipeline (localization → fix → verification) that avoids complex autonomous loops while targeting high cost efficiency. Each issue gets one reproduction test and one set of regression tests, applied uniformly to all patches for final selection. | Phase + workflow decomposition, shared filesystem, specialized tools, cross-agent testing | AGENTLESS's simplicity is a high bar, but our approach can explicitly manage long explorations through phases, workflows, and handovers. Intermediate artifacts (scripts, traces, tests, memos) are preserved on disk, making multi-stage investigation resilient. Per-run tests are also cross-applied across runs, not just shared across patches from a single test set. |
R2E-Gym verifier |
Analysis of execution-based and execution-free verifiers; hybrid verifier to improve BEST@K | Cross-agent testing, filesystem-stored test assets, simple weighted pass-rate + shortest_patch_raw |
No learned verifier or trajectory-dependent judge. Comparison is based on running each run's FAIL_TO_PASS / PASS_TO_PASS tests against other runs' patches. The scoring basis is concrete test pass/fail counts, which makes the selection decisions easy to follow and reduces dependency on additional training or inference infrastructure. |
To summarize: CORTEXA centers on learned localization and diverse patch generation; AGENTLESS on a simple, strong three-stage pipeline; R2E-Gym on verifiers and test-time scaling. Our approach centers on harness design — phase/workflow management, shared working state, specialized tools, and cross-agent testing. That emphasis is the most direct expression of our goal: improving OSS LLM performance without fine-tuning.
The clearest difference from AGENTLESS is in how test assets are handled. AGENTLESS selects one reproduction test and one set of regression tests per issue and applies them to all patches. We have each run independently generate its FAIL_TO_PASS / PASS_TO_PASS tests and then cross-apply them to every other run's patches. This lets you evaluate a patch not just against "does it pass the shared test set?" but "how well does it hold up against the perspective each other run developed independently?"
4.6 Limitations
The published TTS@8 = 74.8% reflects the full system — 8 candidate runs plus the selector. It should not be read as the strength of a single trajectory or a pass@1 number. If the candidates lack diversity, the selector's ceiling drops; if the comparison evidence is weak, final submission accuracy suffers regardless of how many candidates there are.
SWE-bench Verified is a strong, practical benchmark, but it doesn't fully represent enterprise software development. Real deployments involve multi-repository dependencies, access controls, long CI cycles, review workflows, rollback procedures, and audit trail requirements. A configuration that works well on the benchmark won't necessarily be optimal in production.
The trajectory analysis also plays a different role from the published figures. The sb-cli measurement is the official result; per-phase turn distributions and dwell times are supplementary analysis using internal artifacts. They're useful for understanding what the system is doing, but they shouldn't be treated as performance metrics in their own right.
4.7 Open problems
There are three technically interesting directions we see from here.
First, co-optimizing harness improvements with model training. What we found is that harness-side design can move the needle substantially even with an unmodified base model. The next step is a feedback loop: use phase failures, handover events, tool-specialist corrections, cross-agent test outcomes, and selector decisions as training signal, and adjust workflows, prompts, and verification gates to fit the specific model's tendencies. There's substantial room to iterate on model and harness jointly.
Second, further automating harness improvement. Parts of the analysis and configuration update process are already semi-automated, but there's more to do — workflow definitions, specialist prompts, retry thresholds, temperatures, and selector coefficient tuning all have room for more automation. If we can close the loop between failure pattern identification and configuration change, and automate the ablation, regression detection, and re-evaluation cycle, the harness improvement process itself gets faster.
Third, jointly optimizing comparison evidence and selection rules. The selector in this work is simple and auditable, but its performance is directly tied to the quality of the FAIL_TO_PASS / PASS_TO_PASS tests each run generates. How to quantify coverage, redundancy, and discriminative power of those tests — and how to connect those properties to the weighting and tiebreaker in the selector — is an open and tractable problem. There's meaningful room to improve by making the comparison evidence richer.
5. Closing Remarks
The primary point we aim to demonstrate is that scaling model size or increasing fine-tuning effort is not the only path to improved performance. The structure of execution phases, the preservation of intermediate artifacts, and the way multiple candidate solutions are evaluated all play a critical role. In practice, harness design influences both final scores and output quality as directly as the underlying model.
This is not a forward-looking research roadmap. At Fujitsu, we are already deploying development agents of this kind internally. This post focuses on the design decisions and operational choices that have proven effective in real-world use, within the limits of what we can share publicly.
Harness engineering is both an active research area and a source of immediate practical impact. In production settings, organizations do not simply need a model that produces a single patch; they require systems that preserve intermediate work, make failures inspectable, and integrate seamlessly with existing development workflows. Capabilities such as phase and workflow management, shared file systems, specialized tooling, and cross-agent testing naturally align with these needs.
Working with open-weight or locally hosted LLMs further expands flexibility across cost, data residency, latency, and model interchangeability. When the harness can both incorporate model improvements and independently enhance performance and operability, its value extends well beyond benchmark results. We hope this post provides useful insights for practitioners designing not only models, but the full execution environments in which they operate.
References
- Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, Karthik R. Narasimhan, SWE-bench: Can Language Models Resolve Real-world Github Issues?, The Twelfth International Conference on Learning Representations, 2024.
- OpenAI, Introducing SWE-bench Verified, OpenAI, 2024.
- SWE-bench, Overview - sb-cli, SWE-bench Documentation, accessed 2026-04-06.
- Qwen, Qwen3.5-27B, Hugging Face, accessed 2026-04-06.
- Mistral AI, Devstral Small 2 24B Instruct 2512, Hugging Face, accessed 2026-04-06.
- Atefeh Sohrabizadeh et al., Nemotron-CORTEXA: Enhancing LLM Agents for Software Engineering Tasks via Improved Localization and Solution Diversity, ICML 2025.
- Chunqiu Steven Xia et al., Agentless: Demystifying LLM-based Software Engineering Agents, arXiv:2407.01489, 2024.
- Naman Jain et al., R2E-Gym: Procedural Environments and Hybrid Verifiers for Scaling Open-Weights SWE Agents, DL4C @ NeurIPS 2025.