Sapho Chapterhouse Institute Kept Artifacts

Retained research artifacts with source and archive links.

57 kept · 57 decisioned · updated 2026-05-08T22:28:44Z

Back to Portal

Published 2026-05-08 · Sapho Daily

Agent pull requests are everywhere. Here's how to review them.

Agent pull requests have already become a large operational surface in software development, but scale and reviewer comfort do not make them safe by default. The article argues that agent-written changes can raise maintenance burden, hide correctness failures behind green CI, an…

Published 2026-04-29 · Sapho Daily

Securing the git push pipeline: Responding to a critical remote code execution vulnerability

GitHub reports a critical git push pipeline vulnerability whose severity came from a very small attacker action surface and a very large downstream consequence: one crafted push, carrying a malicious push option, could turn unsanitized user input into trusted internal metadata,…

Published 2026-04-15 · Sapho Daily

How exposed is your code? Find out in minutes—for free - The GitHub Blog

GitHub is positioning Code Security Risk Assessment as a fast, free entry-point security scan for organizations: a one-click, CodeQL-based pass over up to 20 of the most active repositories that produces an exposure dashboard rather than a full organizational audit. The practica…

Published 2026-04-13 · Sapho Daily

Selective Memory for Artificial Intelligence: Write-Time Gating with Hierarchical Archiving

The paper argues that memory quality can be improved upstream, at write time, rather than mainly at retrieval time. By scoring incoming knowledge for salience and admitting only a small subset into the active store while archiving lower-salience and superseded material with line…

Published 2026-04-12 · Sapho Daily

Labor market impacts of AI: A new measure and early evidence

This paper argues that labor-market analysis improves when AI exposure is measured through realized, work-related model use rather than theoretical capability alone. Its central move is to build an "observed exposure" metric that counts tasks as exposed when they are both LLM-fe…

Published 2026-04-12 · Sapho Daily

GitHub Copilot CLI combines model families for a second opinion

GitHub is testing an explicit cross-family review layer inside Copilot CLI: a primary coding agent can hand its plan or work to a second model from a different model family for independent critique, and GitHub reports that this setup materially narrows the performance gap on a h…

Published 2026-04-12 · Sapho Daily

GitHub availability report: March 2026 - The GitHub Blog

GitHub reported four separate availability incidents in March 2026, with one broad platform degradation on March 3 and later incidents that were narrower but still severe. The report supports a specific operating lesson: platform instability did not come from one repeating globa…

Published 2026-04-11 · Sapho Daily

Autonomous Evaluation and Refinement of Digital Agents

This paper argues that digital agents can be evaluated automatically with useful, though clearly imperfect, fidelity to oracle metrics or human judgment, and that those evaluator signals can do more than score behavior: they can materially improve agent performance when used as…

Published 2026-04-11 · Sapho Daily

Kimi K2.5 Tech Blog: Visual Agentic Intelligence

Kimi presents K2.5 as a native multimodal model trained on approximately 15 trillion mixed visual and text tokens and positioned not just as a single model but as an orchestrator for parallel agentic work. The strongest supported public takeaway is not general intelligence or br…

Published 2026-04-11 · Sapho Daily

How Do Agentic AI Systems Address Performance Optimizations? A BERTopic-Based Analysis of Pull Requests

Performance work in agentic software development is real, broad, and operationally costly, but it is not easy to identify with naive filters and it does not behave like a narrow low-level tuning niche. In this corpus, reliable detection required LLM-based classification plus man…

Published 2026-04-11 · Sapho Daily

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

This review argues that autonomous AI agents are no longer a narrow extension of language models but a broad systems domain: the field now spans roughly 60 benchmarks, a growing set of tool-using agent frameworks, and emerging coordination protocols, yet the benchmark record sti…

Published 2026-04-11 · Sapho Daily

Autonomous Evaluation and Refinement of Digital Agents

Automatic evaluators can do more than score digital agents after the fact: in the paper’s web and device-control settings, they are accurate enough to function as a usable reward signal for refinement and data filtering, which in turn improves agent performance. The result is op…

Published 2026-04-11 · Sapho Daily

ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions

ReliabilityBench argues that agent quality cannot be read off a single clean run. It measures reliability as a stress surface across repeated-execution consistency, prompt perturbation robustness, and infrastructure-failure tolerance, then shows that even a strong reported confi…

Published 2026-04-11 · Sapho Daily

Agent frameworks do not yield a stable winner across code-centric software engineering tasks

This evaluation shows that agent-framework performance in code-centric software engineering is task-dependent rather than unified under a single best system. Across software development, vulnerability detection, and program repair, the leading result shifts by task: OpenHands le…

Published 2026-04-11 · Sapho Daily

Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency

Adding runtime debugging to a simple multi-agent code-generation workflow improves results relative to the same workflow without debugging, but the gains are modest, not clearly superior to debugging alone on HumanEval, and do not support the broader story that more agentic comp…

Published 2026-04-09 · Sapho Daily

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

ABC-Bench argues that backend agent evaluation should be done on executable end-to-end work rather than isolated code snippets. Its contribution is a workflow benchmark that forces models to explore repositories, implement changes, configure environments, deploy services, and pa…

Published 2026-04-09 · Sapho Daily

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

FeatureBench argues that current coding agents look much weaker once evaluation moves from relatively narrow bug-fix settings to executable feature-development tasks that span larger code surfaces, more files, more functions, and more tests. The benchmark is built to measure tha…

Published 2026-04-09 · Sapho Daily

COMPASS: Enhancing Agent Long-Horizon Reasoning with Evolving Context

COMPASS argues that long-horizon agent failure is not mainly a problem of raw model intelligence but of context control: as tasks stretch across many interdependent steps, the active working state expands until relevant information no longer fits cleanly inside the model window.…

Published 2026-04-09 · Sapho Daily

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

SWE-EVO is a deliberately harder coding-agent benchmark built from real release-to-release software evolution work in seven mature Python repositories, and its results show that current agent stacks degrade sharply when tasks expand from localized bug repair into wider, regressi…

Published 2026-04-09 · Sapho Daily

How AI Coding Agents Modify Code: A Large-Scale Study of GitHub Pull Requests

AI-agent pull requests are structurally different from human pull requests in ways that are large enough to matter operationally, and the clearest separator is commit count rather than raw changed-line volume. The study also finds a modest edge for agent pull requests on descrip…

Published 2026-04-09 · Sapho Daily

Long-horizon agents: OpenCode + GPT-5.2 Codex Experiment - DEV Community

A reported long-horizon coding run used an orchestrator-plus-sub-agent pattern to complete a substantial rewrite while keeping the main working thread relatively small, suggesting that delegated agent structures can stretch practical working scope beyond a single session’s conte…

Published 2026-04-09 · Sapho Daily

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

NL2Repo-Bench tests whether coding agents can build an installable Python library from nothing but a requirements document, with no scaffold, source, or tests shown during development, and the reported results say they still fail often. The benchmark therefore shifts the questio…

Published 2026-04-09 · Sapho Daily

Diagnosing Retrieval vs. Utilization Bottlenecks in LLM Agent Memory

In this evaluation of memory-equipped LLM agents, retrieval quality matters far more than memory write strategy. The strongest reported gains come from getting the right memory back at inference time, especially through hybrid reranking, while downstream use of retrieved memory…

Published 2026-04-09 · Sapho Daily

AgentOrchestra: A Hierarchical Multi-Agent Framework for General-Purpose Task Solving

AgentOrchestra argues that general-purpose task solving improves when a planning layer decomposes work, tracks execution state, and coordinates specialized agents instead of forcing one model to handle planning, browsing, research, and analysis in a single loop.

Published 2026-04-09 · Sapho Daily

Incident postmortem in the age of AI agents

Firetiger’s March 1 ingest outage was not a generic cloud failure but a layered control failure: a CI race condition canceled a build, a later deploy treated missing artifacts as complete and pointed Lambda and ECS at a non-existent container image, Terraform then rejected part…

Published 2026-04-09 · Sapho Daily

On the Use of Agentic Coding: An Empirical Study of Pull Requests on GitHub

Explicitly labeled Claude Code pull requests were merged often in this sample of open-source repositories, but still less often than human pull requests. Once an agentic pull request cleared the acceptance threshold, its downstream revision pattern looked broadly similar to huma…

Published 2026-04-05 · Sapho Daily

Evaluating Very Long-Term Conversational Memory of LLM Agents

LoCoMo is a deliberately constructed benchmark for testing long-horizon conversational memory across months of interaction, and the reported results show that current language models remain far from human performance on this problem. The benchmark is useful because it forces ret…

Published 2026-04-05 · Sapho Daily

Code Change Characteristics and Description Alignment: A Comparative Study of Agentic versus Human Pull Requests

Across 33,596 agentic pull requests and 6,618 human pull requests, agentic software change work in this study looks structurally narrower and locally better described at the commit level, but it merges less often and the code it introduces is revised or removed sooner. The resul…

Published 2026-04-05 · Sapho Daily

Security in the Age of AI Teammates: An Empirical Study of Agentic Pull Requests on GitHub

Security-relevant work is already a visible slice of agent-authored software change on GitHub, but the dominant pattern is not just direct bug repair. In this dataset, agents are frequently contributing broader security hardening work such as refactoring, testing, documentation,…

Published 2026-04-05 · Sapho Daily

Let’s Make Every Pull Request Meaningful: An Empirical Analysis of Developer and Agentic Pull Requests

Within this dataset, agentic pull requests were more predictable at merge time than human pull requests, and the strongest merge signal in both cases was not code shape but submitter position in the workflow. The paper’s central value is not a claim that agents are intrinsically…

Published 2026-04-05 · Sapho Daily

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

SWE-Bench Pro argues that current agent evaluation has been too concentrated on lighter bug-fix settings and introduces a larger benchmark aimed at materially longer-horizon software engineering work: 1,865 problems from 41 actively maintained repositories, with tasks that often…

Published 2026-04-04 · Sapho Daily

What are popular AI coding benchmarks actually measuring? - nilenso blog

SWE-bench Verified is best read as a narrow test of whether an agent can produce a patch for a real GitHub issue that makes the issue’s unit tests pass. The source argues that this is useful but easy to overread: the benchmark is entirely Python, heavily concentrated in Django,…

Published 2026-04-04 · Sapho Daily

#1 open-source agent on SWE-Bench Verified by combining Claude 3.7 and O1 | Augment Code

Augment’s report argues that a high SWE-bench Verified score can be pushed upward by system composition rather than by a single model alone: its submitted setup reportedly reached 65.4% by using Claude Sonnet 3.7 as the main agent and OpenAI o1 as an ensembler, while also inferr…

Published 2026-04-04 · Sapho Daily

Testing AI coding agents (2025): Cursor vs. Claude, OpenAI, and Gemini | Render Blog

This benchmark points to a narrow but useful conclusion: Cursor looked strongest in the greenfield app-generation test and also performed well on one production backend refactor, but the benchmark as a whole does not justify a clean overall ranking because part of the production…

Published 2026-04-04 · Sapho Daily

Top Coding Agents (2025) | Benched.ai

This source’s strongest usable contribution is not a clean leaderboard but a bounded reporting comparison: among the agents covered in the excerpt, Claude Code is the only one presented with multiple named benchmark figures across different evaluations and model variants, while…

Published 2026-04-02 · Sapho Daily

SWE Context Bench: A Benchmark for Context Learning in Coding

SWE-ContextBench reframes coding-agent evaluation around context reuse rather than isolated task solving by pairing 300 SWE-Bench Lite tasks with 99 manually verified related tasks from real issue and pull-request relationships. In the reported benchmark results, compact high-qu…

Published 2026-04-02 · Sapho Daily

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Agentic Context Engineering (ACE) argues that adaptation can be moved into context itself rather than into model weights or heavy supervised pipelines. In the paper’s reported evaluations, that approach produces meaningful benchmark gains across agent and finance tasks, includin…

Published 2026-04-02 · Sapho Daily

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

Software-engineering agents improve materially when the model is not left to operate through a raw shell alone but is given a language-model-tailored interface for search, file viewing, and guarded editing. In this paper, that interface raises benchmark solve rates versus prior…

Published 2026-04-02 · Sapho Daily

Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Confucius Code Agent presents an open-source software-engineering agent stack aimed at large-repository, industrial-style work by combining a unified orchestrator with hierarchical context management, persistent note reuse, and modular tool extensions, and the paper reports that…

Published 2026-04-02 · Sapho Daily

How to Build AI Agents by Augmenting LLMs with Codified Human Expert Domain Knowledge? A Software Engineering Framework

The paper argues that domain agents improve materially when an LLM is not left to improvise alone, but is coupled to explicit expert knowledge encoded as software, retrieval-backed code generation, and expert-guided design rules. In the reported evaluation, that combined archite…

Published 2026-04-02 · Sapho Daily

Decoding the Configuration of AI Coding Agents: Insights from Claude Code Projects

Public Claude Code project configurations are not lightweight prompt scraps; in the sampled repositories they function mainly as structured project orientation layers. The dominant pattern is to tell the coding agent how the system is organized, what it depends on, and what the…

Published 2026-04-02 · Sapho Daily

Agentic Much? Adoption of Coding Agents on GitHub

Coding-agent use on GitHub is already material rather than marginal, with the study estimating adoption at 15.85% to 22.60% across 129,134 projects. The central finding is not just that agent use is visible at scale, but that visible file traces alone miss a meaningful share of…

Published 2026-04-02 · Sapho Daily

ChatDev: Communicative Agents for Software Development

ChatDev argues that software generation can be organized as a staged multi-agent process rather than a single-pass coding prompt: specialized agents negotiate design, implementation, review, and testing through structured dialogue, and that communication scaffold produces better…

Published 2026-04-02 · Sapho Daily

An Empirical Study of Developer-Provided Context for AI Coding Assistants in Open-Source Projects

Developer-supplied context for AI coding assistants in open-source repositories is not a narrow set of prompts but a structured operating layer: across 401 cleaned repositories, maintainers most often supplied guidelines, project information, and conventions, usually in combinat…

Published 2026-04-02 · Sapho Daily

Professional Software Developers Don’t Vibe, They Control: AI Agent Use for Coding in 2025

Among experienced developers, AI coding agents are treated as supervised productivity tools rather than autonomous replacements. The paper’s central finding is not that professionals hand work over to agents, but that they preserve control through planning, close supervision, ou…

Published 2026-04-02 · Sapho Daily

Published as a conference paper at ICLR 2024

SWE-bench turns real GitHub maintenance episodes into a tightly filtered executable benchmark, and the result is a hard test of practical code-repair ability: from roughly 90,000 pull requests across 12 Python repositories, only 2,294 tasks survive the admission pipeline, and un…

Published 2026-04-02 · Sapho Daily

MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework

MetaGPT argues that code-generation performance improves when multi-agent software work is forced through explicit role specialization, structured intermediate artifacts, and executable feedback loops rather than left as an unstructured conversation. The reported result is stron…

Published 2026-04-02 · Sapho Daily

Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code

A context-engineered multi-agent code assistant can outperform a single-agent baseline on complex repository tasks by translating intent, retrieving outside and in-repo context, synthesizing that material into usable working knowledge, and routing implementation through speciali…

Published 2026-04-02 · Sapho Daily

Context Engineering for AI Agents in Open-Source Software

Open-source repositories had low visible uptake of the agent-configuration formats this study examined, and the instruction files that do exist do not yet look like a settled engineering standard. Adoption was sparse in the sampled repositories, structure in AGENTS.md files was…

Published 2026-04-02 · Sapho Daily

Why Do Multi-Agent LLM Systems Fail?

Evaluated multi-agent LLM systems show only marginal gains over single-agent baselines, and the weakness is not explained by one dominant defect. The paper’s central contribution is a failure landscape: 14 distinct failure modes across 3 categories, with evidence that coordinati…

Published 2026-04-02 · Sapho Daily

Codified Context: Infrastructure for AI Agents in a Complex Codebase

The paper argues that AI-agent work in a large live codebase can be stabilized by splitting project context into tiers with different loading rules: a permanently loaded constitution for operating rules and routing, task-invoked specialist agents for focused domain behavior, and…

Published 2026-04-02 · Sapho Daily

When AGENTS.md Backfires: What a New Study Says About Context Files and Coding Agents

The reported ETH Zurich results cut against the default assumption that adding a context file helps coding agents. Across four agents and three context-file conditions, LLM-generated files more often reduced task success than improved it, while both generated and developer-writt…

Published 2026-04-02 · Sapho Daily

arXiv 2601.20404

In a paired study covering 10 repositories and 124 pull requests, the presence of repository-level AGENTS.md guidance was associated with faster agent completion and lower output-token use, but the paper supports an efficiency reading more strongly than a correctness reading bec…

Published 2026-04-02 · Sapho Daily

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agent context files are not marginal setup artifacts. Across a large cross-repository sample, they appear to function as active operational guidance for coding agents: they differ meaningfully by ecosystem, they are revised as living documents rather than written once, and their…

Published 2026-04-02 · Sapho Daily

Testing and Enhancing Multi-Agent Systems for Robust Code Generation

Current code-generation multi-agent systems are materially less robust than headline solve rates suggest: when prompts are rewritten in meaning-preserving ways, systems that had already solved a problem newly fail on 7.9% to 83.3% of those same problems. The paper’s main practic…

Published 2026-04-02 · Sapho Daily

Towards a Science of Scaling Agent Systems

Scaling agent systems is not a general recipe for better performance. In the studied setting, multi-agent gains are conditional, benchmark-specific, and often erased by coordination cost once the base model is already fairly competent or the task is tool-heavy. The paper’s main…

Published 2026-04-02 · Sapho Daily

Repository-level context files mostly add cost and activity overhead rather than improving coding-agent success

This study finds that repository-level context files, as currently used, usually do not help coding agents solve software tasks better. Across the evaluated agents and models, they generally lowered success rates relative to giving no repository context while raising inference c…