Category Brief: Computer Science / Artificial Intelligence / Autonomous Research Harnesses
Generated: 2026-05-07
Scope
This brief covers cs-ai/research-harnesses using arXiv categories cs.SE, cs.AI, cs.LG and keywords autonomous research, research harness, AI scientist, cross-model, adversarial collaboration, claim audit, research wiki, paper writing pipeline, agent laboratory, automated scientific discovery.
Current queue snapshot
- Papers in top queue: 5
- Primary-category distribution: cs.AI=4, cs.SE=1
Candidate clusters
cs.AI(4): 2504.08066v1, 2603.28589v1, 2507.23276v2, 2511.04583v4cs.SE(1): 2605.03042v1
Strong candidate papers
2605.03042v1— ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
- Why queued: primary:cs.SE; title:autonomous research; abstract:research harness; abstract:cross-model; abstract:adversarial collaboration; abstract:claim audit; abstract:research wiki; fresh:30d
- URL: https://arxiv.org/abs/2605.03042v1
2504.08066v1— The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
- Why queued: primary:cs.AI; title:AI scientist; title:automated scientific discovery; fresh:2y
- URL: https://arxiv.org/abs/2504.08066v1
2603.28589v1— Towards a Medical AI Scientist
- Why queued: primary:cs.AI; abstract:autonomous research; title:AI scientist; fresh:180d
- URL: https://arxiv.org/abs/2603.28589v1
2507.23276v2— How Far Are AI Scientists from Changing the World?
- Why queued: primary:cs.AI; title:AI scientist; abstract:automated scientific discovery; fresh:1y
- URL: https://arxiv.org/abs/2507.23276v2
2511.04583v4— Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper
- Why queued: primary:cs.AI; title:AI scientist; fresh:1y
- URL: https://arxiv.org/abs/2511.04583v4
Evidence-backed claims
| Claim | Supporting paper_id(s) | Evidence row notes | Status |
|---|---|---|---|
| Autonomous research harnesses should be evaluated as stateful systems, not just as model prompts, because storage, retrieval, artifact routing, and review context shape long-horizon research behavior. | 2605.03042v1 | ARIS frames the harness as the system logic around LLM weights and implements execution, orchestration, and assurance layers. | supported |
| The main safety failure for long-running research agents is "plausible unsupported success": claims can look coherent while outrunning raw evidence or inheriting the executor's framing. | 2605.03042v1 | The paper makes this failure mode central in the abstract, introduction, assurance section, and conclusion. | supported |
| Cross-family executor/reviewer separation is a design default in ARIS, but the paper does not provide controlled evidence that it is superior to same-family review. | 2605.03042v1 | The paper explicitly labels deployment evidence observational and proposes compute-matched controlled benchmarks as future work. | supported |
| Evidence-to-claim ledgers are a practical mechanism for making autonomous research outputs auditable across experiment, manuscript, and citation stages. | 2605.03042v1 | ARIS defines experiment-audit, result-to-claim, paper-claim-audit, citation audit, and manuscript assurance gates. | supported |
| End-to-end AI Scientist systems have crossed a visible capability threshold, including fully autonomous workshop-paper generation, but their evaluation evidence is heterogeneous and needs claim-level auditing. | 2504.08066v1 | AI Scientist-v2 reports three autonomous ICLR workshop submissions and one paper exceeding the average human acceptance threshold. | supported |
| Domain-specific autonomous research systems are emerging because generic AI Scientist workflows do not automatically satisfy clinical evidence, modality, and ethics requirements. | 2603.28589v1 | Medical AI Scientist reports clinician-engineer co-reasoning, 171 cases, 19 clinical tasks, and 6 modalities. | supported |
| Survey-level work frames AI Scientist progress as promising but bottlenecked by missing components needed for ground-breaking discovery. | 2507.23276v2 | The survey positions current achievements, missing capabilities, and ultimate goals for scientific AI. | supported |
| Baseline-paper-grounded autonomous systems may be a more realistic near-term workflow than unconstrained full automation, but they surface risks that must be tracked explicitly. | 2511.04583v4 | Jr. AI Scientist builds from real top-venue baseline papers and reports both reviewer-score improvements and risk findings. | supported |
System comparison scaffold
| System / paper | Workflow scope | Evidence / assurance mechanism | Reported evaluation | Local-ledger implication |
|---|---|---|---|---|
ARIS (2605.03042v1) | Research harness with execution, orchestration, and assurance layers | Cross-family review, research wiki, experiment-audit, result-to-claim, paper-claim-audit, citation audit | Observational overnight run; 65+ skills; 3 tested executor platforms; no controlled causal evaluation | Anchor architecture for evidence-ledger-first paper research |
AI Scientist-v2 (2504.08066v1) | End-to-end autonomous ML paper generation | Agentic tree search, experiment manager, VLM feedback loop; abstract does not foreground claim ledgers | Three autonomous ICLR workshop submissions; one exceeded average human acceptance threshold | Capability frontier; needs external claim audit before trusting generated papers |
Medical AI Scientist (2603.28589v1) | Clinical autonomous research with domain conventions | Clinician-engineer co-reasoning, medical evidence grounding, ethical policies | 171 cases, 19 tasks, 6 modalities; human/LLM/Agentic Reviewer evaluations | Shows need for domain-specific evidence rows and policy fields |
How Far Are AI Scientists (2507.23276v2) | Survey of AI Scientist achievements and bottlenecks | Survey synthesis, not a harness audit | Prospect-driven review; no primary metric in abstract | Provides taxonomy/open-problem framing |
Jr. AI Scientist (2511.04583v4) | Baseline-paper-driven autonomous exploration | Iterative experiments from real papers; risk report | DeepReviewer, author-led, and Agents4Science evaluations | Useful near-term model: constrained automation plus explicit risk ledger |
NORA (2605.02092v1) | Spatial data-science autonomous research agent | Needs backfill from paper queue/full text | Needs backfill | Candidate for next evidence row |
| Agent Laboratory | Human-in-the-loop AI research assistant workflow | Needs manual source backfill | Needs backfill | Compare human checkpoints against ARIS reviewer-independence |
| data-to-paper | Data-to-manuscript workflow with traceability emphasis | Needs manual source backfill | Needs backfill | Compare traceability mechanisms with evidence matrices |
Deep analysis: 2605.03042v1 ARIS
**Positioning.** ARIS is best read as a *research-harness architecture paper*, not as a model-performance paper. Its unit of contribution is the workflow substrate around LLM agents: Markdown skills, artifact contracts, reviewer routing, persistent memory, and claim audits.
**Core design thesis.** The paper adopts a deliberately conservative assumption: any long-term task performed by a single agent is unreliable. From that assumption it derives three required capabilities: persistent research state, modular execution, and independent assurance. This maps cleanly to ARIS's research wiki, single-file skill/workflow design, and cross-model assurance stack.
**Most useful mechanism for our system.** The strongest transferable piece is the evidence-to-claim cascade: first audit experiment/evaluation integrity, then map results to explicit claim verdicts, then have a fresh reviewer check manuscript claims against the claim ledger and raw evidence. Our evidence_matrix.csv and discipline_claim_index.csv should be treated as this category's lightweight version of the same idea.
**What not to overclaim.** The paper's empirical support is observational. It reports one overnight run where reviewer score improved from 5.0 to 7.5/10 over about eight hours with four review-revise rounds and more than 20 GPU experiments, but it explicitly says this is not causal evidence that cross-family review is superior.
**Integration consequence.** This category should become the anchor for papers on AI Scientist-style systems, agent laboratories, research-harness engineering, and autonomous paper-writing systems. ARIS should be cited for the assurance/architecture pattern; AI Scientist-v2 and domain-specific systems should be used to compare search strategy, benchmark protocol, and domain outcomes.
Open questions
- What controlled benchmark can isolate the value of cross-family review from model quality, researcher taste, and task difficulty?
- How should claim ledgers represent qualitative literature-review claims, not only numerical experiment claims?
- Can local reviewer models provide enough independence for confidential codebases without sending repository context to external APIs?
- Which older autonomous-research systems are missed by arXiv-first discovery and need manual backfill?
Next actions
- Fill evidence rows for AI Scientist-v2, NORA, Agent Laboratory, data-to-paper, and AutoResearchClaw-style papers.
- Add a category-level comparison table: workflow scope, persistent memory, cross-family review, audit stack, artifact contracts, and controlled evaluation.
- Run
compose-discipline --discipline cs-aiafter at least three category briefs have evidence-backed claims.