Category Brief: Computer Science / Artificial Intelligence / Autonomous Research Harnesses

Generated: 2026-05-07

Scope

This brief covers cs-ai/research-harnesses using arXiv categories cs.SE, cs.AI, cs.LG and keywords autonomous research, research harness, AI scientist, cross-model, adversarial collaboration, claim audit, research wiki, paper writing pipeline, agent laboratory, automated scientific discovery.

Current queue snapshot

Papers in top queue: 5
Primary-category distribution: cs.AI=4, cs.SE=1

Candidate clusters

cs.AI (4): 2504.08066v1, 2603.28589v1, 2507.23276v2, 2511.04583v4
cs.SE (1): 2605.03042v1

Strong candidate papers

2605.03042v1 — ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration

Why queued: primary:cs.SE; title:autonomous research; abstract:research harness; abstract:cross-model; abstract:adversarial collaboration; abstract:claim audit; abstract:research wiki; fresh:30d
URL: https://arxiv.org/abs/2605.03042v1

2504.08066v1 — The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Why queued: primary:cs.AI; title:AI scientist; title:automated scientific discovery; fresh:2y
URL: https://arxiv.org/abs/2504.08066v1

2603.28589v1 — Towards a Medical AI Scientist

Why queued: primary:cs.AI; abstract:autonomous research; title:AI scientist; fresh:180d
URL: https://arxiv.org/abs/2603.28589v1

2507.23276v2 — How Far Are AI Scientists from Changing the World?

Why queued: primary:cs.AI; title:AI scientist; abstract:automated scientific discovery; fresh:1y
URL: https://arxiv.org/abs/2507.23276v2

2511.04583v4 — Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

Why queued: primary:cs.AI; title:AI scientist; fresh:1y
URL: https://arxiv.org/abs/2511.04583v4

Evidence-backed claims

Claim	Supporting paper_id(s)	Evidence row notes	Status
Autonomous research harnesses should be evaluated as stateful systems, not just as model prompts, because storage, retrieval, artifact routing, and review context shape long-horizon research behavior.	`2605.03042v1`	ARIS frames the harness as the system logic around LLM weights and implements execution, orchestration, and assurance layers.	supported
The main safety failure for long-running research agents is "plausible unsupported success": claims can look coherent while outrunning raw evidence or inheriting the executor's framing.	`2605.03042v1`	The paper makes this failure mode central in the abstract, introduction, assurance section, and conclusion.	supported
Cross-family executor/reviewer separation is a design default in ARIS, but the paper does not provide controlled evidence that it is superior to same-family review.	`2605.03042v1`	The paper explicitly labels deployment evidence observational and proposes compute-matched controlled benchmarks as future work.	supported
Evidence-to-claim ledgers are a practical mechanism for making autonomous research outputs auditable across experiment, manuscript, and citation stages.	`2605.03042v1`	ARIS defines experiment-audit, result-to-claim, paper-claim-audit, citation audit, and manuscript assurance gates.	supported
End-to-end AI Scientist systems have crossed a visible capability threshold, including fully autonomous workshop-paper generation, but their evaluation evidence is heterogeneous and needs claim-level auditing.	`2504.08066v1`	AI Scientist-v2 reports three autonomous ICLR workshop submissions and one paper exceeding the average human acceptance threshold.	supported
Domain-specific autonomous research systems are emerging because generic AI Scientist workflows do not automatically satisfy clinical evidence, modality, and ethics requirements.	`2603.28589v1`	Medical AI Scientist reports clinician-engineer co-reasoning, 171 cases, 19 clinical tasks, and 6 modalities.	supported
Survey-level work frames AI Scientist progress as promising but bottlenecked by missing components needed for ground-breaking discovery.	`2507.23276v2`	The survey positions current achievements, missing capabilities, and ultimate goals for scientific AI.	supported
Baseline-paper-grounded autonomous systems may be a more realistic near-term workflow than unconstrained full automation, but they surface risks that must be tracked explicitly.	`2511.04583v4`	Jr. AI Scientist builds from real top-venue baseline papers and reports both reviewer-score improvements and risk findings.	supported

System comparison scaffold

System / paper	Workflow scope	Evidence / assurance mechanism	Reported evaluation	Local-ledger implication
ARIS (`2605.03042v1`)	Research harness with execution, orchestration, and assurance layers	Cross-family review, research wiki, experiment-audit, result-to-claim, paper-claim-audit, citation audit	Observational overnight run; 65+ skills; 3 tested executor platforms; no controlled causal evaluation	Anchor architecture for evidence-ledger-first paper research
AI Scientist-v2 (`2504.08066v1`)	End-to-end autonomous ML paper generation	Agentic tree search, experiment manager, VLM feedback loop; abstract does not foreground claim ledgers	Three autonomous ICLR workshop submissions; one exceeded average human acceptance threshold	Capability frontier; needs external claim audit before trusting generated papers
Medical AI Scientist (`2603.28589v1`)	Clinical autonomous research with domain conventions	Clinician-engineer co-reasoning, medical evidence grounding, ethical policies	171 cases, 19 tasks, 6 modalities; human/LLM/Agentic Reviewer evaluations	Shows need for domain-specific evidence rows and policy fields
How Far Are AI Scientists (`2507.23276v2`)	Survey of AI Scientist achievements and bottlenecks	Survey synthesis, not a harness audit	Prospect-driven review; no primary metric in abstract	Provides taxonomy/open-problem framing
Jr. AI Scientist (`2511.04583v4`)	Baseline-paper-driven autonomous exploration	Iterative experiments from real papers; risk report	DeepReviewer, author-led, and Agents4Science evaluations	Useful near-term model: constrained automation plus explicit risk ledger
NORA (`2605.02092v1`)	Spatial data-science autonomous research agent	Needs backfill from paper queue/full text	Needs backfill	Candidate for next evidence row
Agent Laboratory	Human-in-the-loop AI research assistant workflow	Needs manual source backfill	Needs backfill	Compare human checkpoints against ARIS reviewer-independence
data-to-paper	Data-to-manuscript workflow with traceability emphasis	Needs manual source backfill	Needs backfill	Compare traceability mechanisms with evidence matrices

Deep analysis: `2605.03042v1` ARIS

**Positioning.** ARIS is best read as a *research-harness architecture paper*, not as a model-performance paper. Its unit of contribution is the workflow substrate around LLM agents: Markdown skills, artifact contracts, reviewer routing, persistent memory, and claim audits.

**Core design thesis.** The paper adopts a deliberately conservative assumption: any long-term task performed by a single agent is unreliable. From that assumption it derives three required capabilities: persistent research state, modular execution, and independent assurance. This maps cleanly to ARIS's research wiki, single-file skill/workflow design, and cross-model assurance stack.

**Most useful mechanism for our system.** The strongest transferable piece is the evidence-to-claim cascade: first audit experiment/evaluation integrity, then map results to explicit claim verdicts, then have a fresh reviewer check manuscript claims against the claim ledger and raw evidence. Our evidence_matrix.csv and discipline_claim_index.csv should be treated as this category's lightweight version of the same idea.

**What not to overclaim.** The paper's empirical support is observational. It reports one overnight run where reviewer score improved from 5.0 to 7.5/10 over about eight hours with four review-revise rounds and more than 20 GPU experiments, but it explicitly says this is not causal evidence that cross-family review is superior.

**Integration consequence.** This category should become the anchor for papers on AI Scientist-style systems, agent laboratories, research-harness engineering, and autonomous paper-writing systems. ARIS should be cited for the assurance/architecture pattern; AI Scientist-v2 and domain-specific systems should be used to compare search strategy, benchmark protocol, and domain outcomes.

Open questions

What controlled benchmark can isolate the value of cross-family review from model quality, researcher taste, and task difficulty?
How should claim ledgers represent qualitative literature-review claims, not only numerical experiment claims?
Can local reviewer models provide enough independence for confidential codebases without sending repository context to external APIs?
Which older autonomous-research systems are missed by arXiv-first discovery and need manual backfill?

Next actions

Fill evidence rows for AI Scientist-v2, NORA, Agent Laboratory, data-to-paper, and AutoResearchClaw-style papers.
Add a category-level comparison table: workflow scope, persistent memory, cross-family review, audit stack, artifact contracts, and controlled evaluation.
Run compose-discipline --discipline cs-ai after at least three category briefs have evidence-backed claims.