Evidence-ledger draft

Source brief

Category source brief used to validate supported claims before public release.

Category Brief: Computer Science / Artificial Intelligence / Autonomous Research Harnesses

Generated: 2026-05-07

Scope

This brief covers cs-ai/research-harnesses using arXiv categories cs.SE, cs.AI, cs.LG and keywords autonomous research, research harness, AI scientist, cross-model, adversarial collaboration, claim audit, research wiki, paper writing pipeline, agent laboratory, automated scientific discovery.

Current queue snapshot

Candidate clusters

Strong candidate papers

  1. 2605.03042v1 — ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration
  1. 2504.08066v1 — The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
  1. 2603.28589v1 — Towards a Medical AI Scientist
  1. 2507.23276v2 — How Far Are AI Scientists from Changing the World?
  1. 2511.04583v4 — Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

Evidence-backed claims

ClaimSupporting paper_id(s)Evidence row notesStatus
Autonomous research harnesses should be evaluated as stateful systems, not just as model prompts, because storage, retrieval, artifact routing, and review context shape long-horizon research behavior.2605.03042v1ARIS frames the harness as the system logic around LLM weights and implements execution, orchestration, and assurance layers.supported
The main safety failure for long-running research agents is "plausible unsupported success": claims can look coherent while outrunning raw evidence or inheriting the executor's framing.2605.03042v1The paper makes this failure mode central in the abstract, introduction, assurance section, and conclusion.supported
Cross-family executor/reviewer separation is a design default in ARIS, but the paper does not provide controlled evidence that it is superior to same-family review.2605.03042v1The paper explicitly labels deployment evidence observational and proposes compute-matched controlled benchmarks as future work.supported
Evidence-to-claim ledgers are a practical mechanism for making autonomous research outputs auditable across experiment, manuscript, and citation stages.2605.03042v1ARIS defines experiment-audit, result-to-claim, paper-claim-audit, citation audit, and manuscript assurance gates.supported
End-to-end AI Scientist systems have crossed a visible capability threshold, including fully autonomous workshop-paper generation, but their evaluation evidence is heterogeneous and needs claim-level auditing.2504.08066v1AI Scientist-v2 reports three autonomous ICLR workshop submissions and one paper exceeding the average human acceptance threshold.supported
Domain-specific autonomous research systems are emerging because generic AI Scientist workflows do not automatically satisfy clinical evidence, modality, and ethics requirements.2603.28589v1Medical AI Scientist reports clinician-engineer co-reasoning, 171 cases, 19 clinical tasks, and 6 modalities.supported
Survey-level work frames AI Scientist progress as promising but bottlenecked by missing components needed for ground-breaking discovery.2507.23276v2The survey positions current achievements, missing capabilities, and ultimate goals for scientific AI.supported
Baseline-paper-grounded autonomous systems may be a more realistic near-term workflow than unconstrained full automation, but they surface risks that must be tracked explicitly.2511.04583v4Jr. AI Scientist builds from real top-venue baseline papers and reports both reviewer-score improvements and risk findings.supported

System comparison scaffold

System / paperWorkflow scopeEvidence / assurance mechanismReported evaluationLocal-ledger implication
ARIS (2605.03042v1)Research harness with execution, orchestration, and assurance layersCross-family review, research wiki, experiment-audit, result-to-claim, paper-claim-audit, citation auditObservational overnight run; 65+ skills; 3 tested executor platforms; no controlled causal evaluationAnchor architecture for evidence-ledger-first paper research
AI Scientist-v2 (2504.08066v1)End-to-end autonomous ML paper generationAgentic tree search, experiment manager, VLM feedback loop; abstract does not foreground claim ledgersThree autonomous ICLR workshop submissions; one exceeded average human acceptance thresholdCapability frontier; needs external claim audit before trusting generated papers
Medical AI Scientist (2603.28589v1)Clinical autonomous research with domain conventionsClinician-engineer co-reasoning, medical evidence grounding, ethical policies171 cases, 19 tasks, 6 modalities; human/LLM/Agentic Reviewer evaluationsShows need for domain-specific evidence rows and policy fields
How Far Are AI Scientists (2507.23276v2)Survey of AI Scientist achievements and bottlenecksSurvey synthesis, not a harness auditProspect-driven review; no primary metric in abstractProvides taxonomy/open-problem framing
Jr. AI Scientist (2511.04583v4)Baseline-paper-driven autonomous explorationIterative experiments from real papers; risk reportDeepReviewer, author-led, and Agents4Science evaluationsUseful near-term model: constrained automation plus explicit risk ledger
NORA (2605.02092v1)Spatial data-science autonomous research agentNeeds backfill from paper queue/full textNeeds backfillCandidate for next evidence row
Agent LaboratoryHuman-in-the-loop AI research assistant workflowNeeds manual source backfillNeeds backfillCompare human checkpoints against ARIS reviewer-independence
data-to-paperData-to-manuscript workflow with traceability emphasisNeeds manual source backfillNeeds backfillCompare traceability mechanisms with evidence matrices

Deep analysis: 2605.03042v1 ARIS

**Positioning.** ARIS is best read as a *research-harness architecture paper*, not as a model-performance paper. Its unit of contribution is the workflow substrate around LLM agents: Markdown skills, artifact contracts, reviewer routing, persistent memory, and claim audits.

**Core design thesis.** The paper adopts a deliberately conservative assumption: any long-term task performed by a single agent is unreliable. From that assumption it derives three required capabilities: persistent research state, modular execution, and independent assurance. This maps cleanly to ARIS's research wiki, single-file skill/workflow design, and cross-model assurance stack.

**Most useful mechanism for our system.** The strongest transferable piece is the evidence-to-claim cascade: first audit experiment/evaluation integrity, then map results to explicit claim verdicts, then have a fresh reviewer check manuscript claims against the claim ledger and raw evidence. Our evidence_matrix.csv and discipline_claim_index.csv should be treated as this category's lightweight version of the same idea.

**What not to overclaim.** The paper's empirical support is observational. It reports one overnight run where reviewer score improved from 5.0 to 7.5/10 over about eight hours with four review-revise rounds and more than 20 GPU experiments, but it explicitly says this is not causal evidence that cross-family review is superior.

**Integration consequence.** This category should become the anchor for papers on AI Scientist-style systems, agent laboratories, research-harness engineering, and autonomous paper-writing systems. ARIS should be cited for the assurance/architecture pattern; AI Scientist-v2 and domain-specific systems should be used to compare search strategy, benchmark protocol, and domain outcomes.

Open questions

Next actions

  1. Fill evidence rows for AI Scientist-v2, NORA, Agent Laboratory, data-to-paper, and AutoResearchClaw-style papers.
  2. Add a category-level comparison table: workflow scope, persistent memory, cross-family review, audit stack, artifact contracts, and controlled evaluation.
  3. Run compose-discipline --discipline cs-ai after at least three category briefs have evidence-backed claims.