Evidence-ledger draft

Evidence Ledgers for Autonomous Research Harnesses

A public release of an evidence-ledger draft for autonomous research harnesses, including paper text, claim ledger, and auditable source artifacts.

Paper draft

Readable HTML version of the generated Markdown draft.

Read paper

Claim ledger

5 claims traced to source paper IDs and evidence rows.

Inspect claims

Audit status

pass 8 supported claims, 5 filled evidence rows.

Download evidence CSV

Release notes

Evidence-backed papers

PaperClaimNotes
2605.03042v1Long-horizon single-agent research is unreliable by default; the central failure mode is plausible unsupported success, mitigated through persistent state, modular execution, and independent assurance via cross-family executor/reviewer separation.Full text read from arXiv PDF. Integration implication: our harness should keep evidence_matrix/claim ledger as first-class artifacts, add research-harnesses taxonomy, and keep reviewer-independent audits as hard invariants.
2504.08066v1AI Scientist-v2 reports an end-to-end agentic system that formulates hypotheses, designs and executes experiments, analyzes and visualizes data, and writes manuscripts; compared with v1 it removes reliance on human-authored code templates and uses progressive agentic tree search.Abstract-derived evidence row; needs full-text audit before final submission.
2603.28589v1Clinical autonomous research needs domain-specific evidence grounding; Medical AI Scientist transforms surveyed literature into actionable evidence and uses clinician-engineer co-reasoning to improve traceability of generated ideas.Abstract-derived evidence row; full text required before citing numerical comparisons.
2507.23276v2Current AI Scientist systems have visible achievements, but the field still needs clarity on bottlenecks and critical components before scientific agents can produce ground-breaking discoveries that solve grand challenges.Abstract-derived evidence row; needs full text for taxonomy details.
2511.04583v4Jr. AI Scientist follows a novice-student-like workflow from a baseline paper: analyze limitations, formulate hypotheses, iterate experiments until improvements, and write a result paper; the work also reports risks and limitations of current AI Scientist systems.Abstract-derived evidence row; useful for risk and workflow comparison.