{"authors": ["Qiujie Xie", "Yixuan Weng", "Minjun Zhu", "Fuchen Shen", "Shulin Huang", "Zhen Lin", "Jiahui Zhou", "Zilan Mao", "Zijie Yang", "Linyi Yang", "Jian Wu", "Yue Zhang"], "categories": ["cs.AI"], "fetched_at": "2026-05-08T00:46:03.084820+00:00", "paper_id": "2507.23276v2", "pdf_url": "https://arxiv.org/pdf/2507.23276v2", "primary_category": "cs.AI", "published": "2025-07-31T06:32:06Z", "query_category": "research-harnesses", "query_discipline": "cs-ai", "source": "arxiv", "summary": "The emergence of large language models (LLMs) is propelling automated scientific discovery to the next level, with LLM-based Artificial Intelligence (AI) Scientist systems now taking the lead in scientific research. Several influential works have already appeared in the field of AI Scientist systems, with AI-generated research papers having been accepted at the ICLR 2025 workshop, suggesting that a human-level AI Scientist capable of uncovering phenomena previously unknown to humans, may soon become a reality. In this survey, we focus on the central question: How far are AI scientists from changing the world and reshaping the scientific research paradigm? To answer this question, we provide a prospect-driven review that comprehensively analyzes the current achievements of AI Scientist systems, identifying key bottlenecks and the critical components required for the emergence of a scientific agent capable of producing ground-breaking discoveries that solve grand challenges. We hope this survey will contribute to a clearer understanding of limitations of current AI Scientist systems, showing where we are, what is missing, and what the ultimate goals for scientific AI should be.", "title": "How Far Are AI Scientists from Changing the World?", "updated": "2025-08-01T12:49:36Z", "url": "https://arxiv.org/abs/2507.23276v2"}
{"authors": ["Hongtao Wu", "Boyun Zheng", "Dingjie Song", "Yu Jiang", "Jianfeng Gao", "Lei Xing", "Lichao Sun", "Yixuan Yuan"], "categories": ["cs.AI", "cs.LG"], "fetched_at": "2026-05-08T00:46:03.085173+00:00", "paper_id": "2603.28589v1", "pdf_url": "https://arxiv.org/pdf/2603.28589v1", "primary_category": "cs.AI", "published": "2026-03-30T15:37:25Z", "query_category": "research-harnesses", "query_discipline": "cs-ai", "source": "arxiv", "summary": "Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.", "title": "Towards a Medical AI Scientist", "updated": "2026-03-30T15:37:25Z", "url": "https://arxiv.org/abs/2603.28589v1"}
{"authors": ["Ruofeng Yang", "Yongcan Li", "Shuai Li"], "categories": ["cs.SE", "cs.AI"], "fetched_at": "2026-05-08T00:46:03.085434+00:00", "paper_id": "2605.03042v1", "pdf_url": "https://arxiv.org/pdf/2605.03042v1", "primary_category": "cs.SE", "published": "2026-05-04T18:10:15Z", "query_category": "research-harnesses", "query_discipline": "cs-ai", "source": "arxiv", "summary": "This report describes ARIS (Auto-Research-in-sleep), an open-source research harness for autonomous research, including its architecture, assurance mechanisms, and early deployment experience. The performance of agent systems built on LLMs depends on both the model weights and the harness around them, which governs what information to store, retrieve, and present to the model. For long-horizon research workflows, the central failure mode is not a visible breakdown but a plausible unsupported success: a long-running agent can produce claims whose evidential support is incomplete, misreported, or silently inherited from the executor's framing. Therefore, we present ARIS as a research harness that coordinates machine-learning research workflows through cross-model adversarial collaboration as a default configuration: an executor model drives forward progress while a reviewer from a different model family is recommended to critique intermediate artifacts and request revisions. ARIS has three architectural layers. The execution layer provides more than 65 reusable Markdown-defined skills, model integrations via MCP, a persistent research wiki for iterative reuse of prior findings, and deterministic figure generation. The orchestration layer coordinates five end-to-end workflows with adjustable effort settings and configurable routing to reviewer models. The assurance layer includes a three-stage process for checking whether experimental claims are supported by evidence: integrity verification, result-to-claim mapping, and claim auditing that cross-checks manuscript statements against the claim ledger and raw evidence, as well as a five-pass scientific-editing pipeline, mathematical-proof checks, and visual inspection of the rendered PDF. A prototype self-improvement loop records research traces and proposes harness improvements that are adopted only after reviewer approval.", "title": "ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration", "updated": "2026-05-04T18:10:15Z", "url": "https://arxiv.org/abs/2605.03042v1"}
{"authors": ["Minjun Zhu", "Qiujie Xie", "Yixuan Weng", "Jian Wu", "Zhen Lin", "Linyi Yang", "Yue Zhang"], "categories": ["cs.AI", "cs.CL", "cs.LG"], "fetched_at": "2026-05-08T00:46:03.085682+00:00", "paper_id": "2506.01372v2", "pdf_url": "https://arxiv.org/pdf/2506.01372v2", "primary_category": "cs.AI", "published": "2025-06-02T06:59:10Z", "query_category": "research-harnesses", "query_discipline": "cs-ai", "source": "arxiv", "summary": "The emergence of Artificial Intelligence (AI) Scientist represents a paradigm shift in scientific discovery, with large language models (LLMs) taking the lead as the primary executor in the entire scientific workflow from idea generation to experiment implementation. Recent AI Scientist studies demonstrate sufficient capabilities for independent scientific discovery, with the generated research reports gaining acceptance at the ICLR 2025 workshop and ACL 2025, arguing that a human-level AI Scientist, capable of uncovering phenomena previously unknown to humans, may be imminent. Despite this substantial progress, AI Scientist has yet to produce a groundbreaking achievement in the domain of computer science on par with automated scientific tools. Based on extensive quantitative evidence from existing benchmarks in complex engineering tasks and a systematic evaluation assess 28 research papers generated by five advanced AI Scientist systems, we argue that \\textbf{the fundamental bottleneck for AI Scientists lies in their capability to execute the requisite verification procedures.} Current AI Scientist systems lack the execution capabilities needed to execute rigorous experiments and produce high-quality scientific papers. To better illustrate the root cause of this \\textbf{implementation gap}, we provide an in-depth discussion on the fundamental limitations of AI Scientist. This position paper aims to call for the participants in the community to bridge the implementation gap.", "title": "AI Scientists Fail Without Strong Implementation Capability", "updated": "2025-06-09T09:01:24Z", "url": "https://arxiv.org/abs/2506.01372v2"}
{"authors": ["Atsuyuki Miyai", "Mashiro Toyooka", "Takashi Otonari", "Zaiying Zhao", "Kiyoharu Aizawa"], "categories": ["cs.AI", "cs.CL", "cs.CV", "cs.LG"], "fetched_at": "2026-05-08T00:46:03.085936+00:00", "paper_id": "2511.04583v4", "pdf_url": "https://arxiv.org/pdf/2511.04583v4", "primary_category": "cs.AI", "published": "2025-11-06T17:37:49Z", "query_category": "research-harnesses", "query_discipline": "cs-ai", "source": "arxiv", "summary": "Understanding the current capabilities and risks of AI Scientist systems (autoresearch) is essential for ensuring trustworthy and sustainable AI-driven scientific progress while preserving the integrity of the academic ecosystem. To this end, we develop Jr. AI Scientist, a state-of-the-art autonomous AI scientist system that mimics the core research workflow of a novice student researcher: Given the baseline paper from the human mentor, it analyzes its limitations, formulates novel hypotheses for improvement, iteratively experiments until improvements are achieved, and writes a paper with the results. Unlike previous approaches that assume full automation or operate on small-scale code, Jr. AI Scientist follows a well-defined research workflow and leverages modern coding agents to handle complex, multi-file implementations, leading to scientifically valuable contributions. Through our experiments, the Jr. AI Scientist successfully generated new research papers that build upon real NeurIPS, IJCV, and ICLR works by proposing and implementing novel methods. For evaluation, we conducted automated assessments using AI Reviewers, author-led evaluations, and submissions to Agents4Science, a venue dedicated to AI-driven contributions. The findings demonstrate that Jr. AI Scientist generates papers receiving higher review scores by DeepReviewer than existing fully automated systems. Nevertheless, we identify important limitations from the author evaluation and the Agents4Science reviews, indicating the potential risks of directly applying current AI Scientist systems and key challenges for future research. Finally, we comprehensively report various risks identified during development. We believe this study clarifies the current role and limitations of AI Scientist systems, offering insights into the areas that still require human expertise and the risks that may emerge as these systems evolve.", "title": "Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper", "updated": "2026-03-11T19:48:08Z", "url": "https://arxiv.org/abs/2511.04583v4"}
{"authors": ["Xiaoxin Yin"], "categories": ["cs.AI"], "fetched_at": "2026-05-08T00:46:03.086055+00:00", "paper_id": "2405.13352v1", "pdf_url": "https://arxiv.org/pdf/2405.13352v1", "primary_category": "cs.AI", "published": "2024-05-22T05:14:27Z", "query_category": "research-harnesses", "query_discipline": "cs-ai", "source": "arxiv", "summary": "While LLMs have shown impressive capabilities in solving math or coding problems, the ability to make scientific discoveries remains a distinct challenge. This paper proposes a \"Turing test for an AI scientist\" to assess whether an AI agent can conduct scientific research independently, without relying on human-generated knowledge. Drawing inspiration from the historical development of science, we propose seven benchmark tests that evaluate an AI agent's ability to make groundbreaking discoveries in various scientific domains. These tests include inferring the heliocentric model from celestial observations, discovering the laws of motion in a simulated environment, deriving the differential equation governing vibrating strings, inferring Maxwell's equations from electrodynamics simulations, inventing numerical methods for initial value problems, discovering Huffman coding for data compression, and developing efficient sorting algorithms. To ensure the validity of these tests, the AI agent is provided with interactive libraries or datasets specific to each problem, without access to human knowledge that could potentially contain information about the target discoveries. The ultimate goal is to create an AI scientist capable of making novel and impactful scientific discoveries, surpassing the best human experts in their respective fields. These \"Turing tests\" serve as intermediate milestones, assessing the AI agent's ability to make discoveries that were groundbreaking in their time. If an AI agent can pass the majority of these seven tests, it would indicate significant progress towards building an AI scientist, paving the way for future advancements in autonomous scientific discovery. This paper aims to establish a benchmark for the capabilities of AI in scientific research and to stimulate further research in this exciting field.", "title": "\"Turing Tests\" For An AI Scientist", "updated": "2024-05-22T05:14:27Z", "url": "https://arxiv.org/abs/2405.13352v1"}
{"authors": ["Yutaro Yamada", "Robert Tjarko Lange", "Cong Lu", "Shengran Hu", "Chris Lu", "Jakob Foerster", "Jeff Clune", "David Ha"], "categories": ["cs.AI", "cs.CL", "cs.LG"], "fetched_at": "2026-05-08T00:46:03.086196+00:00", "paper_id": "2504.08066v1", "pdf_url": "https://arxiv.org/pdf/2504.08066v1", "primary_category": "cs.AI", "published": "2025-04-10T18:44:41Z", "query_category": "research-harnesses", "query_discipline": "cs-ai", "source": "arxiv", "summary": "AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper. This system iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts. Compared to its predecessor (v1, Lu et al., 2024 arXiv:2408.06292), The AI Scientist-v2 eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent. Additionally, we enhance the AI reviewer component by integrating a Vision-Language Model (VLM) feedback loop for iterative refinement of content and aesthetics of the figures. We evaluated The AI Scientist-v2 by submitting three fully autonomous manuscripts to a peer-reviewed ICLR workshop. Notably, one manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review. This accomplishment highlights the growing capability of AI in conducting all aspects of scientific research. We anticipate that further advancements in autonomous scientific discovery technologies will profoundly impact human knowledge generation, enabling unprecedented scalability in research productivity and significantly accelerating scientific breakthroughs, greatly benefiting society at large. We have open-sourced the code at https://github.com/SakanaAI/AI-Scientist-v2 to foster the future development of this transformative technology. We also discuss the role of AI in science, including AI safety.", "title": "The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search", "updated": "2025-04-10T18:44:41Z", "url": "https://arxiv.org/abs/2504.08066v1"}
{"authors": ["Ziming Luo", "Atoosa Kasirzadeh", "Nihar B. Shah"], "categories": ["cs.AI", "cs.DL"], "fetched_at": "2026-05-08T00:46:03.086301+00:00", "paper_id": "2509.08713v2", "pdf_url": "https://arxiv.org/pdf/2509.08713v2", "primary_category": "cs.AI", "published": "2025-09-10T16:04:24Z", "query_category": "research-harnesses", "query_discipline": "cs-ai", "source": "arxiv", "summary": "AI scientist systems, capable of autonomously executing the full research workflow from hypothesis generation and experimentation to paper writing, hold significant potential for accelerating scientific discovery. However, the internal workflow of these systems have not been closely examined. This lack of scrutiny poses a risk of introducing flaws that could undermine the integrity, reliability, and trustworthiness of their research outputs. In this paper, we identify four potential failure modes in contemporary AI scientist systems: inappropriate benchmark selection, data leakage, metric misuse, and post-hoc selection bias. To examine these risks, we design controlled experiments that isolate each failure mode while addressing challenges unique to evaluating AI scientist systems. Our assessment of two prominent open-source AI scientist systems reveals the presence of several failures, across a spectrum of severity, which can be easily overlooked in practice. Finally, we demonstrate that access to trace logs and code from the full automated workflow enables far more effective detection of such failures than examining the final paper alone. We thus recommend journals and conferences evaluating AI-generated research to mandate submission of these artifacts alongside the paper to ensure transparency, accountability, and reproducibility.", "title": "The More You Automate, the Less You See: Hidden Pitfalls of AI Scientist Systems", "updated": "2025-12-20T05:26:07Z", "url": "https://arxiv.org/abs/2509.08713v2"}