Autonomous Research Infrastructure
Describe your goal in plain text. ARI runs the experiments, writes the paper, and verifies the results — entirely on its own. Currently supports digital computation experiments (HPC, ML, systems).
Demo
Watch a live walkthrough of the ARI web dashboard, then download the sample paper that ARI generated end-to-end — a 10-page Stratum-Roofline CSR-SpMM study on Fujitsu A64FX/SVE-512.
▶ Live walkthrough of the ARI dashboard (auto-loop)
Scroll inside the window below — this is the actual paper that ARI generated end-to-end.
Philosophy
Today, running experiments, reviewing prior work, writing papers, and verifying findings each require separate expertise and enormous time. ARI removes those barriers.
Whether you are a student with a laptop and a free local AI, or a researcher with access to a supercomputer and the latest cloud models — ARI works the same way. You describe the goal. ARI does the rest.
Computation is only the beginning. ARI is architecturally designed to grow beyond software — into robotics, sensors, and laboratory equipment. This is not yet implemented, but the plugin architecture exists precisely for this purpose.
Inspired By · Prior Work
ARI was built on lessons from these pioneering works.
Sakana AI's fully autonomous scientific research system — from idea generation to peer-review-ready papers. A foundational reference for ARI's end-to-end pipeline design.
arxiv.org/abs/2504.08066 →An LLM-driven framework for automated HPC performance optimization. Demonstrated autonomous compiler flag and thread-count tuning on real supercomputer workloads — a direct predecessor to ARI's search engine.
researchgate.net/publication/403797672 →A multi-agent scientific deliberation system. Multiple AI personas with different research backgrounds debate a hypothesis, producing richer and more diverse research ideas than single-agent generation. Integrated into ARI's idea generation stage.
arxiv.org/abs/2410.09403 →OpenAI's benchmark for evaluating whether AI agents can reproduce frontier ML papers from scratch. Each paper is decomposed into a fine-grained TaskNode rubric scored by an LLM judge ("SimpleJudge"). v0.7.0 vendors PaperBench under ari-skill-paper-re/vendor/paperbench as the deterministic core of ARI's reproducibility check (ORS Phase 2). v0.7.2 adds paper-audit mode: the same rubric machinery inverted to audit whether a paper itself describes enough to be reproducible — venue-conditioned templates (sc / neurips / nature), multimodal figure inspection, and a paper_audit prompt patch that breaks the structural ceiling on Result Analysis leaves. v0.8.0 ships the 3-stage bridge contract (rollout_submission / reproduce_submission / judge_submission) honouring vendor protocol with container_image end-to-end plumbing, fail-loud sandbox/GPU preconditions, salvage retries, executed-submission tarballs, code_only Stage 1↔3 consistency, agent.env injection, host-FS source-removal guard, and env-truth guardrails (probe-before-scaffold, language-choice counter-prime, host-truthful ADDITIONAL NOTES with auto-detected binaries / GPU / network / Phase-2 isolation).
Hierarchical memory + tool-mediated paging that lets an LLM agent operate as if it had unbounded context. The MemGPT paper became Letta, which v0.6.0 adopted as ARI's memory backend — replacing the v0.5.x JSONL store with ancestor-scoped archival memory and a portable per-checkpoint snapshot.
arxiv.org/abs/2310.08560 →How It Works
Write a short description of what you want to optimize. ARI takes it from there — automatically, without any human intervention.
Write your research goal in plain Markdown. Even 3 lines is enough.
ARI searches arXiv and Semantic Scholar, then VirSci multi-agent deliberation — multiple AI personas with different research backgrounds — debates the results to produce diverse, well-grounded ideas.
Best-First Tree Search (BFTS) with five node types — DRAFT, IMPROVE, DEBUG, ABLATION, VALIDATION — explores the hypothesis space in parallel. The LLM selects which frontier node to expand next based on peer-review scores, not heuristics. Failed nodes generate debug children, not retries. All generated code is captured per node.
A complete academic paper with figures and citations is generated automatically.
ARI traverses the full experiment tree — root, improvements, ablations, and validation runs — and uses an LLM to extract methodology, setup, and key findings directly from raw artifacts. Nothing is hardcoded.
A separate AI agent independently reconstructs and re-runs the experiment from the paper text alone — no original code shared.
A 10-page React/TypeScript SPA dashboard (built with Vite) with an Overleaf-like LaTeX editor, React Flow visual workflow editor (BFTS / Paper / Reproduce phase toggles per skill), D3 experiment tree, VLM figure review loop, rubric-driven paper review with ensemble + Area Chair meta-review and few-shot example manager, container runtime management, Letta-backed memory admin, recursive sub-experiments, and a 4-step experiment wizard — all from your browser. Component-based architecture with separate modules for each page. All output is isolated per project.
▶ See the live walkthrough in the Demo section ↑
Universal
ARI is not built for one environment or one type of user. It scales across five dimensions.
Environment profiles (laptop / hpc / cloud) auto-detect your scheduler (SLURM, PBS, LSF, SGE, Kubernetes) and configure parallelism, memory, and container runtime automatically.
Free local models (Ollama) or commercial APIs (Claude, GPT, Gemini) via litellm. Per-experiment model selection in the dashboard wizard — Ollama models support free-form name entry.
Write 3 lines as a beginner, or 200 lines with precise technical controls as an expert.
Add new capabilities as plug-in modules. No changes to the core system needed.
Papers can target arXiv, NeurIPS, ICPP, SC, ISC, ACM, or any custom venue template.
Vision
The current version automates digital computation experiments. Physical world integration is on the roadmap — not yet implemented.
Compiler tuning, algorithm benchmarking, ML hyperparameters, systems performance.
Robot arm trajectories, motion planning parameters, control system tuning via ROS2.
Liquid handlers, plate readers, reaction conditions — planned, not yet implemented.
If it has a goal and a parameter space, ARI can explore it. The same system, infinite domains.
Quick Start
# 1. Clone the repository git clone https://github.com/kotama7/ARI cd ARI # 2. Install everything bash setup.sh # 3. Choose your AI model # Free option (runs offline): ollama pull qwen3:8b # Or use Claude / OpenAI: export ARI_BACKEND=claude export ANTHROPIC_API_KEY=sk-... # 4. Run your first experiment ari run experiment.md # 5. Launch the web dashboard ari viz ./checkpoints/my_run/ # → http://localhost:8765 # Other useful commands ari projects # list all runs ari resume ./ckpt/ # resume interrupted run ari settings # view/modify config
The simplest experiment file looks like this:
# experiment.md — free-form text is fine # Headings are optional; ARI reads any format. Maximize performance of matrix multiplication on this machine. # Or use structured Markdown (optional): ## Research Goal Maximize the target metric for my experiment on this machine. ## Evaluation Metric Primary score ## Constraints - Describe your environment here
ARI will survey related papers, run experiments, generate figures, write a complete paper, and verify reproducibility independently — automatically. A unique LLM-generated title is assigned to each experiment project.
Community
ARI is open source and welcomes contributions of all kinds — new skills, pipeline ideas, bug reports, and more.
Have a domain ARI doesn't cover yet? Wrap it as an MCP skill and submit a PR. Any benchmark, any tool.
Have an idea for a new research pipeline or workflow step? Open a discussion and shape ARI's direction.
Found a bug or a rough edge? Issues and PRs are always welcome. No contribution is too small.