ARI Documentation
Technical reference for developers and researchers.
Overview
ARI automates the full research cycle: hypothesis generation → experiment execution → paper writing → reproducibility verification. The system is built on three layers:
idea-skill, transform-skill, plot-skill, paper-skill, paper-re-skill, vlm-skill, web-skill (partial), memory-skill (Letta embedding retrieval, △).Installation
- Clone the repository
git clone https://github.com/kotama7/ARI && cd ari - Run setup
bash setup.sh - Install LaTeX
# With conda (no sudo needed) conda install -c conda-forge texlive-core # Or system package sudo apt install texlive-full # Debian/Ubuntu sudo dnf install texlive # RHEL/CentOS - Set LLM backend — see LLM Configuration
First Run
# Minimal experiment file
cat > experiment.md << 'EOF'
## Research Goal
Maximize the target metric for my experiment on this machine.
<!-- metric_keyword: score -->
EOF
# Run
ari run experiment.md
# With custom config
ari run experiment.md --config ari-core/config/workflow.yaml
# On a SLURM cluster
sbatch your_pipeline_job.sh
Output files appear in the checkpoint directory:
| File | Description |
|---|---|
nodes_tree.json | BFTS search tree (all explored configurations) |
science_data.json | Science-facing data (no internal terms). v0.7.0 adds a typed split on each configurations[*]: parameters (input knobs — never a headline result) vs. measurements (measured outputs) vs. predictions (model ceilings) vs. scores (derived ratios), sourced from the coding-skill emit_results contract or the LLM-evaluator's typed split. summary_stats reduces direction-aware over a known primary_metric instead of max() over every key. |
related_refs.json | arXiv references |
figures_manifest.json | Generated figure paths and captions |
full_paper.tex / .pdf | Generated paper |
review_report.json | Rubric-driven paper review (AI Scientist v1/v2-compatible): scores, strengths, weaknesses, decision. When N>1, includes ensemble_reviews[] (the N individual reviews) and meta_review (Area Chair aggregation) inline. |
vlm_review.json | Per-figure VLM findings (score, issues, suggestions) — piped into paper review as reviewer notes |
ors_rubric.json / ors_phase1.json / ors_grade.json | PaperBench-format reproducibility (v0.7.0): auto-rubric, Phase 1 sandbox run, Phase 2 SimpleJudge grading. Replaces the v0.6.0 reproducibility_report.json. |
ear/ + ear_published/ + manifest.lock + publish_record.json | Experiment Artifact Repository, curated bundle, deterministic bundle_sha256, publish backend record (v0.7.0). |
Rubric-driven paper review (v0.6.0+)
The paper review phase is venue-agnostic and driven by YAML rubrics in ari-core/config/reviewer_rubrics/. 16 bundled rubrics covering ML conferences (neurips — default, AI Scientist v2-compatible — iclr, icml, cvpr, acl), systems/HPC (sc, osdi, usenix_security), theory/graphics (stoc, siggraph), HCI/robotics (chi, icra), and journals/generic (nature, journal_generic, workshop, generic_conference), plus a built-in legacy fallback (v0.5 schema). Drop a new YAML to add any venue — no code changes.
Each rubric declares score_dimensions (soundness / presentation / contribution / …), text_sections (strengths / weaknesses / questions / limitations), decision rules (threshold or categorical), and execution parameters (reflection rounds, few-shot example count, ensemble size, temperature). Defaults follow the Nature Ablation (Appendix A.4 of arXiv:2408.06292): num_reflections=5, num_fs_examples=1, num_reviews_ensemble=1, temperature=0.75.
Configure per run via CLI flags (--rubric, --fewshot-mode, --num-reviews-ensemble, --num-reflections) or environment variables (ARI_RUBRIC, ARI_FEWSHOT_MODE, ARI_NUM_REVIEWS_ENSEMBLE, ARI_NUM_REFLECTIONS). The New Experiment Wizard exposes the same settings under “Paper Review”.
A Few-shot Examples sub-panel inside the Wizard lists the examples currently available for the selected rubric and provides three one-click actions: Auto-sync pulls the corpus declared in scripts/fewshot/manifest.yaml (including the three AI Scientist v2 samples from GitHub); Upload accepts a JSON review form + optional .txt excerpt + optional PDF; Delete removes an example. Each action hits /api/fewshot/<rubric>/{sync,upload,delete}, which refuses any rubric not present in reviewer_rubrics/ and strips path-traversal sequences from inputs.
Architecture
BFTS — Best-First Tree Search
ARI uses Best-First Tree Search to explore the hypothesis space. The LLM selects the most promising node to expand next, guided by real measurement data. Controlled via ari-core/config/default.yaml:
bfts:
max_total_nodes: 50 # maximum nodes to explore
max_depth: 5 # tree depth limit
max_parallel_nodes: 4 # concurrent experiments
score_threshold: 0.3 # minimum score to expand
ReAct Loop
Each node runs: Reason → Act (tool call) → Observe → Reason... until a JSON result is produced. The agent automatically polls async HPC jobs without consuming step budget.
Post-BFTS Pipeline
After BFTS completes, workflow.yaml drives a sequential pipeline. Stages are idempotent — re-runs skip already-completed stages.
Experiment Files
Experiment files are Markdown. No code changes needed — all domain knowledge lives here.
Minimal (3 lines)
## Research Goal
Maximize the target metric for my experiment on this machine.
<!-- metric_keyword: score -->
Full Reference
# Experiment Title
## Research Goal
Describe the optimization objective in plain language.
The LLM reads this to generate hypotheses.
## Required Workflow
1. Call `survey` to find related literature
2. Call `slurm_submit` with a SLURM script
3. Call `job_status` to wait for completion
4. Call `run_bash` to read the output file
5. Return JSON with measured values
## Hardware Limits
- Max CPUs: 64
- Compiler: gcc only
## SLURM Script Template
```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --time=00:30:00
python run_experiment.py
```
## Rules
- HARD LIMIT: never exceed 64 threads
- Always use absolute paths in slurm_submit
<!-- metric_keyword: score -->
<!-- min_expected_metric: 100 -->
| Section | Required | Purpose |
|---|---|---|
## Research Goal | ✔ | Drives LLM hypothesis generation |
## Required Workflow | Sets tool execution sequence | |
## Hardware Limits | Hard constraints injected at every step | |
## SLURM Script Template | Starting point for LLM modifications | |
## Rules | Agent constraints and invariants | |
<!-- metric_keyword --> | ✔ | Metric name for extraction |
<!-- min_expected_metric --> | Minimum acceptable value |
workflow.yaml
The single configuration file for the post-BFTS pipeline. Adding or reordering stages requires only YAML changes — no code changes.
version: '1'
slurm_partition: your_partition
author_name: "Your Name or Organization"
skills:
- name: paper-skill
path: '{{ari_root}}/ari-skill-paper'
description: LaTeX paper writing
pipeline:
- stage: transform_data # NEW: strips BFTS internals
skill: transform-skill
tool: nodes_to_science_data
inputs:
nodes_json_path: '{{ckpt}}/nodes_tree.json'
outputs:
file: '{{ckpt}}/science_data.json'
skip_if_exists: '{{ckpt}}/science_data.json'
- stage: generate_figures
skill: plot-skill
tool: generate_figures_llm
depends_on: [transform_data]
inputs:
science_data_path: '{{stages.transform_data.outputs.file}}'
output_dir: '{{ckpt}}'
outputs:
file: '{{ckpt}}/figures_manifest.json'
- stage: write_paper
skill: paper-skill
tool: write_paper_iterative
depends_on: [generate_figures, search_related_work]
inputs:
refs_json: '{{stages.search_related_work.outputs.file}}'
figures_manifest_json: '{{ckpt}}/figures_manifest.json'
venue: arxiv
outputs:
file: '{{ckpt}}/full_paper.tex'
Template Variables
| Variable | Value |
|---|---|
{{ckpt}} | Checkpoint directory (absolute path) |
{{ari_root}} | ARI project root |
{{paper_context}} | Science-facing experiment summary |
{{stages.NAME.outputs.file}} | Primary output of stage NAME |
{{author_name}} |
Skills Overview
| Skill | Tools | Type |
|---|---|---|
ari-skill-hpc | slurm_submit, job_status, job_cancel, run_bash, singularity_* | Deterministic |
ari-skill-evaluator | make_metric_spec | Deterministic △ |
ari-skill-idea | survey, generate_ideas | LLM ✷ |
ari-skill-web | web_search, fetch_url, search_arxiv, search_papers, collect_references_iterative, list_uploaded_files, read_uploaded_file | Partial LLM △ |
ari-skill-memory | add_memory, search_memory, get_node_memory, clear_node_memory, get_experiment_context | Letta-backed △ |
ari-skill-transform | nodes_to_science_data, generate_ear | LLM ✷ |
ari-skill-plot | generate_figures, generate_figures_llm | LLM ✷ |
ari-skill-paper | write_paper_iterative, review_compiled_paper, generate_section, ... | LLM ✷ |
ari-skill-paper-re | extract_repro_config, build_repro_report, extract_metric_from_output | LLM ✷ |
ari-skill-coding | write_code, run_code, read_file, run_bash | Deterministic |
ari-skill-benchmark | analyze_results, plot, statistical_test | Deterministic |
ari-skill-vlm | review_figure, review_table | LLM ✷ |
ari-skill-orchestrator | run_experiment, get_status, list_runs, list_children, get_paper | Deterministic |
✷ LLM-using tools are explicitly annotated. △ = LLM in some tools only. 13 skills total (12 default, 1 additional).
Adding a Skill
- Create the skill directory
ari-skill-yourskill/ ├── src/server.py ├── tests/test_server.py └── pyproject.toml - Implement the server
from mcp.server.fastmcp import FastMCP mcp = FastMCP("yourskill") @mcp.tool() def your_tool(param: str) -> dict: """Clear description for the LLM.""" result = pure_computation(param) # no LLM calls here return {"result": result} if __name__ == "__main__": mcp.run() - Register in workflow.yaml
skills: - name: yourskill path: '{{ari_root}}/ari-skill-yourskill' - Add a pipeline stage
pipeline: - stage: your_stage skill: yourskill tool: your_tool inputs: param: '{{paper_context}}' outputs: file: '{{ckpt}}/your_output.json'
LLM Configuration
# OpenAI
export ARI_LLM_MODEL=openai/gpt-4o
export OPENAI_API_KEY=sk-...
# Anthropic
export ARI_LLM_MODEL=anthropic/claude-sonnet-4-5
export ANTHROPIC_API_KEY=sk-ant-...
# Local Ollama (free, no API key)
export ARI_LLM_MODEL=qwen3:32b
export LLM_API_BASE=http://127.0.0.1:11434
# Any OpenAI-compatible API (vLLM, LM Studio, etc.)
export ARI_LLM_MODEL=your-model-name
export LLM_API_BASE=http://your-server:8000/v1
openai/gpt-5.2, not just gpt-5.2.HPC / Execution Backend
ARI uses a pluggable executor model. Set ARI_EXECUTOR to match your environment — no code changes needed.
# Environment variables
export ARI_EXECUTOR=slurm # local | slurm | pbs | lsf
export ARI_SLURM_PARTITION=your_partition # SLURM only
# workflow.yaml (reproducibility stage — react_driver form)
- stage: reproducibility_check
skill: paper-re-skill
pre_tool: extract_repro_config # one-shot LLM, extracts claimed value
post_tool: build_repro_report # one-shot LLM, builds verdict
react:
agent_phase: reproduce # only MCP skills opted into this phase are visible
max_steps: 40
final_tool: report_metric
sandbox: '{{checkpoint_dir}}/repro_sandbox'
inputs:
paper_path: '{{checkpoint_dir}}/full_paper.tex'
tolerance_pct: 5.0
For experiments (BFTS), configure in default.yaml:
hpc:
mode: slurm # or "local" for laptop
scheduler: slurm
max_nodes: 4
max_walltime: "04:00:00"
To run without a cluster, set ARI_EXECUTOR=local. ARI will execute experiments as local subprocesses.
Experiment Monitor (ari viz)
ARI ships a real-time experiment tree visualiser. It shows every BFTS node, its status, metrics, and the full tool-call trace — all in a browser.
▶ Live walkthrough of the ARI dashboard.
Starting the monitor
ari viz --checkpoint <ckpt_dir> --port 9878
Open http://localhost:9878 in any browser. The dashboard polls /state and reconnects over WebSocket automatically.
Node detail panel
Click any node circle to open the four-tab detail panel:
- Overview — ID, status, type, execution time, parent, metrics, evaluation summary.
- Trace — Ordered list of every MCP tool call the agent made (name · step · result snippet). Fetched live from
/memory/<node_id>. - Code — Generated source files stored in node artifacts.
- Output — SLURM stdout / benchmark results stored in node artifacts.
Status indicators
- Green circle —
success - Red circle —
failed - Blue circle —
running - Grey circle —
pending
Architecture
The viz server (ari/viz/server.py) is a pure-stdlib asyncio HTTP + WebSocket handler — no external dependencies beyond the websockets package already installed by ARI. The dashboard is a React/TypeScript SPA built with Vite (ari/viz/frontend/), with modular components for each page (Home, Experiments, Monitor, Tree, Results, Wizard, Idea, Workflow, Settings). The production build is output to ari/viz/static/dist/ and served by the Python server.
Agent Memory
As of v0.6.0, ARI keeps per-experiment memory in Letta (ex-MemGPT). Each checkpoint gets a dedicated Letta agent ari_agent_<hash> with two archival collections and a seeded core-memory block. A portable snapshot ({checkpoint}/memory_backup.jsonl.gz) is written automatically at pipeline-stage boundaries and on shutdown so the checkpoint directory stays self-contained.
ari_node_<hash>— ancestor-scoped node entries (add_memory/search_memory)ari_react_<hash>— flat ReAct-trace collection written byLettaMemoryClient- Core memory: stable experiment facts (goal, primary metric, hardware) read via
get_experiment_context() {checkpoint}/memory_access.jsonl— append-only write/read telemetry for the Tree dashboard
Write-side tools enforce Copy-on-Write (a child cannot mutate an ancestor's entries) and search_memory strictly filters by the ancestor_ids you pass, so sibling branches never cross-contaminate. Letta self-edit is disabled by default via ARI_MEMORY_LETTA_DISABLE_SELF_EDIT=true.
The Trace tab in the Experiment Monitor reads these entries live through the Letta-backed library. Manage the backend with ari memory (migrate / backup / restore / start-local / …).
v0.7.0 note — on Letta 0.16.x the SDK call passages.list(search=q) is a SQL substring filter (LOWER(text) LIKE LOWER(%q%)), not semantic search — long natural-language queries silently returned 0 ancestor entries against real data. search_memory uses passages.search (GET /archival-memory/search, embed_query=True) with top_k = max(letta_overfetch, limit*40) and post-filters by ancestor_ids + ari_checkpoint locally. The embedding cost paid on every add_memory insert is now actually consumed by retrieval; children see ancestor entries ranked by relevance to their eval_summary query.
Publication Lifecycle (v0.7.0)
v0.7.0 turns the Experiment Artifact Repository (EAR) into a curated, digest-anchored publication chain. The author writes a small ear/publish.yaml allowlist; ari-core enforces a built-in deny list (.env*, secrets/**, *.pem, …) and computes a deterministic bundle_sha256 over a canonical {path,sha256,size} manifest. The digest is baked into the paper's \codedigest{...} macro by the finalize_paper stage, so any reader can verify the bundle from the paper alone — even if the registry hosting it disappears.
ari ear curate <ckpt>— deterministic, no LLM. Producesear_published/+manifest.lock.ari ear publish <ckpt> --backend ari-registry|local-tarball|gh|zenodo— always starts atvisibility=staged(FR-P5).ari ear promote <ckpt> --target public|unlisted— visibility moves up only.ari clone <ref> --expect-sha256 <hex>— reader-side fetch + verify. Resolvers:file://,https://,ari://<id>,gh:<u>/<r>,doi:<doi>. No code execution.ari registry serve— optional self-hosted FastAPI registry with sqlite-backed bearer tokens, content-addressed artefact storage, and three deploy modes (local uvicorn, docker-compose, Apptainer). See docs/registry.md.
Reproducibility — PaperBench-format (v0.7.0)
The legacy LLM-driven verdict path (extract_repro_config → react_driver → build_repro_report) is replaced by a deterministic two-phase grading flow compatible with PaperBench (arXiv:2504.01848). Three pipeline stages run in sequence:
ors_generate_rubric(newari-skill-replicate) — auto-generates a frozen PaperBench TaskNode rubric from the final paper. Default two-stage generation: a skeleton pass defines the root + per-contribution direct children with leaf budgets, then parallel subtree passes recursively populate each child to depth 4–6. Yields ~4× more leaves and 1–2 levels more depth than the legacy single-call path, at ~5× more API tokens; toggle off via the GUI Wizard "Two-stage generation" checkbox orARI_RUBRIC_GEN_TWO_STAGE=0for cheap runs.audit_rubricflags vague/unverifiable leaves and recommends regeneration when >20% are flagged.ors_run_reproduce(paper-re — Phase 1) — executesreproduce.shin a sandbox:docker→apptainer→singularity→local(overridable viaARI_PHASE1_SANDBOX). Capturesreproduce.logand reports any rubric-declaredexpected_artifactsthat did not appear.ors_grade(paper-re — Phase 2) — runs PaperBenchSimpleJudgeover the rubric leaves n_runs times (default 3), averages the weighted leaf scores, and runs a one-off negative control (empty repo + trivialreproduce.sh). Both controls must score <5% (passed=true) for the rubric to be considered honest.
Paper-Audit Mode — PaperBench-format (v0.7.2)
v0.7.2 adds the audit-side companion to the agent-side reproducibility check. Where the agent-side asks “can an AI agent reproduce this paper?”, the audit-side inverts the framing to “does this paper describe enough to be reproducible?” — using the same vendored PaperBench rubric formalism with zero vendor changes. Driven by the SC reproducibility framework (Artifact Description / Artifact Evaluation Appendices) and generalised to NeurIPS Reproducibility Checklist and Nature Reporting Summary venues.
Three orthogonal mechanisms lift the audit score above a low single-digit paper-only baseline (0.033 measured on LLAMP / sc24-00070). An earlier “Step 4 reproduction-package generator” was removed as off-protocol: it asked an LLM to write a fake reproduce.log transcribing the paper's own table/figure numbers as if they were observed values, bypassing PaperBench's Step 2 (executed submission). The previously-claimed audit score of 0.857 obtained with that shortcut is retracted.
- Venue-conditioned rubric templates —
ari-core/config/paperbench_rubrics/<id>.yamldeclaresmode: paper_auditwith a fixed set oftop_level_axes. The bundledsc.yamlemits six SC-specific axes (environment / data / execution / figures / scaling / conclusion) perHPC PaperBench audit research plan§5 Step 3. Generator code inari-skill-replicatestays domain-agnostic; venue knowledge lives in YAML. See rubric schema. - Multimodal expander —
pymupdf4llmconverts the paper PDF to markdown withreferences; an expander in_litellm_completer.async_completionrewrites these into OpenAI multimodal content blocks so vision-capable judges see figures alongside text. VendorSimpleJudgeunchanged. - paper_audit prompt patch in
ari-skill-paper-re/src/_paperbench_bridge.py— monkey-patches PaperBench'sTASK_CATEGORY_QUESTIONS(which asks “did the submission do X?”) at call scope to ask “does the paper describe X with concrete specificity?”. Restored on exit so agent-benchmark calls are unaffected. Breaks the structural ceiling that capped paper_audit mean score around 0.3.
CLI dogfood: python scripts/sc_paper_dogfood.py --pdf paper.pdf --rubric-template sc --judge-dryrun. The --paper-extras AD.pdf AE.pdf flag concatenates Artifact Description and Artifact Evaluation Appendices so the audit can test HPC PaperBench audit research plan hypothesis 2 (paper-only vs +AD+AE).