ARI Documentation

Technical reference for developers and researchers.

Overview

ARI automates the full research cycle: hypothesis generation → experiment execution → paper writing → reproducibility verification. The system is built on three layers:

┌────────────────────────────────────────────┐ │ experiment.md / CLI │ └──────────────────┬─────────────────────────┘ │ ┌──────────────────▼─────────────────────────┐ │ ari-core │ │ BFTS Engine → ReAct Loop → Pipeline │ └──────────────────┬─────────────────────────┘ │ MCP protocol ┌─────────────┼─────────────────┐ │ │ │ Skills Skills Skills (deterministic) (deterministic) (LLM-annotated) hpc, web, idea, memory, paper, plot, evaluator transform paper-re

Key principle: MCP skills are deterministic where possible. LLM-using tools are explicitly annotated. Default skills using LLM: idea-skill, transform-skill, plot-skill, paper-skill, paper-re-skill, vlm-skill, web-skill (partial), memory-skill (Letta embedding retrieval, △).

Installation

Clone the repository

git clone https://github.com/kotama7/ARI && cd ari

Run setup
```
bash setup.sh
```

Install LaTeX

# With conda (no sudo needed)
conda install -c conda-forge texlive-core

# Or system package
sudo apt install texlive-full        # Debian/Ubuntu
sudo dnf install texlive             # RHEL/CentOS

Set LLM backend — see LLM Configuration

First Run

# Minimal experiment file
cat > experiment.md << 'EOF'
## Research Goal
Maximize the target metric for my experiment on this machine.
<!-- metric_keyword: score -->
EOF

# Run
ari run experiment.md

# With custom config
ari run experiment.md --config ari-core/config/workflow.yaml

# On a SLURM cluster
sbatch your_pipeline_job.sh

Output files appear in the checkpoint directory:

Want to see what ARI produces? Download the sample paper (PDF) — a 10-page Stratum-Roofline CSR-SpMM study on Fujitsu A64FX/SVE-512 generated by an actual ARI run, including figures, citations, and reproducibility verification.

File	Description
`nodes_tree.json`	BFTS search tree (all explored configurations)
`science_data.json`	Science-facing data (no internal terms). v0.7.0 adds a typed split on each `configurations[*]`: `parameters` (input knobs — never a headline result) vs. `measurements` (measured outputs) vs. `predictions` (model ceilings) vs. `scores` (derived ratios), sourced from the coding-skill `emit_results` contract or the LLM-evaluator's typed split. `summary_stats` reduces direction-aware over a known `primary_metric` instead of `max()` over every key.
`related_refs.json`	arXiv references
`figures_manifest.json`	Generated figure paths and captions
`full_paper.tex / .pdf`	Generated paper
`review_report.json`	Rubric-driven paper review (AI Scientist v1/v2-compatible): scores, strengths, weaknesses, decision. When N>1, includes `ensemble_reviews[]` (the N individual reviews) and `meta_review` (Area Chair aggregation) inline.
`vlm_review.json`	Per-figure VLM findings (score, issues, suggestions) — piped into paper review as reviewer notes
`ors_rubric.json / ors_phase1.json / ors_grade.json`	PaperBench-format reproducibility (v0.7.0): auto-rubric, Phase 1 sandbox run, Phase 2 SimpleJudge grading. Replaces the v0.6.0 `reproducibility_report.json`.
`ear/` + `ear_published/` + `manifest.lock` + `publish_record.json`	Experiment Artifact Repository, curated bundle, deterministic `bundle_sha256`, publish backend record (v0.7.0).

Rubric-driven paper review (v0.6.0+)

The paper review phase is venue-agnostic and driven by YAML rubrics in ari-core/config/reviewer_rubrics/. 16 bundled rubrics covering ML conferences (neurips — default, AI Scientist v2-compatible — iclr, icml, cvpr, acl), systems/HPC (sc, osdi, usenix_security), theory/graphics (stoc, siggraph), HCI/robotics (chi, icra), and journals/generic (nature, journal_generic, workshop, generic_conference), plus a built-in legacy fallback (v0.5 schema). Drop a new YAML to add any venue — no code changes.

Each rubric declares score_dimensions (soundness / presentation / contribution / …), text_sections (strengths / weaknesses / questions / limitations), decision rules (threshold or categorical), and execution parameters (reflection rounds, few-shot example count, ensemble size, temperature). Defaults follow the Nature Ablation (Appendix A.4 of arXiv:2408.06292): num_reflections=5, num_fs_examples=1, num_reviews_ensemble=1, temperature=0.75.

Configure per run via CLI flags (--rubric, --fewshot-mode, --num-reviews-ensemble, --num-reflections) or environment variables (ARI_RUBRIC, ARI_FEWSHOT_MODE, ARI_NUM_REVIEWS_ENSEMBLE, ARI_NUM_REFLECTIONS). The New Experiment Wizard exposes the same settings under “Paper Review”.

A Few-shot Examples sub-panel inside the Wizard lists the examples currently available for the selected rubric and provides three one-click actions: Auto-sync pulls the corpus declared in scripts/fewshot/manifest.yaml (including the three AI Scientist v2 samples from GitHub); Upload accepts a JSON review form + optional .txt excerpt + optional PDF; Delete removes an example. Each action hits /api/fewshot/<rubric>/{sync,upload,delete}, which refuses any rubric not present in reviewer_rubrics/ and strips path-traversal sequences from inputs.

Architecture

📄 Canonical source: this section is a summary — see concepts/architecture.md for the authoritative, always-current detail.

BFTS — Best-First Tree Search

ARI uses Best-First Tree Search to explore the hypothesis space. The LLM selects the most promising node to expand next, guided by real measurement data. Controlled via ari-core/config/default.yaml:

bfts:
  max_total_nodes: 50      # maximum nodes to explore
  max_depth: 5             # tree depth limit
  max_parallel_nodes: 4    # concurrent experiments
  score_threshold: 0.3     # minimum score to expand

ReAct Loop

Each node runs: Reason → Act (tool call) → Observe → Reason... until a JSON result is produced. The agent automatically polls async HPC jobs without consuming step budget.

Post-BFTS Pipeline

After BFTS completes, workflow.yaml drives a sequential pipeline. Stages are idempotent — re-runs skip already-completed stages.

Experiment Files

Experiment files are Markdown. No code changes needed — all domain knowledge lives here.

Minimal (3 lines)

## Research Goal
Maximize the target metric for my experiment on this machine.
<!-- metric_keyword: score -->

Full Reference

# Experiment Title

## Research Goal
Describe the optimization objective in plain language.
The LLM reads this to generate hypotheses.

## Required Workflow
1. Call `survey` to find related literature
2. Call `slurm_submit` with a SLURM script
3. Call `job_status` to wait for completion
4. Call `run_bash` to read the output file
5. Return JSON with measured values

## Hardware Limits
- Max CPUs: 64
- Compiler: gcc only

## SLURM Script Template
```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --time=00:30:00
python run_experiment.py

```

## Rules
- HARD LIMIT: never exceed 64 threads
- Always use absolute paths in slurm_submit

<!-- metric_keyword: score -->
<!-- min_expected_metric: 100 -->

Section	Required	Purpose
`## Research Goal`	✔	Drives LLM hypothesis generation
`## Required Workflow`		Sets tool execution sequence
`## Hardware Limits`		Hard constraints injected at every step
`## SLURM Script Template`		Starting point for LLM modifications
`## Rules`		Agent constraints and invariants
`<!-- metric_keyword -->`	✔	Metric name for extraction
`<!-- min_expected_metric -->`		Minimum acceptable value

workflow.yaml

The single configuration file for the post-BFTS pipeline. Adding or reordering stages requires only YAML changes — no code changes.

version: '1'
slurm_partition: your_partition
author_name: "Your Name or Organization"

skills:
  - name: paper-skill
    path: '{{ari_root}}/ari-skill-paper'
    description: LaTeX paper writing

pipeline:
  - stage: transform_data           # NEW: strips BFTS internals
    skill: transform-skill
    tool: nodes_to_science_data
    inputs:
      nodes_json_path: '{{ckpt}}/nodes_tree.json'
    outputs:
      file: '{{ckpt}}/science_data.json'
    skip_if_exists: '{{ckpt}}/science_data.json'

  - stage: generate_figures
    skill: plot-skill
    tool: generate_figures_llm
    depends_on: [transform_data]
    inputs:
      science_data_path: '{{stages.transform_data.outputs.file}}'
      output_dir: '{{ckpt}}'
    outputs:
      file: '{{ckpt}}/figures_manifest.json'

  - stage: write_paper
    skill: paper-skill
    tool: write_paper_iterative
    depends_on: [generate_figures, search_related_work]
    inputs:
      refs_json: '{{stages.search_related_work.outputs.file}}'
      figures_manifest_json: '{{ckpt}}/figures_manifest.json'
      venue: arxiv
    outputs:
      file: '{{ckpt}}/full_paper.tex'

Template Variables

Variable	Value
`{{ckpt}}`	Checkpoint directory (absolute path)
`{{ari_root}}`	ARI project root
`{{paper_context}}`	Science-facing experiment summary
`{{stages.NAME.outputs.file}}`	Primary output of stage NAME
`{{author_name}}`	Top-level field from workflow.yaml

Skills Overview

Skill	Tools	Type
`ari-skill-hpc`	`slurm_submit`, `job_status`, `job_cancel`, `run_bash`, `singularity_*`	Deterministic
`ari-skill-evaluator`	`make_metric_spec`	Deterministic △
`ari-skill-idea`	`survey`, `generate_ideas`	LLM ✷
`ari-skill-web`	`web_search`, `fetch_url`, `search_arxiv`, `search_papers`, `collect_references_iterative`, `list_uploaded_files`, `read_uploaded_file`	Partial LLM △
`ari-skill-memory`	`add_memory`, `search_memory`, `get_node_memory`, `clear_node_memory`, `get_experiment_context`	Letta-backed △
`ari-skill-transform`	`nodes_to_science_data`, `generate_ear`	LLM ✷
`ari-skill-plot`	`generate_figures`, `generate_figures_llm`	LLM ✷
`ari-skill-paper`	`write_paper_iterative`, `review_compiled_paper`, `generate_section`, ...	LLM ✷
`ari-skill-paper-re`	`extract_repro_config`, `build_repro_report`, `extract_metric_from_output`	LLM ✷
`ari-skill-coding`	`write_code`, `run_code`, `read_file`, `run_bash`	Deterministic
`ari-skill-benchmark`	`analyze_results`, `plot`, `statistical_test`	Deterministic
`ari-skill-vlm`	`review_figure`, `review_table`	LLM ✷
`ari-skill-orchestrator`	`run_experiment`, `get_status`, `list_runs`, `list_children`, `get_paper`	Deterministic

✷ LLM-using tools are explicitly annotated. △ = LLM in some tools only. 13 skills total (12 default, 1 additional).

Adding a Skill

Create the skill directory

ari-skill-yourskill/
├── src/server.py
├── tests/test_server.py
└── pyproject.toml

Implement the server

from mcp.server.fastmcp import FastMCP
mcp = FastMCP("yourskill")

@mcp.tool()
def your_tool(param: str) -> dict:
    """Clear description for the LLM."""
    result = pure_computation(param)   # no LLM calls here
    return {"result": result}

if __name__ == "__main__":
    mcp.run()

Register in workflow.yaml

skills:
  - name: yourskill
    path: '{{ari_root}}/ari-skill-yourskill'

Add a pipeline stage

pipeline:
  - stage: your_stage
    skill: yourskill
    tool: your_tool
    inputs:
      param: '{{paper_context}}'
    outputs:
      file: '{{ckpt}}/your_output.json'

LLM Configuration

# OpenAI
export ARI_LLM_MODEL=openai/gpt-4o
export OPENAI_API_KEY=sk-...

# Anthropic
export ARI_LLM_MODEL=anthropic/claude-sonnet-4-5
export ANTHROPIC_API_KEY=sk-ant-...

# Local Ollama (free, no API key)
export ARI_LLM_MODEL=qwen3:32b
export LLM_API_BASE=http://127.0.0.1:11434

# Any OpenAI-compatible API (vLLM, LM Studio, etc.)
export ARI_LLM_MODEL=your-model-name
export LLM_API_BASE=http://your-server:8000/v1

Note: New models not in litellm's known list require an explicit provider prefix: openai/gpt-5.2, not just gpt-5.2.

HPC / Execution Backend

ARI uses a pluggable executor model. Set ARI_EXECUTOR to match your environment — no code changes needed.

# Environment variables
export ARI_EXECUTOR=slurm    # local | slurm | pbs | lsf
export ARI_SLURM_PARTITION=your_partition  # SLURM only

# workflow.yaml (reproducibility stage — react_driver form)
- stage: reproducibility_check
  skill: paper-re-skill
  pre_tool: extract_repro_config      # one-shot LLM, extracts claimed value
  post_tool: build_repro_report       # one-shot LLM, builds verdict
  react:
    agent_phase: reproduce            # only MCP skills opted into this phase are visible
    max_steps: 40
    final_tool: report_metric
    sandbox: '{{checkpoint_dir}}/repro_sandbox'
  inputs:
    paper_path: '{{checkpoint_dir}}/full_paper.tex'
    tolerance_pct: 5.0

For experiments (BFTS), configure in default.yaml:

hpc:
  mode: slurm          # or "local" for laptop
  scheduler: slurm
  max_nodes: 4
  max_walltime: "04:00:00"

To run without a cluster, set ARI_EXECUTOR=local. ARI will execute experiments as local subprocesses.

Experiment Monitor (`ari viz`)

ARI ships a real-time experiment tree visualiser. It shows every BFTS node, its status, metrics, and the full tool-call trace — all in a browser.

▶ Live walkthrough of the ARI dashboard.

Starting the monitor

ari viz --checkpoint <ckpt_dir> --port 9878

Open http://localhost:9878 in any browser. The dashboard polls /state and reconnects over WebSocket automatically.

Node detail panel

Click any node circle to open the four-tab detail panel:

Overview — ID, status, type, execution time, parent, metrics, evaluation summary.
Trace — Ordered list of every MCP tool call the agent made (name · step · result snippet). Fetched live from /memory/<node_id>.
Code — Generated source files stored in node artifacts.
Output — SLURM stdout / benchmark results stored in node artifacts.

Status indicators

Green circle — success
Red circle — failed
Blue circle — running
Grey circle — pending

The footer shows total node count, per-status counts, and the best metric value seen so far.

Architecture

The viz server (ari/viz/server.py) is a pure-stdlib asyncio HTTP + WebSocket handler — no external dependencies beyond the websockets package already installed by ARI. The dashboard is a React/TypeScript SPA built with Vite (ari/viz/frontend/), with modular components for each page (Home, Experiments, Monitor, Tree, Results, Wizard, Idea, Workflow, Settings). The production build is output to ari/viz/static/dist/ and served by the Python server.

Agent Memory

📄 Canonical source: this section is a summary — see concepts/memory.md for the authoritative, always-current detail.

As of v0.6.0, ARI keeps per-experiment memory in Letta (ex-MemGPT). Each checkpoint gets a dedicated Letta agent ari_agent_<hash> with two archival collections and a seeded core-memory block. A portable snapshot ({checkpoint}/memory_backup.jsonl.gz) is written automatically at pipeline-stage boundaries and on shutdown so the checkpoint directory stays self-contained.

ari_node_<hash> — ancestor-scoped node entries (add_memory / search_memory)
ari_react_<hash> — flat ReAct-trace collection written by LettaMemoryClient
Core memory: stable experiment facts (goal, primary metric, hardware) read via get_experiment_context()
{checkpoint}/memory_access.jsonl — append-only write/read telemetry for the Tree dashboard

Write-side tools enforce Copy-on-Write (a child cannot mutate an ancestor's entries) and search_memory strictly filters by the ancestor_ids you pass, so sibling branches never cross-contaminate. Letta self-edit is disabled by default via ARI_MEMORY_LETTA_DISABLE_SELF_EDIT=true.

The Trace tab in the Experiment Monitor reads these entries live through the Letta-backed library. Manage the backend with ari memory (migrate / backup / restore / start-local / …).

v0.7.0 note — on Letta 0.16.x the SDK call passages.list(search=q) is a SQL substring filter (LOWER(text) LIKE LOWER(%q%)), not semantic search — long natural-language queries silently returned 0 ancestor entries against real data. search_memory uses passages.search (GET /archival-memory/search, embed_query=True) with top_k = max(letta_overfetch, limit*40) and post-filters by ancestor_ids + ari_checkpoint locally. The embedding cost paid on every add_memory insert is now actually consumed by retrieval; children see ancestor entries ranked by relevance to their eval_summary query.

Publication Lifecycle (v0.7.0)

v0.7.0 turns the Experiment Artifact Repository (EAR) into a curated, digest-anchored publication chain. The author writes a small ear/publish.yaml allowlist; ari-core enforces a built-in deny list (.env*, secrets/**, *.pem, …) and computes a deterministic bundle_sha256 over a canonical {path,sha256,size} manifest. The digest is baked into the paper's \codedigest{...} macro by the finalize_paper stage, so any reader can verify the bundle from the paper alone — even if the registry hosting it disappears.

ari ear curate <ckpt> — deterministic, no LLM. Produces ear_published/ + manifest.lock.
ari ear publish <ckpt> --backend ari-registry|local-tarball|gh|zenodo — always starts at visibility=staged (FR-P5).
ari ear promote <ckpt> --target public|unlisted — visibility moves up only.
ari clone <ref> --expect-sha256 <hex> — reader-side fetch + verify. Resolvers: file://, https://, ari://<id>, gh:<u>/<r>, doi:<doi>. No code execution.
ari registry serve — optional self-hosted FastAPI registry with sqlite-backed bearer tokens, content-addressed artefact storage, and three deploy modes (local uvicorn, docker-compose, Apptainer). See docs/registry.md.

Reproducibility — PaperBench-format (v0.7.0)

📄 Canonical source: this section is a summary — see guides/paperbench/paperbench_quickstart.md for the authoritative, always-current detail.

The legacy LLM-driven verdict path (extract_repro_config → react_driver → build_repro_report) is replaced by a deterministic two-phase grading flow compatible with PaperBench (arXiv:2504.01848). Three pipeline stages run in sequence:

ors_generate_rubric (new ari-skill-replicate) — auto-generates a frozen PaperBench TaskNode rubric from the final paper. Default two-stage generation: a skeleton pass defines the root + per-contribution direct children with leaf budgets, then parallel subtree passes recursively populate each child to depth 4–6. Yields ~4× more leaves and 1–2 levels more depth than the legacy single-call path, at ~5× more API tokens; toggle off via the GUI Wizard "Two-stage generation" checkbox or ARI_RUBRIC_GEN_TWO_STAGE=0 for cheap runs. audit_rubric flags vague/unverifiable leaves and recommends regeneration when >20% are flagged.
ors_run_reproduce (paper-re — Phase 1) — executes reproduce.sh in a sandbox: docker → apptainer → singularity → local (overridable via ARI_PHASE1_SANDBOX). Captures reproduce.log and reports any rubric-declared expected_artifacts that did not appear.
ors_grade (paper-re — Phase 2) — runs PaperBench SimpleJudge over the rubric leaves n_runs times (default 3), averages the weighted leaf scores, and runs a one-off negative control (empty repo + trivial reproduce.sh). Both controls must score <5% (passed=true) for the rubric to be considered honest.

Paper-Audit Mode — PaperBench-format (v0.7.2)

📄 Canonical source: this section is a summary — see reference/rubric_schema.md for the authoritative, always-current detail.

v0.7.2 adds the audit-side companion to the agent-side reproducibility check. Where the agent-side asks “can an AI agent reproduce this paper?”, the audit-side inverts the framing to “does this paper describe enough to be reproducible?” — using the same vendored PaperBench rubric formalism with zero vendor changes. Driven by the SC reproducibility framework (Artifact Description / Artifact Evaluation Appendices) and generalised to NeurIPS Reproducibility Checklist and Nature Reporting Summary venues.

Three orthogonal mechanisms lift the audit score above a low single-digit paper-only baseline (0.033 measured on LLAMP / sc24-00070). An earlier “Step 4 reproduction-package generator” was removed as off-protocol: it asked an LLM to write a fake reproduce.log transcribing the paper's own table/figure numbers as if they were observed values, bypassing PaperBench's Step 2 (executed submission). The previously-claimed audit score of 0.857 obtained with that shortcut is retracted.

Venue-conditioned rubric templates — ari-core/config/paperbench_rubrics/<id>.yaml declares mode: paper_audit with a fixed set of top_level_axes. The bundled sc.yaml emits six SC-specific axes (environment / data / execution / figures / scaling / conclusion) per HPC PaperBench audit research plan §5 Step 3. Generator code in ari-skill-replicate stays domain-agnostic; venue knowledge lives in YAML. See rubric schema.
Multimodal expander — pymupdf4llm converts the paper PDF to markdown with ![](images/img-N.png) references; an expander in _litellm_completer.async_completion rewrites these into OpenAI multimodal content blocks so vision-capable judges see figures alongside text. Vendor SimpleJudge unchanged.
paper_audit prompt patch in ari-skill-paper-re/src/_paperbench_bridge.py — monkey-patches PaperBench's TASK_CATEGORY_QUESTIONS (which asks “did the submission do X?”) at call scope to ask “does the paper describe X with concrete specificity?”. Restored on exit so agent-benchmark calls are unaffected. Breaks the structural ceiling that capped paper_audit mean score around 0.3.

CLI dogfood: python scripts/sc_paper_dogfood.py --pdf paper.pdf --rubric-template sc --judge-dryrun. The --paper-extras AD.pdf AE.pdf flag concatenates Artifact Description and Artifact Evaluation Appendices so the audit can test HPC PaperBench audit research plan hypothesis 2 (paper-only vs +AD+AE).

ARI Documentation

Overview

Installation

First Run

Rubric-driven paper review (v0.6.0+)

Architecture

BFTS — Best-First Tree Search

ReAct Loop

Post-BFTS Pipeline

Experiment Files

Minimal (3 lines)

Full Reference

workflow.yaml

Template Variables

Skills Overview

Adding a Skill

LLM Configuration

HPC / Execution Backend

Experiment Monitor (ari viz)

Starting the monitor

Node detail panel

Status indicators

Architecture

Agent Memory

Publication Lifecycle (v0.7.0)

Reproducibility — PaperBench-format (v0.7.0)

Paper-Audit Mode — PaperBench-format (v0.7.2)

Experiment Monitor (`ari viz`)