ARI Documentation

Technical reference for developers and researchers.

Overview

ARI automates the full research cycle: hypothesis generation → experiment execution → paper writing → reproducibility verification. The system is built on three layers:

┌────────────────────────────────────────────┐ │ experiment.md / CLI │ └──────────────────┬─────────────────────────┘ │ ┌──────────────────▼─────────────────────────┐ │ ari-core │ │ BFTS Engine → ReAct Loop → Pipeline │ └──────────────────┬─────────────────────────┘ │ MCP protocol ┌─────────────┼─────────────────┐ │ │ │ Skills Skills Skills (deterministic) (deterministic) (LLM-annotated) hpc, web, idea, memory, paper, plot, evaluator transform paper-re
Key principle: MCP skills are deterministic where possible. LLM-using tools are explicitly annotated. Default skills using LLM: idea-skill, transform-skill, plot-skill, paper-skill, paper-re-skill, vlm-skill, web-skill (partial), memory-skill (Letta embedding retrieval, △).

Installation

  1. Clone the repository
    git clone https://github.com/kotama7/ARI && cd ari
  2. Run setup
    bash setup.sh
  3. Install LaTeX
    # With conda (no sudo needed)
    conda install -c conda-forge texlive-core
    
    # Or system package
    sudo apt install texlive-full        # Debian/Ubuntu
    sudo dnf install texlive             # RHEL/CentOS
  4. Set LLM backendsee LLM Configuration

First Run

# Minimal experiment file
cat > experiment.md << 'EOF'
## Research Goal
Maximize the target metric for my experiment on this machine.
<!-- metric_keyword: score -->
EOF

# Run
ari run experiment.md

# With custom config
ari run experiment.md --config ari-core/config/workflow.yaml

# On a SLURM cluster
sbatch your_pipeline_job.sh

Output files appear in the checkpoint directory:

Want to see what ARI produces? Download the sample paper (PDF) — a 10-page Stratum-Roofline CSR-SpMM study on Fujitsu A64FX/SVE-512 generated by an actual ARI run, including figures, citations, and reproducibility verification.
FileDescription
nodes_tree.jsonBFTS search tree (all explored configurations)
science_data.jsonScience-facing data (no internal terms). v0.7.0 adds a typed split on each configurations[*]: parameters (input knobs — never a headline result) vs. measurements (measured outputs) vs. predictions (model ceilings) vs. scores (derived ratios), sourced from the coding-skill emit_results contract or the LLM-evaluator's typed split. summary_stats reduces direction-aware over a known primary_metric instead of max() over every key.
related_refs.jsonarXiv references
figures_manifest.jsonGenerated figure paths and captions
full_paper.tex / .pdfGenerated paper
review_report.jsonRubric-driven paper review (AI Scientist v1/v2-compatible): scores, strengths, weaknesses, decision. When N>1, includes ensemble_reviews[] (the N individual reviews) and meta_review (Area Chair aggregation) inline.
vlm_review.jsonPer-figure VLM findings (score, issues, suggestions) — piped into paper review as reviewer notes
ors_rubric.json / ors_phase1.json / ors_grade.jsonPaperBench-format reproducibility (v0.7.0): auto-rubric, Phase 1 sandbox run, Phase 2 SimpleJudge grading. Replaces the v0.6.0 reproducibility_report.json.
ear/ + ear_published/ + manifest.lock + publish_record.jsonExperiment Artifact Repository, curated bundle, deterministic bundle_sha256, publish backend record (v0.7.0).

Rubric-driven paper review (v0.6.0+)

The paper review phase is venue-agnostic and driven by YAML rubrics in ari-core/config/reviewer_rubrics/. 16 bundled rubrics covering ML conferences (neurips — default, AI Scientist v2-compatibleiclr, icml, cvpr, acl), systems/HPC (sc, osdi, usenix_security), theory/graphics (stoc, siggraph), HCI/robotics (chi, icra), and journals/generic (nature, journal_generic, workshop, generic_conference), plus a built-in legacy fallback (v0.5 schema). Drop a new YAML to add any venue — no code changes.

Each rubric declares score_dimensions (soundness / presentation / contribution / …), text_sections (strengths / weaknesses / questions / limitations), decision rules (threshold or categorical), and execution parameters (reflection rounds, few-shot example count, ensemble size, temperature). Defaults follow the Nature Ablation (Appendix A.4 of arXiv:2408.06292): num_reflections=5, num_fs_examples=1, num_reviews_ensemble=1, temperature=0.75.

Configure per run via CLI flags (--rubric, --fewshot-mode, --num-reviews-ensemble, --num-reflections) or environment variables (ARI_RUBRIC, ARI_FEWSHOT_MODE, ARI_NUM_REVIEWS_ENSEMBLE, ARI_NUM_REFLECTIONS). The New Experiment Wizard exposes the same settings under “Paper Review”.

A Few-shot Examples sub-panel inside the Wizard lists the examples currently available for the selected rubric and provides three one-click actions: Auto-sync pulls the corpus declared in scripts/fewshot/manifest.yaml (including the three AI Scientist v2 samples from GitHub); Upload accepts a JSON review form + optional .txt excerpt + optional PDF; Delete removes an example. Each action hits /api/fewshot/<rubric>/{sync,upload,delete}, which refuses any rubric not present in reviewer_rubrics/ and strips path-traversal sequences from inputs.

Architecture

📄 Canonical source: this section is a summary — see concepts/architecture.md for the authoritative, always-current detail.

BFTS — Best-First Tree Search

ARI uses Best-First Tree Search to explore the hypothesis space. The LLM selects the most promising node to expand next, guided by real measurement data. Controlled via ari-core/config/default.yaml:

bfts:
  max_total_nodes: 50      # maximum nodes to explore
  max_depth: 5             # tree depth limit
  max_parallel_nodes: 4    # concurrent experiments
  score_threshold: 0.3     # minimum score to expand

ReAct Loop

Each node runs: Reason → Act (tool call) → Observe → Reason... until a JSON result is produced. The agent automatically polls async HPC jobs without consuming step budget.

Post-BFTS Pipeline

After BFTS completes, workflow.yaml drives a sequential pipeline. Stages are idempotent — re-runs skip already-completed stages.

Experiment Files

Experiment files are Markdown. No code changes needed — all domain knowledge lives here.

Minimal (3 lines)

## Research Goal
Maximize the target metric for my experiment on this machine.
<!-- metric_keyword: score -->

Full Reference

# Experiment Title

## Research Goal
Describe the optimization objective in plain language.
The LLM reads this to generate hypotheses.

## Required Workflow
1. Call `survey` to find related literature
2. Call `slurm_submit` with a SLURM script
3. Call `job_status` to wait for completion
4. Call `run_bash` to read the output file
5. Return JSON with measured values

## Hardware Limits
- Max CPUs: 64
- Compiler: gcc only

## SLURM Script Template
```bash
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --cpus-per-task=32
#SBATCH --time=00:30:00
python run_experiment.py

```

## Rules
- HARD LIMIT: never exceed 64 threads
- Always use absolute paths in slurm_submit

<!-- metric_keyword: score -->
<!-- min_expected_metric: 100 -->
SectionRequiredPurpose
## Research GoalDrives LLM hypothesis generation
## Required WorkflowSets tool execution sequence
## Hardware LimitsHard constraints injected at every step
## SLURM Script TemplateStarting point for LLM modifications
## RulesAgent constraints and invariants
<!-- metric_keyword -->Metric name for extraction
<!-- min_expected_metric -->Minimum acceptable value

workflow.yaml

The single configuration file for the post-BFTS pipeline. Adding or reordering stages requires only YAML changes — no code changes.

version: '1'
slurm_partition: your_partition
author_name: "Your Name or Organization"

skills:
  - name: paper-skill
    path: '{{ari_root}}/ari-skill-paper'
    description: LaTeX paper writing

pipeline:
  - stage: transform_data           # NEW: strips BFTS internals
    skill: transform-skill
    tool: nodes_to_science_data
    inputs:
      nodes_json_path: '{{ckpt}}/nodes_tree.json'
    outputs:
      file: '{{ckpt}}/science_data.json'
    skip_if_exists: '{{ckpt}}/science_data.json'

  - stage: generate_figures
    skill: plot-skill
    tool: generate_figures_llm
    depends_on: [transform_data]
    inputs:
      science_data_path: '{{stages.transform_data.outputs.file}}'
      output_dir: '{{ckpt}}'
    outputs:
      file: '{{ckpt}}/figures_manifest.json'

  - stage: write_paper
    skill: paper-skill
    tool: write_paper_iterative
    depends_on: [generate_figures, search_related_work]
    inputs:
      refs_json: '{{stages.search_related_work.outputs.file}}'
      figures_manifest_json: '{{ckpt}}/figures_manifest.json'
      venue: arxiv
    outputs:
      file: '{{ckpt}}/full_paper.tex'

Template Variables

VariableValue
{{ckpt}}Checkpoint directory (absolute path)
{{ari_root}}ARI project root
{{paper_context}}Science-facing experiment summary
{{stages.NAME.outputs.file}}Primary output of stage NAME
{{author_name}}Top-level field from workflow.yaml

Skills Overview

SkillToolsType
ari-skill-hpcslurm_submit, job_status, job_cancel, run_bash, singularity_*Deterministic
ari-skill-evaluatormake_metric_specDeterministic △
ari-skill-ideasurvey, generate_ideasLLM ✷
ari-skill-webweb_search, fetch_url, search_arxiv, search_papers, collect_references_iterative, list_uploaded_files, read_uploaded_filePartial LLM △
ari-skill-memoryadd_memory, search_memory, get_node_memory, clear_node_memory, get_experiment_contextLetta-backed △
ari-skill-transformnodes_to_science_data, generate_earLLM ✷
ari-skill-plotgenerate_figures, generate_figures_llmLLM ✷
ari-skill-paperwrite_paper_iterative, review_compiled_paper, generate_section, ...LLM ✷
ari-skill-paper-reextract_repro_config, build_repro_report, extract_metric_from_outputLLM ✷
ari-skill-codingwrite_code, run_code, read_file, run_bashDeterministic
ari-skill-benchmarkanalyze_results, plot, statistical_testDeterministic
ari-skill-vlmreview_figure, review_tableLLM ✷
ari-skill-orchestratorrun_experiment, get_status, list_runs, list_children, get_paperDeterministic

✷ LLM-using tools are explicitly annotated. △ = LLM in some tools only. 13 skills total (12 default, 1 additional).

Adding a Skill

  1. Create the skill directory
    ari-skill-yourskill/
    ├── src/server.py
    ├── tests/test_server.py
    └── pyproject.toml
  2. Implement the server
    from mcp.server.fastmcp import FastMCP
    mcp = FastMCP("yourskill")
    
    @mcp.tool()
    def your_tool(param: str) -> dict:
        """Clear description for the LLM."""
        result = pure_computation(param)   # no LLM calls here
        return {"result": result}
    
    if __name__ == "__main__":
        mcp.run()
  3. Register in workflow.yaml
    skills:
      - name: yourskill
        path: '{{ari_root}}/ari-skill-yourskill'
  4. Add a pipeline stage
    pipeline:
      - stage: your_stage
        skill: yourskill
        tool: your_tool
        inputs:
          param: '{{paper_context}}'
        outputs:
          file: '{{ckpt}}/your_output.json'

LLM Configuration

# OpenAI
export ARI_LLM_MODEL=openai/gpt-4o
export OPENAI_API_KEY=sk-...

# Anthropic
export ARI_LLM_MODEL=anthropic/claude-sonnet-4-5
export ANTHROPIC_API_KEY=sk-ant-...

# Local Ollama (free, no API key)
export ARI_LLM_MODEL=qwen3:32b
export LLM_API_BASE=http://127.0.0.1:11434

# Any OpenAI-compatible API (vLLM, LM Studio, etc.)
export ARI_LLM_MODEL=your-model-name
export LLM_API_BASE=http://your-server:8000/v1
Note: New models not in litellm's known list require an explicit provider prefix: openai/gpt-5.2, not just gpt-5.2.

HPC / Execution Backend

ARI uses a pluggable executor model. Set ARI_EXECUTOR to match your environment — no code changes needed.

# Environment variables
export ARI_EXECUTOR=slurm    # local | slurm | pbs | lsf
export ARI_SLURM_PARTITION=your_partition  # SLURM only

# workflow.yaml (reproducibility stage — react_driver form)
- stage: reproducibility_check
  skill: paper-re-skill
  pre_tool: extract_repro_config      # one-shot LLM, extracts claimed value
  post_tool: build_repro_report       # one-shot LLM, builds verdict
  react:
    agent_phase: reproduce            # only MCP skills opted into this phase are visible
    max_steps: 40
    final_tool: report_metric
    sandbox: '{{checkpoint_dir}}/repro_sandbox'
  inputs:
    paper_path: '{{checkpoint_dir}}/full_paper.tex'
    tolerance_pct: 5.0

For experiments (BFTS), configure in default.yaml:

hpc:
  mode: slurm          # or "local" for laptop
  scheduler: slurm
  max_nodes: 4
  max_walltime: "04:00:00"

To run without a cluster, set ARI_EXECUTOR=local. ARI will execute experiments as local subprocesses.

Experiment Monitor (ari viz)

ARI ships a real-time experiment tree visualiser. It shows every BFTS node, its status, metrics, and the full tool-call trace — all in a browser.

▶ Live walkthrough of the ARI dashboard.

Starting the monitor

ari viz --checkpoint <ckpt_dir> --port 9878

Open http://localhost:9878 in any browser. The dashboard polls /state and reconnects over WebSocket automatically.

Node detail panel

Click any node circle to open the four-tab detail panel:

Status indicators

The footer shows total node count, per-status counts, and the best metric value seen so far.

Architecture

The viz server (ari/viz/server.py) is a pure-stdlib asyncio HTTP + WebSocket handler — no external dependencies beyond the websockets package already installed by ARI. The dashboard is a React/TypeScript SPA built with Vite (ari/viz/frontend/), with modular components for each page (Home, Experiments, Monitor, Tree, Results, Wizard, Idea, Workflow, Settings). The production build is output to ari/viz/static/dist/ and served by the Python server.

Agent Memory

📄 Canonical source: this section is a summary — see concepts/memory.md for the authoritative, always-current detail.

As of v0.6.0, ARI keeps per-experiment memory in Letta (ex-MemGPT). Each checkpoint gets a dedicated Letta agent ari_agent_<hash> with two archival collections and a seeded core-memory block. A portable snapshot ({checkpoint}/memory_backup.jsonl.gz) is written automatically at pipeline-stage boundaries and on shutdown so the checkpoint directory stays self-contained.

Write-side tools enforce Copy-on-Write (a child cannot mutate an ancestor's entries) and search_memory strictly filters by the ancestor_ids you pass, so sibling branches never cross-contaminate. Letta self-edit is disabled by default via ARI_MEMORY_LETTA_DISABLE_SELF_EDIT=true.

The Trace tab in the Experiment Monitor reads these entries live through the Letta-backed library. Manage the backend with ari memory (migrate / backup / restore / start-local / …).

v0.7.0 note — on Letta 0.16.x the SDK call passages.list(search=q) is a SQL substring filter (LOWER(text) LIKE LOWER(%q%)), not semantic search — long natural-language queries silently returned 0 ancestor entries against real data. search_memory uses passages.search (GET /archival-memory/search, embed_query=True) with top_k = max(letta_overfetch, limit*40) and post-filters by ancestor_ids + ari_checkpoint locally. The embedding cost paid on every add_memory insert is now actually consumed by retrieval; children see ancestor entries ranked by relevance to their eval_summary query.

Publication Lifecycle (v0.7.0)

v0.7.0 turns the Experiment Artifact Repository (EAR) into a curated, digest-anchored publication chain. The author writes a small ear/publish.yaml allowlist; ari-core enforces a built-in deny list (.env*, secrets/**, *.pem, …) and computes a deterministic bundle_sha256 over a canonical {path,sha256,size} manifest. The digest is baked into the paper's \codedigest{...} macro by the finalize_paper stage, so any reader can verify the bundle from the paper alone — even if the registry hosting it disappears.

Reproducibility — PaperBench-format (v0.7.0)

📄 Canonical source: this section is a summary — see guides/paperbench/paperbench_quickstart.md for the authoritative, always-current detail.

The legacy LLM-driven verdict path (extract_repro_configreact_driverbuild_repro_report) is replaced by a deterministic two-phase grading flow compatible with PaperBench (arXiv:2504.01848). Three pipeline stages run in sequence:

  1. ors_generate_rubric (new ari-skill-replicate) — auto-generates a frozen PaperBench TaskNode rubric from the final paper. Default two-stage generation: a skeleton pass defines the root + per-contribution direct children with leaf budgets, then parallel subtree passes recursively populate each child to depth 4–6. Yields ~4× more leaves and 1–2 levels more depth than the legacy single-call path, at ~5× more API tokens; toggle off via the GUI Wizard "Two-stage generation" checkbox or ARI_RUBRIC_GEN_TWO_STAGE=0 for cheap runs. audit_rubric flags vague/unverifiable leaves and recommends regeneration when >20% are flagged.
  2. ors_run_reproduce (paper-re — Phase 1) — executes reproduce.sh in a sandbox: dockerapptainersingularitylocal (overridable via ARI_PHASE1_SANDBOX). Captures reproduce.log and reports any rubric-declared expected_artifacts that did not appear.
  3. ors_grade (paper-re — Phase 2) — runs PaperBench SimpleJudge over the rubric leaves n_runs times (default 3), averages the weighted leaf scores, and runs a one-off negative control (empty repo + trivial reproduce.sh). Both controls must score <5% (passed=true) for the rubric to be considered honest.

Paper-Audit Mode — PaperBench-format (v0.7.2)

📄 Canonical source: this section is a summary — see reference/rubric_schema.md for the authoritative, always-current detail.

v0.7.2 adds the audit-side companion to the agent-side reproducibility check. Where the agent-side asks “can an AI agent reproduce this paper?”, the audit-side inverts the framing to “does this paper describe enough to be reproducible?” — using the same vendored PaperBench rubric formalism with zero vendor changes. Driven by the SC reproducibility framework (Artifact Description / Artifact Evaluation Appendices) and generalised to NeurIPS Reproducibility Checklist and Nature Reporting Summary venues.

Three orthogonal mechanisms lift the audit score above a low single-digit paper-only baseline (0.033 measured on LLAMP / sc24-00070). An earlier “Step 4 reproduction-package generator” was removed as off-protocol: it asked an LLM to write a fake reproduce.log transcribing the paper's own table/figure numbers as if they were observed values, bypassing PaperBench's Step 2 (executed submission). The previously-claimed audit score of 0.857 obtained with that shortcut is retracted.

  1. Venue-conditioned rubric templatesari-core/config/paperbench_rubrics/<id>.yaml declares mode: paper_audit with a fixed set of top_level_axes. The bundled sc.yaml emits six SC-specific axes (environment / data / execution / figures / scaling / conclusion) per HPC PaperBench audit research plan §5 Step 3. Generator code in ari-skill-replicate stays domain-agnostic; venue knowledge lives in YAML. See rubric schema.
  2. Multimodal expanderpymupdf4llm converts the paper PDF to markdown with ![](images/img-N.png) references; an expander in _litellm_completer.async_completion rewrites these into OpenAI multimodal content blocks so vision-capable judges see figures alongside text. Vendor SimpleJudge unchanged.
  3. paper_audit prompt patch in ari-skill-paper-re/src/_paperbench_bridge.py — monkey-patches PaperBench's TASK_CATEGORY_QUESTIONS (which asks “did the submission do X?”) at call scope to ask “does the paper describe X with concrete specificity?”. Restored on exit so agent-benchmark calls are unaffected. Breaks the structural ceiling that capped paper_audit mean score around 0.3.

CLI dogfood: python scripts/sc_paper_dogfood.py --pdf paper.pdf --rubric-template sc --judge-dryrun. The --paper-extras AD.pdf AE.pdf flag concatenates Artifact Description and Artifact Evaluation Appendices so the audit can test HPC PaperBench audit research plan hypothesis 2 (paper-only vs +AD+AE).