Internals, Agent Orchestration
The deepest dive: what runTurn actually does. The full reference is src/orchestrator/index.ts, this page walks every step, every routing pattern, every validation gate, and the Fact Sheet contract t…
The deepest dive: what
runTurnactually does. The full reference issrc/orchestrator/index.ts, this page walks every step, every routing pattern, every validation gate, and the Fact Sheet contract that pins synthesis to deterministic numbers. Today this code runs in-process inside the CLI; the same module is what a futureTurnWorkflowon Cloudflare will wrap step-by-step.
The pipeline implements the paper The Anatomy of a Personal Health Agent (Heydari et al., 2025) §F.2 plus Amy's two extensions: a Hypothesis Investigator for vague queries, and a CoDaS-style validation phase (Kim et al., 2026) that gates every quantitative claim before it can be synthesized into the answer.
Quick navigation
- The 9-step pipeline
- Sequence diagram
- Step-by-step reference
- Routing, the 6 patterns
- The Investigator path
- The validation phase
- The Fact Sheet contract
- Synthesis constraints
- Memory extraction
- Cost breakdown
- Where to next
The 9-step pipeline
runTurn(userMessage, ctx) runs these in order:
| # | Step | Implementation | Model | Typical wall time | Failure mode | Retry behavior |
|---|---|---|---|---|---|---|
| 0 | Vagueness classify | classifyVagueness | fastModel (sonnet) | ~1-3s | Returns "low" on any parse error | None, defaults to low |
| 1a | Investigator (high vagueness only) | runInvestigator | model (opus) | ~30-60s | Empty hypotheses list | None, emits empty briefing |
| 1b | Routing (low/medium vagueness) | classifyJson w/ TASK_ASSIGNMENT prompt | fastModel | ~2-5s | LLM emits garbage agent name | sanitiseRouting canonicalises or returns "" (→ fallback reply) |
| 2 | Question rephrase per agent | classifyJson w/ QUESTION_REPHRASE | fastModel | ~2-4s | JSON parse error | try/catch → falls back to the original userMessage |
| 3 | Supporting agents (sequential) | runDsAgent, runDeAgent | per-agent (see below) | ~30-120s each | Sandbox/tool failure surfaced in their answer | Internal retries (DS: up to 3 sandbox attempts) |
| 4 | Main agent | runDsAgent / runDeAgent / runHcAgent | per-agent | ~30-120s | Same as above; empty mainText if dispatch fell through | None at this layer |
| 5 | Reflection (only if supporting list was non-empty) | classifyJson w/ REFLECTION prompt | fastModel | ~3-5s, + reflection sub-calls if YES | Returns nothing useful → caught and skipped | Single shot. If decision="YES", runs the named follow-up agents (DS w/ maxRetries=0, DE full) |
| 6 | Validation | validateFindings over (ds.findings ∪ investigator.findings ∪ reflectionFindings) | gates: python sandbox; Critic + Assessment: validatorModel (opus) | ~5-30s per finding | Gate runner crash → verdict="rejected", hard_rejection set | None, every finding either passes through or is flagged |
| 7 | Synthesis (streams) | ask w/ FINAL_SYNTHESIS prompt | model (opus) | ~10-30s | LLM call failure bubbles up | None |
| 7b | Fact-check pass | factCheckReply (regex-based) | none | ~10ms | Always returns, issues array may be empty | N/A |
| 8 | Memory extraction | extractMemories | fastModel | ~3-5s | JSON parse → empty list | try/catch → returns [] |
The pipeline emits OrchestratorEvents throughout
(src/orchestrator/events.ts) so the CLI
(and any future SSE stream) can render every transition live.
Sequence diagram
Step-by-step reference
Step 0, Vagueness classifier (Amy extension)
src/orchestrator/index.ts:124–142.
Input: the raw user message. Output: "low" | "medium" | "high".
A one-shot classify (no tools, no JSON), prompt embedded inline in
classifyVagueness (lines 896-935). Biased toward low. Returns "high"
only when the query has no anchor at all ("Is anything interesting in my
data?"); coaching queries with a stated action ("I want to set a SMART goal")
are explicitly classified "low".
Step 1a, Investigator (vagueness = high only)
If vague === "high", the orchestrator jumps directly into the
Investigator path and returns without ever hitting
the router. The Investigator is never assigned as a regular routing
agent (see the explicit note in ORCHESTRATOR_SYSTEM).
Step 1b, Routing
src/orchestrator/index.ts:215–236.
Prompt: ORCHESTRATOR_SYSTEM + TASK_ASSIGNMENT
(src/orchestrator/prompts.ts lines
11-86). Input includes the full conversation history rendered as
role: content\n plus the current [TOPIC]. Output is a JSON object:
{
"main_agent": "Data Science Agent" | "Domain Expert Agent" | "Health Coach Agent" | "",
"supporting_agents": "Data Science Agent; Domain Expert Agent" | "",
"collaboration_workflow": "..."
}The router classifies into one of the 6 collaboration
patterns plus corner cases (general health info,
device help, etc.) that resolve to main_agent="" → fallback reply.
sanitiseRouting (index.ts:870–894):
the LLM occasionally emits aliases or misspellings. The sanitiser
canonicalises:
"DS Agent" / "ds" / "data scientist"→"Data Science Agent""domain expert" / "de"→"Domain Expert Agent""coach" / "hc"→"Health Coach Agent"- Anything else →
null(dropped). Ifmain_agentbecomes"", the orchestrator runs a fallbackask()(index.ts:289–322) rather than dispatching to a phantom agent.
Step 2, Question rephrase
src/orchestrator/index.ts:238–286.
Prompt: QUESTION_REPHRASE (prompts.ts:88–109).
The LLM is told to decompose the user's question into per-agent
sub-questions. The prompt has a load-bearing constraint for the DS
Agent: "frame its question as ONE narrow, concrete computation, the
single core metric/relationship needed. Do NOT enumerate multi-part specs."
Empirically, multi-part DS asks produced 100+ line brittle pandas that
failed to run.
Output:
{
"main_agent_question": "...",
"supporting_agent_questions": { "Data Science Agent": "...", "...": "..." }
}On parse failure, falls back to using the original userMessage for every
agent (lines 264-271).
Step 3, Supporting agents (sequential)
src/orchestrator/index.ts:324–369.
Each supporting agent in supportingList is invoked in order (NOT in
parallel, sequential is intentional so the main agent can see all
supporting outputs in one block). Only Data Science and Domain Expert can
be supporting; Health Coach is always main.
Each call returns its own trace object (DsTrace, DeTrace) and an
answer string that gets concatenated into supportingInsights for the
main agent and synthesis.
Data Science Agent (src/agents/data-science/)
A 3-stage internal loop:
- Plan (
PLAN_PROMPT,dsModel): produces an== Approach ==text describing what to compute. - Code-gen (
CODE_PROMPT,dsModel): produces the body of a Python functionanalysis(summary_df, activities_df, profile_df, population_df). - Sandbox (
runDsCode→sandbox.ts):- Pre-flight
ast.parseviapython3 -c(~50ms) before burning sandbox time. - Auto-fixes the #1 LLM bug: block opener (
if/for/...) followed by un-indented body (autoFixPythonIndent). - Wraps the body in
PY_WRAPPERthat loads SQLite + JSON, attaches deterministic composite features (cardio_fitness_index,hrv_rhr_ratio, rolling_sd_30d/_cv_30d/_mean_30d), and emits a JSON-safe result. - Up to
maxRetries=2debug iterations (DEBUG_PROMPT with the previous stderr, with extra "indentation help" inlined when an IndentationError is detected).
- Pre-flight
- Summarize + extract findings:
extractFindingsreturns structuredFinding[]for the validation pipeline. Short-circuits onlooksDescriptive(query)(e.g. "what is my average X?") to skip extraction entirely.
Reflection-mode DS uses maxRetries: 0 (single shot), if it can't answer
in one attempt, the reflection ask was too ambitious. Comment at
index.ts:502–508.
Domain Expert Agent (src/agents/domain-expert/)
A ReAct loop with maxTurns: 10. Tools:
WebSearch,WebFetch(built-in)mcp__amy-de-tools__ncbi_search(PubMed)mcp__amy-de-tools__range_compare(clinical reference ranges)mcp__amy-de-tools__datacommons(Google Data Commons)
System prompt includes the user's Profile, latest Biomarkers, and the
full literature priors blurb from
data/reference/biomarker_priors.json.
The agent is instructed never to invent a URL.
Health Coach Agent (src/agents/health-coach/)
A 3-module modular flow (paper §6.2, splitting prevents the failure mode of giving premature recommendations from a single fat prompt):
HC_RECOMMEND_GATE(classify): emits[VERDICT]: YESREC | NOREC.HC_SYSTEMmain coaching response, system prompt is parameterised by verdict (NOREC → keep gathering context; YESREC → deliver SMART recommendation NOW, don't re-ask).HC_FINISH_GATE(classify):FINISH | CONTINUE. OnFINISH, a closing summary is generated.
Two hard rules in the HC path:
- HC never sees rejected findings (
index.ts:421–424). Filtering keeps the coach from grounding recommendations on data the Critic disowned. - HC is required to reference at least one personal anchor from the
deterministic
computeAnchors(store)blurb, forces specific-to-user advice instead of textbook generics.
Step 4, Main agent
src/orchestrator/index.ts:371–453.
Same agent code as Step 3; the main is just whichever
routing.main_agent resolved to. Difference:
- DE as main receives
supportingInsightsso it doesn't redo computation. - HC as main triggers a pre-validation hop: if DS already ran as
supporting and produced findings, validate them BEFORE the HC speaks
(
index.ts:408–420). This is the only path where validation runs before the main agent, necessary because HC is the only agent that turns numbers into actions.
Step 5, Reflection
src/orchestrator/index.ts:455–560.
Only runs if supportingList.length > 0. Prompt: REFLECTION
(prompts.ts:111–152). Output:
{ "decision": "YES" | "NO", "reflection_questions": { "<agent>": "<q>" } }NO is the common case. YES triggers up to one follow-up per agent;
prompt explicitly caps at "Maximum 1 question per agent (one DS, one DE)
and prefer just 1 total." Reflection DS findings flow into
reflectionFindings[] and are merged into validation alongside the main
DS findings (index.ts:563–593).
Step 6, Validation
See The validation phase below.
Step 7, Synthesis (streams)
src/orchestrator/index.ts:596–637.
Prompt: FINAL_SYNTHESIS (prompts.ts:154–213).
The system prompt is the famously short:
You are Amy, a unified personal health agent. Speak as a single coherent voice. Do not mention specific sub-agents. Honour the FACT SHEET and validated findings, never invent numbers or contradict the validation verdicts.
The user message is a structured block containing main agent draft, supporting insights, reflection insights, validated findings blurb, the Fact Sheet, and (when DS failed) an explicit hard-warning block telling synthesis NOT to invent numbers.
Synthesis uses onSdkEvent to stream text_delta events through to the
orchestrator's emitter, which the CLI renders character-by-character. The
final text is what the user sees.
Step 7b, Fact-check (regex)
src/orchestrator/index.ts:759–838,
the factCheckReply function.
After synthesis, every numeric token in the reply is checked against:
- The Fact Sheet values (with 2% relative tolerance + 0.05 absolute floor).
- Pairwise ratios of Fact Sheet values (synthesis often derives e.g.
effect / noise_sd; the math is correct but the ratio isn't literally in the sheet). - Numbers from the user's original message (echo: "your 7.3% drop").
- Numbers from the Domain Expert's prose (literature reference values like PSQI MCID = 4.4).
Anything else gets flagged as {value, severity: "warn"}. Issues are emitted
via the fact_check event so the CLI can render them with a yellow tint.
This is intentionally regex-based (deterministic, no LLM cost), see CoDaS §2.6 numeric verification.
Step 8, Memory extraction
See Memory extraction below.
Routing, the 6 patterns
Source: TASK_ASSIGNMENT prompt
(prompts.ts:27–86). The router must
match one of:
| # | When | Main agent | Supporting | Why |
|---|---|---|---|---|
| 1 | "Understand health topic / facts / news" | Domain Expert | , | Pure knowledge lookup. |
| 2 | "Understand my time-series data, single source" | Data Science | , | DS computation suffices; no external interpretation. |
| 3 | "Understand my data AND need external knowledge" | Domain Expert | Data Science | DS computes, DE interprets. |
| 4 | "Wellness advice / goal-setting (no data)" | Health Coach | , | Pure coaching. |
| 5 | "Wellness advice based on my data" | Health Coach | Data Science | DS computes, HC guides. |
| 6 | "Wellness advice + data + medical context" | Health Coach | Data Science and Domain Expert | DS computes, DE adds clinical context, HC guides. |
Plus a forcing rule ([STEP 3] in the prompt): "If the user asks about
something potentially related to their personal data, even just a bit, and
the main agent is not the data science agent, add the data science agent
as a supporting agent." This is what catches "is my LDL of 124 something
to worry about" (pattern 1 → pattern 3, because the user named a specific
number that ought to be cross-checked against their actual data).
Corner-case bucket → main_agent="" → fallback reply. Garbage agent name
→ same.
The Investigator path
src/agents/investigator/index.ts.
Triggered when vagueness === "high". Bypasses routing entirely.
1. computeDigest(store)
└── Per-metric (steps, sleep_minutes, deep_sleep_minutes, rem_sleep_minutes,
resting_heart_rate, heart_rate_variability, stress_management_score,
active_zone_minutes, sleep_score):
avg(last 30d) vs avg(prior 30d), overall avg ± SD, delta in SD units
└── Day-of-week sleep breakdown
└── Missingness (HRV / steps presence ratio)
└── Top 10 workout types
└── Output: ~500–1500 tokens
2. classifyJson(INVESTIGATOR_SYSTEM) →
Hypothesis[] = [{ id, title, rationale, test_plan, data_required,
missing_data_flag, expected_impact, confidence,
next_action, kind, feature_to_test, target_to_test }]
Already-tested hypotheses (from memory.testedHypotheses()) are passed in
to prevent re-proposal.
3. Top-K (default K=3, `AUTO_TEST_TOP_K`) hypotheses → Findings via
`hypothesisToFinding(h, store)`:
- kind=association → computeSpearman(store, feature, target)
- kind=trend → computeRecentVsPrior(store, feature)
- kind=scalar → computeMean(store, feature)
All computed in TypeScript against SQLite directly — no LLM call, no
Python sandbox. Numbers are real before the finding enters validation.
4. Validate the top-K Findings (same pipeline as Step 6 main flow).
5. Generate a user-facing briefing (INVESTIGATOR_BRIEFING, `model=opus`).
6. augmentBriefingWithVerdicts: appends "_Auto-tested findings:_" with ✓ /
~ markers and the effect sizes. If ALL surviving = 0, appends a "none
survived validation" note so the user knows the noticed patterns might
be noise.The Investigator path is synthesis-free: the briefing IS the final
response. Numeric verification still runs against the briefing text.
The validation phase
src/agents/validator/index.ts.
For every Finding (from DS, Investigator, or reflection DS):
1. Deterministic gates (Python sandbox)
└── runGates({ finding }) — see below
└── If gate runner crashes → verdict=rejected, hard_rejection=error
└── If hard gate fails → verdict=rejected
└── Otherwise → preliminary verdict from gate ratio
2. Critic (LLM, validatorModel=opus by default)
└── runCritic({ finding, gates, priors, memory })
└── Output: { decision: accept|downgrade|reject, concerns, rationale }
└── decision=reject → final verdict=rejected (short-circuit)
└── decision=downgrade && verdict=validated → conditional
3. Assessment (LLM, validatorModel)
└── Only runs on non-rejected findings
└── Output: { mechanism, novelty, strategy, citations }The output ValidatedFinding carries the original Finding + verdict +
gates + critic + assessment.
The 7 deterministic gates
Source: src/agents/validator/gates.ts.
Implemented as a single Python script run via spawn("python3", ...)
against the same SQLite that the DS Agent reads. Per-finding the script
loads summary, computes each gate, and emits a JSON result wrapped in
__GATES_BEGIN__ / __GATES_END__ markers.
| # | Gate | Applies to | Logic | Pass condition | Hard? |
|---|---|---|---|---|---|
| 1 | sample_size | all | Count non-null observations of feature (and target if present). | n ≥ 20 for associations, n ≥ 10 otherwise. | Yes, n < min is an automatic reject. |
| 2 | effect_vs_noise | trend, scalar only | |effect| / metric_sd | ratio ≥ 0.5 (effect is at least half a metric SD), or sd=0/inf. | No, soft. |
| 3 | construct_validity | association only | Spearman ρ between feature and target on all valid pairs. | |ρ| ≤ 0.85 | Yes, |ρ| > 0.85 means feature is almost certainly the target re-expressed (tautology). |
| 4 | bootstrap | all (assoc / trend / scalar) | 1000-resample with seeded RNG (42). For associations: bootstrap Spearman ρ; passes if 95% CI [q.025, q.975] does not cross zero. For trend/scalar: bootstrap the mean. | CI not crossing zero (assoc). For trend/scalar this always passes (informational only). | No, soft. |
| 5 | subgroup_consistency | association only | Split window in time-ordered halves; compute Spearman ρ in each. | Same sign in both halves (ρ₁ · ρ₂ > 0). | No, soft. |
| 6 | method_triangulation | association only | Spearman ρ vs Kendall τ-b on all valid pairs. | Same sign. | No, soft. |
| 7 | discriminative_power | association only | |effect| against personal-data noise floor. | |ρ| ≥ 0.10 | No, soft. (Failure alone won't reject, but bootstrap + discriminative_power both fail → reject; see logic below.) |
Verdict aggregation logic
# gates.ts → PY → def run() (lines 304–350)
if hard_rejection: # gate 1 or 3 hard-failed
verdict = "rejected"
elif bootstrap_fail and discrim_fail: # core "is-there-signal" gates BOTH failed
verdict = "rejected"
elif applicable == 0: # no gate applied (degenerate)
verdict = "conditional"
else:
ratio = passes / applicable_count
if ratio >= 0.85: verdict = "validated"
elif ratio >= 0.5: verdict = "conditional"
else: verdict = "rejected"Gates marked with detail.applicable = False (e.g., effect_vs_noise on
an association finding) are excluded from applicable_count so they
don't drag the ratio down.
The Critic (with literature priors)
src/agents/validator/critic.ts.
The Critic gets:
- The Finding (claim, numbers, feature/target, mechanism).
- Per-gate results (
✓/✗+reason). - Relevant literature priors filtered from
data/reference/biomarker_priors.json(only those whosefeatureortargetsubstring-matches this finding). - The user's memory (filtered to
barrier | preference | decision | value | insight, max 12 entries).
Output schema:
{
"decision": "accept" | "downgrade" | "reject",
"concerns": [{
"category": "confounder" | "reverse_causation" | "selection_bias" |
"literature_contradiction" | "tautology" | "small_n" | "noise",
"detail": "<one sentence>",
"severity": "low" | "medium" | "high"
}],
"rationale": "<two-sentence summary>"
}Hard rules embedded in the system prompt:
- ANY
severity=highconcern → must berejectordowngrade. - 2+
severity=medium→ should bedowngrade. - 0-1
lowand a plausible mechanism →accept.
On malformed output or call failure, defaults to downgrade (conservative)
- never silently accepts.
Assessment (mechanism / novelty / strategy)
src/agents/validator/assessment.ts.
Only runs for non-rejected findings. Single LLM call. Output:
{
"mechanism": "<one sentence, grounded in priors or honest 'no established mechanism known'>",
"novelty": "established" | "supported" | "emerging" | "user_specific",
"strategy": "<one concrete next step achievable this week>" | null,
"citations": ["..."]
}The "strategy quality bar" in the prompt is explicit: must be specific to
the finding, not generic ("add a 25-min walk on the 3 lowest-step
weekdays", not "exercise more"). If no clear lever exists, strategy is
null.
The Fact Sheet contract
The Fact Sheet is the immutable, deterministic dictionary of every number
synthesis is allowed to cite. Built by buildFactSheet(validated) in
src/agents/validator/types.ts:137–159:
export type FactSheet = Record<string, number>;
// Keys are `<finding_id>.<numbers_key>`:
// "ds-001.effect" = -0.374
// "ds-001.n" = 87
// "ds-001.ci_low" = -0.42
// "ds-001.ci_high" = -0.31What gets in
- Every
numberskey of every non-rejectedValidatedFinding(verdict ∈ conditional). - Tuples (CIs) are split into
<key>_low/<key>_high. - Only finite, real-valued numbers (silently drops
NaN,Inf, arrays that aren't 2-tuples). - Duplicate finding IDs are suffixed
-2,-3, etc., so no number is silently dropped.
What's blocked
- Anything from a
verdict=rejectedFinding. - Any number generated mid-synthesis by the LLM (the fact-check pass
flags these as
warn). - Numbers from the DS Agent's free-text
answerthat were not captured intofindings(extraction is mandatory for synthesis-eligible numbers).
Tolerance window (in factCheckReply)
const tolerance = Math.max(0.02 * Math.abs(sv), 0.05);
matches = |v - sv| <= tolerance OR |v - |sv|| <= toleranceA 2% relative tolerance with a 0.05 absolute floor. Tight enough to catch
a 78.3 vs 75.5 fabrication; loose enough to allow rounding (372 vs 371.83) and ratio derivations (e.g. ~0.4 from 7.38 / 18.74).
Numeric tokens that DON'T trip the fact-check:
- Bare integers under 100 with no decimal/percent (LLM often uses them as list indices or sentence numbers).
- Year-looking integers
[1900, 2100](publication years in citations). - Numbers in stripped patterns (URLs, markdown links,
arXiv:1234.56789,N=1,234, ISO dates).
Synthesis constraints
FINAL_SYNTHESIS prompt lines
154-213. The hard rules:
- Numbers MUST come from the Fact Sheet. No invention, no inconsistent rounding, no interpolation. If a value isn't there, synthesis must say "we'd need to look more closely" instead of making one up.
- Honour the verdicts.
- VALIDATED → state with confidence.
- CONDITIONAL → hedge ("preliminary signal", "worth tracking").
- REJECTED → do NOT mention as findings (the Critic flagged them as confounded / tautological / under-powered).
- Lead with the strongest validated finding. No throat-clearing.
- Weave in Mechanism + Strategy from the Assessment when available.
- One voice. No references to "the data scientist" or "the domain expert", that's an Amy-internal abstraction.
- Coach mode ends with the coach's forward question. Other modes end cleanly.
- Focused. Answer the asked thing first.
If DS was invoked but failed (dsStatus = "failed"), the prompt prepends
an explicit hard-warning block:
⚠ DATA SCIENCE STATUS: FAILED. The data analysis sandbox did NOT complete successfully this turn. There is no validated quantitative result. You MUST be honest, NOT invent numbers, NOT cite a "result", and offer a narrower follow-up.
This is what prevents the "silent degradation" failure mode where synthesis confidently states a number even though the DS sandbox crashed.
Memory extraction
src/orchestrator/index.ts:937–966 →
extractMemories + validatedToMemories.
After every turn, two memory writes happen in series:
extractMemories(userMessage, assistantMessage)
Prompt: MEMORY_UPDATE (prompts.ts:215–240).
Pulls long-term-worth-keeping facts out of the conversation:
[
{
"agent": "user" | "orchestrator",
"kind": "goal" | "barrier" | "preference" | "insight" |
"hypothesis" | "decision" | "value",
"text": "...",
"confidence": 0.0–1.0
}
]The prompt has explicit "only extract things relevant 2 weeks from now"
guidance. On JSON parse failure → [].
validatedToMemories(validated)
src/orchestrator/index.ts:684–702.
For every ValidatedFinding, emits a tested_hypothesis MemoryEntry:
{
ts, agent: "validator",
kind: "tested_hypothesis",
text: validated.claim,
confidence: { validated: 0.9, conditional: 0.6, rejected: 0.4 }[verdict],
meta: { finding_id, feature, target, verdict, effect }
}Memory.appendMany dedupes by meta.finding_id so the same hypothesis
re-tested across turns doesn't bloat the JSONL.
The Investigator reads back memory.testedHypotheses() on its next run
and includes them in its prompt to prevent re-proposing the same
hypotheses (investigator/index.ts:77–86).
Cost breakdown
Costs come from result.total_cost_usd reported by the Claude Agent SDK
src/llm.ts:150–151. They're accumulated in
trace.total_cost_usd and emitted via cost and cost_warning events.
A warn event fires once per turn when cumulative cost crosses
config.costWarnUsd (default $3, override via AMY_COST_WARN_USD).
Typical per-step costs (from real transcripts in the README / production traces; vary with prompt size and model):
| Step | Model | Typical cost |
|---|---|---|
| Vagueness classify | fastModel (sonnet) | $0.001-0.003 |
| Routing | fastModel | $0.005-0.015 |
| Question rephrase | fastModel | $0.005-0.015 |
| Investigator (digest + hypotheses + briefing) | model (opus) | $0.05-0.20 |
| DS plan + code-gen | dsModel (sonnet-4-6) | $0.03-0.10 |
| DS debug iteration (each) | dsModel | $0.05-0.15 |
| DS summary + extractFindings | fastModel + validatorModel | $0.01-0.05 |
| DE ReAct (per call) | model | $0.10-0.40 (depends on tool turns) |
| HC recommend gate + main + finish gate | fastModel + model + fastModel | $0.05-0.15 |
| Validator gates | none (Python only) | $0 |
| Critic | validatorModel | $0.02-0.10 per finding |
| Assessment | validatorModel | $0.01-0.05 per finding |
| Reflection | fastModel | $0.003-0.008 |
| Reflection DS follow-up | dsModel | $0.05-0.15 |
| Synthesis | model | $0.05-0.20 |
| Memory extraction | fastModel | $0.005-0.015 |
Total turn cost typically lands $0.10-0.30 for descriptive queries, $0.30-1.00 for analytical queries, and $1.00-3.00 for vague exploratory queries that fire the Investigator + validate multiple top hypotheses.
The two model knobs that move costs the most:
AMY_DS_MODEL, defaults to sonnet-4-6 (faster + cheaper + empirically better than opus at pandas codegen, per the comment inconfig.ts:53–67).AMY_VALIDATOR_MODEL, defaults to opus because it's the trust-load-bearing step. Setting to sonnet cuts validation cost ~70% but raises the risk of accepting a confounded finding.
Model resolution:
| Alias | Resolves to (direct Anthropic) | OpenRouter env var |
|---|---|---|
opus | claude-opus-4-7 (whatever Claude Code default is) | ANTHROPIC_DEFAULT_OPUS_MODEL |
sonnet | claude-sonnet-4-6 | ANTHROPIC_DEFAULT_SONNET_MODEL |
haiku | claude-haiku-4-5 | ANTHROPIC_DEFAULT_HAIKU_MODEL |
claude-sonnet-4-6 | exact pin | exact pin |
claude-opus-4-7 | exact pin | exact pin |
dsModel is pinned to claude-sonnet-4-6 (not the alias) because the comment
explicitly notes Sonnet 4.6 outperforms Opus 4.7 on pandas codegen.
Where to next
- The runtime that hosts the API (Worker, Queues, Cron) is in
runtime.md. The orchestrator currently runs in the CLI;
the architecture target moves it into a
TurnWorkflowso each step above survives restarts and is observable in the Cloudflare dashboard. - The ingest pipeline that fills the SQLite the DS Agent reads is in data-pipeline.md.
- The schema of every column the agents touch is in storage.md.
- The biomarker priors the Critic uses live in
data/reference/biomarker_priors.json. - For the user-facing event taxonomy and the "calm" CLI rendering, see
src/orchestrator/events.ts.
Internals, Runtime
How Amy's cloud actually executes on Cloudflare. One Worker (cloud/) ties together D1, R2, KV, Queues, and Cron. There are no Workflows yet, async work runs via a Queue consumer in the same Worker.…
Internals, Data Pipeline
How wearable + lab data gets from Terra into D1, what runs along the way, and how to recover when something falls over. The pipeline is small, webhook → raw_events insert → Queue → normalize → typed…