HOME › AI AGENTS › TEALCBUILD › PROMPT DESIGN

Designing Tealc's system prompt

Tealc's system prompt was originally a 715-line ALLCAPS wall that grew section-by-section as we added features. By April 2026 it had hit a wall: rules conflicted, attention was diffuse, and every change risked colliding with something written six months earlier. This page is the design diary of the rewrite — what we kept, what we cut, and what we changed based on Anthropic's 2026 prompting guidance and a few research patterns from the agent-evaluation literature. The full prompt source lives in agent/graph.py; this page is the “why”.

Static prefix~12,700 tokens, cached via cache_control: ephemeral Dynamic suffix~820 tokens, loaded fresh per chat-start (priorities, drive layout, personality, preferences) On-demand6 SKILL.md files, 1.5–3K tokens each, loaded only when triggered Structure14 XML-tagged semantic blocks (persona, stance, review_default, uncertainty, skills, heath_profile, lab, behavior, tool_routing, state, safety_rules, workflows, scheduled_jobs, integrations) ModelsClaude Sonnet 4.6 by default, Opus 4.7 on demand; adaptive thinking with effort=high for the chat agent Net change vs. prior version~3,900 tokens removed (-24%); 5 “non-negotiable” rules collapsed to 2 actual safety rules + 6 workflows; tool catalog moved into API tool schemas

Every section below is collapsed by default. Click a row to open it.

Premise Why a system prompt is not documentation Every byte either steers behavior or burns attention.

The temptation with a long-running agent is to keep adding context: explain how every tool works, list every edge case, document every decision the agent has ever made. The system prompt becomes a runbook. Anthropic's guidance is the opposite: treat the prompt like onboarding a new colleague, not a reference manual. The model has finite attention; every paragraph competes with every other paragraph for that attention. Long, repetitive prompts dilute the rules that actually matter.

The rewrite started from one question: which lines, if removed, would change Tealc's behavior in a way we'd notice? Anything that didn't pass that test got cut. The 104-tool catalog inside the prompt — one line per tool with a one-sentence description — got cut entirely, because LangGraph already passes @tool docstrings to the API as native tool schemas. The prompt was duplicating information the model already had.

Structure XML tags, not section dividers The model parses structure. Visual ASCII art is just text.

The previous prompt used heavy ═══════ divider lines to mark sections. To a human reader, those scream “new section”. To the model, they're just unicode box-drawing characters. Anthropic's guidance is explicit: “XML tags help Claude parse complex prompts unambiguously, especially when your prompt mixes instructions, context, examples, and variable inputs.”

The rewrite wraps the major thematic blocks in semantic XML tags: <persona>, <stance>, <review_default>, <uncertainty>, <heath_profile>, <lab>, <behavior>, <tool_routing>, <scheduled_jobs>, <state>, <safety_rules>, <workflows>, <integrations>. Inside the safety/workflow blocks, individual rules use nested tags — <rule id="data_resource_preflight">, <workflow id="hypothesis_proposal">, <reason>, <example> — mirroring the patterns Anthropic uses in its own published Claude system prompts.

This is not cosmetic. The model uses XML tags as attention anchors when it needs to find a specific rule mid-reasoning. With dividers, it has to scan prose. With tags, it can route directly.

Persona Character via behavioral constraints, not adjectives “Brilliant, organized, direct” is decoration. Anti-sycophancy is a behavior.

Anthropic's published system prompts for claude.ai don't describe Claude with adjectives. They describe what Claude does and doesn't do. Lines like “does not need to apologize when the person is unnecessarily rude”, “willing to push back on users and be honest, but does so constructively”, “never thanks the person merely for reaching out” — these are concrete behavioral constraints. Reading those lines, you can predict what Claude will and won't do in a specific situation. Reading “you are warm and direct”, you can't.

Tealc's <persona> block keeps a short identity statement (lab member, knows your work, runs on Sonnet 4.6 / Opus 4.7), then defers all the actual character work to three behavioral blocks immediately after:

  • <stance> — the default reading is skeptical. First job on any draft, claim, or piece of code is to find the weakest link, not to validate. Skip validation-forward openers (“great question”, “I'd be happy to”) and the words “genuinely”, “honestly”, “straightforward” — the latter is lifted directly from Anthropic's own anti-sycophancy phrasing.
  • <review_default> — when reviewing writing or code, report every issue including low-severity and uncertain ones, tagged with severity + confidence. Anthropic's own guidance on code-review harnesses warns that “be conservative” and “don't nitpick” cause Claude 4.7 to silently drop real findings: precision rises, recall collapses. Coverage at the finding stage with a downstream filter is the better pattern.
  • <uncertainty> — calibration as a habit. Every hypothesis or draft gets a confidence label, the strongest counter-argument the model can generate, and one experiment that would change its mind. “I think” and “my guess is” instead of bare assertion. Never invent paths, IDs, citations, or facts — investigate first or say “I don't know”.

Combined token cost: about 165 tokens. Influences every assistant turn for the life of the prompt. Highest-leverage 165 tokens in the file.

Stance Why the default reading is skeptical Adversarial-critic patterns improve factual accuracy ~23%.

The default frame for most LLM assistants is helpfulness, which in practice slides into validation: hedge the disagreement, soften the critique, agree with the user's framing. For a research lab, that's exactly wrong. The most useful behavior from a colleague is the one that catches the bad idea before it becomes a published preprint.

The agent-evaluation literature has a name for this pattern: the adversarial critic or skeptical internal reviewer. Multi-agent debate research reports ~23% improvements in factual accuracy when one agent's outputs are routed through a second agent prompted specifically to find errors, unsupported assumptions, and guideline violations. Commercial systems for underwriting decisions with adversarial self-critique use the same shape: a generator produces a draft, a critic is told to verify or reject every claim, and the final output reflects what survived.

Tealc bakes that shape into the default. The <stance> block tells the model that its first job, on any input, is to find the weakest link — not because the user is wrong, but because that's the higher form of help. The system already runs an explicit critic pass on hypothesis proposals and grant drafts (Tier 0 smoke filter → Haiku type classifier → Sonnet/Opus type-aware critic, see the hypothesis pipeline); the stance block extends the same posture to in-chat work where there's no formal pipeline.

The cost of skepticism in this domain is small. The cost of a confidently wrong hypothesis attached to the wrong citation is substantial. The April 2026 Fragile-Y preregistration incident — where Tealc proposed a hypothesis already tested in a paper it cited as supporting — is the failure mode that motivated the explicit wiki-consultation rule and, more broadly, the skeptical default.

Language Five “non-negotiables” collapsed to two When everything is non-negotiable, nothing is.

The previous prompt had five rules tagged (non-negotiable) and dozens of MUST / NEVER / DO NOT imperatives. Anthropic's own guidance on Claude 4.6+ models is explicit about why this backfires:

“Claude Opus 4.5 and Claude Opus 4.6 are also more responsive to the system prompt than previous models. If your prompts were designed to reduce undertriggering on tools or skills, these models may now overtrigger. The fix is to dial back any aggressive language. Where you might have said 'CRITICAL: You MUST use this tool when…', you can use more normal prompting like 'Use this tool when…'”

The rewrite reserves the <safety_rules> block for two genuinely load-bearing rules that are also backed by code-level guards: the data-resource preflight (must call require_data_resource before emitting analysis code that reads a lab database) and the destructive-action confirm (preview-then-confirm on the three irreversible Google Doc / Sheets / Calendar tools). The other three former “non-negotiables” — voice-matching for extended prose, wiki consultation before hypothesis proposals, the email-trash two-step — moved into <workflows> with calmer language, where they belong as quality and process rules rather than safety gates.

The two safety rules each ship with a <reason> block citing the actual incident the rule prevents. Anthropic's guidance again: “providing context or motivation behind your instructions… helps Claude generalize from the explanation.” Bare imperatives are weaker than imperatives with their causal story attached.

Examples Five worked examples beat fifty bullet rules “One of the most reliable ways to steer Claude's output.”

The previous prompt had zero <example> blocks across 715 lines. Anthropic's guidance on few-shot examples is unambiguous: “Examples are one of the most reliable ways to steer Claude's output format, tone, and structure. A few well-crafted examples (known as few-shot or multishot prompting) can dramatically improve accuracy and consistency.” The recommendation is 3–5 examples wrapped in <example> tags.

The rewrite adds 5 worked examples inside the safety / workflow blocks — one each for the data-resource preflight (showing both the success path and the error path), the destructive-action confirm flow, the hypothesis-proposal wiki-consultation flow, the extended-prose voice-exemplar flow, and the email-trash two-step. Each example is short (4–8 lines, formatted as a mini-dialogue) and shows the desired call sequence with the expected tool returns inline. They cost about 400 tokens combined and replace what would otherwise need 1500+ tokens of prose to convey.

Caching Static prefix, dynamic suffix Two content blocks, one cache breakpoint, ~90% off on cached tokens.

Anthropic's prompt-caching feature lets you mark a prefix of the system prompt with cache_control: {type: "ephemeral"}. On subsequent API calls with the same prefix, the cached portion bills at roughly 10% of the normal input-token rate. For a 12k-token static prompt called dozens of times a day, that's real money.

The catch: the cache is invalidated by any change in the prefix. If you put both the static prompt and dynamic per-chat-start data (today's priorities, this week's preferences, the live Drive layout) into one cached block, every chat-start invalidates the cache and you get zero benefit.

The fix is to split the system message into two content blocks. Block one is the ~12,290-token static prefix — persona, behavioral rules, lab profile, workflows, integrations — with cache_control: ephemeral attached. Block two is the ~820-token dynamic suffix — loaded fresh from the SQLite goals table, the personality sliders file, and the weekly preference consolidation — without cache_control. The static prefix stays cached across chat-starts even when the dynamic suffix changes. graph.py's build_graph() function constructs the message that way.

The dynamic suffix matters for a separate reason: it removes a stale-data risk. The previous prompt had URGENT PRIORITIES RIGHT NOW (April 2026) hardcoded with hand-typed deadlines. Six weeks later, those deadlines were stale and Tealc was still anchored to them. Replacing that block with a _load_urgent_priorities() function that queries the goals table at chat-start means priorities auto-refresh whenever the table is updated, with no prompt edit required.

Tools Tool descriptions belong in tools, not the prompt ~3,900 tokens reclaimed by deleting the catalog.

The biggest single-section win came from deleting the in-prompt tool catalog — 104 lines of - tool_name: one-line description that had accumulated over months. The reason it was there at all: with earlier models, listing tools in the system prompt improved tool-selection accuracy. With Claude 4.6 and 4.7, that's no longer needed.

LangGraph already passes each @tool-decorated function's docstring to the API as a native tool schema. The model sees the tool name, parameters, and description through the proper structured channel. Duplicating that in the system prompt as plain prose adds tokens without adding information.

What does still belong in the prompt is the small set of routing rules that aren't obvious from a docstring: when to call describe_capabilities vs. answering from memory; when to call find_resource before falling back to search_drive; the cross-references to the <safety_rules> and <workflows> blocks for tools that have preconditions. That's now a ~15-line <tool_routing> block instead of a 105-line catalog.

Anthropic's Building effective agents piece is direct about this: “Tool definitions and specifications should be given just as much prompt engineering attention as your overall prompts.” Engineering effort that was going into the in-prompt catalog now goes into the tool docstrings themselves.

Always-on What always-on agents need that one-shot prompts don't Calibration, transparency, and stopping conditions over cleverness.

A chatbot prompt and an always-on agent prompt have different priorities. Anthropic's guidance for agentic systems emphasizes three principles: simplicity, transparency, and clear interfaces. Tealc's prompt tries to honor each:

  • Simplicity. Reserve agentic structure for problems that need it. Don't add a rule unless its absence has caused a real failure. The five-to-two collapse on safety rules came from this principle.
  • Transparency. The <persona> block tells Tealc to acknowledge scope on complex requests — one sentence before the first tool call, naming what's about to happen. The <uncertainty> block tells it to surface confidence and counter-arguments rather than presenting conclusions as settled. Both make the agent's reasoning visible to the human reading along.
  • Clear interfaces. Tools have detailed docstrings with parameter types and constraints. Tools with preconditions (must call X before Y) have the precondition in the <safety_rules> or <workflows> block, so it's enforceable both by the model's reading and (for the safety rules) by code-level guards in the tool implementation itself.

Stopping conditions matter too. The hard guards are in code, not just in prose: destructive operations require a second call with confirmed=True after a preview; the recursion_limit on the LangGraph loop is set explicitly; the email-trash function re-checks the protected-sender blocklist regardless of what the caller said. The prompt-side rules are the first line of defense; the code-side guards are the catch.

What we cut The deliberately-not-included list Removing things is most of the work.
  • The 104-tool catalog. Tool schemas come from @tool docstrings. Listing tools in the prompt was duplicate.
  • Hardcoded urgent priorities. Deadlines go stale. Now loaded dynamically from the goals table at chat-start.
  • The “V2 / V3 / V4 / V5 / V6 / V7 / V8” section labels. These were historical strata from when the prompt grew feature-by-feature. The model doesn't care which version added which rule; the labels were noise. Sections now have thematic names without version prefixes.
  • Aggressive markers. (non-negotiable) tags went from five to two. MUST / NEVER / DO NOT imperatives became regular prose almost everywhere. Anthropic's own guidance: dial back, the newer models follow softer language more reliably.
  • Repeated rules. Voice-matching was mentioned in three places, drafting policy in two, service-decline in two. Each rule now lives in exactly one place; cross-references go to the canonical home.
  • Validation-forward language. The <stance> block bans phrases like “great question” and the words “genuinely”, “honestly”, “straightforward”. Lifted directly from Anthropic's own published Claude system prompts.
Outcomes What changed in practice Honest accounting, not a victory lap.

We're early enough in the rewrite that the right framing is “tracking it” rather than “shipped it”. Things we expect to see, and will be measuring against the v2 critic scores and the Reviewer Circle correlations:

  • Less validation-forward chatter. Fewer responses opening with “Great question!” or hedging the disagreement. We expect this immediately on Sonnet, where the baseline tone is warmer; Opus 4.7 was already there per Anthropic's tone notes.
  • More explicit confidence on hypotheses and drafts. The <uncertainty> block asks for confidence, counter-argument, and a falsification path on every claim. The output ledger captures these per-artifact, so we can audit how often Tealc actually does it after the rewrite vs. before.
  • Cheaper chat-starts. Static prefix is cached at ~90% off. Dynamic suffix is small. After the first chat-start of the day primes the cache, subsequent starts should be substantially cheaper. We track this in get_cost_summary.
  • Fewer stale priorities. The hardcoded April 2026 priorities are gone. The dynamic loader pulls live priorities from the goals table, so the agent's frame matches the agent's data without manual prompt edits.
  • Easier evolution. Adding a workflow now means adding one <workflow id="…"> block, ideally with a <reason> and an <example>. No version-section archaeology required.

What we're not claiming: that the prompt is now “optimal”. It's a single design pass against a published guidance document and a small set of research patterns. The honest version is that we'll see how it behaves over the next several months and iterate from there. The instrumentation is in place: weekly self-review, output-ledger critic scores, Reviewer Circle blinded scores, cost telemetry.

Iteration What landed after first publication Skills system, SCIENTIST_MODE preamble, Memory tool scaffolding, project-session helpers, adaptive thinking.

The first version of this page covered the prompt rewrite itself. A few weeks later, several Anthropic-side primitives had matured enough to wire into the harness. None of these change the prompt's design philosophy — they extend it.

Six on-demand Skills. Big chunks of domain-specific knowledge moved out of the static prompt and into SKILL.md files under agent/skills/: karyotype-databases, r-comparative-phylogenetics, wiki-authoring, grant-section-drafter, hypothesis-pipeline-rubric, voice-matching. Each is 1.5–3K tokens of body content; only the ~100-token metadata sits in the system prompt at all times. The chat agent reads the relevant SKILL.md once per session, only when triggered. Anthropic's Skills architecture post describes the progressive-disclosure pattern. Net result: the system prompt stays compact while the agent has access to ~10K tokens of focused playbooks on demand.

SCIENTIST_MODE preamble for artifact-grade jobs. The chat agent now defaults to skeptical reviewing — but the scheduled jobs that produce the artifacts the chat agent later reviews didn't share that stance. A shared ~200-token preamble in agent/jobs/__init__.py now prepends to the system prompts of the eight artifact-grade jobs (grant drafter, hypothesis generator, literature synthesis, comparative-analysis interpreter, recognition-impact, recognition-pipeline-health, cross-project synthesis, weekly review). It enforces calibration, anti-hype word list, distinguish-hypothesis-from-finding, distinguish-correlation-from-causation, and a no-fabrication rule. Per-job prompts add domain-specific rubric on top — the literature synthesis job, for example, now requires explicit claims_vs_findings, correlation_vs_causation, and limitations fields in its JSON output. The asymmetry is intentional: the chat is the critic, the jobs are the producers, and the preamble enforces a calibration floor on the producers without making them as adversarial as the chat.

Anthropic Memory tool scaffolding. A subclass of BetaAbstractMemoryTool at agent/memory_backend.py, file-backed at ~/Library/Application Support/tealc/memories/ (outside Google Drive sync to avoid the corruption modes that hit the SQLite live store). All six commands implemented (view / create / str_replace / insert / delete / rename) with path-traversal protection, atomic writes, and size caps. Native registration in build_graph deferred — the memory_20250818 beta-tool format needs a langchain-anthropic shim that doesn't exist yet.

Multi-session continuity helpers. Anthropic's multi-session harness pattern applied to research projects. Each project gets a progress.md log + optional feature_list.json, stored at ~/Library/Application Support/tealc/memories/projects/<project_id>/. Two new tools, start_project_session and end_project_session, bookend extended work on a specific project — the agent reads the progress log at the start and appends a timestamped entry at the end. For something like a multi-week grant or paper revision, this is the missing layer between the per-chat thread and the long-horizon SQLite project tables.

Adaptive thinking with effort tuning. The chat agent now requests thinking={"type": "adaptive"} with output_config={"effort": "high"} via ChatAnthropic's model_kwargs. Per-job effort tiers live in agent/model_router.py's EFFORT_TIERS: xhigh for cross-project synthesis and formal hypothesis pass, high for the chat agent and grant drafter, medium for everyday synthesis jobs, low for email-triage classification and paper-of-the-day summaries. The router function returns a (model, effort) NamedTuple so the per-job effort threading is mechanical — that wiring is in progress at time of writing.

Optional Langfuse observability. A new agent/observability.py module exposes a @traced decorator that captures Anthropic API calls (input, output, usage, cost, latency) into Langfuse as OpenTelemetry spans. Pure no-op when langfuse isn't installed or env vars aren't set — nothing breaks. The intended use cases: prompt-regression detection across model upgrades, per-tool cost dashboards that don't depend on the agent summarizing them, and LLM-as-judge regression suites for the prompts that produce research artifacts. pip install langfuse + three env vars to enable.

The pattern across all five additions is the same: rather than trying to make the system prompt do more, off-load domain knowledge and bookkeeping to better-shaped primitives so the prompt itself stays focused on stance, behavior, and the small set of rules that actually steer output. Anthropic's effective context engineering post calls this just-in-time context, and it's the design lesson that ties everything together.

References Sources cited above, in one place Anthropic guidance + agent-evaluation research.

Anthropic guidance (primary references):

Agent-evaluation and adversarial-critic research:

Tealc-specific:

Question copied. Paste it into the NotebookLM tab.