HOME › AI AGENTS › TEALC

TEALC: An Agent Harness for Biology Labs

Tealc: the lab's autonomous member

Tealc is not a chatbot you summon. It is the lab's always-on autonomous member. It auto-starts on login, runs without supervision, and has access to the lab's full operational surface: email, calendar, Drive, literature databases, student records, grant feeds, R execution.

Tealc has been running continuously in this lab for months. The code is open-source under MIT, and any researcher with a Claude API account can run their own. The unit of adoption is one scientist with a laptop, not a university with an IT department.

tealc on GitHub → Build your own Tealc →

The longer version What Tealc actually does, and why this phase is different Executive loop, overnight science, self-review. What the experiment is designed to find out.

Every fifteen minutes an executive loop reads a live snapshot of the lab's current state and decides what to do next without being asked: draft a reply to a time-sensitive email, flag a grant deadline, surface an overnight preprint, note that a student has not checked in. A self-review job runs every Sunday and writes a critique of its own decisions from the week prior, identifying where the rules should tighten and where they are too conservative. The result is a member of the lab that is working before you sit down, remembers every conversation, and maintains its own queue of open intentions across sessions.

That boundary is now moving. Overnight, while the lab is idle, Tealc reads and synthesizes new papers tagged to active research projects, drafts the next unfinished section of an active grant or manuscript, runs queued comparative analyses in R, and proposes testable hypotheses from recent literature. Every output lands in a document tagged [draft, needs review]. The morning briefing surfaces what was produced. Whether those drafts are actually useful, and where the model's judgment about what to work on and how to frame it diverges from a scientist's, is what this phase of the experiment is designed to find out.

158

Tools wired in

Scheduled jobs

15 min

Executive loop cadence

Always-on

launchd auto-start, never idles

Scopeautonomous lab member: handles lower-tier knowledge work during the day, does science overnight: literature synthesis, grant and manuscript drafting, R + Python analyses, hypothesis generation, adversarial self-critique StackClaude (Sonnet 4.6 default, Opus 4.7 on demand + critic, Haiku 4.5 for executive loop and batch jobs) · LangGraph ReAct loop · Chainlit UI · APScheduler · SQLite state · 59 scheduled jobs · 158 tools Cost~$60/month at trainee scope; up to ~$200/month for a 20+ person lab running many parallel projects. Pay-per-use via Claude API. Portabilitycode, memory, and credentials sync through Google Drive; runs on any Mac via launchd auto-start Rulesdraft-only on email · advisor mode first, autonomous only after explicit sign-off · every action in the public feed

How it works

Two processes share a single SQLite database, both auto-started on Mac login via launchd. The reactive chat responds to direct requests. The always-on layer runs continuously: scheduled management jobs during the day, science-work jobs overnight, an executive loop firing every fifteen minutes, and event-driven triggers for real-time signals including incoming email.

Reactive layer Email, calendar, Drive, Docs, Sheets, literature, R execution Reads, drafts, writes back. Sends nothing without approval.

Tealc reads Gmail with keyword filters, manages the calendar (read and write, no invitations sent without explicit approval), searches Google Drive by content, and reads local files including .docx and PDF manuscripts. It drafts reply emails that wait in Gmail for a human to approve and send. It writes directly into Google Docs and updates Google Sheets, including the lab's karyotype databases and CURE data. A notes table in SQLite survives across sessions and machines, so context from one conversation carries into the next.

The research tools cover Europe PMC, bioRxiv, and OpenAlex with citation counts and open-access links. A citation tracker monitors who is citing the lab's own work. Tealc can load the full llms-full.txt snapshot of the lab website, so it starts every conversation already knowing the lab's research program, active projects, and student roster. For analysis, it executes R scripts in a sandboxed working directory and interprets the results in context. The R environment has ape, phytools, geiger, diversitree, and tidyverse installed; every run saves the script and output for reproducibility.

Always-on layer Morning briefing, daily plan, grant radar, student pulse, NAS accounting Work done before you sit down. Scheduled jobs sharing one database.

At 6:30 am a daily plan job reads the current goal portfolio, upcoming milestones, today's calendar, and high-priority pending intentions and writes three to five prioritized actions to a Today tab. Heath can edit before standing up. At 7:45 am the morning briefing assembles new citations to the lab's work, overnight preprints in chromosome and sex-chromosome evolution, flagged emails, the day's calendar, and a countdown of grant deadlines, synthesized by Claude into a structured update under 600 words, with the Today plan included. Every Monday at 6 am a grant radar job scans NSF and NIH RSS feeds, scores each opportunity against the lab's profile using a fast Claude model, and queues the high-fit entries for review. Every Sunday evening a student pulse job flags anyone with no recent contact or an overdue milestone, and a self-review job critiques the prior week's executive decisions and proposes rule changes. Weekly jobs track citation count, h-index, and i10-index from OpenAlex and tag every executive decision and calendar block with which goal it advanced. Every three months a retrospective job reads the full goal and decision history and proposes goals to drop and goals to add. Every job logs to an audit table; results surface automatically when the chat opens.

Overnight science Literature synthesis, grant drafting, R analyses, hypothesis generation Runs while the lab is idle. Every output is a draft for Heath to review in the morning.

Each active research project has an entry in the Goals Sheet with a data directory, current hypothesis, and queued analyses. At midnight a literature synthesis job picks five to eight new papers from the past week tagged to active projects, reads abstracts and opening pages via Sonnet, extracts key findings, and updates a per-project knowledge file in Drive. By morning there is a running annotated bibliography for each project that did not exist the night before.

At 1am, if an active grant deadline is within thirty days, a grant drafter job opens the linked artifact, finds the next unfinished section, and drafts a first pass into a new Google Doc tagged [draft, needs review]. This job never overwrites existing text. It writes into a new document that Heath reads in the morning and decides what to keep.

Saturday at 3am a database health job runs consistency checks across all six karyotype databases, the Tree of Sex data, and the epistasis database, flagging rows with malformed values, missing chromosome counts, or likely duplicates. Also Saturday at 4am a cross-project synthesis job identifies conceptual bridges across multiple active research projects and proposes integration points. Sunday at 4am a comparative analysis job writes and executes R scripts for each active research project with a defined next action and reports results to an analysis log. Sunday at 5am a hypothesis generator reads the week's literature synthesis notes and the lab's current open questions and proposes one or two new testable hypotheses with suggested methods. Friday at 3am an exploratory analysis job writes and executes Python scripts (pandas, numpy, matplotlib, scipy) on project data directories, surfaces one interesting plot or summary statistic per project, and stores the result in the analysis log.

All of these jobs run only when the executive loop classifies the lab as idle or deep-idle, meaning no interaction for four or more hours. They do not interrupt daytime work. The morning briefing consolidates what was produced overnight into a single summary Heath reads with coffee.

v2 rigor layer Adversarial critic, output ledger, reproducibility bundles, cost telemetry Every research output is scored, provenance-tracked, and bundled for reproduction before it ever reaches Heath.

The v2 upgrade (April 2026) adds an instrumentation layer that treats Tealc's own outputs as objects of scientific scrutiny, not just useful artifacts.

Every research artifact (grant drafts, hypothesis proposals, literature syntheses, analysis results) is logged to an output ledger with full provenance: which job produced it, which model, which prompt version, which input data, and a SHA256 hash of the source material. Nothing enters the review queue without a ledger entry.

Each ledger entry is then passed to an adversarial critic: Opus 4.7 with prompt-cached domain rubrics scores the output 1–5 on four dimensions: support (are claims backed by cited evidence?), citation accuracy, hype (does the framing overstate what the data show?), and biological plausibility. The critic flags unsupported claims and missing citations before Heath reads a draft. That check runs on every overnight output automatically.

Reproducibility bundles wrap every analysis output as a self-contained tar.gz: the R or Python script, sessionInfo, input data SHA256 manifest, and a plain-language reproduction instruction. Any output can be independently reproduced from the bundle alone.

Cost telemetry tracks every API call with input tokens, output tokens, cache hit rate, and USD cost. The science layer runs at approximately $60–80/month. A monthly NAS case packet assembles citation counts, h-index deltas, and impact metrics into a single evidence document, a concrete measurement of whether the strategy is moving the needle.

A retrieval quality monitor runs daily: Haiku samples recent literature notes and scores relevance against the project hypothesis. If precision drops below a threshold, the project's retrieval keywords are flagged for refinement. A blinded evaluation harness runs overnight outputs through four domain rubrics (chromosomal evolution, sex chromosome evolution, comparative genomics, macroevolution), emulating external peer review. Together these systems mean Tealc's outputs are not just auditable. They are actively critiqued before the human ever sees them.

Wiki pipeline Paper → finding → verifier → topic page, all verbatim-grounded An 11-step Opus+Sonnet pipeline that ingests papers, extracts findings with verbatim-quote verification, and composes topic pages in Heath's voice.

The wiki_pipeline.py job runs in 11 steps: PDF resolution (Drive / local / DOI → Europe PMC cascade), text extraction, SHA256 fingerprint and cache, Opus 4.7 extractor (with finding_extractor.md prompt), deterministic verbatim quote verification (TraitTrawler discipline, non-matching quotes dropped), Sonnet 4.6 verifier (with finding_verifier.md, 4-dimensional scoring: quote-finding fit, page accuracy, reasoning honesty, counter specificity), upsert to paper_findings and topics tables, generate /knowledge/papers/<slug>.md, update /knowledge/topics/<slug>.md via topic_page_writer.md, stage via website_git.stage_files() (path-allowlisted), log every call to cost_tracking and output_ledger.

The Sunday 10am improve_wiki.py job picks 2 topic pages and 1 paper page (oldest edited first, skipping anything edited in the last 14 days or marked editor_frozen: true), runs Opus 4.7 with Heath's voice exemplars as context, and applies edits in place if the diff size is under 40%; otherwise it files a briefing suggesting manual review. Defaults to dry-run mode for the first two weeks of any new class of edit.

The Monday 8am wiki_janitor.py job runs 9 automated checks including missing category, stub titles, H1 mismatch, broken finding anchors, orphan topic references, and cross-link opportunities.

Five dedicated wiki prompts live in agent/prompts/: finding_extractor.md, finding_verifier.md, topic_page_writer.md, citation_proposer.md, and repo_note_writer.md.

V1 of the upcoming wiki redesign extends this pipeline with a deterministic accuracy contract (digit-substring invariant with tilde-approximation zone) and a dual-register reader/student lead composition.

Executive loop Autonomous decisions every fifteen minutes Haiku reads lab state, decides what to do next, logs every decision for review.

The executive loop is the center of Tealc's autonomous operation. Every fifteen minutes, a lightweight Haiku model reads a live snapshot of the lab's state: open intentions, unread briefings, student statuses, hours since last interaction, active deadlines, current grant draft state. It also reads the top three active goals and their nearest milestones, so every decision is grounded in "what advances goal X right now" rather than just "what is in the inbox." It selects from 16 defined action classes that range from escalating a deadline, flagging an overdue milestone, drafting a VIP reply, surfacing a stale briefing, or nudging a pending intention, down to doing nothing when conditions do not warrant intervention. All decisions start in advisor mode: logged to an executive_decisions table for weekly review. Each reviewed decision can be promoted to run autonomously, building a deliberate escalation path from supervised to trusted. A self-review job runs Sunday evening: Sonnet reads the past week's decisions and tool-call logs, identifies patterns, and writes a critique of where the rules should tighten or loosen. The PI reviews and edits the rules. That loop is how Tealc's judgment improves over time without requiring retraining.

The executive loop is idle-class-aware. When the lab has been active within the last 30 minutes it defers almost everything. When away 30 minutes to 4 hours it limits itself to light nudges and briefing surfacing. When idle for 4–24 hours it opens heavier work: drafts, refinements, longer analyses. When deep-idle for more than 24 hours it has full runway for science jobs. That graduated awareness means the system never competes with the human for attention during working hours.

Memory and continuity Intentions queue, session history, rolling context state Tealc remembers what was decided last week and what it committed to follow up on.

Memory in Tealc works in three layers. The intentions queue holds open tasks that persist across sessions: things the chat committed to following up, things the executive loop identified as needing attention, upcoming deadlines, cross-session threads. Tealc can add intentions during a conversation ("I should follow up with this student about their chapter on Friday") and the executive loop acts on them when the time arrives. Session summaries are written every thirty minutes during a conversation and on close, tagged by topic and people mentioned and searchable across any number of past sessions, so "what did we decide about this project last week" has an answer. The rolling context-state table gives the executive loop a sub-second read on the current situation without re-running a battery of queries. Together, these layers mean the first message in any new session already has context, and commitments made in one conversation do not fall through the cracks when the next one starts.

Goal layer Goal-conditioned decisions, NAS impact accounting, daily plan Every autonomous action is grounded in an explicit goal portfolio. Weekly accounting forces honesty about whether the work matches the strategy.

Tealc's goal portfolio lives in a Google Sheet, synced bidirectionally with SQLite via an on-demand export tool (replacing the earlier polling sync, which caused quota errors at scale). Goals carry importance scores, milestones with target dates, and dependencies. The executive loop reads the top active goals on every wake and grounds its decisions in what advances those goals now, not just what is in the inbox.

A weekly job tags every executive decision, email triage row, and calendar block with which goal it advanced, or none, producing a concrete time-portfolio accounting: NAS-trajectory work, service drag, maintenance, as a fraction of the working week. A separate weekly job pulls citation count, h-index, and i10-index from OpenAlex and tracks deltas. Monthly briefing: "Citations +47 vs prior month; h-index +1 (now 24). Top three papers gaining citations: ..." Real signal on whether the strategy is moving the needle, not just whether the inbox is under control.

A daily conflict check compares recent activity against goal importance ranking. If the allocation does not match the priorities, Tealc says so directly: "You spent four of the last five days on coleoptera curation (importance 2). The chromosomal stasis paper (importance 1) has not moved in nine days." That is the kind of honest accounting a trusted lab member can give, and that no one else in the lab is positioned to deliver.

Chat-side tools close the loop in both directions. A decomposition command breaks any goal into milestones with target dates: "decompose: submit chromosomal stasis to Nature" returns six to ten milestones Heath approves and Tealc writes to the sheet. When Heath mentions a new research direction in conversation, Tealc detects the intent, asks three questions (time horizon, importance, success metric) and writes a proposed goal to the sheet for later review. Service requests in email triage are now evaluated against the goal portfolio: each gets an explicit pass-or-fail with reasoning before a draft reply is written. Every three months a retrospective job reads the full goals, milestones, and decision history and proposes which goals to drop and which to add. Heath reviews and edits the sheet.

Open by design Every tool call in a public feed; every dataset agent-readable The privacy gate reduces personal content to generic descriptions; research queries publish as-is.

Every tool call Tealc makes is logged to a public feed visible at tealc.html and refreshed every thirty seconds. The feed passes through a privacy gate: anything touching personal email content, student records, or unpublished data is reduced to a generic description. What remains is still informative: literature searches, file reads, calendar checks, R runs, grant scans. The goal is an AI agent whose behavior is not just auditable by the person it works with but legible to anyone watching. That legibility is not incidental; it is the mechanism that makes the system trustworthy enough to escalate toward autonomy.

The lab's scientific content is also structured for machine readability. The llms.txt file indexes the site for AI systems; llms-full.txt provides a structured dump of every research page, dataset description, and tool; the /data endpoint exposes machine-readable versions of the karyotype databases and lab-status data. Any researcher, any AI system, can query the lab's scientific work directly. That is not just convenience. It is a deliberate choice about what open science looks like when the readers are not only human.

The experiment Can an agent do the science, and would you know if the output was good? A falsifiable question backed by NAS impact accounting and citation tracking.

The architecture is built around a question that has sharpened over time. The first phase asked how much of the knowledge work surrounding science could be delegated to an autonomous agent. The current phase asks something harder: can the agent do the science itself, and how would you know if the output was actually good? The overnight jobs produce drafts. The question is whether those drafts are useful in the way a collaborator's draft is useful, or whether they are fluent but wrong in the ways that matter: misframing the hypothesis, missing the key prior result, running the right analysis on the wrong data subset.

The executive loop's escalation from advisor to autonomous is the instrument for the management layer. Each promoted decision class is a provisional claim that the model's judgment is reliable enough to not require review. Tracking where promotions hold, where they get revoked, and where the self-review job identifies systematic failures is how that part of the experiment runs.

The goal layer adds a concrete measurement instrument that most AI-for-science projects lack. The NAS impact accounting produces a weekly number: what fraction of the working week actually advanced the highest-importance scientific goals versus service drag and maintenance. Citation count, h-index, and i10-index tracking adds a second, slower signal. If those numbers do not move over several months despite the executive loop running, the hypothesis that an autonomous agent can meaningfully reallocate a scientist's time is falsified. If they do move, that is a result the field needs to see, measured rather than claimed.

The failure modes we expect are specific. Agents complete tasks; research is not a task. Tealc will drift toward work that is easy to close rather than work most worth doing; the goal-conflict check exists specifically to surface that drift rather than let it accumulate silently. Biological plausibility is a harder test than statistical correctness: computing a statistic is easy, noticing that the statistic describes something biologically impossible is what requires training. Novel-result discipline is where the seams will show in any manuscript contribution, because identifying what is actually new versus what merely appears new to a system trained on the existing literature is a judgment call that depends on scientific experience. These are not reasons to stop. They are the signals that make the experiment informative. Building a system that surfaces its own failure modes, rather than hiding them, is the design goal.

The problem is not that scientists lack ambition. It is that the job has grown into something dominated by tasks that do not require the training it took to get here. Tealc is an attempt to fix that.

Design principles Eight non-negotiables that shape every decision Draft-only on email · advisor before autonomous · full audit trail · AI amplifies effort, humans supply judgment.

Draft-only on email. Tealc never sends. A human approves every message before it leaves the lab.
Science outputs in draft mode. Every overnight job writes into a new document tagged [draft, needs review]. Tealc never overwrites existing research text. The PI reads what was produced and decides what to keep.
Advisor before autonomous. Every new class of autonomous action starts in review mode. Promotion to autonomous operation requires explicit sign-off. Revocation is always one step.
Self-review loop. Tealc critiques its own weekly decisions and proposes rule changes. The PI reviews, edits, and applies them. The agent improves from use without retraining.
Full audit trail. Every tool call is visible in the Chainlit step display, the job-runs audit table, and the public aquarium feed. Nothing Tealc does is hidden from the human it works with, or from anyone else who wants to look.
Service protection. Requests that would grow service load without advancing the science get a polite decline drafted automatically.
Persistent memory. Every conversation, note, and intention survives restarts and new machines.
AI amplifies effort, humans supply judgment. The lab's own AI philosophy, applied to its own infrastructure.

Build your own Tealc → Tealc live console → AI in the lab, overview → TraitTrawler → Work on this with us →