Engineering Highlights¶
Every design decision in AIBA answers one question: how do we make autonomous web agents safe, fast, and affordable for a single user on a single machine?
Model Choice: Gemini¶
AIBA runs exclusively on Google Gemini models.
| Factor | Why Gemini wins |
|---|---|
| 1M token context window | Web pages are long. Snapshots are longer. A 1M window means the agent can hold dozens of pages, screenshots, and tool outputs in context without truncation. |
| Native vision | Gemini reads images natively — no separate vision model, no format conversion. browser_take_screenshot → read_image is a single hop. |
| Cost | Gemini Flash models are among the cheapest per-token LLMs with vision. Sub-agents make many calls — cost matters. |
| Built-in search grounding | WEB_SEARCH_ENGINE=native injects Google Search results directly into the model's context. No separate search API, no tool round-trips. |
Pydantic AI v2: Deterministic by Design¶
AIBA is built on pydantic-ai >= 2.0.0 — a framework designed for production agents, not prototypes.
agent = Agent(
model=model,
system_prompt=SYSTEM_PROMPT,
capabilities=[...],
retries=AgentRetries(tools=1, output=1),
)
| Property | What it means for AIBA |
|---|---|
| Structured output | Every tool return is typed. No parsing ambiguity. |
| Deterministic retries | AgentRetries(tools=1, output=1) — one retry on tool failure, one on output validation. Predictable, not infinite. |
| Capability injection | Guardrails, tool schemas, and system prompt reinjection are capabilities — composable, testable, togglable. |
| UsageLimits | Request caps, tool call caps, and token budgets enforced at framework level. Not advisory — hard stops. |
| No hidden state | Agent configuration is explicit in code. Model, tools, capabilities — all visible in Agent(...). |
The framework itself enforces budget discipline. The agent can't overspend even if it wants to.
Session History: Aggressive Truncation¶
Every REPL turn trims conversation history before the next call. This isn't a nice-to-have — it's a cost and latency multiplier.
Two-stage pipeline¶
Stage 1 — Strip tool noise (_filter_tool_parts):
Tool-call payloads and tool-return outputs (YAML, JSON, HTML from Playwright snapshots) are stripped. Their content is already synthesized in the agent's text response. Keeping them would re-consume tokens for no informational gain.
Stage 2 — Trim to 20 messages (trim_history):
The filtered history is capped at MAX_HISTORY_MESSAGES = 20, aligned to a user-message boundary so Gemini's strict turn ordering isn't broken. The system prompt is always preserved as message 0.
Without this, a single browser-heavy investigation could carry 200+ messages of snapshot YAML into every subsequent turn, multiplying token costs with every interaction.
Web Artifacts: Files First, Read Later¶
Web pages and snapshots are never passed inline. They're saved to disk first, then read with line caps.
The problem¶
A browser_navigate to a job listing page can return 15,000 lines of accessibility YAML. Sending that directly to the model would consume 50K+ tokens in one shot — and most of it is nav bars, footers, and boilerplate.
The solution¶
Every Playwright MCP tool that produces content saves to .playwright-mcp/ and returns only a file path, not the content. The agent uses read_and_filter_file to pull what it needs:
def read_and_filter_file(
file_path: str,
start_line: int | None = None,
end_line: int | None = 300, # ← default cap
search_string: str | None = None,
search_regex: str | None = None,
) -> str:
| Design choice | Effect |
|---|---|
| Default 300-line cap | Even without filters, the agent can't accidentally read 15K lines |
| Regex + substring filters | Agent extracts exactly what it needs — emails, names, prices |
| Line-numbered output | Agent can re-read specific ranges with start_line/end_line |
This is the single biggest token saver in AIBA. A 15,000-line snapshot costs nothing to save, and only the filtered subset costs tokens to read.
Guardrails: Shields On by Default¶
Four capability wrappers from pydantic-ai-shields protect against cost overruns, dangerous tool use, prompt injection, and secret leakage. All active by default.
| Shield | Purpose |
|---|---|
CostTracking |
Hard USD budget cap — kills the run, not your wallet |
ToolGuard |
Human-in-the-loop approval for tools like send_email |
InputGuard |
Blocks prompt injection and homoglyph attacks |
SecretRedaction |
Redacts API keys/tokens from sub-agent output |
See Guardrails for full details.
Folder Sandboxing: No Escape from Allowed Paths¶
Agents can only access files within designated folders. There is no general filesystem access.
| Folder | Purpose | Tools that access it |
|---|---|---|
data/ |
CSV files | read_csv, append_csv |
.playwright-mcp/ |
Browser snapshots, screenshots | read_and_filter_file, read_image |
static/ |
Email attachments | send_email |
sessions/ |
Saved conversations | /save, /load |
Every tool validates that the requested file is within its allowed directory. Path traversal attacks (../../etc/passwd) are rejected:
if "/" in filename or "\\" in filename or filename.startswith(".."):
return f"ERROR: '{filename}' is not a valid filename."
The agent can't create files outside these folders either — append_csv only writes to pre-existing CSVs with matching headers. It never creates new files.
Flat Files: No Database, No Migrations, No Overhead¶
AIBA stores everything as flat files on disk. No SQLite, no Postgres, no ORM.
| Data | Format | Location |
|---|---|---|
| Conversation history | JSON | sessions/*.json |
| Beat state | JSON | data/beat_state.json |
| Beat run logs | JSON + Markdown | logs/beats/<name>/ |
| Task tracking (CSV) | CSV | data/*.csv |
| Browser cookies | JSON | .playwright-mcp/cookies.json |
Why this works¶
AIBA is single-user software. There's no concurrent access, no replication, no sharding. A JSON file is:
- Human-readable — open
sessions/*.jsonand see every message - Zero-dependency — Python's
jsonmodule, nothing to install - Instant to debug —
cat data/beat_state.jsonbeatsSELECT * FROM ...
Atomic writes use the .tmp → rename pattern to prevent corruption.
Playwright MCP: One Browser, Many Agents¶
A single Chromium instance serves all sub-agents. There's no browser-per-worker model.
┌─────────────────────────────┐
│ Playwright MCP (npx) │
│ ┌───────────────────────┐ │
│ │ Chromium (--isolated)│ │
│ │ Shared across all │ │
│ │ sub-agent calls │ │
│ └───────────────────────┘ │
└─────────────────────────────┘
│
┌────┴────┬────────┬────────┐
Worker 1 Worker 2 Worker 3 ...
The --isolated flag gives each sub-agent its own browser context (separate cookies, localStorage, session) while sharing the single Chromium process. This means:
- Memory: 1 browser process, not N × 500MB
- Startup: The transport is deferred (
defer_loading=True) — Playwright doesn't start until the first browser tool is actually called, saving resources when the agent only does web search. And keeps running till all the tasks are done - Cookie isolation:
--storage-statekeeps each worker's context isolated at runtime
AIBA-beats: Zero Resources When Idle¶
AIBA-beats has no built-in scheduler. No daemon, no polling loop, no background process.
The user configures their OS scheduler (cron, launchd, Task Scheduler) to run python main.py beat run --all at their desired interval. Between runs, AIBA-beats consumes exactly zero CPU, zero memory, and zero network.
| Approach | CPU idle | Memory idle | Complexity |
|---|---|---|---|
| Built-in scheduler | > 0 (polling loop) | > 0 (process alive) | High (watchdog, crash recovery) |
| OS cron | 0 | 0 | Low (one crontab line) |
State persists in data/beat_state.json — each run reads it, executes due beats, and writes it back. If a run crashes, the next cron tick picks up where it left off. No lost state, no stale locks.
TUI: Built for DX¶
The terminal UI is deliberately minimal. No ncurses, no Textual framework, and built for developer experience.
| Choice | Rationale |
|---|---|
| Raw ANSI codes | Zero dependencies for color. Works in any terminal, over SSH, in tmux. |
| Rich only for markdown | Agent output (tables, code blocks, headings) uses Rich. Setup screens don't — keep them fast. |
| Line-by-line input | input() with teal ▸ prompt. No TUI framework to fight. |
| 4–5 step wizard | Mode → Template → Effort → Sub-agents → Notes. Linear, predictable. |
| Session resume | First screen asks if you want to pick up where you left off. Cached in sessions/. |
The TUI is designed for the terminal power user — someone who lives in zsh, tmux, and vim. It's fast, keyboard-driven, and doesn't try to be a web app. See TUI for the full walkthrough.