Cost Control, Harnesses, and Specialized Agent Memory

Production teams are solving the economic and architectural challenges of agent deployments at scale. This week's posts show how organizations are building harnesses for security scanning, controlling runaway spend with gateways, and choosing between general-purpose chat and specialized agents with memory.

Building a 100x Cheaper Trace Judge with Fireworks

LangChain fine-tuned a Qwen-3.5-35B model to detect "Perceived Error" in agent traces—moments when users think the AI screwed up and need to intervene. The fine-tuned model matched GPT-5.5 and Claude Opus accuracy (90-96%) while running 10-100x cheaper, and critically, it transferred across completely different domains without retraining. This is the kind of specialized evaluator that makes continuous agent monitoring economically viable at production scale.

Why Fleet Has Both General Purpose Chat and Specialized Agents

Fleet treats General Purpose Chat and Specialized Agents as distinct architectural primitives: chat for one-off tasks with thread-level context, agents for recurring work with persistent memory and scoped tools. The key insight is that Specialized Agents maintain job-specific memory across runs, learning preferences and patterns over time, while chat stays ephemeral. This gives teams a clear upgrade path from ad hoc assistance to delegated, repeatable workflows.

Introducing eve

Vercel released eve, an open-source agent framework with a file-based architecture where agents are directories containing instructions, tools, skills, and subagents. The framework handles durable execution, sandboxed compute, human-in-the-loop approvals, and tracing out of the box:

agent-name/
  instructions.md
  tools/
  skills/
  subagents/

This pattern eliminates repetitive plumbing and lets developers define what agents do rather than rebuilding infrastructure for each new agent.

Build your own vulnerability harness

Cloudflare evolved a 450-line security audit skill into a fleet-wide scanning pipeline covering 128 repos. The system uses a two-stage workflow: a Vulnerability Discovery Harness with specialized agents for reconnaissance, hunting, validation, and deduplication, feeding into a separate Vulnerability Validation System that uses different models to double-check findings. Key architectural decisions include treating LLMs as stateless compute with externalized state, keeping context below 25% of window capacity, and enabling dynamic threat modeling where agents generate repo-specific attack taxonomies.

How We Made Coding Agent Spend Predictable

LangChain built an LLM Gateway to control unpredictable agent spending after individual developers started generating thousands of dollars in weekly costs. The Gateway provides real-time spend visibility and configurable budgets at organization, workspace, user, and API key levels, integrating with LangSmith for tracing:

// Budget enforcement at multiple levels
const budget = {
  organization: 10000,
  workspace: 2000,
  user: 500,
  apiKey: 100
};

Internal deployment revealed that model pricing complexity and incomplete client routing support make static caps insufficient—you need workflows around hard limits, not just alerts.

These posts share a common thread: production agent systems require purpose-built infrastructure for evaluation, cost control, and task specialization. The teams shipping agents at scale are treating observability, budgets, and memory as first-class architectural concerns, not afterthoughts. 🛠️