Verifiers, Harnesses, and AI-Native Products at User Scale

Production agents are shipping with verifiers, custom harnesses, and infrastructure that scales to millions of users. This week's posts show how teams are optimizing agent evaluation costs, rebuilding for retention, and going AI-native across entire product suites.

Introducing the next generation of Amazon OpenSearch Serverless for building your agentic AI applications

AWS shipped a faster, cheaper OpenSearch Serverless that provisions vector stores in seconds instead of minutes and scales 20x faster than the previous generation. The 60% cost reduction and native integrations with Vercel and Kiro matter for teams running agent memory and retrieval at scale. If you're paying for standing capacity on provisioned clusters just to handle agent context lookups, this is worth benchmarking.

Designing Efficient Verifiers for Legal Agents

LangChain and Harvey cut verifier costs by 60-1000x by switching from frontier models to DeepSeek v4 Flash with batched evaluations across 2,348 rubric criteria. The tradeoff is real: batched verification is faster and cheaper but less accurate than per-criterion scoring, and they had to tune prompts to drop the false-pass rate from 10.7% to 9.5%. This matters for RL post-training loops where verifier costs can exceed agent inference costs.

// Batched verification approach
const batchVerify = async (outputs: AgentOutput[], rubrics: Rubric[]) => {
  const prompt = `Evaluate all ${outputs.length} outputs against these ${rubrics.length} criteria...`;
  return await deepseek.complete(prompt); // 60-1000x cheaper than per-output Opus calls
};

How Harmonic Rebuilt Scout on Deep Agents and 4x'd Retention with LangSmith

Harmonic scrapped their multi-subgraph agent architecture for a single frontier model with tool access and saw 4x retention improvement and 10x session duration increase. The key shift was moving from rigid orchestration to letting the model reason over 40M companies and 200M people with firm-specific context. LangSmith Deployment handled infrastructure and automated failure detection so a lean team could iterate on product instead of ops.

How to Build a Custom Agent Harness

LangChain's create_agent gives you composable middleware hooks (before/after model calls, tool calls, startup/teardown) for building production agent harnesses. The argument is that memory management, retry logic, policy enforcement, and cost controls can't live in prompts alone. This is the scaffolding layer between your model and the real world.

import { createAgent } from "langchain/agents";
 
const agent = createAgent({
  model: llm,
  tools: [searchTool, dbTool],
  middleware: [
    costTracker,      // log token usage per step
    retryHandler,     // exponential backoff on tool failures
    policyEnforcer    // block certain tool combinations
  ]
});

How Rippling Went AI-Native Across Every Product in 6 Months with Deep Agents and LangSmith

Rippling built a supervisor agent coordinating 5-7 specialized subagents (read, RAG, action) to query thousands of database tables across HR, IT, payroll, and finance, now serving 1M+ users. They solved context window limits with dynamic skill injection and used LangSmith tracing to build a semi-automated self-healing eval loop. Six months from concept to production across every product is the velocity benchmark here.

The pattern is clear: teams are moving past prototypes to production systems that require specialized infrastructure for verification, harness customization, and cross-product orchestration at user scale. 🚀