How We Built Our First AI Agent: A Production Guide to Claude API Agents

Every weekday morning, an autonomous AI agent searches the web, picks the 2-3 most consequential AI stories of the previous 24 hours, writes a full article for each — news body, original analysis, sharp prediction — and saves them to our database. By the time we wake up, that day's news is already published. The agent costs us about $0.40 per story, runs in under two minutes per cycle, and has been doing this since the start of 2026.

This guide is a complete walkthrough of how it works. Not a toy example, not a framework demo — the actual production code, the choices that survived contact with reality, and the five lessons that took us months to internalize. If you're building anything more ambitious than a chatbot with the Claude API, this is the playbook we wish someone had handed us on day one.

Why "Agent" Is a Bad Word Until You Define It

The term "AI agent" has been stretched to mean almost anything: a chat interface, a workflow that calls a single API, a multi-step reasoning chain, a fully autonomous system that books your travel. That ambiguity is a problem when you're trying to build one, because the architectural choices for each are radically different.

We use a narrow, useful definition: an agent is a language model in a loop, equipped with tools, where the model decides when to stop. Three properties matter — the model makes the control-flow decisions (not your code), tool calls feed back into its context (not just return values), and the loop terminates on a signal the model chooses to emit (end_turn), not on a hard-coded number of iterations.

That definition rules out most things people call "agents". A workflow with a fixed sequence of LLM calls isn't an agent — it's an LLM pipeline. A function that calls Claude once with a tool and stops isn't an agent — it's a tool-augmented completion. The minute you wrap the call in a while loop and let the model decide whether to call again, you've crossed into agent territory and inherited an entirely new class of problems.

The Loop, in Code

Here is the actual control flow of the BitsMinds news agent, slightly simplified. This is the whole engine — everything else is configuration:

const messages = [{ role: "user", content: userMessage }];
let pauseCount = 0;

while (true) {
  const response = await client.messages.create({
    model: "claude-opus-4-7",
    max_tokens: 8000,
    system: SYSTEM_PROMPT,
    tools: [{ type: "web_search_20260209", name: "web_search" }, ...customTools],
    messages,
  });

  messages.push({ role: "assistant", content: response.content });

  if (response.stop_reason === "end_turn") break;

  if (response.stop_reason === "pause_turn") {
    if (++pauseCount > 5) break;       // safety cap
    continue;                          // resume server-side tool work
  }

  if (response.stop_reason === "tool_use") {
    const toolResults = [];
    for (const block of response.content) {
      if (block.type !== "tool_use") continue;
      const result = await executeTool(block.name, block.input);
      toolResults.push({ type: "tool_result", tool_use_id: block.id, content: result });
    }
    messages.push({ role: "user", content: toolResults });
    continue;
  }

  break;                               // unknown stop_reason — bail
}

That's it. Fifteen meaningful lines. Every other piece of an agent — the prompt design, the tool schemas, the validation, the retries, the cost optimization — is wrapped around that core, but the loop itself is short on purpose.

A few things in this snippet are worth their own bullet:

Always append the full assistant response. Not just the text — the entire response.content array, including tool-use blocks and any extended thinking blocks. If you summarize, paraphrase, or strip blocks before re-sending, you'll get model behavior that gets weirder turn over turn.
pause_turn is a real thing. When the model uses a server-side tool like web_search heavily, it can hit an internal iteration limit and pause. The fix is to just call the API again with the same messages — the server resumes its work. No extra logic needed, but the case must be handled or your loop will silently exit.
Tool results go in a user message. Counterintuitive — your code is sending the result, not the user — but this is how the API is shaped. The model treats tool results as the next "thing said to it".
Bail on unknown stop_reason. New reasons get added occasionally. A default break is the safest behavior.

The System Prompt Is an Editorial Brief, Not a Config File

The single highest-leverage piece of an agent is the system prompt. Not the model choice, not the temperature, not the tool list — the prompt. We rewrote ours four times in the first month, and each rewrite changed the agent's behavior more than any code change we shipped in the same period.

The mistake we kept making was treating the prompt as a list of rules. "Write at least 4 paragraphs. Use HTML tags. Avoid speculation." That style produces an agent that hits the rules and ignores the spirit. What worked was writing the prompt as an editorial brief — a document that explains the publication, not the format.

Our current prompt opens with "You are an AI news journalist AND analyst for BitsMinds (PixelMind), an AI intelligence website. You do NOT just summarize the news — you add a sharp, opinionated analytical layer that readers can't get from RSS aggregators." That single sentence does more work than the next thirty rules combined, because it gives the model an identity. Every subsequent rule is read in service of that identity, not as a checklist.

Three concrete techniques have outsized impact:

Use "MUST" sparingly. Models pattern-match on emphasis. If everything is "MUST", nothing is. Reserve it for the two or three constraints you genuinely cannot tolerate violations of (in our case: HTML formatting, slug uniqueness, and the structure of the analysis callout).
Give examples of bad output, not just good. We list "weak takes" alongside "strong takes" in our prompt. The model is much better at distinguishing patterns when it can see the failure mode it's being asked to avoid.
Bake the brand voice in. Our prompt explicitly says "A sharp opinion that ages badly is better than a hedged one that says nothing. Avoid 'time will tell' and 'may eventually' phrasing." The model will not invent your voice — you have to encode it.

Tool Design Is Where Most Agents Break

Tools are the agent's hands. If they're shaped wrong, no amount of prompt engineering will save you. The single tool that drives most of our system is create_article, and it took three iterations before we settled on its current shape.

Version 1 had one parameter: content (a giant string of HTML). The agent did what we asked, but the output was inconsistent — sometimes 6 paragraphs, sometimes 2, sometimes missing the source link, sometimes inventing a slug with spaces. The signal we sent ("write whatever") was the signal we got back.

Version 2 split content into structured fields: title, slug, excerpt, content, category, source, sourceUrl, image. Every field had a concrete description with constraints — "Unique kebab-case URL identifier, max 60 chars". The output quality jumped roughly 4x overnight. Same model, same prompt, different schema.

Version 3 added the editorial callout fields — analysis as an array of paragraphs and prediction as a single sentence. Critically, we made them optional. Our earlier version made them required, and the model dutifully filled them in even when the story didn't justify a sharp call. We were getting "this will impact the AI landscape" filler in the prediction slot. Making the field optional, plus a one-line nudge in the description ("OMIT this field entirely if you cannot make a sharp, specific call"), restored quality.

The pattern that crystallized: tool inputs should be the smallest set of fields that fully determine the output, with each field's description doing real work. "Title" isn't a useful field description. "Article headline, max 80 chars, clear and informative, no clickbait punctuation" is. Treat every description as a micro-prompt — the model reads it.

Trust Nothing the Agent Sends Back

The first time we shipped the agent to production, it generated a beautiful article — except the slug had a space in it. The URL /news/openai%20o3%20release rendered, but it broke our sitemap, our share links, and our analytics. We had asked the model to produce a kebab-case slug. It produced one most of the time. "Most of the time" is the worst possible reliability bar.

The fix wasn't prompt engineering. The fix was validation inside the tool handler:

if (analysis !== undefined) {
  if (!Array.isArray(analysis) || analysis.length < 2) {
    return `Error: when \`analysis\` is provided it must be an array of at least 2 paragraph strings. Either provide 2-3 paragraphs or omit the field entirely.`;
  }
}

Notice what this return does. It does not throw. It does not write a partial record. It returns a string, which becomes the content of the next tool_result. The model sees the error, understands what went wrong, and re-calls the tool with a corrected payload. The loop self-heals.

This pattern — validation as a returned error string the model can read — is the most important production technique we've found. It turns the agent into a self-correcting system. Three rules go with it:

Error messages must be actionable. "Invalid input" is useless. "analysis must be an array of at least 2 strings, got: 1" tells the model exactly what to fix.
Never expose internal exceptions verbatim. Catch them, summarize, return a clean string. A stack trace in the tool result eats half your context window for no benefit.
Add a soft cap on retry loops. If the model fails validation three times in a row, abort the run. Otherwise an agent stuck in a self-correction spiral will burn through your token budget at full speed.

Cost Economics: Prompt Caching Changes Everything

Without prompt caching, our agent would cost us roughly 4-5x what it actually does. Every turn re-sends the full system prompt, the tool definitions, the conversation history. By turn 8 of a multi-tool run, you're paying input-token rates for content that hasn't changed since turn 1. The bill adds up fast.

Prompt caching solves this. You mark a prefix of your messages or system prompt as cacheable, and on subsequent calls within a 5-minute window, the cached portion is billed at roughly 10% of normal input cost. The savings compound across turns and across runs that happen close in time.

Three things to know about caching in practice:

The cache key is exact. If your system prompt interpolates today's date — like ours does — the cache invalidates daily, which is fine. But if you interpolate something noisier (the current timestamp, a UUID, a randomly-ordered list), you'll never hit the cache. Audit your prompts for hidden variability.
Cache the tools array, not just the system prompt. Tool definitions are often the longest static prefix in your call. Caching them is where the biggest savings live.
Five-minute TTL is a real constraint. If your agent runs every fifteen minutes, the cache will expire between runs. Either accept the miss or batch runs closer together. We added a "warm-up" call when our cron starts, which keeps the cache fresh for the actual work that follows.

For an agent like ours that runs daily with a stable system prompt and tools array, prompt caching turns a $1.60 run into a $0.40 run. Over a year that's the difference between $580 and $146 — meaningful at our scale, transformative if you're running thousands of agents.

Five Things We Learned Moving to Production

None of the items below are obvious from the documentation. We learned each one the expensive way.

1. The agent will surprise you, and not in the direction you expect

Our first production run produced three perfect articles and one with a fabricated quote from a CEO. The model had read a tweet, treated it as a Bloomberg interview, and wrote a paragraph attributing a statement that was never made. Web search returns are not ground truth — they're context, and the model can still hallucinate over them.

We added a constraint to the prompt: "Every direct quote MUST include a working source URL that contains the exact quoted text. If you cannot verify, paraphrase instead." Fabricated quotes dropped to zero. The lesson generalizes: assume the model will find the one shape of failure your prompt doesn't forbid.

2. Server-side tools have their own rate limits

The web_search tool can pause turns when it hits an iteration limit. The first time this happened to us, our agent silently produced articles with no sources because the search budget was exhausted three queries in. We now log every pause_turn and alert if pauseCount exceeds 2 — anything higher usually means the model is searching unproductively and should be cut off.

3. Observability is the whole game

For the first month we logged just tool_name and title. When articles came out weird, we had no idea why. Now we log the full response.content array for every turn, the token usage, the cache hit rate, and the full input that produced each tool call. Storage is cheap. Debugging blind is not.

4. Idempotency at the tool layer, not the agent layer

The agent will occasionally try to publish the same article twice — usually after a pause_turn where it loses track of what's already been saved. The clean fix isn't in the agent; it's a articleExists(slug) check inside create_article that returns "Skipped: already exists" when it hits a duplicate. Cheap, robust, and the model handles the response sensibly.

5. Model upgrades are not free

When Claude Opus 4.6 → 4.7 dropped, we upgraded immediately and the agent's behavior shifted in three ways we hadn't anticipated. Headlines got more academic. Predictions got more hedged. Word counts crept up. None of these were strictly worse, but they didn't match our voice. We had to revisit the prompt to re-anchor on examples. Treat a model upgrade like a UI redesign: smoke-test before you ship.

The Minimal End-to-End Agent (≈70 lines)

If you want to start from zero, this is the smallest agent that does something real — searches the web, writes an article, saves it. Drop it in a file, set ANTHROPIC_API_KEY in your environment, and run with tsx.

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY });

const SYSTEM = `You are a news journalist. Search for the latest AI news from
the past 24 hours, pick one significant story, and save it with the
save_article tool. Be concise — 3 paragraphs, neutral voice.`;

const tools = [
  { type: "web_search_20260209", name: "web_search" },
  {
    name: "save_article",
    description: "Save the finished article. Call once after writing.",
    input_schema: {
      type: "object",
      required: ["title", "body"],
      properties: {
        title: { type: "string", description: "Headline, max 80 chars" },
        body:  { type: "string", description: "3 paragraphs of plain text" },
      },
    },
  },
];

async function executeTool(name, input) {
  if (name === "save_article") {
    console.log("---\n" + input.title + "\n---\n" + input.body);
    return "Saved.";
  }
  return "Unknown tool";
}

const messages = [{ role: "user", content: "Find one important AI story from today and save it." }];

while (true) {
  const r = await client.messages.create({
    model: "claude-opus-4-7", max_tokens: 4000,
    system: SYSTEM, tools, messages,
  });
  messages.push({ role: "assistant", content: r.content });

  if (r.stop_reason === "end_turn") break;
  if (r.stop_reason === "pause_turn") continue;
  if (r.stop_reason === "tool_use") {
    const results = [];
    for (const b of r.content) {
      if (b.type !== "tool_use") continue;
      const out = await executeTool(b.name, b.input);
      results.push({ type: "tool_result", tool_use_id: b.id, content: out });
    }
    messages.push({ role: "user", content: results });
    continue;
  }
  break;
}

That's a working agent. Add prompt caching, validation, a real database write, and proper logging, and you have the BitsMinds news agent. Add a system prompt that's an editorial brief instead of a config file, and you have one that produces work worth reading.