This is not a guide for people who want to learn what AI is.
This is for people building systems with it — engineers who are past "it kind of works" and want to understand why it breaks in production, what the real tradeoffs are, and how to think about the full stack.
These are the fundamentals that separate shipping AI products from shipping prompts.
1. Harness engineering, not just prompt engineering
Prompt engineering is what you write inside the model call. Harness engineering is everything around it.
The harness is the orchestration layer that decides what to send to the model, when, in what order, and what to do with what comes back. It handles retries, fallbacks, output parsing, tool dispatch, memory injection, and routing. A good prompt in a bad harness produces unreliable products. A mediocre prompt in a well-engineered harness often outperforms it.
Production AI failures are almost always harness failures: the retry didn't have exponential backoff, the fallback model received malformed input, the tool result was injected at the wrong position in context. The model was fine. The plumbing wasn't.
The shift to think about: a prompt is a parameter. The harness is the product.
2. Context engineering
Context engineering is the discipline of deciding what goes into the model's context window, where, and in what form. It is the single biggest lever an AI engineer has after model selection.
The naive version: stuff everything in. The production version: surgically include only what the model needs to complete the task at hand. This matters because:
- Context is a finite resource. Every token you include costs money and increases latency.
- Position matters. Most models are biased toward the beginning and end of the context window. Critical instructions buried in the middle get ignored.
- Noise degrades performance. Adding irrelevant documents to a RAG context doesn't help neutrally — it actively makes answers worse.
Good context engineering means choosing what to include, what to compress, what to exclude entirely, and where to put each piece. It also means writing instructions at the right level of abstraction — specific enough to constrain behavior, general enough not to break under variation.
The most common engineering mistake: treating the context window like a dump truck instead of a precision instrument.
3. Caching — prompt vs. semantic
Two kinds of caching matter for production AI systems. They're not interchangeable.
Prompt caching is exact-match prefix caching at the KV level. If the first 2,000 tokens of your prompt are identical across requests, the model doesn't recompute them. Major providers (Anthropic, OpenAI, Google) support this natively. The implication: structure your prompts so stable content — system instructions, reference documents, persona definitions — comes first, and variable content (user input, fresh retrieval) comes last. The cost reduction on high-traffic systems can be substantial (50–80% on input tokens).
Semantic caching is query-level caching based on embedding similarity. When a new request is "close enough" to a previous one, return the cached result rather than calling the model at all. Useful for FAQ-style products, search assistants, or any system where many users ask the same question in slightly different words.
The tradeoff between them: prompt caching is transparent and safe (exact match); semantic caching is approximate and requires a freshness strategy (stale cached answers are worse than no cache). Don't use semantic caching for anything that involves user-specific, time-sensitive, or high-stakes information.
4. KV cache management at scale
When a model processes input, it computes key-value pairs for each attention head at each layer. These KV tensors are what enable the model to "attend" across tokens. On a single request, they're generated and discarded. In serving infrastructure, they can be cached and reused.
At scale, KV cache management becomes a resource allocation problem:
- Memory pressure: KV caches are large. A long context with a large model can consume gigabytes of GPU VRAM per request. When the cache fills, you evict. What you evict matters.
- Eviction policy: LRU (least recently used) is common, but prefix-aware policies that preserve shared system prompt prefixes outperform generic LRU on batched serving.
- Reuse: If 1,000 concurrent requests share the same system prompt, caching that prefix once and reusing it across all requests is a major throughput win.
For engineers: this is why max_tokens isn't just about cost. Long outputs require long KV caches. Requesting more output than you need can starve other concurrent requests of GPU memory.
5. Inference mechanics — prefill vs. decode
LLM inference has two fundamentally different phases, and they optimize differently:
Prefill is the processing of the input prompt. All tokens in the prompt are processed in parallel. It's compute-bound: faster GPUs, bigger batches, and more efficient attention reduce it. The output is the KV cache.
Decode is the generation of output tokens, one at a time. Each step reads the KV cache and generates the next token. It's memory-bandwidth-bound: the bottleneck is how fast the GPU can read the KV cache, not how fast it can compute. Throwing more compute at decode doesn't help much.
Why this matters in practice:
- Time to first token (TTFT) is dominated by prefill. Long prompts have slow TTFT.
- Tokens per second (throughput) is dominated by decode. Speculative decoding and quantization help here.
- Latency budgets are different for each. A chat application cares about TTFT. A background summarization job cares about total time.
Optimizing the wrong phase for your use case is a common mistake.
6. Throughput — batching, attention, and serving
Continuous batching (also called in-flight batching) is the technique that made high-throughput LLM serving practical. Without it, a server waiting for Request A to finish before starting Request B wastes most of its compute. With continuous batching, new requests join the batch as soon as a slot opens — typically when a request finishes generating a token. Modern serving frameworks (vLLM, TGI, TensorRT-LLM) all use this.
Paged attention solves the memory fragmentation problem for KV caches. Traditional serving allocated a contiguous memory block for each request's KV cache upfront (based on max sequence length). Most of that memory was wasted. Paged attention stores KV caches in non-contiguous pages, allocated on demand — similar to how an operating system manages virtual memory. The practical result: higher GPU utilization, more concurrent requests, better throughput per dollar.
For engineers sizing infrastructure: throughput is measured in tokens per second per dollar, not just model speed. A slower model on well-optimized serving infrastructure often beats a faster model on naive deployment.
7. Model compression — quantization, speculative decoding, distillation
Three techniques for running models cheaper and faster without retraining from scratch:
Quantization reduces the precision of model weights. FP32 (32-bit floating point) → FP16 → BF16 → INT8 → INT4. Each step cuts memory roughly in half. Common formats:
- INT8: Near-lossless for most tasks. Safe for production.
- FP8: New standard, well-supported by NVIDIA H100. Good quality/speed balance.
- INT4 with AWQ or GPTQ: Aggressive compression. Quality loss is task-dependent. Works well for simple tasks, degrades on reasoning, multi-step math, and code. Always eval before deploying.
The trap: quantization hurts quality unevenly. The model might score fine on aggregate benchmarks and fail silently on your specific task. Always eval on your actual distribution.
Speculative decoding uses a small draft model to generate several candidate tokens, then verifies them in parallel with the large target model. If the draft was right (often), you get multiple tokens for the compute cost of one large-model forward pass. Effective when draft hit rate is high (it varies with task) and the large model is the bottleneck.
Distillation transfers knowledge from a large model to a smaller one during training. The small model is trained to match the large model's output distribution, not just ground-truth labels. Results in a model that punches above its parameter count on the tasks the large model was good at. Requires training infrastructure; not an inference-time optimization.
8. Structured outputs and function calling
Getting reliable structured output from LLMs is harder than it looks.
Structured outputs (JSON, XML, specific schemas) are now natively supported by most major APIs via constrained decoding — the sampler is constrained so the model can only emit tokens that keep the output valid. This eliminates most malformed JSON, but doesn't eliminate semantic errors (valid JSON with wrong values).
Function calling (tool use) is a model capability, not just an API feature. The model decides whether to call a function and what arguments to pass. Production problems:
- Argument hallucination: The model invents plausible-looking but invalid argument values.
- Wrong function selection: The model calls a function when it shouldn't, or picks the wrong one.
- Non-idempotent calls: If a tool has side effects and you retry on timeout, you may duplicate the action.
Engineering practices that help:
- Tool contracts: Write tool descriptions that explain what the function does, what valid inputs look like, and what errors to expect. The model's function-calling quality is highly sensitive to description quality.
- Argument validation: Validate and sanitize tool arguments before executing, regardless of whether they came from a model.
- Idempotency keys: For side-effectful tools (send email, charge card, update record), make them idempotent so retries are safe.
Repair loops: When a model returns invalid output, a repair loop re-prompts with the error and asks the model to fix it. Works for simple schema violations. Has limited effectiveness for semantic errors and can loop indefinitely if the model is confused.
9. Agent systems — guardrails, budgets, routing
Agents — LLMs that can call tools, make decisions, and run in loops — introduce failure modes that don't exist in single-shot inference.
Loop budgets: An agent without a maximum step count will run until it times out or exhausts your budget. Set hard limits: max tool calls, max LLM calls, max wall-clock time. Build explicit termination conditions into the agent logic, not just resource limits.
Tool budgets: Some tools are expensive (database writes, API calls with rate limits, payments). Track tool call counts per agent run. Enforce per-tool limits. The model can and will call a tool 50 times if you let it.
Agent guardrails are constraints on what an agent can do. They operate at two levels:
- Input guardrails: Screen what the agent receives. Reject requests outside scope before the LLM even sees them.
- Output guardrails: Screen what the agent produces or does. Don't execute a tool call that would take an irreversible action without a human confirmation step when the confidence is low.
Model routing is the practice of directing requests to different models based on task complexity, cost constraints, or latency requirements. A common pattern: fast, cheap model for classification/routing, powerful model for complex reasoning, specialized model for code. The router itself can be a small model or a rules-based classifier.
Graceful degradation: Define what the product does when the model is down, slow, or over capacity. A product that returns "AI unavailable, please try again" is better than one that hangs for 30 seconds and returns a 500. Design the degraded mode explicitly.
10. RAG architecture and retrieval evals
RAG (retrieval-augmented generation) is the pattern of injecting retrieved documents into context at inference time so the model can answer from current or private data rather than training data alone.
Core architecture choices:
- Chunking strategy: How you split documents matters enormously. Chunks that are too small lose context; too large dilute relevance. Recursive character splitting, semantic chunking, and document-structure-aware chunking each have different tradeoffs.
- Embedding models: Dense retrieval (semantic similarity) vs. sparse retrieval (BM25/keyword). Hybrid search — combining both — consistently outperforms either alone on real-world corpora.
- Reranking: A cross-encoder reranker applied to the top-K retrieved candidates before injecting into context dramatically improves precision. Worth the latency for anything accuracy-sensitive.
- Freshness: Embeddings are computed at index time. Stale embeddings are a silent failure mode. Define an indexing schedule and a maximum document age.
Retrieval evals:
- Recall: Are the relevant documents actually being retrieved?
- Precision: Of what's retrieved, how much is relevant?
- Grounding: Is the model's answer actually supported by the retrieved documents?
- Attribution: Can you trace each claim in the output to a specific retrieved passage?
- Citation quality: Are cited passages accurate, complete, and not taken out of context?
Retrieval evals and generation evals are different. A system with excellent retrieval and a poor generation prompt will hallucinate. A system with great generation and poor retrieval will fabricate. Eval both layers independently.
11. Evals — golden sets, regression, LLM-as-judge
Evals are how you know whether the system is working. Most teams skip them until something breaks in production.
Golden sets are curated examples with known correct outputs. They're expensive to build and the most reliable eval signal you have. Cover the core use cases, edge cases, and known failure modes. Never train on them.
Regression tests are a subset of golden sets that specifically test behaviors that have broken before. Every bug that reaches production gets a regression test before the fix is merged.
Adversarial tests probe the system's limits: injection attempts, off-topic requests, ambiguous inputs, maximum-length inputs, multilingual inputs, outputs that might cause downstream failures. These are the tests most systems don't have until they need them.
LLM-as-judge uses a model to evaluate another model's output. It scales well and correlates reasonably with human judgment on many tasks. But it has known failure modes: it's biased toward outputs that look like its own style, it's susceptible to sycophancy (preferring longer and more confident answers), and it can be jailbroken. Use it with calibrated human oversight, not as a replacement for it.
Human evals are the ground truth but don't scale. Use them to calibrate your automated evals, not as the primary signal for monitoring.
A mature eval pipeline has: per-PR regression tests, daily/weekly automated eval runs against the golden set, LLM-as-judge for high-volume tasks, and periodic human eval to recalibrate the automated signals.
12. Observability and cost attribution
LLM systems in production are opaque by default. You need to build observability in, not bolt it on.
What to trace:
- Input and output tokens (for cost)
- Latency — TTFT and total, separately
- Model and version — behavior drifts when providers update models
- Tool calls — name, arguments, result, duration
- Retrieval — queries, document IDs retrieved, reranking scores
- Errors — type, count, where in the chain they occurred
- Eval scores — if you run automated evals, log them per request
Spans let you trace multi-step flows. A single user request might span: routing → retrieval → reranking → prompt construction → LLM call → output parsing → tool dispatch → second LLM call → response formatting. Each step should be a labeled span with its own timing and metadata.
Cost attribution: Track cost not just per model call, but per feature, workflow, tenant, and user journey. "We spent $4,000 on inference last month" is useless. "The contract-analysis feature for enterprise customers costs $0.18 per run, and the onboarding flow is generating 60% of our total spend because of a retrieval bug that's returning 50 documents instead of 5" is actionable.
Drift detection: Model behavior drifts silently. Provider updates change models without notice. User behavior changes. Monitor key metrics over time, not just point-in-time. An eval score that degrades 3% per month will be a crisis in six months.
13. Safety engineering and multi-tenant isolation
Safety in AI systems isn't a feature. It's a layer of the architecture.
Prompt injection is the class of attacks where malicious input causes the model to ignore its instructions or take unintended actions. It's easiest in systems that inject untrusted content into context: RAG pipelines (the retrieved document tells the model to ignore previous instructions), summarization tools (the document being summarized contains adversarial instructions), or multi-agent systems (a tool result contains an injection).
Defenses: input screening, sandboxed tool execution, limiting what tools can do in response to untrusted input, and treating any model output that triggers an irreversible action as untrusted until verified.
Data leakage prevention: In multi-tenant systems, one user's data must not appear in another user's response. This is both a retrieval problem (correct namespace filtering at query time) and a caching problem (cached responses from one tenant must not be served to another).
Permission boundaries: The model should not be able to trigger actions that the user hasn't authorized. If a user asks the agent to "send the report to the team," the agent should only be able to send to users in that user's team, not arbitrary email addresses. Enforce access control at the tool execution layer, not just in the prompt.
Multi-tenant cache safety: Shared KV caches are a correctness risk. If request from tenant A accidentally populates a cache entry that's returned to tenant B, you have a data isolation failure. Tenant IDs must be part of cache keys, not just prefixes.
Cross-user context contamination: In long-running agent sessions, context from a previous user's session must not carry over to the next. Session boundaries must be hard, not soft. Don't assume the model will forget — ensure it structurally cannot see prior context.
14. Choosing between fine-tuning, RAG, ICL, and distillation
These four techniques are often treated as interchangeable. They're not. Each answers a different question.
In-context learning (ICL) is including examples or instructions in the context window. No training required. It's the first thing to try, and it works better than most people expect. Limits: context size, cost per call, the model's ability to generalize from few examples.
RAG gives the model access to external, up-to-date, or private knowledge at inference time. The right answer when: the data changes, the data is too large for context, or you need precise citation and attribution. The wrong answer when: the task requires structured reasoning over the retrieved content that the model performs poorly.
Fine-tuning adapts the model's weights for your task. Useful for: consistent style/format, domain-specific vocabulary, tasks where you have thousands of high-quality labeled examples, reducing latency by shrinking the prompt. Not useful for: injecting factual knowledge (models hallucinate fine-tuned facts as readily as pretrained ones), anything the base model genuinely can't do (fine-tuning won't fix it).
Distillation produces a smaller, faster model that mimics the behavior of a larger one. The right choice when: you have a working large-model system and want to reduce cost/latency at the expense of training infrastructure and a one-time dataset generation cost. Not a first step — distill from a working system, not an aspirational one.
Common mistakes:
- Fine-tuning to fix a prompt engineering problem (fix the prompt)
- Using RAG to teach the model new reasoning skills (train or use ICL)
- Distilling a system that hasn't been eval'd (you're distilling the bugs too)
15. Production failure modes
These are the failures that actually happen, regularly, in production AI systems.
Hallucinated tool calls: The model calls a function that doesn't exist or calls the right function with arguments that look plausible but are semantically wrong. Common when the tool schema is ambiguous or the model is uncertain.
Malformed JSON: Even with constrained decoding, models can return valid JSON that doesn't match your schema — wrong key names, wrong nesting, missing required fields. Always validate schema, not just parse.
Stale retrieval: The RAG pipeline returns documents that were accurate six months ago and are wrong today. A user trusts the answer because it sounds confident. Freshness monitoring and document expiry are not optional.
Runaway agents: An agent enters a loop, repeatedly calling tools, retrying failed steps, or making incremental progress that never terminates. You need hard step limits and circuit breakers, not just trust that the model will stop.
Silent eval regressions: The model provider pushed a silent model update. Your output quality degraded 8%. You don't know because you don't run evals continuously. By the time you notice, it's been two weeks.
Context contamination: A previous turn's content bleeds into the current response. Conversation history wasn't properly reset. A multi-tenant request got the wrong context. The model sounds like it's answering a different user's question.
Sycophantic drift: The model agrees with incorrect user assertions, reverses correct answers when pushed, and stops being useful. This is a model behavior, not a bug you can fix with a patch — but you can detect it in evals and mitigate it with system prompt design.
Over-confident refusals: The model refuses a legitimate request because it pattern-matches to something it shouldn't do. Hard to catch in evals if your golden set doesn't include borderline-legitimate cases.
The way to handle production failures isn't to prevent all of them — it's to detect them fast, degrade gracefully, and have a playbook for each class of failure before you hit it.
![tuaregs[AI]](/_next/image/?url=%2Fimg%2Flogo-tuaregs.png&w=640&q=75)