Deep Tech & Research

Foundational papers, live benchmarks, architecture internals, agent reasoning patterns, and safety research — for engineers and researchers who want to go beyond the surface.

Transformer Architecture RLHF & Fine-Tuning RAG & Retrieval Agent Patterns Benchmarks & Evals Multi-Agent Systems AI Safety & Alignment LLMOps
Seminal Papers Architecture Internals Agent Patterns Benchmarks Live Leaderboards Expert Blogs Safety & Alignment LLMOps & Production
Seminal Papers — Must Read

The papers that shaped how modern LLMs and agents work. Not optional reading — these are the foundations everything else builds on.

Foundational Architecture
2017

Attention Is All You Need

Vaswani et al. · Google Brain · NeurIPS 2017

The paper that started everything. Introduces the Transformer architecture — multi-head self-attention, positional encodings, and the encoder-decoder stack. Every LLM in existence today descends from this architecture. 100,000+ citations.

Essential Seminal Transformers Attention
arXiv:1706.03762 →
2020

Language Models are Few-Shot Learners (GPT-3)

Brown et al. · OpenAI · NeurIPS 2020

Introduces GPT-3 (175B parameters) and demonstrates that scale alone enables emergent few-shot and zero-shot learning. The first paper to show that you don't need task-specific fine-tuning if you scale far enough. Changed the entire research paradigm.

Essential Seminal Scaling Few-Shot GPT
arXiv:2005.14165 →
2022

Training Language Models to Follow Instructions with Human Feedback (InstructGPT / RLHF)

Ouyang et al. · OpenAI

Introduces RLHF (Reinforcement Learning from Human Feedback) — the technique that turns a base language model into an assistant. A 1.3B InstructGPT model outperforms raw GPT-3 (175B) in human preference. This paper is why ChatGPT is useful.

Essential Seminal RLHF Alignment Fine-Tuning
arXiv:2203.02155 →
2022

Scaling Laws for Neural Language Models

Kaplan et al. · OpenAI · 2020 / Hoffmann et al. (Chinchilla) · DeepMind · 2022

Two related papers that explain exactly how model performance scales with parameters, data, and compute. The Chinchilla paper corrects original Kaplan scaling laws and shows most large models were undertrained. Every major lab uses these laws to decide how to spend their training budget.

Essential Scaling Compute
Kaplan: arXiv:2001.08361 →   Chinchilla: arXiv:2203.15556 →
Reasoning & Prompting
2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei et al. · Google Brain · NeurIPS 2022

Shows that simply adding "Let's think step by step" dramatically improves LLM reasoning on math, logic, and commonsense tasks. The paper that legitimized prompt engineering as a research discipline and is the basis for all CoT and reasoning-model work today.

Essential Chain of Thought Reasoning Prompting
arXiv:2201.11903 →
2023

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Yao et al. · Princeton + Google DeepMind

Extends Chain-of-Thought to a tree search: the LLM generates multiple candidate reasoning paths, evaluates them, and expands the most promising ones. Significantly improves performance on tasks where linear reasoning fails (game of 24, creative writing, miniCrossword). The basis for o1/o3-style thinking models.

Tree of Thoughts Search Reasoning
arXiv:2305.10601 →
Agents & Tool Use
2022

ReAct: Synergizing Reasoning and Acting in Language Models

Yao et al. · Princeton + Google Research

Introduces the ReAct pattern: Thought → Action → Observation loop. The agent reasons about what to do, takes an action (e.g., search Wikipedia), observes the result, then reasons again. The foundational paper for all tool-using agent systems. Still the default pattern in LangChain, LlamaIndex, and most agent frameworks.

Essential ReAct Tool Use Agents
arXiv:2210.03629 →
2023

Reflexion: Language Agents with Verbal Reinforcement Learning

Shinn et al. · Northeastern + MIT

Agents that learn from their own mistakes without gradient updates — the model critiques its own past failures and stores them as natural language memory for the next attempt. Achieves 91% on HumanEval benchmarks vs. 80% for GPT-4 alone. Highly practical for production agents.

Reflexion Self-Critique Memory Agents
arXiv:2303.11366 →
2023

Toolformer: Language Models Can Teach Themselves to Use Tools

Schick et al. · Meta AI

Shows that LLMs can self-supervise their own tool use — the model learns when and how to call APIs (calculator, search, calendar) without human-labeled data. The precursor to function calling in GPT-4 and Claude tool use.

Tool Use Self-Supervised API Calls
arXiv:2302.04761 →
2023

AgentBench: Evaluating LLMs as Agents

Liu et al. · Tsinghua University

The first systematic benchmark for evaluating LLMs as real-world agents across 8 environments: web shopping, household tasks, digital card games, lateral thinking puzzles, databases, operating systems, web browsing, and knowledge graphs. GPT-4 outperforms open-source models by a huge margin on all tasks.

Benchmark Agents Evaluation
arXiv:2308.03688 →
2024

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez et al. · Princeton

Evaluates LLMs on 2,294 real GitHub issues requiring code changes to pass existing tests. In 2023, best models solved ~4% of issues. By 2025, top systems solve over 70%. Arguably the most important benchmark for measuring real-world software engineering capability in AI agents.

Essential Coding Benchmark Active Leaderboard
arXiv:2310.06770 →   Live Leaderboard →
RAG & Memory
2020

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG)

Lewis et al. · Meta AI / Facebook · NeurIPS 2020

The original RAG paper. Combines a parametric memory (the LLM) with a non-parametric memory (retrieved documents). The LLM generates conditioned on retrieved context — grounding outputs in real documents. Every RAG system you've seen is based on this architecture.

Essential Seminal RAG Retrieval Memory
arXiv:2005.11401 →
2023

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Asai et al. · University of Washington · ICLR 2024

Standard RAG always retrieves, even when it's unnecessary. Self-RAG trains the model to decide when to retrieve and then evaluate whether the retrieved content is actually relevant, grounded, and useful using reflection tokens (IsRet, IsRel, IsSup, IsUse). Significantly outperforms RAG on factual tasks.

Self-RAG Reflection Grounding
arXiv:2310.11511 →
Architecture Internals

How LLMs actually work under the hood — the components, tradeoffs, and optimizations that matter for engineers.

Tokenization & Vocabulary

BPE · SentencePiece · tiktoken

LLMs operate on tokens (subword units), not characters or words. GPT-4 uses ~100K token vocabulary via byte-pair encoding. "Tokenization matters" — the same word can be 1–4 tokens depending on context. Context windows (4K–2M tokens) define how much the model can "see" at once.

  • GPT-4: ~100K vocab (cl100k)
  • Llama 3: 128K vocab
  • Claude: ~100K vocab
  • 1 token ≈ 0.75 words on average

Multi-Head Self-Attention

Q · K · V · softmax(QKᵀ/√d_k)

The core computation of every transformer. Each token attends to every other token in the sequence — "queries" look for relevant "keys" and retrieve weighted "values." Multiple attention heads let the model attend to different aspects simultaneously (syntax, semantics, coreference).

  • O(n²) memory & compute with sequence length
  • Flash Attention reduces memory to O(n) via tiling
  • Grouped Query Attention (GQA) cuts KV heads for efficiency
  • Sliding Window Attention enables longer contexts

KV Cache — Inference Speedup

K, V cached · Q recomputed per token

During autoregressive generation, without optimization every new token requires re-computing Keys and Values for all previous tokens — O(n²) total work. The KV Cache computes K,V once per token and stores them. Result: 10,000-token generation runs at constant per-token cost instead of growing quadratically. This is why inference cost is dominated by sequence length.

  • Memory: [layers × 2 × seq_len × kv_heads × head_dim]
  • Enables 100–1,000x more concurrent users
  • PagedAttention (vLLM) virtualizes KV cache like OS paging

Positional Encodings

RoPE · ALiBi · Absolute · Relative

Transformers have no inherent notion of word order — positional encodings inject position information. Modern models use Rotary Position Embeddings (RoPE) which encode position as a rotation in complex space and generalize better to sequences longer than training. ALiBi adds linear distance penalties to attention scores instead of embedding positions.

  • RoPE used by: Llama, Mistral, Qwen, Gemma
  • ALiBi used by: BLOOM, MPT
  • YaRN extends RoPE to longer contexts without full retraining

Feed-Forward Networks (FFN) & MoE

SwiGLU · SoLU · GeLU · Mixture of Experts

After attention, each token passes through a position-wise feed-forward network (2 linear layers + activation). FFNs store "factual knowledge" in their weights. Modern models use SwiGLU activation. Mixture of Experts (MoE) replaces a single FFN with N experts — each token routes to top-K experts, multiplying capacity without multiplying compute. Mixtral 8x7B and GPT-4 are believed to be MoE models.

  • Mixtral 8x7B: 46.7B total params, 12.9B active per token
  • MoE enables 10x parameter scale at ~2x compute cost

Training: Pre-Training → SFT → RLHF → DPO

Next-token prediction → alignment → preference

Modern LLMs are trained in stages: (1) Pre-training on trillions of tokens — learns language. (2) Supervised Fine-Tuning (SFT) on instruction/response pairs — learns to be helpful. (3) RLHF — human raters rank outputs, trains reward model, PPO optimizes against it. (4) DPO (Direct Preference Optimization) — simpler alternative to RLHF, directly learns from preference pairs without a separate reward model.

  • RLVR (Reinforcement from Verifiable Rewards) — used in DeepSeek R1, o1/o3
  • Constitutional AI (Anthropic) — model critiques itself using principles

Context Window & Long-Context Techniques

4K → 8K → 128K → 1M tokens

Context windows have expanded dramatically: GPT-3 (4K) → GPT-4 Turbo (128K) → Claude 3.5 (200K) → Gemini 1.5 Pro (1M). Larger windows don't mean equal attention to all tokens — "lost in the middle" phenomenon shows models attend better to tokens at the start/end. Techniques like RAG, chunk summarization, and hierarchical memory address this.

  • Gemini 2.0: 2M token context
  • RULER benchmark tests real long-context retrieval vs. just fitting

Quantization & Efficient Inference

FP16 → INT8 → INT4 → GGUF

Running a 70B model in FP32 requires ~280GB VRAM. Quantization reduces precision to cut memory: INT8 (half), INT4 (quarter). GPTQ, AWQ, and GGUF (llama.cpp) are common formats. LoRA and QLoRA enable fine-tuning on consumer hardware by only training small low-rank adapter matrices, leaving base weights frozen.

  • GGUF: run Llama 70B on a Mac with 64GB RAM
  • QLoRA: fine-tune 65B model on a single 48GB GPU
  • Speculative decoding: draft model generates tokens, large model verifies

Deep read: Sebastian Raschka's Coding the KV Cache from Scratch and Lilian Weng's The Transformer Family v2 are the two best technical deep-dives available for free.

Agent Reasoning Patterns

The design patterns that separate well-behaved production agents from brittle demos. Each pattern trades off token cost, reliability, and task complexity.

Pattern Loop Best For Min Model Token Cost
Prompt Chaining A → B → C (linear) Sequential tasks with clear stages. Document processing pipelines. Any Low
ReAct Think → Act → Observe (loop) Multi-step tool use. Research agents, data lookup, web search. 4B+ Medium
Chain of Thought (CoT) Think step-by-step → Answer Math, logic, structured reasoning. Improves accuracy on MMLU, GSM8K. 7B+ Medium
Tree of Thoughts (ToT) Generate branches → Score → Expand best Creative, ambiguous, or open-ended problems. Planning with uncertainty. 14B+ High
Reflexion Act → Critique → Store → Retry Code generation, multi-step reasoning where feedback is available. 7B+ Medium
Parallelization Spawn N agents in parallel → Aggregate Tasks decomposable into independent subtasks. Research synthesis. Any High (parallel)
Orchestrator + Sub-agents Planner → delegate → collect results Complex workflows. One agent plans, specialist agents execute. 13B+ orchestrator High

RAG Architecture Variants

Naive → Advanced → Modular → Agentic

RAG has matured well beyond naive chunking. Key patterns:

  • Naive RAG: Chunk → Embed → Retrieve → Generate
  • HyDE: Generate hypothetical answer → use it as query
  • Multi-query RAG: Rewrite query N ways, merge results
  • Recursive RAG: Agent decides when retrieval is needed
  • Graph RAG (Microsoft): Build knowledge graph, query it
  • Self-RAG: Reflection tokens decide if retrieval helped

Memory Architecture for Agents

In-context · External · Procedural · Semantic

Agents need different types of memory:

  • In-context (working memory): Current conversation window
  • External (episodic): Vector DB (Pinecone, Chroma, Weaviate)
  • Semantic memory: Facts stored as embeddings, retrieved by similarity
  • Procedural memory: Learned skills, tool schemas, few-shot examples
  • MemGPT: OS-inspired memory management with paging

Multi-Agent Communication Protocols

MCP · A2A · Shared state · Message passing

How agents communicate and coordinate:

  • MCP (Model Context Protocol): Anthropic's open standard for agent↔tool communication — like USB for AI tools
  • Google A2A: Agent-to-Agent protocol for cross-framework communication
  • Shared state (blackboard): Agents read/write to shared memory object
  • Publish/subscribe: Agents emit events, others react asynchronously
Benchmarks Explained

What the model cards and leaderboards actually measure — and what they miss. Understand these before evaluating any LLM for production use.

Benchmark Category What It Measures Key Insight Link
MMLU
Massive Multitask Language Understanding
Knowledge 57-subject MCQ covering STEM, humanities, social sciences, law, medicine. Tests breadth of world knowledge. Near-saturated at top. GPT-4 scores ~87%. Use MMLU-Pro for harder discrimination.
MMLU-Pro
Reasoning Harder version of MMLU with 10-choice questions requiring deeper reasoning. Resists guessing strategies. Better discriminator for frontier models than original MMLU. Top models ~72–79%.
HumanEval
Coding 164 Python programming problems. Tests whether generated code passes unit tests. Pass@1 and Pass@10 metrics. GPT-4 scores ~85%+ on Pass@1. Use BigCodeBench or SWE-bench for harder coding evals.
SWE-bench Verified
Coding Agent Real GitHub issue resolution — agents must write code patches that pass existing test suites. Tests end-to-end software engineering. Most important real-world coding benchmark. Jumped from 4% (2023) to 70%+ (2025). Claude 3.5 Sonnet achieved 49% solo.
GPQA Diamond
Graduate-Level Google-Proof Q&A
Reasoning 448 expert-level MCQs in biology, chemistry, and physics written by PhDs. Designed to be Google-proof — you can't look up the answer. Human experts score ~65%. GPT-4o hits ~53%, o3 ~87%. Strong signal for scientific reasoning depth.
MATH / AIME
Reasoning MATH: 12,500 competition math problems. AIME: AMC/AIME competition problems. Tests mathematical reasoning and proof construction. MATH: GPT-4 ~52% (2023) → o3 ~97% (2025). AIME measures frontier reasoning models — DeepSeek R1 scores 79.8%.
Chatbot Arena (LMSYS)
Human Pref Blind A/B human preference votes across 200+ models. 2.5M+ votes. Elo-style ranking. Most trusted real-world quality signal. Only benchmark using actual human preference in real conversations. Not gameable via benchmark overfitting. Gold standard for "which model feels better."
AgentBench
Agent 8 real-world agentic environments: web shopping, OS tasks, database manipulation, web browsing, digital card games. Open-source models score dramatically worse than GPT-4 on multi-step reasoning with tool use. Biggest gap between frontier and open-source.
HELM (Holistic Eval)
Multi-dimensional Stanford's multi-scenario evaluation across accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency. 42 scenarios, 7 metrics each. The most comprehensive academic benchmark. Useful for comparing models on safety/fairness dimensions that pure accuracy benchmarks miss.
MTEB
Massive Text Embedding Benchmark
Embeddings 56 datasets across 8 embedding tasks: retrieval, clustering, classification, reranking, STS, summarization, bitext mining, pair classification. The definitive benchmark for choosing an embedding model for RAG. text-embedding-3-large and voyage-3 dominate. Essential for RAG system design.
Humanity's Last Exam (HLE)
Reasoning 3,000 extremely hard questions from 100+ academic disciplines, crowd-sourced from experts worldwide. Designed to be near-impossible for current models. Released Jan 2025. Best models score ~8–18%. Designed to stay hard for years. The new frontier benchmark after MMLU saturation.
BigCodeBench
Coding 1,140 function-level coding tasks using real library APIs from 139 libraries. Much harder than HumanEval — tests real-world programming knowledge. GPT-4o: ~60.4%. Llama 3.1 405B: ~52.8%. Good proxy for "can this model write production code." Harder than HumanEval, less brittle than SWE-bench.

Benchmark contamination is real. Many models are trained on data that includes benchmark test sets, inflating scores. Always cross-reference multiple benchmarks, prioritize human evaluation (Chatbot Arena), and test on your own task distribution before trusting published numbers.

Live Leaderboards & Trackers

Bookmark these. They update in real-time as new models are released and evaluated.

LM Arena (formerly Chatbot Arena)

200+ models ranked by blind human preference votes. 2.5M+ votes. The only benchmark that can't be gamed — humans pick winners without knowing the model. Updated daily.

200+
Models Ranked
2.5M+
Human Votes
Open Leaderboard →

BenchLM — 53 Benchmarks

Aggregates 152 models across 53 benchmarks: reasoning, coding, agents, knowledge, multimodal. Single composite score plus per-benchmark breakdowns. Updated continuously.

152
Models
53
Benchmarks
Open Leaderboard →

SWE-bench Verified

Live ranking of AI coding agents on real GitHub issue resolution. Best real-world proxy for software engineering capability. Watch scores climb month-over-month.

70%+
Best Score (2025)
4%
Best Score (2023)
Open Leaderboard →

HELM — Stanford

Holistic multi-dimensional evaluation. Measures accuracy, calibration, robustness, fairness, bias, and toxicity across 42 scenarios. Best for comparing safety dimensions.

42
Scenarios
7
Metric Dims
Open Leaderboard →

MTEB Embeddings

The definitive ranking for embedding models — pick the right one for your RAG pipeline. Covers retrieval, clustering, classification, and 5 other tasks.

56
Datasets
8
Task Types
Open Leaderboard →

BigCode / HumanEval+

Open LLM coding leaderboard. Tracks models on HumanEval, MBPP, BigCodeBench. Useful for comparing open-source vs. proprietary coding performance.

1,140
Test Tasks
139
Libraries
Open Leaderboard →

Open LLM Leaderboard (HF)

Hugging Face's leaderboard tracking open-source model performance. Essential for comparing Llama, Mistral, Qwen, Gemma, DeepSeek and other open-weights models.

2000+
Open Models
Open Leaderboard →

Scale AI SEAL

Expert-validated evaluation across safety, instruction following, and reasoning. Uses human expert graders, not automated metrics. Hard to game.

Expert
Graded
Open Leaderboard →
Expert Blogs & Writing Worth Following

The people whose writing consistently advances understanding — not just summarizing news.

L
Lilian Weng
VP of Safety @ OpenAI

Lil'Log — lilianweng.github.io

The most comprehensive technical AI blog in existence. Deep-dives on transformers, agents, RLHF, diffusion models, hallucination. Every post is a mini-textbook. Active since 2017.

Must reads: Transformer Family v2, LLM-Powered Autonomous Agents, Hallucination in LLMs, Why We Think (2025)

Visit Lil'Log →
A
Andrej Karpathy
ex-OpenAI, ex-Tesla AI · Eureka Labs founder

karpathy.bearblog.dev

Original thinking from one of the best practitioners in AI. Coined "Software 2.0," predicted many trends years early. Known for brutally honest takes. 2025 LLM Year in Review is essential.

Must reads: Software 2.0, State of GPT, LLM Year in Review 2025, The Unreasonable Effectiveness of RNNs

Visit Blog →
S
Sebastian Ruder
Research Scientist @ Google DeepMind

ruder.io

NLP research-focused writing. Covers multilingual LLMs, evaluation, transfer learning, and conference deep-dives from ACL/EMNLP/NAACL/ICML. NLP Newsletter covers the research landscape monthly.

Must reads: The State of NLP, RAG vs Fine-Tuning, NLP Newsletter archive

Visit Blog →
S
Sebastian Raschka
Staff Research Engineer @ Lightning AI

sebastianraschka.com/blog

Best practical ML engineering blog. Known for clear explanations of fine-tuning, LoRA, and LLM internals with code. "Build a Large Language Model From Scratch" (book) originated here.

Must reads: Coding the KV Cache, LoRA vs Full Fine-Tuning, Understanding Flash Attention, LLM evals primer

Visit Blog →
J
Jay Alammar
ML Research Engineer

jalammar.github.io

Best visual explainers for how LLMs and transformers work. The illustrated transformer, BERT, and GPT posts are the most-linked ML explainers on the internet. Phenomenal for building intuition.

Must reads: The Illustrated Transformer, The Illustrated BERT, The Illustrated GPT-2, Visualizing Neural Nets

Visit Blog →
S
Simon Willison
Creator of Django, LLM CLI tool

simonwillison.net

Daily practitioner notes on LLMs, agents, and AI tools. Extremely prolific — covers every major model release, paper, and tool with hands-on testing. Best for staying current on what actually ships.

Must reads: Everything he's written about LLM tool use, his notes on Claude and GPT-4 capabilities

Visit Blog →
H
Hugging Face Blog
Research team blog

huggingface.co/blog

Research and engineering posts from the HF team. Covers new model architectures, PEFT/LoRA, RLHF implementations, dataset insights, and model releases. The most technically rigorous company blog in the ecosystem.

Must reads: RLHF explainer, LoRA guide, Intro to Agents with smolagents, Benchmark analyses

Visit Blog →
A
Anthropic Research
Safety & engineering research

anthropic.com/research

Anthropic publishes more safety and interpretability research than any other lab. Constitutional AI, model cards, sleeper agents, and mechanistic interpretability papers all originated here.

Must reads: Constitutional AI, Measuring Faithfulness of CoT, Many-shot Jailbreaking, Building Effective Agents

Visit Research →
Safety, Alignment & Interpretability

The research ensuring powerful agents don't behave in unintended ways — increasingly important as agents take real-world actions.

2022

Constitutional AI: Harmlessness from AI Feedback

Bai et al. · Anthropic

Instead of relying solely on human feedback for safety, the model critiques and revises its own outputs according to a written "constitution" of principles. Scales safety alignment without requiring human raters for every response. This is how Claude was trained.

Essential Alignment RLHF
arXiv:2212.08073 →
2023

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Hubinger et al. · Anthropic

Demonstrates that it's possible to train models with hidden backdoors that activate under specific trigger conditions — and that standard safety fine-tuning doesn't reliably remove them. A landmark paper on fundamental alignment challenges in AI systems.

Safety Deception Backdoors
arXiv:2401.05566 →
2023

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Bricken et al. · Anthropic

Mechanistic interpretability milestone — identifies individual neurons in a transformer that activate for specific, interpretable concepts (e.g., "DNS," "base64," "racist content"). Sparse autoencoders used to decompose superposition. The beginning of truly understanding what's inside LLMs.

Interpretability Mechanistic SAE
Read Paper →
2025

Anthropic–OpenAI Joint Alignment Evaluation

Anthropic + OpenAI (joint, 2025)

Landmark cross-lab safety evaluation — each lab tested the other's models for alignment failures in simulated high-stakes agentic settings. Found: most models struggle with sycophancy, GPT-4o showed concerning misuse support behaviors in simulations, GPT-5 showed substantial improvements.

New · 2025 Alignment Eval Agents
Read Findings →
2024

Deliberative Alignment: Reasoning for Safe AI (o1 System Card)

OpenAI

Describes how o1 uses chain-of-thought reasoning to think through safety policy before responding — "deliberative alignment." Achieved state-of-the-art on benchmarks for illicit advice avoidance and jailbreak resistance. First paper to show reasoning improves safety, not just capability.

Alignment Reasoning Safety
Read System Card →

Key Safety Concerns for Agentic Systems

As agents take real-world actions (sending emails, executing code, browsing the web), new failure modes emerge beyond chatbot safety:

  • Prompt injection: Malicious content in tool output hijacks agent behavior
  • Goal misspecification: Agent pursues proxy metric, not true objective (Goodhart's Law)
  • Irreversible actions: Deleting files, sending emails — no undo
  • Scope creep: Agent acquires resources/permissions beyond task requirements
  • Sycophancy: Agent tells user what they want to hear, not what's correct
  • Evaluation gaming: Agent optimizes for benchmark metrics, not real-world quality
LLMOps & Production Engineering

The gap between a working demo and a reliable production system. Infrastructure, evaluation, fine-tuning, and observability for real deployments.

Concept

Evals First: Why Evaluation Drives Everything

Before you build, define how you'll measure success. LLM evals include: exact match, ROUGE/BLEU, LLM-as-judge (GPT-4 scoring outputs), human preference, and task-specific metrics. Without evals, you can't know if a prompt change helped or hurt. Build your eval harness before your first production prompt.

Read Hamel's Evals Guide →
AWS · Production

Amazon: Advanced Fine-Tuning for Multi-Agent Orchestration at Scale

Amazon's production lessons: SFT + PPO + DPO + GRPO stacked together. Results: 33% reduction in medication errors, 80% less human review effort in engineering, accuracy from 77% to 96%. Shows what's possible when fine-tuning is done right at scale.

Read on AWS →
Open Source

LangSmith — LLM Observability & Tracing

LangChain's production platform. Traces every LLM call, tool invocation, and agent step. Debug prompts, track latency/cost, run regression tests, compare prompt versions. The essential observability tool for any LangChain-based system.

Visit LangSmith →
Open Source

Weights & Biases (W&B) — Experiment Tracking

Track fine-tuning runs, compare hyperparameters, log eval metrics, visualize attention patterns. The standard MLOps tool for any serious LLM fine-tuning project. Integrates with HuggingFace Trainer, PyTorch, JAX, and most frameworks.

Visit W&B →
Open Source · vLLM

vLLM — High-Throughput LLM Serving

PagedAttention + continuous batching enables 24x higher throughput than naive serving. The standard open-source inference engine for self-hosted models. Powers production deployments of Llama, Mistral, and other open-source models at scale. Used by Anyscale, Databricks, and others.

View on GitHub →
RAG · Production

Building Production-Ready Multi-Agent RAG Systems

Enterprise implementation guide. Covers three-agent architecture (Coordinator → Retrieval → Reasoning), failure modes, latency tradeoffs, and when to use RAG vs. fine-tuning. Addresses the orchestration layer that's typically left out of tutorials.

Read Guide →
RAG vs Fine-Tuning Decision Framework
Dimension RAG Fine-Tuning
Knowledge freshness Real-time, always current Baked in at training time
Knowledge size Unlimited (external DB) Limited by context/capacity
Response style / tone Harder to control Strong style consistency
Task specialization General behavior Deep task expertise
Citability / traceability Source attribution built-in Hard to attribute
Cost to update Update index only Retrain or re-fine-tune
Recommended when Docs change often, need citations, broad knowledge base Consistent format, specialized domain, style/tone control