Foundational papers, live benchmarks, architecture internals, agent reasoning patterns, and safety research — for engineers and researchers who want to go beyond the surface.
The papers that shaped how modern LLMs and agents work. Not optional reading — these are the foundations everything else builds on.
The paper that started everything. Introduces the Transformer architecture — multi-head self-attention, positional encodings, and the encoder-decoder stack. Every LLM in existence today descends from this architecture. 100,000+ citations.
arXiv:1706.03762 →Introduces GPT-3 (175B parameters) and demonstrates that scale alone enables emergent few-shot and zero-shot learning. The first paper to show that you don't need task-specific fine-tuning if you scale far enough. Changed the entire research paradigm.
arXiv:2005.14165 →Introduces RLHF (Reinforcement Learning from Human Feedback) — the technique that turns a base language model into an assistant. A 1.3B InstructGPT model outperforms raw GPT-3 (175B) in human preference. This paper is why ChatGPT is useful.
arXiv:2203.02155 →Two related papers that explain exactly how model performance scales with parameters, data, and compute. The Chinchilla paper corrects original Kaplan scaling laws and shows most large models were undertrained. Every major lab uses these laws to decide how to spend their training budget.
Kaplan: arXiv:2001.08361 → Chinchilla: arXiv:2203.15556 →Shows that simply adding "Let's think step by step" dramatically improves LLM reasoning on math, logic, and commonsense tasks. The paper that legitimized prompt engineering as a research discipline and is the basis for all CoT and reasoning-model work today.
arXiv:2201.11903 →Extends Chain-of-Thought to a tree search: the LLM generates multiple candidate reasoning paths, evaluates them, and expands the most promising ones. Significantly improves performance on tasks where linear reasoning fails (game of 24, creative writing, miniCrossword). The basis for o1/o3-style thinking models.
arXiv:2305.10601 →Introduces the ReAct pattern: Thought → Action → Observation loop. The agent reasons about what to do, takes an action (e.g., search Wikipedia), observes the result, then reasons again. The foundational paper for all tool-using agent systems. Still the default pattern in LangChain, LlamaIndex, and most agent frameworks.
arXiv:2210.03629 →Agents that learn from their own mistakes without gradient updates — the model critiques its own past failures and stores them as natural language memory for the next attempt. Achieves 91% on HumanEval benchmarks vs. 80% for GPT-4 alone. Highly practical for production agents.
arXiv:2303.11366 →Shows that LLMs can self-supervise their own tool use — the model learns when and how to call APIs (calculator, search, calendar) without human-labeled data. The precursor to function calling in GPT-4 and Claude tool use.
arXiv:2302.04761 →The first systematic benchmark for evaluating LLMs as real-world agents across 8 environments: web shopping, household tasks, digital card games, lateral thinking puzzles, databases, operating systems, web browsing, and knowledge graphs. GPT-4 outperforms open-source models by a huge margin on all tasks.
arXiv:2308.03688 →Evaluates LLMs on 2,294 real GitHub issues requiring code changes to pass existing tests. In 2023, best models solved ~4% of issues. By 2025, top systems solve over 70%. Arguably the most important benchmark for measuring real-world software engineering capability in AI agents.
arXiv:2310.06770 → Live Leaderboard →The original RAG paper. Combines a parametric memory (the LLM) with a non-parametric memory (retrieved documents). The LLM generates conditioned on retrieved context — grounding outputs in real documents. Every RAG system you've seen is based on this architecture.
arXiv:2005.11401 →Standard RAG always retrieves, even when it's unnecessary. Self-RAG trains the model to decide when to retrieve and then evaluate whether the retrieved content is actually relevant, grounded, and useful using reflection tokens (IsRet, IsRel, IsSup, IsUse). Significantly outperforms RAG on factual tasks.
arXiv:2310.11511 →How LLMs actually work under the hood — the components, tradeoffs, and optimizations that matter for engineers.
LLMs operate on tokens (subword units), not characters or words. GPT-4 uses ~100K token vocabulary via byte-pair encoding. "Tokenization matters" — the same word can be 1–4 tokens depending on context. Context windows (4K–2M tokens) define how much the model can "see" at once.
The core computation of every transformer. Each token attends to every other token in the sequence — "queries" look for relevant "keys" and retrieve weighted "values." Multiple attention heads let the model attend to different aspects simultaneously (syntax, semantics, coreference).
During autoregressive generation, without optimization every new token requires re-computing Keys and Values for all previous tokens — O(n²) total work. The KV Cache computes K,V once per token and stores them. Result: 10,000-token generation runs at constant per-token cost instead of growing quadratically. This is why inference cost is dominated by sequence length.
Transformers have no inherent notion of word order — positional encodings inject position information. Modern models use Rotary Position Embeddings (RoPE) which encode position as a rotation in complex space and generalize better to sequences longer than training. ALiBi adds linear distance penalties to attention scores instead of embedding positions.
After attention, each token passes through a position-wise feed-forward network (2 linear layers + activation). FFNs store "factual knowledge" in their weights. Modern models use SwiGLU activation. Mixture of Experts (MoE) replaces a single FFN with N experts — each token routes to top-K experts, multiplying capacity without multiplying compute. Mixtral 8x7B and GPT-4 are believed to be MoE models.
Modern LLMs are trained in stages: (1) Pre-training on trillions of tokens — learns language. (2) Supervised Fine-Tuning (SFT) on instruction/response pairs — learns to be helpful. (3) RLHF — human raters rank outputs, trains reward model, PPO optimizes against it. (4) DPO (Direct Preference Optimization) — simpler alternative to RLHF, directly learns from preference pairs without a separate reward model.
Context windows have expanded dramatically: GPT-3 (4K) → GPT-4 Turbo (128K) → Claude 3.5 (200K) → Gemini 1.5 Pro (1M). Larger windows don't mean equal attention to all tokens — "lost in the middle" phenomenon shows models attend better to tokens at the start/end. Techniques like RAG, chunk summarization, and hierarchical memory address this.
Running a 70B model in FP32 requires ~280GB VRAM. Quantization reduces precision to cut memory: INT8 (half), INT4 (quarter). GPTQ, AWQ, and GGUF (llama.cpp) are common formats. LoRA and QLoRA enable fine-tuning on consumer hardware by only training small low-rank adapter matrices, leaving base weights frozen.
Deep read: Sebastian Raschka's Coding the KV Cache from Scratch and Lilian Weng's The Transformer Family v2 are the two best technical deep-dives available for free.
The design patterns that separate well-behaved production agents from brittle demos. Each pattern trades off token cost, reliability, and task complexity.
RAG has matured well beyond naive chunking. Key patterns:
Agents need different types of memory:
How agents communicate and coordinate:
What the model cards and leaderboards actually measure — and what they miss. Understand these before evaluating any LLM for production use.
Benchmark contamination is real. Many models are trained on data that includes benchmark test sets, inflating scores. Always cross-reference multiple benchmarks, prioritize human evaluation (Chatbot Arena), and test on your own task distribution before trusting published numbers.
Bookmark these. They update in real-time as new models are released and evaluated.
200+ models ranked by blind human preference votes. 2.5M+ votes. The only benchmark that can't be gamed — humans pick winners without knowing the model. Updated daily.
Aggregates 152 models across 53 benchmarks: reasoning, coding, agents, knowledge, multimodal. Single composite score plus per-benchmark breakdowns. Updated continuously.
Live ranking of AI coding agents on real GitHub issue resolution. Best real-world proxy for software engineering capability. Watch scores climb month-over-month.
Holistic multi-dimensional evaluation. Measures accuracy, calibration, robustness, fairness, bias, and toxicity across 42 scenarios. Best for comparing safety dimensions.
The definitive ranking for embedding models — pick the right one for your RAG pipeline. Covers retrieval, clustering, classification, and 5 other tasks.
Open LLM coding leaderboard. Tracks models on HumanEval, MBPP, BigCodeBench. Useful for comparing open-source vs. proprietary coding performance.
Hugging Face's leaderboard tracking open-source model performance. Essential for comparing Llama, Mistral, Qwen, Gemma, DeepSeek and other open-weights models.
Expert-validated evaluation across safety, instruction following, and reasoning. Uses human expert graders, not automated metrics. Hard to game.
The people whose writing consistently advances understanding — not just summarizing news.
The most comprehensive technical AI blog in existence. Deep-dives on transformers, agents, RLHF, diffusion models, hallucination. Every post is a mini-textbook. Active since 2017.
Must reads: Transformer Family v2, LLM-Powered Autonomous Agents, Hallucination in LLMs, Why We Think (2025)
Visit Lil'Log →Original thinking from one of the best practitioners in AI. Coined "Software 2.0," predicted many trends years early. Known for brutally honest takes. 2025 LLM Year in Review is essential.
Must reads: Software 2.0, State of GPT, LLM Year in Review 2025, The Unreasonable Effectiveness of RNNs
Visit Blog →NLP research-focused writing. Covers multilingual LLMs, evaluation, transfer learning, and conference deep-dives from ACL/EMNLP/NAACL/ICML. NLP Newsletter covers the research landscape monthly.
Must reads: The State of NLP, RAG vs Fine-Tuning, NLP Newsletter archive
Visit Blog →Best practical ML engineering blog. Known for clear explanations of fine-tuning, LoRA, and LLM internals with code. "Build a Large Language Model From Scratch" (book) originated here.
Must reads: Coding the KV Cache, LoRA vs Full Fine-Tuning, Understanding Flash Attention, LLM evals primer
Visit Blog →Best visual explainers for how LLMs and transformers work. The illustrated transformer, BERT, and GPT posts are the most-linked ML explainers on the internet. Phenomenal for building intuition.
Must reads: The Illustrated Transformer, The Illustrated BERT, The Illustrated GPT-2, Visualizing Neural Nets
Visit Blog →Daily practitioner notes on LLMs, agents, and AI tools. Extremely prolific — covers every major model release, paper, and tool with hands-on testing. Best for staying current on what actually ships.
Must reads: Everything he's written about LLM tool use, his notes on Claude and GPT-4 capabilities
Visit Blog →Research and engineering posts from the HF team. Covers new model architectures, PEFT/LoRA, RLHF implementations, dataset insights, and model releases. The most technically rigorous company blog in the ecosystem.
Must reads: RLHF explainer, LoRA guide, Intro to Agents with smolagents, Benchmark analyses
Visit Blog →Anthropic publishes more safety and interpretability research than any other lab. Constitutional AI, model cards, sleeper agents, and mechanistic interpretability papers all originated here.
Must reads: Constitutional AI, Measuring Faithfulness of CoT, Many-shot Jailbreaking, Building Effective Agents
Visit Research →The research ensuring powerful agents don't behave in unintended ways — increasingly important as agents take real-world actions.
Instead of relying solely on human feedback for safety, the model critiques and revises its own outputs according to a written "constitution" of principles. Scales safety alignment without requiring human raters for every response. This is how Claude was trained.
arXiv:2212.08073 →Demonstrates that it's possible to train models with hidden backdoors that activate under specific trigger conditions — and that standard safety fine-tuning doesn't reliably remove them. A landmark paper on fundamental alignment challenges in AI systems.
arXiv:2401.05566 →Mechanistic interpretability milestone — identifies individual neurons in a transformer that activate for specific, interpretable concepts (e.g., "DNS," "base64," "racist content"). Sparse autoencoders used to decompose superposition. The beginning of truly understanding what's inside LLMs.
Read Paper →Landmark cross-lab safety evaluation — each lab tested the other's models for alignment failures in simulated high-stakes agentic settings. Found: most models struggle with sycophancy, GPT-4o showed concerning misuse support behaviors in simulations, GPT-5 showed substantial improvements.
Read Findings →Describes how o1 uses chain-of-thought reasoning to think through safety policy before responding — "deliberative alignment." Achieved state-of-the-art on benchmarks for illicit advice avoidance and jailbreak resistance. First paper to show reasoning improves safety, not just capability.
Read System Card →As agents take real-world actions (sending emails, executing code, browsing the web), new failure modes emerge beyond chatbot safety:
The gap between a working demo and a reliable production system. Infrastructure, evaluation, fine-tuning, and observability for real deployments.
Before you build, define how you'll measure success. LLM evals include: exact match, ROUGE/BLEU, LLM-as-judge (GPT-4 scoring outputs), human preference, and task-specific metrics. Without evals, you can't know if a prompt change helped or hurt. Build your eval harness before your first production prompt.
Read Hamel's Evals Guide →Amazon's production lessons: SFT + PPO + DPO + GRPO stacked together. Results: 33% reduction in medication errors, 80% less human review effort in engineering, accuracy from 77% to 96%. Shows what's possible when fine-tuning is done right at scale.
Read on AWS →LangChain's production platform. Traces every LLM call, tool invocation, and agent step. Debug prompts, track latency/cost, run regression tests, compare prompt versions. The essential observability tool for any LangChain-based system.
Visit LangSmith →Track fine-tuning runs, compare hyperparameters, log eval metrics, visualize attention patterns. The standard MLOps tool for any serious LLM fine-tuning project. Integrates with HuggingFace Trainer, PyTorch, JAX, and most frameworks.
Visit W&B →PagedAttention + continuous batching enables 24x higher throughput than naive serving. The standard open-source inference engine for self-hosted models. Powers production deployments of Llama, Mistral, and other open-source models at scale. Used by Anyscale, Databricks, and others.
View on GitHub →Enterprise implementation guide. Covers three-agent architecture (Coordinator → Retrieval → Reasoning), failure modes, latency tradeoffs, and when to use RAG vs. fine-tuning. Addresses the orchestration layer that's typically left out of tutorials.
Read Guide →| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Knowledge freshness | Real-time, always current | Baked in at training time |
| Knowledge size | Unlimited (external DB) | Limited by context/capacity |
| Response style / tone | Harder to control | Strong style consistency |
| Task specialization | General behavior | Deep task expertise |
| Citability / traceability | Source attribution built-in | Hard to attribute |
| Cost to update | Update index only | Retrain or re-fine-tune |
| Recommended when | Docs change often, need citations, broad knowledge base | Consistent format, specialized domain, style/tone control |