Complete roadmap to master LLMs with Colab notebooks. Covers fundamentals, fine-tuning, RLHF, RAG, agents, quantization, and deployment.
𝕏 Follow me on X • 🤗 Hugging Face • 💻 Blog • 📙 LLM Engineer's Handbook
The LLM course is divided into three parts:
> [!NOTE] > Based on this course, I co-wrote the LLM Engineer's Handbook, a hands-on book that covers an end-to-end LLM application from design to deployment. The LLM course will always stay free, but you can support my work by purchasing this book.
For a more comprehensive version of this course, check out the DeepWiki.
Toggle section (optional)
Toggle section (optional)

📚 Resources:
---
📚 Resources:
---
📚 Resources:
---
📚 Resources:

📚 References: * Visual intro to Transformers by 3Blue1Brown: Visual introduction to Transformers for complete beginners. * LLM Visualization by Brendan Bycroft: Interactive 3D visualization of LLM internals. * nanoGPT by Andrej Karpathy: A 2h-long YouTube video to reimplement GPT from scratch (for programmers). He also made a video about tokenization. * Attention? Attention! by Lilian Weng: Historical overview to introduce the need for attention mechanisms. * Decoding Strategies in LLMs by Maxime Labonne: Provide code and a visual introduction to the different decoding strategies to generate text.
---
* Data preparation: Pre-training requires massive datasets (e.g., Llama 3.1 was trained on 15 trillion tokens) that need careful curation, cleaning, deduplication, and tokenization. Modern pre-training pipelines implement sophisticated filtering to remove low-quality or problematic content. * Distributed training: Combine different parallelization strategies: data parallel (batch distribution), pipeline parallel (layer distribution), and tensor parallel (operation splitting). These strategies require optimized network communication and memory management across GPU clusters. * Training optimization: Use adaptive learning rates with warm-up, gradient clipping, and normalization to prevent explosions, mixed-precision training for memory efficiency, and modern optimizers (AdamW, Lion) with tuned hyperparameters. * Monitoring: Track key metrics (loss, gradients, GPU stats) using dashboards, implement targeted logging for distributed training issues, and set up performance profiling to identify bottlenecks in computation and communication across devices.
📚 References: * FineWeb by Penedo et al.: Article to recreate a large-scale dataset for LLM pretraining (15T), including FineWeb-Edu, a high-quality subset. * RedPajama v2 by Weber et al.: Another article and paper about a large-scale pre-training dataset with a lot of interesting quality filters. * nanotron by Hugging Face: Minimalistic LLM training codebase used to make SmolLM2. * Parallel training by Chenyan Xiong: Overview of optimization and parallelism techniques. * Distributed training by Duan et al.: A survey about efficient training of LLM on distributed architectures. * OLMo 2 by AI2: Open-source language model with model, data, training, and evaluation code. * LLM360 by LLM360: A framework for open-source LLMs with training and data preparation code, data, metrics, and models.
---
* Storage & chat templates: Because of the conversational structure, post-training datasets are stored in a specific format like ShareGPT or OpenAI/HF. Then, these formats are mapped to a chat template like ChatML or Alpaca to produce the final samples that the model is trained on. * Synthetic data generation: Create instruction-response pairs based on seed data using frontier models like GPT-4o. This approach allows for flexible and scalable dataset creation with high-quality answers. Key considerations include designing diverse seed tasks and effective system prompts. * Data enhancement: Enhance existing samples using techniques like verified outputs (using unit tests or solvers), multiple answers with rejection sampling, Auto-Evol, Chain-of-Thought, Branch-Solve-Merge, personas, etc. * Quality filtering: Traditional techniques involve rule-based filtering, removing duplicates or near-duplicates (with MinHash or embeddings), and n-gram decontamination. Reward models and judge LLMs complement this step with fine-grained and customizable quality control.
📚 References: * Synthetic Data Generator by Argilla: Beginner-friendly way of building datasets using natural language in a Hugging Face space. * LLM Datasets by Maxime Labonne: Curated list of datasets and tools for post-training. * NeMo-Curator by Nvidia: Dataset preparation and curation framework for pre- and post-training data. * Distilabel by Argilla: Framework to generate synthetic data. It also includes interesting reproductions of papers like UltraFeedback. * Semhash by MinishLab: Minimalistic library for near-deduplication and decontamination with a distilled embedding model. * Chat Template by Hugging Face: Hugging Face's documentation about chat templates.
---
📚 References: * Fine-tune Llama 3.1 Ultra-Efficiently with Unsloth by Maxime Labonne: Hands-on tutorial on how to fine-tune a Llama 3.1 model using Unsloth. * Axolotl - Documentation by Wing Lian: Lots of interesting information related to distributed training and dataset formats. * Mastering LLMs by Hamel Husain: Collection of educational resources about fine-tuning (but also RAG, evaluation, applications, and prompt engineering). * LoRA insights by Sebastian Raschka: Practical insights about LoRA and how to select the best parameters.
---
📚 References: * Illustrating RLHF by Hugging Face: Introduction to RLHF with reward model training and fine-tuning with reinforcement learning. * LLM Training: RLHF and Its Alternatives by Sebastian Raschka: Overview of the RLHF process and alternatives like RLAIF. * Preference Tuning LLMs by Hugging Face: Comparison of the DPO, IPO, and KTO algorithms to perform preference alignment. * Fine-tune with DPO by Maxime Labonne: Tutorial to fine-tune a Mistral-7b model with DPO and reproduce NeuralHermes-2.5. * Fine-tune with GRPO by Maxime Labonne: Practical exercise to fine-tune a small model with GRPO. * DPO Wandb logs by Alexander Vishnevskiy: It shows you the main DPO metrics to track and the trends you should expect.
---
📚 References: * LLM evaluation guidebook by Hugging Face: Comprehensive guide about evaluation with practical insights. * Open LLM Leaderboard by Hugging Face: Main leaderboard to compare LLMs in an open and reproducible way (automated benchmarks). * Language Model Evaluation Harness by EleutherAI: A popular framework for evaluating LLMs using automated benchmarks. * Lighteval by Hugging Face: Alternative evaluation framework that also includes model-based evaluations. * Chatbot Arena by LMSYS: Elo rating of general-purpose LLMs, based on comparisons made by humans (human evaluation).
---
* Base techniques: Learn the different levels of precision (FP32, FP16, INT8, etc.) and how to perform naïve quantization with absmax and zero-point techniques. * GGUF & llama.cpp: Originally designed to run on CPUs, llama.cpp and the GGUF format have become the most popular tools to run LLMs on consumer-grade hardware. It supports storing special tokens, vocabulary, and metadata in a single file. * GPTQ & AWQ: Techniques like GPTQ/EXL2 and AWQ introduce layer-by-layer calibration that retains performance at extremely low bitwidths. They reduce catastrophic outliers using dynamic scaling, selectively skipping or re-centering the heaviest parameters. * SmoothQuant & ZeroQuant: New quantization-friendly transformations (SmoothQuant) and compiler-based optimizations (ZeroQuant) help mitigate outliers before quantization. They also reduce hardware overhead by fusing certain ops and optimizing dataflow.
📚 References: * Introduction to quantization by Maxime Labonne: Overview of quantization, absmax and zero-point quantization, and LLM.int8() with code. * Quantize Llama models with llama.cpp by Maxime Labonne: Tutorial on how to quantize a Llama 2 model using llama.cpp and the GGUF format. * 4-bit LLM Quantization with GPTQ by Maxime Labonne: Tutorial on how to quantize an LLM using the GPTQ algorithm with AutoGPTQ. * Understanding Activation-Aware Weight Quantization by FriendliAI: Overview of the AWQ technique and its benefits. * SmoothQuant on Llama 2 7B by MIT HAN Lab: Tutorial on how to use SmoothQuant with a Llama 2 model in 8-bit precision. * DeepSpeed Model Compression by DeepSpeed: Tutorial on how to use ZeroQuant and extreme compression (XTC) with DeepSpeed Compression.
---
* Model merging: Merging trained models has become a popular way of creating performant models without any fine-tuning. The popular mergekit library implements the most popular merging methods, like SLERP, DARE, and TIES. * Multimodal models: These models (like CLIP, Stable Diffusion, or LLaVA) process multiple types of inputs (text, images, audio, etc.) with a unified embedding space, which unlocks powerful applications like text-to-image. * Interpretability: Mechanistic interpretability techniques like Sparse Autoencoders (SAEs) have made remarkable progress to provide insights about the inner workings of LLMs. This has also been applied with techniques such as abliteration, which allow you to modify the behavior of models without training. * Test-time compute: Reasoning models trained with RL techniques can be further improved by scaling the compute budget during test time. It can involve multiple calls, MCTS, or specialized models like a Process Reward Model (PRM). Iterative steps with precise scoring significantly improve performance for complex reasoning tasks.
📚 References: * Merge LLMs with mergekit by Maxime Labonne: Tutorial about model merging using mergekit. * Smol Vision by Merve Noyan: Collection of notebooks and scripts dedicated to small multimodal models. * Large Multimodal Models by Chip Huyen: Overview of multimodal systems and the recent history of this field. * Unsensor any LLM with abliteration by Maxime Labonne: Direct application of interpretability techniques to modify the style of a model. * Intuitive Explanation of SAEs by Adam Karvonen: Article about how SAEs work and why they make sense for interpretability. * Scaling test-time compute by Beeching et al.: Tutorial and experiments to outperform Llama 3.1 70B on MATH-500 with a 3B model.

* LLM APIs: APIs are a convenient way to deploy LLMs. This space is divided between private LLMs (OpenAI, Google, Anthropic, etc.) and open-source LLMs (OpenRouter, Hugging Face, Together AI, etc.). * Open-source LLMs: The Hugging Face Hub is a great place to find LLMs. You can directly run some of them in Hugging Face Spaces, or download and run them locally in apps like LM Studio or through the CLI with llama.cpp or ollama. * Prompt engineering: Common techniques include zero-shot prompting, few-shot prompting, chain of thought, and ReAct. They work better with bigger models, but can be adapted to smaller ones. * Structuring outputs: Many tasks require a structured output, like a strict template or a JSON format. Libraries like Outlines can be used to guide the generation and respect a given structure. Some APIs also support structured output generation natively using JSON schemas.
📚 References: * Run an LLM locally with LM Studio by Nisha Arya: Short guide on how to use LM Studio. * Prompt engineering guide by DAIR.AI: Exhaustive list of prompt techniques with examples * Outlines - Quickstart: List of guided generation techniques enabled by Outlines. * LMQL - Overview: Introduction to the LMQL language.
---
* Ingesting documents: Document loaders are convenient wrappers that can handle many formats: PDF, JSON, HTML, Markdown, etc. They can also directly retrieve data from some databases and APIs (GitHub, Reddit, Google Drive, etc.). Splitting documents: Text splitters break down documents into smaller, semantically meaningful chunks. Instead of splitting text after n* characters, it's often better to split by header or recursively, with some additional metadata. * Embedding models: Embedding models convert text into vector representations. Picking task-specific models significantly improves performance for semantic search and RAG. * Vector databases: Vector databases (like Chroma, Pinecone, Milvus, FAISS, Annoy, etc.) are designed to store embedding vectors. They enable efficient retrieval of data that is 'most similar' to a query based on vector similarity.
📚 References: * LangChain - Text splitters: List of different text splitters implemented in LangChain. * Sentence Transformers library: Popular library for embedding models. * MTEB Leaderboard: Leaderboard for embedding models. * The Top 7 Vector Databases by Moez Ali: A comparison of the best and most popular vector databases.
---
* Orchestrators: Orchestrators like LangChain and LlamaIndex are popular frameworks to connect your LLMs with tools and databases. The Model Context Protocol (MCP) introduces a new standard to pass data and context to models across providers. * Retrievers: Query rewriters and generative retrievers like CoRAG and HyDE enhance search by transforming user queries. Multi-vector and hybrid retrieval methods combine embeddings with keyword signals to improve recall and precision. * Memory: To remember previous instructions and answers, LLMs and chatbots like ChatGPT add this history to their context window. This buffer can be improved with summarization (e.g., using a smaller LLM), a vector store + RAG, etc. * Evaluation: We need to evaluate both the document retrieval (context precision and recall) and the generation stages (faithfulness and answer relevancy). It can be simplified with tools Ragas and DeepEval (assessing quality).
📚 References: * Llamaindex - High-level concepts: Main concepts to know when building RAG pipelines. * Model Context Protocol: Introduction to MCP with motivate, architecture, and quick starts. * Pinecone - Retrieval Augmentation: Overview of the retrieval augmentation process. * LangChain - Q&A with RAG: Step-by-step tutorial to build a typical RAG pipeline. * LangChain - Memory types: List of different types of memories with relevant usage. * RAG pipeline - Metrics: Overview of the main metrics used to evaluate RAG pipelines.
---
* Query construction: Structured data stored in traditional databases requires a specific query language like SQL, Cypher, metadata, etc. We can directly translate the user instruction into a query to access the data with query construction. * Tools: Agents augment LLMs by automatically selecting the most relevant tools to provide an answer. These tools can be as simple as using Google or Wikipedia, or more complex, like a Python interpreter or Jira. * Post-processing: Final step that processes the inputs that are fed to the LLM. It enhances the relevance and diversity of documents retrieved with re-ranking, RAG-fusion, and classification. * Program LLMs: Frameworks like DSPy allow you to optimize prompts and weights based on automated evaluations in a programmatic way.
📚 References: * LangChain - Query Construction: Blog post about different types of query construction. * LangChain - SQL: Tutorial on how to interact with SQL databases with LLMs, involving Text-to-SQL and an optional SQL agent. * Pinecone - LLM agents: Introduction to agents and tools with different types. * LLM Powered Autonomous Agents by Lilian Weng: A more theoretical article about LLM agents. * LangChain - OpenAI's RAG: Overview of the RAG strategies employed by OpenAI, including post-processing. * DSPy in 8 Steps: General-purpose guide to DSPy introducing modules, signatures, and optimizers.
---
* Agent fundamentals: Agents operate using thoughts (internal reasoning to decide what to do next), action (executing tasks, often by interacting with external tools), and observation (analyzing feedback or results to refine the next step). * Agent protocols: Model Context Protocol (MCP) is the industry standard for connecting agents to external tools and data sources with MCP servers and clients. More recently, Agent2Agent (A2A) tries to standardize a common language for agent interoperability. * Vendor frameworks: Each major cloud model provider has its own agentic framework with OpenAI SDK, Google ADK, and Claude Agent SDK if you're particularly tied to one vendor. * Other frameworks: Agent development can be streamlined using different frameworks like LangGraph (design and visualization of workflows) LlamaIndex (data-augmented agents with RAG), or custom solutions. More experimental frameworks include collaboration between different agents, such as CrewAI (role-based team workflows) and AutoGen (conversation-driven multi-agent systems).
📚 References: * Agents Course: Popular course about AI agents made by Hugging Face. * LangGraph: Overview of how to build AI agents with LangGraph. * LlamaIndex Agents: Uses cases and resources to build agents with LlamaIndex.
---
* Flash Attention: Optimization of the attention mechanism to transform its complexity from quadratic to linear, speeding up both training and inference. * Key-value cache: Understand the key-value cache and the improvements introduced in Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). * Speculative decoding: Use a small model to produce drafts that are then reviewed by a larger model to speed up text generation. EAGLE-3 is a particularly popular solution.
📚 References: * GPU Inference by Hugging Face: Explain how to optimize inference on GPUs. * LLM Inference by Databricks: Best practices for how to optimize LLM inference in production. * Optimizing LLMs for Speed and Memory by Hugging Face: Explain three main techniques to optimize speed and memory, namely quantization, Flash Attention, and architectural innovations. * Assisted Generation by Hugging Face: HF's version of speculative decoding. It's an interesting blog post about how it works with code to implement it. * EAGLE-3 paper: Introduces EAGLE-3 and reports speedups up to 6.5×. * Speculators: Library made by vLLM for building, evaluating, and storing speculative decoding algorithms (e.g., EAGLE-3) for LLM inference.
---
* Local deployment: Privacy is an important advantage that open-source LLMs have over private ones. Local LLM servers (LM Studio, Ollama, oobabooga, kobold.cpp, etc.) capitalize on this advantage to power local apps. * Demo deployment: Frameworks like Gradio and Streamlit are helpful to prototype applications and share demos. You can also easily host them online, for example, using Hugging Face Spaces. * Server deployment: Deploying LLMs at scale requires cloud (see also SkyPilot) or on-prem infrastructure and often leverages optimized text generation frameworks like TGI, vLLM, etc. * Edge deployment: In constrained environments, high-performance frameworks like MLC LLM and mnn-llm can deploy LLM in web browsers, Android, and iOS.
📚 References: * Streamlit - Build a basic LLM app: Tutorial to make a basic ChatGPT-like app using Streamlit. * HF LLM Inference Container: Deploy LLMs on Amazon SageMaker using Hugging Face's inference container. * Philschmid blog by Philipp Schmid: Collection of high-quality articles about LLM deployment using Amazon SageMaker. * Optimizing latence by Hamel Husain: Comparison of TGI, vLLM, CTranslate2, and mlc in terms of throughput and latency.
---
* Prompt hacking: Different techniques related to prompt engineering, including prompt injection (additional instruction to hijack the model's answer), data/prompt leaking (retrieve its original data/prompt), and jailbreaking (craft prompts to bypass safety features). * Backdoors: Attack vectors can target the training data itself, by poisoning the training data (e.g., with false information) or creating backdoors (secret triggers to change the model's behavior during inference). * Defensive measures: The best way to protect your LLM applications is to test them against these vulnerabilities (e.g., using red teaming and checks like garak) and observe them in production (with a framework like langfuse).
📚 References: * OWASP LLM Top 10 by HEGO Wiki: List of the 10 most critical vulnerabilities seen in LLM applications. * Prompt Injection Primer by Joseph Thacker: Short guide dedicated to prompt injection for engineers. * LLM Security by @llm_sec: Extensive list of resources related to LLM security. * Red teaming LLMs by Microsoft: Guide on how to perform red teaming with LLMs.
---
Special thanks to:
* Thomas Thelen for motivating me to create a roadmap * André Frade for his input and review of the first draft * Dino Dunn for providing resources about LLM security * Magdalena Kuhn for improving the "human evaluation" part * Odoverdose for suggesting 3Blue1Brown's video about Transformers * Everyone who contributed to the educational references in this course :)
Disclaimer: I am not affiliated with any sources listed here.
---
Source Repository: mlabonne/llm-course
License: Apache 2.0 - Feel free to use, modify, and share