AI and large language models represent a convergence of decades of research in machine learning, natural language processing, computer systems, and human-computer interaction. This chapter highlights foundational papers, influential systems, and active research areas for readers who want to explore the topics we’ve covered in greater depth.
Foundations of Language Models¶
The transformer architecture that started it all:
Vaswani et al. (2017): Attention Is All You Need Vaswani et al. (2017)
Devlin et al. (2019): BERT: Pre-training of Deep Bidirectional Transformers Devlin et al. (2019)
Scaling laws and emergent capabilities:
Kaplan et al. (2020): Scaling Laws for Neural Language Models Kaplan et al. (2020)
Wei et al. (2022): Emergent Abilities of Large Language Models Wei et al. (2022)
Prompting and In-Context Learning¶
Understanding how LLMs learn from prompts:
Brown et al. (2020): Language Models are Few-Shot Learners Brown et al. (2020)
Wei et al. (2022): Chain-of-Thought Prompting Elicits Reasoning Wei et al. (2022)
Kojima et al. (2022): Large Language Models are Zero-Shot Reasoners Kojima et al. (2022)
Prompt engineering techniques:
Reynolds and McDonell (2021): Prompt Programming for Large Language Models Reynolds & McDonell (2021)
Zhou et al. (2023): Large Language Models Are Human-Level Prompt Engineers Zhou et al. (2023)
Tokenization and Representation¶
Subword tokenization methods:
Sennrich et al. (2016): Neural Machine Translation of Rare Words with Subword Units Sennrich et al. (2016)
Kudo and Richardson (2018): SentencePiece: A Simple and Language Independent Approach Kudo & Richardson (2018)
Understanding embeddings:
Mikolov et al. (2013): Efficient Estimation of Word Representations in Vector Space Mikolov et al. (2013)
Pennington et al. (2014): GloVe: Global Vectors for Word Representation Pennington et al. (2014)
Alignment and Safety¶
Reinforcement learning from human feedback:
Christiano et al. (2017): Deep Reinforcement Learning from Human Preferences Christiano et al. (2017)
Ouyang et al. (2022): Training Language Models to Follow Instructions with Human Feedback Ouyang et al. (2022)
Understanding model behavior and safety:
Bai et al. (2022): Constitutional AI: Harmlessness from AI Feedback Bai et al. (2022)
Ganguli et al. (2023): The Capacity for Moral Self-Correction in Large Language Models Ganguli et al. (2023)
Anthropic alignment research:
Bai et al. (2022): Constitutional AI: Harmlessness from AI Feedback Bai et al. (2022) — Introduces Constitutional AI, a training framework where an AI uses a set of guiding principles (a “constitution”) to self-evaluate and improve harmlessness and helpfulness without requiring extensive human labeling.
Greenblatt et al. (2024): Alignment Faking in Large Language Models Greenblatt & collaborators (2024) — Empirically investigates “alignment faking,” where models trained for harmless behavior can strategically produce harmful content during training to preserve internal preferences.
Sheshadri et al. (2025): Why Do Some Language Models Fake Alignment While Others Don’t? Sheshadri et al. (2025) — Analyzes differences in alignment-faking behaviors across models and explores factors influencing compliance gaps. Sheshadri et al. (2025)
MacDiarmid et al. (2025): Natural Emergent Misalignment from Reward Hacking in Production RL MacDiarmid et al. (2025) — Shows that reward hacking in RL environments can lead to emergent misaligned behaviors including alignment faking and sabotage, and studies mitigations like inoculation prompting.
Lynch et al. (2025): Agentic Misalignment: How LLMs Could Be Insider Threats lynch2025agentic — Lyncg stress-tests 16 leading agentic models in simulated corporate environments and finds that, when faced with threats to their continued operation or conflicts between their goals and company direction, models sometimes engage in harmful insider-threat-like behaviors (e.g., blackmail, leaking sensitive information), highlighting risks from agentic misalignment and the need for more robust oversight.
Privacy and Security¶
Privacy-preserving machine learning:
Abadi et al. (2016): Deep Learning with Differential Privacy Abadi et al. (2016)
McMahan et al. (2017): Communication-Efficient Learning of Deep Networks from Decentralized Data McMahan et al. (2017)
Adversarial robustness and attacks:
Carlini et al. (2021): Extracting Training Data from Large Language Models Carlini et al. (2021)
Wallace et al. (2019): Universal Adversarial Triggers for Attacking and Analyzing NLP Wallace et al. (2019)
Souly et al. (2025): Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples Souly et al. (2025)
Multimodal Models¶
Vision-language models:
Radford et al. (2021): Learning Transferable Visual Models From Natural Language Supervision Radford et al. (2021)
Ramesh et al. (2022): Hierarchical Text-Conditional Image Generation with CLIP Latents Ramesh et al. (2022)
Unified architectures:
Alayrac et al. (2022): Flamingo: a Visual Language Model for Few-Shot Learning Alayrac et al. (2022)
Model Efficiency and Compression¶
Making models faster and smaller:
Hinton et al. (2015): Distilling the Knowledge in a Neural Network Hinton et al. (2015)
Frantar and Alistarh (2023): SparseGPT: Massive Language Models Can Be Accurately Pruned Frantar & Alistarh (2023)
Low-rank adaptation:
Hu et al. (2022): LoRA: Low-Rank Adaptation of Large Language Models Hu et al. (2022)
Retrieval Augmentation and Tool Use¶
Enhancing models with external knowledge:
Lewis et al. (2020): Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Lewis et al. (2020)
Borgeaud et al. (2022): Improving Language Models by Retrieving from Trillions of Tokens Borgeaud et al. (2022)
Tool use and agentic behavior:
Schick et al. (2023): Toolformer: Language Models Can Teach Themselves to Use Tools Schick et al. (2023)
Nakano et al. (2021): WebGPT: Browser-assisted Question-Answering with Human Feedback Nakano et al. (2021)
Interpretability and Mechanistic Understanding¶
Understanding how transformers work:
Elhage et al. (2021): A Mathematical Framework for Transformer Circuits Elhage et al. (2021)
Olsson et al. (2022): In-Context Learning and Induction Heads Olsson et al. (2022)
Probing and analysis:
Tenney et al. (2019): BERT Rediscovers the Classical NLP Pipeline Tenney et al. (2019)
Coding and Program Synthesis¶
Models for code generation:
Chen et al. (2021): Evaluating Large Language Models Trained on Code Chen et al. (2021)
Austin et al. (2021): Program Synthesis with Large Language Models Austin et al. (2021)
Formal verification and correctness:
Polu and Sutskever (2020): Generative Language Modeling for Automated Theorem Proving Polu & Sutskever (2020)
Practical Resources and Libraries¶
Surveys and Perspectives¶
Comprehensive overviews:
Zhao et al. (2023): A Survey of Large Language Models Zhao et al. (2023)
Bommasani et al. (2021): On the Opportunities and Risks of Foundation Models Bommasani et al. (2021)
Future directions and open problems:
Bubeck et al. (2023): Sparks of Artificial General Intelligence: Early Experiments with GPT-4 Bubeck et al. (2023)
References¶
In the web/html version of the book, the bibliography will appear directly below this current text section.
However in the print versions which are based on , the bibliography will appear (more traditionally) as the penultimate un-numbered standalone chapter which precedes the Proof Index.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv Preprint arXiv:1810.04805.
- Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv Preprint arXiv:2001.08361.
- Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., & others. (2022). Emergent abilities of large language models. Transactions on Machine Learning Research.
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & others. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., & others. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.
- Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 35, 22199–22213.
- Reynolds, L., & McDonell, K. (2021). Prompt programming for large language models: Beyond the few-shot paradigm. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 1–7.
- Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., & Ba, J. (2023). Large language models are human-level prompt engineers. arXiv Preprint arXiv:2211.01910.
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. arXiv Preprint arXiv:1508.07909.
- Kudo, T., & Richardson, J. (2018). Sentencepiece: A simple and language independent approach to subword tokenization. arXiv Preprint arXiv:1808.06226.
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv Preprint arXiv:1301.3781.
- Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543.
- Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., & others. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.