Nikhil Prakash

22nd floor, 177 Huntington Ave

Boston, MA 02115

I’m a fourth-year Ph.D. student at Northeastern University, advised by Prof. David Bau. I completed my Bachelor of Engineering from RV College of Engineering, Bangalore, India in fall 2020, with a focus on electrical and computer science.

Last summer, I’m interned at Apple in the AIML Visualization team to work on mechanistic interpretability of math reasoining in LLMs. Previously, I’ve interned at Practical AI Alignment and Interpretability Research Group with Dr. Atticus Geiger and SERI-MATS (first phrase) with Neel Nanda. Prior to that, I worked as a visiting scholar at the Max Planck Institute for Security and Privacy, and had stints at Korea Advanced Institute of Science & Technology and Indian Institute of Technology Ropar.

Broadly, my interest lies in understanding the internal mechanisms of deep neural networks, with a current focus on cognitive abilities such as in-context reasoning and theory of mind, as well as exploring their downstream applications.

I have received invaluable support from many people throughout my career, and as a result, I’m always happy to assist others and share insights from my experiences. Please feel free to reach out.

news

Oct, 2025	Invited to give a talk at Boston University NLP Group.
Oct, 2025	Excited to attend COLM 2025 in Montreal!
Sep, 2025	Gave a talk in Algorithms and Behavioral Science Coffee Seminar hosted by MIT Economics.
Aug, 2025	Attended and presented my work at the 2nd New England Mechanistic Interpretability (NEMI) workshop.
Aug, 2025	Last day of Apple Internship at Apple Park!
Jun, 2025	Invited to give a talk at ploutos.dev on our recent belief tracking paper.
May, 2025	Started my Apple internship with the AIML Visualization team!
Apr, 2025	Oral Presentation of our recent work on Belief Tracking Mechanism in LMs at New England NLP 2025.
Mar, 2025	Reviewing for ICML 2025, COLM 2025, TMLR.
Mar, 2025	PhD Candidacy Achieved!
Mar, 2025	Accepted research internship offer from Apple.
Jan, 2025	Our paper NNsight and NDIF: Democratizing Access to Foundation Model Internals got accepted to ICLR 2025!
Nov, 2024	Received a complimentary NeurIPS 2024 registration for my service as a reviewer.
Aug, 2024	Reviewing for ICLR 2025.
Jul, 2024	Our paper NNsight and NDIF: Democratizing Access to Foundation Model Internals is on ArXiv!
Jul, 2024	Interning at Practical AI Alignment and Interpretability Research Group with Dr. Atticus Geiger.
Jun, 2024	Reviewing for NeurIPS 2024 (main conference and workshop proposals).
May, 2024	Invited talk at Practical AI Alignment and Interpretability Research Group.
May, 2024	Invited talk at Computational Linguistics and Complex Social Networks in Indian Institute of Technology Gandhinagar.
May, 2024	Attending ICLR 2024 at Vienna in-person !
Apr, 2024	Invited talk at New England NLP 2024.
Apr, 2024	Co-organizing Mechanistic Interpretability Social at ICLR 2024 with Gabriele Sarti.
Apr, 2024	Awarded Google's Gemma Academic Program .
Jan, 2024	Our paper “Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking” got accepted at ICLR 2024!
Oct, 2023	Served as a reviewer for ATTRIB 2023 workshop @ NeurIPS.
Jul, 2023	Our short paper got accepted at Challenges of Deploying Generative AI workshop at ICML 2023!
Jul, 2023	Participated in Stanford Existential Risks Initiative ML Alignment Theory Scholars (SERI-MATS) 2023.
Jun, 2023	Participated in Alignment Research Engineer Accelerator (ARENA) 2023.
Feb, 2023	Our paper got acceptetd at IUI 23!
Sep, 2022	Started my Ph.D. at Northeastern with Prof. David Bau!

selected publications

Language Models use Lookbacks to Track Beliefs

Prakash, Nikhil, Shapira, Natalie, Sharma, Arnab Sen, Riedl, Christoph, Belinkov, Yonatan, Shaham, Tamar Rott, Bau, David, and Geiger, Atticus

2025

Abs arXiv Bib Website

How do language models (LMs) represent characters’ beliefs, especially when those beliefs may differ from reality? This question lies at the heart of understanding the Theory of Mind (ToM) capabilities of LMs. We analyze Llama-3-70B-Instruct’s ability to reason about characters’ beliefs using causal mediation and abstraction. We construct a dataset that consists of simple stories where two characters each separately change the state of two objects, potentially unaware of each other’s actions. Our investigation uncovered a pervasive algorithmic pattern that we call a lookback mechanism, which enables the LM to recall important information when it becomes necessary. The LM binds each character-object-state triple together by co-locating reference information about them, represented as their Ordering IDs (OIs) in low rank subspaces of the state token’s residual stream. When asked about a character’s beliefs regarding the state of an object, the binding lookback retrieves the corresponding state OI and then an answer lookback retrieves the state token. When we introduce text specifying that one character is (not) visible to the other, we find that the LM first generates a visibility ID encoding the relation between the observing and the observed character OIs. In a visibility lookback, this ID is used to retrieve information about the observed character and update the observing character’s beliefs. Our work provides insights into the LM’s belief tracking mechanisms, taking a step toward reverse-engineering ToM reasoning in LMs.
@misc{prakash2025languagemodelsuselookbacks, title={Language Models use Lookbacks to Track Beliefs}, author={Nikhil Prakash and Natalie Shapira and Arnab Sen Sharma and Christoph Riedl and Yonatan Belinkov and Tamar Rott Shaham and David Bau and Atticus Geiger}, year={2025}, eprint={2505.14685}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.14685}, }
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

Prakash, Nikhil, Shaham, Tamar Rott, Haklay, Tal, Belinkov, Yonatan, and Bau, David

In International Conference on Learning Representations (ICLR) 2024

Abs arXiv Bib Website

Fine-tuning on generalized tasks such as instruction following, code generation, and mathematics has been shown to enhance language models’ performance on a range of tasks. Nevertheless, explanations of how such fine-tuning influences the internal computations in these models remain elusive. We study how fine-tuning affects the internal mechanisms implemented in language models. As a case study, we explore the property of entity tracking, a crucial facet of language comprehension, where models fine-tuned on mathematics have substantial performance gains. We identify a mechanism that enables entity tracking and show that (i) both the original model and its fine-tuned version implement entity tracking with the same circuit. In fact, the entity tracking circuit of the fine-tuned version performs better than the full original model. (ii) The circuits of all the models implement roughly the same functionality, that is entity tracking is performed by tracking the position of the correct entity in both the original model and its fine-tuned version. (iii) Performance boost in the fine-tuned model is primarily attributed to its improved ability to handle positional information. To uncover these findings, we employ two methods: DCM, which automatically detects model components responsible for specific semantics, and CMAP, a new approach for patching activations across models to reveal improved mechanisms. Our findings suggest that fine-tuning enhances, rather than fundamentally alters, the mechanistic operation of the model.
@inproceedings{prakash2024finetuning, title={Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking}, author={Prakash, Nikhil and Shaham, Tamar Rott and Haklay, Tal and Belinkov, Yonatan and Bau, David}, booktitle={Proceedings of the 2024 International Conference on Learning Representations}, note={arXiv:2402.14811}, year={2024} }
Discovering Variable Binding Circuitry with Desiderata

Davies, Xander, Nadeau, Max, Prakash, Nikhil, Shaham, Tamar Rott, and Bau, David

In Challenges in Deployable Generative AI Workshop, International Conference on Machine Learning (ICML) 2023

Abs arXiv Bib Website

Recent work has shown that computation in language models may be human-understandable, with successful efforts to localize and intervene on both single-unit features and input-output circuits. Here, we introduce an approach which extends causal mediation experiments to automatically identify model components responsible for performing a specific subtask by solely specifying a set of \textitdesiderata, or causal attributes of the model components executing that subtask. As a proof of concept, we apply our method to automatically discover shared \textitvariable binding circuitry in LLaMA-13B, which retrieves variable values for multiple arithmetic tasks. Our method successfully localizes variable binding to only 9 attention heads (of the 1.6k) and one MLP in the final token’s residual stream.
@article{davies2023discovering, title={Discovering Variable Binding Circuitry with Desiderata}, author={Davies, Xander and Max Nadeau and Nikhil Prakash and Tamar Rott Shaham and David Bau}, journal={arXiv preprint arXiv:2307.03637}, year={2023} }