Schedule (Friday, October 10th @ Room 518C)

09:00 am: Opening remarks

09:10 am: Invited talks

Sarah Wiegreffe Assistant Professor, Department of Computer Science, University of Maryland

Interpretability as the Inverse Machine Learning Pipeline

John Hewitt Assistant Professor of Computer Science, Columbia University

Interplay research is alignment research with a big bet

10:20 am: Workshop paper talks and coffee break (11:05am)

Localizing Persona Representations in LLMs (Celia Cintas, Miriam Rateike, Erik Miehling, Elizabeth M. Daly, Skyler Speakman)

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence (Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Yizhou Sun, Himabindu Lakkaraju, Shichang Zhang)

Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps (Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasovic, Yonatan Belinkov)

12:00 pm: Organized lunch

01:00 pm: Invited talks

Kyle Mahowald Assistant Professor in Linguistics at University of Texas at Austin

The INTERPLAY Between Verbal Representations and Verbal Behavior

Aaron Mueller Assistant Professor of Computer Science of Data Science at Boston University

Building a More Predictive Science of Language Model Behaviors with Interpretability

02:10 pm Poster session

03:20 pm Round table discussion and coffee break

04:50 pm Closing remarks

05:00 pm Workshop social TBD

First Workshop on the Interplay of Model Behavior and Model Internals