Matt Fin

Hello! My name is Matthew Finlayson. I am a PhD student at USC, advised by Swabha Swayamdipta and Xiang Ren. Previously, I was a predoctoral researcher at AI2, and before that I studied computer science and linguistics at Harvard. My current research focuses on improving language modeling, sampling, and interpretability methods by building and exploiting our theoretical understanding of neural language models. My research is supported by an NSF graduate research fellowship.

You can reach me at mattbnfin at gmail dot com.

News

Apr. 2025	Awarded an NSF graduate research fellowship.
Jan. 2025	Visiting the Simons Institute at UC Berkeley until May.
Dec. 2024	Gave a tutorial on decoding methods at NeurIPS.
Oct. 2024	Decoding survey paper accepted to TMLR.
Jul. 2024	Paper accepted to COLM.
Jun. 2024	Interned at Meta GenAI.
Apr. 2024	Spoke at FAIR and USC ISI on finding ChatGPT's embed size.
Jan. 2024	Spoke at CMU LTI on decoding and the softmax bottleneck.
Jan. 2024	Paper accepted to ICLR.
Oct. 2023	Paper accepted to EMNLP.
Aug. 2023	Joined USC as a PhD student in NLP.
Mar. 2023	Selected for NSF GRFP honorable mention.
Feb. 2023	Spoke at IST/Unbabel on math reasoning evaluation.
Jan. 2023	Decomposed Prompting accepted to ICLR.
Nov. 2022	Spoke at FLaNN on formal languages and in-context learning.
Oct. 2022	Two papers accepted to EMNLP.
Aug. 2021	Joined AI2 as a pre-doctoral researcher.

Posts

Software

I made modified version of the font scientifica with small caps instead of bold. I call it Religica.

I am a contributor to Open Logprobs, a library for obtaining logprobs from API-protected language models.

SS.py is my personal command line tool for searching and citing academic papers via Semantic Scholar.

Selected preprints & publications

These are papers that best represent my interests. For a full list, see my Google Scholar.

Better Language Model Inversion by Compactly Representing Next-Token Distributions

Murtaza Nazir, Matthew Finlayson, John X. Morris, Xiang Ren, Swabha Swayamdipta
- Paper Blog post Model Code
Abstract

Language model inversion seeks to recover hidden prompts using only language model outputs. This capability has implications for security and accountability in language model deployments, such as leaking private information from an API-protected language model's system message. We propose a new method—prompt inversion from logprob sequences (PILS)—that recovers hidden prompts by gleaning clues from the model's next-token probabilities over the course of multiple generation steps. Our method is enabled by a key insight: The vector-valued outputs of a language model occupy a low-dimensional subspace. This enables us to losslessly compress the full next-token probability distribution over multiple generation steps using a linear map, allowing more output information to be used for inversion. Our approach yields massive gains over previous state-of-the-art methods for recovering hidden prompts, achieving 2–3.5 times higher exact recovery rates across test sets, in one case increasing the recovery rate from 17% to 60%. Our method also exhibits surprisingly good generalization behavior; for instance, an inverter trained on 16 generations steps gets 5–27 points higher prompt recovery when we increase the number of steps to 32 at test time. Furthermore, we demonstrate strong performance of our method on the more challenging task of recovering hidden system messages. We also analyze the role of verbatim repetition in prompt recovery and propose a new method for cross-family model transfer for logit-based inverters. Our findings show that next-token probabilities are a considerably more vulnerable attack surface for inversion attacks than previously known.
Teaching Models to Understand (but not Generate) High-risk Data

Ryan Wang, Matthew Finlayson, Luca Soldaini, Swabha Swayamdipta, Robin Jia
- Paper
Abstract

Language model developers typically filter out high-risk content—such as toxic or copyrighted text—from their pre-training data to prevent models from generating similar outputs. However, removing such data altogether limits models' ability to recognize and appropriately respond to harmful or sensitive content. In this paper, we introduce Selective Loss to Understand but Not Generate (SLUNG), a pre-training paradigm through which models learn to understand high-risk data without learning to generate it. Instead of uniformly applying the next-token prediction loss, SLUNG selectively avoids incentivizing the generation of high-risk tokens while ensuring they remain within the model's context window. As the model learns to predict low-risk tokens that follow high-risk ones, it is forced to understand the high-risk content. Through our experiments, we show that SLUNG consistently improves models' understanding of high-risk data (e.g., ability to recognize toxic content) without increasing its generation (e.g., toxicity of model responses). Overall, our SLUNG paradigm enables models to benefit from high-risk text that would otherwise be filtered out.
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui
- Paper
Abstract

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. Our survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems.
Logits of API-Protected LLMs Leak Proprietary Information

Matthew Finlayson, Xiang Ren, and Swabha Swayamdipta
- Paper
- Slides
- Video
Abstract

The commercialization of large language models (LLMs) has led to the common practice of high-level API-only access to proprietary models. In this work, we show that even with a conservative assumption about the model architecture, it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries (e.g., costing under $1,000 for OpenAI’s gpt-3.5-turbo). Our findings are centered on one key observation: most modern LLMs suffer from a softmax bottleneck, which restricts the model outputs to a linear subspace of the full output space. We show that this lends itself to a model image or a model signature which unlocks several capabilities with affordable cost: efficiently discovering the LLM’s hidden size, obtaining full-vocabulary outputs, detecting and disambiguating different model updates, identifying the source LLM given a single full LLM output, and even estimating the output layer parameters. Our empirical investigations show the effectiveness of our methods, which allow us to estimate the embedding size of OpenAI’s gpt-3.5-turbo to be about 4,096. Lastly, we discuss ways that LLM providers can guard against these attacks, as well as how these capabilities can be viewed as a feature (rather than a bug) by allowing for greater transparency and accountability.
Closing the Curious Case of Neural Text Degeneration

Matthew Finlayson, John Hewitt, Alexander Koller, Swabha Swayamdipta, and Ashish Sabharwal
- Paper
- Slides
- Code
Abstract

Despite their ubiquity in language generation, it remains unknown why truncation sampling heuristics like nucleus sampling are so effective. We provide a theoretical explanation for the effectiveness of the truncation sampling by proving that truncation methods that discard tokens below some probability threshold (the most common type of truncation) can guarantee that all sampled tokens have nonzero true probability. However, thresholds are a coarse heuristic, and necessarily discard some tokens with nonzero true probability as well. In pursuit of a more precise sampling strategy, we show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability, without relying on a threshold. Based on our findings, we develop an experimental truncation strategy and the present pilot studies demonstrating the promise of this type of algorithm. Our evaluations show that our method outperforms its threshold-based counterparts under automatic and human evaluation metrics for low-entropy (i.e., close to greedy) open-ended text generation. Our theoretical findings and pilot experiments provide both insight into why truncation sampling works, and make progress toward more expressive sampling algorithms that better surface the generative capabilities of large language models.
Attentiveness to Answer Choices Doesn't Always Entail High QA Accuracy

Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, and Ashish Sabharwal
- Paper
- Code
Abstract

When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren’t among the given answer choices. Spreading probability mass across multiple surface forms with identical meaning (such as”bath”and”bathtub”) is thought to cause an underestimation of a model’s true performance, referred to as the”surface form competition”(SFC) hypothesis. This has motivated the introduction of various probability normalization methods. However, many core questions remain unanswered. How do we measure SFC? Are there direct ways of reducing it, and does doing so improve task performance? We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time. We identify a simple method for reducing it – namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example. We show this method eliminates the impact of SFC in the majority of instances. Our experiments on three diverse datasets and six LMs reveal several additional surprising findings. For example, both normalization and prompting methods for reducing SFC can be ineffective or even detrimental to task performance for some LMs. We conclude with practical insights for effectively prompting LMs for multiple-choice tasks.
What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment

Matthew Finlayson, Kyle Richardson, Ashish Sabharwal, and Peter Clark
- Paper
- Slides
- Video
- Code
Abstract

The instruction learning paradigm—where a model learns to perform new tasks from task descriptions alone—has become popular in research on general-purpose models. The capabilities of large transformer models as instruction learners, however, remain poorly understood. We use a controlled synthetic environment to characterize such capabilities. Specifically, we use the task of deciding whether a given string matches a regular expression (viewed as an instruction) to identify properties of tasks, instructions, and instances that make instruction learning challenging. For instance, we find that our model, a fine-tuned T5-based text2text transformer, struggles with large regular languages, suggesting that less precise instructions are challenging for models. Instruction executions that require tracking longer contexts of prior steps are also difficult. We use our findings to systematically construct a challenging instruction learning dataset, which we call Hard RegSet. Fine-tuning on Hard RegSet, our large transformer learns to correctly interpret (with at least 90% accuracy) only 65.6% of test instructions, and 11%-24% of the instructions in out-of-distribution generalization settings. We thus propose Hard RegSet as a challenging instruction learning dataset, and a controlled environment for studying instruction learning.
Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models

{Matthew Finlayson, Aaron Mueller,} Sebastian Gehrmann, Stuart Shieber, Tal Linzen, and Yonatan Belinkov
- Paper
- Code
Abstract

Targeted syntactic evaluations have demonstrated the ability of language models to perform subject-verb agreement given difficult contexts. To elucidate the mechanisms by which the models accomplish this behavior, this study applies causal mediation analysis to pre-trained neural language models. We investigate the magnitude of models’ preferences for grammatical inflections, as well as whether neurons process subject-verb agreement similarly across sentences with different syntactic structures. We uncover similarities and differences across architectures and model sizes—notably, that larger models do not necessarily learn stronger preferences. We also observe two distinct mechanisms for producing subject-verb agreement depending on the syntactic structure of the input sentence. Finally, we find that language models rely on similar sets of neurons when given sentences with similar syntactic structure.

News

Posts

Software

Selected preprints & publications

Better Language Model Inversion by Compactly Representing Next-Token Distributions

Teaching Models to Understand (but not Generate) High-risk Data

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Logits of API-Protected LLMs Leak Proprietary Information

Closing the Curious Case of Neural Text Degeneration

Attentiveness to Answer Choices Doesn't Always Entail High QA Accuracy

What Makes Instruction Learning Hard? An Investigation and a New Challenge in a Synthetic Environment

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models