The AI Filing Cabinet That Isn’t There
Policymakers and commentators often treat large language models (LLMs) as if they were searchable repositories of personal data. The intuition is understandable: these systems train on massive corpora that may include personal information, and they occasionally generate outputs referencing real people.
But the analogy is still wrong. And policy built on it risks distorting both innovation and privacy enforcement.
I’ve written a new issue brief examining the empirical evidence on LLM memorization, distinguishing it from analytically separate phenomena such as hallucination and inference, and surveying how existing U.S. privacy law addresses these issues. The research points in a consistent direction: large language models do not store personal data like databases, memorization of personal information is atypical, and privacy risk arises primarily at the point of output and use, not from internal statistical representations.