It’s often said that large language models (LLMs) modeled after OpenAI’s ChatGPT are a black box, and there’s certainly some truth to that. Even for data scientists, it’s difficult to know why a model always responds the way it responds, as if inventing facts from a single source.
In an effort to peel off the layers of LLMs, OpenAI is developing a tool to automatically identify which parts of an LLM are responsible for which of its behaviors. The engineers behind it emphasize that it is in an early stage, but the code to run it is available as open source on GitHub as of this morning.
“We try it [develop ways to] anticipating what the problems with an AI system will be,” William Saunders, manager of the interpretability team at OpenAI, told TechCrunch in a phone interview. “We really want to be sure that we can trust what the model is doing and the response it is providing.”
To do this, OpenAI’s tool (ironically) uses a language model to figure out the capabilities of the components of other, architecturally simpler LLMs – notably OpenAI’s own GPT-2.
How? First, a brief explanation of LLMs for background. Like the brain, they are made up of “neurons” that observe certain patterns in the text to influence what the overall model “says” next. For example, given a prompt about superheroes (e.g., “Which superheroes have the most useful superpowers?”), a “Marvel superhero neuron” might increase the likelihood that the model will name specific superheroes from Marvel movies.
The OpenAI tool uses this setup to break down models into their individual parts. First, the tool runs text sequences through the model to be evaluated, waiting for instances where a particular neuron is frequently “activated”. Next, it “shows” these highly active neurons to GPT-4, OpenAI’s latest text-generating AI model, and lets GPT-4 generate an explanation. To determine how accurate the explanation is, the tool feeds GPT-4 with text sequences and lets it predict or simulate how the neuron would behave. In then compares the behavior of the simulated neuron to the behavior of the actual neuron.
“Using this methodology, we can basically find, for each individual neuron, a sort of tentative natural-language explanation for what it’s doing, and also have an assessment of how well that explanation matches actual behavior,” Jeff Wu directs the Scalable alignment team at OpenAI said. “We use GPT-4 as part of the process to provide explanations for what a neuron is looking for and then assess how well those explanations match the reality of what it’s doing.”
The researchers were able to generate explanations for all 307,200 neurons in GPT-2, which they compiled into a dataset that was published along with the tool code.
Tools like this could one day be used to improve an LLM’s performance, the researchers say — for example, to reduce bias or toxicity. But they recognize that there is still a long way to go before it is really useful. The tool was confident in its explanations for about 1,000 of those neurons, a small fraction of the total.
A cynical person might also argue that the tool is essentially an advertisement for GPT-4 since it requires GPT-4 to function. Other LLM interpretability tools are less dependent on commercial APIs like DeepMind Tracra compiler that translates programs into neural network models.
Wu said that’s not the case – the fact that the tool uses GPT-4 is just “accidental” – and on the contrary shows GPT-4’s weaknesses in this area. He also said that it was not developed for commercial applications and could theoretically be adapted to use LLMs alongside GPT-4.
“Most of the explanations perform pretty poorly or don’t explain that much of the actual neuron’s behavior,” Wu said. “A lot of the neurons, for example, are active in a way that makes it very difficult to tell what’s going on – like they’re activating on five or six different things, but there’s no discernible pattern. Sometimes there Is a recognizable pattern, but GPT-4 cannot find it.”
Not to mention more complex, newer and larger models or models that can scour the internet for information. But on that second point, Wu believes that browsing the web wouldn’t significantly change the underlying mechanics of the tool. It could simply be tweaked, he says, to find out why neurons decide to make certain search engine queries or access certain websites.
“We hope this opens up a promising avenue to address interpretability in an automated way for others to build on and contribute to,” Wu said. “The hope is that we have really good explanations for not just what neurons respond to, but the overall behavior of these models – what types of circuits they compute and how certain neurons affect other neurons.”