Walden Perry
Thoughts on Walden


Why do LLMs freak out over the seahorse emoji? →

Interesting article on why LLMs think that there's a seahorse emoji.

But unlike with 🐟, the seahorse emoji doesn't exist. The model tries to construct a "seahorse + emoji" vector just as it would for a real emoji, and on layer 72 we even get a very similar construction as with the fish emoji - " se", "horse", and the emoji prefix byte prefix:

But alas, there's no continuation to ĠðŁ corresponding to a seahorse, so the lm_head similarity score calculation maxes out with horse- or sea-animal-related emoji bytes instead, and an unintended emoji is sampled.

I hadn't realized before that the tokens generated by an LLM are fed back into itself. I guess until now I had naively thought that everything was based off the starting prompt only. When it fails to generate the seahorse emoji, it can see that the previous token it generated was not a seahorse and that instantly has an effect on the next token's output.

I tried it myself and ChatGPT completely spiraled out of control.