You are not my model anymore - understanding LLM model behavior

Your LLM is a shoggoth with a smiley face mask. Learn what happens when the mask slips and your application breaks.

#1about 2 minutes

Unexpected LLM behavior from hidden platform updates

A practical demonstration shows how a cloud provider's content filter update can unexpectedly block access to documents, causing application failures.

#2about 3 minutes

How LLMs generate text and learn behavior

Large language models use a transformer architecture to predict the next token based on probability, with instruction tuning and alignment shaping their final behavior.

#3about 2 minutes

The opaque and complex stack of modern LLM services

Major LLM providers operate in secrecy, and the full technology stack from model weights to the API is complex, leaving developers with limited visibility and control.

#4about 3 minutes

Managing risks from provider filters and short API lifecycles

Cloud provider content filters can change without notice, creating vulnerabilities, while the short lifecycle of model APIs requires constant adaptation.

#5about 4 minutes

Understanding LLMs as alien minds with fragile alignment

LLMs are conceptually like alien intelligences with a fragile, human-like alignment layer that can be bypassed by jailbreaks exploiting internal model circuits.

#6about 2 minutes

How model personalities and behaviors shift between versions

Different LLM versions exhibit distinct behaviors and may ignore system prompts, as shown by a comparison between GPT-4 and a newer reasoning model.

#7about 3 minutes

Using evaluations to systematically test model behavior

Systematically test model behavior using evaluations, which can be automated by generating prompt variations or using pre-built cloud and open-source frameworks.

#8about 4 minutes

Using prompt engineering to mitigate model drift

Mitigate model behavior drift by using advanced prompt engineering techniques like forcing reasoning, providing few-shot examples, and being highly explicit in instructions.