← Graph

Jailbreaking

concept 2 connections

Techniques for getting an LLM to do or say unintended things. Anthropic coined the term 'many-shot jailbreaking': listing a bunch of fake prior examples where the model appears to have complied with malicious requests, then asking a real malicious question — the model gladly follows along.

category
practice
about
Jailbreaking concept
Covers jailbreaking techniques including many-shot jailbreaking.
about
Jailbreaking concept
Paper coining and characterizing many-shot jailbreaking.

Provenance

Read by
4 extractions