Jailbreaking

concept 2 connections

Techniques for getting an LLM to do or say unintended things. Anthropic coined the term 'many-shot jailbreaking': listing a bunch of fake prior examples where the model appears to have complied with malicious requests, then asking a real malicious question — the model gladly follows along.

Provenance

Created in: Building LLM-Powered Applications in Ruby — Andrei Bondar... 2026-04-17 23:20
Read by: 4 extractions