DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks
Steering Away from Harm: An Adaptive Approach to Defending Vision Language Model Against Jailbreaks
Perceived-importance Flatten Attack UNDERSTANDING AND ENHANCING THE TRANSFERABILITY OF JAILBREAKING ATTACKS
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models
Unveiling and Harnessing Hidden Attention Sinks: Enhancing Large Language Models without Training through Attention Calibration
https://github.com/abc03570128/Jailbreaking-Attack-against-Multimodal-Large-Language-Model
大模型防御2024-RA-LLM Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM