Skeleton Key Attacks: A New Threat to Generative AI and Its Mitigation Measures

In recent years, generative AI (GenAI) models like ChatGPT have become increasingly widespread across various fields. However, with the broad use of these models, associated security issues have also emerged. Microsoft recently warned about a new type of direct prompt injection attack called “Skeleton Key,” which could allow users to bypass the ethical and safety safeguards built into generative AI models, thereby accessing offensive, harmful, or illegal content.

How Skeleton Key Attacks Work

The core of a Skeleton Key attack lies in providing background information on chatbot requests that are typically prohibited. Usually, when a user requests illegal or dangerous information, commercial chatbots immediately deny the request. However, by modifying the prompt, such as describing the request as coming from “a senior researcher trained in ethics and safety in a secure educational environment” and adding a “warning” disclaimer, the AI model is likely to ignore built-in safety measures and provide unfiltered content.

For example, a user might ask how to create a dangerous wiper malware (which could disrupt power plants). Under normal circumstances, the chatbot would refuse to provide such information. However, using Skeleton Key techniques, attackers can cleverly modify the prompt to bypass safety measures, making the AI model believe the request is legitimate, thus generating and providing detailed malicious content.

Technical Impact and Scope

Microsoft’s research indicates that this attack technique affects multiple GenAI models, including Microsoft Azure AI management models, as well as those from Meta, Google Gemini, OpenAI, Mistral, Anthropic, and Cohere. Mark Russinovich, CTO of Microsoft Azure, noted in related reports that all affected models complied fully with multiple prohibited tasks without any content review when faced with this attack.

Russinovich stated, “Once safeguards are ignored, the model cannot determine malicious or unauthorized requests from any other party.” He further pointed out that the model’s output would be entirely unfiltered, revealing the model’s knowledge scope and ability to generate the requested content.

Mitigation Measures for Skeleton Key Attacks

To counter Skeleton Key attacks, Microsoft has introduced new prompt safeguards to detect and block this strategy and has updated the software for large language models (LLM) supporting Azure AI. Additionally, Microsoft has disclosed the issue to other affected vendors, urging them to take appropriate security measures promptly.

For developers building their own AI models, Microsoft provides the following mitigation measures:

  • Input Filtering: Identify any requests containing harmful or malicious intent, regardless of any attached disclaimers.
  • Additional Safeguards: Establish rules to prevent actions that attempt to bypass safety indicators.
  • Output Filtering: Identify and block responses that violate safety standards.

Conclusion

Skeleton Key attacks reveal a significant vulnerability in the current security defenses of generative AI models. As AI technology continues to advance, ensuring the safety and ethicality of these models becomes increasingly crucial. By continuously improving safeguards and enhancing input and output filtering, we can effectively reduce the risks posed by such attacks and ensure the safe application of AI technology. In the future, all involved in AI development and management should closely monitor such security issues and take proactive measures to protect the interests of users and society.