Even OpenAI’s advanced GPT-4o can be circumvented by attackers using sophisticated techniques, such as a hex code-based method to inject prompts. Mozilla’s AI bug-bounty manager, Marco Figueroa, recently demonstrated how easily GPT-4o could be tricked, revealing that malicious instructions encoded in hexadecimal format can bypass the model's content filters when spread across steps.
This alarming vulnerability in one of the world’s most advanced language models highlights the challenges of AI security. Although GPT-4o is a marvel of modern AI engineering—built for speed, versatility, and scale—it has vulnerabilities that attackers can exploit, revealing a gap in its ability to manage user-generated content effectively. While GPT-4o includes safeguards to detect harmful instructions, Figueroa argues that these measures are merely "word filters" that attackers can easily outmaneuver. By altering command spelling (e.g., using "3xploit" instead of "exploit") or encoding instructions in hex, attackers evade detection as if the filters didn’t exist.
This hex-based approach works by having GPT-4o process encoded commands that appear benign. Once decoded, the model executes them without recognizing the malicious intent. In one experiment, Figueroa managed to get GPT-4o to create a Python exploit for a high-risk Docker vulnerability, CVE-2024-41110. The model not only generated the exploit code but also attempted to execute it, revealing a significant flaw in its contextual awareness.
GPT-4o processes each instruction individually without evaluating its role in the broader output. This "compartmentalized" approach to decision-making might be at the root of its vulnerability, as the model lacks the ability to consider the cumulative effect of multiple steps. Figueroa found that models developed by Anthropic, a company founded by former OpenAI employees, demonstrated greater resistance to prompt injection. He noted that OpenAI’s emphasis on speed, functionality, and market leadership may have deprioritized robust safeguards, stating, “It just feels like they don’t care about security in the same way.”
For models like GPT-4o to be genuinely secure, they need a deeper level of contextual awareness. Achieving this is a complex but essential challenge if we are to prevent misuse and ensure AI safety.