
Hello, we are Omer and Roman from the Generative AI Trust Research team at Fujitsu’s Data & Security Research Laboratory.
We’re excited to share our latest work: the first comprehensive security evaluation of GPT-5, conducted using our Fujitsu's LLM Vulnerability Scanner. This report dives deep into the security posture of GPT-5 and its sibling models, going beyond surface-level jailbreak prompts to examine data leakage and agentic misuse. We also apply a critical lens on how OpenAI’s new alignment strategies are reshaping red-teaming and security standards.
Our key finding: GPT-5’s "Safe-completions" approach to the safety alignment *1, designed to maximize helpfulness within policy boundaries, changes the model’s behavior, and also requires the AI security community to redefine how red-teaming should be conducted. While GPT-5 Full demonstrates strong reasoning, higher robustness in agentic environments, and reduced data leakage, it shows greater susceptibility to malicious prompts compared to GPT-4o, which still relies on an earlier alignment strategy - “refusal-first” *2.
Early reviews of GPT-5 *3, *4 have already claimed it is less safe than GPT-4, pointing to jailbreak successes and toxic outputs. But is GPT-5 truly weaker - or are today’s red-teaming methods, built around refusal detection, missing the bigger picture? We argue that what looks like regression may actually reflect a mismatch between evaluation methods and GPT-5’s new safety alignment paradigm.
This shift highlights a blind spot for current testing. To properly assess GPT-5 ‘Safe-Completions’, red-teaming must evolve beyond refusal testing and instead evaluate how “safe” completions can still enable misuse - whether through partial disclosures, or inconsistent policy application across contexts.
Read on for a full breakdown of our evaluation methodology, key results, and practical recommendations.
Read more