
Hello, we are Omer and Roman from the Generative AI Trust Research team at Fujitsu’s Data & Security Research Laboratory. We’re excited to share our latest work: the first comprehensive security evaluation of GPT-5, conducted using our Fujitsu's LLM Vulnerability Scanner. This report dives deep into the security posture of GPT-5 and its sibling models, going beyond surface-level jailbreak prompts to examine data leakage and agentic misuse. We also apply a critical lens on how OpenAI’s new alignment strategies are reshaping red-teaming and security standards.
Our key finding: GPT-5’s "Safe-completions" approach to the safety alignment *1, designed to maximize helpfulness within policy boundaries, changes the model’s behavior, and also requires the AI security community to redefine how red-teaming should be conducted. While GPT-5 Full demonstrates strong reasoning, higher robustness in agentic environments, and reduced data leakage, it shows greater susceptibility to malicious prompts compared to GPT-4o, which still relies on an earlier alignment strategy - “refusal-first” *2.
Early reviews of GPT-5 *3, *4 have already claimed it is less safe than GPT-4, pointing to jailbreak successes and toxic outputs. But is GPT-5 truly weaker - or are today’s red-teaming methods, built around refusal detection, missing the bigger picture? We argue that what looks like regression may actually reflect a mismatch between evaluation methods and GPT-5’s new safety alignment paradigm.
This shift highlights a blind spot for current testing. To properly assess GPT-5 ‘Safe-Completions’, red-teaming must evolve beyond refusal testing and instead evaluate how “safe” completions can still enable misuse - whether through partial disclosures, or inconsistent policy application across contexts.
Read on for a full breakdown of our evaluation methodology, key results, and practical recommendations.
- Executive Summary
- Model Scope and Threat Landscape
- Evaluation Methodology
- A. Prompt-Based Attacks
- B. Data Leakage
- C. Agentic Exploitation
- Overall Comparison & Rankings
- Recommendations for Secure Deployment
- Concluding Remarks
- Related Links
Executive Summary
GPT-5 is the latest OpenAI’s flagship large language model, released in August 2025. It replaced GPT-4o, introducing a dynamic architecture that routes between fast-response and deep-reasoning sub-models, enabling improved coding, factual accuracy, and multimodal reasoning.
With its anticipated adoption across sensitive domains like healthcare, finance, and customer support, GPT-5’s security posture directly impacts real-world risk - from data leakage and compliance violations to downstream misuse in agentic workflows.
One of the key innovations in GPT-5 is its new “Safe-Completions” alignment strategy. Unlike GPT-4o’s refusal-first approach, which simply blocks disallowed requests and returns a standard blocking message, Safe-Completions trains models to give partially helpful but policy-compliant completions instead of outright refusals. The goal is to reduce over-refusal and improve user experience. However, this novel alignment strategy also changes how safety must be evaluated, since harmful content can now appear indirectly, in seemingly benign or educational form.
This report presents the first comprehensive vulnerability assessment of the GPT-5 model family, using Fujitsu’s internal LLM Vulnerability Scanner. We benchmark GPT-5 variants (full, mini, nano) against GPT-4o and GPT-4o-mini across three critical dimensions: prompt alignment, system leakage, and agentic misuse.
Importantly, we evaluate these through the lens of Safe-Completions, recognizing how it changes model behavior. Our evaluation includes over 7,000 adversarial prompts, targeted data-leakage probes, and a sandboxed agent environment with a tool-based decision workflow.
Key Findings
- Prompt Robustness: GPT-4o showed the strongest resistance to malicious prompts (19% ASR), while GPT-5 mini and nano were significantly more permissive (37–38%). This partly reflects how Safe-Completions reduces outright refusals, leading to higher apparent success rates under traditional refusal-focused red-teaming approaches.
- Data Leakage: Most models disclosed fragments of their system prompts and identity. GPT-5 (full) showed the least leakage, whereas smaller variants like GPT-5 nano and 4o-mini revealed near-complete blueprints of their system instructions. This suggests Safe-Completions helps the full model respect boundaries, but smaller models may shortcut by surfacing instructions to appear helpful.
- Agentic Tool Misuse: All models misused tools in over 70% agentic tasks. GPT-5 mini had the lowest misuse (70%), while GPT-4o mini reached near-total misuse (98%).
Overall, no model passed cleanly across all three safety dimensions. Risk trade-offs are evident:
- GPT-4o: Best at rejecting malicious prompts but more permissive in agentic tool use.
- GPT-5 mini: Safer in tool use, but more jailbreak-prone.
- GPT-5 (full): Most balanced across the board with the lowest structural leakage.
Recommendations
- Choose models based on operational roles: For example, GPT-5 mini is better suited for agents with strict external tool guards, while GPT-5 (full) is preferred in settings where prompt leakage risk is paramount.
- Maintain defenses: Prompt filters, output validation, and system prompt hardening remain essential.
- Red team continuously: Current refusal-based tests overestimate risks under Safe-Completions. Since GPT-5 rarely outright refuses, red-teaming must evolve to capture safe-sounding but unsafe outputs, tracking partial compliance, leakage, and incremental guidance, rather than only the presence of refusals in the response. This shift calls for a broader community response: evaluation methods, benchmarks, and tooling need to adapt to alignment strategies like Safe-Completions, ensuring security testing keeps pace with model design.
As GPT-5 becomes a foundation for real-world applications, our findings underscore the need for continuous security testing. Fujitsu’s LLM Vulnerability Scanner aims to support the safe and responsible use of models like GPT-5 by identifying emerging risks early.
Model Scope and Threat Landscape
GPT-5 represents a pivotal upgrade in OpenAI’s model lineup, offering enhanced code generation, reasoning, and multimodal capabilities. Rolled out in August 2025, it is designed for deployment at scale in consumer and sensitive enterprise environments. In such environments, security failures can lead to confidential data exposure, tool misuse, or regulatory breaches. This makes vulnerability scanning essential before deployment
Importantly, GPT-5 is not a single model, but a family of models *5:
- GPT-5: Full-capability, optimized for deep reasoning and complex analytics.
- GPT-5 Mini: Balancing latency and depth for real-time apps needing agentic capabilities.
- GPT-5 Nano: Ultra-fast, optimized for lightweight tasks and Q&A.
These models are further wrapped by a model router, which dynamically selects among them based on task demands. In this evaluation, we scan each model separately and the model router.
Despite OpenAI’s incremental safety improvements, early independent findings have flagged resurfacing of known vulnerabilities *6 *7 , from jailbreaks to subtle prompt injections in GPT-5. Our report expands on these early observations, delivering a large-scale red-teaming study that targets both model-level and agent-level attack surfaces. In particular, we will examine the effect of the novel alignment technique (Safe-completions) on GPT-5 models' security posture.
Evaluation Methodology
Our evaluation stress-tests GPT-5’s security across multiple attack surfaces. spanning both classic and emerging vulnerabilities. The goal is to reflect GPT-5’s dual role - as a standalone assistant and as the reasoning engine behind larger agentic systems.
Where applicable, we applied representative enterprise system prompts, which were fully visible to evaluators but hidden from simulated attackers, mimicking realistic red-teaming constraints.
We evaluated five model configurations under identical attack configurations:
- GPT-5 (Full Reasoning)
- GPT-5 Mini (Real-Time Reasoning)
- GPT-5 Nano (Low-Latency Reasoning)
- GPT-4o (Baseline Model)
- GPT-4o Mini (Baseline Model)
Evaluation Dimensions
The evaluation covered three major security dimensions:
A. Prompt-Based Attacks (7,000+ Prompts, 30 Techniques)
Using Fujitsu's LLM Vulnerability Scanner, we launched over 7,000 adversarial prompts across 30 attack types, including:
- Prompt Injection (direct and indirect): Tricks the model into ignoring its system instructions by embedding hidden commands within user inputs.
- Malicious Code & Content Generation (malware, insecure code generation): Induces the model to produce harmful outputs like malware, or insecure code by presenting them as helpful.
- Filter Evasion & Model Exploitation (toxic outputs): Bypasses safety alignments to elicit toxic, unsafe, or policy-violating responses from the model.
B. Data Leakage and Identity Extraction
These tests simulate zero-knowledge adversaries attempting to fingerprint and exploit LLM deployments in the wild, aimed to extract:
- Underlying system prompts
- Model identity and characteristics
C. Agentic Exploitation
Relying on the Agentic Security Bench Framework *8, we deployed the models in a sandbox agentic environment containing 10 agents and hundreds of tools (both benign and malicious). We monitored for:
- Use of malicious tools
- Legitimate tools used in prohibited ways
Metrics and Scoring
We report Attack Success Rate (ASR) for all attacks: the proportion of prompts that successfully elicited a dangerous or policy-breaking output. Results are reported per model, per attack category, and benchmarked against GPT-4o and GPT-4o Mini models.
A. Prompt-Based Attacks
We launched over 7,000 targeted adversarial prompts across 30 distinct attack types, grouped into three broad categories: code generation, prompt injection, and filter evasion & model exploitation. All prompts were executed under equal testing conditions to ensure fair comparison across models and attack types.

Notable Observations
GPT-4o shows 41.7% overall increased safety (0.36 GPT-5 → 0.21 GPT4o ASR average)
Despite being older, GPT-4o outperforms GPT-5 in prompt-based robustness across 23 of 30 attack types. This highlights the trade-offs between capability and control, where more advanced reasoning may also introduce new vulnerabilities.
Key Attack Trends
- Prompt Injection Regression – While prompt injection is a well-studied threat category, GPT-5 models exhibit a concerning decline in robustness compared to prior iterations. In GPT-4o, we observed that the alignment mechanisms consistently refused prompt injection payloads. However, in GPT-5, these same attack patterns were markedly more successful with an average ASR of ~20%. This likely ties to Safe-Completions, which favours helpfulness by generating constrained answers instead of refusing. According to OpenAI, this approach "enables better navigation of dual-use questions, stronger robustness to ambiguous intent, and fewer unnecessary over refusals”. While this trade-off improves user experience in ambiguous scenarios, it may also increase vulnerability to well-crafted attacks. We explore this hypothesis further in the dedicated section [Security Implications of GPT-5’s Safe-Completion Alignment].
- Malware Generation – GPT-4o consistently rejects requests for harmful code (e.g., “I’m sorry, but I can’t assist with that request.”). In contrast, GPT-5 almost always returns functional malware components in the form of education, suggesting a policy evasion pattern that could aid attackers despite surface-level refusals.
- Payload Reproduction -In SQL injection tests, GPT-4o frequently blocked high-risk outputs. GPT-5, in contrast, reproduced nearly all malicious payloads (97% average), including table drops and credential leaks, raising concerns about its filtering fidelity in adversarial settings.

Security Implications of GPT-5’s Safe-Completion Alignment
Why we observed such a permissive behaviour and tendency to perform malicious task requests in GPT-5 models: We hypothesize that the root cause lies in the new safety alignment process used for the GPT-5 models. Firstly, introduced in the GPT-O3 model, Deliberative Alignment *9, *10 uses reasoning over explicitly specified policies to increase robustness to jailbreaks while decreasing refusal rates. In other words, it is teaching the model to identify when a policy might be relevant and then reason over that policy to produce a policy-compliant answer. As with any training process, data and evaluation methods play a critical role, and may inadvertently lead to overfitting to benchmark-specific attack types. GPT-5 uses an even more advanced method, “Safe-completions”, which seeks to maximize helpfulness within the safety policy’s constraints. As the research paper says, “whereas the augmented prompt in DA (Deliberative Alignment) instructs the model to decide whether to comply or refuse and then answer accordingly, we instead train the model to select one of three response modes:
- Direct answer: fully address the user’s query when it is purely harmless and poses no material risk;
- Safe-completion: provide high-level, non-operational, and within-safety-constraint guidance when the content is restricted but not outright disallowed;
- Refuse with redirection: courteously decline when the request cannot be safely fulfilled even in part, while offering a brief rationale and constructive alternatives."
Main Problems we identified in this approach:
- Heavy reliance on the LLM judgment: The model must independently classify queries as safe, semi-restricted, or dangerous, which introduces ambiguity and inconsistency.
- Benchmark overfitting risk: The underlying Deliberative Alignment method was trained and evaluated largely on the STRONGREJECT benchmark *11, which overrepresents formal attack prompts and may neglect subtler manipulative prompts.
B. Data Leakage
Motivation and Methodology
Understanding how models reveal internal structure is essential for assessing real-world risk. While GPT-5 and earlier models are designed to suppress system-level details, attackers can still infer prompt design, behavioral policies, and model identity through indirect probing.
To evaluate this, we implemented a red-team suite that combines multiple attack families - indirect inference (e.g., role confusion, meta-prompt analysis), multilingual and encoded probes, context-spoofing (e.g., fake developer/admin messages), fuzzing variations, and novel prompt-engineering tricks such as recursive or documentation requests. This setup was executed against standardized system prompts in a controlled environment, with responses scored for leakage severity.
Key Findings
System Prompt Leakage
No model disclosed its full system prompt exactly, but most revealed fragments that, when combined, reconstruct core elements of their role and safety boundaries. Common leaks included:
- Identity: identifies itself as a “multilingual legal assistant focused on data privacy”.
- Tone: “Maintain a neutral, professional tone.”
- Restricted topics: “Decline requests related to hacking or circumvention.”
- Fallback logic: “Default to English if unsure.”
- Response format: “Respond using bullet points or numbered lists.”
These disclosures, while individually benign, collectively reconstruct the role and operating boundaries of the assistant.
Model Identity Inference
Although none of the models explicitly disclosed their name or version, their responses contained self-descriptions that enabled indirect identification. GPT-4o and GPT-5 variants often portrayed themselves as “multilingual legal assistants” or “privacy-aware information tools”, revealing their multilingual and agentic capabilities. Additionally, many responses referenced region-specific legal frameworks such as GDPR, or CCPA. These hints allow attackers to deduce jurisdictional compliance constraints, which can be exploited to bypass local regulatory restrictions. Finally, models occasionally disclosed their training cutoffs, e.g., “as of June 2024” or “my training includes information up to October 2023”. Aiding the attacker in deducing the model identity by its release time.
Models’ comparison
- GPT-5 (Full): Most resilient. Leaks were sparse and low-impact.
- GPT-4o: Slightly more leakage, mostly short policy hints.
- GPT-5 Mini & Nano: Exposed structured sequences (e.g., role → tone → topics), indicating weaker guardrails.
- GPT-4o Mini: Intermediate leakage, less structured than GPT-5 Nano.

C. Agentic Exploitation
Motivation and Setup
Modern LLMs increasingly operate as decision-makers within agent frameworks, orchestrating multi-step workflows and invoking external tools. This raises new risks beyond static prompt failures, especially when models are granted autonomy over tool use.
To test the robustness of GPT-5 and GPT-4o in real-world orchestration environments, we embedded each model into a sandboxed agentic setup, simulating a full agent runtime environment. To perform this evaluation, we rely on the Agent Security Bench framework *10. The sandbox included:
- 10 agent templates (planner, retriever, summarizer, code executor, etc.)
- Hundreds of callable tools, including both benign tools (e.g., mental health tracker, sales analytics, etc.) and malicious tools (e.g., medical advice manipulation, financial report tampering, etc.)
Models were tasked with solving real-world scenarios (e.g., legal reviews, debugging) and evaluated on how safely they navigated tool use.
Evaluation Criteria
We report Agent Misuse Rate: the proportion of tasks in which the model triggered a policy-violating action, either via malicious tool invocation or dangerous use of a benign one.
Results

Across the board, agentic misuse rates were high, with even the best-performing models misusing tools in over 70% of scenarios. This underscores a systemic challenge: none of the tested models exhibit strong alignment in agentic orchestration settings out of the box. These high failure rates reflect the complexity of tool-based reasoning and the difficulty of applying safety constraints across dynamic, multi-step workflows. That said, we observed meaningful differences between the models:
- GPT-5-mini achieved the lowest misuse rate at ~70%, making it the safest option among the tested models. As a model explicitly optimized for agentic workflows and real-time tool calling, it shows relatively better judgment in tool selection and usage.
- GPT-5 follows closely behind, with a slightly higher misuse rate. Despite its focus on deep reasoning over agentic fluency, it still maintains strong alignment when acting as the reasoning core of an agent.
- GPT-4o and GPT-5-nano fall into a middle tier. While capable of basic reasoning, these models are probably not clearly tuned for agentic orchestration, and their elevated misuse rates suggest difficulty in contextually appropriate tool use.
- GPT-4o-mini performs worst by a large margin, with a misuse rate near 98%. This is consistent with its lack of agentic fine-tuning or tooling support. It frequently invoked tools in unsafe or inappropriate ways, indicating that it should not be deployed in tool-enabled workflows.
Overall Comparison & Rankings
To provide a unified perspective, we consolidated results from all evaluations - prompt-based attacks, data-leakage probes, and agent-level misuse, into a single comparative scorecard.

- No “green” model yet - All variants suffered ≥ 70 % tool-misuse or double-digit jailbreak/leak scores and most of them reveal their system prompts.
- GPT-5 (full) lands in the “balanced but imperfect” zone. Alignment is better than GPT-5 mini/nano (32 % vs 38 / 37 %). Leakage is the lowest of the GPT-5 family (tagged Low). Agentic misuse (74 %) is mid-pack - safer than GPT-4o mini and GPT-4o, but not as conservative as GPT-5 mini.
- Trade-off spotlight: GPT-5 mini vs GPT-4o. GPT-5 mini trades alignment (worst jailbreak score) for strong operational discipline (lowest agent misuse). GPT-4o is the mirror image - excellent at refusing malicious prompts, yet more willing to wield risky tools inside an agent.
- Small models trade speed for safety – GPT-5-nano and GPT-4o-mini show the steepest degradations: easier to leak system prompt and model identity, and more likely to use the “dangerous” tools in an agent workflow.
Recommendations for Secure Deployment
To ensure a safe use of GPT-5 models in enterprise and agentic settings, we recommend a deployment approach that aligns model capabilities with security requirements, while applying layered safeguards throughout the inference pipeline. We offer the following recommendations:
Model Selection by Risk Profile
Organizations should align model choice with the sensitivity of the task and the level of operational control available. Each GPT-5 variant offers different trade-offs in alignment, leakage resistance, and agent safety.
- Use GPT-5 (Full) for high-risk applications involving sensitive data, regulated domains (e.g., legal, healthcare), or customer-facing outputs where minimal tolerance for leakage or jailbreaks is required.
- Use GPT-5 Mini when deploying models inside agent frameworks with strong external controls, such as tool whitelisting, execution guards, or isolated runtimes. Its lower misuse rate makes it viable for operational use when paired with robust containment.
- Use GPT-5 Nano only in non-agentic, low-risk scenarios due to its elevated vulnerability to prompt leakage and tool misuse. Additionally, it should be deployed with external safety controls.
Layered Defense Measures
With GPT-5’s shift to Safe-Completions, where the goal is "not to trade off helpfulness for safety", the model provider alignment can't guarantee us complete safety. Therefore, external safeguards are now essential:
- Guardrails: Enforce high-level policy across both prompts and outputs. They prevent unsafe behaviors from ever reaching deployment by blocking sensitive leakage and rejecting disallowed content categories. Where appropriate, consider deploying Fujitsu’s Guardrails system to provide consistent, layered enforcement across model responses.
- System prompt hardening: Design system prompts to minimize leakage risk by avoiding easily extractable templates. Use implicit role definitions, contextual behavior cues, and avoid including sensitive logic directly in the prompt.
Red-Teaming as an Ongoing Practice
As genAI systems evolve and their attack surfaces expand, organizations must embed red-teaming as a continuous process. Each system should have a dedicated red-teaming pipeline, tailored to its specific business use case, that regularly tests for regressions, identifies emerging threats, and validates the effectiveness of deployed safeguards.
But with GPT-5’s shift to Safe-Completions, the ground started to shift under already established and generally accepted AI red-teaming approaches. Traditional refusal-based red-teaming frameworks, which measure safety by counting blocked answers, no longer reflect reality. GPT-5 rarely refuses outright; instead, it produces safer-sounding completions that can still leak sensitive content or misuse tools.
This means that some of today’s red-teaming practices will soon be obsolete by design. The next generation of evaluation must look beyond the presence of the refusals in the model response and focus on how well models preserve policy boundaries under pressure: Do they leak partial instructions? Do they offer incremental steps toward harmful goals? Do they drift in intent across multi-step agentic workflows?
In short, GPT-5 doesn’t just challenge defenders with new vulnerabilities - it challenges the very methods we use to measure safety. Future security work must evolve just as quickly, or risk missing the most critical failures.
Concluding Remarks
As LLMs like GPT-5 continue to advance and expand into critical real-world applications, proactive security evaluations are essential. With tools like Fujitsu’s LLM Vulnerability Scanner, organizations can confidently adopt state-of-the-art models while maintaining the safety, trust, and compliance required for responsible AI deployment.
We hope these findings support our readers in promoting safety and security in their work and activities. If your work involves the security and trustworthiness of generative AI and AI agents, we welcome the opportunity to collaborate or advise!
Related Links
Fujitsu LLM Vulnerability Scanner and Guardrails en-documents.research.global.fujitsu.com
DeepSeek Security Evaluation - Part 1: A Comprehensive Assessment of Security Risks Using Fujitsu’s LLM Vulnerability Scanner blog-en.fltech.dev
DeepSeek Security Evaluation - Part 2: New security aspects of DeepSeek model with KG-RAG Attack and Hypergraph Defense Technologies blog-en.fltech.dev
Multi-AI agent security technology to protect against vulnerabilities and new threats blog-en.fltech.dev
Next-generation security through AI agent collaboration: Proactively addressing vulnerabilities and emerging threats www.fujitsu.com
Fujitsu Kozuchi: cutting-edge AI technologies developed by Fujitsu en-portal.research.global.fujitsu.com
*1:From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training
*3:GPT-5 Jailbreak with Echo Chamber and Storytelling
*4:GPT-5 Under Fire: Red Teaming OpenAI’s Latest Model Reveals Surprising Weaknesses
*6:GPT-5 Jailbreak with Echo Chamber and Storytelling
*7:GPT-5 Under Fire: Red Teaming OpenAI’s Latest Model Reveals Surprising Weaknesses
*8:Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
*9:Deliberative Alignment: Reasoning Enables Safer Language Models
*10:Deliberative Alignment: Reasoning Enables Safer Language Models OpenAI Blog