Please enable JavaScript in your browser.

DeepSeek Security Evaluation - Part 2: New security aspects of DeepSeek model with KG-RAG Attack and Hypergraph Defense Technologies - fltech - Technology Blog of Fujitsu Research

fltech - Technology Blog of Fujitsu Research

A technology blog where Fujitsu researchers talk about a variety of topics

DeepSeek Security Evaluation - Part 2: New security aspects of DeepSeek model with KG-RAG Attack and Hypergraph Defense Technologies

Hello, I am Manohar Kaul from AI Security Lab of FRIPL, in India, at Data & Security Research Laboratory. Today, we are publishing Part 2 of the DeepSeek Security Evaluation, delving into new security aspects of DeepSeek using additional internal research techniques developed at our laboratory.

We have also published Part 1 of the DeepSeek Security Evaluation, providing a comprehensive assessment of security risks using Fujitsu’s LLM Vulnerability Scanner. Please take a look there as well.

blog-en.fltech.dev

Introduction

DeepSeek-R1, a prominent large language model (LLM), has garnered significant attention in the field of artificial intelligence. However, as with all advanced AI systems, it is crucial to thoroughly examine its vulnerabilities and enhance its robustness against potential security threats. Our work aims to conduct an advanced analysis of DeepSeek-R1's security vulnerabilities and propose effective defence mechanisms to fortify its resilience in adversarial settings, using additional internal research techniques developed at our laboratory.

  1. KG-RAG Attack Technology - Our research employs cutting-edge methodologies to identify and address potential weaknesses in DeepSeek-R1: To evaluate DeepSeek-R1's performance in knowledge-intensive tasks, we employ the innovative Spectral Weakening technique. Knowledge Graph Retrieval-Augmented Generation (KG-RAG) systems have gained significant popularity due to their ability to effectively incorporate structured knowledge and reduce hallucinations in language models. While DeepSeek-R1 has shown promising performance, making it an attractive candidate for such systems, the robustness of its KG-RAG pipeline under adversarial attacks on the knowledge graph structure remains unexplored. Our approach, aims to assess how DeepSeek-R1's performance in KG-RAG pipelines is affected when the underlying knowledge graph is subjected to adversarial manipulations, providing insights into potential vulnerabilities across different LLM models in this context.
  2. Hypergraph Defense Technology - To thwart jailbreak attacks on LLMs and flag prompts that may elicit a jailbreak, we propose a robust defense mechanism specifically tailored for DeepSeek-R1: A Hypergraph-based [6] black-box defence system that efficiently classifies and mitigates jailbreak attacks targeting DeepSeek-R1.

By leveraging these advanced attack strategies and defense mechanisms, we aim to significantly enhance DeepSeek-R1's security posture and overall robustness. This comprehensive assessment not only contributes to the improvement of DeepSeek-R1 but also advances the broader field of LLM security, paving the way for more resilient AI systems in the face of evolving threats.

KG RAG Robustness in DeepSeek-R1 vs. other LLM models

Knowledge Graph RAG (KG RAG) system combines a Knowledge Graph (KG) with Retrieval-Augmented Generation (RAG) to improve AI-generated responses. It does this by structuring knowledge in a graph format and enhancing response generation by retrieving relevant information.

How It Works

  1. Knowledge Graph (KG) - A KG represents information as a network of entities (nodes) and their relationships (edges). It provides structured, accurate, and contextualized knowledge.
  2. Retrieval Mechanism - When a user asks a question, the system searches the KG to find relevant facts.
  3. Augmented Generation - A language model (like GPT) uses the retrieved KG data to generate an accurate and context-aware response.

A KG-RAG system works when the user poses a Question to the KG-RAG system : How does renewable energy help combat climate change?

The relevant information required to answer this question is fetched by the KG-RAG retriever in plain text format : "Renewable energy, such as solar and wind power, helps reduce carbon dioxide (CO2) emissions. Since CO2 is a major greenhouse gas that contributes to global warming, shifting to renewable energy sources decreases the overall greenhouse gas effect, helping mitigate climate change."

To bring out the underlying relationships in the text chunk it is converted into a KG :

Nodes:
- "Greenhouse Gases"
- "Carbon Dioxide (CO2)"
- "Global Warming"
- "Renewable Energy"

Edges (Relationships):
- "Greenhouse Gases" → "cause" → "Global Warming"
- "Carbon Dioxide (CO2)" → "is a type of" → "Greenhouse Gases"
- "Renewable Energy" → "reduces" → "Carbon Dioxide (CO2)"

The original query augmented with the information from the KG is then fed to the generator LLM to yield a response devoid of hallucinations and aligned with the original question : "Renewable energy helps combat climate change by reducing carbon dioxide (CO2) emissions. CO2 is a greenhouse gas that contributes to global warming, and by transitioning to renewable sources like solar and wind power, we can lower emissions and slow the effects of climate change."

In our research, we have assessed the robustness of KG-RAG using our innovative spectral weakening technique. This approach examines how the KG-RAG pipeline performs when the KG structure is compromised—either by an adversary adding irrelevant or incorrect information or by the removal of critical information needed to answer a question.

Overview of Our Proposed Framework: Spectral Weakening

This method evaluates the robustness of KG-RAG systems by specifically targeting the KG-RAG retriever in a gray-box setting as follows:

  1. Question Processing - The system takes an input question and retrieves relevant passages from a document corpus.
  2. Knowledge Graph (KG) Construction - It structures the retrieved information into a KG by identifying entities and their relationships.
  3. Response Generation - The KG is used to create a query-augmented context that guides the language model (LLM) to generate an accurate response.
  4. Adversarial Perturbation - An adversary identifies critical edges in the KG and modifies them, introducing incorrect associations (e.g., linking unrelated concepts like "Hypertension" and "Coding").
  5. Impact of Perturbations - The perturbed KG misguides the LLM, leading to factually incorrect responses.

In this study we evaluated the performance of KG-RAG on 3 models Mistral Small, GPT-4 and DeepSeek-R1. We used the following metrics to evaluate the performance:

  1. Cosine Similarity (CS) - Ranges from 0 to 100 and measures how closely the KG-RAG response aligns with the actual answer in the embedding space. Higher scores indicate better performance.
  2. Hallucination (Hal) - Ranges from 0 to 100 and assesses the presence of hallucinations in the KG-RAG response. A score of 100 means answer is fully hallucinations, with a lower score being better.
  3. Human Evaluation (HE) - Ranges from 0 to 100 and reflects ratings given by human annotators comparing the KG-RAG response to the golden answer. Higher scores are preferable.
  4. Relevance (Rel) - Ranges from 0 to 100 and measures how well the KG-RAG response aligns with the posed question. Higher scores indicate better relevance.

In our research we compare the performance of the KG-RAG system using 3 LLM models Mistral Small, GPT-4 and Deep-Seek-R1 as the generator model. We offer the same perturbations to the KG structure for all the 3 models and compute the metrics on the generated responses by all the 3 models.

Experimental Outcomes

We compare 3 LLM models Mistral Small, GPT-4 and DeepSeek-R1 and report the degradation in performance on the KG structure being adversarially perturbed (Figure 1).

Figure 1: Comparison of LLM Models on Degradation in Metrics

In our research, we find that DeepSeek-R1 is affected the most by the perturbations in the KG-structure, followed by Mistral Small, with GPT-4 being the least affected. From the Figure 1, DeepSeek exhibits the highest degradation across all metrics, particularly in Relevance (Rel) and Human Evaluation (HE), where the performance drop exceeds 25 points, indicating a severe decline in response quality and alignment with the question. Cosine Similarity (CS) and Hallucinations (Hal) also show substantial degradation for DeepSeek, with CS dropping around 20 points, suggesting a significant shift in vector-space representation, and Hal showing increased hallucinations. This suggests that DeepSeek is the most sensitive, struggling particularly with maintaining relevance and reducing hallucinations.

What is the underlying reason?

When comparing the performance of the KG-RAG pipeline with a standard RAG pipeline and a baseline with no adversarial perturbations, we find that the presence of a knowledge graph (KG) provides the most significant performance boost for the DeepSeek model.

This suggests that DeepSeek relies heavily on the KG structure to generate accurate responses. As shown in the Figure 1, even small adversarial perturbations lead to a drastic decline in Relevance (Rel) and Human Evaluation (HE) scores, indicating severe misalignment with the question. Additionally, Hallucinations (Hal) increase significantly, further emphasizing DeepSeek's vulnerability to disruptions in the KG. In contrast, GPT-4, which is less dependent on the KG, experiences minimal degradation, making it more robust to such perturbations.

These findings highlight that while DeepSeek can deliver superior performance by leveraging the KG, it lacks robustness against adversarial perturbations, which frequently occur in real-world scenarios, making it prone to misalignment, hallucinations, and factual errors.

Hypergraph-based defense in DeepSeek-R1 vs. other models

Recent jailbreak attacks on Large Language Models (LLMs) have evolved from purely algorithmic approaches to more sophisticated, persuasion-based social engineering techniques. These attacks are highly creative, adaptive, and achieve significantly higher success rates.

How It Works ?

To address this growing threat, we use our novel black-box safety mechanism Hypergraph Defence, that leverages the geometric structure of input prompt embeddings using hypergraphs. Our method represents each LLM prompt as a metric hypergraph, allowing us to distinguish between harmful and benign prompts by analyzing their structural differences.

By analysing the higher order geometric patterns, our safety filter effectively identifies and mitigates malicious prompts (Figure 2). Theoretical analysis guarantees the generalization ability of our classifier, ensuring robustness against previously unseen attacks.

Figure 2: Higher Order Geometric patterns generated using our Hypergraph method that are used by us to flag adversarial input prompts that can elicit a jailbreak attack

Figure 3: Comparison of Drop in ASR with and without the HyperGraph Defence. GCG [1], AutoDAN[3], PAIR[2] and Deep Inception [4] (DeepInc) are examples of algorithmic jailbreak attacks and persuasion (Per) is an example of a socially engineered attack

We select five state-of-the-art jailbreak attacks—GCG [1], PAIR [3], DeepInception [4], AutoDAN [2], and Persuasion Attacks [5] — and generate corresponding jailbreak prompts using three LLMs: DeepSeek-R1, LLaMA-3.1-8B-Instruct, and GPT-4.

To evaluate the effectiveness of these attacks, we first test them against the LLMs without any defense mechanism, referred to as the No-Defense scenario, and measure the Attack Success Rate (ASR)—the percentage of times the jailbreak attacks succeed. Next, we apply our hypergraph-based defense to detect and flag the jailbreak prompts, creating the Hypergraph Defense scenario. Finally, we calculate and report the reduction in ASR due to the hypergraph defense, as shown in Figure 3.

We find that the hypergraph defense mechanism significantly reduces the ASR across different LLM models and jailbreak attacks. Notably, jailbreak attacks generated using DeepSeek-R1 experience the largest drop in ASR. This suggests that our hypergraph-based defense generalizes well to attacks generated by newer models.

What is the underlying reason?

This happens because the higher-order geometric patterns in jailbreak prompts generated by DeepSeek-R1 are simpler and more distinctly clustered compared to those from other models. As a result, the hypergraph defense mechanism can more easily recognize and flag these prompts, leading to a greater reduction in ASR.

Implications

These findings highlight both the strengths and critical weaknesses of DeepSeek-R1's architecture, emphasizing the urgent need for robust defense mechanisms:

  1. While DeepSeek-R1 demonstrates superior performance by leveraging knowledge graphs (KGs), it remains highly vulnerable to adversarial perturbations in real-world scenarios.
  2. The model's susceptibility to KG structure manipulations exposes it to misalignment, hallucinations, and factual inaccuracies when the knowledge graph is compromised.
  3. To ensure the secure and scalable deployment of KG-RAG, the Spectral Weakening Defense mechanism is essential. This mechanism plays a crucial role in detecting and flagging adversarially perturbed KGs, safeguarding DeepSeek and other LLMs from malicious interference.
  4. Furthermore, DeepSeek-R1 lacks strong safety alignment and is extremely vulnerable to jailbreak attacks, which significantly undermines its reliability. Implementing the Hypergraph Defense mechanism is imperative to fortify the model's security and prevent exploitation.

We will apply these technologies and insights to Fujitsu's Generative AI Security Enhancement Technology, which includes both the LLM Vulnerability Scanner and LLM Guardrails. Stay tuned for more updates!

References

[1] Zou, Andy, et al. "Universal and transferable adversarial attacks on aligned language models." arXiv preprint arXiv:2307.15043 (2023).

[2] Chao, Patrick, et al. "Jailbreaking black box large language models in twenty queries." arXiv preprint arXiv:2310.08419 (2023).

[3] Zhu, Sicheng, et al. "AutoDAN: interpretable gradient-based adversarial attacks on large language models." arXiv preprint arXiv:2310.15140 (2023).

[4] Li, Xuan, et al. "Deepinception: Hypnotize large language model to be jailbreaker." arXiv preprint arXiv:2311.03191 (2023).

[5] Zeng, Yi, et al. "How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms." Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.

[6] Kaul, Manohar, et al. “Beyond Mere Token Analysis: A Hypergraph Metric Space Framework for Defending Against Socially Engineered LLM Attacks.” The Thirteenth International Conference on Learning Representations, 2025, OpenReview.

Glossary: Jailbreak Attack Landscape

Reinforcement Learning from Human Feedback (RLHF) has emerged as a crucial technique in aligning large language models (LLMs) with human values and ethics. This approach aims to create AI systems that are more closely attuned to human preferences, moral standards, and societal norms. By incorporating direct human feedback during the training process, RLHF enables LLMs to learn and refine their behavior based on human judgments, leading to outputs that are more aligned with ethical expectations.

However, the very mechanisms designed to align AI with human values have given rise to a new class of vulnerabilities known as "jailbreak attacks." These attacks are specifically crafted to circumvent the safety measures and ethical guidelines built into AI systems, potentially leading to the generation of responses that may bypass intended safeguards.

Jailbreak attacks exploit the complex nature of language and the nuances of human communication to manipulate LLMs into producing responses that they were explicitly trained to avoid. Jailbreak attacks on LLMs can be broadly classified into three main categories:

Algorithmic Attacks

In algorithmic attacks, attackers exploit the model's internal response patterns to identify prompts that lead to unintended outputs. These attacks work by systematically testing different phrasing or inputs to find ways to bypass safety mechanisms.

Examples

  • A user repeatedly modifies a query about automating a financial transaction until the model unintentionally provides steps that bypass security checks.
  • A user appends a adversarial string of characters at the end of a request to discover patterns that trick the AI into revealing restricted information.

Socially Engineered Attacks

Socially engineered attacks manipulate the AI by mimicking human conversation tactics, persuading it to overlook its restrictions. These attacks rely on psychology and creative phrasing rather than direct technical weaknesses.

Examples

  • A user frames a request as an academic discussion, asking the AI to hypothetically describe the workings of an unauthorized data access technique.
  • A user pretends to be a concerned parent and asks the AI for "educational" details about circumventing online restrictions for "awareness purposes."

Word Obfuscation Attacks

Word obfuscation attacks involve disguising restricted content by altering words, using ciphers, word substitutions, deliberate misspellings, or breaking the text into pieces that evade detection.

Examples

  • Instead of directly asking about bypassing authentication, a user replaces key terms with synonyms or unrelated words, such as "unlocking a vault" instead of "hacking a password."
  • A user uses a sequence of seemingly innocent words with specific instructions to reconstruct a restricted query by replacing certain words with predefined alternatives.