fltech - Technology Blog of Fujitsu Research

A technology blog where Fujitsu researchers talk about a variety of topics

Fujitsu’s Corporate Benchmarking Proposal: To Unlock the True Value of AI Agent Models #3 From Reading to Reasoning: Introducing the Fujitsu Assessing Compliance in Enterprise Dataset for Enterprise Legal Compliance Agents

This article marks the beginning of a TechBlog series entitled 'Fujitsu's Corporate Benchmarking Proposal: To Unlock the True Value of AI Agent Models.' It covers three blogs to the following schedule:

  • Part 1: When AI 'Sees' What Isn't There: Introducing a Benchmark for Diagnosing Hallucinations in Multimodal Large Language Models (MLLMs) (Published)
  • Part 2: AAAI 2026 AABA4ET Participation Report and Introduction to the Fujitsu RAG Hard Benchmark (Published) Fujitsu RAG Hard Benchmark🔗
  • Part 3: From Reading to Reasoning: Introducing the Fujitsu Assessing Compliance in Enterprise Dataset for Enterprise Legal Compliance Agents (This article)

Hello. We are Pranav Bhagat, Dishank Aggarwal, Ayush Singh, a team from Fujitsu Research of India.

As part of our ongoing TechBlog series, "Fujitsu's Corporate Benchmarking Proposal: To Unlock the True Value of AI Agent Models," we shift focus from perception (Part 1: Fujitsu Hallucination Benchmark ) and retrieval (Part 2: Fujitsu RAG Hard Benchmark) to a critical enterprise domain: legal compliance reasoning.

In this article, we introduce the Fujitsu Assessing Compliance in Enterprise Dataset, a new benchmark designed to train and evaluate AI systems that act not just as text generators, but as AI Paralegals capable of multi-clause reasoning. This work has been accepted as a main paper at the EACL 2026 (Conference of the European Chapter of the Association for Computational Linguistics), highlighting its contribution to advancing AI capabilities in legal reasoning. For more details about the work, see: https://2026.eacl.org/

aclanthology.org

The Core Challenge: Why Current AI Fails at "Lawyering"

Large Language Models (LLMs) have revolutionized text understanding. However, enterprise legal workflows demand more than surface-level comprehension, they require structured reasoning across interconnected clauses.

Consider a simple compliance check:

"A renewal notice is sent 45 days before contract expiration. Is it valid?"

To answer this, a human lawyer must simultaneously evaluate:

  • The definition of "Renewal Notice"
  • The timing requirement clause
  • Any exceptions or overrides

This is multi-clause reasoning.

However, existing benchmarks like ContractNLI and CUAD evaluate models on isolated clauses, not real-world dependencies. As our research shows, when faced with such multi-hop scenarios:

  • Base models achieve only 34–57% accuracy
  • Performance is often barely above random guessing

This reveals a fundamental limitation:

Current AI can read contracts but cannot reason over them.

The Key Insight: Contracts Are Not Text, They Are Graphs

Contracts are structured as interconnected clause relationships rather than flat text.

Legal documents are not flat sequences of text. They are structured systems of logic, where:

  • Definitions influence obligations
  • Exceptions override rules
  • Temporal conditions trigger actions

To capture this, we introduce COMPACT (Compliance Paralegals via Clause Graph Reasoning over Contracts).

The COMPACT Framework

Instead of treating contracts as unstructured text, COMPACT transforms them into Clause Graphs—structured representations of legal logic.

COMPACT Framework

Step 1: Deontic Logic Extraction

Each clause is decomposed into its core components:

  • Subject (who is responsible/involved)
  • Deontic (obligation, permission, prohibition)
  • Action
  • Object
  • Temporal conditions
  • Contextual Parameters

This enables precise modeling of legal obligations across domains.

Step 2: Semantic Clustering

Clauses are grouped by function:

  • Confidentiality
  • Termination
  • Payment
  • Exceptions etc

Insight: This mirrors how lawyers mentally organize contracts.

Step 3: Graph Linking

We then connect clauses through relationships such as:

  • DEFINES (definitions → obligations)
  • EXCEPTIONS (overrides)
  • DEPENDS_ON (conditional triggers)
  • CONFLICTS (contradictions)

The result:

👉 A structured reasoning graph that enables multi-hop legal inference.

Introducing the ACE Dataset

From these clause graphs, we generate Fujitsu Assessing Compliance in Enterprise Dataset(ACE)) the first benchmark designed specifically for multi-clause compliance reasoning.

Distribution of legal document types in the ACE Dataset (N=633 documents across 26 agreement categories) from CUAD and ContractNLI.

Dataset Composition

  • 4,700 compliance scenarios
  • Derived from 633 real-world contracts
  • Covers 26 agreement types
  • Balanced across:
    • Compliant (33.6%)
    • Non-Compliant (34.0%)
    • Non-Applicable (32.3%)

Each scenario requires reasoning across multiple interconnected clauses, not just one.

The "Adversarial" Advantage

Unlike traditional datasets, ACE is adversarially constructed to prevent shallow pattern-matching.

We introduce challenging scenario types:

  1. Compliant-with-Distractor
    Technically valid, but appears suspicious
    (e.g., informal but legally acceptable notice)

  2. Violation-with-Plausible-Defense
    Clearly non-compliant, but masked by reasonable justification

  3. Cross-Clause Non-Applicable
    Relevant-looking scenarios that are actually governed by different clauses

This ensures:

Models must reason, not just match keywords.

Why Multi-Clause Reasoning Is Hard

Legal complexity arises from patterns such as:

  • Definitional chains (A → B → C dependencies)
  • Temporal conflicts (overlapping deadlines)
  • Exception hierarchies
  • Conditional webs

Our dataset explicitly targets these patterns, forcing models to perform multi-hop reasoning across clause graphs.

Training on ACE leads to significant improvements:

Performance comparison of base and fine-tuned models on the ACE test set.

1. Massive Performance Gains

  • +22 to +43 percentage points improvement
  • Small models (3B) show dramatic gains

2. Efficiency at Scale

  • 3B models outperform larger base models
  • Enables cost-effective enterprise deployment

Performance comparison across PrivaCIBench's EU AI Act and HIPAA task subsets.

3. Cross-Domain Generalization

Models trained on ACE improve performance on:

  • EU AI Act
  • HIPAA
  • Legal entailment benchmarks

This shows:

The model learns general legal reasoning, not just dataset patterns.

Key Takeaway: From Reading to Reasoning

The central limitation of today's legal AI is not understanding language, it is understanding relationships.

COMPACT + Fujitsu Assessing Compliance in Enterprise Dataset introduces a paradigm shift:

  • From clause-level understanding → to graph-based reasoning
  • From text prediction → to legal decision-making

Conclusion

The ACE dataset represents a foundational step toward enterprise-grade AI agents that can reason like legal professionals.

Our contributions include:

  • A graph-based framework for modeling contracts
  • A multi-clause benchmark for realistic compliance reasoning
  • Adversarial scenarios that prevent shortcut learning
  • Demonstrated efficiency and generalization

As AI systems move into high-stakes enterprise domains, reasoning—not just generation—will define their true value.

Resources