This article marks the beginning of a TechBlog series entitled 'Fujitsu's Corporate Benchmarking Proposal: To Unlock the True Value of AI Agent Models.' It covers three blogs to the following schedule:
- Part 1: When AI 'Sees' What Isn't There: Introducing a Benchmark for Diagnosing Hallucinations in Multimodal Large Language Models (MLLMs) (Published)
- Part 2: AAAI 2026 AABA4ET Participation Report and Introduction to the Fujitsu RAG Hard Benchmark (Published) Fujitsu RAG Hard Benchmark🔗
- Part 3: From Reading to Reasoning: Introducing the Fujitsu Assessing Compliance in Enterprise Dataset for Enterprise Legal Compliance Agents (This article)
From Reading to Reasoning: Introducing the Fujitsu Assessing Compliance in Enterprise Dataset for Enterprise Legal Compliance Agents
Hello. We are Pranav Bhagat, Dishank Aggarwal, Ayush Singh, a team from Fujitsu Research of India.
As part of our ongoing TechBlog series, "Fujitsu's Corporate Benchmarking Proposal: To Unlock the True Value of AI Agent Models," we shift focus from perception (Part 1: Fujitsu Hallucination Benchmark ) and retrieval (Part 2: Fujitsu RAG Hard Benchmark) to a critical enterprise domain: legal compliance reasoning.
In this article, we introduce the Fujitsu Assessing Compliance in Enterprise Dataset, a new benchmark designed to train and evaluate AI systems that act not just as text generators, but as AI Paralegals capable of multi-clause reasoning. This work has been accepted as a main paper at the EACL 2026 (Conference of the European Chapter of the Association for Computational Linguistics), highlighting its contribution to advancing AI capabilities in legal reasoning. For more details about the work, see: https://2026.eacl.org/
The Core Challenge: Why Current AI Fails at "Lawyering"
Large Language Models (LLMs) have revolutionized text understanding. However, enterprise legal workflows demand more than surface-level comprehension, they require structured reasoning across interconnected clauses.
Consider a simple compliance check:
"A renewal notice is sent 45 days before contract expiration. Is it valid?"
To answer this, a human lawyer must simultaneously evaluate:
- The definition of "Renewal Notice"
- The timing requirement clause
- Any exceptions or overrides
This is multi-clause reasoning.
However, existing benchmarks like ContractNLI and CUAD evaluate models on isolated clauses, not real-world dependencies. As our research shows, when faced with such multi-hop scenarios:
- Base models achieve only 34–57% accuracy
- Performance is often barely above random guessing
This reveals a fundamental limitation:
Current AI can read contracts but cannot reason over them.
The Key Insight: Contracts Are Not Text, They Are Graphs
Legal documents are not flat sequences of text. They are structured systems of logic, where:
- Definitions influence obligations
- Exceptions override rules
- Temporal conditions trigger actions
To capture this, we introduce COMPACT (Compliance Paralegals via Clause Graph Reasoning over Contracts).
The COMPACT Framework
Instead of treating contracts as unstructured text, COMPACT transforms them into Clause Graphs—structured representations of legal logic.

Step 1: Deontic Logic Extraction
Each clause is decomposed into its core components:
- Subject (who is responsible/involved)
- Deontic (obligation, permission, prohibition)
- Action
- Object
- Temporal conditions
- Contextual Parameters
This enables precise modeling of legal obligations across domains.
Step 2: Semantic Clustering
Clauses are grouped by function:
- Confidentiality
- Termination
- Payment
- Exceptions etc
Insight: This mirrors how lawyers mentally organize contracts.
Step 3: Graph Linking
We then connect clauses through relationships such as:
- DEFINES (definitions → obligations)
- EXCEPTIONS (overrides)
- DEPENDS_ON (conditional triggers)
- CONFLICTS (contradictions)
The result:
👉 A structured reasoning graph that enables multi-hop legal inference.
Introducing the ACE Dataset
From these clause graphs, we generate Fujitsu Assessing Compliance in Enterprise Dataset(ACE)) the first benchmark designed specifically for multi-clause compliance reasoning.
Dataset Composition
- 4,700 compliance scenarios
- Derived from 633 real-world contracts
- Covers 26 agreement types
- Balanced across:
- Compliant (33.6%)
- Non-Compliant (34.0%)
- Non-Applicable (32.3%)
Each scenario requires reasoning across multiple interconnected clauses, not just one.
The "Adversarial" Advantage
Unlike traditional datasets, ACE is adversarially constructed to prevent shallow pattern-matching.
We introduce challenging scenario types:
Compliant-with-Distractor
Technically valid, but appears suspicious
(e.g., informal but legally acceptable notice)Violation-with-Plausible-Defense
Clearly non-compliant, but masked by reasonable justificationCross-Clause Non-Applicable
Relevant-looking scenarios that are actually governed by different clauses
This ensures:
Models must reason, not just match keywords.
Why Multi-Clause Reasoning Is Hard
Legal complexity arises from patterns such as:
- Definitional chains (A → B → C dependencies)
- Temporal conflicts (overlapping deadlines)
- Exception hierarchies
- Conditional webs
Our dataset explicitly targets these patterns, forcing models to perform multi-hop reasoning across clause graphs.
Results: From Models to Legal Agents
Training on ACE leads to significant improvements:

1. Massive Performance Gains
- +22 to +43 percentage points improvement
- Small models (3B) show dramatic gains
2. Efficiency at Scale
- 3B models outperform larger base models
- Enables cost-effective enterprise deployment
3. Cross-Domain Generalization
Models trained on ACE improve performance on:
- EU AI Act
- HIPAA
- Legal entailment benchmarks
This shows:
The model learns general legal reasoning, not just dataset patterns.
Key Takeaway: From Reading to Reasoning
The central limitation of today's legal AI is not understanding language, it is understanding relationships.
COMPACT + Fujitsu Assessing Compliance in Enterprise Dataset introduces a paradigm shift:
- From clause-level understanding → to graph-based reasoning
- From text prediction → to legal decision-making
Conclusion
The ACE dataset represents a foundational step toward enterprise-grade AI agents that can reason like legal professionals.
Our contributions include:
- A graph-based framework for modeling contracts
- A multi-clause benchmark for realistic compliance reasoning
- Adversarial scenarios that prevent shortcut learning
- Demonstrated efficiency and generalization
As AI systems move into high-stakes enterprise domains, reasoning—not just generation—will define their true value.
Resources
- 📄 Paper- COMPACT: Building Compliance Paralegals via Clause Graph Reasoning over Contracts - ACL Anthology
- 🤗 Dataset (Fujitsu Assessing Compliance in Enterprise Dataset) https://github.com/FujitsuResearch/Fujitsu-Assessing-Compliance-in-Enterprise-Dataset.git
- 📧 Contact: {Pranav.bhagat, Ayush.singh, Dishank.aggarwal}@fujitsu.com