Hello. We are Taro Togawa and Takao Nakagawa from Artificial Intelligence Laboratory in Fujitsu Research.
To promote the use of generative AI at enterprises, Fujitsu has developed a generative AI framework for enterprises that can flexibly respond to diverse and changing corporate needs and easily comply with the vast amount of data held by a company and laws and regulations. The framework was successively launched in July 2024 as part of Fujitsu Kozuchi (R&D)'s AI service lineup. In this article, we will focus on the transformation in system operations and maintenance brought about by generative AI, and introduce Test Specification Generation Technology, which automates the creation of test cases from existing design documents.
Generative AI transforms enterprise system development and operation processes and automation
Generative AI is redefining the very foundations of software development. Code completion and automated test generation are just the beginning; we're moving toward an "AI-based development process" in every step of the process, from requirements definition to design, implementation, and operation. Large-scale language models (LLMs) extract knowledge from internal documents and logs, accelerating the process of drafting specifications, estimating impact scope, and creating release notes. Meanwhile, issues such as how to ensure quality, security, and accountability are being addressed.
Enterprise system development is particularly challenging. Complex business domains, integration with legacy assets, strict auditing and availability, and frequent compliance with legal changes - these are issues that cannot be resolved with simple automated generation. It is essential to accurately verbalize requirements, visualize non-functional requirements, and maintain architectural consistency while controlling the spread of changes.
Fujitsu is engaged in research and development of advanced technologies that will solve these issues. The core technology is Fujitsu Knowledge Graph enhanced RAG technology, which enables accurate referencing and utilization of large amounts of data. While this technology can be used in a variety of general-purpose scenarios, this series focuses on automating and streamlining system development and operations, and introduces the following seven technologies. In the future, we aim to build a multi-agent system that will promote requirements definition, design, implementation, and operation in a highly reliable and well-controlled manner by having Knowledge Graph enhanced RAG as an integrated database that can be accessed by AI across a wide range of development and operation tasks.

| Requirement definition | Design | Implementation | Test | Operation | Maintenance | |
|---|---|---|---|---|---|---|
| (1) System Specification Visualization | ✓ | ✓ | ||||
| (2) Design Review Assistant | ✓ | |||||
| (3) Code Specification Consistency Analysis | ✓ | ✓ | ||||
| (4) Test Specification Generation | ✓ | ✓ | ||||
| (5) Failure Analysis | ✓ | ✓ | ||||
| (6) Log Analysis | ✓ | ✓ | ||||
| (7) QA Automation | ✓ | ✓ | ✓ |
(1) System Specification Visualization (Knowledge Graph enhanced RAG for Software Engineering, Now Showing)
This technology not only analyzes and understands source code, but also generates high-level functional design documents and summaries, enabling modernization.
(2) Design Review Assistant (Now Showing)
This technology automates the checking of ambiguity and consistency in system design documents by converting complex system design documents into a form that can be understood by generative AI.
(3) Code Specification Consistency Analysis (Now Showing)
This technology compares source code with specifications to detect differences and identify problem areas, shortening the time required to investigate the cause of a failure when one occurs.
(4) Test Specification Generation (This article)
This technology extracts rules for identifying test cases from existing design documents and test specifications, making it possible to generate complete test specifications that take into account the characteristics of the project.
(5) Failure Analysis (Knowledge Graph enhanced RAG for Root Cause Analysis, Now Showing)
This technology creates a report when a failure occurs based on system logs and data from failure cases, and suggests countermeasures based on similar failure cases.
(6) Log Analysis (Knowledge Graph enhanced RAG for Log Analysis, Now Showing)
This technology automatically analyzes system log files and answers highly specialized questions related to identifying the cause of failures, detecting anomalies, and preventive maintenance.
(7) QA Automation (Knowledge Graph enhanced RAG for Q&A, Now Showing)
This technology enables advanced Q&A with a bird's-eye view of large amounts of document data, such as product manuals.
In this article, we will introduce "(4) Test Specification Generation" technology in detail. This technology utilizes generative AI to read and analyze large volumes of documentation, such as design specifications and development guidelines. By leveraging the knowledge of Fujitsu's system engineers, it enables the generation of test cases (Figure 1).

What is testing in software development?
Fujitsu develops a wide variety of software, including mission-critical systems for finance and government. In software development spanning requirements business requirements, design, and operation/maintenance, the testing phase is an indispensable and critically important process for ensuring system reliability prior to release. Figure 2 illustrates the relationship between the design phase and the testing phase in software development, known as the V model. According to this model, each design and implementation phase corresponds to specific testing phases. Therefore, it is expected that test specifications, appropriate to the level of abstraction of the design, are determined concurrently with the design, and testing is performed after implementation.

Traditionally, experts thoroughly familiar with the overall system design and operations generated test cases corresponding to software modifications and conducted testing (Figure 3). However, as system scale increased, the scope of impact broadened. Furthermore, dependencies between design documents became complex and often contained many implicit dependencies, making it extremely difficult to identify the test scope affected by design changes and formulate test cases. Furthermore, when system updates occurred, it required an enormous amount of time to review all design documents and past test materials, leading to the challenge of high workload for test personnel.

Furthermore, recent advances in generative AI technology have begun to yield AI systems capable of understanding design concepts. This enables the automated reading of large volumes of test specifications and design documents. The introduction of such technology is expected to automate the previously complex and labor-intensive task of updating test specifications, thereby reducing workload. It also promises to enhance the comprehensiveness and quality of test cases through human double-checking.
Challenges in the Testing Process
So, what exactly is needed to have AI solve such problems? For example, consider the task: "For every line in the modified design document, compare it against every line in the existing test specification to determine whether it has an impact." This approach is straightforward, but it requires a massive number of queries to the AI, proportional to "number of modifications × number of specification lines." If a single test specification affects multiple design documents, the number of patterns to consider becomes astronomical. Furthermore, even if all patterns were covered, the AI's inherent uncertainty would likely result in a large number of erroneous revision proposals, ultimately degrading the accuracy of the automated revision process itself.
Test Specification Revision Scenario
To solve this problem, we focused on the "process by which the original test specifications were created." Development projects typically have plans or standards defining which tests to perform for each design item. Therefore, the basis for each test specification should be documented somewhere in the design documents, test plans, or standards. Based on this relationship, determining in advance how design document revisions affect test specifications allows us to narrow down which documents and tests need to be compared. This enables efficient and highly accurate automated revision.
Unfortunately, this relationship (traceability) between "test specifications" and "justification descriptions" is often unavailable or, if present, incomplete. That is, the relationship has become implicit knowledge or has been lost. The test specification document shown in Figure 4 lists test cases concerning the startup and shutdown of various servers that make up the system. Here, the rule that the "Detailed Category" of the test specification should list each server described in the "Server Configuration Design Document" is actually implicit knowledge.

Now, suppose the Server Configuration Design Document is revised as shown in Figure 5. What was previously an SQL server has changed to a different type of server called an Integ. DB Server. This change should be reflected in the test specification. But what if all the knowledgeable individuals who understood the implicit relationship described earlier are no longer present? Until someone notices this relationship, the revision to the test specification might be overlooked. In the worst case, tests based on the old description might simply be deleted.

Previous Methods and Their Problems
Actually, there is a research theme that seeks to rediscover these implicit relationships between documents. This field, called trace link recovery, identifies dependencies, derivations, and similarities between artifacts (documents or source code) based on information retrieval and machine learning. Combined with recent advances in generative AI (especially text-generating AI), trace link explanation emerged as a task to describe the meaning of identified relationships (links)—specifically, "What exactly is the nature of this relationship?" While trace link explanation appears to be a powerful technology, precisely because it involves explaining using natural language, the possible explanation perspectives are extremely numerous. It is necessary to select the appropriate perspective based on the task's objective. Furthermore, it has been pointed out that the choice of explanatory perspective fundamentally influences the types and quality of trace links that can be discovered. In this context, there is currently no proposed trace link explanation method capable of clarifying how a modification to one document should affect the other.
Technical Features and Overview
Fortunately, our objective is clear: (Objective A) to identify the design documents that influenced the test specifications and (Objective B) to explain how modifying those design documents would affect the test specifications. Therefore, we believed that if we could describe the work procedures (descriptive rules) that led to the creation of a given test specification (or set of specifications), we could achieve link restoration and explanation from the perspective of modification impact propagation. With that in mind, we aimed to divide the problem into three steps, as shown in Figure 6, and solve it.

Step 1. AI Agent-Based source of influence retrieval
In the first step, documents that may have influenced a given document are autonomously identified. This time, we provided keyword search and similarity search tools to a ReAct-type AI agent capable of multi-turn autonomous reasoning, instructing it to explore the design document collection and list documents likely to serve as source of influence (Figure 7). Since Step 2 focuses on consolidating into high-certainty, high-abstraction "rule descriptions" and requires content accuracy, this stage does not seek perfection. Its sole purpose is to discover plausible candidate bases.

Step2. Generative AI-Based Rule Abstraction
Next, we present the set of links collected in Step 1 to the generative AI and issue the request: "Identify the rules for describing test specifications." This consolidates the scattered links into a structured set of rules (Figure 8). Of course, given the current capabilities of generative AI, simply providing the above instruction cannot yield high-quality descriptive rules.

When exploring candidate justifications, we enumerate candidates at specific, granular levels—such as line-level or case-level. However, as explained in Figure 4, actual descriptive rules often take the form where a single rule can derive multiple test specifications. On the other hand, candidates extracted in Step 1 may contain errors, necessitating criteria for determining which links to retain as justifications for the final descriptive rules. This time, we aim to explain more with fewer descriptive rules by applying the heuristic (rule of thumb) of "selecting the minimal set of source of influence links that explain the greatest number of outcomes." This aligns with the practice in actual software development workflows, where the goal is to create comprehensive test specifications from concise plans and standards, thus applying such a rule of thumb. However, considering cases where the available documents lack any sources for description or are too complex to identify, we also devised a distinction between common rules that can explain many phenomena and exceptional rules for describing a small number of exceptional cases. This allows us to handle test cases that might otherwise slip through the cracks of abstraction. Note that Steps 1 through 2 constitute the analysis process for existing assets and can be executed prior to any modifications.
Step3. Application of Descriptive Rules Based on Modification Differences (Diff)
In the third step, when actual modifications occur, we compare their content against the descriptive rules and update the test specifications (Figure 9). The key point here is how to provide the modification details from the design documents. Design documents vary widely in format across projects and can sometimes be hundreds of pages long. Therefore, we adopted a method where we mechanically extract the differences (hereafter, Diff) from the design document, feed them to the generative AI, and have it identify chapter and section breaks to split the Diff into smaller parts. After that, for each modification difference, we simply have the generative AI determine if there is a corresponding relationship with the rules. If there is a relevant relationship, it proposes the modification content for the test specification in JSON format.

It would be ideal if we could execute everything up to this point seamlessly at any time, but in reality, a significant number of AI-based judgments are interposed. Particularly in Step 2. Rule Abstraction, we sometimes obtain rules with slightly insufficient abstraction or rules with overlapping content. Therefore, the key to success lies in performing manual verification and refinement to obtain even higher-quality rules. Once a rule is confirmed as correct, it can be reused in future revisions. This makes the process a one-time effort to verbalize implicit knowledge, making it valuable in that sense as well.
Contributions of the technology
① Anyone can automatically design test cases!
By introducing this technology into the testing phase of software development, it becomes possible to comprehensively extract test cases that previously required system development knowledge and experience, making the process difficult.
A verification evaluation applying this technology to the test design of a specific project confirmed that it could extract 70.6% (12 cases out of 17 cases) of the test cases that were manually determined. While further accuracy improvements are needed for practical implementation, it is expected that in the future, even less experienced engineers, with the support of generative AI, will be able to perform testing equivalent to that of experienced engineers.
② Reduce workload during system changes!
By automatically generating test cases that require addition or modification from extensive design documents (old/new versions) containing diagrams and charts, generative AI is expected to reduce the workload for test personnel, such as the need to reread materials.
Conclusion
This article introduced "Test Specification Generation Technology" that automatically generates test cases by leveraging generative AI. This technology received the Outstanding Presentation Award at the 8th Machine Learning Systems Engineering Workshop (MLSE Summer Camp 2025) and was featured as an invited talk at the 42nd Annual Conference of the Japan Society for Software Science and Technology. Furthermore, this technology is integrated into Fujitsu Kozuchi. For web applications, users can simply upload their system design documents (both old and new versions), test plans, and old test specifications via browser. The system then generates test cases based on the changes in the design documents. Therefore, we strongly encourage operations and maintenance teams planning future system enhancements to try this technology.
Currently, we are working to improve the accuracy of test case generation and enhance its versatility. We are also advancing integration with technology that automatically generates corresponding test code. Our future goal is to automate the entire testing process for various industries and business systems.