Hello. This is Yuji Mizobuchi from the Artificial Intelligence Laboratory.
To promote the use of generative AI at enterprises, Fujitsu has developed an "Enterprise-wide Generative AI Framework Technology" that can flexibly respond to diverse and changing corporate needs and easily comply with the vast amount of data held by a company and laws and regulations. The framework was successively launched in July 2024 as part of Fujitsu Kozuchi (R&D)'s AI service lineup.
Some of the challenges that enterprise customers face when leveraging specialized generative AI models include:
- Difficulty handling large amounts of data required by the enterprise
- Generative AI cannot meet cost, response speed, and various other requirements
- Requirement to comply with corporate rules and regulations
To address these challenges, the framework consists of the following technologies:
- Fujitsu Knowledge Graph Enhanced RAG ( *1 )
- Generative AI Amalgamation Technology
- Generative AI Audit Technology
In this series, we introduce the "Fujitsu Knowledge Graph Enhanced RAG" every week. We hope this helps you to solve your problems. At the end of the article, we'll also tell you how to try out the technology.
Fujitsu Knowledge Graph Enhanced RAG Technology Overcomes the Weakness of Generative AI that Cannot Accurately Reference Large-Scale Data
Existing RAG techniques for making generative AI refer to related documents, such as internal documents, have the challenge of not accurately referencing large-scale data. To solve this problem, we have developed Fujitsu Knowledge Graph Enhanced RAG (hereinafter, Fujitsu KG Enhanced RAG) technology that can expand the amount of data that can be referred to by LLM from hundreds of thousands to millions of tokens to more than 10 million tokens by developing existing RAG technology and automatically creating a knowledge graph that structures a huge amount of data such as corporate regulations, laws, manuals, and videos owned by companies. In this way, knowledge based on relationships from the knowledge graph can be accurately fed to the generative AI, and logical reasoning and output rationale can be shown.
This technology consists of five technologies, depending on the target data and the application scene.
- Root Cause Analysis (Now Showing)
This technology creates a report on the occurrence of a failure based on system logs and failure case data, and suggests countermeasures based on similar failure cases. - Question & Answer (Now Showing)
This technology makes it possible to conduct advanced Q&A based on a comprehensive view of a large amount of document data such as product manuals. - Software Engineering (This article)
This technology not only understands source code, but also generates high-level functional design documents, summaries, and enables modernization. - Vision Analytics (Now Showing)
This technology can detect specific events and dangerous actions from video data, and even propose countermeasures. - Log Analysis (Now Showing)
This technology can answer various questions related to system logs in natural language, including fault cause analysis, anomaly detection, and summarization.
In this article, I will introduce 3 Software Engineering (hereinafter, SE) in detail.
What is Fujitsu Knowledge Graph Enhanced RAG for SE?
Have you ever experienced software development before? While creating something is a lot of fun, software development involves a lot of tedious and painstaking work, such as understanding the code written by others and identifying the impact of code modifications. In fact, LLM can make these tasks easier: it is so smart that it reads multiple lines of code at once, taking into account the relationships between them, and provides a sophisticated understanding and explanation of the meaning of even obscure variables by analogy. It can also be used for other purposes such as code generation, code transformation, and test case generation, and the possibilities for LLM are endless. Of course, it does not completely eliminate human work, since it is not possible to rely on all of the content generated by LLM, but it can significantly reduce the current work involved in software development.
Fujitsu's Knowledge Graph Enhanced RAG for SE aims to leverage the benefits of LLM in software development for large-scale IT systems. Needless to say, IT systems today are the cornerstone of corporate management, and there is a longstanding need to achieve agile maintenance and modification based on continuous monitoring and full understanding. However, many systems, including legacy systems, are facing difficult situations such as lack of engineers, vendor support, and no design documentation, which are far from the ideal state of systems management. To overcome this situation, Fujitsu's knowledge graph Enhanced RAG for SE, is being researched and developed to support decision making through an understanding of current assets.
How does Fujitsu Knowledge Graph Enhanced RAG for SE work?
Here we would like to explain how the Fujitsu Knowledge Graph Enhanced RAG for SE works. Fig. 1 shows the overall picture of this framework. This framework consists of three phases, each of which plays a role in the information extraction from IT assets, knowledgeisation and knowledge utilization. In each phase, LLM is at the core and, if necessary, existing tools are utilized to process data according to the purpose. The following sections describe each phase in more detail.
Phase 1: Information extraction
In Phase 1, information useful for understanding the assets is extracted from the IT assets. The flow of information extraction is shown in Fig. 2, which assumes that IT assets include business information such as business flows and manuals, operational information such as IT system execution logs, and source code. The information to be extracted from this information is assumed to include business flow, design information, flow information (control flow, data flow, etc.) and call relationships between functions and files. If these pieces of information need to be as decisive and accurate as possible, conventional analysis tools are used. On the other hand, if a certain degree of imprecision is acceptable, LLM is used to produce results that take the business perspective into account. It can be one of the features of this framework to extract information by utilizing the strengths of both.
Phase 2: Knowledgeisation
In Phase 2, information is organized and supplemented according to pre-defined schemas, based on the information collected from the IT assets in Phase 1. As shown in Fig. 3, there is a wide range of relevant information that should be comprehensively prepared and assumed for knowledge utilization in Phase 3. This is because there are various stakeholders in IT systems, and the information they need differs depending on their position and situation. For this reason, it is necessary to supplement the information that is lacking based on a multifaceted and multidimensional analysis, not just the information from Phase 1. LLM enables generalization and concretization at different levels of abstraction, such as business level, functional level, and implementation level, as well as the association of different types of information.
Phase 3: Knowledge application
In Phase 3, the knowledge built up in Phase 2 is used in purpose-specific applications to meet the needs of each stakeholder. At the current stage, we are considering the documentation of various types of information and the introduction of a mechanism for inquiring about various types of information in a chat format. Fig. 4 shows the program design of the payroll chart creation code*2 created with Fujitsu Knowledge Graph Enhanced RAG for SE (Sept. version). Note that a modified version of this payroll payment totalization table creation code is used in the following explanations. In addition, instructions on how to use this function are provided at the end of this blog, so please read to the end.
What are the technical points of design information generation?
Here we present some technical points about the design information generation of the Fujitsu Knowledge Graph Enhanced RAG for SE (Sept. version).
Challenges in design information generation using LLM
In the first place, rule-based tools have existed for design information generation in the past. These tools prepare templates according to the control structure of the program and apply variable information such as variables to them. The content of the processing performed by the code was rewritten in natural language as it is and could not be described at the level of abstraction of the business level.
In contrast, LLM can refer not only to code but also to variable names and comment text, so it is possible to describe the processing content at the business level using these as hints. Conversely, if the comments and variable names are unclear in the code, it is difficult to generate design information that takes the business into account, which is one of the reasons why it is difficult to generate design information from legacy code. Fig. 5 shows an example of a salary payment tabulation table generation code with easy-to-understand comments and variable names (left) and an example of a code with abbreviated comments and variables (right). The code on the right may be more difficult to understand than the code on the left, and in LLM, the less information that can be utilized, the more difficult it becomes to understand the code, resulting in a lower quality product.
Approach
The Fujitsu Knowledge Graph Enhanced RAG for SE uses program analysis technology to generate design information for the above issues. Flow analysis and call relationship analysis are performed on the code in advance, and the analysis results are incorporated into LLM's design information generation prompts to help understand the content of the code. The specific flow is shown in Fig. 6. When programmers understand code, they follow the flow of variables and controls to understand the specific processing content and grasp the overall picture of the code. This can be said to be almost equivalent to flow analysis, such as data flow analysis and control flow analysis, being carried out by the programmers themselves. Also, when reading a piece of code and not understanding it, it is possible to deepen understanding by reading the code that the code calls. This can be said to be almost the same analysis as call relationship analysis. This experience is also used for LLM generation to ensure high quality.
By the way, it has been reported that accuracy can be improved when asking LLM to solve a task by breaking the task down into smaller steps and having it solve them one at a time*3. The paper that proposed Chain of Thought(CoT), the new prompting approach, gives the example of rewriting the method for finding out how many tennis balls Roger has into a method that breaks it down into smaller steps and calculates it one at a time (Fig. 7). Similarly, in this framework, we have made it possible to achieve gradual understanding by performing program analysis in advance, rather than generating design information in one go.
Examples of data flow analysis, control flow analysis and call relationship analysis using the low readability code in Fig. 5 are shown in Fig. 8. The data flow analysis enumerates where each variable is defined and used, the control flow analysis enumerates execution paths, and the call relationship analysis enumerates functions called from the code.
Examples of improvements
Finally, we would like to show how the design information generated changes ‘with‘ and ‘without‘ program analysis. Fig. 9 shows the results of design information generation by making the bonus file creation code*4 less readable. The design items are Program name, Process summary, Input record schema, Input data example, Output record schema, Output data example. What do you think? Compared to the results generated without program analysis, don't you think that design information with program analysis is generated with a better understanding of the meaning of the code? In the case of ‘without’ program analysis, the information is written at the level of implementation-level abstraction, and the effect obtained is almost the same as reading the code as it is. On the other hand, the design information ‘with’ takes the business perspective into account, and the input and output items are composed of items related to the bonus calculation. It is thought that the processing patterns were judged to be common in bonus calculation based on flow analysis and the number of digits of variables. In the experiment at hand, it was also confirmed that quality can be improved by about 40% by conducting program analysis. We believe that utilizing the results of program analysis for design information generation in this way is effective for design information generation. In the future, Fujitsu Knowledge Graph Enhanced RAG for SE will continue to pursue effective ways of utilizing LLM for design information generation and IT asset understanding.
Why not try the Knowledge Graph Enhanced RAG for SE?
Here we show you how to generate design information using Fujitsu Knowledge Graph Enhanced RAG for SE. It is very easy to generate design information with just 2 Steps.
- Step 1: Register and select code
- Step 2: Checking the analysis results
Step 1: Registering and selecting codes
- Zip the code you want to generate design information from and either drag & drop it or select it from the Browse files and press the ‘Upload’ button. The system currently supports PL/I, COBOL, CLIST and JCL.
- Select the files you want to analyze from the files registered in the zip file. If you do not select a file, all files will be analyzed.
Step 2: Checking the analysis results
- The analysis results can be checked from the preview screen. Click on the analysis result you want to see to view the corresponding design description in the application.
- Analysis results created in markdown format can also be downloaded in batches using the ‘Download results’ button.
*1:RAG technology Retrieval Augmented Generation. A technology that combines and extends the capabilities of generative AI with external data sources.
*2:https://web.archive.org/web/20201023191843/https://sites.google.com/a/offshorejp.com/www/pl1/02/02-02
*3:Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824-24837.
*4:https://web.archive.org/web/20201023191839/https://sites.google.com/a/offshorejp.com/www/pl1/02/02-01