
Hello everyone. We are Yuya Edazawa, Yuma Muto, and Takuya Okamoto from Fujitsu’s Advanced Technology Development Division. We are working on the development of FUJITSU-MONAKA1 series, including the next-generation Arm processor FUJITSU-MONAKA, which will be the foundation of the future of cutting-edge domains such as AI, HPC, and cloud. In addition, in June 2025 we launched development of FugakuNEXT, the successor system to Fugaku. This project is currently being carried out jointly with RIKEN and NVIDIA.
To share these initiatives and their latest progress with the global community, we participated on-site in SC25, the International Conference for High Performance Computing, Networking, Storage and Analysis (hereafter, SC25 | https://sc25.supercomputing.org/), held in St. Louis, USA from November 16 to 21, 2025.
In this article, we report on the exhibit content related to FUJITSU-MONAKA and FugakuNEXT, as well as the latest trends we observed through attending technical sessions and visiting other exhibitors’ booths.
FUJITSU-MONAKA
At Fujitsu’s booth at SC25, we introduced the key features and core technologies of the next-generation Arm-based CPU FUJITSU-MONAKA, together with its product roadmap.
FUJITSU-MONAKA adopts the Armv9-A architecture and Arm Scalable Vector Extension 2 (SVE2). It is a processor designed to simultaneously pursue high real-application performance and energy efficiency for AI and HPC workloads, while also providing strong security through Confidential Computing. Leveraging the technologies cultivated with Fugaku, FUJITSU-MONAKA aims to support next-generation computing infrastructure across data centers, HPC, and telecommunications.
According to the development roadmap, FUJITSU-MONAKA is scheduled for release in 2027. In 2029, we plan to introduce FUJITSU-MONAKA-X, featuring an integrated NPU and next-generation process node technology, which has already been selected for use in FugakuNEXT. By 2031, FUJITSU-MONAKA will evolve into FUJITSU-MONAKA-XX, further fusing CPU and NPU and utilizing cutting-edge process technologies to deliver even higher performance and better energy efficiency.

FUJITSU-MONAKA Processor Technologies
FUJITSU-MONAKA applies two of Fujitsu’s unique technologies—our original microarchitecture optimized for 3D packaging, and an ultra-low-voltage operation technology—to achieve both high performance and low power consumption.
At SC25, our exhibit attracted significant attention, in part because it is still rare for vendors to design and own their own CPU microarchitecture. Below, we introduce two key processor technologies.
3D Many-Core Architecture

FUJITSU-MONAKA adopts a 3D chiplet structure in which multiple dies are vertically stacked. The main components are:
- Core dies responsible for computation
- SRAM dies that contain the last-level cache (LLC)
- An IO die that integrates external interfaces such as PCIe and DDR5
The Core die and SRAM die are vertically stacked and tightly connected using TSV (Through-Silicon Via), enabling low latency and high throughput.
We also apply heterogeneous process integration: the Core die uses a cutting-edge 2 nm process, while the SRAM and IO dies use a 5 nm process. By applying the 2 nm process only to the Core die, we limit use of the most advanced process to below 30%, optimizing cost while maintaining performance.
Through this 3D many-core architecture, which combines stacking technologies and heterogeneous process integration, FUJITSU-MONAKA delivers high performance, energy efficiency, and cost effectiveness for demanding workloads.
Ultra-Low-Voltage Operation Technology

This technology aims to fundamentally reduce CPU power consumption by lowering the operating voltage of the entire CPU. In general, reducing voltage can significantly reduce power, but it also introduces challenges, especially instability in SRAM operation.
To address this, Fujitsu has developed SRAM that remains stable even at ultra-low voltages by combining dedicated CAD tools with assist circuits. This technology aims to go beyond the limits of the 2 nm process in terms of energy efficiency and enable power-efficient and stable operation for AI and HPC workloads.
Accelerating AI and HPC with FUJITSU-MONAKA and Our Software Efforts
FUJITSU-MONAKA targets a wide range of domains, including AI, HPC, and cloud. To help customers in each domain easily adopt and exploit FUJITSU-MONAKA, we are committed to supporting products from independent software vendors (ISVs) as well as open-source software (OSS) that are de facto standards in their respective areas. We are working closely with ISVs and OSS communities on development and validation.
At previous ISC and SC conferences, we showcased initiatives such as improving the performance and quality of LLVM/GCC, and software development to leverage the Arm Confidential Computing Architecture (CCA). These activities are introduced in our tech blog article “ ISC2025に参加・展示しました#2 ~ 次世代Armプロセッサ「FUJITSU-MONAKA」の最新技術とOSS展開” (Japanese).
In this SC25 exhibit, we focused particularly on AI and HPC applications, and introduced three key topics:
- FUJITSU-MONAKA performance for AI and HPC workloads
- Fujitsu’s contributions to improving the performance of AI/HPC OSS
- R&D for industrial AI adoption
This article provides more detail on these initiatives.

This slide showed an estimated performance comparison between FUJITSU-MONAKA, scheduled for release in 2027, and competing CPUs in a similar price band expected to be released in 2027. The estimates are based on current CPU performance data, together with projected performance gains from architectural evolution for both FUJITSU-MONAKA and competing CPUs.
As the graphs indicate, FUJITSU-MONAKA is expected to deliver high performance for both AI and HPC workloads. Fujitsu is using these projections to help customers evaluate the potential benefits of adopting FUJITSU-MONAKA and is also conducting Proof of Concept (PoC) activities—for example, accelerating and reducing the power consumption of customers’ applications using current Arm processors—so that customers can experience the benefits firsthand before adopting FUJITSU-MONAKA.
While these performance values are primarily derived from hardware improvements, Fujitsu is also actively optimizing software for further acceleration of AI and HPC workloads, including contributing patches to OSS projects. At SC25, we introduced our performance enhancement work on the following OSS commonly used in AI and HPC:
- llama.cpp and vLLM, widely used for large language model (LLM) inference
- OpenBLAS, a matrix computation library used in many applications

As the graphs show, Fujitsu’s contributions have significantly improved the performance of these OSS components. For llama.cpp, we improved performance by introducing Arm’s 8-bit integer matrix multiply instruction (SMMLA) into the GGML library’s matrix-multiplication kernels.
For vLLM, we implemented Paged Attention—a memory-efficient attention mechanism—using SVE2 for the OpenVINO backend, which had not previously supported Arm. This greatly increased throughput. Since these improvements rely only on standard Arm features, they benefit not only FUJITSU-MONAKA but also other Arm CPUs.
For OpenBLAS, we improved the matrix partitioning logic used in multi-threaded matrix operations so that the matrix blocks assigned to each thread become as close to square as possible. The previous logic often produced extremely elongated rectangular blocks, limiting scalability. The new logic significantly improves this. This enhancement is effective not only on Arm CPUs but also on other architectures such as x86, providing value to a wide range of users.
Below are links to the Pull Requests that include these performance improvements and to a poster presented at ISC25 on the vLLM optimization:
Pull Request: llama.cpp, vLLM, OpenBLAS
Poster at ISC25: Enabling vLLM on ARM for scalable LLM inference on resource-constrained servers

So far we have focused mainly on performance, but this slide introduced our R&D on surrogate models as part of enabling industrial AI.
A surrogate model is an AI model that serves as a surrogate for numerical simulations. At Fujitsu, one of the key use cases is reducing cost and improving efficiency in CAE design in manufacturing. Today, CAE design generally relies on numerical simulation, which is computationally expensive and therefore often applied in later design stages. While we expect the need for high-accuracy simulations to remain, introducing surrogate models can significantly reduce simulation cost.
If simulation becomes cheaper, it will be possible to run simulations from the early stages of design, enabling early design validation and improving the overall efficiency of CAE design.
A major challenge for realizing this use case is improving the accuracy and versatility of surrogate models. To address this, Fujitsu is working on building surrogate models using Graph Neural Networks (GNNs). In initial experiments comparing 2D simulations produced by OpenFOAM with GNN-based surrogate models, we achieved high accuracy across various object positions and shapes. We are now working to construct high-accuracy, highly versatile surrogate models for more advanced 3D simulations.
We are also preparing to make the outcomes of this R&D available to customers through Fujitsu’s AI Surrogate Model Trial Platform (Kozuchi Research Portal).
FugakuNEXT
FugakuNEXT project aims for the start of operations in 2030. Under the "Made with Japan" concept—leveraging Japan’s strengths while engaging in global development partnerships—we are collaborating with RIKEN and NVIDIA. Our goal is to achieve technical breakthroughs that deliver more than 100× application performance compared to Fugaku, using heterogeneous nodes that tightly integrate CPUs and GPUs.
We also place strong emphasis on sustainability and continuity by building a sustainable software ecosystem, modernizing applications, and implementing advanced energy-efficient operation technologies.
The ultimate goal of FugakuNEXT ecosystem is to accelerate scientific progress through "AI for Science". By establishing R&D leadership in advanced computational science and AI technologies, and by providing computing resources continuously, we aim to powerfully support the future of science and technology in Japan.

Key Specifications of FugakuNEXT
In our exhibit, we explained how FugakuNEXT improves on the performance of Fugaku. Key points of comparison include:
- Number of nodes: FugakuNEXT will have more than 3,400 nodes, compared to Fugaku’s 158,976 nodes. While Fugaku has many more nodes, each FugakuNEXT node will be far larger and more powerful, as explained below.
- FP64 vector performance: FugakuNEXT’s CPUs will deliver 48 PFLOPS or more, and its GPUs 2.6 EFLOPS or more, in FP64 vector performance. Comparing the GPU portion alone to Fugaku’s FP64 CPU performance of 537 PFLOPS, this represents approximately a 4.9× increase.
- FP16/BF16 matrix performance: For AI workloads such as deep learning, FP16/BF16 matrix performance is particularly important. FugakuNEXT’s GPUs are projected to deliver 150 EFLOPS or more, an impressive ~70.5× increase over Fugaku’s 2.15 EFLOPS in FP16/BF16 CPU performance.
- FP8 matrix performance: FugakuNEXT newly specifies FP8 matrix performance of 3.0 EFLOPS or more on CPUs and 300 EFLOPS or more on GPUs. In addition, GPUs will reach 600 EFLOPS or more for sparse FP8 matrix performance, a crucial metric for state-of-the-art machine learning models.
- Memory capacity: The combined memory capacity across FugakuNEXT’s CPUs and GPUs will exceed 10 PiB, about 2.06× the 4.85 PiB of Fugaku.
- Memory bandwidth: FugakuNEXT’s CPUs will achieve at least 7 PB/s of memory bandwidth, and its GPUs 800 PB/s or more. Compared with Fugaku’s 163 PB/s, this represents about a 4.9× improvement on the GPU side.

Hardware Architecture of FugakuNEXT

FUJITSU-MONAKA-X processor adopted for FugakuNEXT is a next-generation Arm-based processor optimized for AI and HPC. It is designed to maximize AI-HPC performance through tight integration of CPU and GPU.
For HPC, FUJITSU-MONAKA-X uses a next-generation 3D many-core architecture and a 1.4 nm process, and accelerates workloads with extended SIMD capabilities while maintaining compatibility with existing HPC applications.
For AI acceleration, FUJITSU-MONAKA-X implements Arm Scalable Matrix Extension 2 (SME2)—for the first time in a server-class Arm CPU—working in concert with GPUs to further boost application performance.
In terms of energy efficiency and reliability, FUJITSU-MONAKA-X provides ultra-low-voltage control, enhanced security through confidential computing, and RAS features to deliver high reliability.
CPU and GPU are connected using NVLink Fusion, enabling high-bandwidth, low-latency, memory-coherent access to accelerate integrated AI-HPC workloads. CPUs excel at complex control flows, latency-sensitive tasks, and irregular memory access patterns (for example, simulations, real-time AI, multi-modal processing, signal processing, and database access). GPUs excel at massively parallel data-parallel workloads with regular memory access patterns (for example, large-scale DL/ML training and large-scale graphics processing). By combining these strengths, FugakuNEXT will efficiently execute a wide range of workloads from simulation to AI.

Each FugakuNEXT compute node is built around multiple FUJITSU-MONAKA-X CPUs and NVIDIA GPUs. Within a node, a scale-up network is constructed to tightly couple multiple GPUs and connect CPUs and GPUs, enabling high-performance data transfer and cooperative computation. This is particularly effective for workloads requiring fast intra-node communication, such as large-scale AI model training.
Across nodes, a scale-out network interconnects multiple compute nodes, enabling large-scale distributed computation. This network is crucial for large HPC simulations and scenarios where many AI inference jobs are processed concurrently. By seamlessly integrating these two types of networks, FugakuNEXT maximizes both flexibility and efficiency.
In AI-HPC platforms, unifying scale-up and scale-out is critically important. FugakuNEXT network is designed to combine both characteristics so that tightly integrated AI and HPC workloads achieve optimal performance. For example, it enables complex workflows such as analyzing and predicting the results of HPC simulations using AI, or verifying AI-generated models with HPC simulations, in a seamless manner.
Software Technologies Aimed at 100× Application Performance
As mentioned earlier, FugakuNEXT is expected to deliver a dramatic increase in low-precision performance—especially for AI workloads—compared to Fugaku, while the increase in double-precision performance for scientific computing is estimated at around 5×. Yet FugakuNEXT sets an ambitious goal of up to 100× application performance improvement. Achieving this target will require significant software-driven acceleration.
Our exhibit introduced the software technologies we are exploring to realize such acceleration. Here we highlight three especially important technologies.

Leveraging AI to accelerate simulations
While application optimization techniques mentioned later will be effective, achieving a 100× gain calls for more fundamental approaches. As introduced in FUJITSU-MONAKA exhibit, Fujitsu is focusing R&D on surrogate models. We are also developing quantization techniques that reduce model size while preserving accuracy.
We believe leveraging such AI technologies is essential to achieving the 100× goal, and we will continue to expand AI application domains and pursue further acceleration.
Harnessing CPU-GPU tightly coupled architecture for application speedup
In FugakuNEXT, GPUs will account for the majority of raw compute performance. However, previous research has shown that application performance can improve significantly when workloads are carefully optimized to exploit both CPUs and GPUs in combination. Making effective use of both will become increasingly important.
Fujitsu is developing compilers, math libraries, and AI frameworks that enable application acceleration on FUJITSU-MONAKA-X alone, and is also conducting R&D on maximizing application performance by combining FUJITSU-MONAKA-X, NVIDIA GPUs, and NVLink Fusion.
Accelerating high-precision operations with low-precision arithmetic
Ozaki Scheme is a technique that enables high-precision computations to be performed efficiently using low-precision compute units. By applying Ozaki Scheme, we can leverage the excellent low-precision performance of NVIDIA GPUs for scientific computing.
Ozaki Scheme is particularly effective for accelerating matrix computations and remains an active research area. Fujitsu is evaluating the effectiveness of Ozaki Scheme across a wide variety of applications and exploring its applicability to FUJITSU-MONAKA-X, which itself has strong low-precision performance.
These technologies represent our current outlook and may evolve as detailed design progresses and research advances. By continuing R&D and closely tracking the latest research trends, we aim to contribute to achieving the ambitious goals of FugakuNEXT project.
The Future of AI×HPC Seen from Technical Sessions and Exhibits
Between explaining our exhibits, we attended technical sessions and visited other booths. Below we introduce several topics we found particularly interesting.
Platforms for Optimizing AI×HPC Workflow Execution
In recent years, there has been growing interest in combining AI and HPC to accelerate scientific discovery. Such combined workloads are often implemented and executed as workflows. Unlike traditional workflows consisting solely of large, long-running MPI jobs, modern AI×HPC workflows include tasks of different natures—for example, short-running Python scripts or numerous small AI inference tasks executed concurrently. These differ significantly from conventional HPC jobs.
At SC25, one presentation addressed optimization of such workflows and task execution:
- "Integrating and Characterizing HPC Task Runtime Systems for hybrid AI-HPC workloads" (paper link)
In many systems, the tasks that make up a workflow are treated as jobs and executed through batch job schedulers such as Slurm. However, Slurm (particularly when launching tasks with srun) has limitations in terms of concurrency and launch throughput. For workloads that execute many short-running tasks, task startup latency can dominate overall runtime, leading to underutilization of reserved compute resources and increased total makespan.
The paper tackled this issue using two task runtime systems called Flux and Dragon, integrated via a framework named RADICAL-Pilot (RP). RP serves as an execution substrate for workflow tools and applications, providing a Python API that can route tasks to appropriate runtimes depending on their characteristics. Flux is used for MPI jobs and other HPC-style tasks, while Dragon is used for large numbers of short inference or scripting tasks. This configuration enables execution that matches the nature of each task within the workflow.
Using IMPECCABLE, a real-world workflow from the drug discovery domain, the authors showed that this approach reduced workflow makespan by about 30–60% compared to using Slurm (srun) alone. Performance optimization is often discussed in terms of tuning applications themselves, but this research is noteworthy in showing that simply choosing and combining suitable execution runtimes can dramatically shorten the overall execution time of AI×HPC hybrid workloads.
RP is not a workflow description language or workflow engine in itself. Instead, it is designed as an execution substrate underneath workflow tools and applications. It is already supported by workflow tools such as Parsl and has been used in projects funded by the U.S. Department of Energy and the National Science Foundation. By delegating workflow control and dependency management to upper-level tools and focusing solely on execution, RP allows applications to benefit from a general-purpose optimized runtime without each application having to implement its own execution optimization logic. This clear separation of concerns is a notable strength of the approach.
System / Interconnect: Trends in Scale-up and Scale-out Networks
The integration of scale-up and scale-out networks, as envisioned for FugakuNEXT, is also a widely discussed topic in AI×HPC systems in general.
At the Birds of a Feather session “UALink and Ultra Ethernet: Addressing AI Networking Challenges in an Open Ecosystem”, there was an active discussion on standardization efforts for these networks.
Due to the rapid increase in model sizes, bandwidth has become a major bottleneck in AI system networks. In addition, AI workloads rely heavily on GPU-to-GPU synchronization, making low-latency communication essential. Ultra Accelerator Link (UALink) is an effort to standardize an open network specification that meets these requirements. UALink focuses on scale-up networks, connecting many GPUs and presenting them as a single large GPU to software, with a shared address space across GPUs. The session introduced the UALink 200G 1.0 Specification, which was first released in April 2025.
The session also covered Ultra Ethernet (UE), a scale-out network standard being developed by the Ultra Ethernet Consortium. The Ultra Ethernet 1.0 Specification was released in June 2025. Many existing network specifications date back more than 20 years, and with today’s much higher compute capabilities and more complex processing requirements, updated specifications are needed.
The session highlighted several key features of UE 1.0, including:
- Support for lossy operation, which allows controlled data loss to reduce overhead
- Out-of-order delivery and per-packet multipath routing for load balancing
- Built-in security features such as cluster-wide keying and zero-state replay protection
Memory Technologies for High Bandwidth and Low Latency
The session “Energy-efficient Memory Technology for Maximizing Bandwidth and Reducing Latency” discussed the importance of memory and its future direction.
In recent years, the gap between compute performance and memory performance—the so-called “memory wall”—has widened, making memory one of the key determinants of overall system performance. Many applications, including large-language-model inference and HPCG, require high memory bandwidth. However, increasing bandwidth poses a major challenge: the energy consumed by data movement. As bandwidth increases, the energy required for data transfer increases proportionally, pushing up against thermal and energy-efficiency limits. Therefore, improving energy efficiency is essential for further expanding bandwidth.
An effective way to reduce data movement energy is to pack everything as closely as possible, driving the adoption of advanced packaging technologies such as wafer-scale integration, 2.5D/3D integration, and chiplets. The goal is to place large-capacity memory near compute units to minimize data movement. However, stacking introduces significant thermal challenges, as higher power density demands advanced cooling solutions.
Latency is another critical issue. Increasing bandwidth without regard to latency can increase the number of in-flight memory operations, requiring large buffers to hide latency and introducing new constraints.
As memory systems evolve, extracting their full performance potential is becoming a cross-cutting challenge across hardware, software, and applications. With memory technologies diversifying from conventional DRAM to HBM and CXL, this represents a major opportunity to fully utilize hardware capabilities. In addition, the discussion raised the issue that focusing solely on computational complexity—such as Big-O notation—is insufficient; algorithms should be reconsidered to explicitly account for bandwidth, memory, and communication constraints.
Conclusion
At FUJITSU-MONAKA and FugakuNEXT exhibits, many visitors expressed interest in the fact that Fujitsu designs its own CPU microarchitecture. This reinforced our recognition that the technologies we have cultivated over many years of processor development are a key strength.
We also received feedback that FugakuNEXT is drawing attention not only in Japan but also overseas, which is a great encouragement as we move forward with its design and development.
Leveraging the insights gained from technical sessions and other exhibits at SC25, we will continue to advance the development of both FUJITSU-MONAKA and FugakuNEXT from the perspectives of both software and hardware.
- FUJITSU-MONAKA: This new technology applied to FUJITSU-MONAKA is based on results obtained from a project subsidized by the New Energy and Industrial Technology Development Organization (NEDO).↩