
Namaskara! We are software engineers in the FUJITSU-MONAKA Software R&D team at Fujitsu Research of India Pvt Ltd (FRIPL). Our goal is to expand and optimize the HPC-AI software ecosystem for Arm CPUs, with special focus maximizing performance for FUJITSU-MONAKA together with our counterparts in Fujitsu Research Japan. Our work spans various software verticals, including databases, machine learning frameworks, deep learning and GenAI frameworks, and confidential computing. Recently, the three of us got an opportunity to present some of our work at the ISC25 conference held in Germany. In this article, we would like to share our experience of this event and the things we learned from it.
Author: Nishant Prabhu, Shreyas K Shankar, Divya Kotadiya
1 Introduction
The International Supercomputing Conference (ISC) High Performance 2025 was held in Hamburg, Germany between June 10-13, 2025. The event focuses on the intersection of high-performance computing paradigms such as hardware design, cooling technologies, algorithmic advancements, artificial intelligence and quantum computing to name a few, and is visited by 3500+ industry professionals, academia, journalists. A delegation of 3 members from the FUJITSU-MONAKA team in Fujitsu Research India participated in ISC25 to present 3 posters that were accepted into the project poster exhibition event. Our posters showcased our team’s efforts in expanding the HPC-AI software ecosystem and bridging the performance gap on Arm CPUs, especially our upcoming 2nm Arm-based processor FUJITSU-MONAKA.
2 Attendance at ISC25
ISC25 comprised largely of booth exhibits, poster exhibitions, research paper presentations, birds-of-a-feather (BoF) sessions and talks. More than 100 groups, including Fujitsu Research, had organized booth exhibits advertising their products and latest developments to visitors. The exhibits tackled various challenges faced in the HPC ecosystem, including but not limited to:
- Fast data storage and retrieval systems
- Network and interconnect technologies for HPC clusters
- Cooling and rack technology innovations
- Algorithmic advancements for optimizing hardware utilization
- Hardware design innovations
Furthermore, 28 research papers and about 30 project posters were accepted about various topics including:
- System architecture and hardware components
- Programming environments and system software
- Algorithms, methods and performance
- Applications and use-cases
- Machine learning and AI
- Quantum computing
Visitors to the event included industry professionals, students, academia and journalists with expertise/interest in various sub-domains of HPC. Many startups and self-employed individuals also took part either as visitors or exhibitors.
3 Our posters
Despite the advancements in compute capabilities and power efficiency of Arm-based processors, its software ecosystem has only begun to mature recently. The FUJITSU-MONAKA team at FRIPL has made several significant developments in enabling and optimizing the HPC-AI software ecosystem for Arm CPUs, including compute libraries (oneDAL, oneDNN, OpenBLAS), threading backends (OpenMP, oneTBB), machine learning frameworks (scikit-learn, XGBoost), deep learning and GenAI frameworks (PyTorch, TensorFlow, OpenVINO, ONNX, llama.cpp) to name a few. The delegation from the FUJITSU-MONAKA team of Fujitsu Research India presented 3 posters in the project poster exhibition event, showcasing some of our recent developments in more detail.
3.1 Enabling vLLM on ARM for scalable LLM inference on resource-constrained servers
[Link to poster]

Poster Summary
Serving LLMs and GenAI models on servers requires efficient handling of compute and memory resources for large volumes of asynchronous requests. vLLM introduced two techniques to achieve the same on a software level. First, continuous batching manages dynamic batching of incoming requests to minimize the average latency of any individual request by evicting completed requests while others continue to be processed. Second, paged attention handles allocation of memory for KV caches to minimize memory fragmentation by mirroring memory paging done by operating systems. This combination has delivered throughput improvements of 4-5x on GPUs. Unfortunately, vLLM was originally designed to work only with x86 CPUs and Nvidia GPUs. This work showcased our work of enabling and optimizing vLLM for Arm CPUs for PyTorch and OpenVINO backends, giving us throughput improvements of ~1.5x with PyTorch and ~51x with OpenVINO.
Audience interaction and feedback
The audience comprised largely of students and academia working on AI technologies, apart from industry professionals who were curious to know about vLLM and our work in general. The most common reaction to this work was surprise about the fact that a framework as important as vLLM was not available to the Arm ecosystem. The attendees were also interested in more technical details including performance comparisons between our work on Arm and other CPU architectures (e.g., x86), the rationale for choosing PyTorch and OpenVINO backends specifically and the future of this project. Overall, the feedback towards this work was quite positive.
3.2 Optimizing Matrix Math: Batch-Reduced GEMM (BRGEMM) for Accelerated Deep Learning on ARM HPC Systems
[Link to poster]

Poster Summary
This poster presents our work on developing a high-performance BRGEMM kernel using Arm’s Scalable Vector Extension (SVE). BRGEMM is a crucial algorithm for accelerating matrix multiplications — a core operation in models like Transformers and large language models (LLMs). We implemented this kernel within oneDNN, an open-source performance library used by popular frameworks such as PyTorch and TensorFlow. Our optimized kernel delivers 1.2×–1.4× speedup at the kernel level and up to 3× inference acceleration for models like ResNet50, Whisper, T5, and LLaMA on Arm-based platforms.
Questions Asked
During the poster session, attendees raised several thoughtful technical questions. A common question was why we focused on deep learning on CPUs instead of GPUs. We discussed scenarios where CPUs are preferable due to availability, power efficiency, or deployment constraints. Another key question was about the reasoning behind batching matrices and processing them together — we explained how BRGEMM reduces overhead and improves memory locality. Attendees were also curious if the kernel was written using intrinsics (yes, to exploit SVE directly), and what compute library the baseline comparisons were based on (we used the default oneDNN implementation as reference).
Who Attended
The poster drew attention from both academia and industry. Notably, researchers and professors from the Alan Turing Institute showed interest in the implications of this work for scalable AI on heterogeneous hardware. Industry professionals from leading companies such as Arm and IBM also engaged in detailed discussions, sharing insights and expressing interest in collaborative possibilities to further enhance performance across hardware and software layers.
Feedback Received
The feedback we received was encouraging and constructive. While many appreciated the significant software-level optimization, some suggested exploring hardware-level enhancements in future iterations — such as adding architectural support for matrix operations to complement BRGEMM. The poster was well-received for its practical impact, with several attendees acknowledging the importance of such optimizations in driving forward deep learning workloads on Arm HPC systems.
3.3 Accelerating XGBoost on ARM CPUs with Scalable Vector Extension for High-Performance Data Science
[Link to poster]

Poster Summary
This poster presents the acceleration of XGBoost, a widely used gradient boosting library known for its scalability and performance on structured data. In this work, we accelerate XGBoost on ARM CPUs by incorporating ARM Scalable Vector Extension (SVE), a SIMD architecture designed for high-performance and energy-efficient computing. We focus on optimizing the histogram building kernel—one of the most compute-intensive components—by integrating key SVE features such as vector-length agnostic programming, predication, and masked memory operations. Our SVE-optimized implementation achieves up to 2× speedup on Graviton3 CPUs compared to the baseline. This highlights the potential of SVE to significantly enhance XGBoost’s performance on ARM platforms, paving the way for efficient ML training in HPC and edge environments.
Highlights from Academic and Industry Conversations
The poster attracted significant interest from both academic and industry attendees. Researchers and professors from institutions such as the Alan Turing Institute and the University of Cambridge engaged with the work, particularly intrigued by its broader implications for scientific computing. Industry professionals, including engineers from Arm, contributed to in-depth discussions around implementation details. Key areas of curiosity included the nature and functionality of SVE intrinsics, performance behavior on smaller datasets, and the practical applications of XGBoost in real-world scenarios. The feedback highlighted both the technical relevance and broader applicability of this optimization effort.

4 Other relevant and interesting posters
We found many other research and project posters displayed during the event discussing interesting developments. We present some selected works below which we found relevant to our work.
4.1 DisCostiC: Digital Twin Performance Simulations Unlocking Hardware-Software Interplay (research poster award winner)
The poster challenges the common hypothesis that total runtime of parallel applications can be accurately predicted by simply summing computation and communication times using analytical models like Roofline, ECM, Hockney, or LogGP. In practice, actual runtimes often diverge from predictions due to complex hardware-software interactions, system noise, and overlapping resource usage, which these traditional models fail to capture. Existing techniques rely on trace-based analysis or real-system execution, both of which introduce ambiguity and noise that obscure the true causes of performance variation.
To address this, the poster proposes DisCostiC, a full-scale simulator built on first-principles analytic models spanning all system hierarchies—from CPU cores and ccNUMA regions to networked clusters. Unlike trace-based approaches, DisCostiC uses DSEL-based application skeletons to model inter-process dependencies, enabling precise simulation of parallel program behavior without executing on real hardware. It offers capabilities for performance prediction, system design exploration, and bottleneck identification, with high accuracy and efficiency. This makes it a powerful tool for scalable studies, helping developers understand and optimize application behavior across complex, multi-layered HPC systems.
4.2 Design and Implementation of a GPU-Aware MPI Collective library for Intel GPUs
The poster introduces an advanced Message Passing Interface (MPI) designed to optimize communication between CPUs and GPUs, a critical aspect of heterogeneous computing. It focuses on two primary operations: data movement, which handles efficient transfer of data across devices, and reduce operations, which combine values from different processes or devices. Both operations have been carefully optimized to minimize latency and maximize throughput. These improvements play a key role in achieving the performance gains demonstrated in their results, particularly in large-scale, GPU-accelerated HPC workloads.
4.3 pyGinkgo: A Sparse Linear Algebra Operator Framework For Python
The poster presents a sparse matrix multiplication kernel developed as part of the Ginkgo library, written in C++ with support for both CPU and GPU architectures. The focus is on efficient Sparse Matrix-Vector Multiplication (SpMV), a critical operation in many scientific and engineering applications. The team benchmarked their kernel against well-known libraries like CuPy and SciPy, demonstrating superior performance and scalability across hardware platforms. Notably, their results on Intel Xeon Platinum 8368 CPUs and NVIDIA A100 GPUs show significant improvements, highlighting the kernel’s portability and optimization across heterogeneous systems.
4.4 The EUPILOT: Pilot using Independent Local & Open Technologies
The poster highlights efforts toward building a sustainable HPC ecosystem powered by Arm-based architectures. It showcases how software optimizations, such as the BRGEMM kernel integrated into oneDNN, align with broader goals of energy-efficient AI computing. Interestingly, the software stack they aim to support closely mirrors ours, indicating strong alignment in ecosystem direction. Their use of 12nm chips further underscores a focus on balancing performance with power efficiency, reinforcing the sustainability theme across both hardware and software layers.
4.5 OEHI Undertakings-RISC Matrix Extensions and ARM yourself for the Compute-Continuum
The poster primarily focuses on three key objectives to strengthen the HPC ecosystem with SME (Scalable Matrix Extension) support. First, it aims to expand the availability of software tools and environments that enable easier development and testing using SME capabilities. Second, it seeks to build a global knowledge base to educate and showcase the performance and efficiency benefits of SME in real-world applications. Lastly, it emphasizes the importance of creating an active, collaborative HPC-SME community that brings together researchers, developers, and industry partners to drive innovation and adoption.
5 Conclusion
This participation improved our perspective of the advancements in HPC, not limited to software innovations but also other areas related to hardware development. However, this also comes forth as an opportunity for Fujitsu to introduce our cutting-edge software innovations to the HPC community, in addition to our hardware innovations leading to products that have topped benchmarks for several years. We wish to continue participating in ISC at larger levels in the coming years, as we find it to be an excellent platform for learning and networking with the larger HPC community with whom we can work together to make HPC-AI more capable, sustainable and accessible.
Acknowledgement
This article is based on results obtained from a project subsidized by the New Energy and Industrial Technology Development Organization (NEDO).