Please enable JavaScript in your browser.

FUJITSU-MONAKA team's presence in PyTorch Conference 2025 - fltech - Technology Blog of Fujitsu Research

fltech - Technology Blog of Fujitsu Research

A technology blog where Fujitsu researchers talk about a variety of topics

FUJITSU-MONAKA team's presence in PyTorch Conference 2025

Namaskara! We are members of the FUJITSU-MONAKA Software R&D team at Fujitsu Research of India Pvt Ltd (FRIPL). Our unit is dedicated to advancing and optimizing the High-Performance Computing (HPC) and Artificial Intelligence (AI) software ecosystem for Arm CPUs. A significant part of our focus is geared towards maximizing performance for FUJITSU-MONAKA, a collaborative effort with our esteemed colleagues at Fujitsu Limited Japan. Our expertise spans a comprehensive range of software verticals, including databases, machine learning frameworks, deep learning and Generative AI frameworks, as well as confidential computing.

Recently, our team had the distinct honor of presenting a poster at the PyTorch Conference 2025 in San Francisco, US. One of our members visited in person to present this poster, which was authored by all three of us. In this blog post, we are pleased to share our experiences from this prestigious event and the invaluable insights we garnered.

Authors: N Maajid Khan, Devang Choudhary & Abhishek Jain

1. Introduction

The PyTorch Conference 2025 was held in San Francisco, USA on 22–23 October 2025 is recognized as one of the largest global conferences for the deep learning community. The event brought together 3,432 developers, researchers, and innovators from 1,026 organizations across industry, academia, and open-source communities.

FUJITSU-MONAKA Software R&D team, based out of Fujitsu Research of India (FRIPL) presented their work at the prestigious PyTorch Conference 2025, detailing efforts to optimize and expand the HPC-AI software ecosystem for Arm CPUs. This initiative directly supports Fujitsu's upcoming 2nm Arm-based FUJITSU-MONAKA processor, a next-gen chip driven by architectural innovation from Fujitsu’s Advanced Technology Development Unit (ATDU), Japan. The project highlights Fujitsu Research's ongoing dedication to open-source AI frameworks and enabling high-performance, portable AI across Arm computing platforms.

2. Highlights from the PyTorch Conference 2025

The PyTorch Conference 2025 in San Francisco brought together contributors across the AI ecosystem—framework and compiler engineers, researchers, hardware architects, and industry practitioners. The event featured keynotes, technical talks, poster exhibitions, hands-on workshops, and BoF sessions, covering the latest developments across the PyTorch stack.

The technical sessions focused on several important themes, including:

  • Large-scale model training and distributed systems
  • High-performance inference and quantization
  • Compiler technologies such as TorchDynamo, TorchInductor, and AOT Autograd
  • Edge and mobile deployment with ExecuTorch
  • Hardware acceleration across CPUs, GPUs, and NPUs
  • Model compression, PTQ, QAT, and scalable LLM serving
  • Efficient MoE architectures and responsible AI practices

Poster presentations were organized into domains such as Compilers & Kernels, Performance, Hardware Acceleration, Applied AI, Large Models, and Responsible AI. Participants included contributors from Meta AI, NVIDIA, AMD, Arm, Microsoft, Amazon, Intel, Google, Hugging Face, and several leading research labs and startups.

3. Poster Presentation by Fujitsu Research

Fujitsu Drives Arm HPC-AI Evolution: Scaling Software for Next-Gen Performance

As Arm processors achieve rapid compute and energy efficiency gains, the FUJITSU-MONAKA team at Fujitsu Research of India (FRIPL) is aggressively maturing its software ecosystem. Their substantial contributions span optimizing critical HPC-AI software components for Arm CPUs, including compute libraries (oneDAL, oneDNN, OpenBLAS), threading backends (OpenMP, oneTBB), ML frameworks (scikit-learn, XGBoost), and leading deep learning and GenAI frameworks like PyTorch, TensorFlow, OpenVINO, ONNX Runtime, and llama.cpp. At the recent PyTorch Conference 2025, the team proudly showcased their latest progress in this foundational effort via an impactful poster presentation.

3.1 Efficient INT8 Inference on Arm: Leveraging PyTorch 2 Export Quantization

Poster Summary

This poster presents a comprehensive effort to bring high-performance INT8 acceleration to Arm CPUs through the PyTorch 2 Export (PT2E) pipeline. The work extends PT2E quantization support, previously focused on x86 to Arm by integrating oneDNN JIT and Arm Compute Library (ACL) INT8 kernels. A new ArmInductorQuantizer enables quantization recipes, fusion patterns, and both PTQ and QAT workflows. Arm-specific lowering in torch.compile maps quantized ops to optimized INT8 kernels. New oneDNN INT8 BRGEMM kernels improve matmul and convolution performance. Benchmark results on AWS Graviton3E demonstrate up to 2.1× inference speedup over FP32 with minimal accuracy loss on models like BERT, T5, ResNet50, and ViT. This enables high-performance, portable INT8 inference across Arm cloud and edge platforms. The work advances scalable AI deployment on next-generation Arm architectures including FUJITSU-MONAKA.

Actual Poster presented at the conference:

Audience interaction and feedback

The poster attracted strong interest from members of the PyTorch quantization and compiler teams, along with industry practitioners and researchers engaged in performance optimization. Feedback emphasized the value of integrating INT8 kernels into the PT2E workflow and extending performance portability toward Arm-based systems. Additional suggestions included expanding kernel coverage, enhancing calibration tools for PTQ, and improving ONNX export compatibility for Arm architectures.

4. Conclusion

Participation in the PyTorch Conference 2025 provided valuable insight into the rapid evolution of AI systems, model optimization techniques, compiler advancements, and scalable deployment strategies. The conference highlighted PyTorch’s continued evolution toward unified, modular, and hardware-aware frameworks that enhance the efficiency and accessibility of large AI models.

For Fujitsu Research, the event served as an opportunity to share progress in Arm-based performance optimization, INT8 acceleration, and scalable inference systems aligned with future hardware such as FUJITSU-MONAKA. Continued collaboration with the PyTorch community will further advance open-source innovation and contribute to the development of high-performance and efficient AI systems.

Acknowledgement

This article is based on results obtained from a project subsidized by the New Energy and Industrial Technology Development Organization (NEDO).