Please enable JavaScript in your browser.

Scaling Graph AI to Billion-sized Graphs - fltech - Technology Blog of Fujitsu Research

fltech - Technology Blog of Fujitsu Research

A technology blog where Fujitsu researchers talk about a variety of topics

Scaling Graph AI to Billion-sized Graphs

Hello. We are Mohit Meena, Yash Punjabi, and Mahesh Chandran from the Artificial Intelligence Research Laboratory at Fujitsu Research of India (FRIPL). We are excited to share our latest research focused on addressing scalability challenges in large-scale Graph AI.

Nowadays, graph is ubiquitous. It forms the backbone of many real-world systems we interact with every day. Social media platforms connect billions of users through friendship networks. E commerce platforms represent users, products, and interactions as interconnected structures. Knowledge graphs link entities across the web at a global scale. Fig. 1 illustrates a large graph, with finer topological details becoming visible as we zoom in.

Fig. 1: A snapshot of a 200k-node, 7B-edge graph visualized using our in-house developed tool

Despite this widespread presence, learning from such graphs is far from straightforward. When graphs grow to billions of nodes and edges, many assumptions that work in academic settings begin to break down. As we worked with such giant graphs, two fundamental questions kept surfacing.

The first question concerns the feasibility of training graph AI models at scale under limited memory and time constraints. In practice, memory becomes a bottleneck long before model capacity is reached, even loading node features for large graphs can exceed available system memory, preventing training altogether. This makes resource-efficient training pipelines a fundamental requirement.

The second question is about modeling choice. Even if training becomes possible, should we rely on a single Graph Neural Networks (GNNs) or sophisticated architectures across the entire graph. Real-world graphs are structurally very diverse. Some regions are dense, others sparse. Some are homophilic, others highly heterophilic. Expecting one model to perform well everywhere is a strong assumption and fails in most of the cases.

This blog brings these two questions together. We present them not as isolated challenges, but as complementary parts of a larger story.


1. Breaking the Memory Barrier in Large-Scale Graph Learning

When working with large graphs in production, the first problem is rarely about choosing a better model. It is about getting the system to run at all. Real-world graphs can easily scale to billions of nodes and edges, and at that point memory becomes the first hard limit. Long before a GNN reaches its modeling potential, training often breaks down simply because the system cannot hold all node features in memory.

In domains such as fraud detection, this challenge is especially pronounced as node features may include rich and high dimensional information such as user profiles, transaction histories, etc. which can quickly overwhelm available resources. This is a common pain point in production pipelines, where hardware is finite and systems must scale reliably under real-world constraints.

Most existing graph learning pipelines assume that node features are fully resident in memory. This assumption works for small and medium sized graphs, but it breaks down at scale. Once the feature store grows beyond available GPU or CPU memory, training stalls. Techniques such as sampling or distributed execution can stretch the limit, but they do not remove the core dependency between graph size and memory capacity.

To address this, we redesigned the pipeline around a simple idea. Node features should be treated as data that can be fetched when needed, not as something that must always stay in memory.

Fig. 2: Overview of Pipeline for a fraud detection use-case

As shown in Fig. 2, the pipeline takes a simple but effective approach. The graph structure is kept in memory, which requires minimal space since it only stores sparse connectivity information. In contrast, high dimensional node and edge features are stored in a disk backed feature store, using lightweight databases such as SQLite or PostgreSQL, indexed by node or edge identifiers.

During training, only the features needed for the current mini batch are fetched on demand. A dedicated cache layer keeps frequently accessed features readily available, ensuring fast retrieval while keeping overall memory usage under control.

Importantly, this pipeline naturally enables more granular optimization strategies. Feature access can be further improved through efficient memory mapping, while multiple cache tiers can be introduced to prioritize frequently accessed data. For instance, nodes or edges that are accessed more often, such as high-degree or hub nodes, can be placed in faster caches. These adaptive caching and access policies enhance data locality and throughput, all while preserving the memory-efficient nature of the pipeline.

This design makes a practical difference. Graph size is no longer tied to available memory, allowing the same pipeline to scale from millions to billions of nodes without architectural changes. Moreover, the pipeline is model independent. Once the system is in place, different GNN models can be trained and evaluated on the same large graph with minimal overhead.

At this point, one major barrier to large scale graph learning is removed. We can train models reliably, efficiently, and at scale. But this leads to the next question.

If we can efficiently train multiple models on the same large graph, do we really need to rely on just one?

Real graphs are far from uniform, and this realization sets the stage for our work termed as SAGMM, where system dynamically selects, weighs and prune models from a predefined expert pool based on the graph structure.


2. Self-Adaptive Graph Mixture of Experts (SAGMM)

Arxiv paper link: https://arxiv.org/abs/2511.13062

Code link: https://github.com/ast-fri/SAGMM

As we began experimenting with large real-world graphs, one observation kept surfacing. There was no single GNN architecture that worked best everywhere. A model that performed extremely well on one dataset often struggle on another dataset or task. Even more surprisingly, within the same graph, we noticed that different regions seemed to favor different modeling assumptions.

At first, this felt counterintuitive. The standard workflow in graph learning is to pick one GNN architecture after extensive trial and error and apply it uniformly across the entire graph. But real graphs are not uniform. Some regions are smooth and homophilic, others are sparse or noisy, and some contain highly irregular connectivity patterns. Expecting one inductive bias to handle all of this equally well is a strong assumption.

This raised a simple but powerful question for us.

Instead of asking which single GNN to use, what if different parts of the graph could use different GNNs?

This question led directly to development of our SAGMM framework, which has been accepted for publication in Main Technical Track at the Association for the Advancement of Artificial Intelligence Conference (AAAI) 2026, a top conferences in AI.

In this section, we explain the core idea behind SAGMM in an intuitive way. Fig. 3 presents an overview of the SAGMM framework, illustrating its three key components: a routing mechanism that determines expert selection, a diverse pool of expert models, and an adaptive pruning strategy that improves efficiency by removing less useful experts. Fig. 4 complements this overview by systematically summarizing the core contributions of SAGMM and highlighting the distinctive features associated with each component.

Fig. 3: The overall illustration of the SAGMM framework.

Fig. 4: Highlights of Core Contributions of SAGMM Framework


From Single Models to Diverse Pool of Models

To understand why mixtures of GNNs are useful, it helps to step back and look at how most GNNs work under the hood. At a high level, GNNs update a node’s representation by aggregating information from its neighbors and passing it through a learnable transformation and a nonlinearity. The general update rule is shown in Equation below, where each node collects messages from its neighbors using a convolution or aggregation operator and updates its embedding layer by layer.

While this equation looks generic, the key difference between popular GNN architectures lies in how the aggregation term is defined.

In spectral based models such as GCN [1] and GraphCNN [2], the aggregation weights are fixed and derived directly from the graph structure, typically using a normalized adjacency matrix or Laplacian. This works well when neighboring nodes are similar, as in homophilic graphs, but it treats all neighbors uniformly.

Similarly, spatial and aggregation-based models such as GraphSAGE [3], GIN [4], and GAT [5] take a different approach. Instead of fixed aggregation, they learn how to combine neighbor information using functions like mean, sum, or attention. In these models, the aggregation weights are learned from data, allowing the model to adaptively decide how much each neighbor should contribute. This flexibility makes them more robust to noisy, sparse, or heterogeneous neighborhoods and enables inductive learning.

The important takeaway is that each GNN architecture makes a different design choice about how information flows through the graph. These choices translate into different inductive biases. Some models favor smooth, local neighborhoods. Others emphasize adaptive weighting or long-range interactions. No single choice is universally optimal.

This observation is what naturally leads to SAGMM. Instead of committing to one aggregation rule or one inductive bias everywhere, SAGMM treats each GNN as an expert that implements a particular version of the update rule. Mathematically, As shown in the mixture of experts formulation below, the final node representation (h_v) is computed as a weighted combination of expert specific message passing outputs, where the routing weights (g(v,ei )) are learned dynamically.

Here, f_(e_i ) is expert-specific message passing. Intuitively, you can think of this as each node asking multiple experts for advice but only listening to the ones that are most relevant.


Topology Aware Attention Gating and Adaptive Expert Pruning

Selecting the right experts is just as important as having a diverse pool. Many existing MoE approaches rely on fixed top-k selection, which introduces sensitivity and requires careful tuning. As shown in Fig. 5, performance can vary significantly depending on this choice, and most nodes end up activating the same number of experts regardless of their structural complexity.

"Fig 5: (a) Performance variation across different Top-k values in GMoE-GCN for various datasets. (b) Distribution of expert activation counts by SAGMM for ogbn-proteins dataset.

To address this, we introduce Topology Aware Attention Gating (TAAG). Instead of relying only on node features, TAAG incorporates both local neighborhood information and global structural signals when making routing decisions. TAAG also uses a learnable threshold to decide how many experts a node actually needs. Simple nodes may activate a single expert, while structurally complex nodes may draw from multiple experts. This leads to sparse, node-wise routing that is both stable and efficient.

As training progresses, not all experts remain equally useful. Some consistently contribute very little across nodes. Rather than keeping them indefinitely, SAGMM tracks expert importance and prunes underperforming experts at scheduled intervals. The pruning update formulation is listed below.

The key idea is simple:

  • Dynamically estimate expert importance based on historical contribution
  • Gradually remove persistently underutilized experts

This reduces memory and inference cost without sacrificing performance, making the framework more efficient and sustainable. The detailed TAAG and pruning equations, along with the full algorithm, are provided in the paper for interested readers.


Frozen Experts and Sustainable Training

SAGMM also supports a pretrained expert setting, referred to as SAGMM-PE. In this variant, experts are pretrained once and then frozen. During SAGMM training, only the lightweight router and task head are updated.

This design avoids repeated end-to-end retraining of large models, significantly reducing compute, memory usage, and energy consumption. In practice, this makes SAGMM more suitable for large-scale and long-running systems, where sustainability and cost efficiency matter.


Results and Insights

We evaluate SAGMM across node classification, graph-level prediction, and link prediction tasks on a diverse set of benchmark datasets. Tables 1–4 report the detailed quantitative results of SAGMM across these common graph learning tasks. Across tasks and datasets, SAGMM consistently outperforms strong baselines while adaptively selecting the most effective experts. Fig. 6 further summarizes the results by highlighting the largest performance gains achieved by SAGMM for each task category.

Table 1: Results of SAGMM on Node Classification Task
Table 2: Results of SAGMM on Link Prediction Task
Table 3: Results of SAGMM on Graph Classification Task
Table 4: Results of SAGMM on Graph Regression Task
Fig. 6: Plot showcasing best gains of SAGMM per task
To better understand what drives SAGMM’s performance, we conducted an ablation study by systematically modifying its core components. We evaluated four variants: removing expert diversity by using identical GNNs across experts, replacing the proposed gating mechanism with noisy Top-k gating, adopting Top-Any gating with dynamic expert selection, and disabling the adaptive expert pruning module, the results are listed in table below.
Table 5: Results of ablation study
Hence to conclude the ablation study:

  • Expert diversity is critical: removing architectural heterogeneity causes the largest performance drop.
  • TAAG gating and pruning matter: alternative gating degrades selection quality, while pruning preserves accuracy but significantly improves memory efficiency.

Conclusions

In this work, we addressed two complementary challenges: how to scale graph learning systems under memory constraints, and how to adapt model capacity to graph heterogeneity. SAGMM tackles the latter by making model selection a learned, node-level decision.

Looking forward, we are exploring a divide and conquer strategy that pushes this idea one level deeper. Instead of routing at the node level alone, we aim to decompose graphs into finer structural units and learn expert selection at an even more granular scale. We believe this direction can further improve efficiency and adaptability, and we plan to share these ideas in future work.

Beyond efficiency and performance, another equally important question is understanding why models make certain decisions. In Part 2, our colleagues will discuss a complementary and independent line of work, focusing on explainability and the technologies which AI Lab in FRIPL is developing to make graph learning models more transparent and interpretable.


References

[1] Kipf, T. N., and Welling, M. Semi-Supervised Classification with Graph Convolutional Networks.
International Conference on Learning Representations (ICLR), 2017.
https://arxiv.org/abs/1609.02907

[2] Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. NeurIPS, 29. https://arxiv.org/abs/1606.09375

[3] Hamilton, W. L., Ying, R., and Leskovec, J.
Inductive Representation Learning on Large Graphs.
Advances in Neural Information Processing Systems (NeurIPS), 2017.
https://arxiv.org/abs/1706.02216

[4] Xu, K., Hu, W., Leskovec, J., and Jegelka, S.
How Powerful Are Graph Neural Networks?
International Conference on Learning Representations (ICLR), 2019.
https://arxiv.org/abs/1810.00826

[5] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y.
Graph Attention Networks.
International Conference on Learning Representations (ICLR), 2018.
https://arxiv.org/abs/1710.10903

[6] Wu, Q., Zhao, W., Yang, C., Zhang, H., Nie, F., Jiang, H., Bian, Y. and Yan, J., 2023. Sgformer: Simplifying and empowering transformers for large-graph representations. Advances in Neural Information Processing Systems, 36, pp.64753-64773. https://arxiv.org/abs/2306.10759