
Hello, this is Koji Maruhashi from the Artificial Intelligence Laboratory. This article is the third in the GraphAI series. In the first post, we introduced methods for learning large-scale graphs with billions of nodes. In the second post, we explained how to make inherently black-box GraphAI explainable and interpretable. In this article, we will introduce our efforts on a further challenge: "uncovering the underlying phenomena of large-scale graphs."
Why is it necessary to uncover "underlying phenomena"?
Large-scale graphs are no longer something that humans can directly understand. Not only billion-node-scale graphs, but even graphs with several hundred nodes and several thousand edges appear as almost solid black masses when visualized. For example, Figure 1 shows differences in gene regulatory networks under different conditions. While it is possible to list which nodes are connected to which, it is extremely difficult to obtain meaningful insights from the overall structure. However, the real challenge is not observing the massive structure itself, but understanding what is happening behind it. A network is merely a collection of observed relationships. What we want to know are the causes, mechanisms, or structural dynamics that produce those relationships. In other words, the essential question is to capture the phenomena behind the observed network and to approach the deeper meanings and principles that give rise to those phenomena. In the example of Figure 1, the differences in gene networks are shown between patients where the lung cancer therapeutic drug erlotinib is effective and those who have acquired resistance and for whom it is no longer effective. These differences can be understood as phenomena, such as the activation of specific signaling pathways. Behind these phenomena lie deeper meanings and principles, such as cancer malignancy and drug resistance. In recent years, large-scale graph analysis has been progressing from "the stage of estimating structure" to "the stage of extracting phenomena behind the structure," and further to "the stage of understanding these phenomena as meaning by connecting them with existing knowledge." This paper first focuses on the extraction of "phenomena" as an intermediate layer. However, we aim to extract not just phenomena as statistical patterns, but phenomenon structures that can be connected to the level of meaning and principles through structures common to multiple networks.

Medical domain Gene expression data, protein interaction networks, disease-associated genes, and drug target information can all be represented as massive networks. What is observed here are statistical and structural relationships, such as co-expression relationships between genes and interactions between proteins. However, these networks show more than just connection relationships. Behind them, phenomena such as the activation of specific molecular pathways, enhancement of inflammatory responses, disruption of cell cycle control, and reorganization of metabolic pathways are occurring. Further beyond these are biological meanings and principles such as signal transduction pathways, evolutionarily conserved functional units, and drug action mechanisms. Even in cancers of the same organ, patient conditions differ not because of differences in network structure, but because the dominant phenomena, meanings, and principles behind them differ. What is important in drug discovery is not which genes are connected, but which molecular mechanisms drive the whole system. What is needed here is not a description of the network, but an estimation of phenomena, and ultimately, an elucidation of their meanings and principles.
Crime detection and financial domains A similar situation is seen in crime detection and financial domains. Financial transaction networks, inter-corporate relationship networks, and human relationship networks on social media are observable network structures. Edges are records of specific actions such as transactions, communications, and fund transfers. However, these are merely the result of phenomena occurring behind them. In money laundering, dynamic behaviors such as money circulation patterns and concealment actions appear as phenomena; in markets, dynamic behaviors such as simultaneous upward trends across sectors or increased volatility appear as phenomena. Further behind these are deeper meanings and principles such as geopolitical risks, organizational crime strategies, financial policies, and macroeconomic structures. Markets move not because individual stocks are connected, but because underlying economic structures and policy decisions create phenomena, which in turn form network structures. Here too, what is important is not the network itself, but understanding "which semantic and principled factors drive markets and society through which phenomena."
Environmental and global-scale data This is even clearer in the domain of environmental and global-scale data. Observation data from artificial satellites and sensors form a vast spatio-temporally linked network. Correlations and dependencies between observation points can be represented as graphs. However, behind them are phenomena such as El Niño, extreme weather, droughts, and heavy rains. And further behind these are the physical meanings and dynamic principles of the Earth system, such as ocean currents, atmospheric convection, and energy balance. The observation network is a superficial projection of phenomena created by dynamic principles based on physical laws. In disaster prediction and policy formulation, what is important is not to precisely depict the observation structure, but to understand which dynamic principles drive which phenomena.
What these domains have in common is that the observed network is at the outermost layer, behind it are phenomena, and further deeper are meanings and principles. The network is the result of phenomena, and phenomena arise based on meaning. To understand huge and complex graphs is to move from structure to phenomena, and then to the meaning behind them. With this problem consciousness, the following question naturally arises:
How can we extract the structure of phenomena from huge and complex graphs? And how can we build a foundation to connect them to the meaning beyond?
Hypothesis: Large-scale graphs can be understood by a few hidden factors
To address this problem, we are conducting research based on the following hypotheses.
Hypothesis 1: Behind large-scale graphs, there exist a small number of hidden factors, and the overall graph structure can be understood as simpler interactions among these factors. This hypothesis is not necessarily self-evident, but in natural and social sciences, there are many known cases where complex phenomena can be approximated by a few main factors. As seen in fundamental equations of physics and macroeconomic indicators, the idea of assuming a low-dimensional structure behind observations has been widely accepted.
Hypothesis 2: Hidden factors are expressed as linear combinations of the original nodes. Unlike black-box latent variables, it is important that the model can explicitly show which nodes are involved in which factors. The simple framework of linear combinations allows intuitive tracking of structural contributions and makes interpretation of results easier. We believe that this interpretability has great significance for scientific understanding and practical applications.
In the following sections, we will introduce specific efforts based on this hypothesis. The first technique is a method for directly handling large-scale graphs and estimating a small number of hidden factors behind them. Factors are expressed as linear combinations of the original nodes, allowing explicit identification of which nodes contribute to which factors. Although the computational complexity is enormous, it is possible to return from the estimated factors to the original graph for detailed analysis. The second technique is a method for quickly estimating factors and their sparse dependencies without explicitly calculating large-scale graphs.
Extracting hidden common factors via tensor representation of large-scale networks
Tensor representation of graphs

Our first effort starts with how to mathematically represent graph data [1]. The basic representation of a graph is an "adjacency matrix." If nodes are connected, 1 is assigned; otherwise, 0 (or a weight), representing it as a 2D matrix of node_count × node_count. However, real-world graphs are not sufficiently described by this alone. Each node and edge is accompanied by information such as expression levels, sales, temperature, multiple attributes or conditions, and even time. To include these "relationships" along with "features per node/edge" and "time/conditions," a 2D matrix is insufficient. Therefore, we use a "tensor." A tensor is a multi-dimensional array that extends a matrix to higher dimensions, for example, it can be represented as 3D data with three axes: node i, node j, and feature k (Figure 2). In this way, the entire graph, including node/edge attributes and conditional information, can naturally be viewed as a tensor of three or more dimensions.
Approximating a tensor as a linear combination of a few factors
We hypothesize that this observed tensor can be approximated as a linear combination of a few factors using a mathematical technique called tensor decomposition. In mathematical notation, it can be written as:
Here, are factor vectors corresponding to each mode (node, node, feature, etc.), respectively, and compose the tensor in the form of an "outer product." The important point is that this decomposition explicitly shows "which node contributes to which factor and to what extent." Because factors are expressed as linear combinations of the original variables, interpretability is preserved.
Extracting complex structures between factors
However, the essence of our technology is not simply to decompose a tensor. In typical tensor decomposition, the goal is to successfully reconstruct observed data, so the resulting factors remain at the level of statistical covariation patterns, i.e., "phenomena." On the other hand, our method optimizes factors common to multiple graphs to align with a semantic axis (e.g., cancer malignancy) through classification performance. Specifically, a neural network that can accurately classify graphs is simultaneously trained using the few hidden factors obtained by decomposition as input [1][2]. Through this training, the weights of the linear combination (i.e., the relationship between original variables and hidden factors) and the parameters of the classification network are simultaneously optimized. In other words, we aim not just to extract phenomena, but to construct a factor space where meaning and principle-level structures become visible. This achieves a structure where:
- The relationship between original observed variables and hidden factors remains linear.
- However, classification boundaries in the hidden factor space are learned non-linearly.
Since the hidden factors are few, their space can be visualized in two or three dimensions. This allows visual confirmation of non-linear classification boundaries. Structures that were invisible in the huge high-dimensional graph space become clear in the low-dimensional factor space.

Experiments using artificial data can symbolically demonstrate this feature (Figure 3) [2]. In this experiment, a spiral non-linear classification boundary is artificially defined in a "hidden factor space." That is, a structure where classes are distributed in a spiral shape in a 2D factor space is created. Next, "observed variables" that are linearly combined with those factors are generated. Specifically, 100-dimensional data is generated by adding 98-dimensional noise, and observed variables are created by a linear transformation that applies rotation in that 100-dimensional space. From the observation side, the spiral structure is buried in noise and not clearly visible. In this situation, we verify whether the underlying factor space and its non-linear classification boundary can be restored by providing only the observed data. In this experiment, to clearly demonstrate the technical features, the adjacency matrix of the graph assumes a special structure of 1x100, meaning relations only between one node and 100 nodes. This design is to show that even under a simple structure, a non-linear factor structure can be hidden behind it.
The results clearly demonstrate the effectiveness of the proposed method. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) cannot capture the spiral structure. Classification boundaries can only be expressed linearly, and classes remain mixed. However, our method correctly restores the hidden factor space and clearly draws the spiral non-linear classification plane on it. As shown in Figure 3, a non-linear classification structure appears in a visualizable form. This result shows that even when observed variables are only linear combinations of underlying factors, non-linear structures can still be successfully extracted.
Application to gene data: Discovering hidden mechanisms behind cancer malignancy
We applied this method to biological data [3]. The target was gene expression data related to cancer [4]. In particular, we focused on the biological phenomenon of Epithelial–Mesenchymal Transition (EMT), which is an important mechanism involved in cancer cells acquiring invasiveness and metastatic potential. There is a known set of indicator genes for EMT, and an EMT score can be defined from their expression patterns. The question is whether a latent structure consistent with the biological axis of EMT can be data-drivenly found from thousands of genes.

Figure 4 visualizes the hidden factors extracted by applying our method to this data in a two-dimensional plane. It is shown that the distribution of the extracted hidden factors is strongly correlated with the EMT score. That is, even though EMT labels were not directly provided in advance, a direction consistent with the EMT axis spontaneously appeared within the factor space.
Furthermore, enrichment analysis of gene groups contributing to the extracted factors identified biologically plausible pathways such as EMT-related pathways, cell adhesion, and cytoskeleton reorganization. Enrichment analysis is a method that statistically examines whether genes related to a specific function or pathway are "more abundant than by chance" within a given group of genes. In other words, it is an analytical method to objectively clarify the biological characteristics of that gene group. This result indicates that not just a mathematical decomposition of data, but a biologically meaningful structure can be extracted.
Thus, while it is difficult to directly interpret a massive gene network, by reducing it to a few factors and visualizing the non-linear classification structure in that factor space, it becomes possible to highlight biologically plausible phenomena. Tensor representation and factor extraction based on linear combinations serve as powerful tools to uncover hidden non-linear phenomenon structures while maintaining interpretability.
Quickly and directly estimating graph structures between hidden common factors
So far, we have extracted hidden factors from observed large-scale graphs and uncovered non-linear classification structures within that factor space. However, when applying this approach to large-scale real data, there was another major challenge: "the enormous computational cost involved in estimating the large-scale graph structure itself." For example, consider directly estimating dependencies between tens of thousands of genes. When the number of genes reaches tens of thousands, statistically stable estimation of dependencies requires extremely large computational resources. In fact, precisely estimating networks on the scale of tens of thousands of genes can take several months of computation time, even with the highest-level supercomputers available to typical research institutions. However, in a rapidly advancing research field, methods that presuppose such long computation times are not practical. Reconstructing large-scale networks every time new data is obtained does not lead to rapid hypothesis generation or clinical applications. Therefore, our next approach was to "quickly estimate the dependencies between a small number of hidden factors, rather than directly estimating the huge dependency structure between observed variables."

Conventional method (Graphical Lasso)
Assuming a multivariate Gaussian distribution, conditional independence between variables is expressed by the "precision matrix (inverse of the covariance matrix)." While the covariance matrix shows overall correlations, the precision matrix indicates "which variables are directly dependent on each other when other variables are fixed." Graphical Lasso [5] uses a technique called L1 regularization (Lasso) when estimating this precision matrix. L1 regularization is a method that imposes a penalty on the sum of the absolute values of parameters, pushing many parameters to zero. As a result, the estimated precision matrix becomes sparse. That is, a structure remains where only "truly necessary dependencies" are present. However, as can be seen from Figure 5(b), even the dependencies between 100 nodes form a very complex network when visualized. With many edges, it is difficult for humans to intuitively understand. If the number of nodes increases further, the complexity grows exponentially.
Proposed method: Estimating dependency structure in hidden factor space (Meta Graphical Lasso)
Therefore, we adopted the idea of applying L1 regularization to the precision matrix "among hidden (latent) factors" instead of between observed variables [6]. Refer to Figure 5(c). First, observed data is mapped to a few latent factors through a linear transformation. This linear transformation can be estimated in a way that is shared across multiple datasets. Then, L1 regularization is applied to the precision matrix among those latent factors to estimate dependencies. The difference here is clear:
- Conventional Graphical Lasso applies L1 regularization to the precision matrix "between observed variables."
- This method applies L1 regularization to the precision matrix "between latent factors."
Even if there are tens of thousands of observed variables, if there are only about 20 latent factors, the dependencies to be estimated are significantly reduced. The computational cost is dramatically smaller, and because the number of factors is small, it becomes a size that humans can understand even when visualized. Even more importantly, this linear transformation can be estimated to be common across multiple datasets. This means that by using a common factor space as a foundation, it is possible to compare different dependency structures across datasets. By estimating dependency structures based on factors common to multiple datasets, it becomes possible to approach more essential meaning and principle-level common mechanisms, such as drug resistance or social structures, beyond statistical structures within a single dataset.
Application to gene data: Hidden dependencies that differ by 5-FU sensitivity
A representative problem to which this method was applied is the estimation of hidden mechanisms that give rise to differences in sensitivity to the anticancer drug 5-FU (5-fluorouracil). The data used consists of expression data for about 12,000 genes, comprising several hundred samples. Samples are classified into four groups, B1 to B4, based on 5-FU sensitivity. B1 is the most sensitive group, B4 is the least sensitive group, and B2 and B3 are intermediate. This classification is based on clinical indicators of drug response. The purpose of this study is to identify "what hidden molecular mechanisms produce differences in 5-FU sensitivity." Instead of merely observing which genes are differentially expressed, we analyze how the dependency structure between factors changes in each group.

Referring to Figure 6, the sparse dependency network among 20 latent factors is clearly different across groups B1-B4. For example, the following interactions were observed to be strongly associated with 5-FU sensitivity:
- A positive interaction between 0 (ECM-receptor interaction) and 2 (Phagosome) is strongly observed in 5-FU-sensitive cell lines (B1), but it weakens as sensitivity decreases and disappears in 5-FU-resistant cell lines (B4).
- Interactions between 2 (Phagosome) and 6 (Transcriptional misregulation in cancer), and between 2 (Phagosome) and 12 (Antigen processing and presentation) were also confirmed as 5-FU sensitivity-specific interactions. These appear prominently in sensitive cell lines and are lost with resistance.
- Furthermore, the interaction between 1 (Focal adhesion) and 11 (Protein digestion and absorption) shows a negative correlation in B1, disappears in B2, and reverses to a positive correlation in B3 and B4. This sign reversal suggests a change in molecular mechanisms associated with the transition from 5-FU sensitivity to resistance.
The extracted latent factors are associated with KEGG pathways through enrichment analysis. This makes it possible to interpret which biological pathway corresponds to each factor. In other words, it was shown that differences in 5-FU response can be explained as a 20-factor network, rather than directly dealing with complex dependencies of thousands of genes. This means that medically meaningful hypotheses can be generated quickly.
Application to financial data
For the application to financial data, we used approximately 4,000 U.S. stock price data. The number of samples was about 1,500 days of daily data. These were divided into 7 years from 2008 onwards, and dependency structures were estimated for each year. Strictly speaking, stock prices form time-series data with strong temporal dependencies. However, for this simplified experiment, we analyzed assuming independence between daily samples. Nevertheless, some interesting insights were obtained.

Referring to Figure 7, it is visually confirmed how the dependencies between factors change year by year. There are clear trends visible to the eye. Especially important insights are the following two points:
- Factor 1 was interpreted as related to rare metals, and Factor 3 as related to electronic devices. Factors 1 and 2 were conditionally dependent in both 2010 and 2012; however, the corresponding entries in the precision matrices had opposite signs. This pattern may reflect the trade friction over rare metals between China and Europe during that period. These results suggest that specific macroeconomic events can be captured in the inter-factor dependency structure.
- Factor 8 was interpreted as a gas and oil-related factor. Strong dependencies with other factors were observed in 2012 and 2014 in the latent factor dependency network (not shown), suggesting a connection to the shale revolution, which was a major topic at the time. It is confirmed that structural changes in the energy market appear as a form of factor network.
A key achievement of this method is the ability to grasp market dynamics as a dependency structure among a few factors, rather than directly analyzing a massive inter-stock network. Thus, Meta Graphical Lasso provides a framework that, instead of directly estimating a huge observed graph, maps it to a sparsely dependent structure in a small latent factor space and estimates dependencies among those factors in a fast and interpretable manner. It has become possible to provide concrete insights into real data while balancing computational efficiency and interpretability.
Finally
Future initiatives
This paper has focused on the question of "how to extract the phenomena behind networks." Factor models based on linear projection provide a powerful framework to reduce the outermost structure of large-scale graphs to a few interpretable factors. As shown by their application to gene and financial data, they can clearly highlight phenomenon-level patterns underlying complex connection structures.
With the methods introduced in this paper, we are gaining a foothold in approaching phenomenon structures consistent with meaning and principle levels. However, their ultimate semantic interpretation still heavily relies on alignment with expert knowledge. Understanding which meaning or principle the extracted factors ultimately correspond to still requires specialized background knowledge. Furthermore, real-world structures are inherently non-linear, and some aspects cannot be sufficiently captured by a mere superposition of linear factors. That is, while "from network to phenomenon" can be automated to some extent, the connection "from phenomenon to meaning and principle" still largely depends on human intervention.
Here, the recent advances in generative AI, particularly large language models, are opening up new possibilities. These models store vast amounts of knowledge accumulated in biological papers, financial reports, policy documents, and more. If we can connect the phenomenon structures extracted by factor models with the conceptual structures embedded in text, it will be possible to semi-automatically interpret the meaning of factors. This would allow us, as shown in Figure 1, to understand the phenomena, meanings, and principles behind large-scale networks by closely linking them with the vast amount of knowledge accumulated by humankind.
We are currently pursuing research in this direction: enhancing semantic interpretation using generative models while building upon structurally sound factor models. Our goal is to develop next-generation graph interpretation technology that integrates "structure" and "context."
Summary
Large-scale graphs are difficult to understand as they are. Directly analyzing the collection of nodes and edges alone cannot sufficiently capture the events occurring behind them. On the other hand, by decomposing structures into factors, it becomes possible to organize and understand complex networks at the level of phenomena. The tensor decomposition and estimation of inter-factor dependency structures introduced in this paper are fundamental methods for this purpose. Through these efforts, a path has emerged to systematically extract phenomenal patterns underlying observed network structures. Furthermore, in the future, by connecting numerically obtained structures with existing knowledge and contextual information, it is expected to develop into a framework that connects from networks to phenomena, and then to meanings and principles. This paper illustrates a core approach within this trend. Treating large-scale graphs not just as structures but as clues to understanding phenomena—this perspective will drive the next generation of developments.
Over three posts, we have introduced Fujitsu Research's efforts in GraphAI. With the advancement of observation technologies, the graph data to be utilized for analysis is rapidly increasing in scale and diversity, ranging from macro perspectives at cosmic and global scales to micro perspectives at cellular and molecular levels. The importance of realizing billion-node scale large-scale GraphAI, ensuring high explainability and interpretability, and further understanding the phenomena, meanings, and principles behind large-scale graphs is expected to grow even more in the future. We will continue to work diligently toward the further development of GraphAI.
References
[1] Koji Maruhashi, Masaru Todoriki, Takuya Ohwa, Keisuke Goto, Yu Hasegawa, Hiroya Inakoshi, Hirokazu Anai: Learning Multi-Way Relations via Tensor Decomposition With Neural Networks. AAAI 2018: 3770-3777
[2] Koji Maruhashi, Heewon Park, Rui Yamaguchi, Satoru Miyano: Linear Tensor Projection Revealing Nonlinearity. CoRR abs/2007.03912 (2020)
[3] Park H, Maruhashi K, Yamaguchi R, Imoto S, Miyano S (2020) Global gene network exploration based on explainable artificial intelligence approach. PLoS ONE 15(11): e0241508.
[4] Shimamura T, Imoto S, Shimada Y, Hosono Y, Niida A, Nagasaki M, et al. (2011) A Novel Network Profiling Analysis Reveals System Changes in Epithelial-Mesenchymal Transition. PLoS ONE 6(6): e20804.
[5] Friedman, J., Hastie, T. & Tibshirani, R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–441 (2007).
[6] Maruhashi, K., Kashima, H., Miyano, S. et al. Meta graphical lasso: uncovering hidden interactions among latent mechanisms. Sci Rep 14, 18105 (2024)
Related articles
Scaling Graph AI to Billion-sized Graphs
Lifting the veil on Graph AI: From black box explainability to self-interpretable models