fltech - Technology Blog of Fujitsu Research

A technology blog where Fujitsu researchers talk about a variety of topics

Introduction of Attention Augmented Hallucination Mitigation Technologies for Multimodal Large Language Models

Hello, we are Fei Li, Ziqiang Shi and Jingyi Wang, from Generative AI research group of Fujitsu Research & Development Center (FRDC) in China. Today, we would like to introduce our developed technologies about hallucination mitigation for Multimodal Large Language Models (MLLMs). The related three papers have been accepted by the international conferences of WACV (IEEE/CVF Winter Conference on Applications of Computer Vision) 2026 and ICASSP (IEEE International Conference on Acoustics, Speech, and Signal Processing) 2026.

Hallucination in MLLMs

Nowadays, MLLMs excel at processing visual and textual information, advancing many tasks such as Visual Question Answering (VQA), image captioning, and visual reasoning. Although they greatly enhance cross-modal understanding and generation, MLLMs still face a critical challenge: hallucination. Hallucination refers to factual inconsistencies between the model’s decision (or inference) and the input image, often manifesting as fabricated objects or inaccurate descriptions of attributes and relationships. In other words, the model does not make decisions and analysis totally based on the input, instead, sometimes it incorporates fictional content. Figure 1 shows an example of hallucination for the task of VQA. The consumption tax for beer used to be 10% for a long time, but it has been adjusted to 8% recently. This 8% is clearly stated in the table of the input image. However, when an MLLM answers the question, it still mistakenly calculates using 10%.

Question: Based on the image, what is the total tax on all the goods?
Correct answer: 80,000 x 8% + 60,000 x 4% = 8,800
Hallucinated answer: 80,000 x 10% + 60,000 x 4% = 10,400
Figure 1. Example of hallucination in MLLMs.

Hallucination issue stems from multiple factors. One of the most important factors is over-reliance on prior knowledge, such as biases in training data or inherent language priors within Large Language Models (LLMs). The hallucinated answer in Figure 1 is caused by this. Other key factors include limitations in visual encoder localization, misalignment of multimodal information, as well as suboptimal attention modeling during decoding. All the aforementioned factors hinder accurate associations between user’s inputs and model’s outputs.

Among various types of hallucination, a typical one is that the model focuses on the incorrect area of the image. An example is illustrated in Figure 2. The question is about the fruit in the picture, but an MLLM pays attention to the mouse, and mistakes it for a peach. Our main research work is centered on such type of hallucination, and we attempt to change the original incorrect focused area into the correct one. More specifically, we address the problem from two different perspectives. On the one hand, we refine attention activation from hallucinated one to trusted one. On the other hand, we adopt attention enhancement for the task-relevant text and visual tokens. Neither of our ideas need to re-train the model, and only a little additional computation is introduced.

Question: What fruit is in the picture?
Correct answer: Apple.
Hallucinated answer: Peach.
Figure 2. Example of hallucination caused by incorrect focused area. (a) Input image; (b) Correct attention map; (c) Incorrect attention map.

Attention activation refinement for hallucination mitigation

Current MLLMs are usually based on Transformer structure. For the input token sequence, the output of the (l+1)-th Transformer layer, denoted as hl+1, is computed using multi-head self-attention as

where N_h is the total head number in each Transformer layer, A_nl (∙) is the attention operator of the n-th head in the l-th layer, P_nl maps the activated output of attention head back to the operation space of Transformer layer.

The basic idea of attention activation refinement is to analyze the difference between trusted and hallucinated attention activations, namely A_nl (∙). For MLLMs, trusted attention activation manifolds are distribution is built from valid (image, question, answer) triplets. And hallucinated manifolds are distribution is created by perturbing images or altering specific image regions while keeping question-answer pairs correct. Our goal is to map hallucinated manifolds distribution to trusted ones, enabling corrective adjustments to attention activations.

Scalpel: Attention activation distribution alignment via mixture Gaussian bridges

Our first method is named Scalpel, and its main framework is illustrated in Figure 3. In this method, Gaussian mixture models (GMMs) are adopted to model attention activations from both trusted and hallucinated data and to approximate their manifolds distributions. Both trusted and hallucinated GMMs contain multiple components. When an activation belongs to a hallucinated component, the related transport mapping requires not only identifying its trusted counterpart but determining the optimal transformation path. To address this, we propose an effective alignment algorithm. Intuitively, minimal corrections preserve data manifold distribution integrity, thus we seek minimum-cost flow mapping between components. Schrödinger bridge problem solving theory is applied to compute optimal transport between hallucinated and trusted GMMs. This allows for component-specific correction based on the current activation. After getting the transfer vector v_r between the hallucinated component and its trusted counterpart, the vector is added to the attention activation, and the output of the (l+1)-th Transformer layer is re-calculated as

where α is for intervention strength. This correction applies at each generation step to mitigate hallucination while preserving generation coherence.

Figure 3. Schematic of Scalpel.

We conduct experiments on the dataset of Polling-based Object Probing Evaluation (POPE). POPE evaluates object hallucinations in MLLMs via binary queries (e.g., “Is there a chair?”). Unlike caption-based methods, it directly probes object recognition and hallucination. Our experimental results shown in Table 1 demonstrate the effectiveness of Scalpel.

Table 1. Performance comparison of Scalpel and other baseline methods on the POPE benchmark across three datasets — MSCOCO, A-OKVQA, and GQA— using the LLaVA-1.5-7B model. Bold values indicate the best performance on each subset and metric.

SchröMind: Point-wise attention activation alignment

Although Scalpel is effective, it only aligns hallucinated and trusted activations at component level with the help of GMM. That is to say, the transfer vector is calculated based on the hallucinated component and its trusted counterpart. To make the alignment more accurate, we improve Scalpel and propose a point-wise alignment method named SchröMind. It is also implemented via solving the Schrödinger bridge problem. However, it establishes a point-level mapping between hallucinated and trusted activations with minimal transport cost through lightweight training.

The main framework of SchröMind is shown in Figure 4. The left part is the same as that of Scalpel. The trusted attention activations are collected from valid (image, question, answer) triplets, and the hallucinated ones are from perturbing images or altering regions. The main difference from Scalpel is that GMM is not adopted for modeling the activation distributions, and SchröMind directly minimizes transport cost with entropy regularization via Schrödinger Bridge Problem/ Entropy-regularized Optimal Transport (SBP/EOT) formulation for each point.

Figure 4. Schematic of SchröMind.

Experimental results on the POPE dataset are shown in Table 2. We can see that SchröMind usually achieves better performance.

Table 2. Performance comparison of SchröMind and baselines on the POPE benchmark using the LLaVA-1.5-7B model on three datasets: MSCOCO, A-OKVQA, and GQA. Bold and underlined values show the best and second-best, respectively.

Visual-aware attention and logits enhancement (VAALE) for hallucination mitigation

As more new tokens are generated, MLLMs increasingly allocate attention to text tokens. Therefore, one specific type of hallucination is caused by insufficient attention to visual tokens. Obviously, the method of enhancing the attention weights for all visual tokens is not optimal, since visual understanding requires focused attention on task-relevant image regions rather than the whole image. Based on the observation that the task-relevant tokens typically exhibit higher visual-textual similarities in visual understanding tasks, we introduce a novel plug-and-play method named VAALE. It mainly consists of two modules: attention refocusing and visual beam search. And the maim framework is illustrated in Figure 5.

Figure 5. Schematic of VAALE.

Attention refocusing

This module refocuses attention on task-relevant tokens. It is inspired by the method of self-augmentation via self-reweighting (SASR) for LLMs. First, we use a fixed instruction, such as “Please describe this image in detail.”, to generate a description of the image. Then, we concatenate the generated description before the original instruction, hence the new prompt is guaranteed to contain tokens semantically aligned with the visual tokens. Next, according to the cross-attention between visual and instruction tokens, we calculate the vision correlation matrix under instruction as well as the instruction correlation matrix under vision. Based on the correlation matrices, the original visual and instruction attention values are effectively reweighted, and the tokens with high visual-textual similarities are paid more attention.

Visual beam search: Optimal answer search by considering visual information

To further mitigate hallucination caused by insufficient visual attention, we propose a new decoding method called visual beam search with the assumption that the response interacting more with image is considered more reliable. The basic encoding process is like the traditional beam search. The main difference lies in that the Visual Interaction Degree (VID) values of candidate text tokens are adopted to enhance the original predicted logits, and the encoding process depends on the enhanced logits. In our implementation, the VID is defined as the average cross-attention weights between the text token and all the visual tokens. By effectively introducing visual interaction information into logits, visual beam search strengthens the role of image during decoding.

Experimental results

We also conduct experiments on the MSCOCO subset of POPE dataset, and the experimental results are shown in Table 3. Compared to the baselines, our method achieves notable improvements.

Table 3. Performance on MSCOCO POPE. Results are averaged across the three sampling options (i.e., random, popular and adversarial). Bold values indicate the best performance for each base model.

Conclusion and future work

We have conducted in-depth research on hallucination mitigation for MLLMs. The methods based on attention activation refinement, namely Scalpel and SchröMind, treat the problem of incorrect attention as a black box. They do not analyze the reasons and just try to align the hallucinated activations and trusted ones. Comparing the two methods, Scalpel only adjusts the hallucinated activations at component level, while SchröMind achieves more accurate point-wise alignment. In contrast, VAALE designs two modules of attention refocusing and visual beam search to selectively enhance attention for the task-relevant text and visual tokens, which aims to explore the underlying causes for incorrect attention. Experimental results demonstrate the effectiveness of our proposed methods on public benchmark for the VQA task. More detailed can be found in our papers: https://openaccess.thecvf.com/content/WACV2026/html/Shi_Scalpel_Fine-Grained_Alignment_of_Attention_Activation_Manifolds_via_Mixture_Gaussian_WACV_2026_paper.html (Scalpel), https://arxiv.org/abs/2602.09528 (SchröMind), and https://arxiv.org/abs/2602.09521 (VAALE).

Next, we will extend the applications of our methods and implement them into customers’ real scenarios. We hope our persistent R&D efforts will make Fujitsu’s MLLM more powerful and more popular.