
Hello, we are Fei Li, Jiaqi Ning and Ming Yang, from Generative AI research group of Fujitsu Research & Development Center (FRDC) in China. Today, we would like to introduce our developed grounding technologies for Multimodal Large Language Models (MLLMs).
Grounding for MLLMs
In recent years, MLLMs have demonstrated impressive capabilities in visual understanding tasks such as Visual Question Answering (VQA). Given a question and the corresponding image, most existing MLLMs can output the final answer as well as the reasoning process. However, it still lacks explainability since we do not know where the output information comes from. The technology for localizing the related areas in the images is referred to as grounding. Unfortunately, current MLLMs’ grounding capability is not as strong as their reasoning capability, especially for documentation images.
Our main research work is centered on developing grounded MLLMs. And we address the problem from two different perspectives. On the one hand, with an effective training strategy, we localize the answer areas and develop a conclusion-grounded model. On the other hand, with additional preprocessing operations, we localize all the related information in the reasoning process as well as the final answer, and develop a thinking-grounded model. The effectiveness of both methods is demonstrated on public benchmarks for the VQA task.
Conclusion-grounded model
The target of our first method is to train a model which can localize the answer areas. Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two common strategies for MLLM training. SFT distills knowledge from high-quality expert data, enabling models to fit the target distribution rapidly and accurately. But it often limits the model’s exploration capabilities, potentially leading to overfitting on training data and poor performance on test data. RL allows the model to explore actively and improve the performance based on the feedback of verified results. However, the prerequisite for applying RL is that the base model has already possessed a certain inherent capability. Because of the limited grounding capability of existing MLLMs, RL struggles to get satisfactory results.
We have developed a joint training strategy with the combination of SFT and RL. The basic idea is as follows. For each training sample, multiple outputs are generated for model optimization. If most generated outputs are incorrect, which means the current model has limited ability to handle the problem, we will use SFT to imitate labeled data. Otherwise, if most generated outputs are correct, which means the current model is able to deal with the problem, we will adopt RL to explore better possibilities.
We conduct extensive experiments with Qwen-VL-7B on multiple datasets, including chart data from ChartQA, document data from DocVQA, and poster data from DORG. And we compare our joint training strategy of SFT+RL with the commonly used method of “first SFT then RL”. The experimental results are shown in Table 1. The base model cannot localize answer areas. And our method outperforms “first SFT then RL” for both VQA accuracy and grounding accuracy.

An example is shown in Figure 1. The VQA case is from the open benchmark of ChartQA. The input is a line chart image, and the question is about the difference between the highest and lowest life expectancy at birth from 2008 to 2018 for female. The final answer of the non-reasoning model is correct but lacks explainability. Our conclusion-grounded model not only outputs the correct answer and its reasoning, but also localizes the answer related areas in the image, namely the highest and lowest values of life expectancy for the line of “female”, which makes the output answer more explainable.

Thinking-grounded model
The target of our second method is to train a model which can localize all the related information in the thinking process. Obviously, the amount of areas to be localized is much larger, and this requires the model to have extremely strong grounding capabilities. To alleviate grounding task, our basic idea is to resort to preprocessing operations. For VQA with documentation images, one of the most important parts is text. Therefore, we detect all the text in the input image by an external Optical Character Recognition (OCR) detector, and adopt the detection results as the additional information for model training with SFT. In this way, the model only needs to select the bounding boxes in the detection results instead of generating coordinates for localization.
We conduct extensive experiments with Qwen-VL-3B on chart data from ChartQA and table data from TabMWP. We compare our thinking-grounded model with the base model. The experimental results are shown in Table 2. Our model not only improves VQA accuracy, but also provides satisfactory grounding capability.

An example is shown in Figure 2. The VQA case is also from the open benchmark of ChartQA. The input is a bar chart image, and the question is about the difference in values of mean height for adult men between Tajikistan and Algeria. We can see that the answer of the non-reasoning model is wrong, but we do not know the reason since no additional explanation is available. Our thinking-grounded model localizes all the related information for thinking, with pairs of highlighted text and the corresponding bounding box in the image. In this way, the output of our model can be easily understood.

Conclusion and future work
We have conducted in-depth research on grounding technologies for MLLMs, and developed two grounded models. For conclusion-grounded technology, while currently we focus on documentation understanding tasks, our joint training strategy can be easily extended to other kinds of images and generalize to other application fields such as medical diagnosis for localizing abnormal tissues and industrial quality inspection for marking defects. For thinking-grounded technology, since it depends on OCR detection results, the main applications are documentation oriented, such as invoice visual-text understanding for highlighting key information and intelligent education for illustrating detailed analysis and solution of exercises.
Next, we will make Fujitsu’s grounded MLLMs more powerful, and apply them into more customized real-world scenarios.