Hello. We are Hirai, Moteki, and Masui from Artificial Intelligence Laboratory.
In October 2024, Fujitsu began offering "Fujitsu Kozuchi AI Agent", which enables AI to collaborate with humans to autonomously perform advanced tasks. An AI agent is an evolving form of generative AI that can analyze what individual tasks are needed for a given goal, plans the entire processing flow, and autonomously achieves the goal by utilizing available resources. So far, various AI agents have been proposed and announced, but what we are targeting is not only conventional office work, but also field work such as manufacturing, logistics, public and road management, and construction. At work sites, it is becoming difficult to secure the high level of proficiency and man-hours of workers due to the aging of the population and the shortage of human resources. This has led to increased burdens on field managers and workers for safety management and training, creating a social problem.
If an AI agent built with a general multi-modal LLM (Large Language Model) is used to understand work field from a single monocular camera installed at the site and apply to safety monitoring of on-site work, the following issues arise and are known to hinder practical applications.
- It is not possible to measure the accurate distance between workers and moving objects such as work vehicles, and then, it is not possible to confirm compliance with the safety rule guidelines.
- When grasping the condition of field workers, it is not possible to accurately recognize those who are wearing personal protective equipment (PPE) or operating vehicles (forklifts, etc.) whose entire body is not captured due to occlusion (hiding).
These issues reveal the lack of understanding abilities of the work space by a general multi-modal LLM, and these abilities are essential functions for a video-analysis-type field work support agent (hereafter referred to as field work support agent) to reduce the workload of field managers and workers. In this article, we introduce the basic operation of the field work support agent and the associated fine-tuning technology to enhance its abilities to grasp the work space under development at our Human Reasoning CPJ, Artificial Intelligence Laboratory. Please read this press release to learn more about the overall scope, evaluation environment of field work support agent, and schedule for practical applications.
Field work support agent
The field work support agent uses video input from a monocular camera installed in the field, along with documents such as work instructions, safety rules, and other relevant materials. Leveraging its enhanced spatial understanding abilities, the agent can detect near-miss incidents at worksites and issue associated incident reports to provide practical supports for field managers and workers.
Fine-tuning technology to enhance spatial understanding abilities
The figure below shows the training flow for a multi-modal LLM to enhance 3D spatial understanding abilities from monocular 2D camera footage. First, objects (e.g., a forklift or person) described in site-related documents (e.g., safety rules) are selected. Next, from monocular 2D video acquired on-site, the positions of these selected objects are identified using object detection technology. 3D depth images, with depth information generated for each pixel of the 2D camera images, are then created and combined with the object locations to generate 3D data that distinguishes individual objects. This 3D data allows for the understanding of positional relationships of objects in 3D space. Subsequently, question-answer pairs (e.g., "How far is the worker from the forklift?" – "1.5 m") are generated and associated with the 3D data to fine-tune the model. Further training data includes images depicting PPE usage and work vehicle operators (like forklift operators) whose bodies is partially obscured. With this comprehensive training data, the multi-modal LLM is fine-tuned to enhance spatial understanding abilities in field settings.
We use a multi-modal LLM with integrated 3D spatial understanding, rather than a separated spatial reasoning AI engine supported by an LLM, to maximize processing speed. In the field applications, real-time support for field managers and workers on safety issues is critical. Combining language processing and spatial understanding with the fine-tuned multi-modal LLM allows for direct answer generation from video and text inputs to minimize the processing time.
The actual fine-tuning process to provide existing multi-modal LLMs the abilities to understand spatial relationship between objects uses LLaVA-format datasets, converted images and texts in json format, to learn the spatial context and relative position of objects in images. By using the LoRA*1 in the fine-tuning process, only limited parameters of the base LLM are tuned, making it possible to improve the learning efficiency while reducing the calculation cost, and at the same time, to improve the spatial understanding abilities by optimizing the extraction of spatial features.
High precision 3D data generation for fine-tuning
To generate 3D data that can accurately infer the distance between objects from monocular 2D camera images,
- Monocular camera calibration
- Depth scale estimation
are of importance. Up to now, camera calibration (internal and external parameter estimation) using a chessboard has been frequently used; however, since the burden of placing and scanning the chessboard by field workers is enormous, automatic calibration from monocular camera images becomes essential. In addition, since the 3D depth information estimated by the monocular camera often does not match the actual scale of analyzed space, it is found necessary to estimate the correlation between the depth information and the actual scale of analyzed space (Depth scale estimation).
Monocular Camera Calibration
To extract internal camera parameters (focal length and distortion factor) and external camera parameters (camera angle) from monocular camera images simultaneously, GeoCalib*2, a combination of deep learning and geometric optimization, can be used as a tool. However, the problem was that this tool could not estimate the position of the camera (camera height). Therefore, we developed a method to simultaneously optimize the height of the person and the camera height by using the estimated parameters of GeoCalib and the scale of the person determined from the skeleton generated from a pose estimation tool.
Here is an example of a camera calibration result. The image on the right plots the 2D coordinate trajectories of the feet (red) and head (blue) of the skeletons of the detected people. The image on the left plots the 3D coordinates of the foot (red) and head (blue) as deduced from these 2D coordinate trajectories. In this way, it is possible to estimate camera parameters from monocular camera images and infer the motion of a person's 3D position.
Depth scale estimation
To estimate the 3D depth from a monocular camera image and convert that depth information into 3D data, we need a focal length and depth scale. While the focal length is determined by the camera calibration described above, the depth scale is derived as follows:
- A reference plane (e.g. floor) is estimated.
- A plurality of two-dimensional coordinates are extracted on a reference plane, and three-dimensional coordinates X are calculated from the two-dimensional coordinates by using camera parameters.
- Assuming that the initial value of depth scale is 1, a three-dimensional coordinate X' is calculated from the two-dimensional coordinates by using the depth information.
- The distance between the plurality of points is respectively calculated using the three-dimensional coordinates X calculated using the camera parameters at the plurality of points and the three-dimensional coordinates X' calculated using the depth information, and the depth scale is estimated from the ratio.
The two techniques described above, monocular camera calibration and depth scale estimation, enable accurate estimation of the three-dimensional position of an object based only on the image information of a monocular camera. The following figure shows examples of the results whether depth scale is estimated or not. In this example, it can be noted that the depth scale is 0.45, and if the depth scale is not correctly estimated, the distances, equal to the training data, are far from the actual values.
As a result of fine-tuning with the training data generated using the above technologies, the distance to the operator closest to the forklift and the position of the worker who is not wearing PPE are accurately derived as output from the field work support agent as shown in the figures below.
Conclusion
This article summarizes an overview of field work support agents that can support workers at manufacturing and logistics sites, fine-tuning technologies necessary to acquire and strengthen required spatial understanding abilities, and monocular camera calibration and depth scale estimation technologies to increase the accuracy of learning data. As a result, it became possible to measure the exact distance between a worker and a moving object, and to check the wearing of PPE. In order to apply this technology to a wider range of fields, we will develop this technology in cooperation with the "KG Extended RAG for VA" technology developed by Human Reasoning CPJ. Read more about KG Extended RAG for VA here.
The technologies involved in this article were developed by the following members. Let me take this opportunity to introduce you.
- Human Reasoning CPJ, Artificial Intelligence Laboratory, Fujitsu Research: Shan Jiang, Shoichi Masui, Atsunori Moteki, Fan Yang, Yukio Hirai, Ikuo Kusajima, Marinho Soares Mauro, Yosie Kobayashi
- AI Innovation CPJ, Artificial Intelligence Laboratory, Fujitsu Research: Yasuto Watanabe
- Application Guild Division, Japan Global Gateway : Hiroyuki Ishida, Yuhei Shibahara, Ken Tsushima