
Hello! We are researchers from the Converging Technologies lab at Fujitsu Research. We are presenting a novel semi-automated solution to the problem of data conversion and integration into desired structures and formats. In data-driven systems, integrating disparate data sources becomes challenging when incoming data doesn’t conform to the system’s data specifications. Despite advances in automated schema matching systems, data integration tasks involving complex semantic interrelationships still require users to manually identify and define elaborate transformations between datasets. This process consumes a huge amount of manual time and effort and remains a bottleneck in modern data integration workflows. Our DataSemantics technology employs an AI-driven human-in-the-loop system to automate the end-to-end data conversion. It uses LLMs to analyze semantic relationships and generate step-by-step transformation pipelines autonomously, while only requesting the user’s attention to resolve specific semantic ambiguities through its user interface.

Existing Challenges in Data Integration and Conversion
Getting data into desired structures and formats is a major part of the data wrangling operations. This is frequently encountered in various contexts: when collaborating with external organizations, handling crowdsourced or consumer data, converting old data into newer formats, processing data from diverse sensors or hardware manufacturers, or applying previous analyses to new datasets. Almost every business dealing with data faces these issues. Data from different sources that communicates similar information such as public census information, but is structured, formatted, and expressed wildly different requires a lot of manual effort starting with data understanding, strategizing the conversion process, identifying and implementing the individual data transformation operations, and finalizing the results. No existing tool automates this end-to-end process, which is a major point of friction.
The DataSemantics Solution
DataSemantics performs the automated conversion in three steps:
1. Analyzing Datasets and Meta-data: It extracts and interprets structural and semantic characteristics of the existing data, specifications, and incoming new data sources
2. Establishing semantic relationships: It establishes meaningful mappings (direct and derived) between source and specification.
3. Implementing Conversion Pipeline: It translates the generated transformation strategies into executable code, forming the transformation pipeline, and generating the final converted data.

Sample Use Case
As a join collaboration between our labs in US, Europe, and Japan, we ran a successful field trial on the problem of converting diverse census data into a specific format for the purposes of traffic simulation. Census data for each region or country can differ significantly in their structure and format even though they carry similar information. In this case, we were using specific census information for a downstream task which required the data to be processed into a specific format. However, the developer would need to write separate data conversion scripts to process census data from every new region. As the figure shows, US Census data is very differently structured compared to a Japanese prefecture data. This is where we simply use DataSemantics to obtain the conversion scripts automatically. The final script and converted data were 100% accurate, taking only 1 hour for the entire run compared to a time of 1-2 weeks of developer time to write the conversion scripts manually.

Potential Use Cases
- Public Census Data from Different Cities: A company that uses demographic data to generate traffic simulations can use DataSemantics as shown in our sample use case.
- E-commerce data from different companies: Company A is acquiring Company B and B’s order data needs to be converted to the same structure and format as A. We use DataSemantics to successfully perform this conversion.
- Retail store sales data from different stores: Different stores of a particular company have their own local formats of storing data. The company now wants to unify all the data into a single format. We use DataSemantics to successfully perform this style of conversion.
- Road Accident Data from Different Countries: A consulting company wants to analyze road accident data from different countries to find unique patterns. However, their data are in very different formats. DataSemantics enables conversion of both into a prescribed format to be used easily for analysis.