
Hello, I'm Yamada from Fujitsu Research's Artificial Intelligence Laboratory. Fujitsu participated in the prestigious international AI conference "The 40th Annual AAAI Conference on Artificial Intelligence (AAAI-26)" held in Singapore from January 20 to 27, 2026, presenting multiple papers and hosting a workshop. We will now deliver a series of articles about AAAI-26.
This post is the third article, focusing on techniques that strengthen a large language model’s (LLM’s) ability to solve new tasks by leveraging past experience. The other posts in this series are:
- Part 1: AAAI-26 Participation and Exhibition #1
- Report on Hosting the Workshop (Published)
- Part 2: AAAI-26 Participation and Exhibition #2
- Report on the Paper Presentation on Causal AI Technology (Published)
- Part 3: AAAI-26 Participation and Exhibition #3
- Report on our Paper Presentation on AI Reasoning Technology (This article)
Publication Information
- Title: Hypothesis-Driven Reasoning for Large Language Models
- Authors: Aakash Kumar Agarwal, Moyuru Yamada (co-first authors)
- Venue: 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
- Link: External site
* This research was conducted in collaboration with an intern student during my assignment to Fujitsu Research in India, FRIPL.
Background: Long-Term Memory in LLMs
When you start a new chat with a generative AI service, it typically does not remember what you discussed in previous, separate chats. However, it would be extremely useful if the model could provide better answers by drawing on past interactions. For that, the LLM needs to “remember” past information—often referred to as long-term memory in LLMs.
The same applies to enterprise use cases. If a model can solve new tasks based on previously provided data, we can build more useful services. For example, imagine you previously labeled shapes as anomalous or normal. Later, you might want the system to solve a new task: count how many anomalous shapes exist in the entire image. If the model can reuse what it learned from the earlier labels, that would be very helpful.
In principle, you can try to achieve this simply by including the past content in the prompt. But our research started from the observation that naively providing past content often fails to deliver good performance. To address this, we developed a technique that enables LLMs to solve new tasks more effectively by leveraging past experience.
Problem Setup: Knowledge Transfer Tasks
To evaluate an LLM’s ability to solve new tasks based on past experience, we designed a set of knowledge transfer tasks. In this setup, the LLM is given:
- Past experience (“Episodes”): examples of objects with labels, and
- A new target task to solve based on that experience.
Concretely, after seeing labels indicating whether each shape is anomalous or normal, the model must solve the target task of counting the number of anomalous shapes in a full image.

We defined three difficulty levels:
- Level 1:
- Background: white only
- Colors: blue, red, yellow, green (4 colors)
- Shapes: circle and square (2 shapes)
- Past experience covers the mapping between labels and 8 types of colored shapes.
- Level 2:
- Colors and shapes are the same as Level 1
- Background: white or black (2 backgrounds)
- Crucially, the label judgment flips when the background is black.
- For example, a shape that is normal on a white background becomes an anomaly on a black background.
- Past experience includes relationships among 16 combinations (shape × color × background) and labels.
- Level 3:
- Based on Level 2, but we hide some of the 16 combinations in the episode.
- For example, the model might see a blue circle on a white background, but not a blue circle on a black background.
- This makes it harder to discover the rule needed to solve the target task from past experience.
- We hide 4 out of 16 combinations, but the rule is still discoverable for humans.
The target task is to count the number of anomalous objects. Humans can usually infer the rule and answer correctly—so the question is: can LLMs do the same?
We chose to create a synthetic dataset rather than using an existing benchmark because existing datasets may have been included in the LLM’s training data. By embedding our own rules into synthetic data, we can more reliably evaluate whether the model truly learns from the episode to solve the new task.
Limitations of LLMs
We first evaluated a straightforward baseline: providing the episode directly in a standard prompt.
Surprisingly, under Level 3 with Chain-of-Thought (CoT), the average accuracy across three LLMs was only 38.4%. This indicates that LLMs struggle to solve a new task based on multimodal past experience. We also observed that during reasoning, the LLM often failed to discover the rule needed for the target task—or inferred an incorrect rule.
Then we asked: what if we provide the correct rule (i.e., a semantic/oracle rule) explicitly in text?
In fact, giving the correct rule significantly improved the LLM’s ability to solve the target task. This suggests a key limitation: the bottleneck is not executing the task once the rule is known, but discovering the rule from past experience. In other words, simply pasting experiences into the prompt is not enough.

Proposed Method: Hypothesis-Driven Reasoning
Our proposal is to add a module outside the LLM that extracts implicit knowledge—like patterns or rules—from past experience as hypotheses, and then stores and utilizes those hypotheses. We call this approach Hypothesis-Driven Reasoning.
Hypothesis-Driven Reasoning enables the LLM to discover rules from past experience and apply them to new tasks.

To extract reliable hypotheses from data, we developed a new method consisting of two main stages:
- Factor Extraction:
From the episode, we extract the factors that are necessary for solving the target task. Examples include "color", "shape", and "background color". - Hypothesis Generation and Verification:
Based on the extracted factors, the method generates hypotheses for the rule needed to solve the target task, and then verifies them against the episodes. For instance, during initial hypothesis generation, the model might propose: “Blue shapes are normal.” But during verification, it can detect that this hypothesis is inconsistent with the episode and reject it. By repeating this generate–verify loop, the system can identify more reliable implicit knowledge from past experience.

Experimental Results
We compared the quality of generated hypotheses against other methods. Our approach was able to generate more reliable hypotheses than competing baselines. As a result, it achieved performance close to the setting where the LLM is provided with the correct oracle rule.
These results show that Hypothesis-Driven Reasoning can significantly improve an LLM’s ability to solve new tasks based on past experience.

Summary and Future Directions
We proposed Hypothesis-Driven Reasoning, a novel framework to strengthen an LLM’s capability to solve new tasks using past experience. Experimental results demonstrated that our method can substantially improve performance on knowledge transfer tasks by enabling the model to discover and leverage reliable rules from prior data.
At the AAAI 2026 poster session, we received a lot of feedback such as:
- “I’ve had the same question, so this is very interesting,” and
- “Could this be applied to xxx?”
So many people stopped by our poster that discussions continued well beyond the scheduled end time.
Going forward, we plan to validate applications where the system discovers domain-specific tacit knowledge—knowledge not written in manuals—directly from data and uses it to support real-world operations.