fltech - Technology Blog of Fujitsu Research

A technology blog where Fujitsu researchers talk about a variety of topics

Efficient Task-Specific Hybrid Attention Model Construction

Hello. We are Xiaojie Xia, Chaoliang Zhong, and Jun Sun from Fujitsu Research and Development Center (FRDC) and Yusuke Oishi from the AI Laboratory in FRJ. We are excited to share our latest research focused on constructing task-specific hybrid models.

Transformers are powerful—but their attention mechanism scales quadratically with sequence length. That makes them slow and memory-hungry for some tasks such as long documents and extended conversations.

To fix this, new models like RetNet, Mamba, and Gated DeltaNet use linear-complexity attention. They compress past context into a small hidden state, cutting memory and computation cost dramatically. But this compression often hurts performance on tasks that need fine-grained understanding.

Hybrid models—mixing a few full-attention layers (where accuracy matters most) with efficient linear attention layers elsewhere. In theory, you get the best of both worlds: high quality and fast speed.

In practice, though, two problems remain:

  1. Training hybrid models from scratch is as costly as training full-attention transformer models.
  2. There’s no clear rule for where to place full-attention layers or linear layers—it mostly depends on experience.

Method

Our work is to develop a method to efficient, task-specific creation of high-performance hybrid models from existing pretrained full-attention transformer models.

“Clone” full-attention blocks to linear attention counterparts

We take each full-attention block in the original transformer model and teach a lightweight linear attention block to mimic its output. Think of it as training a student to copy an expert’s answers—not by solving the problem from scratch, but by learning to reproduce the expert’s replies on the same inputs.

We do this block by block, independently. For every layer, we feed the same hidden input into both the original full-attention block and its linear counterpart. Then, we adjust the linear block’s weights until its output closely matches the original’s. This process is called block-wise local distillation.

Figure 1: Linear attention weight by blockwise local distillation. (a) Overall distillation framework from full attention to the linear attention. (b) BLD (blockwise local distillation), which are trained in parallel and independently.

Since each block is trained in parallel, we don’t need to backpropagate through the entire network. That means:

  • Training is much faster and cheaper than pretraining a new model,
  • And the final linear blocks stay faithful to the original behavior—making them easy to plug into existing pretrained model.

Construct a task-specific hybrid model

Now that we have both the original pretrained model and its distilled linear attention counterparts, the next step is to construct a hybrid model tailored to domain task.

Instead of running expensive architecture searches or retraining the whole model, we use a simple, greedy strategy: replace layers one by one, guided by validation performance.

Figure 2: The greedy layer replacement on target domain validation data.

Here’s how it works:

  1. Start with the full-attention model and measure its performance on your target task (e.g., accuracy, F1 score, etc.).
  2. Try replacing each layer individually with its linear attention block, and test the performance on validation data.
  3. Keep the replacement that causes the smallest drop (or even a gain!) in performance, and lock it to linear attention type in place.
  4. Repeat the process: with that layer now fixed as linear, test the remaining full-attention layers one by one, and replace the “safest” one next.

Because linear attention layers don’t need to store large key-value cache, every replacement reduces memory use and boosts speed. The more layers we can swap, the faster the model runs.

We stop the iteration when:

  • Performance falls below an acceptable threshold, or
  • All layers have been replaced to linear (rare in practice, but ideal).

This approach needs only a single pass over the validation set, requires no backpropagation, and avoids costly joint fine-tuning.

Result

The table below compares the base model (Llama-3.1-8B-Instruct) with the best hybrid variants we built by replacing full-attention layers with linear alternatives: Gated Linear Attention (GLA), Gated DeltaNet(GDN) and Jet-Block(JET). For each task, we report:

  • Pbase : performance of the original model,
  • Pbest : performance of the best hybrid,
  • #Rep : how many layers were replaced to linear attention.

Table 1: Performance of base model (Llama-3.1-8B-Instruct) and best searched hybrid models with replaced linear blocks (GLA, GDN and JET) across tasks.

Surprisingly, in most cases, the hybrid models match or even outperform the original—despite being faster and more memory-efficient. At first glance, this seems counterintuitive: how can a “lighter” model do better?

However, careful analysis reveals a plausible explanation: full-attention layers often contain redundant or task-irrelevant information across the model depth. In contrast, the linear attention blocks via blockwise local distillation from the pretrained backbone can encode distilled, task-adapted representations that are, in some cases, more effective for specific downstream tasks.

By strategically interleaving full-attention and linear attention blocks through greedy, validation-guided layer replacement, our method constructs a hybrid architecture that leverages the complementary strengths of both mechanisms, thereby preserving or even enhancing performance on the target task.

Throughput comparison

The figure below shows that the decoding throughput increases with the number of linear layers, and greater speedups are observed at longer context lengths.

Figure 3: Throughput comparison under context length of 512, 2,048, 16,384 and 65,536. The numbers above points indicate the speedup relative to base full-attention model.

Layer replacement trajectories

We tested our greedy layer replacement strategy across different base models, linear attention variants, and downstream tasks. Figure 4 below indicates that the order in which layers can be safely replaced varies—revealing that knowledge isn’t evenly spread across the model, and different tasks rely on different layers.

Figure 4: Layer replacement trajectories on PubMedQA (PM) and CommonsenseQA (CQ) using Qwen2.5-1.5B and Llama3.2-3B-Instruct (28 layers each) with linear attention variants: Gated Linear Attention (GLA) and Jet-Block (JET).

This confirms: there’s no one-fits-all hybrid model—you need a task-specific architecture.

Surprisingly, though, the replacement trajectories were very similar across different linear attention types. This suggests that the optimal placement order of linear attention blocks is primarily governed by the base model architecture and task characteristics, rather than the specific design of the linear attention mechanism itself. In other words: where to replace matters far more than what you replace it with.

Experimental cost

We measured the total runtime of our method on a single NVIDIA A800 GPU using the PubMedQA dataset and various base models. Even in the worst case—where we greedily replace all attention layers—the entire process finishes in just a few hours (Table 2).

Table 2: GPU hours across base models during BLD and greedy replacement.

The data requirement is minimal:

  • About 100M general-domain tokens for distilling linear blocks,
  • A small task-specific validation set for layer replacement selection.

This means you can turn any pretrained model into a task-optimized hybrid quickly, cheaply, and without large-scale infrastructure.

Use cases

OpenDRIVE is widely used in autonomous driving simulation and often suffers from missing road due to occlusion or invisibility in traffic video capture, significantly hindering downstream planning and verification tasks. We localize the missing road by matching the OpenDRIVE road and Open-Street-Map. Then, an LLM-based framework using Qwen3-0.6B model is designed to complete the missing road. Our fine-tuned Qwen3-0.6B model achieves 98.17% success rate, generating high-precision OpenDRIVE file. Furthermore, hybrid model is built based on the proposed block-wise local distillation "cloning" full-attentions to linear attentions and greedy search strategy replacing layers one by one. Experimental results demonstrate that 13 full-attention layers out of 28 layers can be replaced by linear layers, which results in 1.9x throughput speedup, without accuracy loss. This showcases our proposed efficient hybrid model construction technology successfully applied in OpenDRIVE missing road completion task.

Conclusion

We’ve introduced a lightweight, practical framework for building task-specific hybrid models that smartly combine full and linear attention. By first distilling linear counterparts layer-by-layer and then greedily replacing layers based on validation performance, we avoid costly pretraining or architecture search.

Crucially, our approach is both backbone-agnostic and task-general: it can be seamlessly applied to any pretrained full-attention models and adapted to diverse downstream tasks with minimal overhead. We believe this paradigm offers a scalable and deploy-friendly pathway toward efficient adaptation of foundation models, particularly in resource-constrained or latency-sensitive scenarios.

Although our approach can improve inference efficiency while largely preserving or even improving model performance on specific tasks, combining it with LLM distillation can further enhance the resulting model. LLM distillation allows the optimized hybrid models to better absorb knowledge from large teacher models, leading to improved generalization and task performance. We refer readers to our concurrent work on scheduled checkpoint distillation for domain-specific LLMs for more details: Following the Teacher’s Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs

References

Press Release

Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction

Following the Teacher’s Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs