
Hello. We are Cheng Feng, Chaoliang Zhong, and Jun Sun from Fujitsu Research and Development Center (FRDC) and Yusuke Oishi from the AI Laboratory in FRJ. We are excited to share our latest research focused on domain-specific Large Language Models (LLM) distillation.
Overview
Although transformer-based LLMs have achieved strong performance across various tasks, there still present two key limitations for domain-specific applications: suboptimal zero-shot accuracy and prohibitive model size for deployment, particularly in resource-constrained environments, such as Edge AI, Embodied AI, TinyML and Smart Home applications.
To enhance the domain performance, a common approach involves fine-tuning a base model on domain-specific data via Supervised Fine-Tuning (SFT). As illustrated in Figure 1, domain SFT significantly enhances both the Llama 3.1 8B and Llama 3.2 3B models, e.g., raising Llama 3.1 8B from 0.462 to 0.773, llama 3.2 3b from 0.333 to 0.655.
However, a trade-off generally exists between model size and performance: Llama 3.1 8B delivers notably better performance than Llama 3.2 3B, yet requires far more memory (16.8 GB vs. 6.4 GB) as well as more inference time (42.0 tokens/second vs. 57.4 tokens/second). To solve the problem, traditional distillation uses the larger model as a teacher to boost the smaller model's performance.
Regrettably, due to the capacity gap between student and teacher, the distilled student model still lags behind the teacher—e.g., Llama 3B distillation improves SFT from 0.655 to 0.727, versus the teacher's 0.773.

To this end, we develop a general LLM distillation method that narrows the teacher-student gap globally as shown in Figure 1 and furthermore achieves student surpassing teacher on about 40% of the sub-tasks, making distillation ideal for resource-limited AI applications.
For certain mission-critical use cases, like sales negotiation prediction with CRM data, there's a stringent demand for local and secure access to models and data. Additionally, these scenarios often feature exceptionally long contexts, which traditional models such as XGBoost are unable to handle. Consequently, lightweight LLMs become essential for facilitating local deployment and performing long-context reasoning. Our developed LLM disitillation method is ideally suited to address these specific use cases.
Method
Insight 1: Learning "how" the teacher learns is more important than learning "what" the teacher has learned.
Traditional Distillation: The student performs SFT while learning to replicate the teacher's output, i.e., learning "what" the teacher has learned as shown in Figure 2.

Checkpoint Distillation: Students mimic the teacher's training trajectory by extending traditional distillation to step-by-step distillation as shown in Figure 3.

However, Checkpoint Distillation adopts a naive step-by-step alignment strategy, which fails to thoroughly consider the student's guiding principles and the optimal strategies for teacher selection.
Our Distillation: Based on theoretical analysis, we identify a first principle for optimal teacher selection: balancing teacher performance with student learning difficulty, i.e., the closer the teacher is to the student, the easier it is for the student to mimic; however, such a teacher generally exhibits inferior performance. As the student model's performance progressively improves, the teacher it requires accordingly shifts towards the optimal one. Guided by this principle, we propose our method as shown in the Figure 4.

Insight 2: Selective Imitation—Preserving Student Strengths—Enables Distillation to Surpass their Teachers
Imitation alone is insufficient to outperform the teacher. We find that on domain data, both teacher and student exhibit inherent strengths in distinct subdomains, observable from post-SFT models.
As illustrated in the Figure 5 and Figure 6, we seek to preserve the student’s inherent strengths to ultimately surpass the teacher. Building on the sample-wise adaptive weight (AW) derived from the domain-specific SFT student and teacher models, we implement an instance-level AW mechanism.


Result
We evaluate on two domain-specific benchmarks in distinct languages: PubmedQA (English) and JMED-LLM (Japanese) GitHub - sociocom/JMED-LLM: JMED-LLM: Japanese Medical Evaluation Dataset for Large Language Models
Teacher model: Llama-3.1-8B-Instruct
Student model: Llama-3.2-3B-Instruct

As shown in Table 1, our proposed methods consistently outperform existing distillation techniques. SCD achieves competitive performance (Avg: 0.742), while SCD w/AW demonstrates superior effectiveness (Avg: 0.763), excelling in key tasks such as JMMLU (+4.7% vs CD) and NRNER Partial F1 (+6.9% vs best baseline). Notably, SCD w/AW enables the student to surpass the teacher's performance in multiple tasks: it outperforms the teacher SFT model in both NRNER metrics (Exact F1: 0.711 vs 0.667, Partial F1: 0.944 vs 0.932) and matches or exceeds teacher performance in SMDIS (0.986 vs 0.985).
Conclusion
In this blog, we explore a fundamental question: when and how can a student model match or surpass its teacher in domain-specific LLM distillation. Through theoretical analysis, we establish that this is achievable when the student's advantage on its favorable subdomain (SFS) outweighs its deficit on the teacher-favored subdomain (TFS). Guided by this insight, we introduce SCD to systematically reduce the TFS deficit by emulating the teacher's training trajectory, and AW to preserve the student's inherent strengths on SFS. Extensive experiments across diverse multilingual tasks show that our method surpasses existing distillation techniques and often allows the student model to match or even exceed the teacher’s performance. Our work provides both a theoretical foundation and a practical framework for developing efficient, high-performance domain-specific language models.
While LLM distillation focuses on transferring knowledge from a large teacher model to a compact student model, the resulting student architecture can still be further optimized for inference efficiency. Such architecture optimization is complementary to LLM distillation and can be applied on top of distilled models to further reduce inference cost. We refer readers to our concurrent work on efficient task-specific hybrid attention model construction for more details: Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction
References
Following the Teacher's Footsteps: Scheduled Checkpoint Distillation for Domain-Specific LLMs
Distill-then-Replace: Efficient Task-Specific Hybrid Attention Model Construction