1 Jahangirnagar University, Dhaka, Bangladesh.
2 Uttara University, Dhaka, Bangladesh.
3 Daffodil International University, Dhaka, Bangladesh.
International Journal of Science and Research Archive, 2025, 16(03), 314–321
Article DOI: 10.30574/ijsra.2025.16.3.2569
Received on 29 July 2025; revised on 06 September 2025; accepted on 08 September 2025
One well-liked method for condensing massive language models (LLMs) into smaller, faster, more effective versions without sacrificing performance is knowledge distillation (KD). However, it is no longer feasible to run distillation on a single device as LLMs grow to hundreds of billions of parameters; it is simply too computationally demanding. In this paper, we investigate how to leverage multi-GPU configurations to make KD scale. To overcome communication bottlenecks and accelerate training, our method com- bines adaptive gradient compression with tensor, pipeline, and data parallelism. Tested on transformer-based LLMs, our framework maintains strong accuracy for both understanding and generation tasks, reduces communication overhead by 27%, and provides up to 3.4× faster training than single-GPU baselines.
Knowledge Distillation (KD); Large Language Models (LLMS); Model Compression; Multi-GPU Systems; Distributed Training; Hybrid Parallelism (Data, Tensor, Pipeline); Gradient Compression
Preview Article PDF
Wary Hossain Rabby, A.S.S.M.Q-E-Elahy, Gias Uddin, Emran Sikder, Rafiqul Islam and Hasibul Islam. Scalable knowledge distillation for large language models on multi-GPU systems. International Journal of Science and Research Archive, 2025, 16(03), 314–321. Article DOI: https://doi.org/10.30574/ijsra.2025.16.3.2569.
Copyright © 2025 Author(s) retain the copyright of this article. This article is published under the terms of the Creative Commons Attribution Liscense 4.0







