About

I am a final-year Ph.D. student in Microelectronics Thrust at the Hong Kong University of Science and Technology (Guangzhou), supervised by Prof. Jiayi Huang and Prof. Xiaowen Chu. My research interests lie in the fields of ML Systems, Efficient AI and HPC, with a special focus on the system-algorithm co-design of Mixture-of-Experts.

Awards

2025 - 🏆 HKUST(GZ) FUNH Dean’s Award for PhD Research Excellence 2025/26, Merit Prize
2021 - 🥇 The 5th “Sunway Cup” China Parallel Application Challenge on Domestic CPU (CPC2021), National Champion
2021 - 🥉 The 9th “Intel Cup” Parallel Application Challenge (PAC2021), Third Prize

Publications

[ICML ‘26] Mining Tensor/Neuron-Level Sparsity to Maximize Mixture-of-Experts Potential in Post-Training and Inference
Weilin Cai, Le Qin, Shwai He, Junwei Cui, Ang Li, Jiayi Huang
Forty-Third International Conference on Machine Learning, July 2026.

[ISCA ‘26] Mapping and Communication Optimizations with Fault Tolerance for Wafer-Scale LLM Inference
Junwei Cui, Le Qin, Weilin Cai, Jiayi Huang
ACM/IEEE International Symposium on Computer Architecture, June 2026.

[MLSys ‘26] FlexTrain: Scalable Hybrid‑Parallel Training with Elastic Resource Utilization and Consistent Accuracy
Weilin Cai*, Diandian Gu*, Baoquan Zhong*, Jun Wang*, Zhuolin Zheng*, Gaohong Liu, Kaihua Jiang, Shuguang Wang, Wencong Xiao, Jiayi Huang (*: Equal contribution)
Ninth Annual Conference on Machine Learning and Systems, May 2026.

[DATE ‘26] XTree on EquiMesh: Topology and Algorithm Co-Design for Collective Communication
Junwei Cui, Le Qin, Weilin Cai, Jiayi Huang
Design, Automation & Test in Europe Conference, April 2026.

[ICLR ‘26] Capacity‑Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
Shwai He, Weilin Cai, Jiayi Huang, Ang Li
The Fourteenth International Conference on Learning Representations, April 2026.

[MICRO ‘25] Optimizing All-to-All Collective Communication with Fault Tolerance on Torus Networks
Le Qin, Junwei Cui, Weilin Cai, Meng Niu, Yan Yang, Jiayi Huang
IEEE/ACM International Symposium on Microarchitecture, October 2025.

[ICML ‘25] Shortcut-connected Expert Parallelism for Accelerating Mixture of Experts
Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunhung Kim, Jiayi Huang
Forty-Second International Conference on Machine Learning, July 2025.
Impact: Adopted in Meituan’s LongCat-Flash language model

[ISCA ‘25] Chimera: Communication Fusion for Hybrid Parallelism in Large Language Models
Le Qin, Junwei Cui, Weilin Cai, Jiayi Huang
ACM/IEEE International Symposium on Computer Architecture, June 2025.
Distinguished Artifact Award

[ASPLOS ‘25] MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
Weilin Cai, Le Qin, Jiayi Huang
ACM International Conference on Architectural Support for Programming Languages and Operating Systems, April 2025.

[TKDE ‘25] A Survey on Mixture of Experts in Large Language Models
Weilin Cai*, Juyong Jiang*, Fan Wang*, Jing Tang, Sunghun Kim, Jiayi Huang (*: Equal contribution)
IEEE Tranactions on Knowledge and Data Engineering, 37(7):3896—3915, March 2025.

[NPC ‘22] Flexible Supervision System: A Fast Fault‑Tolerance Strategy for Cloud Applications in Cloud‑Edge Collaborative Environments
Weilin Cai, Heng Chen, Zhimin Zhuo, Ziheng Wang, Ninggang An
IFIP International Conference on Network and Parallel Computing, 2022.

[TJSC ‘22] Implementation and optimization of ChaCha20 stream cipher on sunway taihuLight supercomputer
Weilin Cai, Heng Chen, Ziheng Wang, Xingjun Zhang
The Journal of Supercomputing pp. 4199–4216. Springer, 2022.

Preprints

[ArXiv ’26] ReaLB: Real-Time Load Balancing for Multimodal MoE Inference
Yingping Wang, Yi Wu, Xiangyu Wu, Junwei Cui, Weilin Cai, Zhijiang Guo, Jiayi Huang
arXiv preprint arXiv:2604.19503. 2026.

[ArXiv ’25] Accelerating Mixture‑of‑Experts Inference by Hiding Offloading Latency with Speculative Decoding
Zhibin Wang, Zhonghui Zhang, Yuhang Zhou, Zibo Wang, Mo Zhou, Peng Jiang, Weilin Cai, Chengying Huan, Rong Gu, Sheng Zhong, Chen Tian
arXiv preprint arXiv:2508.21706. 2025.