Machine Learning Systems (ML Sys)

Course Description

The Machine Learning Systems (ML Sys) course focuses on the integration of algorithms, system software, and hardware to build efficient and scalable machine learning services. It covers topics such as the evolution of machine learning architectures, including GPUs and specialized AI accelerators, and explores optimization techniques like CUDA programming, parallel training, efficient inference, and model optimization. The course delves into the practical challenges of large-scale neural network development, training and inference, emphasizing the importance of a holistic view of communication-storage-computing optimization and parallelism. Ethical considerations, including privacy, safety, and sustainability, are also addressed. Through hands-on programming and project work, students gain a real-world understanding of ML systems, from design to deployment, and learn to create machine learning systems that are efficient, scalable, and reliable for the fast-evolving and diverse AI applications.

Administrative

Instructor in Charge: Professor Li Shang, Professor Yuedong Xu

Instructor Email: lishang@fudan.edu.cn, ydxu@fudan.edu.cn

Weekly Lectures

Week	Topic	Materials
1	Introduction: The Intellectual Map of Machine Learning Systems	Book Chapter 1, Slides
2	CPU Foundations and GPU Emergence	Book Chapter 2, Slides
3	GPU Architecture for Machine Learning Systems	Book Chapter 3, Slides
4	CUDA Programming Through the Lens of GPU Architecture	Book Chapter 4, Slides
5	CUDA Programming as Hardware-Software Co-Optimization	Book Chapter 5, Slides
6	Multi-Agent Infra. as LLM Scaffolding	Multi-agent Infra, Slides
7	Deep Learning Compilers: From “AI Everywhere” to the LLM Era	Book Chapter 7, Slides
8	Data Parallelism in LLM Training	Book Chapter 8, Slides1, Slides2, Slides3
9	Model Parallelism in LLM Training	Book Chapter 9, Slides1
10	TBD	TBD
11	TBD	TBD
12	TBD	TBD
13	TBD	TBD
14	TBD	TBD
15	TBD	TBD
16	TBD	TBD

Course Projects

Students will complete a course project focusing on the design, optimization, or analysis of machine learning systems. Projects may involve GPU programming, distributed training systems, model efficiency techniques, or system–hardware co-design.

Course project phase 1

Course project phase 2

Contact

For course-related questions, please contact the course coordinator: Mrs.Wu (ping_wu@fudan.edu.cn)