Skip to main content

Survey & Book


  • Toward Highly Available, Intelligent Cloud and ML Systems [Slide]
  • A curated list of awesome System Designing articles, videos and resources for distributed computing, AKA Big Data. [GitHub]
  • awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning [GitHub]
  • Opportunities and Challenges Of Machine Learning Accelerators In Production [Paper]
    • Ananthanarayanan, Rajagopal, et al. "
    • 2019 {USENIX} Conference on Operational Machine Learning (OpML 19). 2019.
  • How (and How Not) to Write a Good Systems Paper [Advice]
  • Applied machine learning at Facebook: a datacenter infrastructure perspective [Paper]
    • Hazelwood, Kim, et al. (HPCA 2018)
  • Infrastructure for Usable Machine Learning: The Stanford DAWN Project
    • Bailis, Peter, Kunle Olukotun, Christopher Ré, and Matei Zaharia. (preprint 2017)
  • Hidden technical debt in machine learning systems [Paper]
    • Sculley, David, et al. (NIPS 2015)
  • End-to-end arguments in system design [Paper]
    • Saltzer, Jerome H., David P. Reed, and David D. Clark.
  • System Design for Large Scale Machine Learning [Thesis]
  • Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications [Paper]
    • Park, Jongsoo, Maxim Naumov, Protonu Basu et al. arXiv 2018
    • Summary: This paper presents a characterizations of DL models and then shows the new design principle of DL hardware.
  • A Berkeley View of Systems Challenges for AI [Paper]


  • Computer Architecture: A Quantitative Approach [Must read]
  • Streaming Systems [Book]
  • Kubernetes in Action (start to read) [Book]
  • Machine Learning Systems: Designs that scale [Website]
  • Trust in Machine Learning [Website]
  • Automated Machine Learning in Action [Website]