Skip to main content

Machine Learning Infrastructure

Frameworks, infra and useful toolkits (e.g., visulization) for training, inference or both. You can check [AI infrastructures list] for more.


  • The Case for Learning-and-System Co-design [Paper]
    • Mike Liang, C.J., Xue, H., Yang, M. and Zhou, L., 2019.
    • ACM SIGOPS Operating Systems Review, 53(1), pp.68-74.
    • Summary: Make the system learnable. Propose a framework named AutoSys which contains both training plane and inference plane

These three papers are not only for ML but also for Big Data and they are too good to be ingnored.

  • Large-scale cluster management at Google with Borg [Paper]
    • Verma, Abhishek, et al
    • Proceedings of the Tenth European Conference on Computer Systems. 2015.
  • Apache hadoop yarn: Yet another resource negotiator [Paper]
    • Vavilapalli, Vinod Kumar, et al.
    • Proceedings of the 4th annual Symposium on Cloud Computing. 2013.
  • Mesos: A platform for fine-grained resource sharing in the data center [Paper]
    • Hindman, Benjamin, et al
    • NSDI. Vol. 11. No. 2011. 2011.

ML Platform

  • Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster [GitHub]
  • Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.[GitHub]
  • MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.[GitHub]
  • MegEngine is a fast, scalable and easy-to-use numerical evaluation framework, with auto-differentiation.[GitHub]
  • cortexlabs/cortex: Deploy machine learning applications without worrying about setting up infrastructure, managing dependencies, or orchestrating data pipelines. [GitHub]
  • Osquery is a SQL powered operating system instrumentation, monitoring, and analytics framework. [Facebook Project]
  • Kubeflow: Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable. [GitHub]
  • Polytaxon: A platform for reproducible and scalable machine learning and deep learning on kubernetes. [GitHub]
  • MLOps on Azure [GitHub]
  • Flame: An ML framework to accelerate research and its path to production. [GitHub]
  • Ludwig is a toolbox built on top of TensorFlow that allows to train and test deep learning models without the need to write code. [GitHub]
  • intel-analytics/analytics-zoo Distributed Tensorflow, Keras and BigDL on Apache Spark [GitHub]
  • Machine Learning for .NET [GitHub]
    • ML.NET is a cross-platform open-source machine learning framework which makes machine learning accessible to .NET developers.
    • ML.NET allows .NET developers to develop their own models and infuse custom machine learning into their applications, using .NET, even without prior expertise in developing or tuning machine learning models.
  • ONNX: Open Neural Network Exchange [GitHub]
  • ONNXRuntime: has an open architecture that is continually evolving to address the newest developments and challenges in AI and Deep Learning. ONNX Runtime stays up to date with the ONNX standard, supporting all ONNX releases with future compatibility and maintaining backwards compatibility with prior releases. [GitHub]
  • BentoML: Machine Learning Toolkit for packaging and deploying models [GitHub]
  • EuclidesDB: A multi-model machine learning feature embedding database [GitHub]
  • Prefect: Perfect is a new workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine. [GitHub]
  • MindsDB: MindsDB's goal is to make it very simple for developers to use the power of artificial neural networks in their projects [GitHub]
  • PAI: OpenPAI is an open source platform that provides complete AI model training and resource management capabilities. [Microsoft Project]
  • Bistro: Scheduling Data-Parallel Jobs Against Live Production Systems [Facebook Project]
  • GNES is Generic Neural Elastic Search, a cloud-native semantic search system based on deep neural network. [GitHub]

GPU Sharing

  • Yu, P. and Chowdhury, M., 2019. Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications. arXiv preprint arXiv:1902.04610. [Paper] [GitHub]
  • gpushare-scheduler-extender [GitHub]
    • More and more data scientists run their Nvidia GPU based inference tasks on Kubernetes. Some of these tasks can be run on the same Nvidia GPU device to increase GPU utilization. So one important challenge is how to share GPUs between the pods

Userful Tools


  • Performance issues analysis (Off-CPU) [Website]
  • Collective Knowledge repository to automate MLPerf - a broad ML benchmark suite for measuring performance of ML software frameworks, ML hardware accelerators, and ML cloud platforms [GitHub]
  • NetworKit is a growing open-source toolkit for large-scale network analysis. [GitHub]
  • gpu-sentry: Flask-based package for monitoring utilisation of nVidia GPUs. [GitHub]
  • anderskm/gputil: A Python module for getting the GPU status from NVIDA GPUs using nvidia-smi programmically in Python [GitHub]
  • Pytorch-Memory-Utils: detect your GPU memory during training with Pytorch. [GitHub]
  • torchstat: a lightweight neural network analyzer based on PyTorch. [GitHub]
  • NVIDIA GPU Monitoring Tools [GitHub]
  • PyTorch/cpuinfo: cpuinfo is a library to detect essential for performance optimization information about host CPU. [GitHub]
  • Popular Network memory consumption and FLOP counts [GitHub]
  • Intel® VTune™ Amplifier [Website]
    • Stop guessing why software is slow. Advanced sampling and profiling techniques quickly analyze your code, isolate issues, and deliver insights for optimizing performance on modern processors
  • Pyflame: A Ptracing Profiler For Python [GitHub]


  • Facebook AI Performance Evaluation Platform [GitHub]
  • Netron: Visualizer for deep learning and machine learning models [GitHub]
  • Facebook/FBGEMM: FBGEMM (Facebook GEneral Matrix Multiplication) is a low-precision, high-performance matrix-matrix multiplications and convolution library for server-side inference. [GitHub]
  • Dslabs: Distributed Systems Labs and Framework for UW system course [GitHub]
  • Machine Learning Model Zoo [Website]
  • Faiss: A library for efficient similarity search and clustering of dense vectors [GitHub]
  • Microsoft/MMdnn: A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models.[GitHub]
  • Example recipes for Kubernetes Network Policies that you can just copy paste [GitHub]