Data analytics systems [W2-3]
MapReduce: Simplified Data Processing on Large Clusters
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Storm @Twitter
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
NoSQL and Distributed storage [W4]
Bigtable: A Distributed Storage System for Structured Data
Dynamo: Amazon’s Highly Available Key-value Store
Cluster management [W5-6]
Apache Hadoop YARN: Yet Another Resource Negotiator
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
Networking [W7-8]
A Scalable, Commodity Data Center Network Architecture
Data Center TCP (DCTCP)
Azure Accelerated Networking: SmartNICs in the Public Cloud
Efficient Coflow Scheduling with Varys
Machine learning systems [W9-12]
Scaling Distributed Machine Learning with the Parameter Server
TensorFlow: A System for Large-Scale Machine Learning
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks
Tiresias: A GPU Cluster Manager for Distributed Deep Learning
Clipper: A Low-Latency Online Prediction Serving System
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
Data analytics systems
Dryad: distributed data-parallel programs from sequential building blocks
GraphX: Graph Processing in a Distributed Dataflow Framework
Spark SQL: Relational Data Processing in Spark
Distributed storage
The Google File System
Cassandra - A Decentralized Structured Storage System
Flat Datacenter Storage
f4: Facebook’s Warm BLOB Storage System
Cluster management
Borg, Omega, and Kubernetes
Borg: Large-Scale Cluster Management at Google with Borg
Tetris:Multi-Resource Packing for Cluster Schedulers
Networking I: Architecture
OpenFlow: Enabling Innovation in Campus Networks
The Road to SDN: An intellectual history of programmable networks
Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network
Networking II: Performance
Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization
pFabric: Minimal Near-Optimal Datacenter Transport
CONGA: Distributed Congestion-Aware Load Balancing for Datacenters
Expeditus: Congestion-Aware Load Balancing in Clos Data Center Networks
Congestion Control for Large-Scale RDMA Deployments
Machine learning systems
Ray: A Distributed Framework for Emerging AI Applications
A Configurable Cloud-Scale DNN Processor for Real-Time AI
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
Project Adam: Building an Efficient and Scalable Deep Learning Training System
Gandiva: Introspective Cluster Scheduling for Deep Learning
HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning