Im Folgenden finden Sie eine Aufstellung der zur Verfügung stehenden Themen. Die angegebene Literatur versteht sich als Startlektüre und weitere Literatur sollte selbstständig recherchiert wertden.
Themenblock 1: Data Infrastruktur
- Verteilte Dateisysteme
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation, pages 137-150, Berkeley, CA, USA, 2004. USENIX Association.
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. SIGOPS Oper. Syst. Rev., 37(5):29, 2003.
- Hadoop File System. http://hadoop.apache.org/common/docs/current/hdfs_design.html
- Cluster Scheduling
- Eric Baldeschwieler et al., Apache Hadoop YARN: Yet Another Resource Negotiator, https://www.cs.cmu.edu/~garth/15719/papers/yarn.pdf
- Hindemann, Meso: A Platform for Fine-Grained Resource Sharing in the Data Center, Technical Report, University of California, Berkley, 2010
- Verma et al., Large-scale cluster management at Google with Borg, Proceedings of the European Conference on Computer Systems (EuroSys), ACM, 2015
- Kubernetes, https://kubernetes.io/>
- Burns et al., Borg, Omega, and Kubernetes, ACM Queue, 2016
- Distributed Processing Engines
- Dean et al., MapReduce: Simplified Data Processing in Large Clusters, 2004
- Zaharia et al., Spark: Cluster Computing with Working Sets, Hotcloud 2010.
- Carbone et al., Apache Flink: Stream and Batch Processing in a Single Engine, IEEE Engineering Bulletins, 2015
- Rocklin, Dask: Parallel Computation with Blocked algorithms and Task Scheduling, Proceedings of the 14th Python in Science Conference, 2015
- SQL Engines
- Stream Processing
- Cloud-Services for Data
- Storage and SQL Services
- Higher-Level Services
Themenblock 2: Data Science und Machine Learning
- Data Science Overview: Methods and Frameworks
- Peng, Matsui, The Art of Data Science, 2017
- Hey, The Fourth Paradigm of Scientific Discovery, 2009
- Cao, Data Science: Challenges and Directions, Communications of the ACM, 2017
- Donoho, 50 Years of Data Science, 2015
- Scikit-Learn, 2017
- Machine Learning Frameworks (Github), 2017
- Machine Learning Best Practices
- Distributed Machine Learning
- Dean et al., Large Scale Distributed Deep Networks, 2012
- Li et al., Scaling Distributed Machine Learning with the Parameter Server, OSDI, 2014
- Xing et al., Petuum: A New Platform for Distributed Machine Learning on Big Data, KDD, 2015
- Meng et al., MLlib: Machine Learning in Apache Spark, Journal of Machine Learning Research, 2016
-
H2O, H2O version 3, 2017.
- Model Deployment
Themenblock 3: Deep Learning
- Deep Learning: Convolutional Neural Networks
- LeCun, Bengio, Hinton, Deep Learning, Nature, 2015
- Krizhevsky et al., ImageNet Classification with Deep Convolutional Neural Networks, 2012
- Ujjwal Karn, An intuitive explanation of convolutional neural networks, 2016
- Alex Krizhevsky, One weird trick for parallelizing convolutional neural networks, 2014
- Stanford, CS231n: Convolutional Neural Networks, 2017
- Deep Learning Frameworks: Caffe, Torch/PyTorch, Tensorflow, CNTK, MXNet
- Jia et al., Caffe: Convolutional Architecture for Fast Feature Embedding, MM, 2014.
- Torch, 2017.
- PyTorch, 2017.
- Abadi et al., TensorFlow:
Large-Scale Machine Learning on Heterogeneous Distributed Systems, White Paper, 2015.
- Seide et al., CNTK: Microsoft's Open Source Deep-Learning Toolkit, White Paper, 2016.
- Chen et al., MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems, 2015.