Master 2015 2016
Stages de la spécialité SAR
Dynamic Adaptation in Stream Processing Engines


Site : Equipe Projet Inria Myriads
Lieu : Inria Rennes Bretagne Atlantique
Encadrant : Cédric Tedeschi, Maître de conférence Univ. Rennes 1 Matthieu Simonin, Ingénieur Inria
Dates :du 01/02/2016 au 31/06/2016
Rémunération :554 euros
Mots-clés : Master SAR, autre qu’ATIAM

Description

Distributed stream processing has become a leading trend for analysing a large amount of data in real-time. Internet of things, stock trading, web traffic monitoring are all pushing continuously datas for immediate processing. To address the challenge of handling high volume and high velocity of datas, different stream processing engines emerged including Spark Streaming based on Spark[1], Storm[2], Flink[3] or Samza[4]. Those platforms are known to be intensively used at large scale by different actors (e.g. Yahoo !, Twitter, LinkedIn). The reputation of those frameworks are often measured in terms raw performances (number of events treated / second). From our point of view, another aspect to take into account is the capacity of the framework to adapt to changes due to external factors. For example these systems are subject to overload or failures. Different steps have been made to this direction in [5,6,7,8].

These works primarily focus on dynamically adapting the mapping of the operators of the stream-processing application onto the computing resources dedicated to host the application. Some of them assume a limited amount of resources and propose different techniques to adapt without having to significantly degrade the throughput of the application (but at the cost of a degraded accuracy of the output). Other approaches consider the elasticity of such an approach when cloud-based resources are available, in which case the goal is to minimize the gap between allocated resources and their actual utilisation.

Some work remains to be done as several point are still unclear, in particular :

1) The comparison of these approaches in terms of performance is still a widely open issue, as no actual systematic benchmark or tool has been devised for this issue

2) The optimization of their performance in a dynamic environment has only been addressed in a non-systematic fashion, where algorithms target a sub-part of this software family, having its own specificities.

The internship will first focus on studying several representatives of this family of software and devise a model able to include them all, so as to develop a generic framework for their experimental evaluation. In a second part, the work will try to include the dynamic adaptation into the model devised. Finally, this framework will allow to test the different actors in the field (mentioned above) so as to conduct a more comprehensive experimental campaign. The Grid’5000 platform will be used to perform these experiments.

Bibliographie

[1] http://spark.apache.org/streaming/

[2] https://storm.apache.org/

[3] https://flink.apache.org/

[4] http://samza.apache.org/

[5] Waldemar Hummer, Benjamin Satzger, Schahram Dustdar : Elastic stream processing in the Cloud. Wiley Interdisc. Rew. : Data Mining and Knowledge Discovery 3(5) : 333-345 (2013)

[6] Vincenzo Gulisano, Ricardo Jiménez-Peris, Marta Patiño-Martínez, Claudio Soriente, Patrick Valduriez : StreamCloud : An Elastic and Scalable Data Streaming System. IEEE Trans. Parallel Distrib. Syst. 23(12) : 2351-2365 (2012)

[7] Javier Cerviño, Evangelia Kalyvianaki, Joaquín Salvachúa, Peter Pietzuch : Adaptive Provisioning of Stream Processing Systems in the Cloud. ICDE Workshops 2012 : 295-301

[8] Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Çetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, Stanley B. Zdonik : The Design of the Borealis Stream Processing Engine. CIDR 2005 : 277-289