Master 2014 2015
Stages de la spécialité SAR
Version management for big-data storage

Site :Équipe Regal Lip6-Inria
Lieu :LIP6 / 4 place Jussieu / 75005 Paris
Encadrant : Marc Shapiro
Dates :1/3/2015 au 31/8/2014 (négociable)
Rémunération :oui
Mots-clés : Master SAR, autre qu’ATIAM


Many modern applications access very large objects, such as the friendship graph of a social network, or the directory used for storing a large email collection. The cost of copying such a large object, either to or from disk or over the network, may overwhelm the computational cost of using the object. However, copying is a frequent operation, for instance when in-place update is undesirable, e.g., to accomodate concurrent updates, to ensure atomicity, or for fault tolerance.

Techniques for addressing this problem include copy-on-write (CoW) page management (as used in operating system kernels), and so-called persistent objects (as used by the implementation of functional languages). Unfortunately, existing CoW techniques are low-level and hard to use, and furthermore are not applicable in modern managed language environments such as Java or Erlang. Conversely, persistent object techniques are general but not have not been applied to a distributed big-data setting. One recent result is the LSM-Tree, used in the LevelDB database, for the special case of a large index tree. The above techniques all follow a common pattern. A frozen snapshot of the main (large) data structure is created. A new version consists of both a pointer to the previous version and some kind of buffer. An update is stored in the buffer, and does not modify the snapshot. Reading the object state first checks the buffer, then (possibly) the snapshot. Versions can be stacked or chained, several possibly sharing the same base snapshot. When the chain of recent-update buffers becomes too large or too complex, they are merged (according to the object’s semantics) into a new frozen snapshot.

This pattern has been used so far mostly in an ad-hoc way and has not been studied scientifically. We wish to study its theoretical and practical foundations and limitations, and to leverage the mergeability, commutativity, associativity and idempotence properties of Conflict-Free Replicated Data Types (CRDTs) to optimise and generalise it in a principled way. The specific objectives of the internship are the following :

- Develop a library of big-data versionable data types based on this technique. Data types of interest include maps (directories), trees and directed graphs. - Implement them efficiently in the context of large multicore computers and/or clusters. - Deploy them in some representative applications. - Perform experiments and measure performance, compared to alternative implementations. - Publish the findings in the scientific literature.

Requirements : - Enrolled in a Masters’ in Computer Science / Informatics or a related field. - An excellent academic record. - A strong interest and good knowledge of big-data applications, distributed systems, distributed algorithms, or distributed databases. - Motivated by experimental research. Knowledge of node.js is a plus.


The internship is fully funded, and will take place in the Regal group, at Laboratoire d’Informatique de Paris-6 (LIP6), in Paris. A successful intern will be invited to apply for a PhD.

To apply, contact Marc Shapiro , with the following information : - A resume or Curriculum Vitæ - A list of courses and grades of the last two years of study (an informal transcript is OK). - Names and contact details of two references (people who can recommend you. We will contact these people directly.