Master 2014 2015
Stages de la spécialité SAR
Geo-Distributed Indexes Bucket

Site :Scality
Lieu :in Scality office at Paris 9th
Encadrant : Vianney Rancurel, Marc Shapiro
Dates :du 01/03/2015 au 31/08/2015
Rémunération :oui
Mots-clés : Master SAR, autre qu’ATIAM


Simple key-value stores have become the building block in a lot of large-scale distributed storage systems. These stores follow a simple design decision in which keys are spread and retrieved over the network using a consistent hash function[1] ; the hash function is the key for load balancing and system resiliency, however, it usually doesn’t preserve the data locality. Nevertheless, data locality is a desired feature to support range querying in more complex storage.

Approaches that attempt to bring range-query feature into distributed storage systems usually come with some trade-off. Distributed tree-based systems such as PHT[2] bring range-query to DHT (Distributed Hash Table), however, these systems are difficult to scale in geo-replicated environments and they do not provide eventual consistency. Column-based partitioning techniques, such as those used in BigTable[3] and Cassandra[4], also preserve data locality, but they either do not support eventual consistency or replace consistent hashing by order-preserving hash functions, which requires manual load balancing of the system.

We would like to study another distributed storage system which has the following properties :
- Geo-replication aware ; number of geo-sites is not limited
- Load-balancing and network resiliency of DHT
- Range-query supporting
- Strong eventual consistency supporting
- Scalable

- The system should be prototyped as a middleware, that works with different components in our system
- Develop a library based on the findings. Data types of interest include maps (directories), trees and directed graphs ; conflicts scenarios and their merging semantics should be studied.
- Implement them efficiently in the context of large multi-core computers and/or clusters.
- Deploy them in some representative applications.
- Perform experiments and measure performance, compared to alternative implementations.
- Publish the findings in the scientific literature.

Requirements :
- Enrolled in a Masters’ in Computer Science / Informatics or a related field.
- An excellent academic record.
- A strong interest and good knowledge of big-data applications, distributed systems, distributed algorithms, or distributed databases.
- Motivated by experimental research. A strong programming ability is required ; knowledge of node.js is a plus.

Applying To apply, contact Vianney Rancurel , with the following information :
- A CV
- A transcript of the last two years of study
- Names and contact details of two references (to whom we will contact directly) The internship is fully funded, and will take place at Scality (at least 3 days/week) and LIP6 in Paris.


[1] Karger, David, et al. "Consistent hashing and random trees : Distributed caching protocols for relieving hot spots on the World Wide Web." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997.

[2] Ramabhadran, Sriram, et al. "Prefix hash tree : An indexing data structure over distributed hash tables." Proceedings of the 23rd ACM symposium on principles of distributed computing. Vol. 37. 2004.

[3] Chang, Fay, et al. "Bigtable : A distributed storage system for structured data."ACM Transactions on Computer Systems (TOCS) 26.2 (2008) : 4.

[4] Lakshman, Avinash, and Prashant Malik. "Cassandra : a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010) : 35-40.