Master d’informatique 2008 2009
Proposition de thèse (IRISA)

Title : Parallelism and distribution for very large scale content-based multimedia retrieval systems

Place : IRISA Campus de Beaulieu 35042 Rennes, FRANCE

Advisor : Laurent Amsaleg Email :

Proposal : 1) Context Thanks to Web search engines, digital cameras and camcorders, and sharing platforms such as YouTube or Flickr, it is now very easy to create, copy, store, share, find, and modify digital material. While this increases users‘ digital experience enjoyment, it also raises many copyright infringement problems as that material is quite often illegally uploaded and used, sometimes by malicious pirates. Various techniques enforcing copyright protection have been developed, including content-based image retrieval systems (CBIRS) where potentially illegal contents is used to query a database containing the material to protect. These systems are extremely complex but quite efficient, i.e., they detect violations even if the stolen material has been severely modified, and they can protect rather large collections of contents.

These systems, however, are still slow : a handful of seconds is needed to check the copyright of one image. They can hardly be used in practice in the real world where throughput is crucial since thousands of verifications per second must be enforced. This is impossible today as they are typically built according to a centralized architecture. It is therefore key to start investigating the issues raised by building very large scale CBIRS on top of architectures supporting distribution and parallelism.

2) Subject Investigating these issues makes the core of this PhD proposal. This includes designing distributed global memory management policies (prefetching, caching, data shipping, ...), thread scheduling strategies (thread shipping, load balancing, coping with node failures, ...), distributing the queries and the data collection on nodes, etc. All these policies need to take into account the specificities of CBIRS where approximate nearest-neighbor searches are performed in a multi-dimensional feature space, contrasting with more standard applications also having throughput requirements.

This work will build on state of the art technologies : the CBIRS to start from will be one of the best systems proposed in the literature in terms of efficiency and ability to cope with scale (e.g., NV-tree or VideoGoogle) ; the underlying Operating System architecture (Kerrighed) is one of the best platforms for providing a single image operating system for high performance computing on clusters.

3) Environment and Fundings This PhD will take place in the context of the European Project Quaero that aims at building Web-scale content-based multimedia search engines. Therefore, a three year financial support is already available, starting Sept 2009. In addition, the Quaero environment provides opportunities to meet many scientists concerned with these problems, as searching the contents of videos, audio archives, still image collections or even digital libraries asks for throughput oriented systems.

The applicant must have a very solid backgound on operating systems, memory management, distribution and parallelism. No expertise in image (or signal) processing is needed ; some knowledge in databases is beneficial. Very good programming skills are required.