ENSAE Paris - École d'ingénieurs pour l'économie, la data science, la finance et l'actuariat

Analysis of Matrix Data

Teacher

LERASLE Matthieu

Department: Statistics

Objective

The objective of this course is to study high-dimensional statistics problems in order to highlight three fundamental ideas in this area that can subsequently be applied to many other problems related to data science. We briefly outline these principles:

A large amount of real-world data belongs to high-dimensional spaces in which classical statistical methods are ineffective (see "curse of dimensionality"). However, most of this data is structured. As a result, the "true" dimension of the problem is no longer that of the ambient space but rather that of the structure that contains the useful information of the data. This is referred to as structured or sparse data. Constructing bases or dictionaries that reveal the low-dimensional structures of this data is an important component of high-dimensional statistics.

At first glance, finding these low-dimensional structures seems to require a combinatorial search in a high-dimensional space. Such procedures are impractical. Therefore, an important aspect of high-dimensional statistics is to propose and analyze algorithms that can be implemented even in high-dimensional spaces. Two approaches have received particular attention: convex relaxation (coupled with the convex optimization toolbox) and iterative algorithms, which sometimes allow solving non-convex optimization problems.

Finally, the third component is the role played by randomness in high-dimensional statistics. It turns out that low-dimensional structures are generally revealed by random objects, and so far, deterministic methods have not been able to uncover these structures as efficiently as, for example, random matrices.

A course on high-dimensional statistics can thus cover several areas of mathematics, including approximation theory, convex optimization, and probability theory. In this course, we will mainly study the algorithmic and probabilistic aspects of this theory. Approximation theory will be addressed only briefly through the example of images. This course will approach the paradigm of high-dimensional statistics around three main topics:

  • Compressed sensing: the problem of exact and approximate reconstruction of a high-dimensional signal from a small number of linear measurements of this vector, knowing that it has a small support;
  • Matrix completion/recommendation systems: how to complete a matrix from observing a small number of its entries, knowing that the matrix has a low rank;
  • Community detection in graphs: finding densely connected subgraphs in 'large' graphs.

We will thus address the problem of high-dimensional statistics through three key objects/types of data in data science: high-dimensional but sparse vectors, large matrices but of low rank, and finally, 'large' graphs whose nodes are organized into communities.

From a mathematical technique standpoint, we will focus on the following themes:

  1. Concentration of random variables and complexity calculations;
  2. Methods and analysis of algorithms in convex optimization.

At the end of this course, students should be able to:

  • Identify the computational properties of certain optimization problems, particularly identifying computationally difficult problems;
  • Find convex relaxations for these (non-convex) optimization problems;
  • Understand the role of randomness in these problems, particularly in constructing appropriate measurement vectors;
  • Use the problems of compressed sensing, matrix completion, and community detection as standard problems;
  • Build algorithms that approximate the solutions of “convexified” optimization problems.

Planning

1. Compressed sensing

2. Matrix completion

3. Community detection

References

Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo, and Alan S. Willsky. The convex geometry of linear inverse problems. Found. Comput. Math., 12(6):805{849, 2012.

Simon Foucart. A simple proof of kashin decomposition theorem. Technical report, Drexel University, 2012.

Olivier Guédon and Roman Vershynin. Community detection in sparse networks via grothendieck's inequality. Technical report, 2014.

Trevor Hastie, Rahul Mazumder, Jason D. Lee, and Reza Zadeh. Matrix completion and low-rank svd via fast alternating least squares. Technical report, Statistics Department and ICME Stanford University, 2014.

Felix Krahmer and Rachel Ward. New and improved Johnson-Lindenstrauss embeddings via the restricted isometry property. SIAM J. Math. Anal., 43(3):1269{1281, 2011.

Emile Richard, Guillaume Obozinski, and Jean-Philippe Vert. Tight convex relaxations for sparse matrix factorization. Technical report, 2012.

Ruslan Salakhutdinov, Andriy Mnih, and Georgey Hinton. Restricted boltzmann machines for collaborative filtering. Technical report, Toronto University, 2010.

Joel A. Tropp. Convex recovery of a structured signal from independent random linear measurements. Technical report, To appear in Sampling Theory, a Renaissance, 2014.