Data Science Seminar

The LTCI Data Science Seminar is a joint seminar between the DIG team and the S2A team of the LTCI. It features talks related to machine learning and data management.

The seminar organisers are Antoine Amarilli (DIG) and Olivier Fercoq (S2A). Attendance is generally open to the public, feel free to contact the organisers if you are interested in coming ( Talks are held at Télécom ParisTech, 46 rue Barrault, Paris, France, métro Corvisart.

March 14, 2019

The seminar takes place from 2PM to 4PM (room C49), and will consist of two talks:

Talk 1: Thomas Bonald (Télécom ParisTech): Hierarchical graph clustering: quality metrics and algorithms

You can download the slides of this talk.

Abstract: This talk is about hierarchical clustering for graph-structured data. We present a novel agglomerative algorithm as well as a quality metric based on the sampling distribution of node pairs induced by the graph. A fundamental property of the proposed metric is that it is interpretable in terms of graph reconstruction: it is exactly the information loss induced by the representation of the graph as a tree. It is applicable to any tree and, in particular, to a tree of height 2 corresponding to a usual graph clustering, i.e., a partition of the set of nodes. It can also be used to compress the binary tree returned by any hierarchical clustering algorithm so as to get a compact, meaningful representation of the hierarchical structure of the graph. The results are illustrated on both synthetic and real datasets.

Talk 2: Alessandro Rudi (Inria Sierra): Structured prediction via implicit embeddings

You can download the slides of this talk.

Abstract: In this talk we analyze a regularization approach for structured prediction problems. We characterize a large class of loss functions that allows to naturally embed structured outputs in a linear space. We exploit this fact to design learning algorithms using a surrogate loss approach and regularization techniques. We prove universal consistency and finite sample bounds characterizing the generalization properties of the proposed methods. Experimental results are provided to demonstrate the practical usefulness of the proposed approach.

December 11, 2018

The seminar takes place from 2PM to 3PM (room C48), and will consist of one talk:

Talk: Shai Ben-David (University of Waterloo): Settling the sample complexity of learning mixtures of Gaussians via sample compression schemes

You can download the slides of this talk.

Abstract: Estimating distributions from observed data is a fundamental
task in statistics that has been studied for more than a century. This task
also arises in applied machine learning. In many scenarios it is common to
assume that the distribution can be modelled using a mixture of Gaussians.

We prove that Θ(kd^2/ε^2 ) samples are necessary and sufficient for
learning a mixture of k Gaussians in R^d , up to error ε in total variation
distance. This improves both the known upper bounds and lower bounds for
this problem. For mixtures of axis-aligned Gaussians, we show that O(kd/ε^2
) samples suffice, matching a known lower bound. The upper bound is based
on a novel technique for distribution learning based on a notion of sample
compression. Any class of distributions that allows such a sample
compression scheme can also be learned with few samples. Moreover, if a
class of distributions has such a compression scheme, then so do the
classes of products and mixtures of those distributions. The core of our
main result is showing that the class of Gaussians in R^d has an efficient
sample compression.

The talk is based on joint work with Hassan Ashtiani, Nicholas J. A.
Harvey, Christopher Liaw, Abbas Mehrabian and Yaniv Plan.

This work has just received a best paper award at NeurIPS 2018.

November 29, 2018

The seminar takes place from 2PM to 4PM (room C48), and consists of the following talks:

Talk 1: Rodrigo Mello (University of Sao Paulo): Steps, concepts and issues involved in providing learning guarantees in the Deep Learning scenario (most specially Convolutional Neural Networks)

You can download the slides of this talk.

Abstract: The main idea about this one-hour talk is to start with the Generalization bound, from the Statistical Learning Theory, estimate the Shattering coefficient of a single-hyperplane-based algorithm in different input space dimensionalities, so that we end up questioning the complexity of Deep Learning biases and how they could be estimated.

Talk 2: Olivier Sigaud (Sorbonne University): Efficient exploration for learning continuous action policies

You can download the slides of this talk.

Abstract: This talk is based on our ICLM 2018 paper with Cédric Colas and Pierre-Yves Oudeyer. I will first give a tutorial introduction to reinforcement learning (RL) and the DQN and DDPG deep RL algorithms. Then I will highlight a simple exploration issue faced by these systems, I will present Goal Exploration Processes (GEPs) as an efficient exploration method, and I will propose a way of combining GEPs with DDPG which provides a satisfactory solution to the issue. Finally, I will wrap these results into the broader context of Artificial Intelligence from a developmental learning perspective.

February 22, 2018

The seminar takes place from 2PM to 4PM (room C48), and consists of the following talks:

Talk 1: Odalric-Ambrym Maillard (Inria Lille, Sequel team): A survey of some recent results in structured bandits

The slides for this talk are not available.

Abstract: I will especially focus on the use of multi-armed bandit theory in the context of linearly correlated distributions, optimization of Lipschitz functions, and the problem of sequential ranking. The talk will be based on recent results obtained by Stefan Magureanu and by Audrey Durand during their PhD.

Talk 2: Pierre Senellart (ENS): Reinforcement learning for intensional data management

You can download the slides of this talk.

Abstract: We will review how techniques from reinforcement learning (bandits, Markov decision processes) naturally occur in a wide variety of tasks related to intensional data management, i.e., management of data whose access is associated to a cost. This covers a range of applications, from Web crawling to crowdsourcing, from query optimization to management of virtual machines. In contrast with traditional reinforcement learning applications, data management scenarios often involve a very large but heavily structured state or action space, requiring adaptations of traditional techniques.

November 16, 2017

The seminar takes place from 2PM to 4PM (room C48), and consists of the following talks:

Talk 1: Luis Galárraga (Inria Rennes), How to know how much we know

You can download the slides of this talk.

Abstract: Current RDF knowledge bases (KBs) are highly incomplete. Such incompleteness is a serious problem both for data consumers and data producers. Data consumers do not have any guarantee about the completeness of the results of queries run on a KB. This diminishes the practical value of the available data. On the other hand, data producers are blind about the parts of the KB that should be populated. Yet, completeness information management is poorly supported in the Semantic Web: Completeness information for KBs is scarce and no RDF storage engine can nowadays provide completeness guarantees for queries. In this talk we present a vision of a completeness- aware Semantic Web, and explain our ongoing work in this direction, namely on the tasks of automatic generation of completeness annotations, and computation of completeness guarantees for SPARQL queries.

Talk 2: Vincent Audigier (CNAM): Multiple imputation with principal component methods

You can download the slides of this talk.

Abstract: Missing data are common in the domain of statistics. They are a key problem because most statistical methods cannot be applied to incomplete data sets. Multiple imputation is a classical strategy for dealing with this issue. Although many multiple imputation methods have been suggested in the literature, they still have some weaknesses. In particular, it is difficult to impute categorical data, mixed data and data sets with many variables, or a small number of individuals with respect to the number of variables. This is due to overfitting caused by the large number of parameters to be estimated. This work proposes new multiple imputation methods that are based on principal component methods, which were initially used for exploratory analysis and visualisation of continuous, categorical and mixed multidimensional data. The study of principal component methods for imputation, never previously attempted, offers the possibility to deal with many types and sizes of data. This is because the number of estimated parameters is limited due to dimensionality reduction. First, we describe a single imputation method based on factor analysis of mixed data. We study its properties and focus on its ability to handle complex relationships between variables, as well as infrequent categories. Its high prediction quality is highlighted with respect to the state-of-the-art single imputation method based on random forests. Next, a multiple imputation method for categorical data using multiple correspondence analysis (MCA) is proposed. The variability of prediction of missing values is introduced via a non-parametric bootstrap approach. This helps to tackle the combinatorial issues which arise from the large number of categories and variables. We show that multiple imputation using MCA outperforms the best current methods.

October 5, 2017

The seminar takes place from 2PM to 4PM in Amphi Grenat, and consists of the two following talks:

Talk 1: Giovanna Varni (Télécom ParisTech): Automated analysis of socio-affective signals in interpersonal interaction

The slides for this talk are not available.

Abstract: The establishment of interpersonal interaction, understood as human-human and human-machine interaction, grounds on the skills to exchange and manage verbal and nonverbal socio-affective signals. Becoming an effective partner in interaction requires, first, the perception and the understanding of these signals and, then, a continuous adaptation of one’s own signals by predicting the future development of the interaction. Endowing systems and artificial agents with the skill to automatically analyze these signals and their dynamics is not a trivial task both offline and on the fly. Still, this is a crucial challenge for improving the effectiveness and naturalness of human-artificial agent interaction in several scenarios e.g. education, manufacturing industries, assistive technology and so on.

In this talk, I report my recent works on Social Signals Processing and Affective Computing with particular reference to laughter, emotional interdependence and personality traits. These works were performed in the framework of European and national French projects.

Talk 2: James Eagan (Télécom ParisTech): Reinventing the mind’s bicycle

The slides for this talk are not available.

Abstract: Steve Jobs once described the computer as “a bicycle for the mind.” My work focuses on making this bicycle a better, more expressive tool for humans: by making richer, more expressive interactions, by reinventing how we build software so users can adapt it to their own idiosyncratic needs, and by providing richer tools to interact with and understand data. In this talk, I will present an overview of my work in these three areas.

Bio: James Eagan is an Associate Professor (maître de conférences) in the DIVA group at Télécom ParisTech <> with a research focus on Human-Computer Interaction and Information Visualization. He is also co-chair of the MS Big Data program. Before joining Télécom ParisTech, he was a post-doctoral researcher at LRI (CNRS & Université Paris-Sud). He holds a Ph.D. and M.Sc. in Computer Science from the Georgia Institute of Technology and B.A. in Mathematics/Computer Science from Lawrence University.

September 7, 2017

The seminar takes place from 2PM to 4PM in Amphi Saphir, and consists of the two following talks:

Talk 1: Albert Bifet (Télécom ParisTech): Massive Online Analytics for the Internet of Things (IoT)

You can download the slides of this talk.

Abstract: Big Data and the Internet of Things (IoT) have the potential to fundamentally shift the way we interact with our surroundings. The challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors and devices is bound to become a key area of data mining research as the number of applications requiring such processing increases. Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in stream mining. In this talk, I will present an overview of data stream mining, and I will introduce some popular open source tools for data stream mining.

Talk 2: François Roueff (Télécom ParisTech): Prediction of weakly locally stationary processes by auto-regression

You can download the slides of this talk.

Abstract: We introduce locally stationary time series through the  local approximation of the non-stationary covariance structure by a  stationary one. This allows us to define autoregression coefficients in a  non-stationary context, which, in the particular case of a locally stationary  Time Varying Autoregressive (TVAR) process, coincide with the generating coefficients. We provide and study an estimator of the time varying autoregression coefficients in a general setting. The proposed estimator of  these coefficients enjoys an optimal minimax convergence rate under limited  smoothness conditions. In a second step, using a bias reduction technique, we  derive a minimax-rate estimator for arbitrarily smooth time-evolving  coefficients, which outperforms the previous one for large data sets. In  turn, for TVAR processes, the predictor derived from the estimator  exhibits an optimal minimax prediction rate.


June 22, 2017

The seminar takes place from 2PM to 4PM in Amphi Saphir, and consists of the two following talks:

Talk 1: Pascal Bianchi (Télécom ParisTech): Distributed optimization on graphs using operator splitting methods

You can download the slides of this talk.

Abstract: Consider a network of N agents (computing units) having private objective functions and seeking to find a minimum of the aggregate objective. The aim is to design iterative algorithms where, at a each iteration, an agent updates a local estimate of the minimizer based on the sole knowledge of its private function and the information received from its neighbors. In this talk, i will first provide an overview of standard distributed optimization methods. Then, i will explain how recent and generic results in stochastic optimization can be used in order to design asynchronous and adaptive distributed optimization algorithms.

Talk 2: Maximilien Danisch (UPMC): Towards real-world graph algorithmics

You can download the slides of this talk.

Abstract: Real-world graphs (a.k.a. complex networks) are ubiquitous: the web, Facebook, brain networks, protein interaction networks, etc. Studying these graphs and solving computational problems on them (say maximum clique, partitioning or dense subgraph) has applications in many fields. I will first show that the structure of some real-world graphs is special. In particular, they are not adversarial and some difficult problems (say NP-hard problems) can be solved on some huge real-world graphs (say 2G edges or more). I will then present two works along the lines of “understanding and leveraging the structure of real-world graphs in order to build better algorithms”: (i) Enumerating all k-cliques and (ii) Computing the density-friendly graph decomposition. The latter one has been published in WWW2017.