Data Science Seminar

The LTCI Data Science Seminar is a joint seminar between the DIG team and the S2A team of the LTCI. It features talks related to machine learning and data management.

Attendance is generally open to the public, feel free to contact us if you are interested in coming. Talks are held at Télécom ParisTech, 46 rue Barrault, Paris, France, métro Corvisart.

February 22, 2018

The seminar takes place from 2PM to 4PM (room C48), and consists of the following talks:

Talk 1: Odalric-Ambrym Maillard (Inria Lille, Sequel team): A survey of some recent results in structured bandits

Abstract: I will especially focus on the use of multi-armed bandit theory in the context of linearly correlated distributions, optimization of Lipschitz functions, and the problem of sequential ranking. The talk will be based on recent results obtained by Stefan Magureanu and by Audrey Durand during their PhD.

Talk 2: Pierre Senellart (ENS): Reinforcement learning for intensional data management

Abstract: We will review how techniques from reinforcement learning (bandits, Markov decision processes) naturally occur in a wide variety of tasks related to intensional data management, i.e., management of data whose access is associated to a cost. This covers a range of applications, from Web crawling to crowdsourcing, from query optimization to management of virtual machines. In contrast with traditional reinforcement learning applications, data management scenarios often involve a very large but heavily structured state or action space, requiring adaptations of traditional techniques.

November 16, 2017

The seminar takes place from 2PM to 4PM (room C48), and consists of the following talks:

Talk 1: Luis Galárraga, How to know how much we know

You can download the slides of this talk.

Abstract: Current RDF knowledge bases (KBs) are highly incomplete. Such incompleteness is a serious problem both for data consumers and data producers. Data consumers do not have any guarantee about the completeness of the results of queries run on a KB. This diminishes the practical value of the available data. On the other hand, data producers are blind about the parts of the KB that should be populated. Yet, completeness information management is poorly supported in the Semantic Web: Completeness information for KBs is scarce and no RDF storage engine can nowadays provide completeness guarantees for queries. In this talk we present a vision of a completeness- aware Semantic Web, and explain our ongoing work in this direction, namely on the tasks of automatic generation of completeness annotations, and computation of completeness guarantees for SPARQL queries.

Talk 2: Vincent Audigier (CNAM): Multiple imputation with principal component methods

You can download the slides of this talk.

Abstract: Missing data are common in the domain of statistics. They are a key problem because most statistical methods cannot be applied to incomplete data sets. Multiple imputation is a classical strategy for dealing with this issue. Although many multiple imputation methods have been suggested in the literature, they still have some weaknesses. In particular, it is difficult to impute categorical data, mixed data and data sets with many variables, or a small number of individuals with respect to the number of variables. This is due to overfitting caused by the large number of parameters to be estimated. This work proposes new multiple imputation methods that are based on principal component methods, which were initially used for exploratory analysis and visualisation of continuous, categorical and mixed multidimensional data. The study of principal component methods for imputation, never previously attempted, offers the possibility to deal with many types and sizes of data. This is because the number of estimated parameters is limited due to dimensionality reduction. First, we describe a single imputation method based on factor analysis of mixed data. We study its properties and focus on its ability to handle complex relationships between variables, as well as infrequent categories. Its high prediction quality is highlighted with respect to the state-of-the-art single imputation method based on random forests. Next, a multiple imputation method for categorical data using multiple correspondence analysis (MCA) is proposed. The variability of prediction of missing values is introduced via a non-parametric bootstrap approach. This helps to tackle the combinatorial issues which arise from the large number of categories and variables. We show that multiple imputation using MCA outperforms the best current methods.

October 5, 2017

The seminar takes place from 2PM to 4PM in Amphi Grenat, and consists of the two following talks:

Talk 1: Giovanna Varni, Automated analysis of socio-affective signals in interpersonal interaction

The slides for this talk are not available.

Abstract: The establishment of interpersonal interaction, understood as human-human and human-machine interaction, grounds on the skills to exchange and manage verbal and nonverbal socio-affective signals. Becoming an effective partner in interaction requires, first, the perception and the understanding of these signals and, then, a continuous adaptation of one’s own signals by predicting the future development of the interaction. Endowing systems and artificial agents with the skill to automatically analyze these signals and their dynamics is not a trivial task both offline and on the fly. Still, this is a crucial challenge for improving the effectiveness and naturalness of human-artificial agent interaction in several scenarios e.g. education, manufacturing industries, assistive technology and so on.

In this talk, I report my recent works on Social Signals Processing and Affective Computing with particular reference to laughter, emotional interdependence and personality traits. These works were performed in the framework of European and national French projects.

Talk 2: James Eagan, Reinventing the mind’s bicycle

The slides for this talk are not available.

Abstract: Steve Jobs once described the computer as “a bicycle for the mind.” My work focuses on making this bicycle a better, more expressive tool for humans: by making richer, more expressive interactions, by reinventing how we build software so users can adapt it to their own idiosyncratic needs, and by providing richer tools to interact with and understand data. In this talk, I will present an overview of my work in these three areas.

Bio: James Eagan is an Associate Professor (maître de conférences) in the DIVA group at Télécom ParisTech <> with a research focus on Human-Computer Interaction and Information Visualization. He is also co-chair of the MS Big Data program. Before joining Télécom ParisTech, he was a post-doctoral researcher at LRI (CNRS & Université Paris-Sud). He holds a Ph.D. and M.Sc. in Computer Science from the Georgia Institute of Technology and B.A. in Mathematics/Computer Science from Lawrence University.

September 7, 2017

The seminar takes place from 2PM to 4PM in Amphi Saphir, and consists of the two following talks:

Talk 1: Albert Bifet
Massive Online Analytics for the Internet of Things (IoT)

You can download the slides of this talk.

Abstract: Big Data and the Internet of Things (IoT) have the potential to fundamentally shift the way we interact with our surroundings. The challenge of deriving insights from the Internet of Things (IoT) has been recognized as one of the most exciting and key opportunities for both academia and industry. Advanced analysis of big data streams from sensors and devices is bound to become a key area of data mining research as the number of applications requiring such processing increases. Dealing with the evolution over time of such data streams, i.e., with concepts that drift or change completely, is one of the core issues in stream mining. In this talk, I will present an overview of data stream mining, and I will introduce some popular open source tools for data stream mining.

Talk 2: François Roueff,
Prediction of weakly locally stationary processes by auto-regression

You can download the slides of this talk.

Abstract: We introduce locally stationary time series through the  local approximation of the non-stationary covariance structure by a  stationary one. This allows us to define autoregression coefficients in a  non-stationary context, which, in the particular case of a locally stationary  Time Varying Autoregressive (TVAR) process, coincide with the generating coefficients. We provide and study an estimator of the time varying autoregression coefficients in a general setting. The proposed estimator of  these coefficients enjoys an optimal minimax convergence rate under limited  smoothness conditions. In a second step, using a bias reduction technique, we  derive a minimax-rate estimator for arbitrarily smooth time-evolving  coefficients, which outperforms the previous one for large data sets. In  turn, for TVAR processes, the predictor derived from the estimator  exhibits an optimal minimax prediction rate.


June 22, 2017

The seminar takes place from 2PM to 4PM in Amphi Saphir, and consists of the two following talks:

Talk 1: Pascal Bianchi,
Distributed optimization on graphs using operator splitting methods

You can download the slides of this talk.

Abstract: Consider a network of N agents (computing units) having private objective functions and seeking to find a minimum of the aggregate objective. The aim is to design iterative algorithms where, at a each iteration, an agent updates a local estimate of the minimizer based on the sole knowledge of its private function and the information received from its neighbors. In this talk, i will first provide an overview of standard distributed optimization methods. Then, i will explain how recent and generic results in stochastic optimization can be used in order to design asynchronous and adaptive distributed optimization algorithms.

Talk 2: Maximilien Danisch,
Towards real-world graph algorithmics

You can download the slides of this talk.

Abstract: Real-world graphs (a.k.a. complex networks) are ubiquitous: the web, Facebook, brain networks, protein interaction networks, etc. Studying these graphs and solving computational problems on them (say maximum clique, partitioning or dense subgraph) has applications in many fields. I will first show that the structure of some real-world graphs is special. In particular, they are not adversarial and some difficult problems (say NP-hard problems) can be solved on some huge real-world graphs (say 2G edges or more). I will then present two works along the lines of “understanding and leveraging the structure of real-world graphs in order to build better algorithms”: (i) Enumerating all k-cliques and (ii) Computing the density-friendly graph decomposition. The latter one has been published in WWW2017.