Laboratoire d'Informatique de Grenoble (LIG)
Towards Interactive User Data Analytics
User data can be acquired from various domains. This data is characterized
by a combination of demographics such as age and occupation and user actions
such as rating a movie, reviewing a restaurant or buying groceries. User data
is appealing to analysts in their role as data scientists who seek to conduct
large-scale population studies, and gain insights on various population segments.
It is also appealing to novice users in their role as information consumers who
use the social Web for routine tasks such as finding a book club or choosing a
restaurant. User data exploration has been formulated as identifying group-level behavior
such as "Asian women who publish regularly in databases". Group-level exploration
enables new findings and addresses issues raised by the peculiarities of user
data such as noise and sparsity. I will review our work on one-shot and interactive
user data exploration. I will then describe the challenges of developing a
visual analytics tool for finding and connecting users and groups.
Sihem Amer-Yahia is a CNRS Research Director in Grenoble where she leads
the SLIDE team. Her interests are at the intersection of large-scale
data management and data analytics. Before joining CNRS, she was
Principal Scientist at QCRI, Senior Scientist at Yahoo! Research and
Member of Technical Staff at AT&T Labs. Sihem served on the SIGMOD
Executive Board, the VLDB Endowment, and the EDBT Board. She is
Editor-in-Chief of the VLDB Journal. Sihem is PC co-chair for
VLDB 2018. Sihem received her Ph.D. in CS from Paris-Orsay
and INRIA in 1999, and her Diplôme d'Ingénieur from INI, Algeria.
University of Oxford, CTO Wrapidity Ltd
Wrapping Millions of Documents Per Day ‐ and How that's Just the Beginning
Companies and researchers have been painfully maintaining manually created
programs for wrapping (aka scraping) web sites for decades ‐ employing hundreds
of engineers to wrap thousands or even tens of thousands of sources. Where wrappers
break, engineers must scramble to fix the wrappers manually or face the ire of their
users. In DIADEM we demonstrated the effectiveness of hybrid AIs for automatically
generating wrapper programs in an academic settings. Our hybrid approach combined
knowledge-based rule systems with large-scale analytics founded in techniques from
the NLP and ML community. Recently, we commercialized that technology into Wrapidity
Ltd. and quickly joined up with Meltwater, the leading global media-intelligence company.
At Meltwater, we are applying Wrapidity's techniques and insight to the wrapping of tens
of thousands of news sources, processing over 10M unique documents per day. But that's
just the beginning... Together with Meltwater engineers and researchers from San Francisco,
Stockholm, Bangalore, and Budapest, we are on the way to make the collection and analysis
of "outside" data a breeze through fairhair.ai, a platform for accessing and enriching the
vast content stored and collected by Meltwater every day.
Tim Furche is co-founder and CTO of Wrapidity and Lecturer at the Department of Computer
Science of the University of Oxford. Tim has been extracting data from the web for over a decade, starting
from one of the most-cited works on XPath during his undergraduate studies. When not extracting
data, he loves working on large scale data management and query languages. He has managed a number
of large-scale research grant (>$12M in total) and is currently a co-investigator in the
$5M EPSRC Programme Grant VADA.
Joint Keynote with
Politecnico di Torino
Opening the Black Box: Deriving Rules from Data
A huge amount of data is currently being made available for exploration and analysis in many application domains.
Patterns and models are extracted from data to describe their characteristics and predict variable values.
Unfortunately, many high quality models are characterized by being hardly interpretable.
Rules mined from data may provide easily interpretable knowledge, both for exploration and classification
(or prediction) purposes. In this talk I will introduce different types of rules
(e.g., several variations on association rules, classification rules) and will discuss their capability
of describing phenomena and highlighting interesting correlations in data.
Elena Baralis is head of the Database and Data Mining Group and has
been full professor in the Control and Computer Engineering Department
of the Politecnico di Torino since January 2005. She received her M.S.
in Electrical Engineering, and her Ph.D in Computer and Systems
Engineering from the Politecnico di Torino. Her current research
interests are in the field of database systems and data mining. More
specifically her activity focuses on algorithms for diverse data mining
tasks on Big Data and on the different domains in which data mining
techniques find their application. She has published over 100 papers in
international peer-reviewed journals and conference proceedings. She is
involved in several national and European research projects focused on
her research topics.
Programming Models and Tools for Distributed Graph Processing
Graphs capture relationships between data items, such as interactions or dependencies, and their analysis can reveal valuable
insights for machine learning tasks, anomaly detection, clustering, recommendations, social influence analysis, bioinformatics,
and other application domains. This tutorial reviews the state of the art in high-level abstractions for distributed graph processing. First, we present six models that were developed specifically for distributed graph processing, namely vertex-centric, scatter-gather, gather-sum-apply-scatter, subgraph-centric, filter-process, and graph traversals. Then, we consider general-purpose distributed programming models that have been used for graph analysis, such as MapReduce, dataflow, linear algebra primitives, datalog, and shared partitioned tables.
The tutorial aims at making a qualitative comparison of popular graph programming abstractions. We further consider performance
limitations of some graph programming models and we summarize proposed extensions and optimizations.
Vasiliki Kalavri is a postdoctoral researcher at the ETH Zurich
Systems group, where she is working on distributed data processing,
data center performance, and graph streaming algorithms.
She is a PMC member of Apache Flink and a core developer of its graph
processing API, Gelly. Vasiliki has a PhD in Distributed Computing
from KTH, Stockholm and UCLouvain, Belgium,
and she has previously interned at Telefonica Research and data Artisans.
Declarative Approaches to Data Quality Assessment and Cleaning
Data quality is an increasingly important problem and concern in business intelligence. Data quality in general,
so too their cleaning and assessment in terms of quality in particular, are relative properties and activities,
which largely depend on additional semantic information, data and metadata. This extra information and their use
for data quality purposes can be specified in declarative terms. In this tutorial we review and discuss some
forms of declarative semantic conditions that have their origin in integrity constraints, quality constraints,
matching dependencies, and ontological contexts, etc. They can all be used to characterize, assess and obtain
quality data from possibly dirty data sources. In this tutorial the emphasis is on declarative approaches to
inconsistency, certain forms of incompleteness, and duplicate data. The latter give rise to the entity resolution
problem, for which we show declarative approaches that can be naturally combined with machine learning methods.
Leopoldo Bertossi has been Full Professor at the School of Computer Science, Carleton University (Ottawa, Canada) since 2001.
He has been a Faculty Fellow of the IBM Center for Advanced Studies (IBM Toronto Lab), and the theme leader for "Data Quality
and Data Cleaning" of the "NSERC Strategic Network for Data Management for Business Intelligence" (BIN, 2009-2014).
Until 2001 he was professor and departmental chair (1993-1995) at the Department of Computer Science, PUC-Chile; and also the
President of the Chilean Computer Science Society (SCCC) in 1996 and 1999-2000. He obtained a PhD in Mathematics from the Pontifical
Catholic University of Chile (PUC) in 1988. Prof. Bertossi's research interests include database theory, business intelligence, data quality, data integration, semantic web,
intelligent information systems, knowledge representation, logic programming, and computational logic, probabilistic reasoning,
and machine learning.