31st British International Conference on Databases

10-12 July 2017, London, UK

foto

Keynote Speakers

Sihem Amer-Yahia
Laboratoire d'Informatique de Grenoble (LIG)
Towards Interactive User Data Analytics

User data can be acquired from various domains. This data is characterized by a combination of demographics such as age and occupation and user actions such as rating a movie, reviewing a restaurant or buying groceries. User data is appealing to analysts in their role as data scientists who seek to conduct large-scale population studies, and gain insights on various population segments. It is also appealing to novice users in their role as information consumers who use the social Web for routine tasks such as finding a book club or choosing a restaurant. User data exploration has been formulated as identifying group-level behavior such as "Asian women who publish regularly in databases". Group-level exploration enables new findings and addresses issues raised by the peculiarities of user data such as noise and sparsity. I will review our work on one-shot and interactive user data exploration. I will then describe the challenges of developing a visual analytics tool for finding and connecting users and groups.

Sihem Amer-Yahia is a CNRS Research Director in Grenoble where she leads the SLIDE team. Her interests are at the intersection of large-scale data management and data analytics. Before joining CNRS, she was Principal Scientist at QCRI, Senior Scientist at Yahoo! Research and Member of Technical Staff at AT&T Labs. Sihem served on the SIGMOD Executive Board, the VLDB Endowment, and the EDBT Board. She is Editor-in-Chief of the VLDB Journal. Sihem is PC co-chair for VLDB 2018. Sihem received her Ph.D. in CS from Paris-Orsay and INRIA in 1999, and her Diplôme d'Ingénieur from INI, Algeria.

Tim Furche
University of Oxford, CTO Wrapidity Ltd
Wrapping Millions of Documents Per Day ‐ and How that's Just the Beginning

Companies and researchers have been painfully maintaining manually created programs for wrapping (aka scraping) web sites for decades ‐ employing hundreds of engineers to wrap thousands or even tens of thousands of sources. Where wrappers break, engineers must scramble to fix the wrappers manually or face the ire of their users. In DIADEM we demonstrated the effectiveness of hybrid AIs for automatically generating wrapper programs in an academic settings. Our hybrid approach combined knowledge-based rule systems with large-scale analytics founded in techniques from the NLP and ML community. Recently, we commercialized that technology into Wrapidity Ltd. and quickly joined up with Meltwater, the leading global media-intelligence company. At Meltwater, we are applying Wrapidity's techniques and insight to the wrapping of tens of thousands of news sources, processing over 10M unique documents per day. But that's just the beginning... Together with Meltwater engineers and researchers from San Francisco, Stockholm, Bangalore, and Budapest, we are on the way to make the collection and analysis of "outside" data a breeze through fairhair.ai, a platform for accessing and enriching the vast content stored and collected by Meltwater every day.

Tim Furche is co-founder and CTO of Wrapidity and Lecturer at the Department of Computer Science of the University of Oxford. Tim has been extracting data from the web for over a decade, starting from one of the most-cited works on XPath during his undergraduate studies. When not extracting data, he loves working on large scale data management and query languages. He has managed a number of large-scale research grant (>$12M in total) and is currently a co-investigator in the $5M EPSRC Programme Grant VADA.

Joint Keynote with RuleML+RR 2017
Elena Baralis
Politecnico di Torino
Opening the Black Box: Deriving Rules from Data

A huge amount of data is currently being made available for exploration and analysis in many application domains. Patterns and models are extracted from data to describe their characteristics and predict variable values. Unfortunately, many high quality models are characterized by being hardly interpretable. Rules mined from data may provide easily interpretable knowledge, both for exploration and classification (or prediction) purposes. In this talk I will introduce different types of rules (e.g., several variations on association rules, classification rules) and will discuss their capability of describing phenomena and highlighting interesting correlations in data.

Elena Baralis is head of the Database and Data Mining Group and has been full professor in the Control and Computer Engineering Department of the Politecnico di Torino since January 2005. She received her M.S. in Electrical Engineering, and her Ph.D in Computer and Systems Engineering from the Politecnico di Torino. Her current research interests are in the field of database systems and data mining. More specifically her activity focuses on algorithms for diverse data mining tasks on Big Data and on the different domains in which data mining techniques find their application. She has published over 100 papers in international peer-reviewed journals and conference proceedings. She is involved in several national and European research projects focused on her research topics.


Tutorials

Programming Models and Tools for Distributed Graph Processing
Vasiliki Kalavri

ETH Zürich

Graphs capture relationships between data items, such as interactions or dependencies, and their analysis can reveal valuable insights for machine learning tasks, anomaly detection, clustering, recommendations, social influence analysis, bioinformatics, and other application domains. This tutorial reviews the state of the art in high-level abstractions for distributed graph processing. First, we present six models that were developed specifically for distributed graph processing, namely vertex-centric, scatter-gather, gather-sum-apply-scatter, subgraph-centric, filter-process, and graph traversals. Then, we consider general-purpose distributed programming models that have been used for graph analysis, such as MapReduce, dataflow, linear algebra primitives, datalog, and shared partitioned tables. The tutorial aims at making a qualitative comparison of popular graph programming abstractions. We further consider performance limitations of some graph programming models and we summarize proposed extensions and optimizations.

Vasiliki Kalavri is a postdoctoral researcher at the ETH Zurich Systems group, where she is working on distributed data processing, data center performance, and graph streaming algorithms. She is a PMC member of Apache Flink and a core developer of its graph processing API, Gelly. Vasiliki has a PhD in Distributed Computing from KTH, Stockholm and UCLouvain, Belgium, and she has previously interned at Telefonica Research and data Artisans.

Declarative Approaches to Data Quality Assessment and Cleaning
Leopoldo Bertossi

Carleton University

Data quality is an increasingly important problem and concern in business intelligence. Data quality in general, so too their cleaning and assessment in terms of quality in particular, are relative properties and activities, which largely depend on additional semantic information, data and metadata. This extra information and their use for data quality purposes can be specified in declarative terms. In this tutorial we review and discuss some forms of declarative semantic conditions that have their origin in integrity constraints, quality constraints, matching dependencies, and ontological contexts, etc. They can all be used to characterize, assess and obtain quality data from possibly dirty data sources. In this tutorial the emphasis is on declarative approaches to inconsistency, certain forms of incompleteness, and duplicate data. The latter give rise to the entity resolution problem, for which we show declarative approaches that can be naturally combined with machine learning methods.

Leopoldo Bertossi has been Full Professor at the School of Computer Science, Carleton University (Ottawa, Canada) since 2001. He has been a Faculty Fellow of the IBM Center for Advanced Studies (IBM Toronto Lab), and the theme leader for "Data Quality and Data Cleaning" of the "NSERC Strategic Network for Data Management for Business Intelligence" (BIN, 2009-2014). Until 2001 he was professor and departmental chair (1993-1995) at the Department of Computer Science, PUC-Chile; and also the President of the Chilean Computer Science Society (SCCC) in 1996 and 1999-2000. He obtained a PhD in Mathematics from the Pontifical Catholic University of Chile (PUC) in 1988. Prof. Bertossi's research interests include database theory, business intelligence, data quality, data integration, semantic web, intelligent information systems, knowledge representation, logic programming, and computational logic, probabilistic reasoning, and machine learning.