On Describing the Contents of a Dataset.

Companies and organizations collect huge amounts of data from almost every aspect of their activities, often without knowing in advance how to actually exploit them. A significant first step before the collected datasets can be leveraged by data analytics is to understand the information they contain. This task is known as data exploration. In this work, we study the problem of generating informative descriptions of the data contained in a dataset. We aim at descriptions that are complete, i.e., cover as much of the data as possible, and avoid redundancy, i.e., repeating the same kind of information multiple times. We present different algorithms for achieving this and study their properties through an extensive set of experimental evaluations.

Source code & Datasets

Description Miner
Source code (MD5: 89209c14eda6bcf696db1171ca042511).
Prototype that implements the three proposed solutions: Naiive, Vertical, and Adaptive.

Datasets
Archive (MD5: 7efbff244e74225f04626098c4a1764f).
Archive containing the datasets used for testing the proposed solutions.
The datasets are divided in three folders:

[R] containing real-world data.
[V] containing synthetic data of high variety (heterogeneity); used to study the behaviour of domain pruning and of the three algorithms.
[S] also containing synthetic data but generated to evaluate the scalability of the techniques.

Data Generator
Source code (MD5: 56e94ca888232c5199dbc2b6a76b83bc).
A data-generator we developed, that provides fine control over both the structure of the dataset and over the distribution of the patterns in its data.
This generator was used to produce the data from [V] and [S].