On Generating Dataset Descriptions.

A critical first step in every analytic task is to understand well the kind of data on which the analytics will be applied. Unfortunately, for the modern large and heterogeneous datasets, this kind of understanding is not always explicitly provided. Instead, the analyst has to spend a significant amount of time exploring the data to obtain it. In this work, we deal with the problem of generating informative descriptions that can provide such an understanding in the best possible way. We aim at generating descriptions that are complete, i.e., covering as much of the data as possible, and avoid redundancy, i.e., repetitions, that may overwhelm the analyst. We present different algorithms for achieving this and study their properties through an extensive set of experimental evaluations.

Datasets & Prototypes

Data Generator
Source code (MD5: 3174a4b1a918dce16794ae94f83e4c00).
A data-generator we developed, that provides fine control over both the structure of the dataset and over the distribution of the patterns in its data.

Description Miner
Source code (MD5: 5e4ac6e52ad108262910edc89a96148b).
Prototype for bothe the Naiive and Vertical solutions.

Archive (MD5: 7f9922a29a0b133bf705d064ffb1735d).
Archive with two CSV files containing the seeds and parameters to regenerate the datasets.