Modeling and Computing Probabilistic Skyline on Incomplete Data in Java

Modeling and Computing Probabilistic Skyline on Incomplete Data in Java

Abstract:

The skyline query is important in database community. In recent years, the researches on incomplete data have been increasingly considered, especially for the skyline query. However, the existing skyline definition on incomplete data cannot provide users with valuable references. In this paper, we propose a novel skyline definition utilizing probabilistic model on incomplete data where each point has a probability to be in the skyline. In particular, it returns K points with the highest skyl0ine probabilities. In addition, we propose incomplete models and estimate probability density functions of missing values on independent, correlated and anti-correlated distributions, respectively. Meanwhile, it is a big challenge to compute probabilistic skyline on incomplete data. We propose three efficient algorithms SPISkyline, SPCSkyline and SPASkyline for probabilistic skyline computation on incomplete data complying with independent, correlated and anti-correlated distributions, respectively. They employ pruning strategy, optimization of the process of probability computation, and sorting technique to improve the efficiency of probabilistic skyline computation on incomplete data. Our experimental results demonstrate that our proposed concept of probabilistic skyline is an effective method to tackle skyline query on incomplete data and our algorithms are tens of times faster than the naive algorithm on both synthetic and real datasets.