A Transformation Based Framework for KNN Set Similarity Search in Java

A Transformation Based Framework for KNN Set Similarity Search in Java

Abstract:

Set similarity search is a fundamental operation in a variety of applications. While many previous studies focus on threshold based set similarity search and join, few efforts have been paid for KNN set similarity search. In this paper, we propose a transformation based framework to solve the problem of KNN set similarity search, which given a collection of set records and a query set, returns k results with the largest similarity to the query. We devise an effective transformation mechanism to transform sets with various lengths to fixed length vectors which can map similar sets closer to each other. Then we index such vectors with a tiny tree structure. Next we propose efficient search algorithms and pruning strategies to perform exact KNN set similarity search. We also design an estimation technique by leveraging the data distribution to support approximate KNN search, which can speed up the search while retining high recall. Experimental results on real world datasets show that our framework significantly outperforms state-of-the-art methods in both memory and disk based settings.