We study an indexing architecture to store and search in a database of high-dimensional vectors from the perspective of statistical signal processing and decision theory. This architecture is composed of several memory units, each of which summarizes a fraction of the database by a single representative vector. The potential similarity of the query to one of the vectors stored in the memory unit is gauged by a simple correlation with the memory unit’s representative vector. This representative optimizes the test of the following hypothesis: the query is independent from any vector in the memory unit vs. the query is a simple perturbation of one of the stored vectors.
Compared to exhaustive search, our approach finds the most similar database vectors significantly faster without a noticeable reduction in search quality. Interestingly, the reduction of complexity is provably better in high-dimensional spaces. We empirically demonstrate its practical interest in a large-scale image search scenario with off-the-shelf state-of-the-art descriptors.
Some strategies have been proposed to (partly) overcome this problem. For instance, the vector approximation file first relies on exhaustive search with approximate measurements and then computes the exact similarities only for a subset of vectors deemed of interest. The cosine sketch approximates cosine similarity with faster Hamming distance. Other works like spectral hashing, Euclidean sketches, product quantization and inverted multi-index also rely on compact codes to speed up neighbor search while compressing the data.
An interesting strategy is the Set Compression Tree, which uses a structure similar to a k-d tree to compress a set of vectors into an extremely compact representation. Again, this method is dedicated to small dimensional vectors (its authors recommend the dimension be smaller than log2(N) where N is the size of the database) so that it is used in conjunction with a drastic dimension reduction via PCA to work with classical computer vision descriptors.
Our approach is similar to the descriptor pooling problem in computer vision, but at a higher level. Many successful descriptors, such as BOV, VLAD, FV, and EMK, encode and aggregate a set of local features into global representations. Yet, the new representation has a larger dimension than the local features. We use a similar approach at a higher-level: instead of aggregating local features into one global image representation, we aggregate global representations into group representations, so called memory vectors. These have no semantical meaning. Their purpose is to allow efficient search. Another difference is that we keep the same ambient dimension.