QuEval: Beyond high-dimensional indexing à la carte

Reliable and Reproducible Evaluations of High-Dimensional Index Structures

In case you use material from this website, such as source code, please include the following reference to your list of references: QuEval: Beyond high-dimensional indexing à la carte has been accepted for the final issue of PVLDB 2013 and thus, will be presented at VLDB 2014.

Martin Schäler, Alexander Grebhahn, Reimar Schröter, Sandro Schulze, Veit Köppen, and Gunter Saake. PVLDB, 6(14):1654–1665, 2013. (bibtex)

News

A new QuEval paper has been accepted at RCIS 2014 issuing the challenge of managing the large amount of tailored index implementation variants: Toward variability management to tailor high-dimensional index implementations.

Description

Multimedia data, or high-dimensional data in general, have been subject to research for more than two decades and gain momentum even more in the communication technology age. From a database point of view, the myriads of gigabyte of data pose the problem of managing these data. In this course, query processing is a challenging task due to the high dimensionality of such data. In the past, dozens of index structures for high-dimensional data have been proposed and some of them are even standard-like references. However, it is still some kind of black magic to decide which index structure fits to a certain problem or outweighs other index structures. In particular, the following questions have to be answered to make a decision in favor or against a certain index structure:

This is where QuEval, a framework for quantitative comparison and evaluation of high-dimensional index structures comes into play. QuEval is a Java-based framework that supports comparison of index strucutres regarding certain characteristics such as dimensionality, accuracy, or performance. Currently, the framework contains of six different index structures. However, a main focus of the framework is its extensibility and we encourage people to contribute to QuEval by providing more index structures or other interesting aspects for their comparisons.

QuEval is under constant development. The following features are implemented:

Overview

Generally, our framework consists of three parts: Data Generator, Query-Point Generator, and HDI Tester. In the following we describe briefly each of these components and how they interact (which is also indicated by the figure below).

Data Generator: This component is responsible for generating the data set that is used for a certain scenario. To this end, the user can specify certain characteristics of the data. In particular, we currently support the specification of dimensionality (i.e., number of dimensions), the amount of data (i.e., number of data points in the data set), the stochastic distribution of data, and the range of values for each dimension. Furthermore, it is possible to save the generated data set and thus, reuse it for replicating or extending a former scenario.

Query-Point Generator: Analogously, this component is responsible for generating the query points used within a scenario. This can be done in two different ways: First, this component takes the generated data set as input and selects randomly a specified number of query points out of this data set. Consequently, the query points exhibit the same properties as the data points in the data set. Alternatively, the Query-Point Generator can generate query points independent of the underlying data set (nevertheless, with the same properties). Furthermore, the generated query points can be saved and reused for replication.

HDI Tester: This is the core component of our framework since it performs the actual comparison of index structures. To this end, the user has to specify a scenario description, containing the following information:

The user can create this scenario description by means of a graphical user interface. To compare/evaluate the selected indexes, the HDI Tester determines the accuracy (only if chosen) and time efficiency. In the case that the framework computes the accuracy (currently based on the euclidean distance), it uses the sequential scan to determine the correct query response and compares this result to the query result of the evaluated index structures. The results of a scenario are saved in a CSV file that can be deployed for detailed analyses of the experimental results.

Implemented Index Structures

Currently, the following index structures are available within the QuEval framework.
NameFurther information/original papersLink(s)
Pyramid technique Original paper: S.Berchthold et al., The pyramid-technique: towards breaking the curse of dimensionality, SIGMOD Records, 1998
Implementation based on: D.-H. Lee et al., An efficient technique for nearest-neighbor query processing on the SPY-TEC, Trans. Knowl. Data Eng., 2003
paper access

paper access
VA-File Original paper: R.Weber et al., An approximation-based data structure for similarity search, technical report, 1997 paper access
Prototype-based approach Original paper: E.C.Gonzalez et al., Effective proximity retrieval by ordering permutations, Trans. Pattern Analysis and Machine Intelligence, 2008 paper access
original implementation (in C#)
LSH-based approach Original paper: P.Indyk et al., Approximate nearest neighbors: towards removing the curse of dimensionality., ACM SAC, 1998
Implementation based on: M.Data et al., Locality-sensitive hashing scheme based on p-stable distributions, Int. Symp. on Comp. Geometry, 2004
paper access

paper access
R-Tree Original paper: A.Guttman, A dynamic index structure for spatial searching, SIGMOD, 1984
Original implementation from the GeoCraft open source framework; adjusted in order to add exact match functionality to the R-Tree
paper access
original implementation
R-Variant Adapted form of original R-Tree with configurable split and insert algorithms (e.g., R*-Tree).
N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, The R*-Tree: An efficient and robust access method for points and rectangles in SIGMOD. ACM, 1990, pp. 322-331.
paper access
k-d Tree Original paper: J.L.Bentley, Multidimensional binary search trees used for associative searching, Comm. ACM, 1975
Original implementation from Simon Levy's software archive
paper access
original implementation
M-tree Original paper: P. Ciaccia et al., M-tree: An Efficient Access Method for Similarity Search in Metric Spaces, VLDB, 1997
paper access

Integrating New Index Structures

One of our aims is to maintain an extensible framework so that other researcher can contribute to it and use it for their specific purposes. A central aspect of this extensibility is the possibility to add new index structures to the framework. To this end, an API exists that has to be implemented for each new index structure. You can find a documentation (JavaDoc style) of this interface at AIndexStructure -- Documentation

Advanced Features

For improving the extensiblety of the framework, we implemented some advanced features. These features are not needed to perform tests with the framework but are useful to extend the user experience.

Download

The current version of QuEval can be downloaded as ZIP-compressed archive: Download page or use the direct zip file link. After downloading the file, extract it to your file system. Afterwards, proceed as follows (where QUEVAL_PATH denotes the main directory of the extracted framework):

Windows

  1. Go to QUEVAL_PATH and execute the file start-framework.bat. The framework should start (see the screenshot below for an example).

Mac OS/Linux

  1. Open the terminal and type cd QUEVAL_PATH/Framework.
  2. Start the framework by executing the file framework.jar as follows: java -jar framework.jar
Screenshot of the QuEval GUI immediately after starting the application.

Source Code

QuEval is released under the EPL License.

Example Data

Recently, we conducted a large-scale case study for the implemented index structures. You can access the data as zip-archive here. Simply copy the unzipped data sets int the /data folder of the framework and restart QuEval (if running).
Scenario description#Dimensions#Points in data setFurther information/paper referenceLink(s)
Use case 1: Forensic evidences (handwriting samples)
Real-world data 16 10,992 F. Alimoglu and E. Alpaydin, Methods of combining multiple classifiers based on different representations for pen-based handwriting recognition in ICDAR. IEEE, 1997, pp. 637-640. paper access
uniform data
Gaussian data
real-world data
Uniform data Produced by internal data generator
Gaussian data Produced by extension to R
Use case 2: Sensor parameter optimization
Real-world data 43 411,961 T. Kiertscher, R. Fischer, and C. Vielhauer, Latent fingerprint detection using a spectral texture feature in MMSec. ACM, 2011, pp. 27-32. paper access
uniform data
Gaussian data
real-world data
Unifom data Produced by internal data generator
Gaussian data Produced by extension to R
Use case 3: Low populated high-dimensional database
Real-world data 50 130,064 B. P. Roe, H.-J. Yang, J. Zhu, Y. Liu, I. Stancu, and G. McGregor, Boosted decision trees as an alternative to artificial neural networks for particle identification. NIMPA, vol. 543, no. 2-3, pp. 577-584, 2005. paper access
uniform data
Gaussian data
real-world data
Unifom data Produced by internal data generator
Gaussian data Produced by extension to R
Use case 4: Dense populated high-dimensional database
Real-world data 51 3,850,505 A. Reiss and D. Stricker, Creating and benchmarking a new dataset for physical activity monitoring. PETRA, pp. 40:1–40:8, 2012. paper access
compressed data (rar)
Unifom data Produced by internal data generator
Gaussian data Produced by extension to R

Publications

Contact

QuEval is developed mainly at the University of Magdeburg, Germany. It is open source and we are currently working on the first release so that everybody who is interested can extend/improve it. For information about the project, technical questions and bug reports: please contact the development team via Martin Schäler.

Project members: Former project members:

Related projects

Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.