QuEval: Beyond high-dimensional indexing á la carte

Home

Docs

In case you use material from this website, such as source code, please include the following reference to your list of references: QuEval: Beyond high-dimensional indexing à la carte has been accepted for the final issue of PVLDB 2013 and thus, will be presented at VLDB 2014.

Martin Schäler, Alexander Grebhahn, Reimar Schröter, Sandro Schulze, Veit Köppen, and Gunter Saake. PVLDB, 6(14):16541665, 2013. (bibtex)

News

A new QuEval paper has been accepted at RCIS 2014 issuing the challenge of managing the large amount of tailored index implementation variants: Toward variability management to tailor high-dimensional index implementations.

Description

Multimedia data, or high-dimensional data in general, have been subject to research for more than two decades and gain momentum even more in the communication technology age. From a database point of view, the myriads of gigabyte of data pose the problem of managing these data. In this course, query processing is a challenging task due to the high dimensionality of such data. In the past, dozens of index structures for high-dimensional data have been proposed and some of them are even standard-like references. However, it is still some kind of black magic to decide which index structure fits to a certain problem or outweighs other index structures. In particular, the following questions have to be answered to make a decision in favor or against a certain index structure:

In which case to use which index structure (e.g., regarding dimensionality)?
What kind of index is the best for a specific problem?
How do we adjust existing parameters of a chosen index structure to achieve best results?

This is where QuEval, a framework for quantitative comparison and evaluation of high-dimensional index structures comes into play. QuEval is a Java-based framework that supports comparison of index strucutres regarding certain characteristics such as dimensionality, accuracy, or performance. Currently, the framework contains of six different index structures. However, a main focus of the framework is its extensibility and we encourage people to contribute to QuEval by providing more index structures or other interesting aspects for their comparisons.

QuEval is under constant development. The following features are implemented:

A Data Generator that generates the actual data for the comparison, based on some user-specified properties.
A Query-Point Generator that randomly generates the query points for a comparison scenario.
Persistent storage of datasets and query points that allow for replication of certain case studies.
The HDI Tester that is the core component and responsible for the actual comparison of index structures.
Six index structures, integrated within the framework. For details see below.
Support for generating different kinds of data such as equally distributed or clustered data.
A graphical user interface (GUI) supporting an intuitive usage of the framework.

Overview

Generally, our framework consists of three parts: Data Generator, Query-Point Generator, and HDI Tester. In the following we describe briefly each of these components and how they interact (which is also indicated by the figure below).

Data Generator: This component is responsible for generating the data set that is used for a certain scenario. To this end, the user can specify certain characteristics of the data. In particular, we currently support the specification of dimensionality (i.e., number of dimensions), the amount of data (i.e., number of data points in the data set), the stochastic distribution of data, and the range of values for each dimension. Furthermore, it is possible to save the generated data set and thus, reuse it for replicating or extending a former scenario.

Query-Point Generator: Analogously, this component is responsible for generating the query points used within a scenario. This can be done in two different ways: First, this component takes the generated data set as input and selects randomly a specified number of query points out of this data set. Consequently, the query points exhibit the same properties as the data points in the data set. Alternatively, the Query-Point Generator can generate query points independent of the underlying data set (nevertheless, with the same properties). Furthermore, the generated query points can be saved and reused for replication.

HDI Tester: This is the core component of our framework since it performs the actual comparison of index structures. To this end, the user has to specify a scenario description, containing the following information:

A data set, generated by the Data Generator, or a point collection stored as comma separated values (CSV).
A set of query points randomly generated by the Query-Point Generator for exact match queries that not exists in the data set.
A set of points from the data set as further query points for exact match and knn queries.
A selection of indexes that are subject to comparison within the scenario.
The number of test rounds and amount of nearest neighbors (k) that shall be computed.
Whether accuracy shall be determined or not.

The user can create this scenario description by means of a graphical user interface. To compare/evaluate the selected indexes, the HDI Tester determines the accuracy (only if chosen) and time efficiency. In the case that the framework computes the accuracy (currently based on the euclidean distance), it uses the sequential scan to determine the correct query response and compares this result to the query result of the evaluated index structures. The results of a scenario are saved in a CSV file that can be deployed for detailed analyses of the experimental results.

Implemented Index Structures

Currently, the following index structures are available within the QuEval framework.

Name	Further information/original papers	Link(s)
Pyramid technique	Original paper: S.Berchthold et al., The pyramid-technique: towards breaking the curse of dimensionality, SIGMOD Records, 1998 Implementation based on: D.-H. Lee et al., An efficient technique for nearest-neighbor query processing on the SPY-TEC, Trans. Knowl. Data Eng., 2003	paper access paper access
VA-File	Original paper: R.Weber et al., An approximation-based data structure for similarity search, technical report, 1997	paper access
Prototype-based approach	Original paper: E.C.Gonzalez et al., Effective proximity retrieval by ordering permutations, Trans. Pattern Analysis and Machine Intelligence, 2008	paper access original implementation (in C#)
LSH-based approach	Original paper: P.Indyk et al., Approximate nearest neighbors: towards removing the curse of dimensionality., ACM SAC, 1998 Implementation based on: M.Data et al., Locality-sensitive hashing scheme based on p-stable distributions, Int. Symp. on Comp. Geometry, 2004	paper access paper access
R-Tree	Original paper: A.Guttman, A dynamic index structure for spatial searching, SIGMOD, 1984 Original implementation from the GeoCraft open source framework; adjusted in order to add exact match functionality to the R-Tree	paper access original implementation
R-Variant	Adapted form of original R-Tree with configurable split and insert algorithms (e.g., R-Tree). N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger, The R-Tree: An efficient and robust access method for points and rectangles in SIGMOD. ACM, 1990, pp. 322-331.	paper access
k-d Tree	Original paper: J.L.Bentley, Multidimensional binary search trees used for associative searching, Comm. ACM, 1975 Original implementation from Simon Levy's software archive	paper access original implementation
M-tree	Original paper: P. Ciaccia et al., M-tree: An Efficient Access Method for Similarity Search in Metric Spaces, VLDB, 1997	paper access

Integrating New Index Structures

One of our aims is to maintain an extensible framework so that other researcher can contribute to it and use it for their specific purposes. A central aspect of this extensibility is the possibility to add new index structures to the framework. To this end, an API exists that has to be implemented for each new index structure. You can find a documentation (JavaDoc style) of this interface at AIndexStructure -- Documentation

Advanced Features

For improving the extensiblety of the framework, we implemented some advanced features. These features are not needed to perform tests with the framework but are useful to extend the user experience.

C++ Handler. Since we are aware that many existing index structure implementations are written in C++ and we also want to support evaluations using these implementations. We decided to extend the framework with an interface to include and evaluate existing C++ implementations. Find the details here.
Visualisation extension. To improve the user experience for understanding different index structures, we offer visualisation extensions for showing the partition done by different index structures.
Extension to R. With our framework, we will also evaluate the performace impact of complex stoachstic distributed data on the different index structres. Since, we are aware that there are a lot of programming languages to create stochastic distributed data, we extended our framework with an extension to R.

Download

The current version of QuEval can be downloaded as ZIP-compressed archive: Download page or use the direct zip file link. After downloading the file, extract it to your file system. Afterwards, proceed as follows (where QUEVAL_PATH denotes the main directory of the extracted framework):

Windows

Go to QUEVAL_PATH and execute the file start-framework.bat. The framework should start (see the screenshot below for an example).

Mac OS/Linux

Open the terminal and type cd QUEVAL_PATH/Framework.
Start the framework by executing the file framework.jar as follows: java -jar framework.jar

: Screenshot of the QuEval GUI immediately after starting the application.

Source Code

QuEval is released under the EPL License.

Example Data

Recently, we conducted a large-scale case study for the implemented index structures. You can access the data as zip-archive here. Simply copy the unzipped data sets int the /data folder of the framework and restart QuEval (if running).

Scenario description	#Dimensions	#Points in data set	Further information/paper reference	Link(s)
Use case 1: Forensic evidences (handwriting samples)
Real-world data	16	10,992	F. Alimoglu and E. Alpaydin, Methods of combining multiple classifiers based on different representations for pen-based handwriting recognition in ICDAR. IEEE, 1997, pp. 637-640.	paper access uniform data Gaussian data real-world data
Uniform data			Produced by internal data generator
Gaussian data			Produced by extension to R
Use case 2: Sensor parameter optimization
Real-world data	43	411,961	T. Kiertscher, R. Fischer, and C. Vielhauer, Latent fingerprint detection using a spectral texture feature in MMSec. ACM, 2011, pp. 27-32.	paper access uniform data Gaussian data real-world data
Unifom data			Produced by internal data generator
Gaussian data			Produced by extension to R
Use case 3: Low populated high-dimensional database
Real-world data	50	130,064	B. P. Roe, H.-J. Yang, J. Zhu, Y. Liu, I. Stancu, and G. McGregor, Boosted decision trees as an alternative to artificial neural networks for particle identification. NIMPA, vol. 543, no. 2-3, pp. 577-584, 2005.	paper access uniform data Gaussian data real-world data
Unifom data			Produced by internal data generator
Gaussian data			Produced by extension to R
Use case 4: Dense populated high-dimensional database
Real-world data	51	3,850,505	A. Reiss and D. Stricker, Creating and benchmarking a new dataset for physical activity monitoring. PETRA, pp. 40:140:8, 2012.	paper access compressed data (rar)
Unifom data			Produced by internal data generator
Gaussian data			Produced by extension to R

Publications

Veit Köppen, Martin Schäler, and Reimar Schröter. Toward variability management to tailor high-dimensional index implementations . In International Conference on Research Challenges in Information Science (RCIS). IEEE, 2014 - to appear.
Martin Schäler, Alexander Grebhahn, Reimar Schröter, Sandro Schulze, Veit Köppen, and Gunter Saake. QuEval: Beyond high-dimensional indexing à la carte . In Proc Int' Conf. on Very Large Databases (VLDB), pages 1654–1665. VLDB Endowment, 2013.
David Broneske, Martin Schäler, and Alexander Grebhahn. Extending an Index-Benchmarking Framework with Non-Invasive Visualization Capability . In Workshop on Information Systems in Digital Engineering (ISDE), BTW-Workshops, pages 151–160. Köllen-Verlag, 2013.
Alexander Grebhahn, Martin Schäler, Veit Köppen, and Gunter Saake. Privacy-Aware Multidimensional Indexing . In 15. GI-Fachtagung Datenbanksysteme für Business, Technologie und Web (BTW), pages 133–147. Köllen-Verlag, 2013.
Alexander Grebhahn, David Broneske, Martin Schäler, Reimar Schröter, Veit Köppen, and Gunter Saake. Challenges in finding an appropriate multi-dimensional index structure with respect to specific use cases . In Proceedings of the 24th GI-Workshop "Grundlagen von Datenbanken 2012", pages 77–82. CEUR-WS, 2012. urn:nbn:de:0074-850-4.
Jan Hentschel, Thomas Meyer, Thomas Rommel. Trying to outperform a well-known index with a sequential scan . In EDBT/ICDT Workshops 2013 pages 349-356.
Martin Schäler, Sandro Schulze, Alexander Grebhahn, Veit Köppen, Andreas Lübcke, and Gunter Saake (Hrsg.). Techniken zur forensischen Datenhaltung - Ausgewählte studentische Beiträge . Collection FIN-05-2012, University of Magdeburg, Germany, October 2012. In German.

David Broneske, Denis Dietze, Maik Lampe, Andreas Meier, Vergleich von Indexverfahren für hochdimensionale Datenräume.
Björn-Erik Aust, Alexander Grebhahn, Tran Tuan Nguyen, Reimar Schröter, Implementierung und Evaluierung hochdimensionaler Indexstrukturen.
Matthias Koch, Martin Tobies, Christoph Neubüse, Implementierung und Evaluierung von Distanzfunktionen im Digi-Dak-Framework.
Sebastian Dorok, Martin Tobies, Andre Zaske. Einfluss von multivariat schief normalverteilten Daten auf multidimensionale Indexstrukturen.

Tim Hering. Parallelization of K-D Trees for kNN-queries in main memory. BTW Studierendenprogramm (Students program), 2013, Köllen-Verlag.
Sarah Heckel. Optimierung der Exact-Match Anfrage eines Lokal Sensitiven Hashverfahrens. BTW Studierendenprogramm (Students program), 2013, Köllen-Verlag, In German.

Contact

QuEval is developed mainly at the University of Magdeburg, Germany. It is open source and we are currently working on the first release so that everybody who is interested can extend/improve it. For information about the project, technical questions and bug reports: please contact the development team via Martin Schäler.

Project members:

Martin Schäler (Uni Magdeburg)
Reimar Schröter (Uni Magdeburg)
Alexander Grebhahn (Uni Passau)
Sandro Schulze (TU Braunschweig)
Veit Köppen (Uni Magdeburg)

Former project members:

Martin Tobies
Tim Hering
Christina Pielach
Björn-Erik Aust
David Broneske
Denis Dietze
Maik Lampe
Tran Tuan Ngyuen

Related projects

Optimal CPU/GPU utilization in databases: CoGa DB - A column-oriented GPU-accelerated DBMS
Feature IDE - An Eclipse-based IDE that supports all phases of feature-oriented software development for the development of SPLs

Copyright Notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

QuEval: Beyond high-dimensional indexing à la carte

Reliable and Reproducible Evaluations of High-Dimensional Index Structures

Latest Version November 25, 2013