Worst-case Optimal Graph Joins in Almost No Space
Online material for the paper Worst-case Optimal Graph Joins in Almost No Space. Here you can find:
- Proofs and additional comments (Updated February 2021).
- The source code of our engine.
- The instructions to use our code.
- Resources referenced from our paper.
- Appendix (Previous submission).
How to use our code
1. Install the library SDSL
The first step is to install an extended version of the library SDSL, which is a C++11 library for working with succinct data structures. You can download the library here. Then, to install the library, you have to follow the instructions available in the file README.md.
2. Run our code
After the extended version of SDSL is installed, you have to download our code. Then follow these steps:
2.1. Create our build folder and compile the code:
mkdir build
cd build
cmake ..
make
Check that there is no errors.
2.2. Download the version of Wikidata that you want to use:
- a) Wikidata Filtered (about 80M triples).
- b) Wikidata (about 1000M triples). Note that this file is compressed.
Now put the .dat file inside a folder.
2.3. You have to build the index. After compiling the code you should have an executable called build-index in build. Now run:
./build-index <absolute-path-to-the-.dat-file> <type-ring>
<type-ring> can take two values: ring or c-ring. Both are implementations of our ring index but using plain and compressed bitvectors, respectively. This will generate the index in the folder where the .dat file is located. The index is suffixed with .ring or .c-ring according to the second argument.
2.4. Querying the index. In build folder, you should find another executable file called query-index. To solve the queries you should run:
./query-index <absoulute-path-to-the-index-file> <absolute-path-to-the-query-file>
Note that the second argument is the path to a file that contains all the queries. The queries of our benchmark are in Queries:
- The file
Queries-wikidata-benchmark.txtcan be run withwikidata-filtered-enumerated.dat. - The file
Queries-bgps-limit1000.txtcontains the queries ofwikidata-enumerated.dat.
After running that command, you should see the number of the query, the number of results, and the elapsed time of each one of the queries with the following format:
<query number>;<number of results>;<elapsed time>
Optional: if you want to run the Ring (Fixed P) code instead, you should download this version of our source code and repeat the previous steps.
Other resources referenced from our paper
- Upper Bounds on the Number of Indexes (Table 3).
- The Wikidata Benchmark official webpage. There you can find instructions for loading the data in Apache Jena, LFTJ Jena, Virtuoso and Blazegraph.
- RDF-3X source code.
- EmptyHeaded source code.
- The instructions to replicate our procedures with Graphflow.
Instructions for using SPARQL Engines
First you have to download our scripts and the instructions.
The instructions on how to install, configure and load data into Blazegraph, Jena, JenaLTJ and Virtuoso are in files:
blazegraph.txt.jena.txt (also for JenaLTJ).rdf3x.txt.virtuoso.txt.
To run queries, the following scripts expect a *.txt file with one
SPARQL query per line in a file. Add this query file to a new folder.
You can have multiple query files in a given folder. Our query files are:
For benchmarking SPARQL over HTTP (BlazeGraph, Jena, JenaLTJ, Virtuoso):
benchmark.py: benchmark using SPARQL HTTP API (where available). For
this you need python-sparqlwrapper (you can install it with pip or
sudo apt-get install python-sparqlwrapper)
Then use one of the following bash scripts (bash *-benchmark.sh). Note
that benchmark.py must be in the same folder as the script. Call the
script to see the arguments.
blazegraph-benchmark.sh.jena-benchmark.sh(pass a different Fuseki jar to run queries for standard Jena or JenaLTJ).virtuoso-benchmark.sh.
For benchmarking on the command line (RDF3x):
benchmark-rdfs3x.py: benchmark using CLI of RDF3x (does not have a HTTP
API available)
Then use the following bash script (bash rdf3x-benchmark.sh). Note that
benchmark.py must be in the same folder as the script.
Call the script to see the arguments (rdf3x-benchmark.sh).