How to replicate our procedures with Graphflow
Here you will find the instructions to replicate the errors we get when we tried to run the Wikidata Benchmark on Graphflow. First of all, you have to download some scripts and resources. You can unzip the file to get a folder called graphflow-resources.
Then you have to clone the repository of Graphflow. You have to follow the instructions to compile the code. First move to the root folder an run:
./gradlew clean build installDist
to do a full clean build and:
./gradlew build installDist
for all subsequent builds. Then you have run the following command:
. ./env.sh
As we mention in the paper, it is not possible to load the smaller graph of the Wikidata Benchmark within 550GB of memory. This graph has 52 million nodes (unique
subjects/objects), 2 thousand edge labels (properties), and 81 million
edges (triples). To check this you can download the dataset and transform it to the format required by graphflow with the script transform_all_data.py (this script is inside the graphflow-resources folder that you downloaded before). The dataset and the script have to be in the same folder.
The source code of graphflow provides a script for loading the data, it is located at the scripts folder. To move to this folder run:
cd scripts
Then execute the following command:
JAVA_OPTS='-Xmx550G' python3 serialize_dataset.py /absolute/path/wiki-graphflow-all.dat /absolute/path/folder
This scripts creates a folder with the data. However, with an assignation of 550GB to the JVM, this scripts launched us a heap space error.
Reducing the number of properties of the benchmark
We wanted to obtain a lower bound for the times of the distinct query patterns of the Wikidata Benchmark. Thus we wanted to reduce the number of properties of the benchmark with the following idea. Consider the file inside the folder queries (this folder is in the graphflow-resources folder) with name J3.txt. This file contains all the queries of one of the query patterns of the benchmark. We obtain al the labels of the distinct properties used for this set of queries and we stored in the file props/J3.txt. Then we use the script transform_pattern_data.py to create a file that uses the format required by graphflow where every property of the benchmark is maintained if it is in the file props/J3.txt, and the property is changed for p0 in other case. Then we run:
JAVA_OPTS='-Xmx550G' python3 serialize_dataset.py /absolute/path/wiki-graphflow-J3.dat /absolute/path/folder
Unfortunately, it was not possible to load the dataset. To the best of our knowledge, and according to what we discussed with the authors of Graphflow, the engine is not designed to work with a high number of distinct properties, so we did experiments to discover the number of properties that we could use.
Hashing the properties
To understand the number of properties that we could load, we designed the following experiment. We hashed all the properties of the dataset to a fixed number of properties. With the script hash_wiki.py it is possible to replicate this procedure. For instance, we can start trying with 50 distinct properties. When we execute the script we get a dataset with at most 50 properties called wikidata-hashed-50.dat. To load this dataset we run the procedure.
JAVA_OPTS='-Xmx550G' python3 serialize_dataset.py /absolute/path/wikidata-hashed-50.dat /absolute/path/folder
Then we get a folder of size ~50GB. In order to run the queries, we need to run another command.
JAVA_OPTS='-Xmx550G' python3 serialize_catalog.py /absolute/path/folder
Where we need to give the path to the folder created at the previous step. Unfortunately, this procedure launches a heap space error although we gave 550GB to the JVM.
We repeated this procedure for 20, 10 and 5 distinct properties, but the three times we got a heap space error. Finally, we could load the graph using 2 distinct properties, but with this number of properties many of the queries of the benchmark become duplicated (since Graphflow uses the semantic based on homomorphisms), and massive numbers of results are generated for each query.