Toward the end of 2020 I spent some time blackbox testing data load times for WDQS and Blazegraph to try and find out which possible setting tweaks might make things faster.
I didn’t come to any major conclusions as part of this effort but will write up the approach and data nonetheless incase it is useful for others.
I expect the next step toward trying to make this go faster would be via some whitebox testing, consulting with some of the original developers or with people that have taken a deep dive into the code (which I started but didn’t complete).
Blazegraph, WDQS and Java have a variety of settings that can be changed, either specifically to do with data loading, or that might have some affect. The hardware resources available might also have an impact on loading times.
I added all of these variables to a big spreadsheet and started working through nearly 100 VMs on Google cloud slightly tweaking the settings and recording the results.
Setting wise this boils down to:
- Buffer Capacity
- Queue Capacity
- Gzip Buffer
- Write Retention Queue
And hardware wise:
- CPU Cores
- CPU Clock
- Storage / Disk speed
For the tests I tried loading the first set of chunks of the Wikidata RDF data from some point in 2020.
Nothing truly scientific going on here, and I didn’t load a whole data set at any point during these tests. But some things were pretty easy to spot.
Pretty obviously faster hardware often makes the data load faster.
CPU clock speed makes a big difference with the first 10 batches of RDF taking 2 hours less on a c2-standard-8 3.1-3.8 Ghz machine instead of a n1-standard-8 2.2-3.7 Ghz machine. This was a reduction from 6 hours to 4 hours.
SSDs should obviously be used for storage. There is a small difference between different RAID configurations. Using too many disks can lower performance. Using a single disk for read and write operations (reading the dump and writing the journal) can lower performance. The first 10 batches of RDF took 3.9 hours with 3 disks for reading and 5 for writing. They took 3.6 hours with only 2 disks for reading and 3 for writing. Using 5 disks raided together took 3.7 hours.
Due to the fast disks and Java being allocated set amounts of memory, increasing the available memory didn’t really have much impact. This might be different if using slower disks, as memory would be used by the OS disk caching.
BufferMode: Disk vs DiskRW
DiskRW made the data load go much faster, also producing a much larger journal file. After performing the experiment I went to investigate why.
RW indicates that it is not a WORM with append only semantics but rather a disk alloc/realloc mechanism that supports updates to values.https://blazegraph.com/database/apidocs/com/bigdata/journal/BufferMode.html#DiskRW
I forget how quickly the disk got eaten up, but the journal was already bigger than a normal final result after only a few hours.
This could be feasible if you have ample disk space and don’t need to write back to the journal and want faster load times.
Seemingly different combinations of available memory, allocated HEAP, and CPU speed lead to a variety of different load time results. I can only image this is due to different elements of the load process being impacted if there is too much or too little of a single resource. Garbage collection likely plays a big role here.
For example, on a machine with 30GB of memory and all other settings being the same:
- 4GB HEAP allocation, 5.8 hours to complete batch 10
- 8GB HEAP allocation, 5.6 hours to complete batch 10
- 16 GB HEAP allocation, 7.7 hours to complete batch 10
I saw some of the most promising changes to load time when altering the queue capacity setting.
My understanding of where this sits in the loading sequence is shown in the diagram below.
By this point I had stopped timing my results instead monitoring the rate reported by the loading output for a “goodSet”.
The problem with changing this value is that often things would start off faster, but eventually everything would slow down to a similar basic rate.
There are still other settings and options that I didn’t try looking at.
For example other journal buffer modes, such as
MemStore which is optimized for primary memory.
At the end of the day I think a better understanding of some of the Blazegraph itnernals would lead to some faster answers. I did start looking at some profiling while performing data loads, but didn’t figure out anything concrete.
You can find all of my raw data in this Google Sheet. If you have any questions please ask!