TDB2 parallel load on cloud SSD and other observations/questions

Isroel Kogan Sat, 20 Jun 2020 08:10:48 -0700

Hi,

I am also a newcomer to the RDF world - and particularly Jena, which I started 
using this week.


A couple of observations I have made over the last few days exploring different 
options.

Local Machine (specs):

Ubuntu 18.04
Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
16GB RAM
512 SSD (NVMe).


the following compares loading a file in compressed vs decompressed format 
-both w parallel loader.

file:
docstrings_triples.nq
size: 28GB

cmd:
time tdb2.tdbloader --loader=parallel --loc=test1graphdb docstrings_triples.nq 
> tdb2.log1 2>&1

:: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s

real    22m46.346s
user    120m46.591s
sys    3m22.698s


file:
docstrings_triples.nq.bz2
size: 542M

cmd:

time tdb2.tdbloader --loader=parallel --loc=test2graphdb 
docstrings_triples.nq.bz2 > tdb2.log2 2>&1

:: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s


real    37m8.182s
user    109m42.970s
sys    6m27.426s

resulting DB size
30GB

confirmed equal via diff.

pbzip2 ran in 84s

Less rigorously I noticed a similar gain in speed for other files.
Is this expected behaviour? What factors influence this?

SSD - local vs cloud.

on my local machine, when running parallel loader, cores were working at over 
70% capacity and there was little IO induced down time.

GCP instance specs:

20 CPU
32GB RAM
6TB "local SSD" storage
the local SSD storage offers the best performance to reduce IO latency - it has 
physical proximity to instance - as per GCP.

a few cores were working at near capacity, while the vast majority idle (near 
0%) w occasional spikes. average load translates to 20% utilization. As I've 
seen others write here, this is a difference others have noted.
How can this be addressed? buffer size? (I don't have a deep enough 
understanding).


Another recurring pattern is the reduction in batch size.
I've been running a load job on my gcp instance for almost a day (23+h).

file size: 93GB
triples: 472m

batch size decreased from 160k range to under 1k, while processing time per 
batch increased from a few seconds to over 10 min. All this time average CPU 
usage has remained steady, as has RAM usage.

I don't understand how all of this works with indexing. Is this expected 
behaviour? besides a locally proximate SSD, I've thrown an overkill of hardware 
at it.

thanks

TDB2 parallel load on cloud SSD and other observations/questions

Reply via email to