Hi, I am also a newcomer to the RDF world - and particularly Jena, which I started using this week.
A couple of observations I have made over the last few days exploring different options. Local Machine (specs): Ubuntu 18.04 Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU) 16GB RAM 512 SSD (NVMe). the following compares loading a file in compressed vs decompressed format -both w parallel loader. file: docstrings_triples.nq size: 28GB cmd: time tdb2.tdbloader --loader=parallel --loc=test1graphdb docstrings_triples.nq > tdb2.log1 2>&1 :: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s real 22m46.346s user 120m46.591s sys 3m22.698s file: docstrings_triples.nq.bz2 size: 542M cmd: time tdb2.tdbloader --loader=parallel --loc=test2graphdb docstrings_triples.nq.bz2 > tdb2.log2 2>&1 :: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s real 37m8.182s user 109m42.970s sys 6m27.426s resulting DB size 30GB confirmed equal via diff. pbzip2 ran in 84s Less rigorously I noticed a similar gain in speed for other files. Is this expected behaviour? What factors influence this? SSD - local vs cloud. on my local machine, when running parallel loader, cores were working at over 70% capacity and there was little IO induced down time. GCP instance specs: 20 CPU 32GB RAM 6TB "local SSD" storage the local SSD storage offers the best performance to reduce IO latency - it has physical proximity to instance - as per GCP. a few cores were working at near capacity, while the vast majority idle (near 0%) w occasional spikes. average load translates to 20% utilization. As I've seen others write here, this is a difference others have noted. How can this be addressed? buffer size? (I don't have a deep enough understanding). Another recurring pattern is the reduction in batch size. I've been running a load job on my gcp instance for almost a day (23+h). file size: 93GB triples: 472m batch size decreased from 160k range to under 1k, while processing time per batch increased from a few seconds to over 10 min. All this time average CPU usage has remained steady, as has RAM usage. I don't understand how all of this works with indexing. Is this expected behaviour? besides a locally proximate SSD, I've thrown an overkill of hardware at it. thanks