Hi there,
Thanks for reporting the findings.
On 20/06/2020 16:10, Isroel Kogan wrote:
Hi,
I am also a newcomer to the RDF world - and particularly Jena, which I started
using this week.
A couple of observations I have made over the last few days exploring different
options.
Local Machine (specs):
Ubuntu 18.04
Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
which is 4 core and hyper threading. for this workload that is more like
4 threads. HT is not a completely x2 for this sort of continuous
processing threading.
And pre-emtptive timeslicing is not nice!
16GB RAM
512 SSD (NVMe).
the following compares loading a file in compressed vs decompressed format
-both w parallel loader.
file:
docstrings_triples.nq
size: 28GB
cmd:
time tdb2.tdbloader --loader=parallel --loc=test1graphdb docstrings_triples.nq >
tdb2.log1 2>&1
:: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
real 22m46.346s
user 120m46.591s
sys 3m22.698s
file:
docstrings_triples.nq.bz2
size: 542M
cmd:
time tdb2.tdbloader --loader=parallel --loc=test2graphdb docstrings_triples.nq.bz2 >
tdb2.log2 2>&1
:: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
real 37m8.182s
user 109m42.970s
sys 6m27.426s
resulting DB size
30GB
confirmed equal via diff.
pbzip2 ran in 84s
Less rigorously I noticed a similar gain in speed for other files.
For gz files, the speed of loading of compressed vs uncompressed is
usually not very much. It does look like bz2
Using a separate process and faster decompressor may help:
bzip2 -d < docstrings_triples.nq.bz2 | \
time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
-- - > tdb2.log2 2>&1
When Jena decompresses a bz2 file, it uses a Apache Common Compress so
it is a java decompressor which will take time to get optimized by the
JIT and is likely slower than a specialized tool like bzip2.
But with 4 core, it may have the opposite effect - using more processes
causes preemption timeslicing.
It maybe one of the other loaders is faster because it is a better match
to the hardware.
Is this expected behaviour? What factors influence this?
SSD - local vs cloud.
on my local machine, when running parallel loader, cores were working at over
70% capacity and there was little IO induced down time.
How many core were active?
And when it says "nq" is really quads or all data for the default graph?
(there is more indexing work for named graphs).
Some of that will be the bz2 decompression but it looks to me "like it's
"more threads than cores" causing timeslicing.
GCP instance specs:
20 CPU
32GB RAM
And same heap size?
While the parallel loader is using multiple threads it is a fixed number
so more CPU will help only if
More RAM is going to help because the OS will use it for file system
cache, delaying writes.
But with more read threads, it could be there is less preemptive
scheduling and that could be a big gain.
6TB "local SSD" storage
the local SSD storage offers the best performance to reduce IO latency - it has
physical proximity to instance - as per GCP.
a few cores were working at near capacity, while the vast majority idle (near
0%) w occasional spikes. average load translates to 20% utilization. As I've
seen others write here, this is a difference others have noted.
How can this be addressed? buffer size? (I don't have a deep enough
understanding).
My guess is that on the GCP instance it is one thread-one core.
Another recurring pattern is the reduction in batch size.
I've been running a load job on my gcp instance for almost a day (23+h).
file size: 93GB
triples: 472m
batch size decreased from 160k range to under 1k, while processing time per
batch increased from a few seconds to over 10 min. All this time average CPU
usage has remained steady, as has RAM usage.
Not sure I quite understand - this is adding more data to an existing
database? And 10mins for 1k? While it will be slower, that does sound
rather extreme.
I don't understand how all of this works with indexing. Is this expected
behaviour? besides a locally proximate SSD, I've thrown an overkill of hardware
at it.
thanks
Andy