> And when it says "nq" is really quads or all data for the default
> graph? (there is more indexing work for named graphs).
>> : Quads = 127,206,280
OK - it's quads. There are 6 quad indexes and in full parallel mode it
will use 2 more threads to parse and to build the node table.
Full parallel loading is going to use up all the cores and HT threads
aren't full threads for this purpose.
The phased loader (default) uses less threads.
Phase 1:
one thread to decompress and parse
one thread to build the node table.
one thread to for the GSPO
(and one for SPO but you seem to have no triples)
=3
Phase 2:
two threads
=2
Phase 3:
three threads
=3
Andy
On 21/06/2020 22:11, Andy Seaborne wrote:
Hi there,
Thanks for reporting the findings.
On 20/06/2020 16:10, Isroel Kogan wrote:
Hi,
I am also a newcomer to the RDF world - and particularly Jena, which I
started using this week.
A couple of observations I have made over the last few days exploring
different options.
Local Machine (specs):
Ubuntu 18.04
Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
which is 4 core and hyper threading. for this workload that is more like
4 threads. HT is not a completely x2 for this sort of continuous
processing threading.
And pre-emtptive timeslicing is not nice!
16GB RAM
512 SSD (NVMe).
the following compares loading a file in compressed vs decompressed
format -both w parallel loader.
file:
docstrings_triples.nq
size: 28GB
cmd:
time tdb2.tdbloader --loader=parallel --loc=test1graphdb
docstrings_triples.nq > tdb2.log1 2>&1
:: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s
real 22m46.346s
user 120m46.591s
sys 3m22.698s
file:
docstrings_triples.nq.bz2
size: 542M
cmd:
time tdb2.tdbloader --loader=parallel --loc=test2graphdb
docstrings_triples.nq.bz2 > tdb2.log2 2>&1
:: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s
real 37m8.182s
user 109m42.970s
sys 6m27.426s
resulting DB size
30GB
confirmed equal via diff.
pbzip2 ran in 84s
Less rigorously I noticed a similar gain in speed for other files.
For gz files, the speed of loading of compressed vs uncompressed is
usually not very much. It does look like bz2
Using a separate process and faster decompressor may help:
bzip2 -d < docstrings_triples.nq.bz2 | \
time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
-- - > tdb2.log2 2>&1
When Jena decompresses a bz2 file, it uses a Apache Common Compress so
it is a java decompressor which will take time to get optimized by the
JIT and is likely slower than a specialized tool like bzip2.
But with 4 core, it may have the opposite effect - using more processes
causes preemption timeslicing.
It maybe one of the other loaders is faster because it is a better match
to the hardware.
Is this expected behaviour? What factors influence this?
SSD - local vs cloud.
on my local machine, when running parallel loader, cores were working
at over 70% capacity and there was little IO induced down time.
How many core were active?
And when it says "nq" is really quads or all data for the default graph?
(there is more indexing work for named graphs).
Some of that will be the bz2 decompression but it looks to me "like it's
"more threads than cores" causing timeslicing.
GCP instance specs:
20 CPU
32GB RAM
And same heap size?
While the parallel loader is using multiple threads it is a fixed number
so more CPU will help only if
More RAM is going to help because the OS will use it for file system
cache, delaying writes.
But with more read threads, it could be there is less preemptive
scheduling and that could be a big gain.
6TB "local SSD" storage
the local SSD storage offers the best performance to reduce IO latency
- it has physical proximity to instance - as per GCP.
a few cores were working at near capacity, while the vast majority
idle (near 0%) w occasional spikes. average load translates to 20%
utilization. As I've seen others write here, this is a difference
others have noted.
How can this be addressed? buffer size? (I don't have a deep enough
understanding).
My guess is that on the GCP instance it is one thread-one core.
Another recurring pattern is the reduction in batch size.
I've been running a load job on my gcp instance for almost a day (23+h).
file size: 93GB
triples: 472m
batch size decreased from 160k range to under 1k, while processing
time per batch increased from a few seconds to over 10 min. All this
time average CPU usage has remained steady, as has RAM usage.
Not sure I quite understand - this is adding more data to an existing
database? And 10mins for 1k? While it will be slower, that does sound
rather extreme.
I don't understand how all of this works with indexing. Is this
expected behaviour? besides a locally proximate SSD, I've thrown an
overkill of hardware at it.
thanks
Andy