Re: TDB2 parallel load on cloud SSD and other observations/questions

Andy Seaborne Sun, 21 Jun 2020 14:39:49 -0700

> And when it says "nq" is really quads or all data for the default
> graph? (there is more indexing work for named graphs).


>> : Quads = 127,206,280

OK - it's quads. There are 6 quad indexes and in full parallel mode itwill use 2 more threads to parse and to build the node table.

Full parallel loading is going to use up all the cores and HT threadsaren't full threads for this purpose.


The phased loader (default) uses less threads.

Phase 1:
one thread to decompress and parse
one thread to build the node table.
one thread to for the GSPO
(and one for SPO but you seem to have no triples)
=3

Phase 2:
two threads
=2

Phase 3:
three threads
=3

    Andy


On 21/06/2020 22:11, Andy Seaborne wrote:

Hi there,

Thanks for reporting the findings.

On 20/06/2020 16:10, Isroel Kogan wrote:
Hi,
I am also a newcomer to the RDF world - and particularly Jena, which Istarted using this week.
A couple of observations I have made over the last few days exploringdifferent options.
Local Machine (specs):

Ubuntu 18.04
Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)
which is 4 core and hyper threading. for this workload that is more like4 threads. HT is not a completely x2 for this sort of continuousprocessing threading.
And pre-emtptive timeslicing is not nice!
16GB RAM
512 SSD (NVMe).
the following compares loading a file in compressed vs decompressedformat -both w parallel loader.
file:
docstrings_triples.nq
size: 28GB

cmd:
time tdb2.tdbloader --loader=parallel --loc=test1graphdbdocstrings_triples.nq > tdb2.log1 2>&1
:: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s

real    22m46.346s
user    120m46.591s
sys    3m22.698s


file:
docstrings_triples.nq.bz2
size: 542M

cmd:
time tdb2.tdbloader --loader=parallel --loc=test2graphdbdocstrings_triples.nq.bz2 > tdb2.log2 2>&1
:: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s


real    37m8.182s
user    109m42.970s
sys    6m27.426s

resulting DB size
30GB

confirmed equal via diff.

pbzip2 ran in 84s

Less rigorously I noticed a similar gain in speed for other files.
For gz files, the speed of loading of compressed vs uncompressed isusually not very much. It does look like bz2
Using a separate process and faster decompressor may help:

bzip2 -d < docstrings_triples.nq.bz2 | \
time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
     -- - > tdb2.log2 2>&1
When Jena decompresses a bz2 file, it uses a Apache Common Compress soit is a java decompressor which will take time to get optimized by theJIT and is likely slower than a specialized tool like bzip2.
But with 4 core, it may have the opposite effect - using more processescauses preemption timeslicing.
It maybe one of the other loaders is faster because it is a better matchto the hardware.
Is this expected behaviour? What factors influence this?

SSD - local vs cloud.
on my local machine, when running parallel loader, cores were workingat over 70% capacity and there was little IO induced down time.
How many core were active?
And when it says "nq" is really quads or all data for the default graph?(there is more indexing work for named graphs).
Some of that will be the bz2 decompression but it looks to me "like it's "more threads than cores" causing timeslicing.
GCP instance specs:

20 CPU
32GB RAM
And same heap size?
While the parallel loader is using multiple threads it is a fixed numberso more CPU will help only if
More RAM is going to help because the OS will use it for file systemcache, delaying writes.
But with more read threads, it could be there is less preemptivescheduling and that could be a big gain.
6TB "local SSD" storage
the local SSD storage offers the best performance to reduce IO latency- it has physical proximity to instance - as per GCP.
a few cores were working at near capacity, while the vast majorityidle (near 0%) w occasional spikes. average load translates to 20%utilization. As I've seen others write here, this is a differenceothers have noted.How can this be addressed? buffer size? (I don't have a deep enoughunderstanding).
My guess is that on the GCP instance it is one thread-one core.
Another recurring pattern is the reduction in batch size.
I've been running a load job on my gcp instance for almost a day (23+h).

file size: 93GB
triples: 472m
batch size decreased from 160k range to under 1k, while processingtime per batch increased from a few seconds to over 10 min. All thistime average CPU usage has remained steady, as has RAM usage.
Not sure I quite understand - this is adding more data to an existingdatabase? And 10mins for 1k? While it will be slower, that does soundrather extreme.
I don't understand how all of this works with indexing. Is thisexpected behaviour? besides a locally proximate SSD, I've thrown anoverkill of hardware at it.
thanks
     Andy

Re: TDB2 parallel load on cloud SSD and other observations/questions

Reply via email to