> And when it says "nq" is really quads or all data for the default
> graph? (there is more indexing work for named graphs).

>> : Quads = 127,206,280


OK - it's quads. There are 6 quad indexes and in full parallel mode it will use 2 more threads to parse and to build the node table.

Full parallel loading is going to use up all the cores and HT threads aren't full threads for this purpose.

The phased loader (default) uses less threads.

Phase 1:
one thread to decompress and parse
one thread to build the node table.
one thread to for the GSPO
(and one for SPO but you seem to have no triples)
=3

Phase 2:
two threads
=2

Phase 3:
three threads
=3

    Andy


On 21/06/2020 22:11, Andy Seaborne wrote:
Hi there,

Thanks for reporting the findings.

On 20/06/2020 16:10, Isroel Kogan wrote:
Hi,

I am also a newcomer to the RDF world - and particularly Jena, which I started using this week.

A couple of observations I have made over the last few days exploring different options.

Local Machine (specs):

Ubuntu 18.04
Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz (8 CPU)

which is 4 core and hyper threading. for this workload that is more like 4 threads. HT is not a completely x2 for this sort of continuous processing threading.

And pre-emtptive timeslicing is not nice!

16GB RAM
512 SSD (NVMe).


the following compares loading a file in compressed vs decompressed format -both w parallel loader.

file:
docstrings_triples.nq
size: 28GB

cmd:
time tdb2.tdbloader --loader=parallel --loc=test1graphdb docstrings_triples.nq > tdb2.log1 2>&1

:: Time = 1,364.310 seconds : Quads = 127,206,280 : Rate = 93,239 /s

real    22m46.346s
user    120m46.591s
sys    3m22.698s


file:
docstrings_triples.nq.bz2
size: 542M

cmd:

time tdb2.tdbloader --loader=parallel --loc=test2graphdb docstrings_triples.nq.bz2 > tdb2.log2 2>&1

:: Time = 2,225.871 seconds : Quads = 127,206,280 : Rate = 57,149 /s


real    37m8.182s
user    109m42.970s
sys    6m27.426s

resulting DB size
30GB

confirmed equal via diff.

pbzip2 ran in 84s

Less rigorously I noticed a similar gain in speed for other files.

For gz files, the speed of loading of compressed vs uncompressed is usually not very much.  It does look like bz2

Using a separate process and faster decompressor may help:

bzip2 -d < docstrings_triples.nq.bz2 | \
time tdb2.tdbloader --loader=parallel --loc=test2graphdb \
     -- - > tdb2.log2 2>&1

When Jena decompresses a bz2 file, it uses a Apache Common Compress so it is a java decompressor which will take time to get optimized by the JIT and is likely slower than a specialized tool like bzip2.

But with 4 core, it may have the opposite effect - using more processes causes preemption timeslicing.

It maybe one of the other loaders is faster because it is a better match to the hardware.

Is this expected behaviour? What factors influence this?

SSD - local vs cloud.

on my local machine, when running parallel loader, cores were working at over 70% capacity and there was little IO induced down time.

How many core were active?
And when it says "nq" is really quads or all data for the default graph? (there is more indexing work for named graphs).

Some of that will be the bz2 decompression but it looks to me "like it's  "more threads than cores" causing timeslicing.


GCP instance specs:

20 CPU
32GB RAM

And same heap size?

While the parallel loader is using multiple threads it is a fixed number so more CPU will help only if

More RAM is going to help because the OS will use it for file system cache, delaying writes.

But with more read threads, it could be there is less preemptive scheduling and that could be a big gain.

6TB "local SSD" storage
the local SSD storage offers the best performance to reduce IO latency - it has physical proximity to instance - as per GCP.

a few cores were working at near capacity, while the vast majority idle (near 0%) w occasional spikes. average load translates to 20% utilization. As I've seen others write here, this is a difference others have noted. How can this be addressed? buffer size? (I don't have a deep enough understanding).

My guess is that on the GCP instance it is one thread-one core.



Another recurring pattern is the reduction in batch size.
I've been running a load job on my gcp instance for almost a day (23+h).

file size: 93GB
triples: 472m

batch size decreased from 160k range to under 1k, while processing time per batch increased from a few seconds to over 10 min. All this time average CPU usage has remained steady, as has RAM usage.

Not sure I quite understand - this is adding more data to an existing database? And 10mins for 1k? While it will be slower, that does sound rather extreme.


I don't understand how all of this works with indexing. Is this expected behaviour? besides a locally proximate SSD, I've thrown an overkill of hardware at it.

thanks


     Andy

Reply via email to