Re: Re: Re: Re: Re: TDB2 bulk loader - multiple files into different graph per file

Lorenz Buehmann Mon, 29 Aug 2022 10:28:17 -0700

I spotted an interesting difference in performance gap/gain when using asmaller dataset for Europe:


On the server we have

- the ZFS raid with less powerful hard-disks, i.e. only SATA with 4 xSamsung 870 QVO


- an 2TB NVMe mounted separately


On the ZFS raid:

    with Jena 4.6.0:

        Triples = 54,821,333

3,047.89 sec : 54,821,333 Triples : 17,986.64 per second : 0errors : 10 warnings


    with Jena 4.7.0 patched with the BufferedInputStream wrapper:

        Triples = 54,821,333

308.05 sec : 54,821,333 Triples : 177,963.61 per second : 0errors : 10 warnings



On the NVMe

    with Jena 4.6.0:

        Triples = 54,821,333

824.11 sec : 54,821,333 Triples : 66,521.62 per second : 0errors : 10 warnings


    with Jena 4.7.0 patched with the BufferedInputStream wrapper:

        Triples = 54,821,333

303.07 sec : 54,821,333 Triples : 180,888.49 per second : 0errors : 10 warnings



Observation:

- the difference on the ZFS raid is factor 10

- on the NVMe disk it is "only" 3x faster with the buffered stream

Looks like the Bzip implementation of Apache Commons Compress is doinglots of IO stuff, which is why it suffers way more not having thebuffered stream on the ZFS raid compared to the faster NVMe disk.


Nevertheless, it's always worth to use the buffered stream


On 29.08.22 15:53, Simon Bin wrote:

I was asked to try it on my system (samsung 970 evo+ nvme, intel
11850h), but I used a slightly smaller data set (river_europe); it is
not quite as bad as on Lorenz' but the buffering would help
nevertheless:

main      : river_europe-latest.osm.pbf.ttl.bz2   : 815.14 sec : 72,098,221 
Triples :  88,449.21 per second : 0 errors : 10 warnings
fix/bzip2 : river_europe-latest.osm.pbf.ttl.bz2   : 376.64 sec : 72,098,221 
Triples : 191,424.76 per second : 0 errors : 10 warnings
pbzip2 -dc  river_europe-latest.osm.pbf.ttl.bz2 | : 155.24 sec : 72,098,221 
Triples : 464,442.66 per second : 0 errors : 10 warnings
             river_europe-latest.osm.pbf.ttl       : 136.92 sec : 72,098,221 
Triples : 526,587.26 per second : 0 errors : 10 warnings

Cheers,

On Mon, 2022-08-29 at 13:09 +0200, Lorenz Buehmann wrote:

In addition I used the OS tool in a pipe:

bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count
--syntax "Turtle"

Triples = 163,310,838
stdin           : 717.78 sec : 163,310,838 Triples : 227,523.09 per
second : 0 errors : 31 warnings


unsurprisingly more or less exactly the time of decompression + the
parsing time of the uncompressed file - still way faster than the
Apache
Commons one, even with my suggested fix the OS variant is ~5min
faster


On 29.08.22 11:24, Lorenz Buehmann wrote:

riot --time --count river_planet-latest.osm.pbf.ttl

Triples = 163,310,838
351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors
:
31 warnings


riot --time --count river_planet-latest.osm.pbf.ttl.gz

Triples = 163,310,838
431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors
:
31 warnings


riot --time --count river_planet-latest.osm.pbf.ttl.bz2

Triples = 163,310,838
9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0
errors :
31 warnings


Takes ages with Bzip2 ... there must be something going wrong ...


We checked code and the Apache Commons Compress docs, a colleague
spotted the hint at
https://commons.apache.org/proper/commons-compress/examples.html#Buffering

:

The stream classes all wrap around streams provided by the
calling
code and they work on them directly without any additional
buffering.
On the other hand most of them will benefit from buffering so it
is
highly recommended that users wrap their stream in
Buffered(In|Out)putStreams before using the Commons Compress API.

we were curious about this statement, checked
org.apache.jena.atlas.io.IO class and added one line in openFileEx

in = new BufferedInputStream(in);

which wraps the file stream before its passed to the decompressor
streams


Run again the parsing:


riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena
4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file
stream in IO class)

Triples = 163,310,838
1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0
errors
: 31 warnings


What do you think?


On 28.08.22 14:22, Andy Seaborne wrote:

If you are relying on Jena to do the bz2 decompress, then it is
using Commons Compress.

gz is done (via Commons Compress) in native code. I use gz and
if I
get a bz2 file, I decompress it with OS tools.

Could you try an experiment please?

Run on the same hardware as the loader was run:

riot --time --count river_planet-latest.osm.pbf.ttl
riot --time --count river_planet-latest.osm.pbf.ttl.bz2

     Andy

gz vs plain: NVMe m2 SSD : Dell XPS 13 9310

riot --time --count .../BSBM/bsbm-25m.nt.gz
Triples = 24,997,044
118.02 sec : 24,997,044 Triples : 211,808.84 per second

riot --time --count .../BSBM/bsbm-25m.nt
Triples = 24,997,044
109.97 sec : 24,997,044 Triples : 227,314.05 per second

Re: Re: Re: Re: Re: TDB2 bulk loader - multiple files into different graph per file

Reply via email to