In addition I used the OS tool in a pipe:
bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count
--syntax "Turtle"
Triples = 163,310,838
stdin : 717.78 sec : 163,310,838 Triples : 227,523.09 per
second : 0 errors : 31 warnings
unsurprisingly more or less exactly the time of decompression + the
parsing time of the uncompressed file - still way faster than the Apache
Commons one, even with my suggested fix the OS variant is ~5min faster
On 29.08.22 11:24, Lorenz Buehmann wrote:
riot --time --count river_planet-latest.osm.pbf.ttl
Triples = 163,310,838
351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors :
31 warnings
riot --time --count river_planet-latest.osm.pbf.ttl.gz
Triples = 163,310,838
431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors :
31 warnings
riot --time --count river_planet-latest.osm.pbf.ttl.bz2
Triples = 163,310,838
9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0 errors :
31 warnings
Takes ages with Bzip2 ... there must be something going wrong ...
We checked code and the Apache Commons Compress docs, a colleague
spotted the hint at
https://commons.apache.org/proper/commons-compress/examples.html#Buffering
:
The stream classes all wrap around streams provided by the calling
code and they work on them directly without any additional buffering.
On the other hand most of them will benefit from buffering so it is
highly recommended that users wrap their stream in
Buffered(In|Out)putStreams before using the Commons Compress API.
we were curious about this statement, checked
org.apache.jena.atlas.io.IO class and added one line in openFileEx
in = new BufferedInputStream(in);
which wraps the file stream before its passed to the decompressor streams
Run again the parsing:
riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena
4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file
stream in IO class)
Triples = 163,310,838
1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors
: 31 warnings
What do you think?
On 28.08.22 14:22, Andy Seaborne wrote:
If you are relying on Jena to do the bz2 decompress, then it is
using Commons Compress.
gz is done (via Commons Compress) in native code. I use gz and if I
get a bz2 file, I decompress it with OS tools.
Could you try an experiment please?
Run on the same hardware as the loader was run:
riot --time --count river_planet-latest.osm.pbf.ttl
riot --time --count river_planet-latest.osm.pbf.ttl.bz2
Andy
gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
riot --time --count .../BSBM/bsbm-25m.nt.gz
Triples = 24,997,044
118.02 sec : 24,997,044 Triples : 211,808.84 per second
riot --time --count .../BSBM/bsbm-25m.nt
Triples = 24,997,044
109.97 sec : 24,997,044 Triples : 227,314.05 per second