I was asked to try it on my system (samsung 970 evo+ nvme, intel
11850h), but I used a slightly smaller data set (river_europe); it is
not quite as bad as on Lorenz' but the buffering would help
nevertheless:

main      : river_europe-latest.osm.pbf.ttl.bz2   : 815.14 sec : 72,098,221 
Triples :  88,449.21 per second : 0 errors : 10 warnings
fix/bzip2 : river_europe-latest.osm.pbf.ttl.bz2   : 376.64 sec : 72,098,221 
Triples : 191,424.76 per second : 0 errors : 10 warnings
pbzip2 -dc  river_europe-latest.osm.pbf.ttl.bz2 | : 155.24 sec : 72,098,221 
Triples : 464,442.66 per second : 0 errors : 10 warnings
            river_europe-latest.osm.pbf.ttl       : 136.92 sec : 72,098,221 
Triples : 526,587.26 per second : 0 errors : 10 warnings

Cheers,

On Mon, 2022-08-29 at 13:09 +0200, Lorenz Buehmann wrote:
> In addition I used the OS tool in a pipe:
> 
> bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count 
> --syntax "Turtle"
> 
> Triples = 163,310,838
> stdin           : 717.78 sec : 163,310,838 Triples : 227,523.09 per 
> second : 0 errors : 31 warnings
> 
> 
> unsurprisingly more or less exactly the time of decompression + the 
> parsing time of the uncompressed file - still way faster than the
> Apache 
> Commons one, even with my suggested fix the OS variant is ~5min
> faster
> 
> 
> On 29.08.22 11:24, Lorenz Buehmann wrote:
> > riot --time --count river_planet-latest.osm.pbf.ttl
> > 
> > Triples = 163,310,838
> > 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors
> > : 
> > 31 warnings
> > 
> > 
> > riot --time --count river_planet-latest.osm.pbf.ttl.gz
> > 
> > Triples = 163,310,838
> > 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors
> > : 
> > 31 warnings
> > 
> > 
> > riot --time --count river_planet-latest.osm.pbf.ttl.bz2
> > 
> > Triples = 163,310,838
> > 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0
> > errors : 
> > 31 warnings
> > 
> > 
> > Takes ages with Bzip2 ... there must be something going wrong ...
> > 
> > 
> > We checked code and the Apache Commons Compress docs, a colleague 
> > spotted the hint at 
> > https://commons.apache.org/proper/commons-compress/examples.html#Buffering
> >  
> > :
> > 
> > > The stream classes all wrap around streams provided by the
> > > calling 
> > > code and they work on them directly without any additional
> > > buffering. 
> > > On the other hand most of them will benefit from buffering so it
> > > is 
> > > highly recommended that users wrap their stream in 
> > > Buffered(In|Out)putStreams before using the Commons Compress API.
> > we were curious about this statement, checked 
> > org.apache.jena.atlas.io.IO class and added one line in openFileEx
> > 
> > in = new BufferedInputStream(in);
> > 
> > which wraps the file stream before its passed to the decompressor
> > streams
> > 
> > 
> > Run again the parsing:
> > 
> > 
> > riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena 
> > 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file 
> > stream in IO class)
> > 
> > Triples = 163,310,838
> > 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0
> > errors 
> > : 31 warnings
> > 
> > 
> > What do you think?
> > 
> > 
> > On 28.08.22 14:22, Andy Seaborne wrote:
> > > 
> > > > 
> > > > If you are relying on Jena to do the bz2 decompress, then it is
> > > > using Commons Compress.
> > > > 
> > > > gz is done (via Commons Compress) in native code. I use gz and
> > > > if I 
> > > > get a bz2 file, I decompress it with OS tools.
> > > 
> > > Could you try an experiment please?
> > > 
> > > Run on the same hardware as the loader was run:
> > > 
> > > riot --time --count river_planet-latest.osm.pbf.ttl
> > > riot --time --count river_planet-latest.osm.pbf.ttl.bz2
> > > 
> > >     Andy
> > > 
> > > gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
> > > 
> > > riot --time --count .../BSBM/bsbm-25m.nt.gz
> > > Triples = 24,997,044
> > > 118.02 sec : 24,997,044 Triples : 211,808.84 per second
> > > 
> > > riot --time --count .../BSBM/bsbm-25m.nt
> > > Triples = 24,997,044
> > > 109.97 sec : 24,997,044 Triples : 227,314.05 per second
> 

Reply via email to