I was asked to try it on my system (samsung 970 evo+ nvme, intel
11850h), but I used a slightly smaller data set (river_europe); it is
not quite as bad as on Lorenz' but the buffering would help
nevertheless:
main : river_europe-latest.osm.pbf.ttl.bz2 : 815.14 sec : 72,098,221
Triples : 88,449.21 per second : 0 errors : 10 warnings
fix/bzip2 : river_europe-latest.osm.pbf.ttl.bz2 : 376.64 sec : 72,098,221
Triples : 191,424.76 per second : 0 errors : 10 warnings
pbzip2 -dc river_europe-latest.osm.pbf.ttl.bz2 | : 155.24 sec : 72,098,221
Triples : 464,442.66 per second : 0 errors : 10 warnings
river_europe-latest.osm.pbf.ttl : 136.92 sec : 72,098,221
Triples : 526,587.26 per second : 0 errors : 10 warnings
Cheers,
On Mon, 2022-08-29 at 13:09 +0200, Lorenz Buehmann wrote:
> In addition I used the OS tool in a pipe:
>
> bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count
> --syntax "Turtle"
>
> Triples = 163,310,838
> stdin : 717.78 sec : 163,310,838 Triples : 227,523.09 per
> second : 0 errors : 31 warnings
>
>
> unsurprisingly more or less exactly the time of decompression + the
> parsing time of the uncompressed file - still way faster than the
> Apache
> Commons one, even with my suggested fix the OS variant is ~5min
> faster
>
>
> On 29.08.22 11:24, Lorenz Buehmann wrote:
> > riot --time --count river_planet-latest.osm.pbf.ttl
> >
> > Triples = 163,310,838
> > 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors
> > :
> > 31 warnings
> >
> >
> > riot --time --count river_planet-latest.osm.pbf.ttl.gz
> >
> > Triples = 163,310,838
> > 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors
> > :
> > 31 warnings
> >
> >
> > riot --time --count river_planet-latest.osm.pbf.ttl.bz2
> >
> > Triples = 163,310,838
> > 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0
> > errors :
> > 31 warnings
> >
> >
> > Takes ages with Bzip2 ... there must be something going wrong ...
> >
> >
> > We checked code and the Apache Commons Compress docs, a colleague
> > spotted the hint at
> > https://commons.apache.org/proper/commons-compress/examples.html#Buffering
> >
> > :
> >
> > > The stream classes all wrap around streams provided by the
> > > calling
> > > code and they work on them directly without any additional
> > > buffering.
> > > On the other hand most of them will benefit from buffering so it
> > > is
> > > highly recommended that users wrap their stream in
> > > Buffered(In|Out)putStreams before using the Commons Compress API.
> > we were curious about this statement, checked
> > org.apache.jena.atlas.io.IO class and added one line in openFileEx
> >
> > in = new BufferedInputStream(in);
> >
> > which wraps the file stream before its passed to the decompressor
> > streams
> >
> >
> > Run again the parsing:
> >
> >
> > riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena
> > 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file
> > stream in IO class)
> >
> > Triples = 163,310,838
> > 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0
> > errors
> > : 31 warnings
> >
> >
> > What do you think?
> >
> >
> > On 28.08.22 14:22, Andy Seaborne wrote:
> > >
> > > >
> > > > If you are relying on Jena to do the bz2 decompress, then it is
> > > > using Commons Compress.
> > > >
> > > > gz is done (via Commons Compress) in native code. I use gz and
> > > > if I
> > > > get a bz2 file, I decompress it with OS tools.
> > >
> > > Could you try an experiment please?
> > >
> > > Run on the same hardware as the loader was run:
> > >
> > > riot --time --count river_planet-latest.osm.pbf.ttl
> > > riot --time --count river_planet-latest.osm.pbf.ttl.bz2
> > >
> > > Andy
> > >
> > > gz vs plain: NVMe m2 SSD : Dell XPS 13 9310
> > >
> > > riot --time --count .../BSBM/bsbm-25m.nt.gz
> > > Triples = 24,997,044
> > > 118.02 sec : 24,997,044 Triples : 211,808.84 per second
> > >
> > > riot --time --count .../BSBM/bsbm-25m.nt
> > > Triples = 24,997,044
> > > 109.97 sec : 24,997,044 Triples : 227,314.05 per second
>