On 29/08/2022 18:58, Andy Seaborne wrote:


On 29/08/2022 10:24, Lorenz Buehmann wrote:
...

We checked code and the Apache Commons Compress docs, a colleague spotted the hint at https://commons.apache.org/proper/commons-compress/examples.html#Buffering :

The stream classes all wrap around streams provided by the calling code and they work on them directly without any additional buffering. On the other hand most of them will benefit from buffering so it is highly recommended that users wrap their stream in Buffered(In|Out)putStreams before using the Commons Compress API.
we were curious about this statement, checked org.apache.jena.atlas.io.IO class and added one line in openFileEx

in = new BufferedInputStream(in);

which wraps the file stream before its passed to the decompressor streams


Run again the parsing:


riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file stream in IO class)

Triples = 163,310,838
1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors : 31 warnings


What do you think?

Yes.

IO.ensureBuffered.

It buffers if not already buffered and if not a ByteArrayInputStream.
It also makes all buffering findable in the IDE.

RIOT buffers (128K char buffer) so calls down to chars-UTF8-bytes are in chunks.  Your observation indicates BZip2CompressorInputStream is not not exploiting read(byte[] dest) calls ... yep - it's loop calling internal the one byte "read0".

GZIPInputStream has a default 512 byte buffer - maybe a bigger one there will help a bit.

A quick test on BSBM-25 million...

Adding buffering in gzip caused a 0.1% slow-down. (Data from SSD)

    Andy


SnappyCompressorInputStream has a 32k buffer.

So it is bz2 needing IO.ensureBuffered, the others may benefit - or may go slower.

     Andy



On 28.08.22 14:22, Andy Seaborne wrote:


If you are relying on Jena to do the bz2 decompress, then it is using Commons Compress.

gz is done (via Commons Compress) in native code. I use gz and if I get a bz2 file, I decompress it with OS tools.

Could you try an experiment please?

Run on the same hardware as the loader was run:

riot --time --count river_planet-latest.osm.pbf.ttl
riot --time --count river_planet-latest.osm.pbf.ttl.bz2

    Andy

gz vs plain: NVMe m2 SSD : Dell XPS 13 9310

riot --time --count .../BSBM/bsbm-25m.nt.gz
Triples = 24,997,044
118.02 sec : 24,997,044 Triples : 211,808.84 per second

riot --time --count .../BSBM/bsbm-25m.nt
Triples = 24,997,044
109.97 sec : 24,997,044 Triples : 227,314.05 per second

Reply via email to