Re: TDB2 bulk loader - multiple files into different graph per file
On 29/08/2022 10:24, Lorenz Buehmann wrote: ... We checked code and the Apache Commons Compress docs, a colleague spotted the hint at https://commons.apache.org/proper/commons-compress/examples.html#Buffering : The stream classes all wrap around streams provided by the calling code and they work on them directly without any additional buffering. On the other hand most of them will benefit from buffering so it is highly recommended that users wrap their stream in Buffered(In|Out)putStreams before using the Commons Compress API. we were curious about this statement, checked org.apache.jena.atlas.io.IO class and added one line in openFileEx in = new BufferedInputStream(in); which wraps the file stream before its passed to the decompressor streams Run again the parsing: riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file stream in IO class) Triples = 163,310,838 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors : 31 warnings What do you think? Yes. IO.ensureBuffered. It buffers if not already buffered and if not a ByteArrayInputStream. It also makes all buffering findable in the IDE. RIOT buffers (128K char buffer) so calls down to chars-UTF8-bytes are in chunks. Your observation indicates BZip2CompressorInputStream is not not exploiting read(byte[] dest) calls ... yep - it's loop calling internal the one byte "read0". GZIPInputStream has a default 512 byte buffer - maybe a bigger one there will help a bit. SnappyCompressorInputStream has a 32k buffer. So it is bz2 needing IO.ensureBuffered, the others may benefit - or may go slower. Andy On 28.08.22 14:22, Andy Seaborne wrote: If you are relying on Jena to do the bz2 decompress, then it is using Commons Compress. gz is done (via Commons Compress) in native code. I use gz and if I get a bz2 file, I decompress it with OS tools. Could you try an experiment please? Run on the same hardware as the loader was run: riot --time --count river_planet-latest.osm.pbf.ttl riot --time --count river_planet-latest.osm.pbf.ttl.bz2 Andy gz vs plain: NVMe m2 SSD : Dell XPS 13 9310 riot --time --count .../BSBM/bsbm-25m.nt.gz Triples = 24,997,044 118.02 sec : 24,997,044 Triples : 211,808.84 per second riot --time --count .../BSBM/bsbm-25m.nt Triples = 24,997,044 109.97 sec : 24,997,044 Triples : 227,314.05 per second
Re: TDB2 bulk loader - multiple files into different graph per file
On 29/08/2022 14:53, Simon Bin wrote: I was asked to try it on my system (samsung 970 evo+ nvme, intel 11850h), but I used a slightly smaller data set (river_europe); it is not quite as bad as on Lorenz' but the buffering would help nevertheless: main : river_europe-latest.osm.pbf.ttl.bz2 : 815.14 sec : 72,098,221 Triples : 88,449.21 per second : 0 errors : 10 warnings fix/bzip2 : river_europe-latest.osm.pbf.ttl.bz2 : 376.64 sec : 72,098,221 Triples : 191,424.76 per second : 0 errors : 10 warnings pbzip2 -dc river_europe-latest.osm.pbf.ttl.bz2 | : 155.24 sec : 72,098,221 Triples : 464,442.66 per second : 0 errors : 10 warnings river_europe-latest.osm.pbf.ttl : 136.92 sec : 72,098,221 Triples : 526,587.26 per second : 0 errors : 10 warnings Are these two datasets (this dataset and river_planet-latest.osm.pbf.ttl) publicly availably? Different datasets have different performance characteristics. I'm not surprised BSBM is slower - it has a lot of large literals so there is a lot of basic byte shifting. I also tried on a laptop which typically have slower buses. (I had a hardware crash a couple of weeks ago so I don't have the desktop I was using for comparison but from memory, the 8yo desktop is faster for riot parsing than the 1yo laptop.) Andy If you want excellent figure, use LUBM. It has a small node/triple ratio (there are less bytes to shift) and high locality of URI use (better memory cache usage). It is unrealistic for parsing and loading. Cheers, On Mon, 2022-08-29 at 13:09 +0200, Lorenz Buehmann wrote: In addition I used the OS tool in a pipe: bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count --syntax "Turtle" Triples = 163,310,838 stdin : 717.78 sec : 163,310,838 Triples : 227,523.09 per second : 0 errors : 31 warnings unsurprisingly more or less exactly the time of decompression + the parsing time of the uncompressed file - still way faster than the Apache Commons one, even with my suggested fix the OS variant is ~5min faster On 29.08.22 11:24, Lorenz Buehmann wrote: riot --time --count river_planet-latest.osm.pbf.ttl Triples = 163,310,838 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors : 31 warnings riot --time --count river_planet-latest.osm.pbf.ttl.gz Triples = 163,310,838 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors : 31 warnings riot --time --count river_planet-latest.osm.pbf.ttl.bz2 Triples = 163,310,838 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0 errors : 31 warnings Takes ages with Bzip2 ... there must be something going wrong ... We checked code and the Apache Commons Compress docs, a colleague spotted the hint at https://commons.apache.org/proper/commons-compress/examples.html#Buffering : The stream classes all wrap around streams provided by the calling code and they work on them directly without any additional buffering. On the other hand most of them will benefit from buffering so it is highly recommended that users wrap their stream in Buffered(In|Out)putStreams before using the Commons Compress API. we were curious about this statement, checked org.apache.jena.atlas.io.IO class and added one line in openFileEx in = new BufferedInputStream(in); which wraps the file stream before its passed to the decompressor streams Run again the parsing: riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file stream in IO class) Triples = 163,310,838 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors : 31 warnings What do you think? On 28.08.22 14:22, Andy Seaborne wrote: If you are relying on Jena to do the bz2 decompress, then it is using Commons Compress. gz is done (via Commons Compress) in native code. I use gz and if I get a bz2 file, I decompress it with OS tools. Could you try an experiment please? Run on the same hardware as the loader was run: riot --time --count river_planet-latest.osm.pbf.ttl riot --time --count river_planet-latest.osm.pbf.ttl.bz2 Andy gz vs plain: NVMe m2 SSD : Dell XPS 13 9310 riot --time --count .../BSBM/bsbm-25m.nt.gz Triples = 24,997,044 118.02 sec : 24,997,044 Triples : 211,808.84 per second riot --time --count .../BSBM/bsbm-25m.nt Triples = 24,997,044 109.97 sec : 24,997,044 Triples : 227,314.05 per second
Re: Re: Re: Re: Re: TDB2 bulk loader - multiple files into different graph per file
I spotted an interesting difference in performance gap/gain when using a smaller dataset for Europe: On the server we have - the ZFS raid with less powerful hard-disks, i.e. only SATA with 4 x Samsung 870 QVO - an 2TB NVMe mounted separately On the ZFS raid: with Jena 4.6.0: Triples = 54,821,333 3,047.89 sec : 54,821,333 Triples : 17,986.64 per second : 0 errors : 10 warnings with Jena 4.7.0 patched with the BufferedInputStream wrapper: Triples = 54,821,333 308.05 sec : 54,821,333 Triples : 177,963.61 per second : 0 errors : 10 warnings On the NVMe with Jena 4.6.0: Triples = 54,821,333 824.11 sec : 54,821,333 Triples : 66,521.62 per second : 0 errors : 10 warnings with Jena 4.7.0 patched with the BufferedInputStream wrapper: Triples = 54,821,333 303.07 sec : 54,821,333 Triples : 180,888.49 per second : 0 errors : 10 warnings Observation: - the difference on the ZFS raid is factor 10 - on the NVMe disk it is "only" 3x faster with the buffered stream Looks like the Bzip implementation of Apache Commons Compress is doing lots of IO stuff, which is why it suffers way more not having the buffered stream on the ZFS raid compared to the faster NVMe disk. Nevertheless, it's always worth to use the buffered stream On 29.08.22 15:53, Simon Bin wrote: I was asked to try it on my system (samsung 970 evo+ nvme, intel 11850h), but I used a slightly smaller data set (river_europe); it is not quite as bad as on Lorenz' but the buffering would help nevertheless: main : river_europe-latest.osm.pbf.ttl.bz2 : 815.14 sec : 72,098,221 Triples : 88,449.21 per second : 0 errors : 10 warnings fix/bzip2 : river_europe-latest.osm.pbf.ttl.bz2 : 376.64 sec : 72,098,221 Triples : 191,424.76 per second : 0 errors : 10 warnings pbzip2 -dc river_europe-latest.osm.pbf.ttl.bz2 | : 155.24 sec : 72,098,221 Triples : 464,442.66 per second : 0 errors : 10 warnings river_europe-latest.osm.pbf.ttl : 136.92 sec : 72,098,221 Triples : 526,587.26 per second : 0 errors : 10 warnings Cheers, On Mon, 2022-08-29 at 13:09 +0200, Lorenz Buehmann wrote: In addition I used the OS tool in a pipe: bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count --syntax "Turtle" Triples = 163,310,838 stdin : 717.78 sec : 163,310,838 Triples : 227,523.09 per second : 0 errors : 31 warnings unsurprisingly more or less exactly the time of decompression + the parsing time of the uncompressed file - still way faster than the Apache Commons one, even with my suggested fix the OS variant is ~5min faster On 29.08.22 11:24, Lorenz Buehmann wrote: riot --time --count river_planet-latest.osm.pbf.ttl Triples = 163,310,838 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors : 31 warnings riot --time --count river_planet-latest.osm.pbf.ttl.gz Triples = 163,310,838 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors : 31 warnings riot --time --count river_planet-latest.osm.pbf.ttl.bz2 Triples = 163,310,838 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0 errors : 31 warnings Takes ages with Bzip2 ... there must be something going wrong ... We checked code and the Apache Commons Compress docs, a colleague spotted the hint at https://commons.apache.org/proper/commons-compress/examples.html#Buffering : The stream classes all wrap around streams provided by the calling code and they work on them directly without any additional buffering. On the other hand most of them will benefit from buffering so it is highly recommended that users wrap their stream in Buffered(In|Out)putStreams before using the Commons Compress API. we were curious about this statement, checked org.apache.jena.atlas.io.IO class and added one line in openFileEx in = new BufferedInputStream(in); which wraps the file stream before its passed to the decompressor streams Run again the parsing: riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file stream in IO class) Triples = 163,310,838 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors : 31 warnings What do you think? On 28.08.22 14:22, Andy Seaborne wrote: If you are relying on Jena to do the bz2 decompress, then it is using Commons Compress. gz is done (via Commons Compress) in native code. I use gz and if I get a bz2 file, I decompress it with OS tools. Could you try an experiment please? Run on the same hardware as the loader was run: riot --time --count river_planet-latest.osm.pbf.ttl riot --time --count river_planet-latest.osm.pbf.ttl.bz2 Andy gz vs plain: NVMe m2 SSD : Dell XPS 13 9310 riot --time --count .../BSBM/bsbm-25m.nt.gz Triples = 24,997,044 118.02 sec : 24,997,044 Triples : 211,808.84 per second riot --time --count .../BSBM/bsbm-25m.nt Triples = 24,997,044 109.97 sec : 24,997,044 Triples :
Re: Re: Re: Re: TDB2 bulk loader - multiple files into different graph per file
I was asked to try it on my system (samsung 970 evo+ nvme, intel 11850h), but I used a slightly smaller data set (river_europe); it is not quite as bad as on Lorenz' but the buffering would help nevertheless: main : river_europe-latest.osm.pbf.ttl.bz2 : 815.14 sec : 72,098,221 Triples : 88,449.21 per second : 0 errors : 10 warnings fix/bzip2 : river_europe-latest.osm.pbf.ttl.bz2 : 376.64 sec : 72,098,221 Triples : 191,424.76 per second : 0 errors : 10 warnings pbzip2 -dc river_europe-latest.osm.pbf.ttl.bz2 | : 155.24 sec : 72,098,221 Triples : 464,442.66 per second : 0 errors : 10 warnings river_europe-latest.osm.pbf.ttl : 136.92 sec : 72,098,221 Triples : 526,587.26 per second : 0 errors : 10 warnings Cheers, On Mon, 2022-08-29 at 13:09 +0200, Lorenz Buehmann wrote: > In addition I used the OS tool in a pipe: > > bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count > --syntax "Turtle" > > Triples = 163,310,838 > stdin : 717.78 sec : 163,310,838 Triples : 227,523.09 per > second : 0 errors : 31 warnings > > > unsurprisingly more or less exactly the time of decompression + the > parsing time of the uncompressed file - still way faster than the > Apache > Commons one, even with my suggested fix the OS variant is ~5min > faster > > > On 29.08.22 11:24, Lorenz Buehmann wrote: > > riot --time --count river_planet-latest.osm.pbf.ttl > > > > Triples = 163,310,838 > > 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors > > : > > 31 warnings > > > > > > riot --time --count river_planet-latest.osm.pbf.ttl.gz > > > > Triples = 163,310,838 > > 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors > > : > > 31 warnings > > > > > > riot --time --count river_planet-latest.osm.pbf.ttl.bz2 > > > > Triples = 163,310,838 > > 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0 > > errors : > > 31 warnings > > > > > > Takes ages with Bzip2 ... there must be something going wrong ... > > > > > > We checked code and the Apache Commons Compress docs, a colleague > > spotted the hint at > > https://commons.apache.org/proper/commons-compress/examples.html#Buffering > > > > : > > > > > The stream classes all wrap around streams provided by the > > > calling > > > code and they work on them directly without any additional > > > buffering. > > > On the other hand most of them will benefit from buffering so it > > > is > > > highly recommended that users wrap their stream in > > > Buffered(In|Out)putStreams before using the Commons Compress API. > > we were curious about this statement, checked > > org.apache.jena.atlas.io.IO class and added one line in openFileEx > > > > in = new BufferedInputStream(in); > > > > which wraps the file stream before its passed to the decompressor > > streams > > > > > > Run again the parsing: > > > > > > riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena > > 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file > > stream in IO class) > > > > Triples = 163,310,838 > > 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 > > errors > > : 31 warnings > > > > > > What do you think? > > > > > > On 28.08.22 14:22, Andy Seaborne wrote: > > > > > > > > > > > If you are relying on Jena to do the bz2 decompress, then it is > > > > using Commons Compress. > > > > > > > > gz is done (via Commons Compress) in native code. I use gz and > > > > if I > > > > get a bz2 file, I decompress it with OS tools. > > > > > > Could you try an experiment please? > > > > > > Run on the same hardware as the loader was run: > > > > > > riot --time --count river_planet-latest.osm.pbf.ttl > > > riot --time --count river_planet-latest.osm.pbf.ttl.bz2 > > > > > > Andy > > > > > > gz vs plain: NVMe m2 SSD : Dell XPS 13 9310 > > > > > > riot --time --count .../BSBM/bsbm-25m.nt.gz > > > Triples = 24,997,044 > > > 118.02 sec : 24,997,044 Triples : 211,808.84 per second > > > > > > riot --time --count .../BSBM/bsbm-25m.nt > > > Triples = 24,997,044 > > > 109.97 sec : 24,997,044 Triples : 227,314.05 per second >
Re: Re: Re: TDB2 bulk loader - multiple files into different graph per file
In addition I used the OS tool in a pipe: bunzip2 -c river_planet-latest.osm.pbf.ttl.bz2 | riot --time --count --syntax "Turtle" Triples = 163,310,838 stdin : 717.78 sec : 163,310,838 Triples : 227,523.09 per second : 0 errors : 31 warnings unsurprisingly more or less exactly the time of decompression + the parsing time of the uncompressed file - still way faster than the Apache Commons one, even with my suggested fix the OS variant is ~5min faster On 29.08.22 11:24, Lorenz Buehmann wrote: riot --time --count river_planet-latest.osm.pbf.ttl Triples = 163,310,838 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors : 31 warnings riot --time --count river_planet-latest.osm.pbf.ttl.gz Triples = 163,310,838 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors : 31 warnings riot --time --count river_planet-latest.osm.pbf.ttl.bz2 Triples = 163,310,838 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0 errors : 31 warnings Takes ages with Bzip2 ... there must be something going wrong ... We checked code and the Apache Commons Compress docs, a colleague spotted the hint at https://commons.apache.org/proper/commons-compress/examples.html#Buffering : The stream classes all wrap around streams provided by the calling code and they work on them directly without any additional buffering. On the other hand most of them will benefit from buffering so it is highly recommended that users wrap their stream in Buffered(In|Out)putStreams before using the Commons Compress API. we were curious about this statement, checked org.apache.jena.atlas.io.IO class and added one line in openFileEx in = new BufferedInputStream(in); which wraps the file stream before its passed to the decompressor streams Run again the parsing: riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file stream in IO class) Triples = 163,310,838 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors : 31 warnings What do you think? On 28.08.22 14:22, Andy Seaborne wrote: If you are relying on Jena to do the bz2 decompress, then it is using Commons Compress. gz is done (via Commons Compress) in native code. I use gz and if I get a bz2 file, I decompress it with OS tools. Could you try an experiment please? Run on the same hardware as the loader was run: riot --time --count river_planet-latest.osm.pbf.ttl riot --time --count river_planet-latest.osm.pbf.ttl.bz2 Andy gz vs plain: NVMe m2 SSD : Dell XPS 13 9310 riot --time --count .../BSBM/bsbm-25m.nt.gz Triples = 24,997,044 118.02 sec : 24,997,044 Triples : 211,808.84 per second riot --time --count .../BSBM/bsbm-25m.nt Triples = 24,997,044 109.97 sec : 24,997,044 Triples : 227,314.05 per second
fuseki backup process / policy - similar capabilities to autopostgresqlbackup ?
Hello, We are using fuseki and we would like to implement a backup policy similar in capabilities to what [autopostgresqlbackup] has to offer. Are there any existing solutions out there that can do all / part of these? We would like to take: * daily backups for a week * weekly backups - 1 per week, last 4 weeks * monthly backups - 1/ month, last 6 months I believe this could be scripted with via the HTTP API + directory access. The backup api in [fuseki-server-protocol] can trigger a backup and can also list existing backups. Unfortunately in the current implementation, backup is not consistent. In case of a server crash during backup, the file will remain there incomplete. Also, since tasks are stored in memory and cleaned (periodically / on restart) there is no way to know for sure if the backup was successful or not. In have encountered the above quite often in some workloads. The in-consistency could be solved by writing the backup to temporary file name and on completion, renaming it to final file name. Rename is usually atomic operation on POSIX file systems. /backup-list API can list all backups or split backups in complete / incomplete. IMO for now, it can list all of them. The in progress backup could be stored alongside the other backups with a file marker like: dataset_date.nq.gz.INCOMPLETE . Once it's done it can be renamed to dataset_date.nq.gz . Cleanup might be handled externally. In case of a crash, the file will remain INCOMPLETE until it is removed by system by checking a specific amount of time has passed since backup was started (1-2 days). WDYT? [autopostgresqlbackup] https://github.com/k0lter/autopostgresqlbackup [fuseki-server-protocol] https://jena.apache.org/documentation/fuseki2/fuseki-server-protocol.html Thanks, -- Eugen Stan +40770 941 271 / https://www.netdava.combegin:vcard fn:Eugen Stan n:Stan;Eugen email;internet:eugen.s...@netdava.com tel;cell:+40720898747 x-mozilla-html:FALSE url:https://www.netdava.com version:2.1 end:vcard
Re: Re: TDB2 bulk loader - multiple files into different graph per file
On Sun, Aug 28, 2022 at 11:00 AM Lorenz Buehmann wrote: > > Hi Andy, > > thanks for fast response. > > I see - the only drawback with wrapping the streams into TriG is when we > have Turtle syntax files (or lets say any non N-Triples format) - afaik, > prefixes aren't allowed inside graphs, i.e. at that point you're lost. > What I did now is to pipe those files into riot first which then > generates N-Triples which then can be wrapped in TriG graphs. Indeed, we > have the riot overhead here, i.e. the data is parsed twice. Still faster > though then loading graphs in separate TDB loader calls, so I guess I > can live with this. I had a similar question a few years ago, and Claus responded: https://stackoverflow.com/questions/63467067/converting-rdf-triples-to-quads-from-command-line/63716278 > > Having a follow up question: > > I could see a huge difference between read compressed (Bzip) vs > uncompressed file: > > I put the output until the triples have been loaded here as the index > creating should be affected by the compression: > > > # uncompressed with tdb2.tdbloader > > 14:24:40 INFO loader :: Add: 163,000,000 > river_planet-latest.osm.pbf.ttl (Batch: 144,320 / Avg: 140,230) > 14:24:42 INFO loader :: Finished: > output/river_planet-latest.osm.pbf.ttl: 163,310,838 tuples in 1165.30s > (Avg: 140,145) > > > # compressed with tdb2.tdbloader > > 17:37:37 INFO loader :: Add: 163,000,000 > river_planet-latest.osm.pbf.ttl.bz2 (Batch: 19,424 / Avg: 16,050) > 17:37:40 INFO loader :: Finished: > output/river_planet-latest.osm.pbf.ttl.bz2: 163,310,838 tuples in > 10158.16s (Avg: 16,076) > > > So loading the compressed file is ~9x slower then the compressed one. > Can we consider this as expected? Note, here we have a geospatial > dataset with millions of geometry literals. Not sure if this is also > something that makes things worse. > > What are your experiences with loading compressed vs uncompressed data? > > > Cheers, > > Lorenz > > > On 26.08.22 17:02, Andy Seaborne wrote: > > Hi Lorenz, > > > > No - there isn't an option. > > > > The way to do it is to prepare the load as quads by, for example, > > wrapping in TriG syntax around the files or adding the G to N-triples. > > > > This can be done streaming and piped into the loader (with --syntax= > > if not N-quads). > > > > > By the way, the tdb2.xloader has no option for named graphs at all? > > > > The input needs to be prepared as quads. > > > > Andy > > > > On 26/08/2022 15:03, Lorenz Buehmann wrote: > >> Hi all, > >> > >> is there any option to use TDB2 bulk loader (tdb2.xloader or just > >> tdb2.loader) to load multiple files into multiple different named > >> graphs? Like > >> > >> tdb2.loader --loc ./tdb2/dataset --graph file1 --graph > >> file2 ... > >> > >> I'm asking because I thought the initial loading is way faster then > >> iterating over multiple (graph, file) pairs and running the TDB2 > >> loader for each pair? > >> > >> > >> By the way, the tdb2.xloader has no option for named graphs at all? > >> > >> > >> Cheers, > >> > >> Lorenz > >>
Re: Re: TDB2 bulk loader - multiple files into different graph per file
riot --time --count river_planet-latest.osm.pbf.ttl Triples = 163,310,838 351.00 sec : 163,310,838 Triples : 465,271.72 per second : 0 errors : 31 warnings riot --time --count river_planet-latest.osm.pbf.ttl.gz Triples = 163,310,838 431.74 sec : 163,310,838 Triples : 378,258.50 per second : 0 errors : 31 warnings riot --time --count river_planet-latest.osm.pbf.ttl.bz2 Triples = 163,310,838 9,948.17 sec : 163,310,838 Triples : 16,416.17 per second : 0 errors : 31 warnings Takes ages with Bzip2 ... there must be something going wrong ... We checked code and the Apache Commons Compress docs, a colleague spotted the hint at https://commons.apache.org/proper/commons-compress/examples.html#Buffering : The stream classes all wrap around streams provided by the calling code and they work on them directly without any additional buffering. On the other hand most of them will benefit from buffering so it is highly recommended that users wrap their stream in Buffered(In|Out)putStreams before using the Commons Compress API. we were curious about this statement, checked org.apache.jena.atlas.io.IO class and added one line in openFileEx in = new BufferedInputStream(in); which wraps the file stream before its passed to the decompressor streams Run again the parsing: riot --time --count river_planet-latest.osm.pbf.ttl.bz2 (Jena 4.7.0-SNAPSHOT fork with a BufferedInputStream wrapping the file stream in IO class) Triples = 163,310,838 1,004.68 sec : 163,310,838 Triples : 162,550.10 per second : 0 errors : 31 warnings What do you think? On 28.08.22 14:22, Andy Seaborne wrote: If you are relying on Jena to do the bz2 decompress, then it is using Commons Compress. gz is done (via Commons Compress) in native code. I use gz and if I get a bz2 file, I decompress it with OS tools. Could you try an experiment please? Run on the same hardware as the loader was run: riot --time --count river_planet-latest.osm.pbf.ttl riot --time --count river_planet-latest.osm.pbf.ttl.bz2 Andy gz vs plain: NVMe m2 SSD : Dell XPS 13 9310 riot --time --count .../BSBM/bsbm-25m.nt.gz Triples = 24,997,044 118.02 sec : 24,997,044 Triples : 211,808.84 per second riot --time --count .../BSBM/bsbm-25m.nt Triples = 24,997,044 109.97 sec : 24,997,044 Triples : 227,314.05 per second