Yep, I already recognized that I forgot to mention hardware and details:


- file size compressed: 5,9G

- file size uncompressed: 23G


- Server:

    - AMD EPYC 7443P 24-Core Processor
    - 256GB RAM
    - 4 x 8TB SSD  Samsung_SSD_870 as a ZFS raid, i.e. ~30TB


- Jena version (latest release .4.6.0):

TDB2:       VERSION: 4.6.0
TDB2:       BUILD_DATE: 2022-08-20T08:22:47Z

- TDB2 loader is the default one, i.e. it should be 'phased'?

- I rerun the loader phased vs parallel on compress vs uncompressed:

https://gist.github.com/LorenzBuehmann/27f232a1fd2c2a95600115b18958458b


-> compressed one degrades immediately to an avg of 16,000/s vs 140,000/s on the uncompressed data - looks horrible


And I yes, I also tend to decompress via OS tool before loading




On 28.08.22 13:55, Andy Seaborne wrote:


On 28/08/2022 09:58, Lorenz Buehmann wrote:
Hi Andy,

thanks for fast response.

I see - the only drawback with wrapping the streams into TriG is when we have Turtle syntax files (or lets say any non N-Triples format) - afaik, prefixes aren't allowed inside graphs, i.e. at that point you're lost.
>
What I did now is to pipe those files into riot first which then generates N-Triples which then can be wrapped in TriG graphs. Indeed, we have the riot overhead here, i.e. the data is parsed twice. Still faster though then loading graphs in separate TDB loader calls, so I guess I can live with this.


Exercise in text processing :-)

Spit out the prefixes into a separate TTL file (grep!) and load that file as well.


Having a follow up question:

I could see a huge difference between read compressed (Bzip) vs uncompressed file:

I put the output until the triples have been loaded here as the index creating should be affected by the compression:


# uncompressed with tdb2.tdbloader

Which loader?
And what hardware?

(--loader=parallel may not make much of a difference at 100m)


14:24:40 INFO  loader          :: Add: 163,000,000 river_planet-latest.osm.pbf.ttl (Batch: 144,320 / Avg: 140,230) 14:24:42 INFO  loader          :: Finished: output/river_planet-latest.osm.pbf.ttl: 163,310,838 tuples in 1165.30s (Avg: 140,145)


# compressed with tdb2.tdbloader

17:37:37 INFO  loader          :: Add: 163,000,000 river_planet-latest.osm.pbf.ttl.bz2 (Batch: 19,424 / Avg: 16,050) 17:37:40 INFO  loader          :: Finished: output/river_planet-latest.osm.pbf.ttl.bz2: 163,310,838 tuples in 10158.16s (Avg: 16,076)

That is bad!
Was it consistently slow through the load?

If you are relying on Jena to do the bz2 decompress, then it is using Commons Compress.

gz is done (via Commons Compress) in native code. I use gz and if I get a bz2 file, I decompress it with OS tools.

So loading the compressed file is ~9x slower then the compressed one. Can we consider this as expected? Note, here we have a geospatial dataset with millions of geometry literals. Not sure if this is also something that makes things worse.

What are your experiences with loading compressed vs uncompressed data?

bz2 is expensive - it is focuses on max compression. Coupled with being java (not so much the java, as being not highly tuned code decompression code) it coudl be a factor.

Usually (gz) there is a slight slow down if using SSD as source. HDD can be either way.

    Andy



Cheers,

Lorenz


On 26.08.22 17:02, Andy Seaborne wrote:
Hi Lorenz,

No - there isn't an option.

The way to do it is to prepare the load as quads by, for example, wrapping in TriG syntax around the files or adding the G to N-triples.

This can be done streaming and piped into the loader (with --syntax= if not N-quads).

> By the way, the tdb2.xloader has no option for named graphs at all?

The input needs to be prepared as quads.

    Andy

On 26/08/2022 15:03, Lorenz Buehmann wrote:
Hi all,

is there any option to use TDB2 bulk loader (tdb2.xloader or just tdb2.loader) to load multiple files into multiple different named graphs? Like

tdb2.loader --loc ./tdb2/dataset --graph <g1> file1 --graph <g2> file2 ...

I'm asking because I thought the initial loading is way faster then iterating over multiple (graph, file) pairs and running the TDB2 loader for each pair?


By the way, the tdb2.xloader has no option for named graphs at all?


Cheers,

Lorenz

Reply via email to