Hi Andy,
thanks for fast response.
I see - the only drawback with wrapping the streams into TriG is when we
have Turtle syntax files (or lets say any non N-Triples format) - afaik,
prefixes aren't allowed inside graphs, i.e. at that point you're lost.
What I did now is to pipe those files into riot first which then
generates N-Triples which then can be wrapped in TriG graphs. Indeed, we
have the riot overhead here, i.e. the data is parsed twice. Still faster
though then loading graphs in separate TDB loader calls, so I guess I
can live with this.
Having a follow up question:
I could see a huge difference between read compressed (Bzip) vs
uncompressed file:
I put the output until the triples have been loaded here as the index
creating should be affected by the compression:
# uncompressed with tdb2.tdbloader
14:24:40 INFO loader :: Add: 163,000,000
river_planet-latest.osm.pbf.ttl (Batch: 144,320 / Avg: 140,230)
14:24:42 INFO loader :: Finished:
output/river_planet-latest.osm.pbf.ttl: 163,310,838 tuples in 1165.30s
(Avg: 140,145)
# compressed with tdb2.tdbloader
17:37:37 INFO loader :: Add: 163,000,000
river_planet-latest.osm.pbf.ttl.bz2 (Batch: 19,424 / Avg: 16,050)
17:37:40 INFO loader :: Finished:
output/river_planet-latest.osm.pbf.ttl.bz2: 163,310,838 tuples in
10158.16s (Avg: 16,076)
So loading the compressed file is ~9x slower then the compressed one.
Can we consider this as expected? Note, here we have a geospatial
dataset with millions of geometry literals. Not sure if this is also
something that makes things worse.
What are your experiences with loading compressed vs uncompressed data?
Cheers,
Lorenz
On 26.08.22 17:02, Andy Seaborne wrote:
Hi Lorenz,
No - there isn't an option.
The way to do it is to prepare the load as quads by, for example,
wrapping in TriG syntax around the files or adding the G to N-triples.
This can be done streaming and piped into the loader (with --syntax=
if not N-quads).
> By the way, the tdb2.xloader has no option for named graphs at all?
The input needs to be prepared as quads.
Andy
On 26/08/2022 15:03, Lorenz Buehmann wrote:
Hi all,
is there any option to use TDB2 bulk loader (tdb2.xloader or just
tdb2.loader) to load multiple files into multiple different named
graphs? Like
tdb2.loader --loc ./tdb2/dataset --graph <g1> file1 --graph <g2>
file2 ...
I'm asking because I thought the initial loading is way faster then
iterating over multiple (graph, file) pairs and running the TDB2
loader for each pair?
By the way, the tdb2.xloader has no option for named graphs at all?
Cheers,
Lorenz