Re: Re: TDB2 bulk loader - multiple files into different graph per file

Martynas Jusevičius Mon, 29 Aug 2022 03:03:25 -0700

On Sun, Aug 28, 2022 at 11:00 AM Lorenz Buehmann
<buehm...@informatik.uni-leipzig.de> wrote:
>
> Hi Andy,
>
> thanks for fast response.
>
> I see - the only drawback with wrapping the streams into TriG is when we
> have Turtle syntax files (or lets say any non N-Triples format) - afaik,
> prefixes aren't allowed inside graphs, i.e. at that point you're lost.
> What I did now is to pipe those files into riot first which then
> generates N-Triples which then can be wrapped in TriG graphs. Indeed, we
> have the riot overhead here, i.e. the data is parsed twice. Still faster
> though then loading graphs in separate TDB loader calls, so I guess I
> can live with this.


I had a similar question a few years ago, and Claus responded:
https://stackoverflow.com/questions/63467067/converting-rdf-triples-to-quads-from-command-line/63716278

>
> Having a follow up question:
>
> I could see a huge difference between read compressed (Bzip) vs
> uncompressed file:
>
> I put the output until the triples have been loaded here as the index
> creating should be affected by the compression:
>
>
> # uncompressed with tdb2.tdbloader
>
> 14:24:40 INFO  loader          :: Add: 163,000,000
> river_planet-latest.osm.pbf.ttl (Batch: 144,320 / Avg: 140,230)
> 14:24:42 INFO  loader          :: Finished:
> output/river_planet-latest.osm.pbf.ttl: 163,310,838 tuples in 1165.30s
> (Avg: 140,145)
>
>
> # compressed with tdb2.tdbloader
>
> 17:37:37 INFO  loader          :: Add: 163,000,000
> river_planet-latest.osm.pbf.ttl.bz2 (Batch: 19,424 / Avg: 16,050)
> 17:37:40 INFO  loader          :: Finished:
> output/river_planet-latest.osm.pbf.ttl.bz2: 163,310,838 tuples in
> 10158.16s (Avg: 16,076)
>
>
> So loading the compressed file is ~9x slower then the compressed one.
> Can we consider this as expected? Note, here we have a geospatial
> dataset with millions of geometry literals. Not sure if this is also
> something that makes things worse.
>
> What are your experiences with loading compressed vs uncompressed data?
>
>
> Cheers,
>
> Lorenz
>
>
> On 26.08.22 17:02, Andy Seaborne wrote:
> > Hi Lorenz,
> >
> > No - there isn't an option.
> >
> > The way to do it is to prepare the load as quads by, for example,
> > wrapping in TriG syntax around the files or adding the G to N-triples.
> >
> > This can be done streaming and piped into the loader (with --syntax=
> > if not N-quads).
> >
> > > By the way, the tdb2.xloader has no option for named graphs at all?
> >
> > The input needs to be prepared as quads.
> >
> >     Andy
> >
> > On 26/08/2022 15:03, Lorenz Buehmann wrote:
> >> Hi all,
> >>
> >> is there any option to use TDB2 bulk loader (tdb2.xloader or just
> >> tdb2.loader) to load multiple files into multiple different named
> >> graphs? Like
> >>
> >> tdb2.loader --loc ./tdb2/dataset --graph <g1> file1 --graph <g2>
> >> file2 ...
> >>
> >> I'm asking because I thought the initial loading is way faster then
> >> iterating over multiple (graph, file) pairs and running the TDB2
> >> loader for each pair?
> >>
> >>
> >> By the way, the tdb2.xloader has no option for named graphs at all?
> >>
> >>
> >> Cheers,
> >>
> >> Lorenz
> >>

Re: Re: TDB2 bulk loader - multiple files into different graph per file

Reply via email to