Re: tdbloader and

Andy Seaborne Tue, 12 Feb 2013 01:29:46 -0800

On 11/02/13 12:51, Sarven Capadisli wrote:

Hi,


I just wanted to drop in some ballpark stats from my end, and hopefully
get some feedback about your own experience.

As it is fairly well-known, there is a drastic speed difference between
N-Triples and RDF/XML on first load. Giving 12GB of memory to tdbloader,
I can see N-Triples adding around 110k triples/second, whereas RDF/XML
is around 33k triples/second.


It will slow down as the load grows ...


However, there is not all that much of a speed difference between
loading N-Triples and RDF/XML into an existing TDB store. N-Triples gets
added around 33k triples/second, whereas RDF/XML around 18k
triples/second i.e., N-Triples is ballpark 15k triples per second faster.


That sounds about right.

Bulkloading an empty database is a special case, especially intdbloader2 which builds trees by knowing about the disk layout ofB+Trees. Even tdbloader, which does not play raw index games is relyingthe order of actions to maximise caching efficiency.

Incremental loads are done triple-by-triple to check for duplicates withexisting data. Large additional loads could have some of theoptimizations of of tdbloader (not tdbloader2); I don't know where thecross over between adding a few triples in a loop and going in andincrementally adding in bulk to existing indexes would be.


The NTriples vs RDF/XML differences come down to parser speeds.

NTriples is fastest - even faster than Turtle despite shifting morebytes because usually large files are written sequentially to disk soreading is a disk interface speeds, with few costly seeks.

NTriples parsing is fast enough that making it run in parallel with theloader was no gain and even a small drop in performance.

RDF/XML is expensive to parse and parallelism would be more beneficial.You can achieve that effect in part by running



   riot data.rdf | tdbloader --loc=DB -- -

because the parser and the loader are then different processes sodifferent threads.

Personally, I parse all data first to get NTriples before loading tocheck for errors in the data and make sure the load will be clean. Ikeep the intermediate files to avoid needing to recreate them if Ireload. Then I load the NTriples at speed.


If intermediate disk size is a problem, you can use .nt.gz files.

The NTriples parser is nearly as fast working from these - they areexpensive to create (compression is significantly slower thandecompression - hence Google's snappy)


Parallel execution

  riot data.rdf | gzip > data.nt.gz

may work well at scale.

Thanks for the report,

        Andy


-Sarven
http://csarven.ca/#i

Re: tdbloader and

Reply via email to