On 11/02/13 12:51, Sarven Capadisli wrote:
Hi,

I just wanted to drop in some ballpark stats from my end, and hopefully
get some feedback about your own experience.

As it is fairly well-known, there is a drastic speed difference between
N-Triples and RDF/XML on first load. Giving 12GB of memory to tdbloader,
I can see N-Triples adding around 110k triples/second, whereas RDF/XML
is around 33k triples/second.

It will slow down as the load grows ...


However, there is not all that much of a speed difference between
loading N-Triples and RDF/XML into an existing TDB store. N-Triples gets
added around 33k triples/second, whereas RDF/XML around 18k
triples/second i.e., N-Triples is ballpark 15k triples per second faster.

That sounds about right.

Bulkloading an empty database is a special case, especially in tdbloader2 which builds trees by knowing about the disk layout of B+Trees. Even tdbloader, which does not play raw index games is relying the order of actions to maximise caching efficiency.

Incremental loads are done triple-by-triple to check for duplicates with existing data. Large additional loads could have some of the optimizations of of tdbloader (not tdbloader2); I don't know where the cross over between adding a few triples in a loop and going in and incrementally adding in bulk to existing indexes would be.

The NTriples vs RDF/XML differences come down to parser speeds.

NTriples is fastest - even faster than Turtle despite shifting more bytes because usually large files are written sequentially to disk so reading is a disk interface speeds, with few costly seeks.

NTriples parsing is fast enough that making it run in parallel with the loader was no gain and even a small drop in performance.

RDF/XML is expensive to parse and parallelism would be more beneficial. You can achieve that effect in part by running


   riot data.rdf | tdbloader --loc=DB -- -

because the parser and the loader are then different processes so different threads.

Personally, I parse all data first to get NTriples before loading to check for errors in the data and make sure the load will be clean. I keep the intermediate files to avoid needing to recreate them if I reload. Then I load the NTriples at speed.

If intermediate disk size is a problem, you can use .nt.gz files.
The NTriples parser is nearly as fast working from these - they are expensive to create (compression is significantly slower than decompression - hence Google's snappy)

Parallel execution

  riot data.rdf | gzip > data.nt.gz

may work well at scale.

Thanks for the report,

        Andy


-Sarven
http://csarven.ca/#i


Reply via email to