On 11/02/13 12:51, Sarven Capadisli wrote:
Hi,
I just wanted to drop in some ballpark stats from my end, and hopefully
get some feedback about your own experience.
As it is fairly well-known, there is a drastic speed difference between
N-Triples and RDF/XML on first load. Giving 12GB of memory to tdbloader,
I can see N-Triples adding around 110k triples/second, whereas RDF/XML
is around 33k triples/second.
It will slow down as the load grows ...
However, there is not all that much of a speed difference between
loading N-Triples and RDF/XML into an existing TDB store. N-Triples gets
added around 33k triples/second, whereas RDF/XML around 18k
triples/second i.e., N-Triples is ballpark 15k triples per second faster.
That sounds about right.
Bulkloading an empty database is a special case, especially in
tdbloader2 which builds trees by knowing about the disk layout of
B+Trees. Even tdbloader, which does not play raw index games is relying
the order of actions to maximise caching efficiency.
Incremental loads are done triple-by-triple to check for duplicates with
existing data. Large additional loads could have some of the
optimizations of of tdbloader (not tdbloader2); I don't know where the
cross over between adding a few triples in a loop and going in and
incrementally adding in bulk to existing indexes would be.
The NTriples vs RDF/XML differences come down to parser speeds.
NTriples is fastest - even faster than Turtle despite shifting more
bytes because usually large files are written sequentially to disk so
reading is a disk interface speeds, with few costly seeks.
NTriples parsing is fast enough that making it run in parallel with the
loader was no gain and even a small drop in performance.
RDF/XML is expensive to parse and parallelism would be more beneficial.
You can achieve that effect in part by running
riot data.rdf | tdbloader --loc=DB -- -
because the parser and the loader are then different processes so
different threads.
Personally, I parse all data first to get NTriples before loading to
check for errors in the data and make sure the load will be clean. I
keep the intermediate files to avoid needing to recreate them if I
reload. Then I load the NTriples at speed.
If intermediate disk size is a problem, you can use .nt.gz files.
The NTriples parser is nearly as fast working from these - they are
expensive to create (compression is significantly slower than
decompression - hence Google's snappy)
Parallel execution
riot data.rdf | gzip > data.nt.gz
may work well at scale.
Thanks for the report,
Andy
-Sarven
http://csarven.ca/#i