On 22/06/12 17:22, Rob Vesse wrote:
Off the top of my head I believe loading into an empty database is always
faster because of the way it generates the index files and node tables.
When loading to an existing dataset it tends to be slower because it has
to add to the existing files rather than generating them from scratch.
Yes.
It performs the operations in an order that is index friendly - it avoid
inserting in a random fashion but tries to make it roughly sequential.
That makes disk caches more efficient; and real disk access is
expensive. Of the order of 1e6 instructions.
When empty:
tdbloader loads SPO and the nodes together, then creates the secondary
indexes one at a time.
In normal use, loading SPO is a sequential process - data arrives in
blocks of same-subject. The code does not depend on this but it is
faster if it is. The SPO index is written in (at a macroscopic level)
sequential order - complete a B+Tree block and not need to come back to
it later. Making POS and OSP is done by a sequential walk through SPO.
tdbloader2 is more extreme - it loads the node data and outputs a stream
of triples to a temporary file (as a text format!). It sorts the
temporary file into the necessary order for an index, then loads it the
index. Repeats for all the indexes.
The sorting is done by unix sort(1) - while it seems more work, this is
a very efficient program and, for large data, it is faster. Where the
cut over is, depends on data shape and machine.
tdbloader3 is like tdblaoder2 except pure java, binary and does parallel
block sorts.
Load time: 16 minutes
average loading: ca 81.000 triple / second
index time: 40 minutes
store size: 9,3GB
The second test was to store the same data into an allready filled store
As i started the import i created a store with 348.398.593 Triples from DNB and
HBZ (which are german libraries, store size: 33 GB).
Then i started to load the german dbpedia in.
Load time: 3 hours and 4 minutes
average loading: ca 7200 / second
Looks to be like tdbloader, - it needs to check for existence of any
triple when loading a pre-filled store. That is random access and slow.
index time: 38 minutes
store size: 19 GB!!!!!
I don't know why for sure, but jumps in size can be because the indexes
need to be slightly bigger but the unit of allocation is 8M (memory
mapped files). It's while an empty database can look quite large -
there are 8M files, and although sparse some OSs (Mac) count the space.
If there are some bNodes then that can lead to some new nodes and
triples causing index sizes to jump.
What is "ls -lh" saying?
Andy