On 22/06/12 17:22, Rob Vesse wrote:
Off the top of my head I believe loading into an empty database is always
faster because of the way it generates the index files and node tables.
When loading to an existing dataset it tends to be slower because it has
to add to the existing files rather than generating them from scratch.

Yes.

It performs the operations in an order that is index friendly - it avoid inserting in a random fashion but tries to make it roughly sequential. That makes disk caches more efficient; and real disk access is expensive. Of the order of 1e6 instructions.

When empty:

tdbloader loads SPO and the nodes together, then creates the secondary indexes one at a time.

In normal use, loading SPO is a sequential process - data arrives in blocks of same-subject. The code does not depend on this but it is faster if it is. The SPO index is written in (at a macroscopic level) sequential order - complete a B+Tree block and not need to come back to it later. Making POS and OSP is done by a sequential walk through SPO.

tdbloader2 is more extreme - it loads the node data and outputs a stream of triples to a temporary file (as a text format!). It sorts the temporary file into the necessary order for an index, then loads it the index. Repeats for all the indexes.

The sorting is done by unix sort(1) - while it seems more work, this is a very efficient program and, for large data, it is faster. Where the cut over is, depends on data shape and machine.

tdbloader3 is like tdblaoder2 except pure java, binary and does parallel block sorts.


Load time: 16 minutes
average loading: ca 81.000 triple / second
index time: 40 minutes
store size: 9,3GB


The second test was to store the same data into an allready filled store
As i started the import i created a store with 348.398.593 Triples from DNB and 
HBZ (which are german libraries, store size: 33 GB).
Then i started to load the german dbpedia in.

Load time: 3 hours and 4 minutes
average loading: ca 7200 / second

Looks to be like tdbloader, - it needs to check for existence of any triple when loading a pre-filled store. That is random access and slow.

index time: 38 minutes
store size: 19 GB!!!!!

I don't know why for sure, but jumps in size can be because the indexes need to be slightly bigger but the unit of allocation is 8M (memory mapped files). It's while an empty database can look quite large - there are 8M files, and although sparse some OSs (Mac) count the space.

If there are some bNodes then that can lead to some new nodes and triples causing index sizes to jump.

What is "ls -lh" saying?

        Andy

Reply via email to