Re: TDBLoader2 Performance on Empty vs Existing Store (WAS: Import Messures)

Andy Seaborne Mon, 25 Jun 2012 06:08:22 -0700

On 22/06/12 17:22, Rob Vesse wrote:

Off the top of my head I believe loading into an empty database is always
faster because of the way it generates the index files and node tables.
When loading to an existing dataset it tends to be slower because it has
to add to the existing files rather than generating them from scratch.


Yes.

It performs the operations in an order that is index friendly - it avoidinserting in a random fashion but tries to make it roughly sequential.That makes disk caches more efficient; and real disk access isexpensive. Of the order of 1e6 instructions.


When empty:

tdbloader loads SPO and the nodes together, then creates the secondaryindexes one at a time.

In normal use, loading SPO is a sequential process - data arrives inblocks of same-subject. The code does not depend on this but it isfaster if it is. The SPO index is written in (at a macroscopic level)sequential order - complete a B+Tree block and not need to come back toit later. Making POS and OSP is done by a sequential walk through SPO.

tdbloader2 is more extreme - it loads the node data and outputs a streamof triples to a temporary file (as a text format!). It sorts thetemporary file into the necessary order for an index, then loads it theindex. Repeats for all the indexes.

The sorting is done by unix sort(1) - while it seems more work, this isa very efficient program and, for large data, it is faster. Where thecut over is, depends on data shape and machine.

tdbloader3 is like tdblaoder2 except pure java, binary and does parallelblock sorts.


Load time: 16 minutes
average loading: ca 81.000 triple / second
index time: 40 minutes
store size: 9,3GB


The second test was to store the same data into an allready filled store
As i started the import i created a store with 348.398.593 Triples from DNB and 
HBZ (which are german libraries, store size: 33 GB).
Then i started to load the german dbpedia in.

Load time: 3 hours and 4 minutes
average loading: ca 7200 / second

Looks to be like tdbloader, - it needs to check for existence of anytriple when loading a pre-filled store. That is random access and slow.

index time: 38 minutes
store size: 19 GB!!!!!

I don't know why for sure, but jumps in size can be because the indexesneed to be slightly bigger but the unit of allocation is 8M (memorymapped files). It's while an empty database can look quite large -there are 8M files, and although sparse some OSs (Mac) count the space.

If there are some bNodes then that can lead to some new nodes andtriples causing index sizes to jump.


What is "ls -lh" saying?

        Andy

Re: TDBLoader2 Performance on Empty vs Existing Store (WAS: Import Messures)

Reply via email to