On 24/07/12 12:24, Michael Brunnbauer wrote:

Hello Andy,

On Tue, Jul 24, 2012 at 01:13:59PM +0200, Michael Brunnbauer wrote:
BTW: Here is some output from tdbloader2 for this TDB which shows that
the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
run into memory problems.

I should be more specific here: Whenever I watched it after 10^9 quads it was
doing disk IO (i think mostly writes, probably to node2id.dat and nodes.dat).
Would it be possible to generate node2id.dat and nodes.dat without random
access ?

(see also tdbloader4)

Yes - it looks like the node file, part of which is a B+Tree of hash (128 bits) to NodeId. This is used to see if the node has already been encountered. There is a cache - maybe this needs greatly increasing in size or a more explicit in-memory structure fronting the node table for bulk loading. At query time, this isn't such an important lookup.

How big are the node* files (node2id.dat, .idn, nodes.dat) in the resulting database in this case?

        Andy

Regards,

Michael Brunnbauer


Reply via email to