On 24/07/12 12:24, Michael Brunnbauer wrote:
Hello Andy,
On Tue, Jul 24, 2012 at 01:13:59PM +0200, Michael Brunnbauer wrote:
BTW: Here is some output from tdbloader2 for this TDB which shows that
the tdbloader2 data phase runtime gets quite non-linear for very big datasets.
I called tdbloader2 with JVM_ARGS="-Xmx32768M -server" and it did not seem to
run into memory problems.
I should be more specific here: Whenever I watched it after 10^9 quads it was
doing disk IO (i think mostly writes, probably to node2id.dat and nodes.dat).
Would it be possible to generate node2id.dat and nodes.dat without random
access ?
(see also tdbloader4)
Yes - it looks like the node file, part of which is a B+Tree of hash
(128 bits) to NodeId. This is used to see if the node has already been
encountered. There is a cache - maybe this needs greatly increasing in
size or a more explicit in-memory structure fronting the node table for
bulk loading. At query time, this isn't such an important lookup.
How big are the node* files (node2id.dat, .idn, nodes.dat) in the
resulting database in this case?
Andy
Regards,
Michael Brunnbauer