On 31/03/15 12:12, Michael Brunnbauer wrote:

Hello Andy,

On Tue, Mar 31, 2015 at 10:25:32AM +0100, Andy Seaborne wrote:
Also, tdbloader2 seems to be gradually slowed down from 100k triples/s to
< 1000 triples/s on a normal disk drive by random access after ca. 10 million
triples. Is this unavoidable? I made this change to tdbloader2 but I think it
is not relevant during the data phase:

-    SORT_ARGS="--buffer-size=50%"
+    SORT_ARGS="--buffer-size=2048M"

I have tried with Jena 2.13.0 and 2.11.1.

What's the machine it's running on?  OS?

Xeon E5502 with 48GB RAM, Linux 3.4.105 with glibc 2.19 and jdk-8u31-linux-x64.

As this is the data phase, tdbloader2 is, roughly, streaming the parser to
disk, allocating nodeids (which is a bad access pattern).  What size are the
node-related files?

I have it running right now at

"INFO  Add: 138,800,000 Data (Batch: 15,792 / Avg: 9,656)"

-rw-r--r-- 1 java java 7070640000 Mar 31 13:10 data-triples.17513
-rw-r--r-- 1 java java 2021654528 Mar 31 13:10 node2id.dat
-rw-r--r-- 1 java java   16777216 Mar 31 13:10 node2id.idn
-rw-r--r-- 1 java java 3858513162 Mar 31 13:10 nodes.dat

Wow. That look like the unique node/triple ratio is quite high. I take the data has a lots of content-like literals in it, or autogenerated URIs.

Lots of unique nodes can slow things down because of all the node writing.


Does tdbloader do better? (sometimes it does, sometimes it doesn't).

I will try if I fail with tdbloader2 but I guess it will work now because
I switched to a SSD for the tdb dir.

SSD+database => :-)


Regards,

Michael Brunnbauer


Reply via email to