Hi Andy,

Thank you so much. I fixed the heap issue, and will check and report my 
findings.

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:a...@apache.org] 
Gesendet: Montag, 28. Oktober 2013 16:58
An: users@jena.apache.org
Betreff: Re: Declining TDB load performance with larger files

Hi Joachim,

What is happing is that the system is running out of working space and the disk 
is being used for real.

 > JAVA_OPTS: -d64 -Xms6g -Xmx10g

Don't set -Xmx10g.  Try a 2G heap.  Don't bother with -Xms.

More heap does not help - in fact, it can make it worse.  TDB uses memory 
mapped files - these are not in Java heap space.  The operating system manages 
how much real RAM is devoted to the virtual address space for the file.  As 
your JVM grows, it is reducing the space for file caching.


There is another effect.  The OS is managing memory but sometimes it 
gets its policy wrong.   Oddly, the faster the initial part of the load, 
the slower the speed drops off to when workspace RAM runs out.  My guess is 
that the OS guesses some acecss style and then code then breaks that 
assumption.  It can even different from run to run on the same machine.

There is also tdbloader2 - it may be faster, it may not.  It is vulnerable to 
OS in different ways.

As it is so per-system specific, try each and see what happens, after fixing 
the heap issue.

        Andy


On 28/10/13 12:01, Neubert Joachim wrote:
> I'm loading a 111 million triples file (GND German Authority files).
> For the first roughly 70 million triples, it's really fast (more than
> 60,000 avg), but then throughput declines continuously to a thousand 
> or just some hundred triples (which brings down the avg to less than 
> 7000). During the last part of triples data phase, java is down to 
> 1-2% CPU usage, while disk usage goes up to 100%.
>
> As TDB writes to disk, I'd expect rather linear loading times. The 
> Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere 
> cluster, with SAN hardware under-laying. As I observed the same 
> behavior at different times a day, with for sure different load 
> situations, there is no indication that it depended on parallel 
> actions on the cluster.
>
> Perhaps there is something wrong in my config, but I could not figure 
> out what it may be. I add an extract of the log below - it would be 
> great if somebody could help me with hints.
>
> Cheers, Joachim
>

 > ---------------
 >
 > 2013-10-25 13:33:33 start run
 >
 > Configuration:
 > java version "1.6.0_24"
 > Java(TM) SE Runtime Environment (build 1.6.0_24-b07)  > Java HotSpot(TM) 
 > 64-Bit Server VM (build 19.1-b02, mixed mode)  > JAVA_OPTS: -d64 -Xms6g 
 > -Xmx10g
 > Jena:       VERSION: 2.11.0
 > Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
 > ARQ:        VERSION: 2.11.0
 > ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
 > RIOT:       VERSION: 2.11.0
 > RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
 > TDB:        VERSION: 1.0.0
 > TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100
 >
 > Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
 > INFO  -- Start triples data phase
 > INFO  ** Load empty triples table
 > INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25
13:33:35 MESZ
 > INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
 > INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
 > INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
 > INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
 > INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
 > INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
 > INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
 > INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
 > INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
 > INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
 > INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
 > INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
 > INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
 > INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
 > INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
 > INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
 > INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
 > INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
 > INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
 > INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
 > INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
 > INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
 > INFO  -- Finish triples data phase
 > INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per 
 > second]  >  > Indexing phase also takes its time, and has some decrease in 
 > performance too, but does not show a sharp drop.
 >
 > INFO  -- Start triples index phase
 > INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
 > INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds 
 > [Rate: 25.570,57 per second]  > ...
 > INFO  -- Finish triples index phase
 > INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate: 
5.596,64 per second]
 > INFO  -- Finish triples load
 > INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 seconds
[Rate: 3.083,49 per second]
 >
 >

Reply via email to