Hi Andy,

on this machine, it's Redhat EL 5.9, and as you suspected, sort has no 
"--parallel" option.

Thank you very much for the explanations. I just filed an improvement 
suggestion for including tdbloader2 into the Fuseki distribution, to alleviate 
playing with the different options on large datasets.

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:a...@apache.org] 
Gesendet: Samstag, 2. November 2013 13:40
An: users@jena.apache.org
Betreff: Re: AW: Declining TDB load performance with larger files

On 01/11/13 12:42, Neubert Joachim wrote:
> I did a comparison of tdbloader vs. tdbloader2. The results are not relieable 
> (machine-dependent, and perhaps even influenced by different background load 
> on the vm cluster), but perhaps even then they may be interesting to others:
>

Thanks - I'm interested - and if anyone has examples that would be useful 
information as well.

Which OS is this? And specifically, does sort(1) have the --parallel options? 
(Ubuntu does; centos does not at the moment).

> tdbloader w/ 2G heap
> 4:15 Data phase
> 4:30 Index phase
>
> tdbloader2 w/ 2G heap
> 1:30 Data phase
> 6:30 Index phase
>
> So in sum tdbloader2 shows a slight advantage in my current configuration.

This is what I've been finding when the laod is a bit more than the machine has 
RAM for it's current working set.

The sortign in the index phase of tdblaoder2 sometimes, on some systems, takes 
much longer than I think it should.  I even had an example last week when 
making the sort(1) space less speeded it up which I find bizarre.

> The reduction of heap space had indeed brought an improvement:
>
> tdbloader w/ 10G heap
> 4:30 Data phase
> 5:45 Index phase
>
> Could I expect a larger improvement by adding more memory (for example 
> upgrading from 11 to 32 GB)? Are there any experiences for estimating an 
> optimal memory size for tdb loading?

Most likely.  The loading rate is susceptible to system issues as we've all 
been finding, and at the moment, I don't have a good model for wwhat teh 
factors are.  But for a given setup, more RAM is good especially for the all 
java tdbloader.

        Andy

>
> Cheers, Joachim
>
> -----Ursprüngliche Nachricht-----
> Von: Andy Seaborne [mailto:a...@apache.org]
> Gesendet: Montag, 28. Oktober 2013 16:58
> An: users@jena.apache.org
> Betreff: Re: Declining TDB load performance with larger files
>
> Hi Joachim,
>
> What is happing is that the system is running out of working space and the 
> disk is being used for real.
>
>   > JAVA_OPTS: -d64 -Xms6g -Xmx10g
>
> Don't set -Xmx10g.  Try a 2G heap.  Don't bother with -Xms.
>
> More heap does not help - in fact, it can make it worse.  TDB uses memory 
> mapped files - these are not in Java heap space.  The operating system 
> manages how much real RAM is devoted to the virtual address space for the 
> file.  As your JVM grows, it is reducing the space for file caching.
>
>
> There is another effect.  The OS is managing memory but sometimes it
> gets its policy wrong.   Oddly, the faster the initial part of the load,
> the slower the speed drops off to when workspace RAM runs out.  My guess is 
> that the OS guesses some acecss style and then code then breaks that 
> assumption.  It can even different from run to run on the same machine.
>
> There is also tdbloader2 - it may be faster, it may not.  It is vulnerable to 
> OS in different ways.
>
> As it is so per-system specific, try each and see what happens, after fixing 
> the heap issue.
>
>       Andy
>
>
> On 28/10/13 12:01, Neubert Joachim wrote:
>> I'm loading a 111 million triples file (GND German Authority files).
>> For the first roughly 70 million triples, it's really fast (more than
>> 60,000 avg), but then throughput declines continuously to a thousand 
>> or just some hundred triples (which brings down the avg to less than 
>> 7000). During the last part of triples data phase, java is down to 
>> 1-2% CPU usage, while disk usage goes up to 100%.
>>
>> As TDB writes to disk, I'd expect rather linear loading times. The 
>> Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere 
>> cluster, with SAN hardware under-laying. As I observed the same 
>> behavior at different times a day, with for sure different load 
>> situations, there is no indication that it depended on parallel 
>> actions on the cluster.
>>
>> Perhaps there is something wrong in my config, but I could not figure 
>> out what it may be. I add an extract of the log below - it would be 
>> great if somebody could help me with hints.
>>
>> Cheers, Joachim
>>
>
>   > ---------------
>   >
>   > 2013-10-25 13:33:33 start run
>   >
>   > Configuration:
>   > java version "1.6.0_24"
>   > Java(TM) SE Runtime Environment (build 1.6.0_24-b07)  > Java HotSpot(TM) 
> 64-Bit Server VM (build 19.1-b02, mixed mode)  > JAVA_OPTS: -d64 -Xms6g 
> -Xmx10g
>   > Jena:       VERSION: 2.11.0
>   > Jena:       BUILD_DATE: 2013-09-12T10:49:49+0100
>   > ARQ:        VERSION: 2.11.0
>   > ARQ:        BUILD_DATE: 2013-09-12T10:49:49+0100
>   > RIOT:       VERSION: 2.11.0
>   > RIOT:       BUILD_DATE: 2013-09-12T10:49:49+0100
>   > TDB:        VERSION: 1.0.0
>   > TDB:        BUILD_DATE: 2013-09-12T10:49:49+0100
>   >
>   > Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
>   > INFO  -- Start triples data phase
>   > INFO  ** Load empty triples table
>   > INFO  Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25
> 13:33:35 MESZ
>   > INFO  Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
>   > INFO    Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
>   > INFO  Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
>   > INFO    Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
>   > INFO  Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
>   > INFO    Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
>   > INFO  Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
>   > INFO    Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
>   > INFO  Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
>   > INFO    Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
>   > INFO  Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
>   > INFO    Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
>   > INFO  Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
>   > INFO    Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
>   > INFO  Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
>   > INFO    Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
>   > INFO  Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
>   > INFO    Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
>   > INFO  Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
>   > INFO    Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
>   > INFO  Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
>   > INFO    Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
>   > INFO  -- Finish triples data phase
>   > INFO  111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per 
> second]  >  > Indexing phase also takes its time, and has some decrease in 
> performance too, but does not show a sharp drop.
>   >
>   > INFO  -- Start triples index phase
>   > INFO    Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
>   > INFO  ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds 
> [Rate: 25.570,57 per second]  > ...
>   > INFO  -- Finish triples index phase
>   > INFO  ** 111.786.233 triples indexed in 19.973,81 seconds [Rate:
> 5.596,64 per second]
>   > INFO  -- Finish triples load
>   > INFO  ** Completed: 111.813.447 triples loaded in 36.261,98 
> seconds
> [Rate: 3.083,49 per second]
>   >
>   >
>

Reply via email to