On 01/11/13 12:42, Neubert Joachim wrote:
I did a comparison of tdbloader vs. tdbloader2. The results are not relieable
(machine-dependent, and perhaps even influenced by different background load on
the vm cluster), but perhaps even then they may be interesting to others:
Thanks - I'm interested - and if anyone has examples that would be
useful information as well.
Which OS is this? And specifically, does sort(1) have the --parallel
options? (Ubuntu does; centos does not at the moment).
tdbloader w/ 2G heap
4:15 Data phase
4:30 Index phase
tdbloader2 w/ 2G heap
1:30 Data phase
6:30 Index phase
So in sum tdbloader2 shows a slight advantage in my current configuration.
This is what I've been finding when the laod is a bit more than the
machine has RAM for it's current working set.
The sortign in the index phase of tdblaoder2 sometimes, on some systems,
takes much longer than I think it should. I even had an example last
week when making the sort(1) space less speeded it up which I find bizarre.
The reduction of heap space had indeed brought an improvement:
tdbloader w/ 10G heap
4:30 Data phase
5:45 Index phase
Could I expect a larger improvement by adding more memory (for example
upgrading from 11 to 32 GB)? Are there any experiences for estimating an
optimal memory size for tdb loading?
Most likely. The loading rate is susceptible to system issues as we've
all been finding, and at the moment, I don't have a good model for wwhat
teh factors are. But for a given setup, more RAM is good especially for
the all java tdbloader.
Andy
Cheers, Joachim
-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:[email protected]]
Gesendet: Montag, 28. Oktober 2013 16:58
An: [email protected]
Betreff: Re: Declining TDB load performance with larger files
Hi Joachim,
What is happing is that the system is running out of working space and the disk
is being used for real.
> JAVA_OPTS: -d64 -Xms6g -Xmx10g
Don't set -Xmx10g. Try a 2G heap. Don't bother with -Xms.
More heap does not help - in fact, it can make it worse. TDB uses memory
mapped files - these are not in Java heap space. The operating system manages
how much real RAM is devoted to the virtual address space for the file. As
your JVM grows, it is reducing the space for file caching.
There is another effect. The OS is managing memory but sometimes it
gets its policy wrong. Oddly, the faster the initial part of the load,
the slower the speed drops off to when workspace RAM runs out. My guess is
that the OS guesses some acecss style and then code then breaks that
assumption. It can even different from run to run on the same machine.
There is also tdbloader2 - it may be faster, it may not. It is vulnerable to
OS in different ways.
As it is so per-system specific, try each and see what happens, after fixing
the heap issue.
Andy
On 28/10/13 12:01, Neubert Joachim wrote:
I'm loading a 111 million triples file (GND German Authority files).
For the first roughly 70 million triples, it's really fast (more than
60,000 avg), but then throughput declines continuously to a thousand
or just some hundred triples (which brings down the avg to less than
7000). During the last part of triples data phase, java is down to
1-2% CPU usage, while disk usage goes up to 100%.
As TDB writes to disk, I'd expect rather linear loading times. The
Centos 6 64bit machine (11.5 GB memory) runs on a VMware vSphere
cluster, with SAN hardware under-laying. As I observed the same
behavior at different times a day, with for sure different load
situations, there is no indication that it depended on parallel
actions on the cluster.
Perhaps there is something wrong in my config, but I could not figure
out what it may be. I add an extract of the log below - it would be
great if somebody could help me with hints.
Cheers, Joachim
> ---------------
>
> 2013-10-25 13:33:33 start run
>
> Configuration:
> java version "1.6.0_24"
> Java(TM) SE Runtime Environment (build 1.6.0_24-b07) > Java HotSpot(TM)
64-Bit Server VM (build 19.1-b02, mixed mode) > JAVA_OPTS: -d64 -Xms6g -Xmx10g
> Jena: VERSION: 2.11.0
> Jena: BUILD_DATE: 2013-09-12T10:49:49+0100
> ARQ: VERSION: 2.11.0
> ARQ: BUILD_DATE: 2013-09-12T10:49:49+0100
> RIOT: VERSION: 2.11.0
> RIOT: BUILD_DATE: 2013-09-12T10:49:49+0100
> TDB: VERSION: 1.0.0
> TDB: BUILD_DATE: 2013-09-12T10:49:49+0100
>
> Use fuseki tdb.tdbloader on file /opt/thes/var/gnd/latest/src/GND.ttl.gz
> INFO -- Start triples data phase
> INFO ** Load empty triples table
> INFO Load: /opt/thes/var/gnd/latest/src/GND.ttl.gz -- 2013/10/25
13:33:35 MESZ
> INFO Add: 10.000.000 triples (Batch: 64.766 / Avg: 59.984)
> INFO Elapsed: 166,71 seconds [2013/10/25 13:36:21 MESZ]
> INFO Add: 20.000.000 triples (Batch: 71.839 / Avg: 58.653)
> INFO Elapsed: 340,99 seconds [2013/10/25 13:39:16 MESZ]
> INFO Add: 30.000.000 triples (Batch: 67.750 / Avg: 60.271)
> INFO Elapsed: 497,75 seconds [2013/10/25 13:41:52 MESZ]
> INFO Add: 40.000.000 triples (Batch: 68.212 / Avg: 60.422)
> INFO Elapsed: 662,01 seconds [2013/10/25 13:44:37 MESZ]
> INFO Add: 50.000.000 triples (Batch: 54.171 / Avg: 60.645)
> INFO Elapsed: 824,47 seconds [2013/10/25 13:47:19 MESZ]
> INFO Add: 60.000.000 triples (Batch: 58.823 / Avg: 60.569)
> INFO Elapsed: 990,60 seconds [2013/10/25 13:50:05 MESZ]
> INFO Add: 70.000.000 triples (Batch: 45.495 / Avg: 60.468)
> INFO Elapsed: 1.157,63 seconds [2013/10/25 13:52:52 MESZ]
> INFO Add: 80.000.000 triples (Batch: 50.050 / Avg: 57.998)
> INFO Elapsed: 1.379,36 seconds [2013/10/25 13:56:34 MESZ]
> INFO Add: 90.000.000 triples (Batch: 13.954 / Avg: 52.447)
> INFO Elapsed: 1.716,02 seconds [2013/10/25 14:02:11 MESZ]
> INFO Add: 100.000.000 triples (Batch: 1.134 / Avg: 19.024)
> INFO Elapsed: 5.256,29 seconds [2013/10/25 15:01:11 MESZ]
> INFO Add: 110.000.000 triples (Batch: 944 / Avg: 7.643)
> INFO Elapsed: 15.942,27 seconds [2013/10/25 17:59:17 MESZ]
> INFO -- Finish triples data phase
> INFO 111.813.447 triples loaded in 16.288,16 seconds [Rate: 6.864,71 per second]
> > Indexing phase also takes its time, and has some decrease in performance
too, but does not show a sharp drop.
>
> INFO -- Start triples index phase
> INFO Elapsed: 20.563,36 seconds [2013/10/25 19:16:18 MESZ]
> INFO ** Index SPO->POS: 111.786.233 slots indexed in 4.371,67 seconds [Rate:
25.570,57 per second] > ...
> INFO -- Finish triples index phase
> INFO ** 111.786.233 triples indexed in 19.973,81 seconds [Rate:
5.596,64 per second]
> INFO -- Finish triples load
> INFO ** Completed: 111.813.447 triples loaded in 36.261,98 seconds
[Rate: 3.083,49 per second]
>
>