> Threads will not help a single load except for tdbloader2 (which is for TDB1) 
> if tuned - see the command help and notes.  It uses sort(1) which can utilize 
> multiple threads.

This was worth tuning for me. sort generally picks good parameters for a 
system, but I was able to get noticeably better performance by adjusting (up) 
the parallelism manually. But of course, that's a limited amount of 
improvement. (It's also worth making sure your locale is set appropriately. 
Avoid using Unicode collation and it will speed things up impressively.)

ajs6f

> On Dec 2, 2017, at 4:16 PM, Andy Seaborne <[email protected]> wrote:
> 
> 
> 
> On 01/12/17 22:28, Laura Morales wrote:
>> Thank you very much, this is great feedback!
>> Your setup was very similar to mine, except:
>> - I have 8GB RAM single bank, you have 16GB probably on two banks
>> - my CPU is "half" of yours, 2 cores 4 threads
>>  despite this, the results are very similar; maybe yours are slightly 
>> better. I don't understand why this "60K" seems so hard to beat. What's so 
>> special about it?? It's so difficult to understand what to do to improve the 
>> conversion speed... do I buy more ram? Faster ram? A faster CPU? More cores? 
>> Or a CPU with more cache? Or more memory channels? I still can't find an 
>> answer... Why would more cores help if tdb2.tdbloader 
> 
> As already said - tdb2.tdbloader in its current form is not suitable for 
> loading billion triple datasets (unless there is a lot of RAM ... I'd guess 
> upward of 256G for truthy and a tuned server (swappy=0 for example), not that 
> I've tried).
> 
>> runs in a single thread? Maybe the reason is that with more cores, your xeon 
>> can handle more RAM concurrently? I don't understand...
>> With your xeon, you said you were able to get to 120K? Right?
> 
> "concurrent 120K"
> 
> I understood that to mean more than one load running at once.  Dick's system 
> has multiple TDB databases and a large disk cache.
> 
> (I got 76K, single load, on somewhat less hardware so that suggests 120K may 
> be affected by I/O contention.)
> 
>> What xeon, mobo, and RAM did you use?
>> If anybody has any xeon or opteron, it would be nice if they could offer 
>> more feedback too. Even with slower RAM such as DDR3-1333. I certainly can't 
>> wait to read your feedback with the Threadripper :)
> 
> Threads will not help a single load except for tdbloader2 (which is for TDB1) 
> if tuned - see the command help and notes.  It uses sort(1) which can utilize 
> multiple threads.
> 
>    Andy
> 
>> keep us posted!
>> Sent: Friday, December 01, 2017 at 9:11 PM
>> From: "Dick Murray" <[email protected]>
>> To: [email protected]
>> Subject: Re: tdb2.tdbloader performance
>> Hi.
>> Sorry for the delay :-)
>> Short story I used the following "reasonable" device
>> Dell M3800
>> Fedora 27
>> 16GB SODIMM DDR3 Synchronous 1600 MHz
>> CPU cache L1/256KB,L2/1MB,L3/6MB
>> Intel(R) Core(TM) i7-4702HQ CPU @ 2.20GHz 4 cores 8 threads
>> to load part of the latest-truthy.nt from a USB3.0 1TB drive to a 6GB RAM
>> disk and;
>> @800% 60K/Sec
>> @100% 40K/Sec
>> @50% 20K/Sec
>> The full source file contains 2.2G of triples in 10GB bz2 which
>> decompresses to 250GB nt, which I split into 10M triple chunks and used the
>> first one to test.
>> Check with Andy but I think it's limited by CPU, which is why my 24 core (4
>> x Xeon 6 Core @2.5GHz) 128GB server is able to run concurrent loads with no
>> performance hit.
>> I might have access to an AMD ThreadRipper 12 core 24 thread 5GHz in the
>> next few days and I will try and test against it.
>> I haven't run the full import because a: i'm guessing the resulting TDB2
>> will be "large" b: my servers are currently importing other "large"
>> TDB2's!!!
>> Long story follows...
>> decompress the file;
>> pbzip2 -dv -p4 -m1024 latest-truthy.nt.bz2
>> Parallel BZIP2 v1.1.12 [Dec 21, 2014]
>> By: Jeff Gilchrist [http://compression.ca]
>> Major contributions: Yavor Nikolov 
>> [http://javornikolov.wordpress.com[http://javornikolov.wordpress.com]]
>> Uses libbzip2 by Julian Seward
>> # CPUs: 4
>> Maximum Memory: 1024 MB
>> Ignore Trailing Garbage: off
>> -------------------------------------------
>> File #: 1 of 1
>> Input Name: latest-truthy.nt.bz2
>> Output Name: latest-truthy.nt
>> BWT Block Size: 900k
>> Input Size: 9965955258 bytes
>> Decompressing data...
>> Output Size: 277563574685 bytes
>> -------------------------------------------
>> Wall Clock: 5871.550948 seconds
>> count the lines;
>> wc -l latest-truthy.nt
>> 2199382887 latest-truthy.nt
>> Just short of 2200M...
>> split the file into 10M chunks;
>> split -d -l 10485760 -a 3 --verbose latest-truthy.nt latest-truthy.nt.
>> creating file 'latest-truthy.nt.000'
>> creating file 'latest-truthy.nt.001'
>> creating file 'latest-truthy.nt.002'
>> creating file 'latest-truthy.nt.003'
>> creating file 'latest-truthy.nt.004'
>> creating file 'latest-truthy.nt.005'
>> ...
>> Restart!
>> sudo cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
>> --loc /media/ramdisk/ latest-truthy.000.nt
>> ps aux | grep tdb2
>> root 3358 0.0 0.0 222844 5756 pts/0 S+ 19:22 0:00 sudo
>> cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 3359 0.0 0.0 4500 776 pts/0 S+ 19:22 0:00 cpulimit
>> -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 3360 0.0 0.0 120304 3288 pts/0 S+ 19:22 0:00 sh
>> ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
>> latest-truthy.000.nt
>> root 3361 4.9 0.0 4500 92 pts/0 S<+ 19:22 0:05 cpulimit
>> -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 3366 95.7 14.8 7866116 2418768 pts/0 Sl+ 19:22 1:42 java
>> -Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
>> -cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
>> --loc /media/ramdisk/ latest-truthy.000.nt
>> dick 3477 0.0 0.0 119728 972 pts/1 S+ 19:24 0:00 grep
>> --color=auto tdb2
>> Notice PID 3366 is -Xmx2G default.
>> 19:26:49 INFO TDB2 :: Finished: 10,485,760
>> latest-truthy.000.nt 247.28s (Avg: 42,404)
>> After the first pass there is no read from the 1TB source as the OS has
>> cached the 1.2G source.
>> 19:33:50 INFO TDB2 :: Finished: 10,485,760
>> latest-truthy.000.nt 245.70s (Avg: 42,677)
>> export JVM_ARGS="-Xmx4G" i.e. increase the max heap and help the GC
>> sudo ps aux | grep tdb2
>> root 4317 0.0 0.0 222848 6236 pts/0 S+ 19:35 0:00 sudo
>> cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 4321 0.0 0.0 4500 924 pts/0 S+ 19:35 0:00 cpulimit
>> -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 4322 0.0 0.0 120304 3356 pts/0 S+ 19:35 0:00 sh
>> ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
>> latest-truthy.000.nt
>> root 4323 4.8 0.0 4500 88 pts/0 S<+ 19:35 0:09 cpulimit
>> -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 4328 94.8 18.5 8406788 3036188 pts/0 Sl+ 19:35 3:01 java
>> -Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
>> -cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
>> --loc /media/ramdisk/ latest-truthy.000.nt
>> dick 4594 0.0 0.0 119728 1024 pts/1 S+ 19:38 0:00 grep
>> --color=auto tdb2
>> At 800K PID was 3GB and peaked at 3.4GB just prior to completion.
>> 19:39:23 INFO TDB2 :: Finished: 10,485,760
>> latest-truthy.000.nt 247.65s (Avg: 42,340)
>> Throw all CPU resources at it i.e. 800
>> sudo cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
>> --loc /media/ramdisk/ latest-truthy.000.nt
>> Average was at +45K by 350K and +60K by 1.2M
>> 19:43:38 INFO TDB2 :: Finished: 10,485,760
>> latest-truthy.000.nt 166.91s (Avg: 62,823)
>> sudo ps aux | grep tdb2
>> root 4740 0.0 0.0 222848 6264 pts/0 S+ 19:40 0:00 sudo
>> cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 4744 0.0 0.0 4500 720 pts/0 S+ 19:40 0:00 cpulimit
>> -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 4745 0.0 0.0 120304 3208 pts/0 S+ 19:40 0:00 sh
>> ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
>> latest-truthy.000.nt
>> root 4746 4.7 0.0 4500 92 pts/0 R<+ 19:40 0:07 cpulimit
>> -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 4751 131 21.1 8693508 3448252 pts/0 Sl+ 19:40 3:32 java
>> -Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
>> -cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
>> --loc /media/ramdisk/ latest-truthy.000.nt
>> dick 4808 0.0 0.0 119728 1060 pts/1 S+ 19:43 0:00 grep
>> --color=auto tdb2
>> Heap peaked at 3.4GB
>> sudo cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
>> --loc /media/ramdisk/ latest-truthy.000.nt
>> sudo ps aux | grep tdb2
>> root 4898 0.0 0.0 222844 5672 pts/0 S+ 19:45 0:00 sudo
>> cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 4899 0.0 0.0 4500 724 pts/0 S+ 19:45 0:00 cpulimit
>> -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 4900 0.0 0.0 120304 3244 pts/0 T+ 19:45 0:00 sh
>> ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
>> latest-truthy.000.nt
>> root 4901 5.5 0.0 4500 92 pts/0 S<+ 19:45 0:25 cpulimit
>> -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
>> /media/ramdisk/ latest-truthy.000.nt
>> root 4906 50.5 20.7 8685316 3395236 pts/0 Tl+ 19:45 3:55 java
>> -Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
>> -cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
>> --loc /media/ramdisk/ latest-truthy.000.nt
>> dick 4983 0.0 0.0 119728 1072 pts/1 S+ 19:53 0:00 grep
>> --color=auto tdb2
>> 19:53:38 INFO TDB2 :: Finished: 10,485,760
>> latest-truthy.000.nt 482.27s (Avg: 21,742)

Reply via email to