Hi. Sorry for the delay :-)
Short story I used the following "reasonable" device Dell M3800 Fedora 27 16GB SODIMM DDR3 Synchronous 1600 MHz CPU cache L1/256KB,L2/1MB,L3/6MB Intel(R) Core(TM) i7-4702HQ CPU @ 2.20GHz 4 cores 8 threads to load part of the latest-truthy.nt from a USB3.0 1TB drive to a 6GB RAM disk and; @800% 60K/Sec @100% 40K/Sec @50% 20K/Sec The full source file contains 2.2G of triples in 10GB bz2 which decompresses to 250GB nt, which I split into 10M triple chunks and used the first one to test. Check with Andy but I think it's limited by CPU, which is why my 24 core (4 x Xeon 6 Core @2.5GHz) 128GB server is able to run concurrent loads with no performance hit. I might have access to an AMD ThreadRipper 12 core 24 thread 5GHz in the next few days and I will try and test against it. I haven't run the full import because a: i'm guessing the resulting TDB2 will be "large" b: my servers are currently importing other "large" TDB2's!!! Long story follows... decompress the file; pbzip2 -dv -p4 -m1024 latest-truthy.nt.bz2 Parallel BZIP2 v1.1.12 [Dec 21, 2014] By: Jeff Gilchrist [http://compression.ca] Major contributions: Yavor Nikolov [http://javornikolov.wordpress.com] Uses libbzip2 by Julian Seward # CPUs: 4 Maximum Memory: 1024 MB Ignore Trailing Garbage: off ------------------------------------------- File #: 1 of 1 Input Name: latest-truthy.nt.bz2 Output Name: latest-truthy.nt BWT Block Size: 900k Input Size: 9965955258 bytes Decompressing data... Output Size: 277563574685 bytes ------------------------------------------- Wall Clock: 5871.550948 seconds count the lines; wc -l latest-truthy.nt 2199382887 latest-truthy.nt Just short of 2200M... split the file into 10M chunks; split -d -l 10485760 -a 3 --verbose latest-truthy.nt latest-truthy.nt. creating file 'latest-truthy.nt.000' creating file 'latest-truthy.nt.001' creating file 'latest-truthy.nt.002' creating file 'latest-truthy.nt.003' creating file 'latest-truthy.nt.004' creating file 'latest-truthy.nt.005' ... Restart! sudo cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt ps aux | grep tdb2 root 3358 0.0 0.0 222844 5756 pts/0 S+ 19:22 0:00 sudo cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 3359 0.0 0.0 4500 776 pts/0 S+ 19:22 0:00 cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 3360 0.0 0.0 120304 3288 pts/0 S+ 19:22 0:00 sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 3361 4.9 0.0 4500 92 pts/0 S<+ 19:22 0:05 cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 3366 95.7 14.8 7866116 2418768 pts/0 Sl+ 19:22 1:42 java -Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties -cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt dick 3477 0.0 0.0 119728 972 pts/1 S+ 19:24 0:00 grep --color=auto tdb2 Notice PID 3366 is -Xmx2G default. 19:26:49 INFO TDB2 :: Finished: 10,485,760 latest-truthy.000.nt 247.28s (Avg: 42,404) After the first pass there is no read from the 1TB source as the OS has cached the 1.2G source. 19:33:50 INFO TDB2 :: Finished: 10,485,760 latest-truthy.000.nt 245.70s (Avg: 42,677) export JVM_ARGS="-Xmx4G" i.e. increase the max heap and help the GC sudo ps aux | grep tdb2 root 4317 0.0 0.0 222848 6236 pts/0 S+ 19:35 0:00 sudo cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4321 0.0 0.0 4500 924 pts/0 S+ 19:35 0:00 cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4322 0.0 0.0 120304 3356 pts/0 S+ 19:35 0:00 sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4323 4.8 0.0 4500 88 pts/0 S<+ 19:35 0:09 cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4328 94.8 18.5 8406788 3036188 pts/0 Sl+ 19:35 3:01 java -Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties -cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt dick 4594 0.0 0.0 119728 1024 pts/1 S+ 19:38 0:00 grep --color=auto tdb2 At 800K PID was 3GB and peaked at 3.4GB just prior to completion. 19:39:23 INFO TDB2 :: Finished: 10,485,760 latest-truthy.000.nt 247.65s (Avg: 42,340) Throw all CPU resources at it i.e. 800 sudo cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt Average was at +45K by 350K and +60K by 1.2M 19:43:38 INFO TDB2 :: Finished: 10,485,760 latest-truthy.000.nt 166.91s (Avg: 62,823) sudo ps aux | grep tdb2 root 4740 0.0 0.0 222848 6264 pts/0 S+ 19:40 0:00 sudo cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4744 0.0 0.0 4500 720 pts/0 S+ 19:40 0:00 cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4745 0.0 0.0 120304 3208 pts/0 S+ 19:40 0:00 sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4746 4.7 0.0 4500 92 pts/0 R<+ 19:40 0:07 cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4751 131 21.1 8693508 3448252 pts/0 Sl+ 19:40 3:32 java -Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties -cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt dick 4808 0.0 0.0 119728 1060 pts/1 S+ 19:43 0:00 grep --color=auto tdb2 Heap peaked at 3.4GB sudo cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt sudo ps aux | grep tdb2 root 4898 0.0 0.0 222844 5672 pts/0 S+ 19:45 0:00 sudo cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4899 0.0 0.0 4500 724 pts/0 S+ 19:45 0:00 cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4900 0.0 0.0 120304 3244 pts/0 T+ 19:45 0:00 sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4901 5.5 0.0 4500 92 pts/0 S<+ 19:45 0:25 cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt root 4906 50.5 20.7 8685316 3395236 pts/0 Tl+ 19:45 3:55 java -Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties -cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v --loc /media/ramdisk/ latest-truthy.000.nt dick 4983 0.0 0.0 119728 1072 pts/1 S+ 19:53 0:00 grep --color=auto tdb2 19:53:38 INFO TDB2 :: Finished: 10,485,760 latest-truthy.000.nt 482.27s (Avg: 21,742) On 28 November 2017 at 19:08, Laura Morales <laure...@mail.com> wrote: > > I've had loads take over 24 hours and produce 350GB TDB1 instances... > > Yeah 24H is still acceptable, but it's very borderline. Running a > conversion that takes days becomes frustrating very soon. Of course I'm not > trying to be mean here, but I think it's good to push the limits because we > are already at a point where graphs have several billions triples. If my > computer, which is an average consumer PC at best, can do 60-70K, two > "average grade" nodes could already outperform your beefy server if only I > could share the load on multiple PCs. > > > Ok with the data, I have that somewhere and will run it through, > hopefully tonight if paid work doesn't get in the way ;-) > > Thank you very much for trying this and for offering feedback. I'd be > interested to know > > - what components do you have (cpu/ram/disks/...) > - the AVG number of triples/second > - the final size of the TDB2 store > > Also since you're already running this test, would you mind sharing the > final TDB2 store instead of deleting it? :) If the output is not too > large... >