Hi.
Sorry for the delay :-)
Short story I used the following "reasonable" device
Dell M3800
Fedora 27
16GB SODIMM DDR3 Synchronous 1600 MHz
CPU cache L1/256KB,L2/1MB,L3/6MB
Intel(R) Core(TM) i7-4702HQ CPU @ 2.20GHz 4 cores 8 threads
to load part of the latest-truthy.nt from a USB3.0 1TB drive to a 6GB RAM
disk and;
@800% 60K/Sec
@100% 40K/Sec
@50% 20K/Sec
The full source file contains 2.2G of triples in 10GB bz2 which
decompresses to 250GB nt, which I split into 10M triple chunks and used the
first one to test.
Check with Andy but I think it's limited by CPU, which is why my 24 core (4
x Xeon 6 Core @2.5GHz) 128GB server is able to run concurrent loads with no
performance hit.
I might have access to an AMD ThreadRipper 12 core 24 thread 5GHz in the
next few days and I will try and test against it.
I haven't run the full import because a: i'm guessing the resulting TDB2
will be "large" b: my servers are currently importing other "large"
TDB2's!!!
Long story follows...
decompress the file;
pbzip2 -dv -p4 -m1024 latest-truthy.nt.bz2
Parallel BZIP2 v1.1.12 [Dec 21, 2014]
By: Jeff Gilchrist [http://compression.ca]
Major contributions: Yavor Nikolov [http://javornikolov.wordpress.com]
Uses libbzip2 by Julian Seward
# CPUs: 4
Maximum Memory: 1024 MB
Ignore Trailing Garbage: off
-------------------------------------------
File #: 1 of 1
Input Name: latest-truthy.nt.bz2
Output Name: latest-truthy.nt
BWT Block Size: 900k
Input Size: 9965955258 bytes
Decompressing data...
Output Size: 277563574685 bytes
-------------------------------------------
Wall Clock: 5871.550948 seconds
count the lines;
wc -l latest-truthy.nt
2199382887 latest-truthy.nt
Just short of 2200M...
split the file into 10M chunks;
split -d -l 10485760 -a 3 --verbose latest-truthy.nt latest-truthy.nt.
creating file 'latest-truthy.nt.000'
creating file 'latest-truthy.nt.001'
creating file 'latest-truthy.nt.002'
creating file 'latest-truthy.nt.003'
creating file 'latest-truthy.nt.004'
creating file 'latest-truthy.nt.005'
...
Restart!
sudo cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
ps aux | grep tdb2
root 3358 0.0 0.0 222844 5756 pts/0 S+ 19:22 0:00 sudo
cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 3359 0.0 0.0 4500 776 pts/0 S+ 19:22 0:00 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 3360 0.0 0.0 120304 3288 pts/0 S+ 19:22 0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root 3361 4.9 0.0 4500 92 pts/0 S<+ 19:22 0:05 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 3366 95.7 14.8 7866116 2418768 pts/0 Sl+ 19:22 1:42 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick 3477 0.0 0.0 119728 972 pts/1 S+ 19:24 0:00 grep
--color=auto tdb2
Notice PID 3366 is -Xmx2G default.
19:26:49 INFO TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 247.28s (Avg: 42,404)
After the first pass there is no read from the 1TB source as the OS has
cached the 1.2G source.
19:33:50 INFO TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 245.70s (Avg: 42,677)
export JVM_ARGS="-Xmx4G" i.e. increase the max heap and help the GC
sudo ps aux | grep tdb2
root 4317 0.0 0.0 222848 6236 pts/0 S+ 19:35 0:00 sudo
cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4321 0.0 0.0 4500 924 pts/0 S+ 19:35 0:00 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4322 0.0 0.0 120304 3356 pts/0 S+ 19:35 0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root 4323 4.8 0.0 4500 88 pts/0 S<+ 19:35 0:09 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4328 94.8 18.5 8406788 3036188 pts/0 Sl+ 19:35 3:01 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick 4594 0.0 0.0 119728 1024 pts/1 S+ 19:38 0:00 grep
--color=auto tdb2
At 800K PID was 3GB and peaked at 3.4GB just prior to completion.
19:39:23 INFO TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 247.65s (Avg: 42,340)
Throw all CPU resources at it i.e. 800
sudo cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
Average was at +45K by 350K and +60K by 1.2M
19:43:38 INFO TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 166.91s (Avg: 62,823)
sudo ps aux | grep tdb2
root 4740 0.0 0.0 222848 6264 pts/0 S+ 19:40 0:00 sudo
cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4744 0.0 0.0 4500 720 pts/0 S+ 19:40 0:00 cpulimit
-v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4745 0.0 0.0 120304 3208 pts/0 S+ 19:40 0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root 4746 4.7 0.0 4500 92 pts/0 R<+ 19:40 0:07 cpulimit
-v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4751 131 21.1 8693508 3448252 pts/0 Sl+ 19:40 3:32 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick 4808 0.0 0.0 119728 1060 pts/1 S+ 19:43 0:00 grep
--color=auto tdb2
Heap peaked at 3.4GB
sudo cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
sudo ps aux | grep tdb2
root 4898 0.0 0.0 222844 5672 pts/0 S+ 19:45 0:00 sudo
cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4899 0.0 0.0 4500 724 pts/0 S+ 19:45 0:00 cpulimit
-v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4900 0.0 0.0 120304 3244 pts/0 T+ 19:45 0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root 4901 5.5 0.0 4500 92 pts/0 S<+ 19:45 0:25 cpulimit
-v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root 4906 50.5 20.7 8685316 3395236 pts/0 Tl+ 19:45 3:55 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick 4983 0.0 0.0 119728 1072 pts/1 S+ 19:53 0:00 grep
--color=auto tdb2
19:53:38 INFO TDB2 :: Finished: 10,485,760
latest-truthy.000.nt 482.27s (Avg: 21,742)
On 28 November 2017 at 19:08, Laura Morales <[email protected]> wrote:
> > I've had loads take over 24 hours and produce 350GB TDB1 instances...
>
> Yeah 24H is still acceptable, but it's very borderline. Running a
> conversion that takes days becomes frustrating very soon. Of course I'm not
> trying to be mean here, but I think it's good to push the limits because we
> are already at a point where graphs have several billions triples. If my
> computer, which is an average consumer PC at best, can do 60-70K, two
> "average grade" nodes could already outperform your beefy server if only I
> could share the load on multiple PCs.
>
> > Ok with the data, I have that somewhere and will run it through,
> hopefully tonight if paid work doesn't get in the way ;-)
>
> Thank you very much for trying this and for offering feedback. I'd be
> interested to know
>
> - what components do you have (cpu/ram/disks/...)
> - the AVG number of triples/second
> - the final size of the TDB2 store
>
> Also since you're already running this test, would you mind sharing the
> final TDB2 store instead of deleting it? :) If the output is not too
> large...
>