Hi.

Sorry for the delay :-)

Short story I used the following "reasonable" device

    Dell M3800
    Fedora 27
    16GB SODIMM DDR3 Synchronous 1600 MHz
    CPU cache L1/256KB,L2/1MB,L3/6MB
    Intel(R) Core(TM) i7-4702HQ CPU @ 2.20GHz 4 cores 8 threads

to load part of the latest-truthy.nt from a USB3.0 1TB drive to a 6GB RAM
disk and;

@800%    60K/Sec
@100%    40K/Sec
@50%    20K/Sec

The full source file contains 2.2G of triples in 10GB bz2 which
decompresses to 250GB nt, which I split into 10M triple chunks and used the
first one to test.

Check with Andy but I think it's limited by CPU, which is why my 24 core (4
x Xeon 6 Core @2.5GHz) 128GB server is able to run concurrent loads with no
performance hit.

I might have access to an AMD ThreadRipper 12 core 24 thread 5GHz in the
next few days and I will try and test against it.

I haven't run the full import because a: i'm guessing the resulting TDB2
will be "large" b: my servers are currently importing other "large"
TDB2's!!!

Long story follows...

decompress the file;

pbzip2 -dv -p4 -m1024 latest-truthy.nt.bz2
Parallel BZIP2 v1.1.12 [Dec 21, 2014]
By: Jeff Gilchrist [http://compression.ca]
Major contributions: Yavor Nikolov [http://javornikolov.wordpress.com]
Uses libbzip2 by Julian Seward

         # CPUs: 4
 Maximum Memory: 1024 MB
 Ignore Trailing Garbage: off
-------------------------------------------
         File #: 1 of 1
     Input Name: latest-truthy.nt.bz2
    Output Name: latest-truthy.nt

 BWT Block Size: 900k
     Input Size: 9965955258 bytes
Decompressing data...
    Output Size: 277563574685 bytes
-------------------------------------------

     Wall Clock: 5871.550948 seconds

count the lines;

wc -l latest-truthy.nt
2199382887 latest-truthy.nt

Just short of 2200M...

split the file into 10M chunks;

split -d -l 10485760 -a 3 --verbose latest-truthy.nt latest-truthy.nt.
creating file 'latest-truthy.nt.000'
creating file 'latest-truthy.nt.001'
creating file 'latest-truthy.nt.002'
creating file 'latest-truthy.nt.003'
creating file 'latest-truthy.nt.004'
creating file 'latest-truthy.nt.005'
...

Restart!

sudo cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt

ps aux | grep tdb2
root      3358  0.0  0.0 222844  5756 pts/0    S+   19:22   0:00 sudo
cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      3359  0.0  0.0   4500   776 pts/0    S+   19:22   0:00 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      3360  0.0  0.0 120304  3288 pts/0    S+   19:22   0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root      3361  4.9  0.0   4500    92 pts/0    S<+  19:22   0:05 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      3366 95.7 14.8 7866116 2418768 pts/0 Sl+  19:22   1:42 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick      3477  0.0  0.0 119728   972 pts/1    S+   19:24   0:00 grep
--color=auto tdb2

Notice PID 3366 is -Xmx2G default.

19:26:49 INFO  TDB2                 :: Finished: 10,485,760
latest-truthy.000.nt 247.28s (Avg: 42,404)

After the first pass there is no read from the 1TB source as the OS has
cached the 1.2G source.

19:33:50 INFO  TDB2                 :: Finished: 10,485,760
latest-truthy.000.nt 245.70s (Avg: 42,677)

export JVM_ARGS="-Xmx4G" i.e. increase the max heap and help the GC

sudo ps aux | grep tdb2
root      4317  0.0  0.0 222848  6236 pts/0    S+   19:35   0:00 sudo
cpulimit -v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      4321  0.0  0.0   4500   924 pts/0    S+   19:35   0:00 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      4322  0.0  0.0 120304  3356 pts/0    S+   19:35   0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root      4323  4.8  0.0   4500    88 pts/0    S<+  19:35   0:09 cpulimit
-v -l 100 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      4328 94.8 18.5 8406788 3036188 pts/0 Sl+  19:35   3:01 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick      4594  0.0  0.0 119728  1024 pts/1    S+   19:38   0:00 grep
--color=auto tdb2

At 800K PID was 3GB and peaked at 3.4GB just prior to completion.

19:39:23 INFO  TDB2                 :: Finished: 10,485,760
latest-truthy.000.nt 247.65s (Avg: 42,340)

Throw all CPU resources at it i.e. 800

sudo cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt

Average was at +45K by 350K and +60K by 1.2M

19:43:38 INFO  TDB2                 :: Finished: 10,485,760
latest-truthy.000.nt 166.91s (Avg: 62,823)

sudo ps aux | grep tdb2
root      4740  0.0  0.0 222848  6264 pts/0    S+   19:40   0:00 sudo
cpulimit -v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      4744  0.0  0.0   4500   720 pts/0    S+   19:40   0:00 cpulimit
-v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      4745  0.0  0.0 120304  3208 pts/0    S+   19:40   0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root      4746  4.7  0.0   4500    92 pts/0    R<+  19:40   0:07 cpulimit
-v -l 800 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      4751  131 21.1 8693508 3448252 pts/0 Sl+  19:40   3:32 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick      4808  0.0  0.0 119728  1060 pts/1    S+   19:43   0:00 grep
--color=auto tdb2

Heap peaked at 3.4GB

sudo cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt

sudo ps aux | grep tdb2
root      4898  0.0  0.0 222844  5672 pts/0    S+   19:45   0:00 sudo
cpulimit -v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      4899  0.0  0.0   4500   724 pts/0    S+   19:45   0:00 cpulimit
-v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      4900  0.0  0.0 120304  3244 pts/0    T+   19:45   0:00 sh
./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc /media/ramdisk/
latest-truthy.000.nt
root      4901  5.5  0.0   4500    92 pts/0    S<+  19:45   0:25 cpulimit
-v -l 50 -i sh ./apache-jena-3.5.0/bin/tdb2.tdbloader -v --loc
/media/ramdisk/ latest-truthy.000.nt
root      4906 50.5 20.7 8685316 3395236 pts/0 Tl+  19:45   3:55 java
-Dlog4j.configuration=file:/run/media/dick/KVM/jena/apache-jena-3.5.0/jena-log4j.properties
-cp /run/media/dick/KVM/jena/apache-jena-3.5.0/lib/* tdb2.tdbloader -v
--loc /media/ramdisk/ latest-truthy.000.nt
dick      4983  0.0  0.0 119728  1072 pts/1    S+   19:53   0:00 grep
--color=auto tdb2

19:53:38 INFO  TDB2                 :: Finished: 10,485,760
latest-truthy.000.nt 482.27s (Avg: 21,742)

On 28 November 2017 at 19:08, Laura Morales <laure...@mail.com> wrote:

> > I've had loads take over 24 hours and produce 350GB TDB1 instances...
>
> Yeah 24H is still acceptable, but it's very borderline. Running a
> conversion that takes days becomes frustrating very soon. Of course I'm not
> trying to be mean here, but I think it's good to push the limits because we
> are already at a point where graphs have several billions triples. If my
> computer, which is an average consumer PC at best, can do 60-70K, two
> "average grade" nodes could already outperform your beefy server if only I
> could share the load on multiple PCs.
>
> > Ok with the data, I have that somewhere and will run it through,
> hopefully tonight if paid work doesn't get in the way ;-)
>
> Thank you very much for trying this and for offering feedback. I'd be
> interested to know
>
> - what components do you have (cpu/ram/disks/...)
> - the AVG number of triples/second
> - the final size of the TDB2 store
>
> Also since you're already running this test, would you mind sharing the
> final TDB2 store instead of deleting it? :) If the output is not too
> large...
>

Reply via email to