Re: tdb2.tdbloader performance

Andy Seaborne Sat, 02 Dec 2017 12:56:03 -0800

Short story I used the following "reasonable" device


     Dell M3800
     Fedora 27
     16GB SODIMM DDR3 Synchronous 1600 MHz
     CPU cache L1/256KB,L2/1MB,L3/6MB
     Intel(R) Core(TM) i7-4702HQ CPU @ 2.20GHz 4 cores 8 threads

to load part of the latest-truthy.nt from a USB3.0 1TB drive to a 6GB RAM
disk and;

@800%    60K/Sec
@100%    40K/Sec
@50%    20K/Sec

The full source file contains 2.2G of triples in 10GB bz2 which
decompresses to 250GB nt, which I split into 10M triple chunks and used the
first one to test.


Which tdb loader?

For TDB1, the two loader behave very differently.

I loaded truthy, 2.199 billion triples, on a 16G Dell XPS with SSD in 8hours (76K triples/s) using TDB1 tdbloader2.


I'll write it up soon.

Check with Andy but I think it's limited by CPU, which is why my 24 core (4
x Xeon 6 Core @2.5GHz) 128GB server is able to run concurrent loads with no
performance hit.

The limit at scale is the I/O handling and disk cache. 128G RAM gives abetter disk cache and that server machine probably has better I/O. It'sbig enough to fit one whole index (if all RAM is available - and thatdepends on the swappiness setting which should be set to zero ideally).

CPU is a limit for a while but you'll see the load speed slows down soit is not purely CPU as the limit. (As the indexes are 200-way trees,they don't get very deep.)

tdbloader (loader1) does one index at a time so that the I/O isconstrained, unlike simply adding triples to all 3 indexes together(which is what TDB2 loader does currently).

loader1 degrades at large scale due to random I/O write patterns onsecondary indexes. Hence an SSD makes a big difference.

loader2 (which has high overhead) avoids the problems and only writeindexes from sorted input so no random access to the indexes. An SSDmakes less difference.

I might have access to an AMD ThreadRipper 12 core 24 thread 5GHz in the
next few days and I will try and test against it.

I haven't run the full import because a: i'm guessing the resulting TDB2
will be "large" b: my servers are currently importing other "large"
TDB2's!!!

The TDB2 database for a single graph will be same size as TDB1 usingtdbloader (not tdbloader2).


Long story follows...


<lots of interesting numbers>

Re: tdb2.tdbloader performance

Reply via email to