Short story I used the following "reasonable" device
Dell M3800
Fedora 27
16GB SODIMM DDR3 Synchronous 1600 MHz
CPU cache L1/256KB,L2/1MB,L3/6MB
Intel(R) Core(TM) i7-4702HQ CPU @ 2.20GHz 4 cores 8 threads
to load part of the latest-truthy.nt from a USB3.0 1TB drive to a 6GB RAM
disk and;
@800% 60K/Sec
@100% 40K/Sec
@50% 20K/Sec
The full source file contains 2.2G of triples in 10GB bz2 which
decompresses to 250GB nt, which I split into 10M triple chunks and used the
first one to test.
Which tdb loader?
For TDB1, the two loader behave very differently.
I loaded truthy, 2.199 billion triples, on a 16G Dell XPS with SSD in 8
hours (76K triples/s) using TDB1 tdbloader2.
I'll write it up soon.
Check with Andy but I think it's limited by CPU, which is why my 24 core (4
x Xeon 6 Core @2.5GHz) 128GB server is able to run concurrent loads with no
performance hit.
The limit at scale is the I/O handling and disk cache. 128G RAM gives a
better disk cache and that server machine probably has better I/O. It's
big enough to fit one whole index (if all RAM is available - and that
depends on the swappiness setting which should be set to zero ideally).
CPU is a limit for a while but you'll see the load speed slows down so
it is not purely CPU as the limit. (As the indexes are 200-way trees,
they don't get very deep.)
tdbloader (loader1) does one index at a time so that the I/O is
constrained, unlike simply adding triples to all 3 indexes together
(which is what TDB2 loader does currently).
loader1 degrades at large scale due to random I/O write patterns on
secondary indexes. Hence an SSD makes a big difference.
loader2 (which has high overhead) avoids the problems and only write
indexes from sorted input so no random access to the indexes. An SSD
makes less difference.
I might have access to an AMD ThreadRipper 12 core 24 thread 5GHz in the
next few days and I will try and test against it.
I haven't run the full import because a: i'm guessing the resulting TDB2
will be "large" b: my servers are currently importing other "large"
TDB2's!!!
The TDB2 database for a single graph will be same size as TDB1 using
tdbloader (not tdbloader2).
Long story follows...
<lots of interesting numbers>