Hi there,
Thanks for the information and experience report. Always good to hear
what happens in a variety of situations.
A few details:
tdb2.tdbloader has a number of loading algorithms - which one are you
using? While they are different parameters to a common algorithm, they
have different characteristics. (The fastest - the parallel loader - is
not the best at large scale)
What's the hardware being used?
How big is the machine (RAM size, heap size)?
Do you know how the SSD is connected? SATA? NVMe?
It should be possible to port tdbloader2 to TDB2. tdbloader2 is
fundamentally different to the other loaders. For the majority of use
cases, its advantages don't show up with an SSD (it originates from the
disk-era!). But wikidata isn't one of those majority.
The tops of B+trees currently being worked on should naturally end up
cached from the filing system in the OS filing system cache in RAM. As
mapped byte buffers it is as fast, or faster, than heap RAM.
Related thought:
I wonder if we can created wikidata databases once then publish the
database. A database can be published as a compressed zip file of the
directory and the compression ration is quite high. Even so, working
with large files is still going to be non-trivial and we'd need
somewhere to put them that can also supply the bandwidth.
(Also - HDT maybe - don't know how that performs on read at this scale)
Andy
On 12/09/2021 20:12, Cristóbal Miranda wrote:
SSD. First phase was 50-90k triples per second until 3B triples
where it started going down from 50k to 20k per second (took 3 days).
SPO => SPO->POS, SPO->OSP phase was 25-50k per second
until 1B where it went from 25k to 4k triples per second,
currently at 3.7B triples.
On Sun, 12 Sept 2021 at 04:59, Laura Morales <[email protected]> wrote:
Just a personal curiosity... are you building it on a SSD or HDD? What is
your "triples loaded per second" rate?
Sent: Sunday, September 12, 2021 at 2:39 AM
From: "Cristóbal Miranda" <[email protected]>
To: [email protected]
Subject: Faster TDB2 build?
Hi,
I'm running tdb2.tdbloader on Wikidata, but it's
taking too long, now it's on day 11 and still indexing,
whereas tdbloader2 (for TDB) didn't take as much for me.
I was wondering if something could be done to allow
more space on RAM for the build phase in order to be faster,
for example passing a memory budget parameter to the
loader. Not sure exactly how the extra RAM space would be
used, but I was thinking that maybe if more b+tree blocks
were kept in RAM this processing would be faster, for
example keeping 2 upper levels of the tree in primary memory,
or even everything in there if the given budget allowed it.
What would it take to implement such a feature? maybe in a
tdb2.tdbloader2? I was looking at the code for a way to do something
but couldn't find an easy modification to achieve this.