Re: Faster TDB2 build?

Andy Seaborne Mon, 13 Sep 2021 02:44:58 -0700

Hi there,

Thanks for the information and experience report. Always good to hearwhat happens in a variety of situations.


A few details:

tdb2.tdbloader has a number of loading algorithms - which one are youusing? While they are different parameters to a common algorithm, theyhave different characteristics. (The fastest - the parallel loader - isnot the best at large scale)


What's the hardware being used?
  How big is the machine (RAM size, heap size)?
  Do you know how the SSD is connected? SATA? NVMe?

It should be possible to port tdbloader2 to TDB2. tdbloader2 isfundamentally different to the other loaders. For the majority of usecases, its advantages don't show up with an SSD (it originates from thedisk-era!). But wikidata isn't one of those majority.

The tops of B+trees currently being worked on should naturally end upcached from the filing system in the OS filing system cache in RAM. Asmapped byte buffers it is as fast, or faster, than heap RAM.


Related thought:

I wonder if we can created wikidata databases once then publish thedatabase. A database can be published as a compressed zip file of thedirectory and the compression ration is quite high. Even so, workingwith large files is still going to be non-trivial and we'd needsomewhere to put them that can also supply the bandwidth.


(Also - HDT maybe - don't know how that performs on read at this scale)

    Andy

On 12/09/2021 20:12, Cristóbal Miranda wrote:

SSD. First phase was 50-90k triples per second until 3B triples
where it started going down from 50k to 20k per second (took 3 days).
SPO => SPO->POS, SPO->OSP phase was 25-50k per second
until 1B where it went from 25k to 4k triples per second,
currently at 3.7B triples.



On Sun, 12 Sept 2021 at 04:59, Laura Morales <[email protected]> wrote:

Just a personal curiosity... are you building it on a SSD or HDD? What is
your "triples loaded per second" rate?

Sent: Sunday, September 12, 2021 at 2:39 AM
From: "Cristóbal Miranda" <[email protected]>
To: [email protected]
Subject: Faster TDB2 build?

Hi,

I'm running tdb2.tdbloader on Wikidata, but it's
taking too long, now it's on day 11 and still indexing,
whereas tdbloader2 (for TDB) didn't take as much for me.
I was wondering if something could be done to allow
more space on RAM for the build phase in order to be faster,
for example passing a memory budget parameter to the
loader. Not sure exactly how the extra RAM space would be
used, but I was thinking that maybe if more b+tree blocks
were kept in RAM this processing would be faster, for
example keeping 2 upper levels of the tree in primary memory,
or even everything in there if the given budget allowed it.

What would it take to implement such a feature? maybe in a
tdb2.tdbloader2? I was looking at the code for a way to do something
but couldn't find an easy modification to achieve this.

Re: Faster TDB2 build?

Reply via email to