On 14/09/2021 17:26, Cristóbal Miranda wrote:

tdb2.tdbloader has a number of loading algorithms - which one are you
using?


The default one, phased.

  How big is the machine (RAM size, heap size)?


RAM size: 736G
For heap size do you mean Xmx? if that is the case it is 60G,
but I see that no more than 6GB are being used. However, I see
almost 60GB of swap memory being used.

Its doesn't need 60G heap. 8G is probably enough

It should not need to swap but whether the swap figures includes mapped files is unclear. Seems different machines report things in different ways.

What is causing the slowness is I/O saturation and its the bottom of the trees which ave blocks used infrequently.

Presumably the CPU loads is not very high?


Do you know how the SSD is connected? SATA? NVMe?


I don't know that, I could ask someone if necessary.

The tops of B+trees currently being worked on should naturally end up
cached from the filing system in the OS filing system cache in RAM. As
mapped byte buffers it is as fast, or faster, than heap RAM.


What is being cached?

Areas of a file - blocks.

A block is 8k, the trees are about ~200-way B+Trees for triples. The key is 24 or 32 bytes, no value.

Operations on the B+trees happens directly on blocks.

the nodes on the current branch of the tree or
complete upper levels? I'm thinking that if blocks from upper levels are
retrieved from disk repeatedly between insertions (and also having to do
splits)
can degrade performance a lot, especially when the amount of data
is getting big, because too much random access would have to be done.
I see that ids are used to find the blocks in the file, could it be possible
for example to have a HashMap mapping ids to blocks in BlockAccessMapped
and try to retrieve from the hashmap when the id corresponds to an upper
level block and sync when everything is done? The idea being that those
upper
levels will occupy some MBs which is not that expensive to keep in memory
and
it would require less access to disk and also reduce random accesses in
benefit
of more sequential writes for lower-level blocks. This, of course, would
only happen
when building the index.
Do you think that something like this could improve build performance?

Maybe, but I think that tdbloader2 which causes the I/O to be ordered because it builds the trees, bottom up in sorted order.

For the majority of cases this sorting cost isn't worth it with SSDs because there is no large seek time on random I/O.

But at wikidata scale, the I/O bandwidth gets used up. MVNe/PCIe SSDs are better than SATA connected for this.


Related to this, how many children can a block have? 2048, 1024?

About 200.




I wonder if we can created wikidata databases once then publish the
database


That would be nice, but as you say it can be troublesome, especially trying
to have a version which is not too old compared to their latest dumped
dataset.



On Mon, 13 Sept 2021 at 06:44, Andy Seaborne <[email protected]> wrote:

Hi there,

Thanks for the information and experience report.  Always good to hear
what happens in a variety of situations.

A few details:

tdb2.tdbloader has a number of loading algorithms - which one are you
using? While they are different parameters to a common algorithm, they
have different characteristics. (The fastest - the parallel loader - is
not the best at large scale)

What's the hardware being used?
    How big is the machine (RAM size, heap size)?
    Do you know how the SSD is connected? SATA? NVMe?


It should be possible to port tdbloader2 to TDB2.  tdbloader2 is
fundamentally different to the other loaders. For the majority of use
cases, its advantages don't show up with an SSD (it originates from the
disk-era!). But wikidata isn't one of those majority.

The tops of B+trees currently being worked on should naturally end up
cached from the filing system in the OS filing system cache in RAM. As
mapped byte buffers it is as fast, or faster, than heap RAM.

Related thought:

I wonder if we can created wikidata databases once then publish the
database. A database can be published as a compressed zip file of the
directory and the compression ration is quite high. Even so, working
with large files is still going to be non-trivial and we'd need
somewhere to put them that can also supply the bandwidth.

(Also - HDT maybe - don't know how that performs on read at this scale)

      Andy

On 12/09/2021 20:12, Cristóbal Miranda wrote:
SSD. First phase was 50-90k triples per second until 3B triples
where it started going down from 50k to 20k per second (took 3 days).
SPO => SPO->POS, SPO->OSP phase was 25-50k per second
until 1B where it went from 25k to 4k triples per second,
currently at 3.7B triples.



On Sun, 12 Sept 2021 at 04:59, Laura Morales <[email protected]> wrote:

Just a personal curiosity... are you building it on a SSD or HDD? What
is
your "triples loaded per second" rate?


Sent: Sunday, September 12, 2021 at 2:39 AM
From: "Cristóbal Miranda" <[email protected]>
To: [email protected]
Subject: Faster TDB2 build?

Hi,

I'm running tdb2.tdbloader on Wikidata, but it's
taking too long, now it's on day 11 and still indexing,
whereas tdbloader2 (for TDB) didn't take as much for me.
I was wondering if something could be done to allow
more space on RAM for the build phase in order to be faster,
for example passing a memory budget parameter to the
loader. Not sure exactly how the extra RAM space would be
used, but I was thinking that maybe if more b+tree blocks
were kept in RAM this processing would be faster, for
example keeping 2 upper levels of the tree in primary memory,
or even everything in there if the given budget allowed it.

What would it take to implement such a feature? maybe in a
tdb2.tdbloader2? I was looking at the code for a way to do something
but couldn't find an easy modification to achieve this.





Reply via email to