mmm I think this answers my last email. Thank you Andy. So I think the final 
answer for preventing this huge slow down is using one of

- average disks (SATA is OK) but lots of RAM to keep the whole node table in 
memory. In the case of Wikidata this would mean at least 18GB, right? Since 
this seems to be the size of the node2id file

- high performance disks (SSD/M2/U2) if little RAM is available, in order to 
speed up the memory mapped reads

- or the best of both worlds (lots of RAM and fast disks) if you got the money. 
Of a relevant note however, is that the Threadripper with 32GB RAM and 3x M2 
disks "only" averaged 175K TPS. This setup on the other hand, I guess would be 
ideal to load multiple stores in parallel (Mosaic) considering that Dick got 
400K TPS on 5400rpm notebook disks
 
 

Sent: Tuesday, December 12, 2017 at 11:54 PM
From: "Andy Seaborne" <[email protected]>
To: [email protected]
Subject: Re: Report on loading wikidata

On 12/12/17 21:06, Laura Morales wrote:
> 2) from my tests, tdbloader2 starts by parsing triples rather quickly (130K 
> TPS) but then it quickly slows down*a lot* over time,

That's memory.

When the node table index exceeds RAM, updating slows down because disk
I/O happens on what used to be RAM access to check whether a node has
been seen before.

Creating the node table index may be amenable to the same approach as
index building, caveat details.

> And I'm not convinced it's a problem of disk cache either, because I tried to 
> flush it several times

Does not help - it's a read work load.

(It is a memory mapped file)

> (1MB/s writes!!!)

Presumably because random-pattern writes are occurring as pages are
flushed. The entries are keyed by a large hash, hence have a random
pattern.

Andy

Reply via email to