mmm I think this answers my last email. Thank you Andy. So I think the final answer for preventing this huge slow down is using one of
- average disks (SATA is OK) but lots of RAM to keep the whole node table in memory. In the case of Wikidata this would mean at least 18GB, right? Since this seems to be the size of the node2id file - high performance disks (SSD/M2/U2) if little RAM is available, in order to speed up the memory mapped reads - or the best of both worlds (lots of RAM and fast disks) if you got the money. Of a relevant note however, is that the Threadripper with 32GB RAM and 3x M2 disks "only" averaged 175K TPS. This setup on the other hand, I guess would be ideal to load multiple stores in parallel (Mosaic) considering that Dick got 400K TPS on 5400rpm notebook disks Sent: Tuesday, December 12, 2017 at 11:54 PM From: "Andy Seaborne" <[email protected]> To: [email protected] Subject: Re: Report on loading wikidata On 12/12/17 21:06, Laura Morales wrote: > 2) from my tests, tdbloader2 starts by parsing triples rather quickly (130K > TPS) but then it quickly slows down*a lot* over time, That's memory. When the node table index exceeds RAM, updating slows down because disk I/O happens on what used to be RAM access to check whether a node has been seen before. Creating the node table index may be amenable to the same approach as index building, caveat details. > And I'm not convinced it's a problem of disk cache either, because I tried to > flush it several times Does not help - it's a read work load. (It is a memory mapped file) > (1MB/s writes!!!) Presumably because random-pattern writes are occurring as pages are flushed. The entries are keyed by a large hash, hence have a random pattern. Andy
