Re: Report on loading wikidata

Andy Seaborne Mon, 11 Dec 2017 07:33:11 -0800

This is for the large amount of temporary space that tdbloader2 uses?

I got "latest-all" to load but I had to do some things with tdbloader2to work with a compresses data-triples.tmp.gz and also have sort writecomprssed temporary files (I messed up a bit and set the gzipcompression too high so it slowed things down).

There are some small problems with tdbloader2 with complex --sort-args(it only handles one single arg/value correctly). My main trick was toput in a script for "sort" that had the required settings built-in. Iwanted to set --compress, -T and the buffer size.


On 10/12/17 21:18, Dick Murray wrote:

Ryzen 1920X 3.5GHz, 32GB DDR4 quad channel, 3 x M.2 Samsung 960 EVO,
172K/sec 3h45m for truthy.

Is it possible to split the index files into separate folders?


Not built-in.  Symbolic links will work.

I'm keen on symbolic links here because built-in support would hard tokeep all cases covered.


Or sym link the files, if I run the data phase, sym link, then run the
index phase?


Symbolic links will work.

"sort" can be configured to use a temporary folder as well.

The only place symbolic links will not work is for data-triples.tmp. Itmust not exist at all - we ought to change that to make it OK to have azero-length file in place so it can be redirected ahead of time.


    Andy


Point me in the right direction and I'll extend the TDB file open code.

Dick


On 7 Dec 2017 22:21, "Andy Seaborne" <[email protected]> wrote:



On 07/12/17 19:01, Laura Morales wrote:

Thank you a lot Andy, very informative (special thanks for specifying the
hardware).
For anybody reading this, I'd like to highlight the fact that the data
source is "latest-truthy" and not "latest-all".
 From what I understand, truthy leaves out a lot of data (50% ??) and "all"
is more than 4 billion triples.


4,787,194,669 Triples

Dick reported figures for truthy as well.

I used a *16G* machine, and it is a portable with all it's memory
architecture tradeoffs.

"all" is running ATM - it will be much slower due to RAM needs of
tdbloader2 for the data phase.  Not sure the figures will mean anything for
you.

I'd need a machine with (guess) 32G RAM which is still a small server these
days.

(A similar tree builder technique could be applied to the node index and
reduce the max RAM needs but - hey, ho - that's free software for you.)

     Andy

Re: Report on loading wikidata

Reply via email to