Re: Report on loading wikidata

Dick Murray Tue, 12 Dec 2017 02:06:40 -0800

Similar here.

I hacked (i.e. no checking/setup/params) the data/index scripts to create
s, p, o folders on soft linked three separate devices and moved in the
respective.dat and .idn files, hard linked back to the data-triples.tmp.
and ran the three triple indexes in parallel. sort was parallel 8 and
buffer 8GB. It built the three indexes in the time taken to build one.


As an aside there are duplicate entries in the data-triples.tmp file, is
this by design? if you sort data-triples.tmp | uniq > it returns a smaller
file and I've checked visually and there are duplicate entries...

I'll tidy the script and make it available if anyone wants to perform a
tweaked load, only really useful for large datasets.

On 11 December 2017 at 15:32, Andy Seaborne <[email protected]> wrote:

> This is for the large amount of temporary space that tdbloader2 uses?
>
> I got "latest-all" to load but I had to do some things with tdbloader2 to
> work with a compresses data-triples.tmp.gz and also have sort write
> comprssed temporary files (I messed up a bit and set the gzip compression
> too high so it slowed things down).
>
> There are some small problems with tdbloader2 with complex --sort-args (it
> only handles one single arg/value correctly).  My main trick was to put in
> a script for "sort" that had the required settings built-in. I wanted to
> set --compress, -T and the buffer size.
>
> On 10/12/17 21:18, Dick Murray wrote:
>
>> Ryzen 1920X 3.5GHz, 32GB DDR4 quad channel, 3 x M.2 Samsung 960 EVO,
>> 172K/sec 3h45m for truthy.
>>
>> Is it possible to split the index files into separate folders?
>>
>
> Not built-in.  Symbolic links will work.
>
> I'm keen on symbolic links here because built-in support would hard to
> keep all cases covered.
>
>
>> Or sym link the files, if I run the data phase, sym link, then run the
>> index phase?
>>
>
> Symbolic links will work.
>
> "sort" can be configured to use a temporary folder as well.
>
> The only place symbolic links will not work is for data-triples.tmp. It
> must not exist at all - we ought to change that to make it OK to have a
> zero-length file in place so it can be redirected ahead of time.
>
>     Andy
>
>
>
>> Point me in the right direction and I'll extend the TDB file open code.
>>
>> Dick
>>
>>
>> On 7 Dec 2017 22:21, "Andy Seaborne" <[email protected]> wrote:
>>
>>
>>
>> On 07/12/17 19:01, Laura Morales wrote:
>>
>> Thank you a lot Andy, very informative (special thanks for specifying the
>>> hardware).
>>> For anybody reading this, I'd like to highlight the fact that the data
>>> source is "latest-truthy" and not "latest-all".
>>>  From what I understand, truthy leaves out a lot of data (50% ??) and
>>> "all"
>>> is more than 4 billion triples.
>>>
>>>
>> 4,787,194,669 Triples
>>
>> Dick reported figures for truthy as well.
>>
>> I used a *16G* machine, and it is a portable with all it's memory
>> architecture tradeoffs.
>>
>> "all" is running ATM - it will be much slower due to RAM needs of
>> tdbloader2 for the data phase.  Not sure the figures will mean anything
>> for
>> you.
>>
>> I'd need a machine with (guess) 32G RAM which is still a small server
>> these
>> days.
>>
>> (A similar tree builder technique could be applied to the node index and
>> reduce the max RAM needs but - hey, ho - that's free software for you.)
>>
>>      Andy
>>
>>

Re: Report on loading wikidata

Reply via email to