Similar here. I hacked (i.e. no checking/setup/params) the data/index scripts to create s, p, o folders on soft linked three separate devices and moved in the respective.dat and .idn files, hard linked back to the data-triples.tmp. and ran the three triple indexes in parallel. sort was parallel 8 and buffer 8GB. It built the three indexes in the time taken to build one.
As an aside there are duplicate entries in the data-triples.tmp file, is this by design? if you sort data-triples.tmp | uniq > it returns a smaller file and I've checked visually and there are duplicate entries... I'll tidy the script and make it available if anyone wants to perform a tweaked load, only really useful for large datasets. On 11 December 2017 at 15:32, Andy Seaborne <[email protected]> wrote: > This is for the large amount of temporary space that tdbloader2 uses? > > I got "latest-all" to load but I had to do some things with tdbloader2 to > work with a compresses data-triples.tmp.gz and also have sort write > comprssed temporary files (I messed up a bit and set the gzip compression > too high so it slowed things down). > > There are some small problems with tdbloader2 with complex --sort-args (it > only handles one single arg/value correctly). My main trick was to put in > a script for "sort" that had the required settings built-in. I wanted to > set --compress, -T and the buffer size. > > On 10/12/17 21:18, Dick Murray wrote: > >> Ryzen 1920X 3.5GHz, 32GB DDR4 quad channel, 3 x M.2 Samsung 960 EVO, >> 172K/sec 3h45m for truthy. >> >> Is it possible to split the index files into separate folders? >> > > Not built-in. Symbolic links will work. > > I'm keen on symbolic links here because built-in support would hard to > keep all cases covered. > > >> Or sym link the files, if I run the data phase, sym link, then run the >> index phase? >> > > Symbolic links will work. > > "sort" can be configured to use a temporary folder as well. > > The only place symbolic links will not work is for data-triples.tmp. It > must not exist at all - we ought to change that to make it OK to have a > zero-length file in place so it can be redirected ahead of time. > > Andy > > > >> Point me in the right direction and I'll extend the TDB file open code. >> >> Dick >> >> >> On 7 Dec 2017 22:21, "Andy Seaborne" <[email protected]> wrote: >> >> >> >> On 07/12/17 19:01, Laura Morales wrote: >> >> Thank you a lot Andy, very informative (special thanks for specifying the >>> hardware). >>> For anybody reading this, I'd like to highlight the fact that the data >>> source is "latest-truthy" and not "latest-all". >>> From what I understand, truthy leaves out a lot of data (50% ??) and >>> "all" >>> is more than 4 billion triples. >>> >>> >> 4,787,194,669 Triples >> >> Dick reported figures for truthy as well. >> >> I used a *16G* machine, and it is a portable with all it's memory >> architecture tradeoffs. >> >> "all" is running ATM - it will be much slower due to RAM needs of >> tdbloader2 for the data phase. Not sure the figures will mean anything >> for >> you. >> >> I'd need a machine with (guess) 32G RAM which is still a small server >> these >> days. >> >> (A similar tree builder technique could be applied to the node index and >> reduce the max RAM needs but - hey, ho - that's free software for you.) >> >> Andy >> >>
