Since it's N-triples and so one triple per line, why not use unix utilities (eg 'split') to divide it into lots of smaller chunks and do a series of tdbloader uploads. Should be fairly straightforward to script in bash or other scripting language of your choice. That should have a lower memory requirement and so avoid the massive slowdown. Or am I missing something?
Bill On 26 Feb 2013, at 18:23, Aaron Coburn <[email protected]> wrote: > I recently had a need to load ~225M triples into a TDB triplestore, and when > allocating only ~12G to the triple loader, I experienced the very same > slowdowns you described. As an alternative, I just reserved an on-demand, > high memory (i.e. ~60GB) instance in the public cloud, and the processing > completed in only a few hours. I then just moved the files onto my local > server and proceeded from there. > > Aaron Coburn > > > On Feb 25, 2013, at 1:25 PM, Andy Seaborne <[email protected]> wrote: > >> On 25/02/13 20:07, Joshua Greben wrote: >>> Hello All, >>> >>> I am new to this list and to Jena and was wondering if anyone could >>> offer advice for loading a large triplestore. >>> >>> I am trying to load 670M Ntriples into a store using tdbloader on a >>> single machine with 64-bit hardware and 8GB of memory. However, I am >>> running into a massive slowdown. When the load starts the tdbloader >>> is processing around 30K tps but by the time it has loaded 130M >>> triples it can essentially no longer load any more and slows down to >>> 2300 tps. At that point I have to kill the process because it will >>> basically never finish. >>> >>> Is 8GB of memory enough or is there a more efficient way to load this >>> data? I am trying to load the data into a single DB location. Should >>> I be splitting up the triples and loading them into different DBs? >>> >>> Advice from anyone who has experience successfully loading a large >>> triplestore is much appreciated. >> >> Only 8G is pushing it somewhat for 670M triples. It will finish; it will >> take a very long time. Faster loads have been reported by using a larger >> machine (e.g. Freebase in 8 hours on a IBM Power7 and 48G RAM). >> >> tdbloader2 (Linux only) may get you there a bit quicker but really you need >> a bigger machine. >> >> Once built, you can copy the dataset as files to other machines. >> >> Andy >> >>> >>> Thanks! >>> >>> - Josh >>> >>> >>> >>> Joshua Greben Library Systems Programmer & Analyst Stanford >>> University Libraries (650) 714-1937 [email protected] >>> >>> >>> >> >
