On 26/02/13 19:00, Bill Roberts wrote:
Since it's N-triples and so one triple per line, why not use unix
utilities (eg 'split') to divide it into lots of smaller chunks and
do a series of tdbloader uploads.  Should be fairly straightforward
to script in bash or other scripting language of your choice.  That
should have a lower memory requirement and so avoid the massive
slowdown.  Or am I missing something?

Maybe, but. It can avoid the situation where machine starts swapping processes if not configured to balance the memory mapped files with applications running but the effect is moved elsewhere.

But loading that is not into an empty dataset is not optimized.

Both tdbloader andtdbloader2 treat an empty dataset differently and directly manipulate either indexes or files.

Restarting to load a new batch will

(1) not use optimized loading

(2) not use any part of the datastructures that might have been cached.

The problem at large scale / relatively small RAM is that the effective cache is small. Lack of caching will be similar to swapping.

        Andy


Bill

On 26 Feb 2013, at 18:23, Aaron Coburn <[email protected]> wrote:

I recently had a need to load ~225M triples into a TDB triplestore,
and when allocating only ~12G to the triple loader, I experienced
the very same slowdowns you described. As an alternative, I just
reserved an on-demand, high memory (i.e. ~60GB) instance in the
public cloud, and the processing completed in only a few hours. I
then just moved the files onto my local server and proceeded from
there.

Aaron Coburn


On Feb 25, 2013, at 1:25 PM, Andy Seaborne <[email protected]>
wrote:

On 25/02/13 20:07, Joshua Greben wrote:
Hello All,

I am new to this list and to Jena and was wondering if anyone
could offer advice for loading a large triplestore.

I am trying to load 670M Ntriples into a store using tdbloader
on a single machine with 64-bit hardware and 8GB of memory.
However, I am running into a massive slowdown. When the load
starts the tdbloader is processing around 30K tps but by the
time it has loaded 130M triples it can essentially no longer
load any more and slows down to 2300 tps. At that point I have
to kill the process because it will basically never finish.

Is 8GB of memory enough or is there a more efficient way to
load this data? I am trying to load the data into a single DB
location. Should I be splitting up the triples and loading them
into different DBs?

Advice from anyone who has experience successfully loading a
large triplestore is much appreciated.

Only 8G is pushing it somewhat for 670M triples.  It will finish;
it will take a very long time.  Faster loads have been reported
by using a larger machine (e.g. Freebase in 8 hours on a IBM
Power7 and 48G RAM).

tdbloader2 (Linux only) may get you there a bit quicker but
really you need a bigger machine.

Once built, you can copy the dataset as files to other machines.

Andy


Thanks!

- Josh



Joshua Greben Library Systems Programmer & Analyst Stanford
University Libraries (650) 714-1937 [email protected]







Reply via email to