This is an interesting subject. What are the steps done and how is the work partitioned when loading data? I ask because if everything goes well, I will be loading tens of billions of triples a day into a Redshift-backed SDB store. Using SQL to load into Redshift is inefficient, so I plan to prep the data for insertion into the temp tables into flat files that are written into S3, then use the COPY command for distributed load. From there the standard INSERT INTO .... SELECT pattern can be used for the Nodes and Triples/Quads tables. I want to leverage all the existing code to process the triples, then instead of directly inserting, write batches to S3. S3 really likes 32-64MB chunks.
I will read the code, but if anybody has thoughts on the pattern, or some hard-won tips, it would be much appreciated. Dom On Tue, Feb 26, 2013 at 10:23 AM, Aaron Coburn <[email protected]> wrote: > I recently had a need to load ~225M triples into a TDB triplestore, and > when allocating only ~12G to the triple loader, I experienced the very same > slowdowns you described. As an alternative, I just reserved an on-demand, > high memory (i.e. ~60GB) instance in the public cloud, and the processing > completed in only a few hours. I then just moved the files onto my local > server and proceeded from there. > > Aaron Coburn > > > On Feb 25, 2013, at 1:25 PM, Andy Seaborne <[email protected]> wrote: > > > On 25/02/13 20:07, Joshua Greben wrote: > >> Hello All, > >> > >> I am new to this list and to Jena and was wondering if anyone could > >> offer advice for loading a large triplestore. > >> > >> I am trying to load 670M Ntriples into a store using tdbloader on a > >> single machine with 64-bit hardware and 8GB of memory. However, I am > >> running into a massive slowdown. When the load starts the tdbloader > >> is processing around 30K tps but by the time it has loaded 130M > >> triples it can essentially no longer load any more and slows down to > >> 2300 tps. At that point I have to kill the process because it will > >> basically never finish. > >> > >> Is 8GB of memory enough or is there a more efficient way to load this > >> data? I am trying to load the data into a single DB location. Should > >> I be splitting up the triples and loading them into different DBs? > >> > >> Advice from anyone who has experience successfully loading a large > >> triplestore is much appreciated. > > > > Only 8G is pushing it somewhat for 670M triples. It will finish; it > will take a very long time. Faster loads have been reported by using a > larger machine (e.g. Freebase in 8 hours on a IBM Power7 and 48G RAM). > > > > tdbloader2 (Linux only) may get you there a bit quicker but really you > need a bigger machine. > > > > Once built, you can copy the dataset as files to other machines. > > > > Andy > > > >> > >> Thanks! > >> > >> - Josh > >> > >> > >> > >> Joshua Greben Library Systems Programmer & Analyst Stanford > >> University Libraries (650) 714-1937 [email protected] > >> > >> > >> > > > >
