Hello, Thank you very much for all your exhaustive answers. I have redone my loading test today, changing my java heap setting from 40GB to 4GB as advised and I have used tdbloaded as here we got some promising numbers: http://markmail.org/message/npwvg65x77mgr7mr#query:+page:1+mid:2a23v4pi4pifcttd+state:results The load took 15 minutes! and both phases were approx. equal. Thanks! I will let you know how is the whole freebase dump going once I will have correct data for it. Ewa
2013/11/5 Rob Vesse <[email protected]> > Hi > > Comments inline: > > On 04/11/2013 22:57, "Ewa Szwed" <[email protected]> wrote: > > >Hello, > > > >I am currently working on a project that do a loading of a full freebase > >dump into a triple store. > > > >The whole freebase dump is around 2 billion triples at the moment (260 GB > >uncompressed data). > > > >We chose to investigate Apache Jena TDB as a first product for this. > > > >I run Jena on a virtual machine with Linux Red Hat distribution and of 8 > >cores CPU, 64 GB RAM and 1.2 TB hard drive. > > > >Which data loader would be recommended here: (are loaders: tdbloader3 and > >tdbloader4 even of concern) - I have done my first test of loading 2,5% of > >freebase to Jena with tdbloader2 and it took 3,48 hours, which is not very > >promising even if the import time changes linearly. > > tdbloader2 is generally the recommended though whether it gives you much > advantage may depend on whether your OS sort command supports the > --parallel option > > > > >Is there a way to make the import parallel (run a few instances of loader > >at the same time against one Jena instace)? > > No, tdbloader2 will perform some parallelisation if your sort command > supports --parallel as per above but otherwise there is no > parallelisation. tdbloader2 needs exclusive access to the disk location > since it creates the data files from scratch and more recent versions > should refuse to attempt to write to a non-empty disk location > > > > >Is there a way to tune the loader so that data load is faster (did not > >find > >any information for that). > > See the recent thread on this for tips - > http://markmail.org/message/npwvg65x77mgr7mr > > > > >I do not understand the idea of Jena indexing; second phase of the load - > >the one that is acctualy time consuming - is the index phase. Is this > >indexing at all required for querying with Sparql or this is 'full text > >search' type of indexing. I'm am wondering if I could maybe skip this > >phase > >entirely if possible. > > No this is not full text indexing. TDB loading consists of two phases, > the data phase involves reading in the raw data and dictionary encoding it > I.e. assigning a unique Node ID to each unique RDF node and building the > mapping tables of RDF node -> TDB Node ID and TDB Node ID -> RDF node. > > The index phase builds the B+Tree indices that are needed to answer actual > queries, in principal I believe you can build fewer indices (Andy - I am > remembering this right?) but this isn't exposed via the command line and > may have performance impacts later. > > > > >I am basically trying to think how I can make the import faster. > > > >And the last question: > > > >Would you recommend running import with compressed or uncompressed file > >and > >an input file? > > Compressed input since it will reduce disk IO though if you have fast disk > I.e. SSD then this may make little or no difference > > Rob > > > > >Regards, > > > >Ewa > > > > >
