Hi

Comments inline:

On 04/11/2013 22:57, "Ewa Szwed" <[email protected]> wrote:

>Hello,
>
>I am currently working on a project that do a loading of a full freebase
>dump into a triple store.
>
>The whole freebase dump is around 2 billion triples at the moment (260 GB
>uncompressed data).
>
>We chose to investigate Apache Jena TDB as a first product for this.
>
>I run Jena on a virtual machine with Linux Red Hat distribution and of 8
>cores CPU, 64 GB RAM and 1.2 TB hard drive.
>
>Which data loader would be recommended here: (are loaders: tdbloader3 and
>tdbloader4 even of concern) - I have done my first test of loading 2,5% of
>freebase to Jena with tdbloader2 and it took 3,48 hours, which is not very
>promising even if the import time changes linearly.

tdbloader2 is generally the recommended though whether it gives you much
advantage may depend on whether your OS sort command supports the
--parallel option

>
>Is there a way to make the import parallel (run a few instances of loader
>at the same time against one Jena instace)?

No, tdbloader2 will perform some parallelisation if your sort command
supports --parallel as per above but otherwise there is no
parallelisation.  tdbloader2 needs exclusive access to the disk location
since it creates the data files from scratch and more recent versions
should refuse to attempt to write to a non-empty disk location

>
>Is there a way to tune the loader so that data load is faster (did not
>find
>any information for that).

See the recent thread on this for tips -
http://markmail.org/message/npwvg65x77mgr7mr

>
>I do not understand the idea of Jena indexing; second phase of the load -
>the one that is acctualy time consuming - is the index phase. Is this
>indexing at all required for querying with Sparql or this is 'full text
>search' type of indexing. I'm am wondering if I could maybe skip this
>phase
>entirely if possible.

No this is not full text indexing.  TDB loading consists of two phases,
the data phase involves reading in the raw data and dictionary encoding it
I.e. assigning a unique Node ID to each unique RDF node and building the
mapping tables of RDF node -> TDB Node ID and TDB Node ID -> RDF node.

The index phase builds the B+Tree indices that are needed to answer actual
queries, in principal I believe you can build fewer indices (Andy - I am
remembering this right?) but this isn't exposed via the command line and
may have performance impacts later.

>
>I am basically trying to think how I can make the import faster.
>
>And the last question:
>
>Would you recommend running import with compressed or uncompressed file
>and
>an input file?

Compressed input since it will reduce disk IO though if you have fast disk
I.e. SSD then this may make little or no difference

Rob

>
>Regards,
>
>Ewa




Reply via email to