Re: Loading current Freebase dump to Jena

Ewa Szwed Tue, 05 Nov 2013 03:35:02 -0800

Hello,
Thank you very much for all your exhaustive answers.
I have redone my loading test today, changing my java heap setting from
40GB to 4GB as advised and I have used tdbloaded as here we got some
promising numbers:
http://markmail.org/message/npwvg65x77mgr7mr#query:+page:1+mid:2a23v4pi4pifcttd+state:results
The load took 15 minutes! and both phases were approx. equal.
Thanks! I will let you know how is the whole freebase dump going once I
will have correct data for it.
Ewa




2013/11/5 Rob Vesse <[email protected]>

> Hi
>
> Comments inline:
>
> On 04/11/2013 22:57, "Ewa Szwed" <[email protected]> wrote:
>
> >Hello,
> >
> >I am currently working on a project that do a loading of a full freebase
> >dump into a triple store.
> >
> >The whole freebase dump is around 2 billion triples at the moment (260 GB
> >uncompressed data).
> >
> >We chose to investigate Apache Jena TDB as a first product for this.
> >
> >I run Jena on a virtual machine with Linux Red Hat distribution and of 8
> >cores CPU, 64 GB RAM and 1.2 TB hard drive.
> >
> >Which data loader would be recommended here: (are loaders: tdbloader3 and
> >tdbloader4 even of concern) - I have done my first test of loading 2,5% of
> >freebase to Jena with tdbloader2 and it took 3,48 hours, which is not very
> >promising even if the import time changes linearly.
>
> tdbloader2 is generally the recommended though whether it gives you much
> advantage may depend on whether your OS sort command supports the
> --parallel option
>
> >
> >Is there a way to make the import parallel (run a few instances of loader
> >at the same time against one Jena instace)?
>
> No, tdbloader2 will perform some parallelisation if your sort command
> supports --parallel as per above but otherwise there is no
> parallelisation.  tdbloader2 needs exclusive access to the disk location
> since it creates the data files from scratch and more recent versions
> should refuse to attempt to write to a non-empty disk location
>
> >
> >Is there a way to tune the loader so that data load is faster (did not
> >find
> >any information for that).
>
> See the recent thread on this for tips -
> http://markmail.org/message/npwvg65x77mgr7mr
>
> >
> >I do not understand the idea of Jena indexing; second phase of the load -
> >the one that is acctualy time consuming - is the index phase. Is this
> >indexing at all required for querying with Sparql or this is 'full text
> >search' type of indexing. I'm am wondering if I could maybe skip this
> >phase
> >entirely if possible.
>
> No this is not full text indexing.  TDB loading consists of two phases,
> the data phase involves reading in the raw data and dictionary encoding it
> I.e. assigning a unique Node ID to each unique RDF node and building the
> mapping tables of RDF node -> TDB Node ID and TDB Node ID -> RDF node.
>
> The index phase builds the B+Tree indices that are needed to answer actual
> queries, in principal I believe you can build fewer indices (Andy - I am
> remembering this right?) but this isn't exposed via the command line and
> may have performance impacts later.
>
> >
> >I am basically trying to think how I can make the import faster.
> >
> >And the last question:
> >
> >Would you recommend running import with compressed or uncompressed file
> >and
> >an input file?
>
> Compressed input since it will reduce disk IO though if you have fast disk
> I.e. SSD then this may make little or no difference
>
> Rob
>
> >
> >Regards,
> >
> >Ewa
>
>
>
>
>

Re: Loading current Freebase dump to Jena

Reply via email to