Hello, I am currently working on a project that do a loading of a full freebase dump into a triple store.
The whole freebase dump is around 2 billion triples at the moment (260 GB uncompressed data). We chose to investigate Apache Jena TDB as a first product for this. I run Jena on a virtual machine with Linux Red Hat distribution and of 8 cores CPU, 64 GB RAM and 1.2 TB hard drive. Which data loader would be recommended here: (are loaders: tdbloader3 and tdbloader4 even of concern) - I have done my first test of loading 2,5% of freebase to Jena with tdbloader2 and it took 3,48 hours, which is not very promising even if the import time changes linearly. Is there a way to make the import parallel (run a few instances of loader at the same time against one Jena instace)? Is there a way to tune the loader so that data load is faster (did not find any information for that). I do not understand the idea of Jena indexing; second phase of the load - the one that is acctualy time consuming - is the index phase. Is this indexing at all required for querying with Sparql or this is 'full text search' type of indexing. I'm am wondering if I could maybe skip this phase entirely if possible. I am basically trying to think how I can make the import faster. And the last question: Would you recommend running import with compressed or uncompressed file and an input file? Regards, Ewa
