On 05/03/16 18:04, A. Soroka wrote:
I’m writing an ETL-ish utility that extracts triples from some directories of 
application-specific XML to assemble into a from-scratch TDB database. Of 
course I want to take advantage of the bulk loader facilities for best results. 
The TDBLoader methods that I’m looking at all accept InputStreams or URIs from 
which to get serialized RDF. It happens that I am already using Jena to 
transform the XML into RDF, so I’ve got actual Jena Triples in hand when I come 
to the bulk loading apparatus. It seems silly to serialize the triples only for 
the bulk loader to deserialize them, so I’d like to get at a StreamRDF instance 
or something similar that I can use to give Triples in a flow directly to the 
bulk loader, but at a first glance it looks like that’s hidden as 
BulkLoader.DestinationGraphs.

As additional context, the extraction is easily parallelized, but I do not see 
any note that the bulk loading is threadsafe, so I had intended to run a couple 
of threads of extraction loading a queue with a thread feeding the bulk loading 
gear from that queue.

Am I misunderstanding the action of the bulk loader, and more to the point, 
what is the most efficient way I can build a from-scratch TDB database from 
Triples?

Thanks for any help or advice!

---
A. Soroka
The University of Virginia Library


Hi,

StreamRDF came after BulkLoader so it might not be fully exposed tough note it uses "BulkStreamRDF" which adds to the StreamRDF contract. As parsing many files each cause start/finish calls, there has to be some handling of the overall bulk process which is what startBulk/finishBulk adds.

Bulkloading is not thread safe.

Serializing isn't so bad.  It makes the parallel extraction simple.

A bonus here is that you are running two processes in parallel and also you can check the data. Checking before a large bulk load is a good idea for a reliable process.

Realistically, on one general purpose machine, running the extractor process at the same time as the bulk load is going to slow down bulkloading due to I/O interactions (even if separate disks). Write/parse is CPU-dominated and faster than the bulkloader.

    Andy

Reply via email to