Re: StreamRDF or similar for a TDB bulk load?

Andy Seaborne Sun, 06 Mar 2016 08:37:18 -0800

On 05/03/16 18:04, A. Soroka wrote:

I’m writing an ETL-ish utility that extracts triples from some directories of 
application-specific XML to assemble into a from-scratch TDB database. Of 
course I want to take advantage of the bulk loader facilities for best results. 
The TDBLoader methods that I’m looking at all accept InputStreams or URIs from 
which to get serialized RDF. It happens that I am already using Jena to 
transform the XML into RDF, so I’ve got actual Jena Triples in hand when I come 
to the bulk loading apparatus. It seems silly to serialize the triples only for 
the bulk loader to deserialize them, so I’d like to get at a StreamRDF instance 
or something similar that I can use to give Triples in a flow directly to the 
bulk loader, but at a first glance it looks like that’s hidden as 
BulkLoader.DestinationGraphs.


As additional context, the extraction is easily parallelized, but I do not see 
any note that the bulk loading is threadsafe, so I had intended to run a couple 
of threads of extraction loading a queue with a thread feeding the bulk loading 
gear from that queue.

Am I misunderstanding the action of the bulk loader, and more to the point, 
what is the most efficient way I can build a from-scratch TDB database from 
Triples?

Thanks for any help or advice!

---
A. Soroka
The University of Virginia Library

Hi,

StreamRDF came after BulkLoader so it might not be fully exposed toughnote it uses "BulkStreamRDF" which adds to the StreamRDF contract. Asparsing many files each cause start/finish calls, there has to be somehandling of the overall bulk process which is what startBulk/finishBulkadds.


Bulkloading is not thread safe.

Serializing isn't so bad.  It makes the parallel extraction simple.

A bonus here is that you are running two processes in parallel and alsoyou can check the data. Checking before a large bulk load is a good ideafor a reliable process.

Realistically, on one general purpose machine, running the extractorprocess at the same time as the bulk load is going to slow downbulkloading due to I/O interactions (even if separate disks).Write/parse is CPU-dominated and faster than the bulkloader.


    Andy

Re: StreamRDF or similar for a TDB bulk load?

Reply via email to