On 05/03/16 18:04, A. Soroka wrote:
I’m writing an ETL-ish utility that extracts triples from some directories of
application-specific XML to assemble into a from-scratch TDB database. Of
course I want to take advantage of the bulk loader facilities for best results.
The TDBLoader methods that I’m looking at all accept InputStreams or URIs from
which to get serialized RDF. It happens that I am already using Jena to
transform the XML into RDF, so I’ve got actual Jena Triples in hand when I come
to the bulk loading apparatus. It seems silly to serialize the triples only for
the bulk loader to deserialize them, so I’d like to get at a StreamRDF instance
or something similar that I can use to give Triples in a flow directly to the
bulk loader, but at a first glance it looks like that’s hidden as
BulkLoader.DestinationGraphs.
As additional context, the extraction is easily parallelized, but I do not see
any note that the bulk loading is threadsafe, so I had intended to run a couple
of threads of extraction loading a queue with a thread feeding the bulk loading
gear from that queue.
Am I misunderstanding the action of the bulk loader, and more to the point,
what is the most efficient way I can build a from-scratch TDB database from
Triples?
Thanks for any help or advice!
---
A. Soroka
The University of Virginia Library
Hi,
StreamRDF came after BulkLoader so it might not be fully exposed tough
note it uses "BulkStreamRDF" which adds to the StreamRDF contract. As
parsing many files each cause start/finish calls, there has to be some
handling of the overall bulk process which is what startBulk/finishBulk
adds.
Bulkloading is not thread safe.
Serializing isn't so bad. It makes the parallel extraction simple.
A bonus here is that you are running two processes in parallel and also
you can check the data. Checking before a large bulk load is a good idea
for a reliable process.
Realistically, on one general purpose machine, running the extractor
process at the same time as the bulk load is going to slow down
bulkloading due to I/O interactions (even if separate disks).
Write/parse is CPU-dominated and faster than the bulkloader.
Andy