I’m writing an ETL-ish utility that extracts triples from some directories of application-specific XML to assemble into a from-scratch TDB database. Of course I want to take advantage of the bulk loader facilities for best results. The TDBLoader methods that I’m looking at all accept InputStreams or URIs from which to get serialized RDF. It happens that I am already using Jena to transform the XML into RDF, so I’ve got actual Jena Triples in hand when I come to the bulk loading apparatus. It seems silly to serialize the triples only for the bulk loader to deserialize them, so I’d like to get at a StreamRDF instance or something similar that I can use to give Triples in a flow directly to the bulk loader, but at a first glance it looks like that’s hidden as BulkLoader.DestinationGraphs.
As additional context, the extraction is easily parallelized, but I do not see any note that the bulk loading is threadsafe, so I had intended to run a couple of threads of extraction loading a queue with a thread feeding the bulk loading gear from that queue. Am I misunderstanding the action of the bulk loader, and more to the point, what is the most efficient way I can build a from-scratch TDB database from Triples? Thanks for any help or advice! --- A. Soroka The University of Virginia Library
