Re: StreamRDF or similar for a TDB bulk load?

A. Soroka Sun, 06 Mar 2016 09:29:44 -0800

> On Mar 6, 2016, at 11:36 AM, Andy Seaborne <[email protected]> wrote:
> Hi,
> 
> StreamRDF came after BulkLoader so it might not be fully exposed tough note 
> it uses "BulkStreamRDF" which adds to the StreamRDF contract.  As parsing 
> many files each cause start/finish calls, there has to be some handling of 
> the overall bulk process which is what startBulk/finishBulk adds.
> 
> Bulkloading is not thread safe.
> 
> Serializing isn't so bad.  It makes the parallel extraction simple.
> 
> A bonus here is that you are running two processes in parallel and also you 
> can check the data. Checking before a large bulk load is a good idea for a 
> reliable process.
> 
> Realistically, on one general purpose machine, running the extractor process 
> at the same time as the bulk load is going to slow down bulkloading due to 
> I/O interactions (even if separate disks). Write/parse is CPU-dominated  and 
> faster than the bulkloader.
> 
>    Andy


Okay, sounds like for the moment, serializing is the thing to do. In that case, 
I can drive the bulk loader with a PipedInputStream that I feed with N-Triples. 
I think I still might use the queue because a large enough number of Triple 
instances will take up less space than their serialization, assuming that they 
share enough nodes, which is a safe assumption here. I will take a crack at 
some point at getting an exposure of BulkStreamRDF out of the bulk loader, 
after everything else I’m supposed to do for Jena is done. [grin] I know that 
bandwidth will get divided between the "sides" of the process-as-a-whole, but 
there’s not much I can do about that in the particular circumstances.

---
A. Soroka
The University of Virginia Library
>

Re: StreamRDF or similar for a TDB bulk load?

Reply via email to