> On Mar 6, 2016, at 11:36 AM, Andy Seaborne <[email protected]> wrote: > Hi, > > StreamRDF came after BulkLoader so it might not be fully exposed tough note > it uses "BulkStreamRDF" which adds to the StreamRDF contract. As parsing > many files each cause start/finish calls, there has to be some handling of > the overall bulk process which is what startBulk/finishBulk adds. > > Bulkloading is not thread safe. > > Serializing isn't so bad. It makes the parallel extraction simple. > > A bonus here is that you are running two processes in parallel and also you > can check the data. Checking before a large bulk load is a good idea for a > reliable process. > > Realistically, on one general purpose machine, running the extractor process > at the same time as the bulk load is going to slow down bulkloading due to > I/O interactions (even if separate disks). Write/parse is CPU-dominated and > faster than the bulkloader. > > Andy
Okay, sounds like for the moment, serializing is the thing to do. In that case, I can drive the bulk loader with a PipedInputStream that I feed with N-Triples. I think I still might use the queue because a large enough number of Triple instances will take up less space than their serialization, assuming that they share enough nodes, which is a safe assumption here. I will take a crack at some point at getting an exposure of BulkStreamRDF out of the bulk loader, after everything else I’m supposed to do for Jena is done. [grin] I know that bandwidth will get divided between the "sides" of the process-as-a-whole, but there’s not much I can do about that in the particular circumstances. --- A. Soroka The University of Virginia Library >
