I’ve taken similar approaches in data generators in the past. Using one file 
per thread is by far the best way to do things and requires the least 
coordination.

If you are using a concatenable format you can always have a secondary thread 
which tracks which files you are done writing to and generates the Final 
concatenated output partly in parallel. Whether that approach will work or 
depend on the exact structure of your data generator i.e. Whether there are 
logical points to consider a given output file complete

Rob

On 26/11/2017, 15:56, "ajs6f" <[email protected]> wrote:

    I had a similar task a while ago: I did wrap the StreamRDF in a wrapper 
that synchronized the relevant methods, and that worked fine. Then I tried 
using several independent output files, one for each thread, and performance 
improved enormously.
    
    Keep in mind that if you use NTriples or Trig, merging two files (for later 
processing) is just concatenating them.
    
    ajs6f
    
    > On Nov 26, 2017, at 9:15 AM, Zak Mc Kracken <[email protected]> 
wrote:
    > 
    > Hi Andy,
    > 
    > thank you for your reply. Good to know. My use case is an RDF exporter 
that takes data from a relatively slow data source (like a DBMS). In order to 
speed things up, it has multiple threads reading data, converting it to RDF and 
then sending generated RDF to their own Jena Model (one per thread). At the 
end, they stream the model to a common sink/stream, such as a file.
    > 
    > Actually I'm designing this with some flexibility: one can chose to pass 
a java.util.function.Consumer<Model> to the exporter, that is, an handler that 
does something with a thread model, once it is ready. That's because, I want to 
reuse the upstream processing for either an RDF file exporter, or a Neo4J 
uploader (which should be able to manage concurrent writings at a finer grain 
level), or, in general, some other kind of converter.
    > 
    > That said, I'm OK with making the file writing part synchronized and 
hence non really parallel, my question was to understand it better how Jena 
works with this.
    > 
    > Best,
    > Marco.
    > 
    > On 26/11/2017 11:14, Andy Seaborne wrote:
    >> If the output stream is shared, then no.  It's buffered internally.
    >> 
    >> So at small scale, it'll look safe because the whole output is one 
buffer or the order was OK.  But beyond that, the buffered flushes will be 
interleaved and buffer boundaries are based on characters, not logical unit of 
the RDF output.
    >> 
    >> Parallel writing to a shared OutputStream is a bad idea.
    >> 
    >> What's the use case you have for a shared output stream?
    >> 
    >>     Andy
    >> 
    >> 
    > 
    
    




Reply via email to