I’ve taken similar approaches in data generators in the past. Using one file per thread is by far the best way to do things and requires the least coordination.
If you are using a concatenable format you can always have a secondary thread which tracks which files you are done writing to and generates the Final concatenated output partly in parallel. Whether that approach will work or depend on the exact structure of your data generator i.e. Whether there are logical points to consider a given output file complete Rob On 26/11/2017, 15:56, "ajs6f" <[email protected]> wrote: I had a similar task a while ago: I did wrap the StreamRDF in a wrapper that synchronized the relevant methods, and that worked fine. Then I tried using several independent output files, one for each thread, and performance improved enormously. Keep in mind that if you use NTriples or Trig, merging two files (for later processing) is just concatenating them. ajs6f > On Nov 26, 2017, at 9:15 AM, Zak Mc Kracken <[email protected]> wrote: > > Hi Andy, > > thank you for your reply. Good to know. My use case is an RDF exporter that takes data from a relatively slow data source (like a DBMS). In order to speed things up, it has multiple threads reading data, converting it to RDF and then sending generated RDF to their own Jena Model (one per thread). At the end, they stream the model to a common sink/stream, such as a file. > > Actually I'm designing this with some flexibility: one can chose to pass a java.util.function.Consumer<Model> to the exporter, that is, an handler that does something with a thread model, once it is ready. That's because, I want to reuse the upstream processing for either an RDF file exporter, or a Neo4J uploader (which should be able to manage concurrent writings at a finer grain level), or, in general, some other kind of converter. > > That said, I'm OK with making the file writing part synchronized and hence non really parallel, my question was to understand it better how Jena works with this. > > Best, > Marco. > > On 26/11/2017 11:14, Andy Seaborne wrote: >> If the output stream is shared, then no. It's buffered internally. >> >> So at small scale, it'll look safe because the whole output is one buffer or the order was OK. But beyond that, the buffered flushes will be interleaved and buffer boundaries are based on characters, not logical unit of the RDF output. >> >> Parallel writing to a shared OutputStream is a bad idea. >> >> What's the use case you have for a shared output stream? >> >> Andy >> >> >
