Sorry if this is a double-post, I tried using the web client but didn't see the post show up, so instead I've subscribed to the list and am sending this via my email client.
We're working on an apache beam pipeline that gathers messages from various pub/sub topics, decodes them, bundles them into groups by hour (using a FixedWindow) and sends them to Azure. The eventual format of the records is line-delimited JSON. There isn't a sink written for Azure yet, and we want the ability to control which file names get created for each window firing (we want the date/time appended to the file name, it looks like existing sinks don't give you much control over the filename outside of specifying sharding). We decided to try writing our data to Azure via a DoFn instead. For the processElement it writes a single record to a temp file. In the finishBundle step, we acquire a lease on a blockId based off the hash of our temp file, and upload the data. This would work fine if each instance of our DoFn were used for more than one record. However, we're running into the same issue from here: http://stackoverflow.com/questions/42914198/buffer-and-flush-apache-beam-streaming-data, each of our records is getting its own DoFn instance. So my question is: Whats the best way to proceed? I see a few options 1.) Try implementing the stateful processing/trigger mechanism suggested in the stackoverflow post. Use this state instead of the temp files and trigger writes via triggers 2.) Try implementing a FileBasedSink. I'd prefer the flexibility of the Sink abstract class, but looking at the Apache Beam Jira project, that might be going away soon (https://issues.apache.org/jira/browse/BEAM-1897). I'd like to stay away from that based on that uncertainty. I saw in this stack overflow post http://stackoverflow.com/questions/41386717/azure-blob-support-in-apache-beam that it might be nice to have a dedicated Azure Sink, if I could get a good rundown of what needs to be done, I might be able to contribute this. 3.) Do a GroupBy when our window fires. This way instead of having a DoFn<String>, it would be a DoFn<Iterable<String>>. I'm not sure what would happen if the result of the GroupBy couldn't fit in a single machine's memory though (say we uploaded a bunch of data during that 1 hour). Would the Iterable<String>'s records be split across multiple machines? If so, I like this solution best. 4.) Something else?
