Sorry if this is a double-post, I tried using the web client but didn't see
the post show up, so instead I've subscribed to the list and am sending
this via my email client.

We're working on an apache beam pipeline that gathers messages from various
pub/sub topics, decodes them, bundles them into groups by hour (using a
FixedWindow) and sends them to Azure.  The eventual format of the records
is line-delimited JSON.

There isn't a sink written for Azure yet, and we want the ability to
control which file names get created for each window firing (we want the
date/time appended to the file name, it looks like existing sinks don't
give you much control over the filename outside of specifying sharding).

We decided to try writing our data to Azure via a DoFn instead.  For the
processElement it writes a single record to a temp file.  In the
finishBundle step, we acquire a lease on a blockId based off the hash of
our temp file, and upload the data.

This would work fine if each instance of our DoFn were used for more than
one record.  However, we're running into the same issue from here:
http://stackoverflow.com/questions/42914198/buffer-and-flush-apache-beam-streaming-data,
each of our records is getting its own DoFn instance.

So my question is: Whats the best way to proceed?  I see a few options

1.) Try implementing the stateful processing/trigger mechanism suggested in
the stackoverflow post.  Use this state instead of the temp files and
trigger writes via triggers
2.) Try implementing a FileBasedSink.  I'd prefer the flexibility of the
Sink abstract class, but looking at the Apache Beam Jira project, that
might be going away soon (https://issues.apache.org/jira/browse/BEAM-1897).
I'd like to stay away from that based on that uncertainty.

I saw in this stack overflow post
http://stackoverflow.com/questions/41386717/azure-blob-support-in-apache-beam
that it might be nice to have a dedicated Azure Sink, if I could get a good
rundown of what needs to be done, I might be able to contribute this.
3.) Do a GroupBy when our window fires.  This way instead of having a
DoFn<String>, it would be a DoFn<Iterable<String>>.  I'm not sure what
would happen if the result of the GroupBy couldn't fit in a single
machine's memory though (say we uploaded a bunch of data during that 1
hour).  Would the Iterable<String>'s records be split across multiple
machines?  If so, I like this solution best.
4.) Something else?

Reply via email to