On Thu, Dec 6, 2018 at 6:43 AM Jeff Klukas <[email protected]> wrote:

> Thanks for the response, Raghu. Does this sound like a use case that would
> be worth supporting in the future, or does that fall outside of
> Beam/Dataflow's goals?
>

Beam does not hinder runners from providing these. Beam itself has stayed
above state checkpointing and associated guarantees so far, letting runners
define those semantics. What Beam should do especially in lieu of being a
widely applicable data processing API will be an important aspect. I am
sure all the proposals are welcome ('DoFn.RequiresStableInput' is an
example). Often, one of the tricky issues at Beam is how it affects batch,
and in streaming, how it affects orthogonal checkpointing semantics in
runners like Flink and runners like Dataflow & Spark.

Coming to one of its runners Dataflow, yes, it should certainly aim to make
these use cases easier to do. Currently it has 'update' (gives same
exactly-once guarantees across a job restart as within in a single job) &
'drain' (supports graceful shutdown). While these two address a large range
of consistency requirements for customers, you want something in
between:  ability to save the state and be able to restart from that. Flink
has support for this, I think Dataflow will support it too.


> Is there a general design principle of optimizing to make complex
> pipelines possible in favor of allowing detailed control for a simple
> pipeline like this?
>

'save points' / snapshots has been the generic way of supporting this. Note
that resuming from a snapshot has the same caveats as an 'update'
(compatibility of coders for serialized state etc). Another approach is
what Kstreams does, where the streaming engines state, and input and output
are all Kafka topics that can be atomically checkpointed. Beam needs to
support wide range of input sources


> In effect, would you recommend we write a custom application using the
> Pub/Sub and Storage SDKs directly rather than trying to use Beam's
> abstractions?
>

Certainly not :). These are important use cases and we love users to raise
these, like you did. I think the current approach taken by Beam is to let
the runners define consistency semantics.

Raghu.


> On Wed, Dec 5, 2018 at 2:24 PM Raghu Angadi <[email protected]> wrote:
>
>>
>> On Wed, Dec 5, 2018 at 11:13 AM Jeff Klukas <[email protected]> wrote:
>>
>>> We are attempting to build a Beam pipeline running on Dataflow that
>>> reads from a Cloud Pub/Sub subscription and writes 10 minute windows to
>>> files on GCS via FileIO. At-least-once completeness is critical in our use
>>> case, so we are seeking the simplest possible solution with obvious and
>>> verifiable delivery guarantees.
>>>
>>> We had assumed we'd be able to avoid intermediate storage, but it
>>> appears that it's necessary to specify sharding for FileIO when reading
>>> from Pub/Sub which implicitly calls GroupByKey [1] and that PubsubIO will
>>> ack messages at the first GroupByKey operation [0]. Thus, we become
>>> vulnerable to data loss scenarios if the Dataflow service becomes
>>> unavailable or if we improperly drain a pipeline on deletion or update.
>>>
>>> Is this assessment correct?
>>>
>> Yes.
>>
>>
>>> Would it be possible to delay pubsub acks until after the FileIO writes
>>> complete?
>>>
>>
>> There is no option to delay the acks for downstream stages.
>>
>>
>>> Other workarounds we could use to avoid Dataflow checkpoint data
>>> becoming critical state for the completeness of our pipeline?
>>>
>>
>> I don't see a good way not to depend on Dataflow checkpoint for
>> guarantees on a Dataflow pipeline. Others might have ideas here. If you are
>> willing to use external storage (or another pubsub topic) you could achieve
>> at least once. In that case, check other threads on the mailing list about
>> Wait() transform that lets you take an action after certain operation is
>> done (in this case FileIO).
>>
>>
>>> [0]
>>> https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/gcp/pubsub/PubsubUnboundedSource.html
>>> [1]
>>> https://beam.apache.org/releases/javadoc/2.8.0/org/apache/beam/sdk/io/FileIO.html
>>>
>>

Reply via email to