Sorry I mistyped, we need "at least once" processing. <facepalm/>
On Wed, Aug 22, 2018 at 9:39 AM, Micah Whitacre <[email protected]> wrote: > > Could you describe your durability requirements a bit more? > > The requirement is that we need "at most once" processing of the data. So > I'm perfectly happy retrying the processing. I'm more concerned about data > loss/skipping data in the event of processing failures, pipeline operations > (starting/restarting), failures in the runner underlying infrastructure, > and the fun use cases when they might happen at the same time (e.g. > underlying infrastructure problems that cause processing failures and need > us to restart the pipeline :)). > > Are there any good resources talking about the differences at the > boundaries or the assumed guarantees? > > On Tue, Aug 21, 2018 at 5:05 PM, Raghu Angadi <[email protected]> wrote: > >> >> On Tue, Aug 21, 2018 at 2:49 PM Micah Whitacre <[email protected]> >> wrote: >> >>> > Is there a reason you can't trust the runner to be durable storage >>> for inprocess work? >>> >>> That's a fair question. Are there any good resources documenting the >>> durability/stability of the different runners? I assume there are some >>> stability requirements regarding its handling of "bundles" but it would be >>> nice to have that info available. One of the reasons we are targeting the >>> Direct runner is to let us work with the project and let us temporarily >>> delay picking a runner. Durability seems like another important aspect to >>> evaluate. >>> >> >> Could you describe your durability requirements a bit more? >> All the major runners comparable durability guarantees on processing >> within a running pipeline (these are required for Beam model). The >> differences arise at the boundaries: what happens when you stop the >> pipeline, can the pipeline be updated with new code with the old state, >> etc. >> >> An often confusing area is about side effects (like committing Kafka >> offsets in your case).. the users always have to assume that processing >> might be retried (even if it rarely occurs). >> >> >>> >>> On Tue, Aug 21, 2018 at 4:24 PM, Raghu Angadi <[email protected]> >>> wrote: >>> >>>> On Tue, Aug 21, 2018 at 2:04 PM Lukasz Cwik <[email protected]> wrote: >>>> >>>>> Is there a reason you can't trust the runner to be durable storage for >>>>> inprocess work? >>>>> >>>>> I can understand that the DirectRunner only stores things in memory >>>>> but other runners have stronger durability guarantees. >>>>> >>>> >>>> I think the requirement is about producing a side effect (committing >>>> offsets to Kafka) after some processing completes in the pipeline. Wait() >>>> transform helps with that. The the user still has to commit the offsets >>>> explicitly and can't get similar functionality in KafkaIO. >>>> >>>> >>>>> On Tue, Aug 21, 2018 at 9:58 AM Raghu Angadi <[email protected]> >>>>> wrote: >>>>> >>>>>> I think by 'KafkaUnboundedSource checkpointing' you mean enabling >>>>>> 'commitOffsetsInFinalize()' on KafkaIO source. >>>>>> It is better option than enable.auto.commit, but does not exactly do >>>>>> what you want in this moment. It is invoked after the first stage >>>>>> ('Simple >>>>>> Transformation' in your case). This is certainly true for Dataflow and I >>>>>> think is also the case for DirectRunner. >>>>>> >>>>>> I don't see way to leverage built-in checkpoint for consistency >>>>>> externally. You would have to manually commit offsets. >>>>>> >>>>>> On Tue, Aug 21, 2018 at 8:55 AM Micah Whitacre <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I'm starting with a very simple pipeline that will read from Kafka >>>>>>> -> Simple Transformation -> GroupByKey -> Persist the data. We are also >>>>>>> applying some simple windowing/triggering that will persist the data >>>>>>> after >>>>>>> every 100 elements or every 60 seconds to balance slow trickles of data >>>>>>> as >>>>>>> well as not storing too much in memory. For now I'm just running with >>>>>>> the >>>>>>> DirectRunner since this is just a small processing problem. >>>>>>> >>>>>>> With the potential for failure during the persisting of the data, we >>>>>>> want to ensure that the Kafka offsets are not updated until we have >>>>>>> successfully persisted the data. Looking at KafkaIO it seems like our >>>>>>> two >>>>>>> options for persisting offsets are: >>>>>>> * Kafka's enable.auto.commit >>>>>>> * KafkaUnboundedSource checkpointing. >>>>>>> >>>>>>> The first option would commit prematurely before we could guarantee >>>>>>> the data was persisted. I can't unfortunately find many details about >>>>>>> the >>>>>>> checkpointing so I was wondering if there was a way to configure it or >>>>>>> tune >>>>>>> it more appropriately. >>>>>>> >>>>>>> Specifically I'm hoping to understand the flow so I can rely on the >>>>>>> built in KafkaIO functionality without having to write our own offset >>>>>>> management. Or is it more common to write your own? >>>>>>> >>>>>>> Thanks, >>>>>>> Micah >>>>>>> >>>>>> >>> >
