Is there a reason you can't trust the runner to be durable storage for inprocess work?
I can understand that the DirectRunner only stores things in memory but other runners have stronger durability guarantees. On Tue, Aug 21, 2018 at 9:58 AM Raghu Angadi <[email protected]> wrote: > I think by 'KafkaUnboundedSource checkpointing' you mean enabling > 'commitOffsetsInFinalize()' on KafkaIO source. > It is better option than enable.auto.commit, but does not exactly do what > you want in this moment. It is invoked after the first stage ('Simple > Transformation' in your case). This is certainly true for Dataflow and I > think is also the case for DirectRunner. > > I don't see way to leverage built-in checkpoint for consistency > externally. You would have to manually commit offsets. > > On Tue, Aug 21, 2018 at 8:55 AM Micah Whitacre <[email protected]> > wrote: > >> I'm starting with a very simple pipeline that will read from Kafka -> >> Simple Transformation -> GroupByKey -> Persist the data. We are also >> applying some simple windowing/triggering that will persist the data after >> every 100 elements or every 60 seconds to balance slow trickles of data as >> well as not storing too much in memory. For now I'm just running with the >> DirectRunner since this is just a small processing problem. >> >> With the potential for failure during the persisting of the data, we want >> to ensure that the Kafka offsets are not updated until we have successfully >> persisted the data. Looking at KafkaIO it seems like our two options for >> persisting offsets are: >> * Kafka's enable.auto.commit >> * KafkaUnboundedSource checkpointing. >> >> The first option would commit prematurely before we could guarantee the >> data was persisted. I can't unfortunately find many details about the >> checkpointing so I was wondering if there was a way to configure it or tune >> it more appropriately. >> >> Specifically I'm hoping to understand the flow so I can rely on the built >> in KafkaIO functionality without having to write our own offset >> management. Or is it more common to write your own? >> >> Thanks, >> Micah >> >
