Re: Controlling Kafka Checkpoint Persistence

Raghu Angadi Tue, 21 Aug 2018 14:32:54 -0700

On Tue, Aug 21, 2018 at 2:04 PM Lukasz Cwik <lc...@google.com> wrote:


> Is there a reason you can't trust the runner to be durable storage for
> inprocess work?
>
> I can understand that the DirectRunner only stores things in memory but
> other runners have stronger durability guarantees.
>

I think the requirement is about producing a side effect (committing
offsets to Kafka) after some processing completes in the pipeline. Wait()
transform helps with that. The the user still has to commit the offsets
explicitly and can't get similar functionality in KafkaIO.


> On Tue, Aug 21, 2018 at 9:58 AM Raghu Angadi <rang...@google.com> wrote:
>
>> I think by 'KafkaUnboundedSource checkpointing' you mean enabling
>> 'commitOffsetsInFinalize()' on KafkaIO source.
>> It is better option than enable.auto.commit, but does not exactly do what
>> you want in this moment. It is invoked after the first stage ('Simple
>> Transformation' in your case). This is certainly true for Dataflow and I
>> think is also the case for DirectRunner.
>>
>> I don't see way to leverage built-in checkpoint for consistency
>> externally. You would have to manually commit offsets.
>>
>> On Tue, Aug 21, 2018 at 8:55 AM Micah Whitacre <mkwhita...@gmail.com>
>> wrote:
>>
>>> I'm starting with a very simple pipeline that will read from Kafka ->
>>> Simple Transformation -> GroupByKey -> Persist the data.  We are also
>>> applying some simple windowing/triggering that will persist the data after
>>> every 100 elements or every 60 seconds to balance slow trickles of data as
>>> well as not storing too much in memory.  For now I'm just running with the
>>> DirectRunner since this is just a small processing problem.
>>>
>>> With the potential for failure during the persisting of the data, we
>>> want to ensure that the Kafka offsets are not updated until we have
>>> successfully persisted the data.  Looking at KafkaIO it seems like our two
>>> options for persisting offsets are:
>>> * Kafka's enable.auto.commit
>>> * KafkaUnboundedSource checkpointing.
>>>
>>> The first option would commit prematurely before we could guarantee the
>>> data was persisted.  I can't unfortunately find many details about the
>>> checkpointing so I was wondering if there was a way to configure it or tune
>>> it more appropriately.
>>>
>>> Specifically I'm hoping to understand the flow so I can rely on the
>>> built in KafkaIO functionality without having to write our own offset
>>> management.  Or is it more common to write your own?
>>>
>>> Thanks,
>>> Micah
>>>
>>

Re: Controlling Kafka Checkpoint Persistence

Reply via email to