Re: KafkaIO Questions

Jesse Anderson Wed, 08 Jun 2016 14:11:58 -0700

I think Beam's checkpointing is what should be explicitly said. I would say
something like:
Kafka offset commits and "group.ids" aren't necessary as Beam has
checkpointing.


The next question a person would be whether checkpointing is distributed
and redundant like Kafka's offset topic.

On Wed, Jun 8, 2016 at 4:07 PM Raghu Angadi <[email protected]> wrote:

> Thanks for the feedback.
>
> On Wed, Jun 8, 2016 at 12:35 PM, Jesse Anderson <[email protected]>
> wrote:
>
>> The second thing is about the JavaDoc on the read method. I think the
>> JavaDocs should talk about the differences between this and a default
>> KafkaConsumer. Since there isn't a group.id required to be set, I think
>> the JavaDoc should mention the implications of this. It would say something
>> about always starting from the latest data. Also, it should mention that
>> offsets won't be kept and restarting the processing will not start at the
>> last offset; it will start at the latest data again.
>>
>
> JavaDoc currently states this
> <https://github.com/apache/incubator-beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L156>
>  about
> starting from latest offset :
>
>  When the pipeline starts for the first time without any checkpoint, the
> source starts
>   consuming from the <em>latest</em> offsets. You can override this
> behavior to consume from the
>   beginning by setting appropriate appropriate properties in {@link
> ConsumerConfig}, through
>  {@link Read#updateConsumerProperties(Map)}
>
> It says by default it reads from the latest position. I kind of left very
> Kafka specific details (e.g. actual Kafka configuration you need to set to
> start from the beginning, or to enable kafka auto_commit which would
> provide an (external) soft checkpoint of read input location). KafkaIO
> gives full control on configuration you can pass to KafkaConsumer (see 
> "Advanced
> Kafka Configuration
> <https://github.com/apache/incubator-beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L198>"
> section in JavaDoc).
>
> I didn't explicitly mention implications of 'starting a job' vs 'updating
> a job', which is a Beam/Dataflow feature common to all sources. Update
> starts from checkpoint (ensures KakfaIO resumes reading from where it left
> off), but fresh start does not have any checkpointed state, so KafkaIO
> reads from latest offset (by default). It would be better to call this out
> explicitly as many new users may not aware of it.
>
> Please suggest any other improvements.
>
>

Re: KafkaIO Questions

Reply via email to