I think Beam's checkpointing is what should be explicitly said. I would say something like: Kafka offset commits and "group.ids" aren't necessary as Beam has checkpointing.
The next question a person would be whether checkpointing is distributed and redundant like Kafka's offset topic. On Wed, Jun 8, 2016 at 4:07 PM Raghu Angadi <[email protected]> wrote: > Thanks for the feedback. > > On Wed, Jun 8, 2016 at 12:35 PM, Jesse Anderson <[email protected]> > wrote: > >> The second thing is about the JavaDoc on the read method. I think the >> JavaDocs should talk about the differences between this and a default >> KafkaConsumer. Since there isn't a group.id required to be set, I think >> the JavaDoc should mention the implications of this. It would say something >> about always starting from the latest data. Also, it should mention that >> offsets won't be kept and restarting the processing will not start at the >> last offset; it will start at the latest data again. >> > > JavaDoc currently states this > <https://github.com/apache/incubator-beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L156> > about > starting from latest offset : > > When the pipeline starts for the first time without any checkpoint, the > source starts > consuming from the <em>latest</em> offsets. You can override this > behavior to consume from the > beginning by setting appropriate appropriate properties in {@link > ConsumerConfig}, through > {@link Read#updateConsumerProperties(Map)} > > It says by default it reads from the latest position. I kind of left very > Kafka specific details (e.g. actual Kafka configuration you need to set to > start from the beginning, or to enable kafka auto_commit which would > provide an (external) soft checkpoint of read input location). KafkaIO > gives full control on configuration you can pass to KafkaConsumer (see > "Advanced > Kafka Configuration > <https://github.com/apache/incubator-beam/blob/master/sdks/java/io/kafka/src/main/java/org/apache/beam/sdk/io/kafka/KafkaIO.java#L198>" > section in JavaDoc). > > I didn't explicitly mention implications of 'starting a job' vs 'updating > a job', which is a Beam/Dataflow feature common to all sources. Update > starts from checkpoint (ensures KakfaIO resumes reading from where it left > off), but fresh start does not have any checkpointed state, so KafkaIO > reads from latest offset (by default). It would be better to call this out > explicitly as many new users may not aware of it. > > Please suggest any other improvements. > >
