On Sun, Sep 25, 2022 at 4:56 AM Yomal de Silva <yomal.prav...@gmail.com> wrote:
> Hi all, > > I have started using KafkaIO to read a data stream and have the following > questions. Appreciate it if you could provide a few clarifications on the > following. > > 1. Does KafkaIO ignore the offset stored in the broker and uses the offset > stored during checkpointing when consuming messages? > Generally yes, as that's the only way to guarantee consistency (we can't atomically commit to the runner and to Kafka). However when starting a new pipeline, you should be able to start reading at the broker checkpoint. > 2. How many threads will be used by the Kafka consumer? > This depends somewhat on the runner, but you can expect one thread per partition. > 3. If the consumer polls a set of messages A and then later B while A is > still being processed, is there a possibility of set B finishing before A? > Does parallelism control this? > yes. Beam doesn't currently have any notion of ordering. All messages are independent and can be processed at different times (the source also reserves the right to process different ranges of a single Kafka partition on different threads, though it doesn't currently do this). > 4. In the above scenario if B is committed back to the broker and somehow > A failed, upon a restart is there any way we can consume A again without > losing data? > Data should never be lost. If B is processed, then you can assume that the A data is checkpointed inside the Beam runner and will be processed to. > > Thank you. > > >