Hi Folks,
I am trying to handle Kafka messages in a storm topology and make
sure messages are processed at least once while maintaining a queue in
spouts and evict them when an ack is recd.
Any ideas on which of the above would be better ??
Approach 1:
1. Use the low-level API and manage offset states in Zookeeper per
partition.
2. nextTuple should get a batch of tuples from a Kafka partition and keep
them in an in-memory queue whenever the queue is empty
3. Keep in-memory sorted set of the offset of each tuple emitted
4. When we receive an ack, remove that offset from the sorted set
5. Periodically write into Zookeeper the smallest offset in the sorted set
per partition because the offsets will be sequential in a partition but not
sequential across partitions.
Approach 2:
1. maintain a max limit in the storm kafka fetcher queue, if the load
factor for the queue < 0.25, fetch the items using high level consumer,
dont fetch if the load factor > 0.8.
In this way we dont have to keep track of the kafka internals although the
flexibility is less then approach 1.
(kafka fetcher queue: keeps queue of reactions to be replayed in case of
node failures and evict them when successfully acked).
The max limits and load factors of the queue can be configurable in the
config file based on the machine capabilities.
--
Best Regards,
Jyotirmoy Sundi