Raman Gupta created KAFKA-10229:
-----------------------------------

             Summary: Kafka stream dies when earlier shut down node leaves 
group, no errors logged on client
                 Key: KAFKA-10229
                 URL: https://issues.apache.org/jira/browse/KAFKA-10229
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 2.4.1
            Reporter: Raman Gupta


My broker and clients are 2.4.1. I'm currently running a single broker. I have 
a Kafka stream with exactly once processing turned on. I also have an uncaught 
exception handler defined on the client. I have a stream which I noticed was 
lagging. Upon investigation, I see that the consumer group was empty.

On restarting the consumers, the consumer group re-established itself, but 
after about 8 minutes, the group became empty again. There is nothing logged on 
the client side about any stream errors, despite the existence of an uncaught 
exception handler.

In the broker logs, I see that about 8 minutes after the clients restart / the 
stream goes to RUNNING state:

```
[2020-07-02 17:34:47,033] INFO [GroupCoordinator 0]: Member 
cis-d7fb64c95-kl9wl-1-630af77f-138e-49d1-b76a-6034801ee359 in group 
produs-cisFileIndexer-stream has failed, removing it from the group 
(kafka.coordinator.group.GroupCoordinator)
[2020-07-02 17:34:47,033] INFO [GroupCoordinator 0]: Preparing to rebalance 
group produs-cisFileIndexer-stream in state PreparingRebalance with old 
generation 228 (__consumer_offsets-3) (reason: removing member 
cis-d7fb64c95-kl9wl-1-630af77f-138e-49d1-b76a-6034801ee359 on heartbeat 
expiration) (kafka.coordinator.group.GroupCoordinator)
```

so according to this the consumer heartbeat has expired. I don't know why this 
would be, logging shows that the stream was running and processing messages 
normally and then just stopped processing anything about 4 minutes before it 
dies, with no apparent errors or issues or anything logged via the uncaught 
exception handler.

It doesn't appear to be related to any specific poison pill type messages: 
restarting the stream causes it to reprocess a bunch more messages from the 
backlog, and then die again approximately 8 minutes later. At the time of the 
last message consumed by the stream, there are no `INFO`-level or above logs 
either in the client or the broker, or any errors whatsoever. The stream 
consumption simply stops.

There are two consumers -- even if I limit consumption to only a single 
consumer, the same thing happens.

The runtime environment is Kubernetes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to