[jira] [Updated] (KAFKA-2909) Another Instance of Gap in Consumption after Restart

Ben Stopford (JIRA) Mon, 30 Nov 2015 07:10:36 -0800

     [ 
https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ben Stopford updated KAFKA-2909:
--------------------------------
    Description: 
This seems very similar to Rajini's reported KAFAK-2891

*Context*
The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
2s sleep between restarts. Throughput was limited to 1000 messages per second. 

*Failure*
At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
64403, 64406, 64409, 36799)


Missing data was correctly written to Kafka data files:
{quote}
value 36802 -> partition 1,offset: 12216

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker7/kafka-data-logs/test_topic-1/00000000000000000000.log | grep 'offset: 
12216'

-> offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 3001177408

in all three data files. So the data is there. 
{quote}

The first missing value was written at: 20:42:30,185, which is around the time 
the third node goes down. 

The failed writes correlate with the consumer logging out 
NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these 
messages though over a long period so it’s hard to infer this as being the 
cause or specifically correlating with the error. 

*Timeline*
{quote}
grep -r 'shutdown complete' *
20:42:06,132 - Node 1 shutdown completed 
20:42:18,560 - Node 2 shutdown completed 
20:42:30,185 - *Writes that never make it are written by producer*
20:42:31,164 - Node 3 shutdown completed 
20:42:57,872 - Node 1 shutdown completed 
…
{quote}

Logging was verbose during this run so log files can be found here: [click 
me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]


  was:
This seems very similar to Rajini's reported KAFAK-2891

*Context*
The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
2s sleep between restarts. Throughput was limited to 1000 messages per second. 

*Failure*
At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
64403, 64406, 64409, 36799)


Missing data was correctly written to Kafka data files:
{quote}
value 36802 -> partition 1,offset: 12216

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker7/kafka-data-logs/test_topic-1/00000000000000000000.log | grep 'offset: 
12216'

-> offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 3001177408

in all three data files. So the data is there. 
{quote}

The first missing value was written at: 20:42:30,185, which is around the time 
the third node goes down. 

The failed writes correlate with the consumer logging out 
NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these 
messages though over a long period so it’s hard to infer this as being the 
cause or specifically correlating with the error. 

*Timeline*
{quote}
grep -r 'shutdown complete' *
20:42:06,132 - Node 1 shutdown completed 
20:42:18,560 - Node 2 shutdown completed 
20:42:30,185 - *Writes that never make it are written by producer*
20:42:31,164 - Node 3 shutdown completed 
20:42:57,872 - Node 1 shutdown completed 
…
{quote}

All logs for this incident are attached



> Another Instance of Gap in Consumption after Restart
> ----------------------------------------------------
>
>                 Key: KAFKA-2909
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2909
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: consumer
>            Reporter: Ben Stopford
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
> 2s sleep between restarts. Throughput was limited to 1000 messages per 
> second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
> 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker7/kafka-data-logs/test_topic-1/00000000000000000000.log | grep 'offset: 
> 12216'
> -> offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 3001177408
> in all three data files. So the data is there. 
> {quote}
> The first missing value was written at: 20:42:30,185, which is around the 
> time the third node goes down. 
> The failed writes correlate with the consumer logging out 
> NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of 
> these messages though over a long period so it’s hard to infer this as being 
> the cause or specifically correlating with the error. 
> *Timeline*
> {quote}
> grep -r 'shutdown complete' *
> 20:42:06,132 - Node 1 shutdown completed 
> 20:42:18,560 - Node 2 shutdown completed 
> 20:42:30,185 - *Writes that never make it are written by producer*
> 20:42:31,164 - Node 3 shutdown completed 
> 20:42:57,872 - Node 1 shutdown completed 
> …
> {quote}
> Logging was verbose during this run so log files can be found here: [click 
> me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2909) Another Instance of Gap in Consumption after Restart

Reply via email to