[ 
https://issues.apache.org/jira/browse/KAFKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14247766#comment-14247766
 ] 

Bob Potter commented on KAFKA-1788:
-----------------------------------

I've been digging into this a little bit and in addition to an individual 
partition being unavailable there is also a case where all brokers become 
unavailable and we are unable to refresh metadata. This is distinct case 
because the producer still thinks it has a leader for the partition (AFAICT, 
the metadata is never updated). The behavior I have seen is that the producer 
will continue to accept records for any partition which previously had a leader 
but the batches will never exit the accumulator.

It seems like we could track how long it has been since we've been able to 
connect to any known brokers and after a certain threshold complete all 
outstanding record batches with an error and reset the metadata so that new 
production attempts don't end up in the accumulator.

On the other hand, we could just start failing record batches if they have been 
in the accumulator for too long. That would solve both failure scenarios. 
Although, it seems like we should be resetting the metadata for an unavailable 
cluster at some point.

> producer record can stay in RecordAccumulator forever if leader is no 
> available
> -------------------------------------------------------------------------------
>
>                 Key: KAFKA-1788
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1788
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, producer 
>    Affects Versions: 0.8.2
>            Reporter: Jun Rao
>            Assignee: Jun Rao
>              Labels: newbie++
>             Fix For: 0.8.3
>
>
> In the new producer, when a partition has no leader for a long time (e.g., 
> all replicas are down), the records for that partition will stay in the 
> RecordAccumulator until the leader is available. This may cause the 
> bufferpool to be full and the callback for the produced message to block for 
> a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to