from:"Ben Stopford \(JIRA\)"

[jira] [Resolved] (KAFKA-5178) Potential Performance Degradation in Kafka Producer when using Multiple Threads

2017-05-05 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford resolved KAFKA-5178.
-
Resolution: Not A Problem

> Potential Performance Degradation in Kafka Producer when using Multiple 
> Threads
> ---
>
> Key: KAFKA-5178
> URL: https://issues.apache.org/jira/browse/KAFKA-5178
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>
> There is evidence that the Kafka Producer drops performance as we increase 
> the number of threads using it. 
> This is based on some benchmarking done in the community. I have not 
> independently validated these results. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (KAFKA-5178) Potential Performance Degradation in Kafka Producer when using Multiple Threads

2017-05-05 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-5178:

Attachment: (was: TestDetails.pdf)

> Potential Performance Degradation in Kafka Producer when using Multiple 
> Threads
> ---
>
> Key: KAFKA-5178
> URL: https://issues.apache.org/jira/browse/KAFKA-5178
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>
> There is evidence that the Kafka Producer drops performance as we increase 
> the number of threads using it. 
> This is based on some benchmarking done in the community. I have not 
> independently validated these results. Details of the test setup attached.
> ...
> *Effect of Shared KafkaProducer across threads*
>  
> Kafka documentation recommend using the KafkaProducer across multiple worker 
> threads.
>  
> ||#Producers||#Consumers||#Topics||#Partitions per topic||RoundTrip 
> Throughput (events/sec)||Approx Broker Events (Millions/sec)||
> |1|1|1|1|268,312|0.5|
> |4|4|4|4|759,186|1.5|
> |8|8|8|8|640,738|1.2|
> |8|8|8|16|847,314|1.7|
> |8|8|8|48|17,791|0.035|
> |16|16|16|64|5,997|0.01|
>  
> Something appears to be wrong here, with 48 and 64 partitions the shared 
> KafkaProducer struggled to the point that performance became quite bad.
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (KAFKA-5178) Potential Performance Degradation in Kafka Producer when using Multiple Threads

2017-05-05 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-5178:

Description: 
There is evidence that the Kafka Producer drops performance as we increase the 
number of threads using it. 

This is based on some benchmarking done in the community. I have not 
independently validated these results. 


  was:
There is evidence that the Kafka Producer drops performance as we increase the 
number of threads using it. 

This is based on some benchmarking done in the community. I have not 
independently validated these results. Details of the test setup attached.
...

*Effect of Shared KafkaProducer across threads*
 
Kafka documentation recommend using the KafkaProducer across multiple worker 
threads.
 
||#Producers||#Consumers||#Topics||#Partitions per topic||RoundTrip Throughput 
(events/sec)||Approx Broker Events (Millions/sec)||
|1|1|1|1|268,312|0.5|
|4|4|4|4|759,186|1.5|
|8|8|8|8|640,738|1.2|
|8|8|8|16|847,314|1.7|
|8|8|8|48|17,791|0.035|
|16|16|16|64|5,997|0.01|
 
Something appears to be wrong here, with 48 and 64 partitions the shared 
KafkaProducer struggled to the point that performance became quite bad.
 



> Potential Performance Degradation in Kafka Producer when using Multiple 
> Threads
> ---
>
> Key: KAFKA-5178
> URL: https://issues.apache.org/jira/browse/KAFKA-5178
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>
> There is evidence that the Kafka Producer drops performance as we increase 
> the number of threads using it. 
> This is based on some benchmarking done in the community. I have not 
> independently validated these results. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (KAFKA-5178) Potential Performance Degradation in Kafka Producer when using Multiple Threads

2017-05-05 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-5178:
---

 Summary: Potential Performance Degradation in Kafka Producer when 
using Multiple Threads
 Key: KAFKA-5178
 URL: https://issues.apache.org/jira/browse/KAFKA-5178
 Project: Kafka
  Issue Type: Bug
Reporter: Ben Stopford
 Attachments: TestDetails.pdf

There is evidence that the Kafka Producer drops performance as we increase the 
number of threads using it. 

This is based on some benchmarking done in the community. I have not 
independently validated these results. Details of the test setup attached.
...

*Effect of Shared KafkaProducer across threads*
 
Kafka documentation recommend using the KafkaProducer across multiple worker 
threads.
 
||#Producers||#Consumers||#Topics||#Partitions per topic||RoundTrip Throughput 
(events/sec)||Approx Broker Events (Millions/sec)||
|1|1|1|1|268,312|0.5|
|4|4|4|4|759,186|1.5|
|8|8|8|8|640,738|1.2|
|8|8|8|16|847,314|1.7|
|8|8|8|48|17,791|0.035|
|16|16|16|64|5,997|0.01|
 
Something appears to be wrong here, with 48 and 64 partitions the shared 
KafkaProducer struggled to the point that performance became quite bad.
 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (KAFKA-4623) Change Default unclean.leader.election.enabled from True to False

2017-05-02 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4623:

Issue Type: New Feature  (was: Task)

> Change Default unclean.leader.election.enabled from True to False
> -
>
> Key: KAFKA-4623
> URL: https://issues.apache.org/jira/browse/KAFKA-4623
> Project: Kafka
>  Issue Type: New Feature
>Reporter: Ben Stopford
>Assignee: Ben Stopford
> Fix For: 0.11.0.0
>
>
> See KIP-106
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-106+-+Change+Default+unclean.leader.election.enabled+from+True+to+False



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (KAFKA-4623) Change Default unclean.leader.election.enabled from True to False

2017-05-02 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4623:

Issue Type: Improvement  (was: New Feature)

> Change Default unclean.leader.election.enabled from True to False
> -
>
> Key: KAFKA-4623
> URL: https://issues.apache.org/jira/browse/KAFKA-4623
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ben Stopford
>Assignee: Ben Stopford
> Fix For: 0.11.0.0
>
>
> See KIP-106
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-106+-+Change+Default+unclean.leader.election.enabled+from+True+to+False



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (KAFKA-4623) Change Default unclean.leader.election.enabled from True to False

2017-05-02 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4623:

Issue Type: Task  (was: Improvement)

> Change Default unclean.leader.election.enabled from True to False
> -
>
> Key: KAFKA-4623
> URL: https://issues.apache.org/jira/browse/KAFKA-4623
> Project: Kafka
>  Issue Type: Task
>Reporter: Ben Stopford
>Assignee: Sharad
> Fix For: 0.11.0.0
>
>
> See KIP-106
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-106+-+Change+Default+unclean.leader.election.enabled+from+True+to+False



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (KAFKA-4623) Change Default unclean.leader.election.enabled from True to False

2017-05-02 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford reassigned KAFKA-4623:
---

Assignee: Ben Stopford  (was: Sharad)

> Change Default unclean.leader.election.enabled from True to False
> -
>
> Key: KAFKA-4623
> URL: https://issues.apache.org/jira/browse/KAFKA-4623
> Project: Kafka
>  Issue Type: Task
>Reporter: Ben Stopford
>Assignee: Ben Stopford
> Fix For: 0.11.0.0
>
>
> See KIP-106
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-106+-+Change+Default+unclean.leader.election.enabled+from+True+to+False



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (KAFKA-4623) Change Default unclean.leader.election.enabled from True to False

2017-05-02 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford reassigned KAFKA-4623:
---

Assignee: Ben Stopford  (was: Sharad)

> Change Default unclean.leader.election.enabled from True to False
> -
>
> Key: KAFKA-4623
> URL: https://issues.apache.org/jira/browse/KAFKA-4623
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ben Stopford
>Assignee: Ben Stopford
> Fix For: 0.11.0.0
>
>
> See KIP-106
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-106+-+Change+Default+unclean.leader.election.enabled+from+True+to+False



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (KAFKA-5036) Followups from KIP-101

2017-04-10 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962827#comment-15962827
 ] 

Ben Stopford edited comment on KAFKA-5036 at 4/10/17 6:42 PM:
--

Points 2, 5 are addressed in PR: https://github.com/apache/kafka/pull/2831
Point 3 is addressed in PR: https://github.com/apache/kafka/pull/2821 (now 
merged)
Point 4 has been addressed (fixing tests). Changed the log line to an assert 
(which [~junrao] and I discussed previously)


was (Author: benstopford):
Points 2, 5 are addressed in PR: https://github.com/apache/kafka/pull/2831
Point 3 is addressed in PR: https://github.com/apache/kafka/pull/2821 (now 
merged)
Point 4 is an issue only in the test. The LogTest sets up a Log with an empty 
epoch cache. Not addressed at this time.  

> Followups from KIP-101
> --
>
> Key: KAFKA-5036
> URL: https://issues.apache.org/jira/browse/KAFKA-5036
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.11.0.0
>Reporter: Jun Rao
>Assignee: Jun Rao
> Fix For: 0.11.0.0
>
>
> 1. It would be safer to hold onto the leader lock in Partition while serving 
> an OffsetForLeaderEpoch request.
> 2. Currently, we update the leader epoch in epochCache after log append in 
> the follower but before log append in the leader. It would be more consistent 
> to always do this after log append. This also avoids issues related to 
> failure in log append.
> 3. OffsetsForLeaderEpochRequest/OffsetsForLeaderEpochResponse:
> The code that does grouping can probably be replaced by calling 
> CollectionUtils.groupDataByTopic(). Done: 
> https://github.com/apache/kafka/commit/359a68510801a22630a7af275c9935fb2d4c8dbf
> 4. The following line in LeaderEpochFileCache is hit several times when 
> LogTest is executed:
> {code}
>if (cachedLatestEpoch == None) error("Attempt to assign log end offset 
> to epoch before epoch has been set. This should never happen.")
> {code}
> This should be an assert (with the tests fixed up)
> 5. The constructor of LeaderEpochFileCache has the following:
> {code}
> lock synchronized { ListBuffer(checkpoint.read(): _*) }
> {code}
> But everywhere else uses a read or write lock. We should use consistent 
> locking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (KAFKA-5036) Followups from KIP-101

2017-04-10 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-5036:

Description: 
1. It would be safer to hold onto the leader lock in Partition while serving an 
OffsetForLeaderEpoch request.

2. Currently, we update the leader epoch in epochCache after log append in the 
follower but before log append in the leader. It would be more consistent to 
always do this after log append. This also avoids issues related to failure in 
log append.

3. OffsetsForLeaderEpochRequest/OffsetsForLeaderEpochResponse:
The code that does grouping can probably be replaced by calling 
CollectionUtils.groupDataByTopic(). Done: 
https://github.com/apache/kafka/commit/359a68510801a22630a7af275c9935fb2d4c8dbf

4. The following line in LeaderEpochFileCache is hit several times when LogTest 
is executed:

{code}
   if (cachedLatestEpoch == None) error("Attempt to assign log end offset 
to epoch before epoch has been set. This should never happen.")
{code}
This should be an assert (with the tests fixed up)
5. The constructor of LeaderEpochFileCache has the following:

{code}
lock synchronized { ListBuffer(checkpoint.read(): _*) }
{code}

But everywhere else uses a read or write lock. We should use consistent locking.


  was:
1. It would be safer to hold onto the leader lock in Partition while serving an 
OffsetForLeaderEpoch request.

2. Currently, we update the leader epoch in epochCache after log append in the 
follower but before log append in the leader. It would be more consistent to 
always do this after log append. This also avoids issues related to failure in 
log append.

3. OffsetsForLeaderEpochRequest/OffsetsForLeaderEpochResponse:
The code that does grouping can probably be replaced by calling 
CollectionUtils.groupDataByTopic(). Done: 
https://github.com/apache/kafka/commit/359a68510801a22630a7af275c9935fb2d4c8dbf

4. The following line in LeaderEpochFileCache is hit several times when LogTest 
is executed:

{code}
   if (cachedLatestEpoch == None) error("Attempt to assign log end offset 
to epoch before epoch has been set. This should never happen.")
{code}

5. The constructor of LeaderEpochFileCache has the following:

{code}
lock synchronized { ListBuffer(checkpoint.read(): _*) }
{code}

But everywhere else uses a read or write lock. We should use consistent locking.



> Followups from KIP-101
> --
>
> Key: KAFKA-5036
> URL: https://issues.apache.org/jira/browse/KAFKA-5036
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.11.0.0
>Reporter: Jun Rao
>Assignee: Jun Rao
> Fix For: 0.11.0.0
>
>
> 1. It would be safer to hold onto the leader lock in Partition while serving 
> an OffsetForLeaderEpoch request.
> 2. Currently, we update the leader epoch in epochCache after log append in 
> the follower but before log append in the leader. It would be more consistent 
> to always do this after log append. This also avoids issues related to 
> failure in log append.
> 3. OffsetsForLeaderEpochRequest/OffsetsForLeaderEpochResponse:
> The code that does grouping can probably be replaced by calling 
> CollectionUtils.groupDataByTopic(). Done: 
> https://github.com/apache/kafka/commit/359a68510801a22630a7af275c9935fb2d4c8dbf
> 4. The following line in LeaderEpochFileCache is hit several times when 
> LogTest is executed:
> {code}
>if (cachedLatestEpoch == None) error("Attempt to assign log end offset 
> to epoch before epoch has been set. This should never happen.")
> {code}
> This should be an assert (with the tests fixed up)
> 5. The constructor of LeaderEpochFileCache has the following:
> {code}
> lock synchronized { ListBuffer(checkpoint.read(): _*) }
> {code}
> But everywhere else uses a read or write lock. We should use consistent 
> locking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (KAFKA-5036) Followups from KIP-101

2017-04-10 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15962827#comment-15962827
 ] 

Ben Stopford commented on KAFKA-5036:
-

Points 2, 5 are addressed in PR: https://github.com/apache/kafka/pull/2831
Point 3 is addressed in PR: https://github.com/apache/kafka/pull/2821 (now 
merged)
Point 4 is an issue only in the test. The LogTest sets up a Log with an empty 
epoch cache. Not addressed at this time.  

> Followups from KIP-101
> --
>
> Key: KAFKA-5036
> URL: https://issues.apache.org/jira/browse/KAFKA-5036
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.11.0.0
>Reporter: Jun Rao
>Assignee: Jun Rao
> Fix For: 0.11.0.0
>
>
> 1. It would be safer to hold onto the leader lock in Partition while serving 
> an OffsetForLeaderEpoch request.
> 2. Currently, we update the leader epoch in epochCache after log append in 
> the follower but before log append in the leader. It would be more consistent 
> to always do this after log append. This also avoids issues related to 
> failure in log append.
> 3. OffsetsForLeaderEpochRequest/OffsetsForLeaderEpochResponse:
> The code that does grouping can probably be replaced by calling 
> CollectionUtils.groupDataByTopic(). Done: 
> https://github.com/apache/kafka/commit/359a68510801a22630a7af275c9935fb2d4c8dbf
> 4. The following line in LeaderEpochFileCache is hit several times when 
> LogTest is executed:
> {code}
>if (cachedLatestEpoch == None) error("Attempt to assign log end offset 
> to epoch before epoch has been set. This should never happen.")
> {code}
> 5. The constructor of LeaderEpochFileCache has the following:
> {code}
> lock synchronized { ListBuffer(checkpoint.read(): _*) }
> {code}
> But everywhere else uses a read or write lock. We should use consistent 
> locking.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Resolved] (KAFKA-4825) Likely Data Loss in ReassignPartitionsTest System Test

2017-03-09 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford resolved KAFKA-4825.
-
Resolution: Later

Resolving this as it will be fixed by KIP-101 in 3.3

> Likely Data Loss in ReassignPartitionsTest System Test
> --
>
> Key: KAFKA-4825
> URL: https://issues.apache.org/jira/browse/KAFKA-4825
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>  Labels: reliability
> Attachments: problem.zip
>
>
> A failure in the below test may imply to a genuine missing message. 
> kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.security_protocol=PLAINTEXT
> The test - which reassigns partition whilst bouncing cluster members - 
> reconciles messages ack'd with messages received in the consumer. 
> The interesting part is that we received two ack's for the same offset, with 
> different messages:
> {"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7447","time_ms":1488349980718,"offset":372,"key":null}
> {"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7487","time_ms":1488349981780,"offset":372,"key":null}
> When searching the log files, via kafka.tools.DumpLogSegments, only the later 
> message is found. 
> The missing message lies midway through the test and appears to occur after a 
> leader moves (after 7447 is sent there is a ~1s pause, then 7487 is sent, 
> along with a backlog of messages for partitions 11, 16, 6). 
> The overall implication is a message appears to be acknowledged but later 
> lost. 
> Looking at the test itself it seems valid. The producer is initialised with 
> acks = -1. The callback checks for an exception in the onCompletion callback 
> and uses this to track acknowledgement in the test. 
> https://jenkins.confluent.io/job/system-test-kafka/521/console
> http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2017-03-01--001.1488363091--apache--trunk--c9872cb/ReassignPartitionsTest/test_reassign_partitions/bounce_brokers=True.security_protocol=PLAINTEXT/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (KAFKA-4825) Likely Data Loss in ReassignPartitionsTest System Test

2017-03-09 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902877#comment-15902877
 ] 

Ben Stopford commented on KAFKA-4825:
-

Thanks Jun. Makes sense.

> Likely Data Loss in ReassignPartitionsTest System Test
> --
>
> Key: KAFKA-4825
> URL: https://issues.apache.org/jira/browse/KAFKA-4825
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>  Labels: reliability
> Attachments: problem.zip
>
>
> A failure in the below test may imply to a genuine missing message. 
> kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.security_protocol=PLAINTEXT
> The test - which reassigns partition whilst bouncing cluster members - 
> reconciles messages ack'd with messages received in the consumer. 
> The interesting part is that we received two ack's for the same offset, with 
> different messages:
> {"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7447","time_ms":1488349980718,"offset":372,"key":null}
> {"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7487","time_ms":1488349981780,"offset":372,"key":null}
> When searching the log files, via kafka.tools.DumpLogSegments, only the later 
> message is found. 
> The missing message lies midway through the test and appears to occur after a 
> leader moves (after 7447 is sent there is a ~1s pause, then 7487 is sent, 
> along with a backlog of messages for partitions 11, 16, 6). 
> The overall implication is a message appears to be acknowledged but later 
> lost. 
> Looking at the test itself it seems valid. The producer is initialised with 
> acks = -1. The callback checks for an exception in the onCompletion callback 
> and uses this to track acknowledgement in the test. 
> https://jenkins.confluent.io/job/system-test-kafka/521/console
> http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2017-03-01--001.1488363091--apache--trunk--c9872cb/ReassignPartitionsTest/test_reassign_partitions/bounce_brokers=True.security_protocol=PLAINTEXT/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Comment Edited] (KAFKA-4825) Likely Data Loss in ReassignPartitionsTest System Test

2017-03-02 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890661#comment-15890661
 ] 

Ben Stopford edited comment on KAFKA-4825 at 3/2/17 2:55 PM:
-

This could be a result of KIP-101 
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+Alter+Replication+Protocol+to+use+Leader+Epoch+rather+than+High+Watermark+for+Truncation)
 although the use of clean shutdown in the test makes it less likely. 


was (Author: benstopford):
This could be a result of KIP-101 
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+Alter+Replication+Protocol+to+use+Leader+Epoch+rather+than+High+Watermark+for+Truncation)

> Likely Data Loss in ReassignPartitionsTest System Test
> --
>
> Key: KAFKA-4825
> URL: https://issues.apache.org/jira/browse/KAFKA-4825
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
> Attachments: problem.zip
>
>
> A failure in the below test may imply to a genuine missing message. 
> kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.security_protocol=PLAINTEXT
> The test - which reassigns partition whilst bouncing cluster members - 
> reconciles messages ack'd with messages received in the consumer. 
> The interesting part is that we received two ack's for the same offset, with 
> different messages:
> {"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7447","time_ms":1488349980718,"offset":372,"key":null}
> {"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7487","time_ms":1488349981780,"offset":372,"key":null}
> When searching the log files, via kafka.tools.DumpLogSegments, only the later 
> message is found. 
> The missing message lies midway through the test and appears to occur after a 
> leader moves (after 7447 is sent there is a ~1s pause, then 7487 is sent, 
> along with a backlog of messages for partitions 11, 16, 6). 
> The overall implication is a message appears to be acknowledged but later 
> lost. 
> Looking at the test itself it seems valid. The producer is initialised with 
> acks = -1. The callback checks for an exception in the onCompletion callback 
> and uses this to track acknowledgement in the test. 
> https://jenkins.confluent.io/job/system-test-kafka/521/console
> http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2017-03-01--001.1488363091--apache--trunk--c9872cb/ReassignPartitionsTest/test_reassign_partitions/bounce_brokers=True.security_protocol=PLAINTEXT/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (KAFKA-4825) Likely Data Loss in ReassignPartitionsTest System Test

2017-03-01 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890661#comment-15890661
 ] 

Ben Stopford commented on KAFKA-4825:
-

This could be a result of KIP-101 
(https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+Alter+Replication+Protocol+to+use+Leader+Epoch+rather+than+High+Watermark+for+Truncation)

> Likely Data Loss in ReassignPartitionsTest System Test
> --
>
> Key: KAFKA-4825
> URL: https://issues.apache.org/jira/browse/KAFKA-4825
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
> Attachments: problem.zip
>
>
> A failure in the below test may imply to a genuine missing message. 
> kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.security_protocol=PLAINTEXT
> The test - which reassigns partition whilst bouncing cluster members - 
> reconciles messages ack'd with messages received in the consumer. 
> The interesting part is that we received two ack's for the same offset, with 
> different messages:
> {"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7447","time_ms":1488349980718,"offset":372,"key":null}
> {"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7487","time_ms":1488349981780,"offset":372,"key":null}
> When searching the log files, via kafka.tools.DumpLogSegments, only the later 
> message is found. 
> The missing message lies midway through the test and appears to occur after a 
> leader moves (after 7447 is sent there is a ~1s pause, then 7487 is sent, 
> along with a backlog of messages for partitions 11, 16, 6). 
> The overall implication is a message appears to be acknowledged but later 
> lost. 
> Looking at the test itself it seems valid. The producer is initialised with 
> acks = -1. The callback checks for an exception in the onCompletion callback 
> and uses this to track acknowledgement in the test. 
> https://jenkins.confluent.io/job/system-test-kafka/521/console
> http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2017-03-01--001.1488363091--apache--trunk--c9872cb/ReassignPartitionsTest/test_reassign_partitions/bounce_brokers=True.security_protocol=PLAINTEXT/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (KAFKA-4825) Likely Data Loss in ReassignPartitionsTest System Test

2017-03-01 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4825:
---

 Summary: Likely Data Loss in ReassignPartitionsTest System Test
 Key: KAFKA-4825
 URL: https://issues.apache.org/jira/browse/KAFKA-4825
 Project: Kafka
  Issue Type: Bug
Reporter: Ben Stopford
 Attachments: problem.zip

A failure in the below test may imply to a genuine missing message. 

kafkatest.tests.core.reassign_partitions_test.ReassignPartitionsTest.test_reassign_partitions.bounce_brokers=True.security_protocol=PLAINTEXT

The test - which reassigns partition whilst bouncing cluster members - 
reconciles messages ack'd with messages received in the consumer. 

The interesting part is that we received two ack's for the same offset, with 
different messages:

{"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7447","time_ms":1488349980718,"offset":372,"key":null}

{"topic":"test_topic","partition":11,"name":"producer_send_success","value":"7487","time_ms":1488349981780,"offset":372,"key":null}

When searching the log files, via kafka.tools.DumpLogSegments, only the later 
message is found. 

The missing message lies midway through the test and appears to occur after a 
leader moves (after 7447 is sent there is a ~1s pause, then 7487 is sent, along 
with a backlog of messages for partitions 11, 16, 6). 

The overall implication is a message appears to be acknowledged but later lost. 

Looking at the test itself it seems valid. The producer is initialised with 
acks = -1. The callback checks for an exception in the onCompletion callback 
and uses this to track acknowledgement in the test. 


https://jenkins.confluent.io/job/system-test-kafka/521/console
http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2017-03-01--001.1488363091--apache--trunk--c9872cb/ReassignPartitionsTest/test_reassign_partitions/bounce_brokers=True.security_protocol=PLAINTEXT/




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Assigned] (KAFKA-2363) ProducerSendTest.testCloseWithZeroTimeoutFromCallerThread Transient Failure

2017-01-23 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford reassigned KAFKA-2363:
---

Assignee: (was: Ben Stopford)

> ProducerSendTest.testCloseWithZeroTimeoutFromCallerThread Transient Failure
> ---
>
> Key: KAFKA-2363
> URL: https://issues.apache.org/jira/browse/KAFKA-2363
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Fangmin Lv
>  Labels: newbie, transient-unit-test-failure
> Fix For: 0.10.2.0
>
>
> {code}
> kafka.api.ProducerSendTest > testCloseWithZeroTimeoutFromCallerThread 
> STANDARD_OUT
> [2015-07-24 23:13:05,148] WARN fsync-ing the write ahead log in SyncThread:0 
> took 1084ms which will adversely effect operation latency. See the ZooKeeper 
> troubleshooting guide (org.apache.zookeeper.s
> erver.persistence.FileTxnLog:334)
> kafka.api.ProducerSendTest > testCloseWithZeroTimeoutFromCallerThread FAILED
> java.lang.AssertionError: No request is complete.
> at org.junit.Assert.fail(Assert.java:92)
> at org.junit.Assert.assertTrue(Assert.java:44)
> at 
> kafka.api.ProducerSendTest$$anonfun$testCloseWithZeroTimeoutFromCallerThread$1.apply$mcVI$sp(ProducerSendTest.scala:343)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at 
> kafka.api.ProducerSendTest.testCloseWithZeroTimeoutFromCallerThread(ProducerSendTest.scala:340)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4596) KIP-73 rebalance throttling breaks on plans for specific partitions

2017-01-18 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4596:

Status: Patch Available  (was: In Progress)

> KIP-73 rebalance throttling breaks on plans for specific partitions
> ---
>
> Key: KAFKA-4596
> URL: https://issues.apache.org/jira/browse/KAFKA-4596
> Project: Kafka
>  Issue Type: Bug
> Environment: Kafka 0.10.1.1
>Reporter: Tom Crayford
>Assignee: Ben Stopford
> Fix For: 0.10.2.0
>
>
> The reassign-partitions.sh command fails if you both *throttle* and give it a 
> specific partition reassignment. For example, upon reassigning 
> {code}__consumer_offsets{code} partition 19, you get the following error:
> {code}
> Save this to use as the --reassignment-json-file option during rollback
> Warning: You must run Verify periodically, until the reassignment completes, 
> to ensure the throttle is removed. You can also alter the throttle by 
> rerunning the Execute command passing a new value.
> The throttle limit was set to 1048576 B/s
> Partitions reassignment failed due to key not found: [__consumer_offsets,30]
> java.util.NoSuchElementException: key not found: [__consumer_offsets,30]
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:59)
> at 
> kafka.admin.ReassignPartitionsCommand$$anonfun$kafka$admin$ReassignPartitionsCommand$$preRebalanceReplicaForMovingPartitions$1.apply(ReassignPartitionsCommand.scala:37
> 2)
> at 
> kafka.admin.ReassignPartitionsCommand$$anonfun$kafka$admin$ReassignPartitionsCommand$$preRebalanceReplicaForMovingPartitions$1.apply(ReassignPartitionsCommand.scala:37
> 1)
> at 
> scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
> at 
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
> at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
> at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
> at 
> kafka.admin.ReassignPartitionsCommand.kafka$admin$ReassignPartitionsCommand$$preRebalanceReplicaForMovingPartitions(ReassignPartitionsCommand.scala:371)
> at 
> kafka.admin.ReassignPartitionsCommand$$anonfun$assignThrottledReplicas$2.apply(ReassignPartitionsCommand.scala:347)
> at 
> kafka.admin.ReassignPartitionsCommand$$anonfun$assignThrottledReplicas$2.apply(ReassignPartitionsCommand.scala:343)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> kafka.admin.ReassignPartitionsCommand.assignThrottledReplicas(ReassignPartitionsCommand.scala:343)
> at 
> kafka.admin.ReassignPartitionsCommand.maybeThrottle(ReassignPartitionsCommand.scala:317)
> at 
> kafka.admin.ReassignPartitionsCommand.reassignPartitions(ReassignPartitionsCommand.scala:387)
> at 
> kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:165)
> at 
> kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:149)
> at 
> kafka.admin.ReassignPartitionsCommand$.main(ReassignPartitionsCommand.scala:46)
> at 
> kafka.admin.ReassignPartitionsCommand.main(ReassignPartitionsCommand.scala)
> {code}
> This effectively breaks the throttling feature unless you want to rebalance 
> many many partitions at once.
> For reference the command that was run is:
> {code}
> kafka-reassign-partitions.sh --reassignment-json-file 
> 9b137f13_7e91_4757_a7b9_554ff458e7df.3d4269d5-e829
> -4e86-9d05-91e9e2fcb7e7.reassignment_plan.json --throttle 1048576'
> {code}
> and the contents of the plan file is:
> {code}
> {"version":1,"partitions":[{"topic":"__consumer_offsets","partition":19,"replicas":[2,1,0]}
> {code}
> This seems like a simple logic error to me, where we're trying to look up a 
> partition that's not been proposed, when we should not be. It looks like the 
> logic assumes that {code}Map.apply{code} doesn't error if the lookup value 
> isn't found, when in fact it does.
> I checked that this

[jira] [Work started] (KAFKA-4596) KIP-73 rebalance throttling breaks on plans for specific partitions

2017-01-18 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-4596 started by Ben Stopford.
---
> KIP-73 rebalance throttling breaks on plans for specific partitions
> ---
>
> Key: KAFKA-4596
> URL: https://issues.apache.org/jira/browse/KAFKA-4596
> Project: Kafka
>  Issue Type: Bug
> Environment: Kafka 0.10.1.1
>Reporter: Tom Crayford
>Assignee: Ben Stopford
> Fix For: 0.10.2.0
>
>
> The reassign-partitions.sh command fails if you both *throttle* and give it a 
> specific partition reassignment. For example, upon reassigning 
> {code}__consumer_offsets{code} partition 19, you get the following error:
> {code}
> Save this to use as the --reassignment-json-file option during rollback
> Warning: You must run Verify periodically, until the reassignment completes, 
> to ensure the throttle is removed. You can also alter the throttle by 
> rerunning the Execute command passing a new value.
> The throttle limit was set to 1048576 B/s
> Partitions reassignment failed due to key not found: [__consumer_offsets,30]
> java.util.NoSuchElementException: key not found: [__consumer_offsets,30]
> at scala.collection.MapLike$class.default(MapLike.scala:228)
> at scala.collection.AbstractMap.default(Map.scala:59)
> at scala.collection.MapLike$class.apply(MapLike.scala:141)
> at scala.collection.AbstractMap.apply(Map.scala:59)
> at 
> kafka.admin.ReassignPartitionsCommand$$anonfun$kafka$admin$ReassignPartitionsCommand$$preRebalanceReplicaForMovingPartitions$1.apply(ReassignPartitionsCommand.scala:37
> 2)
> at 
> kafka.admin.ReassignPartitionsCommand$$anonfun$kafka$admin$ReassignPartitionsCommand$$preRebalanceReplicaForMovingPartitions$1.apply(ReassignPartitionsCommand.scala:37
> 1)
> at 
> scala.collection.TraversableLike$$anonfun$filterImpl$1.apply(TraversableLike.scala:248)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
> at 
> scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
> at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
> at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
> at 
> kafka.admin.ReassignPartitionsCommand.kafka$admin$ReassignPartitionsCommand$$preRebalanceReplicaForMovingPartitions(ReassignPartitionsCommand.scala:371)
> at 
> kafka.admin.ReassignPartitionsCommand$$anonfun$assignThrottledReplicas$2.apply(ReassignPartitionsCommand.scala:347)
> at 
> kafka.admin.ReassignPartitionsCommand$$anonfun$assignThrottledReplicas$2.apply(ReassignPartitionsCommand.scala:343)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> kafka.admin.ReassignPartitionsCommand.assignThrottledReplicas(ReassignPartitionsCommand.scala:343)
> at 
> kafka.admin.ReassignPartitionsCommand.maybeThrottle(ReassignPartitionsCommand.scala:317)
> at 
> kafka.admin.ReassignPartitionsCommand.reassignPartitions(ReassignPartitionsCommand.scala:387)
> at 
> kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:165)
> at 
> kafka.admin.ReassignPartitionsCommand$.executeAssignment(ReassignPartitionsCommand.scala:149)
> at 
> kafka.admin.ReassignPartitionsCommand$.main(ReassignPartitionsCommand.scala:46)
> at 
> kafka.admin.ReassignPartitionsCommand.main(ReassignPartitionsCommand.scala)
> {code}
> This effectively breaks the throttling feature unless you want to rebalance 
> many many partitions at once.
> For reference the command that was run is:
> {code}
> kafka-reassign-partitions.sh --reassignment-json-file 
> 9b137f13_7e91_4757_a7b9_554ff458e7df.3d4269d5-e829
> -4e86-9d05-91e9e2fcb7e7.reassignment_plan.json --throttle 1048576'
> {code}
> and the contents of the plan file is:
> {code}
> {"version":1,"partitions":[{"topic":"__consumer_offsets","partition":19,"replicas":[2,1,0]}
> {code}
> This seems like a simple logic error to me, where we're trying to look up a 
> partition that's not been proposed, when we should not be. It looks like the 
> logic assumes that {code}Map.apply{code} doesn't error if the lookup value 
> isn't found, when in fact it does.
> I checked that this cluster does indeed have the

[jira] [Commented] (KAFKA-1120) Controller could miss a broker state change

2016-12-15 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750982#comment-15750982
 ] 

Ben Stopford commented on KAFKA-1120:
-

[~junrao] added this comment on a mail thread on this topic which is useful: 
https://www.mail-archive.com/dev@kafka.apache.org/msg62261.html


> Controller could miss a broker state change 
> 
>
> Key: KAFKA-1120
> URL: https://issues.apache.org/jira/browse/KAFKA-1120
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.8.1
>Reporter: Jun Rao
>  Labels: reliability
>
> When the controller is in the middle of processing a task (e.g., preferred 
> leader election, broker change), it holds a controller lock. During this 
> time, a broker could have de-registered and re-registered itself in ZK. After 
> the controller finishes processing the current task, it will start processing 
> the logic in the broker change listener. However, it will see no broker 
> change and therefore won't do anything to the restarted broker. This broker 
> will be in a weird state since the controller doesn't inform it to become the 
> leader of any partition. Yet, the cached metadata in other brokers could 
> still list that broker as the leader for some partitions. Client requests 
> routed to that broker will then get a TopicOrPartitionNotExistException. This 
> broker will continue to be in this bad state until it's restarted again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4522) Using Disruptor instead of Array Blocking queue in Kafka Producer

2016-12-14 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15748169#comment-15748169
 ] 

Ben Stopford commented on KAFKA-4522:
-

It'd be good to have some empirical evidence that this was indeed a bottleneck 
in the wider flow.

> Using Disruptor instead of Array Blocking queue in Kafka Producer
> -
>
> Key: KAFKA-4522
> URL: https://issues.apache.org/jira/browse/KAFKA-4522
> Project: Kafka
>  Issue Type: Improvement
>  Components: producer 
>Reporter: Pratik kumar
>
> Kafka Producer currently uses Java's Array Blocking Queue to store outbound 
> kafka message before batching them in async mode. In case of high production 
> rate of kafka messages,this adds to lock contention on the user and is 
> generally hidden from user.(quoting from personal experience)
> Usage of LMAX Disruptor can reduce the lock contention overhead put by Kafka 
> Producer 
> LMAX Disruptor -> https://github.com/LMAX-Exchange/disruptor
> Also can someone help me understand if blocking queue gives any guarantees 
> inherent to kafka's design(and hence is irreplaceable)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4178) Replication Throttling: Consolidate Rate Classes

2016-11-23 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691178#comment-15691178
 ] 

Ben Stopford commented on KAFKA-4178:
-

Yes, that should be fine. 

> Replication Throttling: Consolidate Rate Classes
> 
>
> Key: KAFKA-4178
> URL: https://issues.apache.org/jira/browse/KAFKA-4178
> Project: Kafka
>  Issue Type: Improvement
>  Components: replication
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
> Fix For: 0.10.2.0
>
>
> Replication throttling is using a different implementation of Rate to client 
> throttling (Rate & SimpleRate). These should be consolidated so both use the 
> same approach. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4330) Leader/Follower Quota Rates in JMX have ambiguous names

2016-10-21 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4330:
---

 Summary: Leader/Follower Quota Rates in JMX have ambiguous names
 Key: KAFKA-4330
 URL: https://issues.apache.org/jira/browse/KAFKA-4330
 Project: Kafka
  Issue Type: Bug
Reporter: Ben Stopford


The names of the JMX attribtutes for replication throttled rates don't include 
the name replication in them. This is a bit confusing. 

+MBean:kafka.server:type=LeaderReplication,name=byte-rate
 +MBean:kafka.server:type=FollowerReplication,name=byte-rate

Suggest ThrottledLeaderReplication etc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4309) Allow "pluggable" properties in KafkaService in System Tests

2016-10-17 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4309:
---

 Summary: Allow "pluggable" properties in KafkaService in System 
Tests
 Key: KAFKA-4309
 URL: https://issues.apache.org/jira/browse/KAFKA-4309
 Project: Kafka
  Issue Type: Bug
Reporter: Ben Stopford






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4282) View / Delete Replication Quotas via Config Command EntityName Wildcards for Topic/Broker

2016-10-10 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4282:
---

 Summary: View / Delete Replication Quotas via Config Command 
EntityName Wildcards for Topic/Broker
 Key: KAFKA-4282
 URL: https://issues.apache.org/jira/browse/KAFKA-4282
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.10.1.1
Reporter: Ben Stopford


It is likely that people will need a utility for checking if a quota is set and 
removing it. This would be useful if something untoward happened, say the 
throttle was not removed from a certain broker due to a network glitch at the 
time of removal. 

Currently it is possible to view and delete replication quota configs using the 
config command, but you'd need to script up something that called the command 
for each topic and each broker. It's also possible to delete the throttle for a 
assignment using the ReassignPartitionsCommand but that requires an outstanding 
assignment and matching json doc. 

Thus it would be good to add something to the ConfigCommand that wrapped this 
functionality. 

Probably the best way would be to simply add wildcard support to the EntityName 
field for Broker and Topic entity types. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4266) Replication Quota Tests: Ensure ZK updated before tests start

2016-10-08 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4266:

Description: 


Alter temporal test in ReassignPartitionsClusterTest so that it doesn't fail if 
ZK is not updated before throttle starts. 

  was:
Alter the integration tests around replication quotas (ReplicationQuotasTest) 
so they ensure ZK has been updated with the throttle before the test starts.

Ideally we'd have similar functionality in ReassignPartitionsClusterTest but 
there is probably no deterministic way to do that easily. 


> Replication Quota Tests: Ensure ZK updated before tests start
> -
>
> Key: KAFKA-4266
> URL: https://issues.apache.org/jira/browse/KAFKA-4266
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Alter temporal test in ReassignPartitionsClusterTest so that it doesn't fail 
> if ZK is not updated before throttle starts. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (KAFKA-4266) Replication Quota Tests: Ensure ZK updated before tests start

2016-10-07 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford reassigned KAFKA-4266:
---

Assignee: Ben Stopford

> Replication Quota Tests: Ensure ZK updated before tests start
> -
>
> Key: KAFKA-4266
> URL: https://issues.apache.org/jira/browse/KAFKA-4266
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Alter the integration tests around replication quotas (ReplicationQuotasTest) 
> so they ensure ZK has been updated with the throttle before the test starts.
> Ideally we'd have similar functionality in ReassignPartitionsClusterTest but 
> there is probably no deterministic way to do that easily. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4266) Replication Quota Tests: Ensure ZK updated before tests start

2016-10-07 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4266:
---

 Summary: Replication Quota Tests: Ensure ZK updated before tests 
start
 Key: KAFKA-4266
 URL: https://issues.apache.org/jira/browse/KAFKA-4266
 Project: Kafka
  Issue Type: Improvement
Reporter: Ben Stopford


Alter the integration tests around replication quotas (ReplicationQuotasTest) 
so they ensure ZK has been updated with the throttle before the test starts.

Ideally we'd have similar functionality in ReassignPartitionsClusterTest but 
there is probably no deterministic way to do that easily. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4225) Replication Quotas: Control Leader & Follower Limit Separately

2016-09-27 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4225:

Description: 
As we split the throttled replicas into Leader and Follower configs, it makes 
sense to also split the throttle limit:

replication.quota.throttled.rate
   =>
replication.leader.quota.throttled.rate
& 
replication.follower.quota.throttled.rate

So that admins have fine grain control over both sides of the replication 
process and the properties match for leader/follower applicability.

  was:
As we split the throttled replicas into Leader and Follower configs, it makes 
sense to also split the throttle limit:

replication.quota.throttled.rate
   =>
replication.leader.quota.throttled.rate
& 
replication.leader.quota.throttled.rate

So that admins have fine grain control over both sides of the replication 
process and the properties match for leader/follower applicability.


> Replication Quotas: Control Leader & Follower Limit Separately
> --
>
> Key: KAFKA-4225
> URL: https://issues.apache.org/jira/browse/KAFKA-4225
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> As we split the throttled replicas into Leader and Follower configs, it makes 
> sense to also split the throttle limit:
> replication.quota.throttled.rate
>=>
> replication.leader.quota.throttled.rate
> & 
> replication.follower.quota.throttled.rate
> So that admins have fine grain control over both sides of the replication 
> process and the properties match for leader/follower applicability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4225) Replication Quotas: Control Leader & Follower Limit Separately

2016-09-27 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4225:

 Assignee: Ben Stopford
Affects Version/s: 0.10.1.0

> Replication Quotas: Control Leader & Follower Limit Separately
> --
>
> Key: KAFKA-4225
> URL: https://issues.apache.org/jira/browse/KAFKA-4225
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> As we split the throttled replicas into Leader and Follower configs, it makes 
> sense to also split the throttle limit:
> replication.quota.throttled.rate
>=>
> replication.leader.quota.throttled.rate
> & 
> replication.leader.quota.throttled.rate
> So that admins have fine grain control over both sides of the replication 
> process and the properties match for leader/follower applicability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4225) Replication Quotas: Control Leader & Follower Limit Separately

2016-09-27 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4225:
---

 Summary: Replication Quotas: Control Leader & Follower Limit 
Separately
 Key: KAFKA-4225
 URL: https://issues.apache.org/jira/browse/KAFKA-4225
 Project: Kafka
  Issue Type: Bug
Reporter: Ben Stopford


As we split the throttled replicas into Leader and Follower configs, it makes 
sense to also split the throttle limit:

replication.quota.throttled.rate
   =>
replication.leader.quota.throttled.rate
& 
replication.leader.quota.throttled.rate

So that admins have fine grain control over both sides of the replication 
process and the properties match for leader/follower applicability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4216) Replication Quotas: Control Leader & Follower Throttled Replicas Separately

2016-09-26 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4216:

Status: Patch Available  (was: In Progress)

> Replication Quotas: Control Leader & Follower Throttled Replicas Separately
> ---
>
> Key: KAFKA-4216
> URL: https://issues.apache.org/jira/browse/KAFKA-4216
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> The current algorithm in kafka-reassign-partitions applies a throttle to all 
> moving replicas, be they leader-side or follower-side. 
> A preferable solution would be to change the throttled replica list to 
> specify whether the throttle applies to leader or follower. That way we can 
> ensure that the regular replication will not be throttled.  
> To do this we should change the way the throttled replica list is specified 
> so it is spread over two separate properties. One that corresponds to the 
> leader-side throttle, and the other that corresponds to the follower-side 
> throttle.
> quota.leader.replication.throttled.replicas = 
> [partId]:[replica],[partId]:[replica],[partId]:[replica] 
> quota.follower.replication.throttled.replicas = 
> [partId]:[replica],[partId]:[replica],[partId]:[replica] 
> Then, when applying the throttle, the leader quota can be applied to all 
> current replicas, and the follower quota can be applied only to the new 
> replicas we are creating as part of the rebalance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (KAFKA-4216) Replication Quotas: Control Leader & Follower Throttled Replicas Separately

2016-09-26 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-4216 started by Ben Stopford.
---
> Replication Quotas: Control Leader & Follower Throttled Replicas Separately
> ---
>
> Key: KAFKA-4216
> URL: https://issues.apache.org/jira/browse/KAFKA-4216
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> The current algorithm in kafka-reassign-partitions applies a throttle to all 
> moving replicas, be they leader-side or follower-side. 
> A preferable solution would be to change the throttled replica list to 
> specify whether the throttle applies to leader or follower. That way we can 
> ensure that the regular replication will not be throttled.  
> To do this we should change the way the throttled replica list is specified 
> so it is spread over two separate properties. One that corresponds to the 
> leader-side throttle, and the other that corresponds to the follower-side 
> throttle.
> quota.leader.replication.throttled.replicas = 
> [partId]:[replica],[partId]:[replica],[partId]:[replica] 
> quota.follower.replication.throttled.replicas = 
> [partId]:[replica],[partId]:[replica],[partId]:[replica] 
> Then, when applying the throttle, the leader quota can be applied to all 
> current replicas, and the follower quota can be applied only to the new 
> replicas we are creating as part of the rebalance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4216) Replication Quotas: Control Leader & Follower Throttled Replicas Separately

2016-09-26 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4216:

Summary: Replication Quotas: Control Leader & Follower Throttled Replicas 
Separately  (was: Replication Throttling: Leader may not be throttled if it is 
not "moving")

> Replication Quotas: Control Leader & Follower Throttled Replicas Separately
> ---
>
> Key: KAFKA-4216
> URL: https://issues.apache.org/jira/browse/KAFKA-4216
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> The current algorithm in kafka-reassign-partitions applies a throttle to all 
> moving replicas, be they leader-side or follower-side. 
> A preferable solution would be to change the throttled replica list to 
> specify whether the throttle applies to leader or follower. That way we can 
> ensure that the regular replication will not be throttled.  
> To do this we should change the way the throttled replica list is specified 
> so it is spread over two separate properties. One that corresponds to the 
> leader-side throttle, and the other that corresponds to the follower-side 
> throttle.
> quota.leader.replication.throttled.replicas = 
> [partId]:[replica],[partId]:[replica],[partId]:[replica] 
> quota.follower.replication.throttled.replicas = 
> [partId]:[replica],[partId]:[replica],[partId]:[replica] 
> Then, when applying the throttle, the leader quota can be applied to all 
> current replicas, and the follower quota can be applied only to the new 
> replicas we are creating as part of the rebalance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4216) Replication Throttling: Leader may not be throttled if it is not "moving"

2016-09-25 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4216:

Description: 
The current algorithm in kafka-reassign-partitions applies a throttle to all 
moving replicas, be they leader-side or follower-side. 

A preferable solution would be to change the throttled replica list to specify 
whether the throttle applies to leader or follower. That way we can ensure that 
the regular replication will not be throttled.  

To do this we should change the way the throttled replica list is specified so 
it is spread over two separate properties. One that corresponds to the 
leader-side throttle, and the other that corresponds to the follower-side 
throttle.

quota.leader.replication.throttled.replicas = 
[partId]:[replica],[partId]:[replica],[partId]:[replica] 
quota.follower.replication.throttled.replicas = 
[partId]:[replica],[partId]:[replica],[partId]:[replica] 

Then, when applying the throttle, the leader quota can be applied to all 
current replicas, and the follower quota can be applied only to the new 
replicas we are creating as part of the rebalance. 


  was:
The current algorithm in kafka-reassign-partitions applies a throttle to all 
moving replicas, be they leader-side or follower-side. 

The problem is that the leader may not be moving, which would mean it would not 
have a throttle applied. So the throttle should be applied to:
[all existing replicas] ++  [all proposed replicas that moved]

A preferable solution would be to change the throttled replica list to specify 
whether the throttle applies to leader or follower. That way we can ensure that 
the regular replication will not be throttled.  


> Replication Throttling: Leader may not be throttled if it is not "moving"
> -
>
> Key: KAFKA-4216
> URL: https://issues.apache.org/jira/browse/KAFKA-4216
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> The current algorithm in kafka-reassign-partitions applies a throttle to all 
> moving replicas, be they leader-side or follower-side. 
> A preferable solution would be to change the throttled replica list to 
> specify whether the throttle applies to leader or follower. That way we can 
> ensure that the regular replication will not be throttled.  
> To do this we should change the way the throttled replica list is specified 
> so it is spread over two separate properties. One that corresponds to the 
> leader-side throttle, and the other that corresponds to the follower-side 
> throttle.
> quota.leader.replication.throttled.replicas = 
> [partId]:[replica],[partId]:[replica],[partId]:[replica] 
> quota.follower.replication.throttled.replicas = 
> [partId]:[replica],[partId]:[replica],[partId]:[replica] 
> Then, when applying the throttle, the leader quota can be applied to all 
> current replicas, and the follower quota can be applied only to the new 
> replicas we are creating as part of the rebalance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4216) Replication Throttling: Leader may not be throttled if it is not "moving"

2016-09-24 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4216:

Description: 
The current algorithm in kafka-reassign-partitions applies a throttle to all 
moving replicas, be they leader-side or follower-side. 

The problem is that the leader may not be moving, which would mean it would not 
have a throttle applied. So the throttle should be applied to:
[all existing replicas] ++  [all proposed replicas that moved]

A preferable solution would be to change the throttled replica list to specify 
whether the throttle applies to leader or follower. That way we can ensure that 
the regular replication will not be throttled.  

  was:
The current algorithm in kafka-reassign-partitions applies a throttle to all 
moving replicas, be they leader-side or follower-side. 

The problem is that the leader may not be moving, which would mean it would not 
have a throttle applied. So the throttle should be applied to:
[all existing replicas] ++  [all proposed replicas that moved]

 


> Replication Throttling: Leader may not be throttled if it is not "moving"
> -
>
> Key: KAFKA-4216
> URL: https://issues.apache.org/jira/browse/KAFKA-4216
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> The current algorithm in kafka-reassign-partitions applies a throttle to all 
> moving replicas, be they leader-side or follower-side. 
> The problem is that the leader may not be moving, which would mean it would 
> not have a throttle applied. So the throttle should be applied to:
> [all existing replicas] ++  [all proposed replicas that moved]
> A preferable solution would be to change the throttled replica list to 
> specify whether the throttle applies to leader or follower. That way we can 
> ensure that the regular replication will not be throttled.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4216) Replication Throttling: Leader may not be throttled if it is not "moving"

2016-09-24 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4216:
---

 Summary: Replication Throttling: Leader may not be throttled if it 
is not "moving"
 Key: KAFKA-4216
 URL: https://issues.apache.org/jira/browse/KAFKA-4216
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.10.1.0
Reporter: Ben Stopford
Assignee: Ben Stopford


The current algorithm in kafka-reassign-partitions applies a throttle to all 
moving replicas, be they leader-side or follower-side. 

The problem is that the leader may not be moving, which would mean it would not 
have a throttle applied. So the throttle should be applied to:
[all existing replicas] ++  [all proposed replicas that moved]

 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4210) Improve Replication Follower Throttling Stability By Checking Quota During Backoff

2016-09-23 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4210:
---

 Summary: Improve Replication Follower Throttling Stability By 
Checking Quota During Backoff
 Key: KAFKA-4210
 URL: https://issues.apache.org/jira/browse/KAFKA-4210
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.11.0.0
Reporter: Ben Stopford
Assignee: Ben Stopford


In Replication Throttling, the leader is more stable, particularly at low 
throttle throughputs, than the follower, as the leader uses Purgatory's Delayed 
Fetch feature to terminate a blocked request early should the quota no longer 
be breached. 

We can simulate the same behaviour simply on the follower, by altering the way 
the back-off works in the AbstractFetcherThread so that, for replication 
throttling, it checks the quota, during the pause, to see if a new request 
should be sent early. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4178) Replication Throttling: Consolidate Rate Classes

2016-09-22 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512485#comment-15512485
 ] 

Ben Stopford commented on KAFKA-4178:
-

Thanks Joel. Actually this comment, from that thread, makes some sense 
"Basically, we were having issues with very large metric values when the metric 
was very recently created."

My guess is, this difference in requirements comes from the fact that client 
quotas throttles by imposing a delay, so if you overestimate the metric, as is 
possible when using the Elapsed Window method, you could calculate a very long 
delay which might cause a client to time out. Replication throttling doesn't 
have this issue in the same way, as a overestimate will only affect replication 
for as long as the metric is actually overestimated. Which is never more than 
one or two sub windows in practice. But replication throttling does have an 
issues with the Fixed Window approach, as it consistently underestimates for 
the entire first window (i.e. ten sub-windows). 

So if we really want to merge the approaches, I actually implemented another 
type of rate (removed from this PR for simplicity) but I'll bring it up here. 
You can see it in this commit, it's called FixedSubWindowPolicy 
https://github.com/benstopford/kafka/blob/edb51d1d0df04b06a980940f9688a0ab06112784/clients/src/main/java/org/apache/kafka/common/metrics/stats/Window.java

This is essentially a simple hybrid of both approaches. If we really want to 
consolidate on one approach, this hybrid approach would be best I believe. I'll 
replicate it here as it's very simple:

{code:title=Window.java|borderStyle=solid}
/**
 * This policy fixes the first sub-window. If measurements do not span
 * more than one sub-window then the whole sub-window duration is used
 * to calculate the rate.
 *
 * However if there are measurements spanning multiple sub windows this rate
 * behaves identically to the elapsed window policy.
 *
 * So this provides a slow start, in a similar fashion to FixedWindows,
 * but only over the duration of the first sub-window rather than all
 * sub-windows.
 *
 * This policy policy provides a balance between the other two. It has a 
short
 * "slow start", in comparison to teh Fixed policy, after which it will have
 * the accuracy of the Elapsed policy.
 */
private static class FixedSubWindowPolicy implements Policy {
@Override
public long windowSize(long first, long last, MetricConfig config) {
long elapsed = last - first;
return elapsed < config.timeWindowMs() ? config.timeWindowMs() : 
elapsed;
}
} 
{code}

So this approach will only underestimate in the first sub-window (rather than 
all 10 in fixed, or just the first measurement in Elapsed) so, unless your 
subwindow size is small in relation to the measurement frequency, it should 
work well for Client throttling. 

Certainly it appears the best compromise to me. Alternatively we just stick 
with both approaches. I still think there is a reasonable argument for both. 

> Replication Throttling: Consolidate Rate Classes
> 
>
> Key: KAFKA-4178
> URL: https://issues.apache.org/jira/browse/KAFKA-4178
> Project: Kafka
>  Issue Type: Improvement
>  Components: replication
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>
> Replication throttling is using a different implementation of Rate to client 
> throttling (Rate & SimpleRate). These should be consolidated so both use the 
> same approach. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4200) Minor issue with throttle argument in kafka-reassign-partitions.sh

2016-09-21 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4200:

Status: Patch Available  (was: In Progress)

> Minor issue with throttle argument in kafka-reassign-partitions.sh
> --
>
> Key: KAFKA-4200
> URL: https://issues.apache.org/jira/browse/KAFKA-4200
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> 1.
> kafka-reassign-partitions —-verify prints Throttle was removed regardless of 
> whether a throttle was applied. It should only print this if the value was 
> actually changed. 
> 2.
> —verify should exception if the —throttle argument. (check generate too)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (KAFKA-4200) Minor issue with throttle argument in kafka-reassign-partitions.sh

2016-09-21 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-4200 started by Ben Stopford.
---
> Minor issue with throttle argument in kafka-reassign-partitions.sh
> --
>
> Key: KAFKA-4200
> URL: https://issues.apache.org/jira/browse/KAFKA-4200
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> 1.
> kafka-reassign-partitions —-verify prints Throttle was removed regardless of 
> whether a throttle was applied. It should only print this if the value was 
> actually changed. 
> 2.
> —verify should exception if the —throttle argument. (check generate too)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (KAFKA-4200) Minor issue with throttle argument in kafka-reassign-partitions.sh

2016-09-20 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford reassigned KAFKA-4200:
---

Assignee: Ben Stopford

> Minor issue with throttle argument in kafka-reassign-partitions.sh
> --
>
> Key: KAFKA-4200
> URL: https://issues.apache.org/jira/browse/KAFKA-4200
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> 1.
> kafka-reassign-partitions —-verify prints Throttle was removed regardless of 
> whether a throttle was applied. It should only print this if the value was 
> actually changed. 
> 2.
> —verify should exception if the —throttle argument. (check generate too)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4200) Minor issue with throttle argument in kafka-reassign-partitions.sh

2016-09-20 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4200:
---

 Summary: Minor issue with throttle argument in 
kafka-reassign-partitions.sh
 Key: KAFKA-4200
 URL: https://issues.apache.org/jira/browse/KAFKA-4200
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.10.1.0
Reporter: Ben Stopford


1.
kafka-reassign-partitions —-verify prints Throttle was removed regardless of 
whether a throttle was applied. It should only print this if the value was 
actually changed. 

2.
—verify should exception if the —throttle argument. (check generate too)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4197) Make ReassignPartitionsTest System Test move data

2016-09-20 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506446#comment-15506446
 ] 

Ben Stopford commented on KAFKA-4197:
-

https://github.com/apache/kafka/pull/1892

> Make ReassignPartitionsTest System Test move data
> -
>
> Key: KAFKA-4197
> URL: https://issues.apache.org/jira/browse/KAFKA-4197
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>Priority: Minor
>
> The ReassignPartitionsTest system tests doesn't reassign any replicas (i.e. 
> move data).
> This is a simple issue. It uses a 3 node cluster with replication factor of 
> 3, so whilst the replicas are jumbled around, nothing actually is moved from 
> machine to machine when the assignment is executed.
> This fix just ups the number of nodes to 4 so things move.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4197) Make ReassignPartitionsTest System Test move data

2016-09-20 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4197:

Assignee: Ben Stopford
  Status: Patch Available  (was: Open)

> Make ReassignPartitionsTest System Test move data
> -
>
> Key: KAFKA-4197
> URL: https://issues.apache.org/jira/browse/KAFKA-4197
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>Priority: Minor
>
> The ReassignPartitionsTest system tests doesn't reassign any replicas (i.e. 
> move data).
> This is a simple issue. It uses a 3 node cluster with replication factor of 
> 3, so whilst the replicas are jumbled around, nothing actually is moved from 
> machine to machine when the assignment is executed.
> This fix just ups the number of nodes to 4 so things move.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4197) Make ReassignPartitionsTest System Test move data

2016-09-20 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4197:
---

 Summary: Make ReassignPartitionsTest System Test move data
 Key: KAFKA-4197
 URL: https://issues.apache.org/jira/browse/KAFKA-4197
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.10.1.0
Reporter: Ben Stopford
Priority: Minor


The ReassignPartitionsTest system tests doesn't reassign any replicas (i.e. 
move data).

This is a simple issue. It uses a 3 node cluster with replication factor of 3, 
so whilst the replicas are jumbled around, nothing actually is moved from 
machine to machine when the assignment is executed.

This fix just ups the number of nodes to 4 so things move.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-4177) Replication Throttling: Remove ThrottledReplicationRateLimit from Server Config

2016-09-19 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15504230#comment-15504230
 ] 

Ben Stopford commented on KAFKA-4177:
-

Patch submitted here: https://github.com/apache/kafka/pull/1864

> Replication Throttling: Remove ThrottledReplicationRateLimit from Server 
> Config
> ---
>
> Key: KAFKA-4177
> URL: https://issues.apache.org/jira/browse/KAFKA-4177
> Project: Kafka
>  Issue Type: Improvement
>  Components: replication
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Replication throttling included the concept of a dynamic broker config 
> (currently for just one property: ThrottledReplicationRateLimit). 
> On reflection it seems best to not include this in KafkaConfig, but rather 
> validate only in AdminUtils. Remove the property 
> ThrottledReplicationRateLimit and related config from KafkaConfig and add 
> validation in AdminUtils where the value can be applied/changed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4177) Replication Throttling: Remove ThrottledReplicationRateLimit from Server Config

2016-09-19 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4177:

Status: Patch Available  (was: In Progress)

> Replication Throttling: Remove ThrottledReplicationRateLimit from Server 
> Config
> ---
>
> Key: KAFKA-4177
> URL: https://issues.apache.org/jira/browse/KAFKA-4177
> Project: Kafka
>  Issue Type: Improvement
>  Components: replication
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Replication throttling included the concept of a dynamic broker config 
> (currently for just one property: ThrottledReplicationRateLimit). 
> On reflection it seems best to not include this in KafkaConfig, but rather 
> validate only in AdminUtils. Remove the property 
> ThrottledReplicationRateLimit and related config from KafkaConfig and add 
> validation in AdminUtils where the value can be applied/changed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4193) FetcherTest fails intermittently

2016-09-19 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4193:

Status: Patch Available  (was: In Progress)

> FetcherTest fails intermittently 
> -
>
> Key: KAFKA-4193
> URL: https://issues.apache.org/jira/browse/KAFKA-4193
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Running FetcherTest.testFetcher many times results in a fairly predictable 
> failure. 
> This appears to be a regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (KAFKA-4193) FetcherTest fails intermittently

2016-09-19 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-4193 started by Ben Stopford.
---
> FetcherTest fails intermittently 
> -
>
> Key: KAFKA-4193
> URL: https://issues.apache.org/jira/browse/KAFKA-4193
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Running FetcherTest.testFetcher many times results in a fairly predictable 
> failure. 
> This appears to be a regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4193) FetcherTest fails intermittently in the presence of the fixed message size change

2016-09-19 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4193:
---

 Summary: FetcherTest fails intermittently in the presence of the 
fixed message size change
 Key: KAFKA-4193
 URL: https://issues.apache.org/jira/browse/KAFKA-4193
 Project: Kafka
  Issue Type: Bug
Reporter: Ben Stopford


Running FetcherTest.testFetcher many times results in a fairly predictable 
failure. 

This appears to be a regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (KAFKA-4193) FetcherTest fails intermittently

2016-09-19 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford reassigned KAFKA-4193:
---

Assignee: Ben Stopford

> FetcherTest fails intermittently 
> -
>
> Key: KAFKA-4193
> URL: https://issues.apache.org/jira/browse/KAFKA-4193
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Running FetcherTest.testFetcher many times results in a fairly predictable 
> failure. 
> This appears to be a regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4193) FetcherTest fails intermittently

2016-09-19 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4193:

Summary: FetcherTest fails intermittently   (was: FetcherTest fails 
intermittently in the presence of the fixed message size change)

> FetcherTest fails intermittently 
> -
>
> Key: KAFKA-4193
> URL: https://issues.apache.org/jira/browse/KAFKA-4193
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>
> Running FetcherTest.testFetcher many times results in a fairly predictable 
> failure. 
> This appears to be a regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (KAFKA-4184) Test failure in ReplicationQuotasTest.shouldBootstrapTwoBrokersWithFollowerThrottle

2016-09-17 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-4184 started by Ben Stopford.
---
> Test failure in 
> ReplicationQuotasTest.shouldBootstrapTwoBrokersWithFollowerThrottle
> ---
>
> Key: KAFKA-4184
> URL: https://issues.apache.org/jira/browse/KAFKA-4184
> Project: Kafka
>  Issue Type: Bug
>  Components: unit tests
>Reporter: Jason Gustafson
>Assignee: Ben Stopford
>
> Seen here: https://builds.apache.org/job/kafka-trunk-jdk7/1544/.
> {code}
> unit.kafka.server.ReplicationQuotasTest > 
> shouldBootstrapTwoBrokersWithFollowerThrottle FAILED
> java.lang.AssertionError: expected:<3.0E7> but was:<2.060725619683527E7>
> at org.junit.Assert.fail(Assert.java:88)
> at org.junit.Assert.failNotEquals(Assert.java:834)
> at org.junit.Assert.assertEquals(Assert.java:553)
> at org.junit.Assert.assertEquals(Assert.java:683)
> at 
> unit.kafka.server.ReplicationQuotasTest$$anonfun$shouldMatchQuotaReplicatingThroughAnAsymmetricTopology$14.apply(ReplicationQuotasTest.scala:172)
> at 
> unit.kafka.server.ReplicationQuotasTest$$anonfun$shouldMatchQuotaReplicatingThroughAnAsymmetricTopology$14.apply(ReplicationQuotasTest.scala:168)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> unit.kafka.server.ReplicationQuotasTest.shouldMatchQuotaReplicatingThroughAnAsymmetricTopology(ReplicationQuotasTest.scala:168)
> at 
> unit.kafka.server.ReplicationQuotasTest.shouldBootstrapTwoBrokersWithFollowerThrottle(ReplicationQuotasTest.scala:74)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3438) Rack Aware Replica Reassignment should warn of overloaded brokers

2016-09-17 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15498293#comment-15498293
 ] 

Ben Stopford commented on KAFKA-3438:
-

Sorry, missed your original comment. Go for it. 

> Rack Aware Replica Reassignment should warn of overloaded brokers
> -
>
> Key: KAFKA-3438
> URL: https://issues.apache.org/jira/browse/KAFKA-3438
> Project: Kafka
>  Issue Type: Improvement
>Affects Versions: 0.10.0.0
>Reporter: Ben Stopford
>Assignee: Vahid Hashemian
> Fix For: 0.10.1.0
>
>
> We've changed the replica reassignment code to be rack aware.
> One problem that might catch users out would be that they rebalance the 
> cluster using kafka-reassign-partitions.sh but their rack configuration means 
> that some high proportion of replicas are pushed onto a single, or small 
> number of, brokers. 
> This should be an easy problem to avoid, by changing the rack assignment 
> information, but we should probably warn users if they are going to create 
> something that is unbalanced. 
> So imagine I have a Kafka cluster of 12 nodes spread over two racks with rack 
> awareness enabled. If I add a 13th machine, on a new rack, and run the 
> rebalance tool, that new machine will get ~6x as many replicas as the least 
> loaded broker. 
> Suggest a warning  be added to the tool output when --generate is called. 
> "The most loaded broker has 2.3x as many replicas as the the least loaded 
> broker. This is likely due to an uneven distribution of brokers across racks. 
> You're advised to alter the rack config so there are approximately the same 
> number of brokers per rack" and displays the individual rack→#brokers and 
> broker→#replicas data for the proposed move.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (KAFKA-4177) Replication Throttling: Remove ThrottledReplicationRateLimit from Server Config

2016-09-16 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford reassigned KAFKA-4177:
---

Assignee: Ben Stopford

> Replication Throttling: Remove ThrottledReplicationRateLimit from Server 
> Config
> ---
>
> Key: KAFKA-4177
> URL: https://issues.apache.org/jira/browse/KAFKA-4177
> Project: Kafka
>  Issue Type: Improvement
>  Components: replication
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Replication throttling included the concept of a dynamic broker config 
> (currently for just one property: ThrottledReplicationRateLimit). 
> On reflection it seems best to not include this in KafkaConfig, but rather 
> validate only in AdminUtils. Remove the property 
> ThrottledReplicationRateLimit and related config from KafkaConfig and add 
> validation in AdminUtils where the value can be applied/changed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Work started] (KAFKA-4177) Replication Throttling: Remove ThrottledReplicationRateLimit from Server Config

2016-09-16 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on KAFKA-4177 started by Ben Stopford.
---
> Replication Throttling: Remove ThrottledReplicationRateLimit from Server 
> Config
> ---
>
> Key: KAFKA-4177
> URL: https://issues.apache.org/jira/browse/KAFKA-4177
> Project: Kafka
>  Issue Type: Improvement
>  Components: replication
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Replication throttling included the concept of a dynamic broker config 
> (currently for just one property: ThrottledReplicationRateLimit). 
> On reflection it seems best to not include this in KafkaConfig, but rather 
> validate only in AdminUtils. Remove the property 
> ThrottledReplicationRateLimit and related config from KafkaConfig and add 
> validation in AdminUtils where the value can be applied/changed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-4179) Replication Throttling: Add Usability Metrics PartitionBytesInRate & SumReplicaLag

2016-09-16 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-4179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-4179:

Affects Version/s: 0.10.1.0
  Description: 
Add two new metrics to Kafka 

PartitionBytesInRate: Equivalent to BytesInPerSec, but at a partition level 
(i.e. total traffic - throttled and not throttled). This is required for 
estimating how long a rebalance will take to complete. B/s. See usability 
section below.

SumReplicaLag: This is the sum of all replica lag values on the broker. This 
metric is used to monitor progress of a rebalance and is particularly useful 
for determining if the rebalance has become stuck due to an overly harsh 
throttle value (as the metric will stop decreasing).

As covered in KIP-73 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas

These make it possible for an administrator to calculate how long a rebalance 
will take. 

> Replication Throttling: Add Usability Metrics PartitionBytesInRate & 
> SumReplicaLag
> --
>
> Key: KAFKA-4179
> URL: https://issues.apache.org/jira/browse/KAFKA-4179
> Project: Kafka
>  Issue Type: Improvement
>  Components: replication
>Affects Versions: 0.10.1.0
>Reporter: Ben Stopford
>
> Add two new metrics to Kafka 
> PartitionBytesInRate: Equivalent to BytesInPerSec, but at a partition level 
> (i.e. total traffic - throttled and not throttled). This is required for 
> estimating how long a rebalance will take to complete. B/s. See usability 
> section below.
> SumReplicaLag: This is the sum of all replica lag values on the broker. This 
> metric is used to monitor progress of a rebalance and is particularly useful 
> for determining if the rebalance has become stuck due to an overly harsh 
> throttle value (as the metric will stop decreasing).
> As covered in KIP-73 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-73+Replication+Quotas
> These make it possible for an administrator to calculate how long a rebalance 
> will take. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4179) Replication Throttling: Add Usability Metrics PartitionBytesInRate & SumReplicaLag

2016-09-16 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4179:
---

 Summary: Replication Throttling: Add Usability Metrics 
PartitionBytesInRate & SumReplicaLag
 Key: KAFKA-4179
 URL: https://issues.apache.org/jira/browse/KAFKA-4179
 Project: Kafka
  Issue Type: Improvement
  Components: replication
Reporter: Ben Stopford






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4178) Replication Throttling: Consolidate Rate Classes

2016-09-16 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4178:
---

 Summary: Replication Throttling: Consolidate Rate Classes
 Key: KAFKA-4178
 URL: https://issues.apache.org/jira/browse/KAFKA-4178
 Project: Kafka
  Issue Type: Improvement
  Components: replication
Affects Versions: 0.10.1.0
Reporter: Ben Stopford


Replication throttling is using a different implementation of Rate to client 
throttling (Rate & SimpleRate). These should be consolidated so both use the 
same approach. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-4177) Replication Throttling: Remove ThrottledReplicationRateLimit from Server Config

2016-09-16 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-4177:
---

 Summary: Replication Throttling: Remove 
ThrottledReplicationRateLimit from Server Config
 Key: KAFKA-4177
 URL: https://issues.apache.org/jira/browse/KAFKA-4177
 Project: Kafka
  Issue Type: Improvement
  Components: replication
Affects Versions: 0.10.1.0
Reporter: Ben Stopford


Replication throttling included the concept of a dynamic broker config 
(currently for just one property: ThrottledReplicationRateLimit). 

On reflection it seems best to not include this in KafkaConfig, but rather 
validate only in AdminUtils. Remove the property ThrottledReplicationRateLimit 
and related config from KafkaConfig and add validation in AdminUtils where the 
value can be applied/changed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2063) Bound fetch response size

2016-07-29 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399960#comment-15399960
 ] 

Ben Stopford commented on KAFKA-2063:
-

[~jkreps] OK, let's see what GZ says on 1. I think it should be ok though. 

on (2) that's a fair point. Happy to stick with randomisation. 

> Bound fetch response size
> -
>
> Key: KAFKA-2063
> URL: https://issues.apache.org/jira/browse/KAFKA-2063
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jay Kreps
>
> Currently the only bound on the fetch response size is 
> max.partition.fetch.bytes * num_partitions. There are two problems:
> 1. First this bound is often large. You may chose 
> max.partition.fetch.bytes=1MB to enable messages of up to 1MB. However if you 
> also need to consume 1k partitions this means you may receive a 1GB response 
> in the worst case!
> 2. The actual memory usage is unpredictable. Partition assignment changes, 
> and you only actually get the full fetch amount when you are behind and there 
> is a full chunk of data ready. This means an application that seems to work 
> fine will suddenly OOM when partitions shift or when the application falls 
> behind.
> We need to decouple the fetch response size from the number of partitions.
> The proposal for doing this would be to add a new field to the fetch request, 
> max_bytes which would control the maximum data bytes we would include in the 
> response.
> The implementation on the server side would grab data from each partition in 
> the fetch request until it hit this limit, then send back just the data for 
> the partitions that fit in the response. The implementation would need to 
> start from a random position in the list of topics included in the fetch 
> request to ensure that in a case of backlog we fairly balance between 
> partitions (to avoid first giving just the first partition until that is 
> exhausted, then the next partition, etc).
> This setting will make the max.partition.fetch.bytes field in the fetch 
> request much less useful and we  should discuss just getting rid of it.
> I believe this also solves the same thing we were trying to address in 
> KAFKA-598. The max_bytes setting now becomes the new limit that would need to 
> be compared to max_message size. This can be much larger--e.g. setting a 50MB 
> max_bytes setting would be okay, whereas now if you set 50MB you may need to 
> allocate 50MB*num_partitions.
> This will require evolving the fetch request protocol version to add the new 
> field and we should do a KIP for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2063) Bound fetch response size

2016-07-29 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15399855#comment-15399855
 ] 

Ben Stopford commented on KAFKA-2063:
-

Thanks for the kicking this one off Andrey. A couple of things come to mind on 
this:

1. I think it would make some sense if we also removed/deprecated the partition 
level maxBytes. Having both is a little confusing to the user. That way each 
request would just fill up to the request-level limitBytes using as many 
partitions as are needed to do so. 

2. I do wonder if randomisation is actually the best approach to this. A 
potentially better approach would be to adjust the order of the partitions 
passed by the consumer. So if we have a request for 8 partitions, we would pass 
partitions 0-7 to the server. The server might use partitions 0-3 to fill the 
response up to limitBytes. The consumer would then send the next fetch request 
with partitions ordered: 4,5,6,7,0,1,2,3. In this way we'd achieve reasonable 
fairness whilst also retaining some level of determinism. 



> Bound fetch response size
> -
>
> Key: KAFKA-2063
> URL: https://issues.apache.org/jira/browse/KAFKA-2063
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jay Kreps
>
> Currently the only bound on the fetch response size is 
> max.partition.fetch.bytes * num_partitions. There are two problems:
> 1. First this bound is often large. You may chose 
> max.partition.fetch.bytes=1MB to enable messages of up to 1MB. However if you 
> also need to consume 1k partitions this means you may receive a 1GB response 
> in the worst case!
> 2. The actual memory usage is unpredictable. Partition assignment changes, 
> and you only actually get the full fetch amount when you are behind and there 
> is a full chunk of data ready. This means an application that seems to work 
> fine will suddenly OOM when partitions shift or when the application falls 
> behind.
> We need to decouple the fetch response size from the number of partitions.
> The proposal for doing this would be to add a new field to the fetch request, 
> max_bytes which would control the maximum data bytes we would include in the 
> response.
> The implementation on the server side would grab data from each partition in 
> the fetch request until it hit this limit, then send back just the data for 
> the partitions that fit in the response. The implementation would need to 
> start from a random position in the list of topics included in the fetch 
> request to ensure that in a case of backlog we fairly balance between 
> partitions (to avoid first giving just the first partition until that is 
> exhausted, then the next partition, etc).
> This setting will make the max.partition.fetch.bytes field in the fetch 
> request much less useful and we  should discuss just getting rid of it.
> I believe this also solves the same thing we were trying to address in 
> KAFKA-598. The max_bytes setting now becomes the new limit that would need to 
> be compared to max_message size. This can be much larger--e.g. setting a 50MB 
> max_bytes setting would be okay, whereas now if you set 50MB you may need to 
> allocate 50MB*num_partitions.
> This will require evolving the fetch request protocol version to add the new 
> field and we should do a KIP for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (KAFKA-1464) Add a throttling option to the Kafka replication tool

2016-07-01 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford reassigned KAFKA-1464:
---

Assignee: Ben Stopford  (was: Ismael Juma)

> Add a throttling option to the Kafka replication tool
> -
>
> Key: KAFKA-1464
> URL: https://issues.apache.org/jira/browse/KAFKA-1464
> Project: Kafka
>  Issue Type: New Feature
>  Components: replication
>Affects Versions: 0.8.0
>Reporter: mjuarez
>Assignee: Ben Stopford
>Priority: Minor
>  Labels: replication, replication-tools
> Fix For: 0.10.1.0
>
>
> When performing replication on new nodes of a Kafka cluster, the replication 
> process will use all available resources to replicate as fast as possible.  
> This causes performance issues (mostly disk IO and sometimes network 
> bandwidth) when doing this in a production environment, in which you're 
> trying to serve downstream applications, at the same time you're performing 
> maintenance on the Kafka cluster.
> An option to throttle the replication to a specific rate (in either MB/s or 
> activities/second) would help production systems to better handle maintenance 
> tasks while still serving downstream applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3726) Enable cold storage option

2016-05-23 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296178#comment-15296178
 ] 

Ben Stopford commented on KAFKA-3726:
-

Ah, OK. Yes I quite like this idea, particularly in conjunction with compacted 
topics. It would be subject to the disk format remaining backwards compatible, 
but considering consumers get the disk format directly that's already, to some 
extent, part of the contract. 

So my interpretation of your use case, from your post, is: If you're ingesting 
data into Kafka, with the aim of getting into file based storage for offline 
processing, it would be simpler to just copy the Kafka data files directly, 
rather than consume them and recreate new files in cold storage. 

Is that correct? I mention this partially because you mention backing files up 
here, which makes me think of database backups etc, which is a slightly 
different use case. 


> Enable cold storage option
> --
>
> Key: KAFKA-3726
> URL: https://issues.apache.org/jira/browse/KAFKA-3726
> Project: Kafka
>  Issue Type: Wish
>Reporter: Radoslaw Gruchalski
> Attachments: kafka-cold-storage.txt
>
>
> This JIRA builds up on the cold storage article I have published on Medium. 
> The copy of the article attached here.
> The need for cold storage or an "indefinite" log seems to be quite often 
> discussed on the user mailing list.
> The cold storage idea would enable the opportunity for the operator to keep 
> the raw Kafka offset files in a third party storage and allow retrieving the 
> data back for re-consumption.
> The two possible options for enabling such functionality are, from the 
> article:
> First approach: if Kafka provided a notification mechanism and could trigger 
> a program when a segment file is to be discarded, it would become feasible to 
> provide a standard method of moving data to cold storage in reaction to those 
> events. Once the program finishes backing the segments up, it could tell 
> Kafka “it is now safe to delete these segments”.
> The second option is to provide an additional value for the 
> log.cleanup.policy setting, call it cold-storage. In case of this value, 
> Kafka would move the segment files — which otherwise would be deleted — to 
> another destination on the server. They can be picked up from there and moved 
> to the cold storage.
> Both have their limitations. The former one is simply a mechanism exposed to 
> allow operator building up the tooling necessary to enable this. Events could 
> be published in a manner similar to Mesos Event Bus 
> (https://mesosphere.github.io/marathon/docs/event-bus.html) or Kafka itself 
> could provide a control topic on which such info would be published. The 
> outcome is, the operator can subscribe to the event bus and get notified 
> about, at least, two events:
> - log segment is complete and can be backed up
> - partition leader changed
> These two, together with an option to keep the log segment safe from 
> compaction for a certain amount of time, would be sufficient to reliably 
> implement cold storage.
> The latter option, {{log.cleanup.policy}} setting would be more complete 
> feature but it is also much more difficult to implement.  All brokers would 
> have keep the backup of the data in the cold storage significantly increasing 
> the size requirements, also, the de-duplication of the data for the 
> replicated data would be left completely to the operator.
> In any case, the thing to stay away from is having Kafka to deal with the 
> physical aspect of moving the data to and back from the cold storage. This is 
> not Kafka's task. The intent is to provide a method for reliable cold storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3736) Add http metrics reporter

2016-05-20 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293328#comment-15293328
 ] 

Ben Stopford commented on KAFKA-3736:
-

I think it's the "Any change that impacts the public interfaces of the project" 
part that's triggering inclusion here. 

> Add http metrics reporter
> -
>
> Key: KAFKA-3736
> URL: https://issues.apache.org/jira/browse/KAFKA-3736
> Project: Kafka
>  Issue Type: New Feature
>  Components: core
>Reporter: Adrian Muraru
> Fix For: 0.10.1.0
>
>
> The current builtin JMX metrics reporter is pretty heavy in terms of load and 
> collection. A new http lightweight reporter is proposed to expose the metrics 
> via a local http port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3736) Add http metrics reporter

2016-05-20 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293234#comment-15293234
 ] 

Ben Stopford commented on KAFKA-3736:
-

Nice little idea! I think it might require a KIP as technically it's a public 
interface. [~ijuma] can confirm.

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

> Add http metrics reporter
> -
>
> Key: KAFKA-3736
> URL: https://issues.apache.org/jira/browse/KAFKA-3736
> Project: Kafka
>  Issue Type: New Feature
>  Components: core
>Reporter: Adrian Muraru
> Fix For: 0.10.0.0
>
>
> The current builtin JMX metrics reporter is pretty heavy in terms of load and 
> collection. A new http lightweight reporter is proposed to expose the metrics 
> via a local http port.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3726) Enable cold storage option

2016-05-18 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15289608#comment-15289608
 ] 

Ben Stopford commented on KAFKA-3726:
-

The standard approach to this sort of problem would be to use Kafka Connector 
to move data to HDFS or S3 etc. Would this not suffice?

> Enable cold storage option
> --
>
> Key: KAFKA-3726
> URL: https://issues.apache.org/jira/browse/KAFKA-3726
> Project: Kafka
>  Issue Type: Wish
>Reporter: Radoslaw Gruchalski
> Attachments: kafka-cold-storage.txt
>
>
> This JIRA builds up on the cold storage article I have published on Medium. 
> The copy of the article attached here.
> The need for cold storage or an "indefinite" log seems to be quite often 
> discussed on the user mailing list.
> The cold storage idea would enable the opportunity for the operator to keep 
> the raw Kafka offset files in a third party storage and allow retrieving the 
> data back for re-consumption.
> The two possible options for enabling such functionality are, from the 
> article:
> First approach: if Kafka provided a notification mechanism and could trigger 
> a program when a segment file is to be discarded, it would become feasible to 
> provide a standard method of moving data to cold storage in reaction to those 
> events. Once the program finishes backing the segments up, it could tell 
> Kafka “it is now safe to delete these segments”.
> The second option is to provide an additional value for the 
> log.cleanup.policy setting, call it cold-storage. In case of this value, 
> Kafka would move the segment files — which otherwise would be deleted — to 
> another destination on the server. They can be picked up from there and moved 
> to the cold storage.
> Both have their limitations. The former one is simply a mechanism exposed to 
> allow operator building up the tooling necessary to enable this. Events could 
> be published in a manner similar to Mesos Event Bus 
> (https://mesosphere.github.io/marathon/docs/event-bus.html) or Kafka itself 
> could provide a control topic on which such info would be published. The 
> outcome is, the operator can subscribe to the event bus and get notified 
> about, at least, two events:
> - log segment is complete and can be backed up
> - partition leader changed
> These two, together with an option to keep the log segment safe from 
> compaction for a certain amount of time, would be sufficient to reliably 
> implement cold storage.
> The latter option, {{log.cleanup.policy}} setting would be more complete 
> feature but it is also much more difficult to implement.  All brokers would 
> have keep the backup of the data in the cold storage significantly increasing 
> the size requirements, also, the de-duplication of the data for the 
> replicated data would be left completely to the operator.
> In any case, the thing to stay away from is having Kafka to deal with the 
> physical aspect of moving the data to and back from the cold storage. This is 
> not Kafka's task. The intent is to provide a method for reliable cold storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-3438) Rack Aware Replica Reassignment should warn of overloaded brokers

2016-03-21 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-3438:
---

 Summary: Rack Aware Replica Reassignment should warn of overloaded 
brokers
 Key: KAFKA-3438
 URL: https://issues.apache.org/jira/browse/KAFKA-3438
 Project: Kafka
  Issue Type: Improvement
Reporter: Ben Stopford


We've changed the replica reassignment code to be rack aware.

One problem that might catch users out would be that they rebalance the cluster 
using kafka-reassign-partitions.sh but their rack configuration means that some 
high proportion of replicas are pushed onto a single, or small number of, 
brokers. 

This should be an easy problem to avoid, by changing the rack assignment 
information, but we should probably warn users if they are going to create 
something that is unbalanced. 

So imagine I have a Kafka cluster of 12 nodes spread over two racks with rack 
awareness enabled. If I add a 13th machine, on a new rack, and run the 
rebalance tool, that new machine will get ~6x as many replicas as the least 
loaded broker. 

Suggest a warning  be added to the tool output when --generate is called. "The 
most loaded broker has 2.3x as many replicas as the the least loaded broker. 
This is likely due to an uneven distribution of brokers across racks. You're 
advised to alter the rack config so there are approximately the same number of 
brokers per rack" and displays the individual rack→#brokers and 
broker→#replicas data for the proposed move.  






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-3270) Add Basic Integration Tests Around ReassignPartitionsCommand

2016-02-23 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-3270:
---

 Summary: Add Basic Integration Tests Around 
ReassignPartitionsCommand
 Key: KAFKA-3270
 URL: https://issues.apache.org/jira/browse/KAFKA-3270
 Project: Kafka
  Issue Type: Improvement
Reporter: Ben Stopford


Currently this has no tests, and we want to do some work in this area, so lets 
add some. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (KAFKA-3270) Add Basic Integration Tests Around ReassignPartitionsCommand

2016-02-23 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford reassigned KAFKA-3270:
---

Assignee: Ben Stopford

> Add Basic Integration Tests Around ReassignPartitionsCommand
> 
>
> Key: KAFKA-3270
> URL: https://issues.apache.org/jira/browse/KAFKA-3270
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Currently this has no tests, and we want to do some work in this area, so 
> lets add some. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-3244) TopicCommand doesn't allow to add partitions with specific assignment

2016-02-17 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15151110#comment-15151110
 ] 

Ben Stopford commented on KAFKA-3244:
-

Hey - [~granthenke] - I'm working with [~gwenshap] and could pick this up if 
you like?

> TopicCommand doesn't allow to add partitions with specific assignment
> -
>
> Key: KAFKA-3244
> URL: https://issues.apache.org/jira/browse/KAFKA-3244
> Project: Kafka
>  Issue Type: Bug
>Reporter: Gwen Shapira
>Assignee: Grant Henke
>
> The combination of "alter" "partitions" and "replica-assignment" is marked as 
> invalid here:
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/admin/TopicCommand.scala#L337
> Although the alter method actually allowed using both parameters:
> https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/admin/TopicCommand.scala#L144
> I think the command line limitation can be safely removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-3223) Add System (ducktape) Test that asserts strict partition ordering despite node failure

2016-02-09 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-3223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-3223:

Description: 
Kafka provides strong ordering when max.inflight.requests.per.connection = 1. 
There currently exists a bug (patched but not released) against this feature. 
It's an important feature for many customers so we should add a test to assert 
that the contract is honoured.

Suggest a similar format to ReassignPartitionsTest 
(reassign_partitions_test.py) or ReplicationTest (replication_test.py). It 
should be simply a case of asserting on the order of the messages (which are 
monotonically increasing numbers in these tests) in each partition with the 
inflight configuration whilst incurring repeated node failure.

Note that this jira is depended on the merge of KAFKA-3197 

  was:
Kafka provides strong ordering when max.inflight.requests.per.connection = 1. 
There currently exists a bug (patched but not released) against this feature. 
It's an important feature for many customers so we should add a test to assert 
that the contract is honoured.

Suggest a similar format to ReassignPartitionsTest 
(reassign_partitions_test.py) or ReplicationTest (replication_test.py). It 
should be simply a case of asserting on the order of the messages (which are 
monotonically increasing numbers in these tests) in each partition with the 
inflight configuration.

Note that this jira is depended on the merge of KAFKA-3197 


> Add System (ducktape) Test that asserts strict partition ordering despite 
> node failure
> --
>
> Key: KAFKA-3223
> URL: https://issues.apache.org/jira/browse/KAFKA-3223
> Project: Kafka
>  Issue Type: Test
>Reporter: Ben Stopford
>
> Kafka provides strong ordering when max.inflight.requests.per.connection = 1. 
> There currently exists a bug (patched but not released) against this feature. 
> It's an important feature for many customers so we should add a test to 
> assert that the contract is honoured.
> Suggest a similar format to ReassignPartitionsTest 
> (reassign_partitions_test.py) or ReplicationTest (replication_test.py). It 
> should be simply a case of asserting on the order of the messages (which are 
> monotonically increasing numbers in these tests) in each partition with the 
> inflight configuration whilst incurring repeated node failure.
> Note that this jira is depended on the merge of KAFKA-3197 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-3223) Add System (ducktape) Test that asserts strict partition ordering despite node failure

2016-02-09 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-3223:
---

 Summary: Add System (ducktape) Test that asserts strict partition 
ordering despite node failure
 Key: KAFKA-3223
 URL: https://issues.apache.org/jira/browse/KAFKA-3223
 Project: Kafka
  Issue Type: Test
Reporter: Ben Stopford


Kafka provides strong ordering when max.inflight.requests.per.connection = 1. 
There currently exists a bug (patched but not released) against this feature. 
It's an important feature for many customers so we should add a test to assert 
that the contract is honoured.

Suggest a similar format to ReassignPartitionsTest 
(reassign_partitions_test.py) or ReplicationTest (replication_test.py). It 
should be simply a case of asserting on the order of the messages (which are 
monotonically increasing numbers in these tests) in each partition with the 
inflight configuration.

Note that this jira is depended on the merge of KAFKA-3197 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2637) Cipher suite setting should be configurable for SSL

2015-12-11 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2637:

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Cipher suite setting should be configurable for SSL
> ---
>
> Key: KAFKA-2637
> URL: https://issues.apache.org/jira/browse/KAFKA-2637
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Currently the Cipher Suite property (SSL_CIPHER_SUITES_CONFIG) is configured 
> but ignored meaning you can't change cipher via the regular SSL Config. This 
> should be enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2771) Add Rolling Upgrade to Secured Cluster to System Tests

2015-12-11 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2771:

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Add Rolling Upgrade to Secured Cluster to System Tests
> --
>
> Key: KAFKA-2771
> URL: https://issues.apache.org/jira/browse/KAFKA-2771
> Project: Kafka
>  Issue Type: Improvement
>  Components: security
>Reporter: Ben Stopford
>Assignee: Ben Stopford
>
> Ensure we can perform a rolling upgrade to enable SSL, SASL_PLAINTEXT &  on a 
> running cluster
> *Method*
> - Start with 0.9.0 cluster with security disabled
> - Upgrade to Client and Inter-Broker ports to SSL (This will take two rounds 
> bounces. One to open the SSL port and one to close the PLAINTEXT port)
> - Ensure you can produce  (acks = -1) and consume during the process. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2964) Split Security Rolling Upgrade Test By Client and Broker Protocols

2015-12-08 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-2964:
---

 Summary: Split Security Rolling Upgrade Test By Client and Broker 
Protocols
 Key: KAFKA-2964
 URL: https://issues.apache.org/jira/browse/KAFKA-2964
 Project: Kafka
  Issue Type: Improvement
Reporter: Ben Stopford
Priority: Minor


We should ensure the security rolling upgrade test runs with different 
client-broker and broker-broker protocols (previously it just ran with protocol 
pairs) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (KAFKA-2908) Consumer Sporadically Stops Consuming From Partition After Server Restart

2015-12-03 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford resolved KAFKA-2908.
-
Resolution: Cannot Reproduce

Haven't been able to reproduce this issue. I'll add a clause to the system 
tests that identifies this type of problem should it occur again. 

> Consumer Sporadically Stops Consuming From Partition After Server Restart
> -
>
> Key: KAFKA-2908
> URL: https://issues.apache.org/jira/browse/KAFKA-2908
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
> Attachments: 2015-11-28--001.tar.gz
>
>
> *Context:*
> Instance of the rolling upgrade test. 10s sleeps have been put around node 
> restarts to ensure stability when subsequent nodes go down. Consumer timeout 
> has been set to 60s. Proudcer has been throttled to 100 messages per second. 
> Failure is rare: It occurred once in 60 executions (6 executions per run #276 
> -> #285 of the system_test_branch_builder)
> *Reported Failure:*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
> 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
> 16442, ...plus 1669 more
> *Immediate Observations:*
> * The list of messages not consumed are all in partition 1.
> * Production and Consumption continues throughout the test (there is no 
> complete write or read failure as we have seen elsewhere)
> * The messages ARE present in the data files:
> e.g. 
> {quote}
> Examining missing value 16385. This was written to p1,offset:5453
> => there is an entry in all data files for partition 1 for this (presumably 
> meaning it was replicated correctly)
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker10/kafka-data-logs/test_topic-1/.log | grep 
> 'offset: 5453'
> worker10,9,8:
> offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
> => ID 16385 is definitely present in the Kafka logs suggesting a problem 
> consumer-side
> {quote}
> These entries definitely do not appear in the consumer stdout file either. 
> *Timeline:*
> {quote}
> 15:27:02,232 - Producer sends first message
> ..servers 1, 2 are restarted (clean shutdown)
> 15:29:42,718 - Server 3 shutdown complete
> 15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
> state transition (kafka.controller.KafkaController)
> 15:29:42,743 - New leasder is 2 (LeaderChangeListner)
> 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
> with correlation id 0 epoch 7 for partition (test_topic,1) since its 
> associated leader epoch 8 is old. Current leader epoch is 8 
> (state.change.logger)
> 15:29:45,642 - Producer starts writing messages that are never consumed
> 15:30:10,804 - Last message sent by producer
> 15:31:10,983 - Consumer times out after 60s wait without messages
> {quote}
> Logs for this run are attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart

2015-12-02 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford resolved KAFKA-2908.
-
Resolution: Fixed

Was bug in the way the test defines min.insync.replicas as found by [~rsivaram]

> Another, possibly different, Gap in Consumption after Restart
> -
>
> Key: KAFKA-2908
> URL: https://issues.apache.org/jira/browse/KAFKA-2908
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
> Attachments: 2015-11-28--001.tar.gz
>
>
> *Context:*
> Instance of the rolling upgrade test. 10s sleeps have been put around node 
> restarts to ensure stability when subsequent nodes go down. Consumer timeout 
> has been set to 60s. Proudcer has been throttled to 100 messages per second. 
> Failure is rare: It occurred once in 60 executions (6 executions per run #276 
> -> #285 of the system_test_branch_builder)
> *Reported Failure:*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
> 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
> 16442, ...plus 1669 more
> *Immediate Observations:*
> * The list of messages not consumed are all in partition 1.
> * Production and Consumption continues throughout the test (there is no 
> complete write or read failure as we have seen elsewhere)
> * The messages ARE present in the data files:
> e.g. 
> {quote}
> Examining missing value 16385. This was written to p1,offset:5453
> => there is an entry in all data files for partition 1 for this (presumably 
> meaning it was replicated correctly)
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker10/kafka-data-logs/test_topic-1/.log | grep 
> 'offset: 5453'
> worker10,9,8:
> offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
> => ID 16385 is definitely present in the Kafka logs suggesting a problem 
> consumer-side
> {quote}
> These entries definitely do not appear in the consumer stdout file either. 
> *Timeline:*
> {quote}
> 15:27:02,232 - Producer sends first message
> ..servers 1, 2 are restarted (clean shutdown)
> 15:29:42,718 - Server 3 shutdown complete
> 15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
> state transition (kafka.controller.KafkaController)
> 15:29:42,743 - New leasder is 2 (LeaderChangeListner)
> 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
> with correlation id 0 epoch 7 for partition (test_topic,1) since its 
> associated leader epoch 8 is old. Current leader epoch is 8 
> (state.change.logger)
> 15:29:45,642 - Producer starts writing messages that are never consumed
> 15:30:10,804 - Last message sent by producer
> 15:31:10,983 - Consumer times out after 60s wait without messages
> {quote}
> Logs for this run are attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart

2015-12-02 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2908:

Comment: was deleted

(was: Was bug in the way the test defines min.insync.replicas as found by 
[~rsivaram])

> Another, possibly different, Gap in Consumption after Restart
> -
>
> Key: KAFKA-2908
> URL: https://issues.apache.org/jira/browse/KAFKA-2908
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
> Attachments: 2015-11-28--001.tar.gz
>
>
> *Context:*
> Instance of the rolling upgrade test. 10s sleeps have been put around node 
> restarts to ensure stability when subsequent nodes go down. Consumer timeout 
> has been set to 60s. Proudcer has been throttled to 100 messages per second. 
> Failure is rare: It occurred once in 60 executions (6 executions per run #276 
> -> #285 of the system_test_branch_builder)
> *Reported Failure:*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
> 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
> 16442, ...plus 1669 more
> *Immediate Observations:*
> * The list of messages not consumed are all in partition 1.
> * Production and Consumption continues throughout the test (there is no 
> complete write or read failure as we have seen elsewhere)
> * The messages ARE present in the data files:
> e.g. 
> {quote}
> Examining missing value 16385. This was written to p1,offset:5453
> => there is an entry in all data files for partition 1 for this (presumably 
> meaning it was replicated correctly)
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker10/kafka-data-logs/test_topic-1/.log | grep 
> 'offset: 5453'
> worker10,9,8:
> offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
> => ID 16385 is definitely present in the Kafka logs suggesting a problem 
> consumer-side
> {quote}
> These entries definitely do not appear in the consumer stdout file either. 
> *Timeline:*
> {quote}
> 15:27:02,232 - Producer sends first message
> ..servers 1, 2 are restarted (clean shutdown)
> 15:29:42,718 - Server 3 shutdown complete
> 15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
> state transition (kafka.controller.KafkaController)
> 15:29:42,743 - New leasder is 2 (LeaderChangeListner)
> 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
> with correlation id 0 epoch 7 for partition (test_topic,1) since its 
> associated leader epoch 8 is old. Current leader epoch is 8 
> (state.change.logger)
> 15:29:45,642 - Producer starts writing messages that are never consumed
> 15:30:10,804 - Last message sent by producer
> 15:31:10,983 - Consumer times out after 60s wait without messages
> {quote}
> Logs for this run are attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2908) Consumer Sporadically Stops Consuming From Partition After Server Restart

2015-12-02 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2908:

Summary: Consumer Sporadically Stops Consuming From Partition After Server 
Restart  (was: Another, possibly different, Gap in Consumption after Restart)

> Consumer Sporadically Stops Consuming From Partition After Server Restart
> -
>
> Key: KAFKA-2908
> URL: https://issues.apache.org/jira/browse/KAFKA-2908
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
> Attachments: 2015-11-28--001.tar.gz
>
>
> *Context:*
> Instance of the rolling upgrade test. 10s sleeps have been put around node 
> restarts to ensure stability when subsequent nodes go down. Consumer timeout 
> has been set to 60s. Proudcer has been throttled to 100 messages per second. 
> Failure is rare: It occurred once in 60 executions (6 executions per run #276 
> -> #285 of the system_test_branch_builder)
> *Reported Failure:*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
> 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
> 16442, ...plus 1669 more
> *Immediate Observations:*
> * The list of messages not consumed are all in partition 1.
> * Production and Consumption continues throughout the test (there is no 
> complete write or read failure as we have seen elsewhere)
> * The messages ARE present in the data files:
> e.g. 
> {quote}
> Examining missing value 16385. This was written to p1,offset:5453
> => there is an entry in all data files for partition 1 for this (presumably 
> meaning it was replicated correctly)
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker10/kafka-data-logs/test_topic-1/.log | grep 
> 'offset: 5453'
> worker10,9,8:
> offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
> => ID 16385 is definitely present in the Kafka logs suggesting a problem 
> consumer-side
> {quote}
> These entries definitely do not appear in the consumer stdout file either. 
> *Timeline:*
> {quote}
> 15:27:02,232 - Producer sends first message
> ..servers 1, 2 are restarted (clean shutdown)
> 15:29:42,718 - Server 3 shutdown complete
> 15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
> state transition (kafka.controller.KafkaController)
> 15:29:42,743 - New leasder is 2 (LeaderChangeListner)
> 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
> with correlation id 0 epoch 7 for partition (test_topic,1) since its 
> associated leader epoch 8 is old. Current leader epoch is 8 
> (state.change.logger)
> 15:29:45,642 - Producer starts writing messages that are never consumed
> 15:30:10,804 - Last message sent by producer
> 15:31:10,983 - Consumer times out after 60s wait without messages
> {quote}
> Logs for this run are attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2909) Example of Gap in Consumption after Restart

2015-12-02 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15035688#comment-15035688
 ] 

Ben Stopford commented on KAFKA-2909:
-

Was bug in the way the test defines min.insync.replicas as found by [~rsivaram]

> Example of Gap in Consumption after Restart
> ---
>
> Key: KAFKA-2909
> URL: https://issues.apache.org/jira/browse/KAFKA-2909
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
> 2s sleep between restarts. Throughput was limited to 1000 messages per 
> second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
> 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
> 12216'
> ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482
> In this case offset 12216 does not contain the value we expected. The 
> implication is that the data was either not written or was lost during 
> rebalancing. 
> {quote}
> The first missing value was written at: 20:42:30,185, which is around the 
> time the third node goes down. 
> The failed writes correlate with the consumer logging out 
> NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of 
> these messages though over a long period so it’s hard to infer this as being 
> the cause or specifically correlating with the error. 
> *Timeline*
> {quote}
> grep -r 'shutdown complete' *
> 20:42:06,132 - Node 1 shutdown completed 
> 20:42:18,560 - Node 2 shutdown completed 
> 20:42:30,185 - *Writes that never make it are written by producer*
> 20:42:31,164 - Node 3 shutdown completed 
> 20:42:57,872 - Node 1 shutdown completed 
> …
> {quote}
> Logging was verbose during this run so log files can be found here: [click 
> me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Reopened] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart

2015-12-02 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford reopened KAFKA-2908:
-

> Another, possibly different, Gap in Consumption after Restart
> -
>
> Key: KAFKA-2908
> URL: https://issues.apache.org/jira/browse/KAFKA-2908
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
> Attachments: 2015-11-28--001.tar.gz
>
>
> *Context:*
> Instance of the rolling upgrade test. 10s sleeps have been put around node 
> restarts to ensure stability when subsequent nodes go down. Consumer timeout 
> has been set to 60s. Proudcer has been throttled to 100 messages per second. 
> Failure is rare: It occurred once in 60 executions (6 executions per run #276 
> -> #285 of the system_test_branch_builder)
> *Reported Failure:*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
> 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
> 16442, ...plus 1669 more
> *Immediate Observations:*
> * The list of messages not consumed are all in partition 1.
> * Production and Consumption continues throughout the test (there is no 
> complete write or read failure as we have seen elsewhere)
> * The messages ARE present in the data files:
> e.g. 
> {quote}
> Examining missing value 16385. This was written to p1,offset:5453
> => there is an entry in all data files for partition 1 for this (presumably 
> meaning it was replicated correctly)
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker10/kafka-data-logs/test_topic-1/.log | grep 
> 'offset: 5453'
> worker10,9,8:
> offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
> => ID 16385 is definitely present in the Kafka logs suggesting a problem 
> consumer-side
> {quote}
> These entries definitely do not appear in the consumer stdout file either. 
> *Timeline:*
> {quote}
> 15:27:02,232 - Producer sends first message
> ..servers 1, 2 are restarted (clean shutdown)
> 15:29:42,718 - Server 3 shutdown complete
> 15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
> state transition (kafka.controller.KafkaController)
> 15:29:42,743 - New leasder is 2 (LeaderChangeListner)
> 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
> with correlation id 0 epoch 7 for partition (test_topic,1) since its 
> associated leader epoch 8 is old. Current leader epoch is 8 
> (state.change.logger)
> 15:29:45,642 - Producer starts writing messages that are never consumed
> 15:30:10,804 - Last message sent by producer
> 15:31:10,983 - Consumer times out after 60s wait without messages
> {quote}
> Logs for this run are attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (KAFKA-2909) Example of Gap in Consumption after Restart

2015-12-02 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford resolved KAFKA-2909.
-
Resolution: Fixed

> Example of Gap in Consumption after Restart
> ---
>
> Key: KAFKA-2909
> URL: https://issues.apache.org/jira/browse/KAFKA-2909
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
> 2s sleep between restarts. Throughput was limited to 1000 messages per 
> second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
> 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
> 12216'
> ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482
> In this case offset 12216 does not contain the value we expected. The 
> implication is that the data was either not written or was lost during 
> rebalancing. 
> {quote}
> The first missing value was written at: 20:42:30,185, which is around the 
> time the third node goes down. 
> The failed writes correlate with the consumer logging out 
> NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of 
> these messages though over a long period so it’s hard to infer this as being 
> the cause or specifically correlating with the error. 
> *Timeline*
> {quote}
> grep -r 'shutdown complete' *
> 20:42:06,132 - Node 1 shutdown completed 
> 20:42:18,560 - Node 2 shutdown completed 
> 20:42:30,185 - *Writes that never make it are written by producer*
> 20:42:31,164 - Node 3 shutdown completed 
> 20:42:57,872 - Node 1 shutdown completed 
> …
> {quote}
> Logging was verbose during this run so log files can be found here: [click 
> me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart

2015-12-01 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2908:

Description: 
*Context:*
Instance of the rolling upgrade test. 10s sleeps have been put around node 
restarts to ensure stability when subsequent nodes go down. Consumer timeout 
has been set to 60s. Proudcer has been throttled to 100 messages per second. 

Failure is rare: It occurred once in 60 executions (6 executions per run #276 
-> #285 of the system_test_branch_builder)

*Reported Failure:*

At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
16442, ...plus 1669 more

*Immediate Observations:*

* The list of messages not consumed are all in partition 1.
* Production and Consumption continues throughout the test (there is no 
complete write or read failure as we have seen elsewhere)
* The messages ARE present in the data files:
e.g. 
{quote}
Examining missing value 16385. This was written to p1,offset:5453
=> there is an entry in all data files for partition 1 for this (presumably 
meaning it was replicated correctly)

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker10/kafka-data-logs/test_topic-1/.log | grep 'offset: 
5453'

worker10,9,8:
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
{quote}

These entries definitely do not appear in the consumer stdout file either. 

*Timeline:*
{quote}
15:27:02,232 - Producer sends first message
..servers 1, 2 are restarted (clean shutdown)
15:29:42,718 - Server 3 shutdown complete
15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
state transition (kafka.controller.KafkaController)
15:29:42,743 - New leasder is 2 (LeaderChangeListner)
15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
with correlation id 0 epoch 7 for partition (test_topic,1) since its associated 
leader epoch 8 is old. Current leader epoch is 8 (state.change.logger)
15:29:45,642 - Producer starts writing messages that are never consumed
15:30:10,804 - Last message sent by producer
15:31:10,983 - Consumer times out after 60s wait without messages
{quote}

Logs for this run are attached



  was:
*Context:*
Instance of the rolling upgrade test. 10s sleeps have been put around node 
restarts to ensure stability when subsequent nodes go down. Consumer timeout 
has been set to 60s. Proudcer has been throttled to 100 messages per second. 

Failure is rare: It occurred once in 60 executions (6 executions per run #276 
-> #285 of the system_test_branch_builder)

*Reported Failure:*

At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
16442, ...plus 1669 more

*Immediate Observations:*

* The list of messages not consumed are all in partition 1.
* Production and Consumption continues throughout the test (there is no 
complete write or read failure as we have seen elsewhere)
* The messages ARE present in the data files:
e.g. 
{quote}
Examining missing value 16385. This was written to p1,offset:5453
=> there is an entry in all data files for partition 1 for this (presumably 
meaning it was replicated correctly)

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker10/kafka-data-logs/test_topic-1/.log | grep 'offset: 
5453'

worker10,9,8 repectively:
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075 
{quote}

These entries definitely do not appear in the consumer stdout file either. 

*Timeline:*
{quote}
15:27:02,232 - Producer sends first message
..servers 1, 2 are restarted (clean shutdown)
15:29:42,718 - Server 3 shutdown complete
15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
state transition (kafka.controller.KafkaController)
15:29:42,743 - New leasder is 2 (LeaderChangeListner)
15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
with correlation id 0 epoch 7 for partition (test_topic,1) since its associated 
leader epoch 8 is old. Current leader epoch is 8 (state.change.logger)
15:29:45,642 - Producer starts writing messages that are never consumed
15:30:10,804 - Last message sent by producer
15:31:10,983 - Consumer times out after 60s wait without messages
{quote}

Logs for this run are

[jira] [Commented] (KAFKA-2909) Example of Gap in Consumption after Restart

2015-12-01 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034158#comment-15034158
 ] 

Ben Stopford commented on KAFKA-2909:
-

This problem is producer-side

> Example of Gap in Consumption after Restart
> ---
>
> Key: KAFKA-2909
> URL: https://issues.apache.org/jira/browse/KAFKA-2909
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
> 2s sleep between restarts. Throughput was limited to 1000 messages per 
> second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
> 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
> 12216'
> ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482
> So in this instance the data is not there! 
> {quote}
> The first missing value was written at: 20:42:30,185, which is around the 
> time the third node goes down. 
> The failed writes correlate with the consumer logging out 
> NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of 
> these messages though over a long period so it’s hard to infer this as being 
> the cause or specifically correlating with the error. 
> *Timeline*
> {quote}
> grep -r 'shutdown complete' *
> 20:42:06,132 - Node 1 shutdown completed 
> 20:42:18,560 - Node 2 shutdown completed 
> 20:42:30,185 - *Writes that never make it are written by producer*
> 20:42:31,164 - Node 3 shutdown completed 
> 20:42:57,872 - Node 1 shutdown completed 
> …
> {quote}
> Logging was verbose during this run so log files can be found here: [click 
> me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart

2015-12-01 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2908:

Description: 
*Context:*
Instance of the rolling upgrade test. 10s sleeps have been put around node 
restarts to ensure stability when subsequent nodes go down. Consumer timeout 
has been set to 60s. Proudcer has been throttled to 100 messages per second. 

Failure is rare: It occurred once in 60 executions (6 executions per run #276 
-> #285 of the system_test_branch_builder)

*Reported Failure:*

At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
16442, ...plus 1669 more

*Immediate Observations:*

* The list of messages not consumed are all in partition 1.
* Production and Consumption continues throughout the test (there is no 
complete write or read failure as we have seen elsewhere)
* The messages ARE present in the data files:
e.g. 
{quote}
Examining missing value 16385. This was written to p1,offset:5453
=> there is an entry in all data files for partition 1 for this (presumably 
meaning it was replicated correctly)

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker10/kafka-data-logs/test_topic-1/.log | grep 'offset: 
5453'

worker10,9,8:
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385

=> ID 16385 is definitely present in the Kafka logs suggesting a problem 
consumer-side
{quote}

These entries definitely do not appear in the consumer stdout file either. 

*Timeline:*
{quote}
15:27:02,232 - Producer sends first message
..servers 1, 2 are restarted (clean shutdown)
15:29:42,718 - Server 3 shutdown complete
15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
state transition (kafka.controller.KafkaController)
15:29:42,743 - New leasder is 2 (LeaderChangeListner)
15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
with correlation id 0 epoch 7 for partition (test_topic,1) since its associated 
leader epoch 8 is old. Current leader epoch is 8 (state.change.logger)
15:29:45,642 - Producer starts writing messages that are never consumed
15:30:10,804 - Last message sent by producer
15:31:10,983 - Consumer times out after 60s wait without messages
{quote}

Logs for this run are attached



  was:
*Context:*
Instance of the rolling upgrade test. 10s sleeps have been put around node 
restarts to ensure stability when subsequent nodes go down. Consumer timeout 
has been set to 60s. Proudcer has been throttled to 100 messages per second. 

Failure is rare: It occurred once in 60 executions (6 executions per run #276 
-> #285 of the system_test_branch_builder)

*Reported Failure:*

At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
16442, ...plus 1669 more

*Immediate Observations:*

* The list of messages not consumed are all in partition 1.
* Production and Consumption continues throughout the test (there is no 
complete write or read failure as we have seen elsewhere)
* The messages ARE present in the data files:
e.g. 
{quote}
Examining missing value 16385. This was written to p1,offset:5453
=> there is an entry in all data files for partition 1 for this (presumably 
meaning it was replicated correctly)

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker10/kafka-data-logs/test_topic-1/.log | grep 'offset: 
5453'

worker10,9,8:
offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
{quote}

These entries definitely do not appear in the consumer stdout file either. 

*Timeline:*
{quote}
15:27:02,232 - Producer sends first message
..servers 1, 2 are restarted (clean shutdown)
15:29:42,718 - Server 3 shutdown complete
15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
state transition (kafka.controller.KafkaController)
15:29:42,743 - New leasder is 2 (LeaderChangeListner)
15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
with correlation id 0 epoch 7 for partition (test_topic,1) since its associated 
leader epoch 8 is old. Current leader epoch is 8 (state.change.logger)
15:29:45,642 - Producer starts writing messages that are never consumed
15:30:10,804 - Last message sent by producer
15:31:10,983 - Consumer times out after 60s wait without messages
{quote}

Logs for this run are attached




> Another, possibly different, Gap in Consumption after Restart
> -
>
>

[jira] [Commented] (KAFKA-2908) Another, possibly different, Gap in Consumption after Restart

2015-12-01 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034160#comment-15034160
 ] 

Ben Stopford commented on KAFKA-2908:
-

This problem is consumer-side

> Another, possibly different, Gap in Consumption after Restart
> -
>
> Key: KAFKA-2908
> URL: https://issues.apache.org/jira/browse/KAFKA-2908
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
> Attachments: 2015-11-28--001.tar.gz
>
>
> *Context:*
> Instance of the rolling upgrade test. 10s sleeps have been put around node 
> restarts to ensure stability when subsequent nodes go down. Consumer timeout 
> has been set to 60s. Proudcer has been throttled to 100 messages per second. 
> Failure is rare: It occurred once in 60 executions (6 executions per run #276 
> -> #285 of the system_test_branch_builder)
> *Reported Failure:*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: 16385, 16388, 16391, 16394, 16397, 16400, 16403, 16406, 
> 16409, 16412, 16415, 16418, 16421, 16424, 16427, 16430, 16433, 16436, 16439, 
> 16442, ...plus 1669 more
> *Immediate Observations:*
> * The list of messages not consumed are all in partition 1.
> * Production and Consumption continues throughout the test (there is no 
> complete write or read failure as we have seen elsewhere)
> * The messages ARE present in the data files:
> e.g. 
> {quote}
> Examining missing value 16385. This was written to p1,offset:5453
> => there is an entry in all data files for partition 1 for this (presumably 
> meaning it was replicated correctly)
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker10/kafka-data-logs/test_topic-1/.log | grep 
> 'offset: 5453'
> worker10,9,8:
> offset: 5453 position: 165346 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 1502953075 payload: 16385
> {quote}
> These entries definitely do not appear in the consumer stdout file either. 
> *Timeline:*
> {quote}
> 15:27:02,232 - Producer sends first message
> ..servers 1, 2 are restarted (clean shutdown)
> 15:29:42,718 - Server 3 shutdown complete
> 15:29:42,712 - (Controller fails over): Broker 2 starting become controller 
> state transition (kafka.controller.KafkaController)
> 15:29:42,743 - New leasder is 2 (LeaderChangeListner)
> 15:29:43,239 - WARN Broker 2 ignoring LeaderAndIsr request from controller 2 
> with correlation id 0 epoch 7 for partition (test_topic,1) since its 
> associated leader epoch 8 is old. Current leader epoch is 8 
> (state.change.logger)
> 15:29:45,642 - Producer starts writing messages that are never consumed
> 15:30:10,804 - Last message sent by producer
> 15:31:10,983 - Consumer times out after 60s wait without messages
> {quote}
> Logs for this run are attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2909) Example of Gap in Consumption after Restart

2015-12-01 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2909:

Description: 
This seems very similar to Rajini's reported KAFAK-2891

*Context*
The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
2s sleep between restarts. Throughput was limited to 1000 messages per second. 

*Failure*
At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
64403, 64406, 64409, 36799)


Missing data was correctly written to Kafka data files:
{quote}
value 36802 -> partition 1,offset: 12216

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
12216'

->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482

So in this instance the data is not there! 
{quote}

The first missing value was written at: 20:42:30,185, which is around the time 
the third node goes down. 

The failed writes correlate with the consumer logging out 
NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these 
messages though over a long period so it’s hard to infer this as being the 
cause or specifically correlating with the error. 

*Timeline*
{quote}
grep -r 'shutdown complete' *
20:42:06,132 - Node 1 shutdown completed 
20:42:18,560 - Node 2 shutdown completed 
20:42:30,185 - *Writes that never make it are written by producer*
20:42:31,164 - Node 3 shutdown completed 
20:42:57,872 - Node 1 shutdown completed 
…
{quote}

Logging was verbose during this run so log files can be found here: [click 
me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]


  was:
This seems very similar to Rajini's reported KAFAK-2891

*Context*
The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
2s sleep between restarts. Throughput was limited to 1000 messages per second. 

*Failure*
At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
64403, 64406, 64409, 36799)


Missing data was correctly written to Kafka data files:
{quote}
value 36802 -> partition 1,offset: 12216

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
12216'

-> offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 3001177408

in all three data files. So the data is there. 
{quote}

The first missing value was written at: 20:42:30,185, which is around the time 
the third node goes down. 

The failed writes correlate with the consumer logging out 
NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these 
messages though over a long period so it’s hard to infer this as being the 
cause or specifically correlating with the error. 

*Timeline*
{quote}
grep -r 'shutdown complete' *
20:42:06,132 - Node 1 shutdown completed 
20:42:18,560 - Node 2 shutdown completed 
20:42:30,185 - *Writes that never make it are written by producer*
20:42:31,164 - Node 3 shutdown completed 
20:42:57,872 - Node 1 shutdown completed 
…
{quote}

Logging was verbose during this run so log files can be found here: [click 
me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



> Example of Gap in Consumption after Restart
> ---
>
> Key: KAFKA-2909
> URL: https://issues.apache.org/jira/browse/KAFKA-2909
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
> 2s sleep between restarts. Throughput was limited to 1000 messages per 
> second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
> 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
> 12216'
> ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482
> So in this instance the data is not there! 
> {quote}
> The first missing value was written at: 20:42:30,185, which is around the 
> time the third node goes down. 
> The failed writes

[jira] [Updated] (KAFKA-2909) Example of Gap in Consumption after Restart

2015-12-01 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2909:

Description: 
This seems very similar to Rajini's reported KAFAK-2891

*Context*
The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
2s sleep between restarts. Throughput was limited to 1000 messages per second. 

*Failure*
At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
64403, 64406, 64409, 36799)


Missing data was correctly written to Kafka data files:
{quote}
value 36802 -> partition 1,offset: 12216

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
12216'

->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482

In this case offset 12216 does not contain the value we expected. The 
implication is that the data was either not written or was lost during 
rebalancing. 
{quote}

The first missing value was written at: 20:42:30,185, which is around the time 
the third node goes down. 

The failed writes correlate with the consumer logging out 
NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these 
messages though over a long period so it’s hard to infer this as being the 
cause or specifically correlating with the error. 

*Timeline*
{quote}
grep -r 'shutdown complete' *
20:42:06,132 - Node 1 shutdown completed 
20:42:18,560 - Node 2 shutdown completed 
20:42:30,185 - *Writes that never make it are written by producer*
20:42:31,164 - Node 3 shutdown completed 
20:42:57,872 - Node 1 shutdown completed 
…
{quote}

Logging was verbose during this run so log files can be found here: [click 
me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]


  was:
This seems very similar to Rajini's reported KAFAK-2891

*Context*
The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
2s sleep between restarts. Throughput was limited to 1000 messages per second. 

*Failure*
At least one acked message did not appear in the consumed messages. 
acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
64403, 64406, 64409, 36799)


Missing data was correctly written to Kafka data files:
{quote}
value 36802 -> partition 1,offset: 12216

kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
12216'

->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482

So in this instance the data is not there! 
{quote}

The first missing value was written at: 20:42:30,185, which is around the time 
the third node goes down. 

The failed writes correlate with the consumer logging out 
NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of these 
messages though over a long period so it’s hard to infer this as being the 
cause or specifically correlating with the error. 

*Timeline*
{quote}
grep -r 'shutdown complete' *
20:42:06,132 - Node 1 shutdown completed 
20:42:18,560 - Node 2 shutdown completed 
20:42:30,185 - *Writes that never make it are written by producer*
20:42:31,164 - Node 3 shutdown completed 
20:42:57,872 - Node 1 shutdown completed 
…
{quote}

Logging was verbose during this run so log files can be found here: [click 
me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



> Example of Gap in Consumption after Restart
> ---
>
> Key: KAFKA-2909
> URL: https://issues.apache.org/jira/browse/KAFKA-2909
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>Assignee: Jason Gustafson
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
> 2s sleep between restarts. Throughput was limited to 1000 messages per 
> second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
> 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
> 12216'
> ->offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 3001177408 payload: 47482
> In this case offset 12216 does not contain the value we expected.

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034222#comment-15034222
 ] 

Ben Stopford commented on KAFKA-2891:
-

[~rsivaram] I found an error in my analysis of KAFKA-2909 meaning that jira 
refers to actual data loss. KAFKA-2908 remains a client-side issue. This puts 
more evidence behind your theory that nodes are being killed before data is 
replicated. I'll be interested to see if this change is stable on Ec2.

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (KAFKA-2915) Tests that use bootstrap.servers embedded in jinja files are not working

2015-12-01 Thread Ben Stopford (JIRA)

Ben Stopford created KAFKA-2915:
---

 Summary: Tests that use bootstrap.servers embedded in jinja files 
are not working
 Key: KAFKA-2915
 URL: https://issues.apache.org/jira/browse/KAFKA-2915
 Project: Kafka
  Issue Type: Bug
Reporter: Ben Stopford


Regression due to changes in the way the tests handle security. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KAFKA-2915) System Tests that use bootstrap.servers embedded in jinja files are not working

2015-12-01 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2915:

Summary: System Tests that use bootstrap.servers embedded in jinja files 
are not working  (was: Tests that use bootstrap.servers embedded in jinja files 
are not working)

> System Tests that use bootstrap.servers embedded in jinja files are not 
> working
> ---
>
> Key: KAFKA-2915
> URL: https://issues.apache.org/jira/browse/KAFKA-2915
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>
> Regression due to changes in the way the tests handle security. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033560#comment-15033560
 ] 

Ben Stopford commented on KAFKA-2891:
-

[~rsivaram] That sounds reasonable to me. I'm also surprised it works reliably 
with hard bounce currently.

Note also that there are a couple of examples (in subtasks) of intermittent 
failures which look consumer related (as data makes it to kafka). Jason kindly 
took a look at this yesterday with one related fix 
[KAFKA-2913|https://issues.apache.org/jira/browse/KAFKA-2913]. 




> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033560#comment-15033560
 ] 

Ben Stopford edited comment on KAFKA-2891 at 12/1/15 12:18 PM:
---

[~rsivaram] That sounds reasonable to me. I'm also surprised it works reliably 
with hard bounce currently like that. Although doesn't it set the 
min.insync.replicas to 2 in the test constructor?

Note also that there are a couple of examples (in subtasks) of intermittent 
failures which look consumer related (as data makes it to kafka). Jason kindly 
took a look at this yesterday with one related fix 
[KAFKA-2913|https://issues.apache.org/jira/browse/KAFKA-2913]. 





was (Author: benstopford):
[~rsivaram] That sounds reasonable to me. I'm also surprised it works reliably 
with hard bounce currently.

Note also that there are a couple of examples (in subtasks) of intermittent 
failures which look consumer related (as data makes it to kafka). Jason kindly 
took a look at this yesterday with one related fix 
[KAFKA-2913|https://issues.apache.org/jira/browse/KAFKA-2913]. 




> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2891) Gaps in messages delivered by new consumer after Kafka restart

2015-12-01 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033724#comment-15033724
 ] 

Ben Stopford commented on KAFKA-2891:
-

[~rsivaram] so - in my investigations, even with min.insync.replicas = 2 + 
clean_shutdown additional pauses are needed between bounces to get long term 
stability on Ec2. My theory is this is a problem consumer-side because I don't 
see evidence of data loss in Kafka. Maybe by waiting for the ISR to hit 2 you 
are getting similar behaviour. Your test is a little more extreme though due to 
the hard_bounce.   

> Gaps in messages delivered by new consumer after Kafka restart
> --
>
> Key: KAFKA-2891
> URL: https://issues.apache.org/jira/browse/KAFKA-2891
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Affects Versions: 0.9.0.0
>Reporter: Rajini Sivaram
>Priority: Critical
>
> Replication tests when run with the new consumer with SSL/SASL were failing 
> very often because messages were not being consumed from some topics after a 
> Kafka restart. The fix in KAFKA-2877 has made this a lot better. But I am 
> still seeing some failures (less often now) because a small set of messages 
> are not received after Kafka restart. This failure looks slightly different 
> from the one before the fix for KAFKA-2877 was applied, hence the new defect. 
> The test fails because not all acked messages are received by the consumer, 
> and the number of messages missing are quite small.
> [~benstopford] Are the upgrade tests working reliably with KAFKA-2877 now?
> Not sure if any of these log entries are important:
> {quote}
> [2015-11-25 14:41:12,342] INFO SyncGroup for group test-consumer-group failed 
> due to NOT_COORDINATOR_FOR_GROUP, will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,342] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:12,958] INFO Attempt to join group test-consumer-group 
> failed due to unknown member id, resetting and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-25 14:41:42,437] INFO Fetch offset null is out of range, resetting 
> offset (org.apache.kafka.clients.consumer.internals.Fetcher)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-2904) Consumer Fails to Reconnect after 30s post restarts

2015-11-30 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032091#comment-15032091
 ] 

Ben Stopford commented on KAFKA-2904:
-

This hasn't reoccured since the timeout was increased to 60s. 

> Consumer Fails to Reconnect after 30s post restarts
> ---
>
> Key: KAFKA-2904
> URL: https://issues.apache.org/jira/browse/KAFKA-2904
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>Assignee: Ben Stopford
> Attachments: 2015-11-27--001 (1).tar.gz
>
>
> This problem occurs in around 1 in 20 executions of the security rolling 
> upgrade test. 
> Test scenario is a rolling upgrade where each of the three servers are 
> restarted in turn whilst producer and consumers run. A ten second sleep 
> between start and stop of each node has been added to ensure there is time 
> for failover to occur (re KAFKA-2827). 
> Failure results in no consumed messages after the failure point. 
> Periodically the consumer does not reconnect for its 30s timeout. The 
> consumer’s log at this point is at the bottom of this jira.
> ISR's appear normal at the time of the failure.
> The producer is able to produce throughout this period. 
> *TIMELINE:*
> {quote}
> 20:39:23 - Test starts Consumer and Producer
> 20:39:27 - Consumer starts consuming produced messages
> 20:39:30 - Node 1 shutdown complete
> 20:39:45 - Node 1 restarts
> 20:39:59 - Node 2 shutdown complete
> 20:40:14 - Node 2 restarts 
> 20:40:27 - Consumer stops consuming
> 20:40:28 - Node 2 becomes controller
> 20:40:28 - Node 3 shutdown complete
> 20:40:34 - GroupCoordinator 2: Preparing to restabilize group 
> unique-test-group...
> 20:40:42 - Node 3 restarts
> *20:41:03 - Consumer times out*
> 20:41:03 - GroupCoordinator 2: Stabilized group unique-test-group...
> 20:41:03 - GroupCoordinator 2: Assignment received from leader for group 
> unique-test-group...
> 20:41:03 - GroupCoordinator 2: Preparing to restabilize group 
> unique-test-group...
> 20:41:03 - GroupCoordinator 2: Group unique-test-group... is dead and removed 
> 20:41:53 - Producer shuts down
> {quote}
> Consumer log at time of failure:
> {quote}
> [2015-11-27 20:40:27,268] INFO Current consumption count is 10100 
> (kafka.tools.ConsoleConsumer$)
> [2015-11-27 20:40:27,321] ERROR Error ILLEGAL_GENERATION occurred while 
> committing offsets for group unique-test-group-0.952644842527 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2015-11-27 20:40:27,321] WARN Auto offset commit failed: Commit cannot be 
> completed due to group rebalance 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2015-11-27 20:40:27,322] ERROR Error ILLEGAL_GENERATION occurred while 
> committing offsets for group unique-test-group-0.952644842527 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2015-11-27 20:40:27,322] WARN Auto offset commit failed:  
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2015-11-27 20:40:27,329] INFO Attempt to join group 
> unique-test-group-0.952644842527 failed due to unknown member id, resetting 
> and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:27,347] INFO SyncGroup for group 
> unique-test-group-0.952644842527 failed due to UNKNOWN_MEMBER_ID, rejoining 
> the group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:27,357] INFO SyncGroup for group 
> unique-test-group-0.952644842527 failed due to NOT_COORDINATOR_FOR_GROUP, 
> will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:27,357] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:28,097] INFO Attempt to join group 
> unique-test-group-0.952644842527 failed due to unknown member id, resetting 
> and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:33,627] INFO Marking the coordinator 2147483646 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:33,627] INFO Attempt to join group 
> unique-test-group-0.952644842527 failed due to obsolete coordinator 
> information, retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:41:03,704] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
>   at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:59)
>   at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:112)
>   at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:69)
>   at

[jira] [Comment Edited] (KAFKA-2904) Consumer Fails to Reconnect after 30s post restarts

2015-11-30 Thread Ben Stopford (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032091#comment-15032091
 ] 

Ben Stopford edited comment on KAFKA-2904 at 11/30/15 5:11 PM:
---

This hasn't reoccured since the timeout was increased to 60s (based on ~10 x6 
executions). 


was (Author: benstopford):
This hasn't reoccured since the timeout was increased to 60s. 

> Consumer Fails to Reconnect after 30s post restarts
> ---
>
> Key: KAFKA-2904
> URL: https://issues.apache.org/jira/browse/KAFKA-2904
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ben Stopford
>Assignee: Ben Stopford
> Attachments: 2015-11-27--001 (1).tar.gz
>
>
> This problem occurs in around 1 in 20 executions of the security rolling 
> upgrade test. 
> Test scenario is a rolling upgrade where each of the three servers are 
> restarted in turn whilst producer and consumers run. A ten second sleep 
> between start and stop of each node has been added to ensure there is time 
> for failover to occur (re KAFKA-2827). 
> Failure results in no consumed messages after the failure point. 
> Periodically the consumer does not reconnect for its 30s timeout. The 
> consumer’s log at this point is at the bottom of this jira.
> ISR's appear normal at the time of the failure.
> The producer is able to produce throughout this period. 
> *TIMELINE:*
> {quote}
> 20:39:23 - Test starts Consumer and Producer
> 20:39:27 - Consumer starts consuming produced messages
> 20:39:30 - Node 1 shutdown complete
> 20:39:45 - Node 1 restarts
> 20:39:59 - Node 2 shutdown complete
> 20:40:14 - Node 2 restarts 
> 20:40:27 - Consumer stops consuming
> 20:40:28 - Node 2 becomes controller
> 20:40:28 - Node 3 shutdown complete
> 20:40:34 - GroupCoordinator 2: Preparing to restabilize group 
> unique-test-group...
> 20:40:42 - Node 3 restarts
> *20:41:03 - Consumer times out*
> 20:41:03 - GroupCoordinator 2: Stabilized group unique-test-group...
> 20:41:03 - GroupCoordinator 2: Assignment received from leader for group 
> unique-test-group...
> 20:41:03 - GroupCoordinator 2: Preparing to restabilize group 
> unique-test-group...
> 20:41:03 - GroupCoordinator 2: Group unique-test-group... is dead and removed 
> 20:41:53 - Producer shuts down
> {quote}
> Consumer log at time of failure:
> {quote}
> [2015-11-27 20:40:27,268] INFO Current consumption count is 10100 
> (kafka.tools.ConsoleConsumer$)
> [2015-11-27 20:40:27,321] ERROR Error ILLEGAL_GENERATION occurred while 
> committing offsets for group unique-test-group-0.952644842527 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2015-11-27 20:40:27,321] WARN Auto offset commit failed: Commit cannot be 
> completed due to group rebalance 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2015-11-27 20:40:27,322] ERROR Error ILLEGAL_GENERATION occurred while 
> committing offsets for group unique-test-group-0.952644842527 
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2015-11-27 20:40:27,322] WARN Auto offset commit failed:  
> (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
> [2015-11-27 20:40:27,329] INFO Attempt to join group 
> unique-test-group-0.952644842527 failed due to unknown member id, resetting 
> and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:27,347] INFO SyncGroup for group 
> unique-test-group-0.952644842527 failed due to UNKNOWN_MEMBER_ID, rejoining 
> the group (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:27,357] INFO SyncGroup for group 
> unique-test-group-0.952644842527 failed due to NOT_COORDINATOR_FOR_GROUP, 
> will find new coordinator and rejoin 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:27,357] INFO Marking the coordinator 2147483644 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:28,097] INFO Attempt to join group 
> unique-test-group-0.952644842527 failed due to unknown member id, resetting 
> and retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:33,627] INFO Marking the coordinator 2147483646 dead. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:40:33,627] INFO Attempt to join group 
> unique-test-group-0.952644842527 failed due to obsolete coordinator 
> information, retrying. 
> (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
> [2015-11-27 20:41:03,704] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
>   at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:59)
>   at

[jira] [Updated] (KAFKA-2909) Example of Gap in Consumption after Restart

2015-11-30 Thread Ben Stopford (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Stopford updated KAFKA-2909:

Summary: Example of Gap in Consumption after Restart  (was: Another 
Instance of Gap in Consumption after Restart)

> Example of Gap in Consumption after Restart
> ---
>
> Key: KAFKA-2909
> URL: https://issues.apache.org/jira/browse/KAFKA-2909
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Ben Stopford
>
> This seems very similar to Rajini's reported KAFAK-2891
> *Context*
> The context is Seurity Rolling Upgrade with 30s consumer timeout. There was a 
> 2s sleep between restarts. Throughput was limited to 1000 messages per 
> second. 
> *Failure*
> At least one acked message did not appear in the consumed messages. 
> acked_minus_consumed: set(36802, 36804, 36805, 36807, 36808, 36810, 36811, 
> 64403, 64406, 64409, 36799)
> Missing data was correctly written to Kafka data files:
> {quote}
> value 36802 -> partition 1,offset: 12216
> kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments --files 
> worker7/kafka-data-logs/test_topic-1/.log | grep 'offset: 
> 12216'
> -> offset: 12216 position: 374994 isvalid: true payloadsize: 5 magic: 0 
> compresscodec: NoCompressionCodec crc: 3001177408
> in all three data files. So the data is there. 
> {quote}
> The first missing value was written at: 20:42:30,185, which is around the 
> time the third node goes down. 
> The failed writes correlate with the consumer logging out 
> NOT_COORDINATOR_FOR_GROUP and Marking the coordinator. There are many of 
> these messages though over a long period so it’s hard to infer this as being 
> the cause or specifically correlating with the error. 
> *Timeline*
> {quote}
> grep -r 'shutdown complete' *
> 20:42:06,132 - Node 1 shutdown completed 
> 20:42:18,560 - Node 2 shutdown completed 
> 20:42:30,185 - *Writes that never make it are written by producer*
> 20:42:31,164 - Node 3 shutdown completed 
> 20:42:57,872 - Node 1 shutdown completed 
> …
> {quote}
> Logging was verbose during this run so log files can be found here: [click 
> me|https://www.dropbox.com/s/owwzh37cs304qh4/2015-11-26--001%20%283%29.tar.gz?dl=0]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

1 2 >

1 - 100 of 188 matches

Mail list logo