[jira] [Commented] (KAFKA-7824) Require member.id for initial join group request

2019-03-23 Thread Stanislav Kozlovski (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799914#comment-16799914
 ] 

Stanislav Kozlovski commented on KAFKA-7824:


This was merged, right? Could we update the status on the KIP?

> Require member.id for initial join group request
> 
>
> Key: KAFKA-7824
> URL: https://issues.apache.org/jira/browse/KAFKA-7824
> Project: Kafka
>  Issue Type: Sub-task
>  Components: consumer
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
> Fix For: 2.2.0
>
>
> For request with unknown member id, broker will blindly accept the new join 
> group request, store the member metadata and return a UUID to consumer. The 
> edge case is that if initial join group request keeps failing due to 
> connection timeout, or the consumer keeps restarting, or the 
> max.poll.interval.ms configured on client is set to infinite (no rebalance 
> timeout kicking in to clean up the member metadata map), there will be 
> accumulated MemberMetadata info within group metadata cache which will 
> eventually burst broker memory. The detection and fencing of invalid join 
> group request is crucial for broker stability.
>  
> The proposed solution is to require one more bounce for the consumer to use a 
> valid member.id to join the group. Details in this 
> [KIP|https://cwiki.apache.org/confluence/display/KAFKA/KIP-394%3A+Require+member.id+for+initial+join+group+request]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8152) Offline partition state not propagated by controller

2019-03-23 Thread Jason Gustafson (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson updated KAFKA-8152:
---
Description: 
Currently when the controller starts up, only the state of online partitions 
will be sent to other brokers. Any broker which is started or restarted after 
the controller will see only a subset of the partitions of any topic which has 
offline partitions. If all the partitions for a topic are offline, then the 
broker will not know of the topic at all. As far as I can tell, the bug is the 
fact that `ReplicaStateMachine.startup` only does an initial state change for 
replicas which are online.

This can be reproduced with the following steps:
 # Startup two brokers
 # Create a single partition topic with rf=1
 # Shutdown the broker where the replica landed
 # Shutdown the other broker
 # Restart the broker without the replica
 # Run `kafka-topics --describe --bootstrap-server \{server ip}`

Note that the metadata inconsistency will only be apparent when using 
`bootstrap-server` in `kafka-topics.sh`. Using zookeeper, everything will seem 
normal.

  was:Currently when the controller starts up, only the state of online 
partitions will be sent to other brokers. Any broker which is started or 
restarted after the controller will see only a subset of the partitions of any 
topic which has offline partitions. If all the partitions for a topic are 
offline, then the broker will not know of the topic at all. As far as I can 
tell, the bug is the fact that `ReplicaStateMachine.startup` only does an 
initial state change for replicas which are online.


> Offline partition state not propagated by controller
> 
>
> Key: KAFKA-8152
> URL: https://issues.apache.org/jira/browse/KAFKA-8152
> Project: Kafka
>  Issue Type: Bug
>Reporter: Jason Gustafson
>Assignee: Jose Armando Garcia Sancio
>Priority: Major
>
> Currently when the controller starts up, only the state of online partitions 
> will be sent to other brokers. Any broker which is started or restarted after 
> the controller will see only a subset of the partitions of any topic which 
> has offline partitions. If all the partitions for a topic are offline, then 
> the broker will not know of the topic at all. As far as I can tell, the bug 
> is the fact that `ReplicaStateMachine.startup` only does an initial state 
> change for replicas which are online.
> This can be reproduced with the following steps:
>  # Startup two brokers
>  # Create a single partition topic with rf=1
>  # Shutdown the broker where the replica landed
>  # Shutdown the other broker
>  # Restart the broker without the replica
>  # Run `kafka-topics --describe --bootstrap-server \{server ip}`
> Note that the metadata inconsistency will only be apparent when using 
> `bootstrap-server` in `kafka-topics.sh`. Using zookeeper, everything will 
> seem normal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (KAFKA-8152) Offline partition state not propagated by controller

2019-03-23 Thread Jason Gustafson (JIRA)
Jason Gustafson created KAFKA-8152:
--

 Summary: Offline partition state not propagated by controller
 Key: KAFKA-8152
 URL: https://issues.apache.org/jira/browse/KAFKA-8152
 Project: Kafka
  Issue Type: Bug
Reporter: Jason Gustafson
Assignee: Jose Armando Garcia Sancio


Currently when the controller starts up, only the state of online partitions 
will be sent to other brokers. Any broker which is started or restarted after 
the controller will see only a subset of the partitions of any topic which has 
offline partitions. If all the partitions for a topic are offline, then the 
broker will not know of the topic at all. As far as I can tell, the bug is the 
fact that `ReplicaStateMachine.startup` only does an initial state change for 
replicas which are online.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (KAFKA-8147) Add changelog topic configuration to KTable suppress

2019-03-23 Thread Bill Bejeck (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799819#comment-16799819
 ] 

Bill Bejeck edited comment on KAFKA-8147 at 3/23/19 10:25 PM:
--

[~mjduijn]  

I've added you to the contributors list and assigned this ticket to you.  Going 
forward you'll be able to assign yourself to tickets.

 EDIT:

Do you have an account on 
[https://cwiki.apache.org/confluence?|https://cwiki.apache.org/confluence]  I 
looked with the username you use for Jira but couldn't find you.  I'll need 
user name for apache confluence to give you permissions for creating/editing a 
KIP.

Thanks,

Bill


was (Author: bbejeck):
[~mjduijn]  

I've added you to the contributors list and assigned this ticket to you.  Going 
forward you'll be able to assign yourself to tickets.

 

Thanks,

Bill

> Add changelog topic configuration to KTable suppress
> 
>
> Key: KAFKA-8147
> URL: https://issues.apache.org/jira/browse/KAFKA-8147
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.1.1
>Reporter: Maarten
>Assignee: Maarten
>Priority: Minor
>  Labels: needs-kip
>
> The streams DSL does not provide a way to configure the changelog topic 
> created by KTable.suppress.
> From the perspective of an external user this could be implemented similar to 
> the configuration of aggregate + materialized, i.e., 
> {code:java}
> changelogTopicConfigs = // Configs
> materialized = Materialized.as(..).withLoggingEnabled(changelogTopicConfigs)
> ..
> KGroupedStream.aggregate(..,materialized)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8147) Add changelog topic configuration to KTable suppress

2019-03-23 Thread Bill Bejeck (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799819#comment-16799819
 ] 

Bill Bejeck commented on KAFKA-8147:


[~mjduijn]  

I've added you to the contributors list and assigned this ticket to you.  Going 
forward you'll be able to assign yourself to tickets.

 

Thanks,

Bill

> Add changelog topic configuration to KTable suppress
> 
>
> Key: KAFKA-8147
> URL: https://issues.apache.org/jira/browse/KAFKA-8147
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.1.1
>Reporter: Maarten
>Assignee: Maarten
>Priority: Minor
>  Labels: needs-kip
>
> The streams DSL does not provide a way to configure the changelog topic 
> created by KTable.suppress.
> From the perspective of an external user this could be implemented similar to 
> the configuration of aggregate + materialized, i.e., 
> {code:java}
> changelogTopicConfigs = // Configs
> materialized = Materialized.as(..).withLoggingEnabled(changelogTopicConfigs)
> ..
> KGroupedStream.aggregate(..,materialized)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-8147) Add changelog topic configuration to KTable suppress

2019-03-23 Thread Bill Bejeck (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bejeck reassigned KAFKA-8147:
--

Assignee: Maarten

> Add changelog topic configuration to KTable suppress
> 
>
> Key: KAFKA-8147
> URL: https://issues.apache.org/jira/browse/KAFKA-8147
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.1.1
>Reporter: Maarten
>Assignee: Maarten
>Priority: Minor
>  Labels: needs-kip
>
> The streams DSL does not provide a way to configure the changelog topic 
> created by KTable.suppress.
> From the perspective of an external user this could be implemented similar to 
> the configuration of aggregate + materialized, i.e., 
> {code:java}
> changelogTopicConfigs = // Configs
> materialized = Materialized.as(..).withLoggingEnabled(changelogTopicConfigs)
> ..
> KGroupedStream.aggregate(..,materialized)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-3539) KafkaProducer.send() may block even though it returns the Future

2019-03-23 Thread Steven Zhen Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-3539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799776#comment-16799776
 ] 

Steven Zhen Wu commented on KAFKA-3539:
---

just to echo last two comments.

There are many applications that just want best effort delivery to kafka and 
can't tolerate blocking behavior at all. Now many people have to re-invent the 
wheels to work around this problem. E.g. we implemented exactly the same things 
that [~tu...@avast.com] mentioned.

Improved documentation can help. But an intuitive API is definitely much better.

> KafkaProducer.send() may block even though it returns the Future
> 
>
> Key: KAFKA-3539
> URL: https://issues.apache.org/jira/browse/KAFKA-3539
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Reporter: Oleg Zhurakousky
>Priority: Critical
>
> You can get more details from the us...@kafka.apache.org by searching on the 
> thread with the subject "KafkaProducer block on send".
> The bottom line is that method that returns Future must never block, since it 
> essentially violates the Future contract as it was specifically designed to 
> return immediately passing control back to the user to check for completion, 
> cancel etc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7986) distinguish the logging from different ZooKeeperClient instances

2019-03-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799761#comment-16799761
 ] 

ASF GitHub Bot commented on KAFKA-7986:
---

ivanyu commented on pull request #6493: KAFKA-7986: Distinguish logging from 
different ZooKeeperClient instances
URL: https://github.com/apache/kafka/pull/6493
 
 
   A broken can have more than one instance of ZooKeeperClient. For example, 
SimpleAclAuthorizer creates a separate ZooKeeperClient instance when configured.
   
   This commit makes it possible to optionally specify the name for the 
ZooKeeperClient instance. The name is specified only for a broker's 
ZooKeeperClient instances, but not for commands' and tests'.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> distinguish the logging from different ZooKeeperClient instances
> 
>
> Key: KAFKA-7986
> URL: https://issues.apache.org/jira/browse/KAFKA-7986
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Assignee: Ivan Yurchenko
>Priority: Major
>  Labels: newbie
>
> It's possible for each broker to have more than 1 ZooKeeperClient instance. 
> For example, SimpleAclAuthorizer creates a separate ZooKeeperClient instance 
> when configured. It would be useful to distinguish the logging from different 
> ZooKeeperClient instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (KAFKA-7986) distinguish the logging from different ZooKeeperClient instances

2019-03-23 Thread Ivan Yurchenko (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Yurchenko reassigned KAFKA-7986:
-

Assignee: Ivan Yurchenko

> distinguish the logging from different ZooKeeperClient instances
> 
>
> Key: KAFKA-7986
> URL: https://issues.apache.org/jira/browse/KAFKA-7986
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Jun Rao
>Assignee: Ivan Yurchenko
>Priority: Major
>  Labels: newbie
>
> It's possible for each broker to have more than 1 ZooKeeperClient instance. 
> For example, SimpleAclAuthorizer creates a separate ZooKeeperClient instance 
> when configured. It would be useful to distinguish the logging from different 
> ZooKeeperClient instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8106) Remove unnecessary decompression operation when logValidator do validation.

2019-03-23 Thread qiaochao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799597#comment-16799597
 ] 

qiaochao commented on KAFKA-8106:
-

In the scene of inPlaceAssignment,I think the server should set the flag so 
that the user can choose to get a higher performance limit to shield the 
decompression of the server.In addition, the current changes to decompress a 
small part, the upper limit performance is also greatly improved, it still 
makes sense.  [~guozhang]

> Remove unnecessary decompression operation when logValidator  do validation.
> 
>
> Key: KAFKA-8106
> URL: https://issues.apache.org/jira/browse/KAFKA-8106
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.1
> Environment: Server : 
> cpu:2*16 ; 
> MemTotal : 256G;
> Ethernet controller:Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network 
> Connection ; 
> SSD.
>Reporter: Flower.min
>Assignee: Flower.min
>Priority: Major
>  Labels: performance
>
>       We do performance testing about Kafka in specific scenarios as 
> described below .We build a kafka cluster with one broker,and create topics 
> with different number of partitions.Then we start lots of producer processes 
> to send large amounts of messages to one of the topics at one  testing .
> *_Specific Scenario_*
>   
>  *_1.Main config of Kafka_*  
>  # Main config of Kafka  
> server:[num.network.threads=6;num.io.threads=128;queued.max.requests=500|http://num.network.threads%3D6%3Bnum.io.threads%3D128%3Bqueued.max.requests%3D500/]
>  # Number of TopicPartition : 50~2000
>  # Size of Single Message : 1024B
>  
>  *_2.Config of KafkaProducer_* 
> ||compression.type||[linger.ms|http://linger.ms/]||batch.size||buffer.memory||
> |lz4|1000ms~5000ms|16KB/10KB/100KB|128MB|
> *_3.The best result of performance testing_*  
> ||Network inflow rate||CPU Used (%)||Disk write speed||Performance of 
> production||
> |550MB/s~610MB/s|97%~99%|:550MB/s~610MB/s       |23,000,000 messages/s|
> *_4.Phenomenon and  my doubt_*
>     _The upper limit of CPU usage has been reached  But  it does not 
> reach the upper limit of the bandwidth of the server  network. *We are 
> doubtful about which  cost too much CPU time and we want to Improve  
> performance and reduces CPU usage of Kafka server.*_
>   
>  _*5.Analysis*_
>         We analysis the JFIR of Kafka server when doing performance testing 
> .We found the hot spot method is 
> *_"java.io.DataInputStream.readFully(byte[],int,int)"_* and 
> *_"org.apache.kafka.common.record.KafkaLZ4BlockInputStream.read(byte[],int,int)"_*.When
>   we checking thread stack information we  also have found most CPU being 
> occupied by lots of thread  which  is busy decompressing messages.Then we 
> read source code of Kafka .
>        There is double-layer nested Iterator  when LogValidator do validate 
> every record.And There is a decompression for each message when traversing 
> every RecordBatch iterator. It is consuming CPU and affect total performance 
> that  decompress message._*The purpose of decompressing every messages just 
> for gain total size in bytes of one record and size in bytes of record body 
> when magic value to use is above 1 and no format conversion or value 
> overwriting is required for compressed messages.It is negative for 
> performance in common usage scenarios .*_{color:#33}Therefore, we suggest 
> that *_removing unnecessary decompression operation_* when doing  validation 
> for compressed message  when magic value to use is above 1 and no format 
> conversion or value overwriting is required for compressed messages.{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (KAFKA-8151) Broker hangs and lockups after Zookeeper outages

2019-03-23 Thread Joe Ammann (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Ammann updated KAFKA-8151:
--
Attachment: symptom3_lxgurten_kafka_dump3.txt
symptom3_lxgurten_kafka_dump2.txt
symptom3_lxgurten_kafka_dump1.txt

> Broker hangs and lockups after Zookeeper outages
> 
>
> Key: KAFKA-8151
> URL: https://issues.apache.org/jira/browse/KAFKA-8151
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core, zkclient
>Affects Versions: 2.1.1
>Reporter: Joe Ammann
>Priority: Major
> Attachments: symptom3_lxgurten_kafka_dump1.txt, 
> symptom3_lxgurten_kafka_dump2.txt, symptom3_lxgurten_kafka_dump3.txt
>
>
> We're running several clusters (mostly with 3 brokers) with 2.1.1, where we 
> see at least 3 different symptoms, all resulting on broker/controller lockups.
> We are pretty sure that the triggering cause for all these symptoms are 
> temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs 
> where the ZK nodes run on regularly get stalled for a couple of minutes. The 
> ZK nodes always very quickly reunite and build a Quorum after the situation 
> clears, but the Kafka brokers (which run on then same Linux VMs) quite often 
> show problems after this procedure.
> I've seen 3 different kinds of problems (this is why I put "reproduce" in 
> quotes, I can never predict what will happen)
>  # the brokers get their ZK sessions expired (obviously) and sometimes only 2 
> of 3 re-register under /brokers/ids. The 3rd broker doesn't re-register for 
> some reason (that's the problem I originally described)
>  # the brokers all re-register and re-elect a new controller. But that new 
> controller does not fully work. For example it doesn't process partition 
> reassignment requests and or does not transfer partition leadership after I 
> kill a broker
>  # the previous controller gets "dead-locked" (it has 3-4 of the important 
> controller threads in a lock) and hence does not perform any of it's 
> controller duties. But it regards itsself still as the valid controller and 
> is accepted by the other brokers
> I'll try to describe each one of the problems in more detail below, and hope 
> to be able to cleary separate them.
> I'm able to provoke these problems in our DEV environment quite regularly 
> using the following procedure
> * make sure all ZK nodes and Kafka brokers are stable and reacting normally
> * freeze 2 out of 3 ZK nodes with {{kill -STOP}} for some minutes
> * let the Kafka broker running, of course they will start complaining to be 
> unable to reach ZK
> * thaw the processes with {{kill -CONT}}
> * now all Kafka brokers get notified that their ZK session has expired, and 
> they start to reorganize the cluster
> In about 20% of the tests, I'm able to produce one of the symptoms above. I 
> can not predict which one though. I'm varying this procedure sometimes by 
> also freezing one Kafka broker (most often the controller), but until now I 
> haven't been able to create a clear pattern or really force one specific 
> symptom
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-8151) Broker hangs and lockups after Zookeeper outages

2019-03-23 Thread Joe Ammann (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799591#comment-16799591
 ] 

Joe Ammann commented on KAFKA-8151:
---

For symptom 3 (all brokers seem ok after freezing and thawing ZK only) I see 
that the broker that was the controller before the start of the problems does 
have some locked threads. I assume this is why it doesn't fullfill its 
controller tasks. It's still registered in ZK and the 2 other nodes also, so no 
new controller election is triggered.

The first messages in Kafka while ZK is down look like

{code}
[2019-03-21 03:31:42,170] WARN Client session timed out, have not heard from 
server in 5336ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn) 
{code}

When I'm especially nasty and simulate the ZK "coming and going" be freezing 
and thawing it again and again for severel seconds (that's about what happens 
in reality when our Linux VMs stall), sometime there is just one of them, 
sometimes a few. And sometimes we have this
{code:java}
2019-03-21 03:37:44,265] WARN Client session timed out, have not heard from 
server in 5335ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:37:44,369] INFO [ZooKeeperClient] Waiting until connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:37:44,370] INFO [ZooKeeperClient] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:37:57,674] WARN Client session timed out, have not heard from 
server in 5333ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:38:00,758] WARN Client session timed out, have not heard from 
server in 2669ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:37:57,802] INFO [ZooKeeperClient] Waiting until connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:02,853] INFO [ZooKeeperClient] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:06,958] WARN Client session timed out, have not heard from 
server in 5335ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:38:07,059] INFO [ZooKeeperClient] Waiting until connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:07,059] INFO [ZooKeeperClient] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:28,402] WARN Client session timed out, have not heard from 
server in 5336ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:38:28,504] INFO [ZooKeeperClient] Waiting until connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:28,504] INFO [ZooKeeperClient] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:38,862] WARN Client session timed out, have not heard from 
server in 5335ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:38:42,330] WARN Client session timed out, have not heard from 
server in 2667ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:38:38,963] INFO [ZooKeeperClient] Waiting until connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:44,923] INFO [ZooKeeperClient] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:50,256] WARN Client session timed out, have not heard from 
server in 5334ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:38:50,357] INFO [ZooKeeperClient] Waiting until connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:50,358] INFO [ZooKeeperClient] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:55,791] WARN Client session timed out, have not heard from 
server in 5335ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:38:55,892] INFO [ZooKeeperClient] Waiting until connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:38:55,893] INFO [ZooKeeperClient] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:39:03,471] WARN Client session timed out, have not heard from 
server in 5336ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:39:08,655] INFO [ZooKeeperClient] Waiting until connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:39:08,655] INFO [ZooKeeperClient] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:39:29,140] WARN Client session timed out, have not heard from 
server in 5335ms for sessionid 0x30016a539c63065 
(org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:39:29,242] INFO [ZooKeeperClient] Waiting until connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:39:29,242] INFO [ZooKeeperClient] Connected. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:39:30,824] WARN Unable to reconnect to ZooKeeper service, 
session 0x30016a539c63065 has expired (org.apache.zookeeper.ClientCnxn)
[2019-03-21 03:39:30,827] INFO [ZooKeeperClient] Session expired. 
(kafka.zookeeper.ZooKeeperClient)
[2019-03-21 03:39:34,266] INFO 

[jira] [Commented] (KAFKA-8151) Broker hangs and lockups after Zookeeper outages

2019-03-23 Thread Joe Ammann (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16799586#comment-16799586
 ] 

Joe Ammann commented on KAFKA-8151:
---

For symptom 1 (failure of a broker to re-register with ZK) I have to freeze ZK 
plus 1 broker. This then normally shows itsself as follows
 * cluster is running ok
 * ZK and one broker get's frozen
 * partitions go to underreplicated on the remaining 2 brokers, as expected
 * failed broker and ZK comes back and reports that ZK session was expired

{code:java}
[2019-03-18 02:27:13,043] INFO [ZooKeeperClient] Session expired. 
(kafka.zookeeper.ZooKeeperClient){code}
 * the broker that failed does *not* re-appear under /brokers/ids in Zookeeper, 
but I can't also find specific error messages about failed tries to 
re-register. It almost looks as if it just doesn't try to re-register
 * some of the brokers that were ok report leader election problems

{code:java}
[2019-03-18 02:27:20,283] ERROR [Controller id=3 epoch=94562] Controller 3 
epoch 94562 failed to change state for partition __consumer_offsets-4 from 
OnlinePartition to OnlinePartition (state.change.logger) 
kafka.common.StateChangeFailedException: Failed to elect leader for partition 
__consumer_offsets-4 under strategy 
PreferredReplicaPartitionLeaderElectionStrategy
at 
kafka.controller.PartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:366)
at 
kafka.controller.PartitionStateMachine$$anonfun$doElectLeaderForPartitions$3.apply(PartitionStateMachine.scala:364)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
kafka.controller.PartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:364)
at 
kafka.controller.PartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:292)
at 
kafka.controller.PartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:210)
at 
kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:133)
at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$onPreferredReplicaElection(KafkaController.scala:624)
at 
kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:974)
at 
kafka.controller.KafkaController$$anonfun$kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance$3.apply(KafkaController.scala:955)
at scala.collection.immutable.Map$Map4.foreach(Map.scala:188)
at 
kafka.controller.KafkaController.kafka$controller$KafkaController$$checkAndTriggerAutoLeaderRebalance(KafkaController.scala:955)
at 
kafka.controller.KafkaController$AutoPreferredReplicaLeaderElection$.process(KafkaController.scala:986)
at 
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:89)
at 
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:89)
at 
kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:89)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
at 
kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:88)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82){code}
 * the failed/revived broker logs errors continuosly about expired session

{code:java}
[2019-03-18 02:28:34,493] ERROR Uncaught exception in scheduled task 
'isr-expiration' (kafka.utils.KafkaScheduler)
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /brokers/topics/__consumer_offsets/partitions/9/state
at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
at 
kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:539)
at kafka.zk.KafkaZkClient.conditionalUpdatePath(KafkaZkClient.scala:717)
at 
kafka.utils.ReplicationUtils$.updateLeaderAndIsr(ReplicationUtils.scala:33)
at 
kafka.cluster.Partition.kafka$cluster$Partition$$updateIsr(Partition.scala:969)
at kafka.cluster.Partition$$anonfun$2.apply$mcZ$sp(Partition.scala:642)
at kafka.cluster.Partition$$anonfun$2.apply(Partition.scala:633)
at kafka.cluster.Partition$$anonfun$2.apply(Partition.scala:633)
at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251)
at kafka.utils.CoreUtils$.inWriteLock(CoreUtils.scala:259)
at kafka.cluster.Partition.maybeShrinkIsr(Partition.scala:632)
at 

[jira] [Updated] (KAFKA-8151) Broker hangs and lockups after Zookeeper outages

2019-03-23 Thread Joe Ammann (JIRA)


 [ 
https://issues.apache.org/jira/browse/KAFKA-8151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Ammann updated KAFKA-8151:
--
Description: 
We're running several clusters (mostly with 3 brokers) with 2.1.1, where we see 
at least 3 different symptoms, all resulting on broker/controller lockups.

We are pretty sure that the triggering cause for all these symptoms are 
temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs 
where the ZK nodes run on regularly get stalled for a couple of minutes. The ZK 
nodes always very quickly reunite and build a Quorum after the situation 
clears, but the Kafka brokers (which run on then same Linux VMs) quite often 
show problems after this procedure.

I've seen 3 different kinds of problems (this is why I put "reproduce" in 
quotes, I can never predict what will happen)
 # the brokers get their ZK sessions expired (obviously) and sometimes only 2 
of 3 re-register under /brokers/ids. The 3rd broker doesn't re-register for 
some reason (that's the problem I originally described)
 # the brokers all re-register and re-elect a new controller. But that new 
controller does not fully work. For example it doesn't process partition 
reassignment requests and or does not transfer partition leadership after I 
kill a broker
 # the previous controller gets "dead-locked" (it has 3-4 of the important 
controller threads in a lock) and hence does not perform any of it's controller 
duties. But it regards itsself still as the valid controller and is accepted by 
the other brokers

I'll try to describe each one of the problems in more detail below, and hope to 
be able to cleary separate them.

I'm able to provoke these problems in our DEV environment quite regularly using 
the following procedure
* make sure all ZK nodes and Kafka brokers are stable and reacting normally
* freeze 2 out of 3 ZK nodes with {{kill -STOP}} for some minutes
* let the Kafka broker running, of course they will start complaining to be 
unable to reach ZK
* thaw the processes with {{kill -CONT}}
* now all Kafka brokers get notified that their ZK session has expired, and 
they start to reorganize the cluster

In about 20% of the tests, I'm able to produce one of the symptoms above. I can 
not predict which one though. I'm varying this procedure sometimes by also 
freezing one Kafka broker (most often the controller), but until now I haven't 
been able to create a clear pattern or really force one specific symptom
 

  was:
We're running several clusters (mostly with 3 brokers) with 2.1.1, where we see 
at least 3 different symptoms, all resulting on broker/controller lockups.

We are pretty sure that the triggering cause for all these symptoms are 
temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs 
where the ZK nodes run on regularly get stalled for a couple of minutes. The ZK 
nodes always very quickly reunite and build a Quorum after the situation 
clears, but the Kafka brokers (which run on then same Linux VMs) quite often 
show problems after this procedure.

I've seen 3 different kinds of problems (this is why I put "reproduce" in 
quotes, I can never predict what will happen)

# the brokers get their ZK sessions expired (obviously) and sometimes only 2 of 
3 re-register under /brokers/ids. The 3rd broker doesn't re-register for some 
reason (that's the problem I originally described)
# the brokers all re-register and re-elect a new controller. But that new 
controller does not fully work. For example it doesn't process partition 
reassignment requests and or does not transfer partition leadership after I 
kill a broker
# the previous controller gets "dead-locked" (it has 3-4 of the important 
controller threads in a lock) and hence does not perform any of it's controller 
duties. But it regards itsself still as the valid controller and is accepted by 
the other brokers

I'll try to describe each one of the problems in more detail below, and hope to 
be able to cleary separate them. 


> Broker hangs and lockups after Zookeeper outages
> 
>
> Key: KAFKA-8151
> URL: https://issues.apache.org/jira/browse/KAFKA-8151
> Project: Kafka
>  Issue Type: Bug
>  Components: controller, core, zkclient
>Affects Versions: 2.1.1
>Reporter: Joe Ammann
>Priority: Major
>
> We're running several clusters (mostly with 3 brokers) with 2.1.1, where we 
> see at least 3 different symptoms, all resulting on broker/controller lockups.
> We are pretty sure that the triggering cause for all these symptoms are 
> temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs 
> where the ZK nodes run on regularly get stalled for a couple of minutes. The 
> ZK nodes always very quickly reunite and build a Quorum after the situation 
> clears, but the Kafka brokers (which 

[jira] [Created] (KAFKA-8151) Broker hangs and lockups after Zookeeper outages

2019-03-23 Thread Joe Ammann (JIRA)
Joe Ammann created KAFKA-8151:
-

 Summary: Broker hangs and lockups after Zookeeper outages
 Key: KAFKA-8151
 URL: https://issues.apache.org/jira/browse/KAFKA-8151
 Project: Kafka
  Issue Type: Bug
  Components: controller, core, zkclient
Affects Versions: 2.1.1
Reporter: Joe Ammann


We're running several clusters (mostly with 3 brokers) with 2.1.1, where we see 
at least 3 different symptoms, all resulting on broker/controller lockups.

We are pretty sure that the triggering cause for all these symptoms are 
temporary (for 3-5 minutes normally) of the Zookeeper cluster. The Linux VMs 
where the ZK nodes run on regularly get stalled for a couple of minutes. The ZK 
nodes always very quickly reunite and build a Quorum after the situation 
clears, but the Kafka brokers (which run on then same Linux VMs) quite often 
show problems after this procedure.

I've seen 3 different kinds of problems (this is why I put "reproduce" in 
quotes, I can never predict what will happen)

# the brokers get their ZK sessions expired (obviously) and sometimes only 2 of 
3 re-register under /brokers/ids. The 3rd broker doesn't re-register for some 
reason (that's the problem I originally described)
# the brokers all re-register and re-elect a new controller. But that new 
controller does not fully work. For example it doesn't process partition 
reassignment requests and or does not transfer partition leadership after I 
kill a broker
# the previous controller gets "dead-locked" (it has 3-4 of the important 
controller threads in a lock) and hence does not perform any of it's controller 
duties. But it regards itsself still as the valid controller and is accepted by 
the other brokers

I'll try to describe each one of the problems in more detail below, and hope to 
be able to cleary separate them. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)