[jira] [Commented] (KAFKA-9290) Update IQ related JavaDocs

2020-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087383#comment-17087383
 ] 

ASF GitHub Bot commented on KAFKA-9290:
---

mjsax commented on issue #8114:
URL: https://github.com/apache/kafka/pull/8114#issuecomment-616325249


   Build failed with checkstyle error:
   ```
   [ant:checkstyle] [ERROR] 
/home/jenkins/jenkins-slave/workspace/kafka-pr-jdk11-scala2.13/streams/src/test/java/org/apache/kafka/streams/state/internals/RocksDBStoreTest.java:34:8:
 Unused import - org.apache.kafka.streams.processor.ProcessorContext. 
[UnusedImports]
   11:59:28 [ant:checkstyle] [ERROR] 
/home/jenkins/jenkins-slave/workspace/kafka-pr-jdk11-scala2.13/streams/src/test/java/org/apache/kafka/streams/state/internals/RocksDBStoreTest.java:163:21:
 '=' is not preceded with whitespace. [WhitespaceAround]
   ```
   
   @highluck Can you update the PR?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update IQ related JavaDocs
> --
>
> Key: KAFKA-9290
> URL: https://issues.apache.org/jira/browse/KAFKA-9290
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>Assignee: highluck
>Priority: Minor
>  Labels: beginner, newbie
>
> In Kafka 2.1.0 we deprecated couple of methods (KAFKA-7277) to pass in 
> timestamps via IQ API via Duration/Instance parameters instead of plain longs.
> In Kafka 2.3.0 we introduced TimestampedXxxStores (KAFKA-3522) and allow IQ 
> to return the stored timestamp.
> However, we never update our JavaDocs that contain code snippets to 
> illustrate how a local store can be queries. For example 
> `KGroupedStream#count(Materialized)` 
> ([https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/kstream/KGroupedStream.java#L116-L122]):
>  
> {code:java}
> * {@code * KafkaStreams streams = ... // counting words
> * String queryableStoreName = "storeName"; // the store name should be the 
> name of the store as defined by the Materialized instance
> * ReadOnlyKeyValueStore localStore = 
> streams.store(queryableStoreName, QueryableStoreTypes. Long>keyValueStore());
> * String key = "some-word";
> * Long countForWord = localStore.get(key); // key must be local (application 
> state is shared over all running Kafka Streams instances)
> * }
> {code}
> We should update all JavaDocs to use `TimestampedXxxStore` and the new 
> Duration/Instance methods in all those code snippets.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9224) State store should not see uncommitted transaction result

2020-04-19 Thread Guozhang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087270#comment-17087270
 ] 

Guozhang Wang commented on KAFKA-9224:
--

Hi [~mjsax], I think we are talking about the same thing: in restoration, let's 
say the previous committed offset is 40, we could opt to NOT make IQ available 
until we've completed restoration until the state is back to RUNNING -- at this 
moment we are guaranteed to be at least at offset 40 already. KIP-535 allows IQ 
on RESTORING tasks (active or standby) and hence exposed the issue.

I think this JIRA is orthogonal to KIP-535, and we can put it this way: KIP-535 
exposed new root causes to break monotonicity, but even before KIP-535 we still 
have other root causes that can break monotonicity. This JIRA is for the "other 
root causes" other than KIP-535 itself, AND here one can argue that under ALOS 
the other root cause is allowed while under EOS it should not be allowed.

> State store should not see uncommitted transaction result
> -
>
> Key: KAFKA-9224
> URL: https://issues.apache.org/jira/browse/KAFKA-9224
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
>
> Currently under EOS, the uncommitted write could be reflected in the state 
> store before the ongoing transaction is finished. This means interactive 
> query could see uncommitted data within state store which is not ideal for 
> users relying on state stores for strong consistency. Ideally, we should have 
> an option to include state store commit as part of ongoing transaction, 
> however an immediate step towards a better reasoned system is to `write after 
> transaction commit`, which means we always buffer data within stream cache 
> for EOS until the ongoing transaction is committed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9224) State store should not see uncommitted transaction result

2020-04-19 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087259#comment-17087259
 ] 

Matthias J. Sax commented on KAFKA-9224:


Isn't that the same what I said? Not sure if (1) and (2) are two different 
issue thought? Seems to be the same to me (while the issue of this Jira is 
neither (1) nor (2))? What is your reasoning to distinguish both?
{quote}with ALOS, since "uncommitted" data may be queried, I think it is 
appropriate that if it crashed, and you query again those uncommitted data 
would not be returned, and hence monotonicity would be broken
{quote}
I am not sure about this. Assume we just do a `builder.table(..., 
Materialized.as(...))` and we commit (in at-least-once mode) every 10 records 
(just for the sake of argument). Hence, if a user queries the store between two 
commits, let's say after commit "40" at offset 45, it would see uncommitted 
data. However, if we fail and restore and serve queries during restoration the 
user may also query at offset 20 during restore. While this is technically 
committed data, it would still violate monotonic. IMHO, returning the state at 
offset 20 during restore is a different "violation" of monotonicity than going 
back from 45 to 40 (assuming we don't query during restore, roll back 
uncommitted state to the latest committed state, and query when we are in 
running state again, but did not (re)process data yet).
{quote}But then with EOS, one can argue that this should not happen, i.e. as 
long as we do not return "uncommitted" data, then even upon crashing we would 
not go backwards.
{quote}
I don't think that holds: The same argument as above applies for the EOS case, 
too: even if we don't allow users to query the uncommitted state at offset 45 
(but would return the committed state at offset 40), on restore the user might 
still query an older committed state, eg, the non-monotonic state at offset 20. 
I don't think that this Jira cover this "query during restore phase" case, but 
only the "return state of offset 40 when querying at offset 45 during running 
state" case?

>From my point of view, querying standby state and querying during restore is 
>the same problem? Querying only committed state during "running" is a 
>different problem and IMHO this ticket only cover this case? Maybe we are both 
>saying the same thing but just don't understand each other?

 

> State store should not see uncommitted transaction result
> -
>
> Key: KAFKA-9224
> URL: https://issues.apache.org/jira/browse/KAFKA-9224
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
>
> Currently under EOS, the uncommitted write could be reflected in the state 
> store before the ongoing transaction is finished. This means interactive 
> query could see uncommitted data within state store which is not ideal for 
> users relying on state stores for strong consistency. Ideally, we should have 
> an option to include state store commit as part of ongoing transaction, 
> however an immediate step towards a better reasoned system is to `write after 
> transaction commit`, which means we always buffer data within stream cache 
> for EOS until the ongoing transaction is committed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-9313) Set default for client.dns.lookup to use_all_dns_ips

2020-04-19 Thread Jason Gustafson (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Gustafson reassigned KAFKA-9313:
--

Assignee: Badai Aqrandista

> Set default for client.dns.lookup to use_all_dns_ips
> 
>
> Key: KAFKA-9313
> URL: https://issues.apache.org/jira/browse/KAFKA-9313
> Project: Kafka
>  Issue Type: Improvement
>  Components: clients
>Reporter: Yeva Byzek
>Assignee: Badai Aqrandista
>Priority: Minor
>
> The default setting of the configuration parameter {{client.dns.lookup}} is 
> *not* {{use_all_dns_ips}} .  Consequently, by default, if there are multiple 
> IP addresses and the first one fails, the connection will fail.
>  
> It is desirable to change the default to be 
> {{client.dns.lookup=use_all_dns_ips}} for two reasons:
>  # reduce connection failure rates by 
>  # users are often surprised that this is not already the default
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9224) State store should not see uncommitted transaction result

2020-04-19 Thread Guozhang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087152#comment-17087152
 ] 

Guozhang Wang commented on KAFKA-9224:
--

I think there are two issues here:

1) If there are active AND standby tasks, and with KIP-535 where standby tasks 
can be queried, then it is possible that querying active, and then querying 
standby could break monotonicity; this is not related to EOS, but related to 
KIP-535.
2) Even if there are only active tasks, it is still possible that, querying 
active, and the active task migrate due to failover, and then query again, 
could break monotonicity --- this is because when crashing upon restoration we 
may not restore all the previous un-flushed entries.

I'm primarily thinking about case 2) here, on whether this is a real issue or 
not: with ALOS, since "uncommitted" data may be queried, I think it is 
appropriate that if it crashed, and you query again those uncommitted data 
would not be returned, and hence monotonicity would be broken. But then with 
EOS, one can argue that this should not happen, i.e. as long as we do not 
return "uncommitted" data, then even upon crashing we would not go backwards. 
And this is what, I think, this JIRA was for -- i.e. it is not really a 
duplicate of 8870.

Of course, only fixing 2) for EOS does not guarantee monotonicity would never 
be broken since we still have 1). And personally I think 1) should be fixed 
differently, probably even out of Streams' scope. For 

> State store should not see uncommitted transaction result
> -
>
> Key: KAFKA-9224
> URL: https://issues.apache.org/jira/browse/KAFKA-9224
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
>
> Currently under EOS, the uncommitted write could be reflected in the state 
> store before the ongoing transaction is finished. This means interactive 
> query could see uncommitted data within state store which is not ideal for 
> users relying on state stores for strong consistency. Ideally, we should have 
> an option to include state store commit as part of ongoing transaction, 
> however an immediate step towards a better reasoned system is to `write after 
> transaction commit`, which means we always buffer data within stream cache 
> for EOS until the ongoing transaction is committed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7870) Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was disconnected before the response was read.

2020-04-19 Thread Ismael Juma (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087147#comment-17087147
 ] 

Ismael Juma commented on KAFKA-7870:


Disconnections can happen for many reasons. Are the errors persistent?

> Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: 
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read.
> 
>
> Key: KAFKA-7870
> URL: https://issues.apache.org/jira/browse/KAFKA-7870
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Chakhsu Lau
>Priority: Blocker
>
> We build a kafka cluster with 5 brokers. But one of brokers suddenly stopped 
> running during the run. And it happened twice in the same broker. Here is the 
> log and is this a bug in kafka ?
> {code:java}
> [2019-01-25 12:57:14,686] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error sending fetch request (sessionId=1578860481, 
> epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was 
> disconnected before the response was read. 
> (org.apache.kafka.clients.FetchSessionHandler)
> [2019-01-25 12:57:14,687] WARN [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=3, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={api-result-bi-heatmap-8=(offset=0, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-save-12=(offset=0, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), api-result-bi-heatmap-task-2=(offset=2, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-flow-39=(offset=1883206, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), __consumer_offsets-47=(offset=349437, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-track-6=(offset=1039889, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-task-17=(offset=0, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), __consumer_offsets-2=(offset=0, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-aggs-19=(offset=1255056, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1578860481, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read
> at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
> at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
> at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
> at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-9243) Update the javadocs from KeyValueStore to TimestampKeyValueStore

2020-04-19 Thread Matthias J. Sax (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthias J. Sax resolved KAFKA-9243.

Resolution: Duplicate

Closing is ticket in favor of KAFKA-9290.

> Update the javadocs from KeyValueStore to TimestampKeyValueStore
> 
>
> Key: KAFKA-9243
> URL: https://issues.apache.org/jira/browse/KAFKA-9243
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.3.0
>Reporter: Walker Carlson
>Assignee: Demitri Swan
>Priority: Minor
>  Labels: beginner, newbie
>
> As of version 2.3, the DSL uses `TimestampedStores` to represent KTables. 
> However, the JavaDocs of all table-related operators still refer to plain 
> `KeyValueStores` etc instead of `TimestampedKeyValueStore` etc. Hence, all 
> those JavaDocs should be updated (the JavaDocs are technically not incorrect, 
> because one can access a TimestampedKeyValueStore as a KeyValueStore, too – 
> hence this ticket is not a "bug" but an improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9243) Update the javadocs from KeyValueStore to TimestampKeyValueStore

2020-04-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087134#comment-17087134
 ] 

ASF GitHub Bot commented on KAFKA-9243:
---

mjsax commented on pull request #7848: KAFKA-9243 Update the javadocs to 
include TimestampKeyValueStore
URL: https://github.com/apache/kafka/pull/7848
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update the javadocs from KeyValueStore to TimestampKeyValueStore
> 
>
> Key: KAFKA-9243
> URL: https://issues.apache.org/jira/browse/KAFKA-9243
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.3.0
>Reporter: Walker Carlson
>Assignee: Demitri Swan
>Priority: Minor
>  Labels: beginner, newbie
>
> As of version 2.3, the DSL uses `TimestampedStores` to represent KTables. 
> However, the JavaDocs of all table-related operators still refer to plain 
> `KeyValueStores` etc instead of `TimestampedKeyValueStore` etc. Hence, all 
> those JavaDocs should be updated (the JavaDocs are technically not incorrect, 
> because one can access a TimestampedKeyValueStore as a KeyValueStore, too – 
> hence this ticket is not a "bug" but an improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7870) Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was disconnected before the response was read.

2020-04-19 Thread shiva (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087129#comment-17087129
 ] 

shiva commented on KAFKA-7870:
--

I wonder how do we handle these on production?  I'm currently using 2.1.0 and 
keep restarting the node whenever it stops responding  to other nodes.  Please 
let me know if it is solved by any configs?

> Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: 
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read.
> 
>
> Key: KAFKA-7870
> URL: https://issues.apache.org/jira/browse/KAFKA-7870
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Chakhsu Lau
>Priority: Blocker
>
> We build a kafka cluster with 5 brokers. But one of brokers suddenly stopped 
> running during the run. And it happened twice in the same broker. Here is the 
> log and is this a bug in kafka ?
> {code:java}
> [2019-01-25 12:57:14,686] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error sending fetch request (sessionId=1578860481, 
> epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was 
> disconnected before the response was read. 
> (org.apache.kafka.clients.FetchSessionHandler)
> [2019-01-25 12:57:14,687] WARN [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=3, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={api-result-bi-heatmap-8=(offset=0, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-save-12=(offset=0, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), api-result-bi-heatmap-task-2=(offset=2, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-flow-39=(offset=1883206, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), __consumer_offsets-47=(offset=349437, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-track-6=(offset=1039889, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-task-17=(offset=0, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), __consumer_offsets-2=(offset=0, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-aggs-19=(offset=1255056, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1578860481, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read
> at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
> at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
> at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
> at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (KAFKA-7870) Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was disconnected before the response was r

2020-04-19 Thread shiva (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shiva updated KAFKA-7870:
-
Comment: was deleted

(was: I wonder how do we handle these on production?  I'm currently using 2.1.0 
and keep restarting the node whenever it stops responding  to other nodes.  
Please let me know if it is solved by any configs?)

> Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: 
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read.
> 
>
> Key: KAFKA-7870
> URL: https://issues.apache.org/jira/browse/KAFKA-7870
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Chakhsu Lau
>Priority: Blocker
>
> We build a kafka cluster with 5 brokers. But one of brokers suddenly stopped 
> running during the run. And it happened twice in the same broker. Here is the 
> log and is this a bug in kafka ?
> {code:java}
> [2019-01-25 12:57:14,686] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error sending fetch request (sessionId=1578860481, 
> epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was 
> disconnected before the response was read. 
> (org.apache.kafka.clients.FetchSessionHandler)
> [2019-01-25 12:57:14,687] WARN [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=3, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={api-result-bi-heatmap-8=(offset=0, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-save-12=(offset=0, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), api-result-bi-heatmap-task-2=(offset=2, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-flow-39=(offset=1883206, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), __consumer_offsets-47=(offset=349437, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-track-6=(offset=1039889, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-task-17=(offset=0, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), __consumer_offsets-2=(offset=0, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-aggs-19=(offset=1255056, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1578860481, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read
> at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
> at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
> at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
> at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-7870) Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was disconnected before the response was read.

2020-04-19 Thread shiva (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087128#comment-17087128
 ] 

shiva edited comment on KAFKA-7870 at 4/19/20, 6:32 PM:


I wonder how do we handle these on production?  I'm currently using 2.1.0 and 
keep restarting the node whenever it stops responding  to other nodes.  Please 
let me know if it is solved by any configs?


was (Author: schikkam):
I wonder how do we handle these on production?  I'm currently using 2.1.0 and 
keep restarting the node whenever it facing this issues. Please let me know if 
it is solved by any configs?

> Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: 
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read.
> 
>
> Key: KAFKA-7870
> URL: https://issues.apache.org/jira/browse/KAFKA-7870
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Chakhsu Lau
>Priority: Blocker
>
> We build a kafka cluster with 5 brokers. But one of brokers suddenly stopped 
> running during the run. And it happened twice in the same broker. Here is the 
> log and is this a bug in kafka ?
> {code:java}
> [2019-01-25 12:57:14,686] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error sending fetch request (sessionId=1578860481, 
> epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was 
> disconnected before the response was read. 
> (org.apache.kafka.clients.FetchSessionHandler)
> [2019-01-25 12:57:14,687] WARN [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=3, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={api-result-bi-heatmap-8=(offset=0, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-save-12=(offset=0, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), api-result-bi-heatmap-task-2=(offset=2, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-flow-39=(offset=1883206, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), __consumer_offsets-47=(offset=349437, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-track-6=(offset=1039889, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-task-17=(offset=0, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), __consumer_offsets-2=(offset=0, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-aggs-19=(offset=1255056, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1578860481, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read
> at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
> at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
> at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
> at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-7870) Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was disconnected before the response was read.

2020-04-19 Thread shiva (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087128#comment-17087128
 ] 

shiva commented on KAFKA-7870:
--

I wonder how do we handle these on production?  I'm currently using 2.1.0 and 
keep restarting the node whenever it facing this issues. Please let me know if 
it is solved by any configs?

> Error sending fetch request (sessionId=1578860481, epoch=INITIAL) to node 2: 
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read.
> 
>
> Key: KAFKA-7870
> URL: https://issues.apache.org/jira/browse/KAFKA-7870
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.1.0
>Reporter: Chakhsu Lau
>Priority: Blocker
>
> We build a kafka cluster with 5 brokers. But one of brokers suddenly stopped 
> running during the run. And it happened twice in the same broker. Here is the 
> log and is this a bug in kafka ?
> {code:java}
> [2019-01-25 12:57:14,686] INFO [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error sending fetch request (sessionId=1578860481, 
> epoch=INITIAL) to node 2: java.io.IOException: Connection to 2 was 
> disconnected before the response was read. 
> (org.apache.kafka.clients.FetchSessionHandler)
> [2019-01-25 12:57:14,687] WARN [ReplicaFetcher replicaId=3, leaderId=2, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=3, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={api-result-bi-heatmap-8=(offset=0, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-save-12=(offset=0, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), api-result-bi-heatmap-task-2=(offset=2, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-flow-39=(offset=1883206, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), __consumer_offsets-47=(offset=349437, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-track-6=(offset=1039889, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-task-17=(offset=0, logStartOffset=0, maxBytes=1048576, 
> currentLeaderEpoch=Optional[4]), __consumer_offsets-2=(offset=0, 
> logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[4]), 
> api-result-bi-heatmap-aggs-19=(offset=1255056, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[4])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1578860481, 
> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
> java.io.IOException: Connection to 2 was disconnected before the response was 
> read
> at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
> at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97)
> at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190)
> at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
> at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
> at scala.Option.foreach(Option.scala:257)
> at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
> at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9013) Flaky Test MirrorConnectorsIntegrationTest#testReplication

2020-04-19 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087126#comment-17087126
 ] 

Matthias J. Sax commented on KAFKA-9013:


[https://builds.apache.org/job/kafka-pr-jdk11-scala2.13/5861/testReport/junit/org.apache.kafka.connect.mirror/MirrorConnectorsIntegrationTest/testReplication/]

> Flaky Test MirrorConnectorsIntegrationTest#testReplication
> --
>
> Key: KAFKA-9013
> URL: https://issues.apache.org/jira/browse/KAFKA-9013
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Reporter: Bruno Cadonna
>Priority: Major
>  Labels: flaky-test
>
> h1. Stacktrace:
> {code:java}
> java.lang.AssertionError: Condition not met within timeout 2. Offsets not 
> translated downstream to primary cluster.
>   at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:377)
>   at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:354)
>   at 
> org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest.testReplication(MirrorConnectorsIntegrationTest.java:239)
> {code}
> h1. Standard Error
> {code}
> Standard Error
> Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource 
> registered in SERVER runtime does not implement any provider interfaces 
> applicable in the SERVER runtime. Due to constraint configuration problems 
> the provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource will 
> be ignored. 
> Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.LoggingResource registered in 
> SERVER runtime does not implement any provider interfaces applicable in the 
> SERVER runtime. Due to constraint configuration problems the provider 
> org.apache.kafka.connect.runtime.rest.resources.LoggingResource will be 
> ignored. 
> Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource registered 
> in SERVER runtime does not implement any provider interfaces applicable in 
> the SERVER runtime. Due to constraint configuration problems the provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource will be 
> ignored. 
> Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.RootResource registered in 
> SERVER runtime does not implement any provider interfaces applicable in the 
> SERVER runtime. Due to constraint configuration problems the provider 
> org.apache.kafka.connect.runtime.rest.resources.RootResource will be ignored. 
> Oct 09, 2019 11:32:01 PM org.glassfish.jersey.internal.Errors logErrors
> WARNING: The following warnings have been detected: WARNING: The 
> (sub)resource method listLoggers in 
> org.apache.kafka.connect.runtime.rest.resources.LoggingResource contains 
> empty path annotation.
> WARNING: The (sub)resource method listConnectors in 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains 
> empty path annotation.
> WARNING: The (sub)resource method createConnector in 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains 
> empty path annotation.
> WARNING: The (sub)resource method listConnectorPlugins in 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource 
> contains empty path annotation.
> WARNING: The (sub)resource method serverInfo in 
> org.apache.kafka.connect.runtime.rest.resources.RootResource contains empty 
> path annotation.
> Oct 09, 2019 11:32:02 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource 
> registered in SERVER runtime does not implement any provider interfaces 
> applicable in the SERVER runtime. Due to constraint configuration problems 
> the provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource will 
> be ignored. 
> Oct 09, 2019 11:32:02 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource registered 
> in SERVER runtime does not implement any provider interfaces applicable in 
> the SERVER runtime. Due to constraint configuration problems the provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource will be 
> ignored. 
> Oct 09, 2019 11:32:02 PM 

[jira] [Commented] (KAFKA-7965) Flaky Test ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup

2020-04-19 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087125#comment-17087125
 ] 

Matthias J. Sax commented on KAFKA-7965:


[https://builds.apache.org/job/kafka-pr-jdk11-scala2.13/5861/testReport/junit/kafka.api/ConsumerBounceTest/testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup/]

> Flaky Test 
> ConsumerBounceTest#testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup
> 
>
> Key: KAFKA-7965
> URL: https://issues.apache.org/jira/browse/KAFKA-7965
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, unit tests
>Affects Versions: 1.1.1, 2.2.0, 2.3.0
>Reporter: Matthias J. Sax
>Assignee: David Jacot
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> To get stable nightly builds for `2.2` release, I create tickets for all 
> observed test failures.
> [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/]
> {quote}java.lang.AssertionError: Received 0, expected at least 68 at 
> org.junit.Assert.fail(Assert.java:88) at 
> org.junit.Assert.assertTrue(Assert.java:41) at 
> kafka.api.ConsumerBounceTest.receiveAndCommit(ConsumerBounceTest.scala:557) 
> at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1(ConsumerBounceTest.scala:320)
>  at 
> kafka.api.ConsumerBounceTest.$anonfun$testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup$1$adapted(ConsumerBounceTest.scala:319)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> kafka.api.ConsumerBounceTest.testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup(ConsumerBounceTest.scala:319){quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Issue Comment Deleted] (KAFKA-7802) Connection to Broker Disconnected Taking Down the Whole Cluster

2020-04-19 Thread shiva (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shiva updated KAFKA-7802:
-
Comment: was deleted

(was: Is this fixed ?  I'm wondering how do we handle on production? I keep 
seeing my cluster is having the same issue. Is there a permanent solution ? 
Looking forward for it. BTW I'm facing it in 2.1.0)

> Connection to Broker Disconnected Taking Down the Whole Cluster
> ---
>
> Key: KAFKA-7802
> URL: https://issues.apache.org/jira/browse/KAFKA-7802
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.0
>Reporter: Candice Wan
>Priority: Critical
> Attachments: thread_dump.log
>
>
> We recently upgraded to 2.1.0. Since then, several times per day, we observe 
> some brokers were disconnected when other brokers were trying to fetch the 
> replicas. This issue took down the whole cluster, making all the producers 
> and consumers not able to publish or consume messages. It could be quickly 
> fixed by restarting the problematic broker.
> Here is an example of what we're seeing in the broker which was trying to 
> send fetch request to the problematic one:
> 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN 
> kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[178])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, 
> epoch=1599941))
>  java.io.IOException: Connection to 3 was disconnected before the response 
> was read
>  at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
>  at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99)
>  at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199)
>  at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
>  
>  Below is the suspicious log of the problematic broker when the issue 
> happened:
> 2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 in group talon-instance1 has 
> failed, rem
>  oving it from the group
>  2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-instance1 in state PreparingRebalance with old 
> generation 27
>  0 (__consumer_offsets-47) (reason: removing member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b in group 
> Notifications.ASIA1546980352799 has failed, removing it from the group
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group Notifications.ASIA1546980352799 in state PreparingRebalance 
> with old generation 1 (__consumer_offsets-44) (reason: removing member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> Notifications.ASIA1546980352799 with generation 2 is now empty 
> 

[jira] [Comment Edited] (KAFKA-7802) Connection to Broker Disconnected Taking Down the Whole Cluster

2020-04-19 Thread shiva (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087099#comment-17087099
 ] 

shiva edited comment on KAFKA-7802 at 4/19/20, 5:22 PM:


Is this fixed ?  I'm wondering how do we handle on production? I keep seeing my 
cluster is having the same issue. Is there a permanent solution ? Looking 
forward for it. BTW I'm facing it in 2.1.0


was (Author: schikkam):
Is this fixed ?  I'm wondering how do we handle on production? I keep seeing my 
cluster is having the same issue. Is there a permanent solution ? Looking 
forward for it.

> Connection to Broker Disconnected Taking Down the Whole Cluster
> ---
>
> Key: KAFKA-7802
> URL: https://issues.apache.org/jira/browse/KAFKA-7802
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.0
>Reporter: Candice Wan
>Priority: Critical
> Attachments: thread_dump.log
>
>
> We recently upgraded to 2.1.0. Since then, several times per day, we observe 
> some brokers were disconnected when other brokers were trying to fetch the 
> replicas. This issue took down the whole cluster, making all the producers 
> and consumers not able to publish or consume messages. It could be quickly 
> fixed by restarting the problematic broker.
> Here is an example of what we're seeing in the broker which was trying to 
> send fetch request to the problematic one:
> 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN 
> kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[178])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, 
> epoch=1599941))
>  java.io.IOException: Connection to 3 was disconnected before the response 
> was read
>  at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
>  at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99)
>  at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199)
>  at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
>  
>  Below is the suspicious log of the problematic broker when the issue 
> happened:
> 2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 in group talon-instance1 has 
> failed, rem
>  oving it from the group
>  2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-instance1 in state PreparingRebalance with old 
> generation 27
>  0 (__consumer_offsets-47) (reason: removing member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b in group 
> Notifications.ASIA1546980352799 has failed, removing it from the group
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group Notifications.ASIA1546980352799 in state PreparingRebalance 
> with old generation 1 (__consumer_offsets-44) (reason: removing member 
> 

[jira] [Commented] (KAFKA-7802) Connection to Broker Disconnected Taking Down the Whole Cluster

2020-04-19 Thread shiva (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-7802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087099#comment-17087099
 ] 

shiva commented on KAFKA-7802:
--

Is this fixed ?  I'm wondering how do we handle on production? I keep seeing my 
cluster is having the same issue. Is there a permanent solution ? Looking 
forward for it.

> Connection to Broker Disconnected Taking Down the Whole Cluster
> ---
>
> Key: KAFKA-7802
> URL: https://issues.apache.org/jira/browse/KAFKA-7802
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 2.1.0
>Reporter: Candice Wan
>Priority: Critical
> Attachments: thread_dump.log
>
>
> We recently upgraded to 2.1.0. Since then, several times per day, we observe 
> some brokers were disconnected when other brokers were trying to fetch the 
> replicas. This issue took down the whole cluster, making all the producers 
> and consumers not able to publish or consume messages. It could be quickly 
> fixed by restarting the problematic broker.
> Here is an example of what we're seeing in the broker which was trying to 
> send fetch request to the problematic one:
> 2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error sending fetch request (sessionId=937967566, epoch=1599941) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-1-3] INFO 
> o.a.k.clients.FetchSessionHandler - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=1] Error sending fetch request (sessionId=506217047, epoch=1375749) 
> to node 3: java.io.IOException: Connection to 3 was disconnected before the 
> response was read.
>  2019-01-09 08:05:10.445 [ReplicaFetcherThread-0-3] WARN 
> kafka.server.ReplicaFetcherThread - [ReplicaFetcher replicaId=1, leaderId=3, 
> fetcherId=0] Error in response for fetch request (type=FetchRequest, 
> replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
> fetchData={__consumer_offsets-11=(offset=421032847, logStartOffset=0, 
> maxBytes=1048576, currentLeaderEpoch=Optional[178])}, 
> isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=937967566, 
> epoch=1599941))
>  java.io.IOException: Connection to 3 was disconnected before the response 
> was read
>  at 
> org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
>  at 
> kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:99)
>  at 
> kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:199)
>  at 
> kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130)
>  at 
> kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129)
>  at scala.Option.foreach(Option.scala:257)
>  at 
> kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129)
>  at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
>  at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
>  
>  
>  Below is the suspicious log of the problematic broker when the issue 
> happened:
> 2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 in group talon-instance1 has 
> failed, rem
>  oving it from the group
>  2019-01-09 08:04:50.177 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group talon-instance1 in state PreparingRebalance with old 
> generation 27
>  0 (__consumer_offsets-47) (reason: removing member 
> consumer-2-7d46fda9-afef-4705-b632-17f0255d5045 on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b in group 
> Notifications.ASIA1546980352799 has failed, removing it from the group
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Preparing to 
> rebalance group Notifications.ASIA1546980352799 in state PreparingRebalance 
> with old generation 1 (__consumer_offsets-44) (reason: removing member 
> consumer-5-94b7eb6d-bc39-48ed-99b8-2e0f55edd60b on heartbeat expiration)
>  2019-01-09 08:04:50.297 [executor-Heartbeat] INFO 
> k.coordinator.group.GroupCoordinator - [GroupCoordinator 3]: Group 
> Notifications.ASIA1546980352799 with generation 2 is now empty 
> 

[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-19 Thread GEORGE LI (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086772#comment-17086772
 ] 

GEORGE LI edited comment on KAFKA-4084 at 4/19/20, 7:28 AM:


[~blodsbror]

I am not very familiar with 5.4 setup. 

Do you have the error message of the crash in the log?  is it missing the 
zkclient jar like below? 

{code}
$ ls -l zk*.jar
-rw-r--r-- 1 georgeli engineering 74589 Nov 18 18:21 zkclient-0.11.jar
$ jar tvf zkclient-0.11.jar 
 0 Mon Nov 18 18:11:58 UTC 2019 META-INF/
  1135 Mon Nov 18 18:11:58 UTC 2019 META-INF/MANIFEST.MF
 0 Mon Nov 18 18:11:58 UTC 2019 org/
 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/
 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/
  3486 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/ContentWatcher.class
   263 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/DataUpdater.class
{code}

If this jar file was there before, please copy it back.   I need to find out 
why it was missing after the build.  maybe some dependency setup in gradle.  I 
have also update the [install doc 
|https://docs.google.com/document/d/14vlPkbaog_5Xdd-HB4vMRaQQ7Fq4SlxddULsvc3PlbY/edit]
 using `./gradew clean build -x test` 

Also make sure the startup script for kafka is not hard coding 5.4 jars,  but 
take the jars from the lib classpath?  e.g.

{code}
/usr/lib/jvm/java-8-openjdk-amd64/bin/java 
-Dlog4j.configuration=file:/etc/kafka/log4j.xml -Xms22G -Xmx22G -XX:+UseG1GC 
-XX:MaxGCPauseMillis=20 -XX:NewSize=16G -XX:MaxNewSize=16G 
-XX:InitiatingHeapOccupancyPercent=3 -XX:G1MixedGCCountTarget=1 
-XX:G1HeapWastePercent=1 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintGCDateStamps -verbose:gc -Xloggc:/var/log/kafka/gc-kafka.log -server 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port=29010 
-Djava.rmi.server.hostname=kafka12345-dca4 -cp '.:/usr/share/kafka/lib/*' 
kafka.Kafka /etc/kafka/server.properties
{code}


If you give us more details, we can help more. 

Thanks


Actually,  I just patched and added back zkclient libs for the gradle build.  
Please "git clone https://github.com/sql888/kafka.git; (or git pull)   and try 
to build again.  I suspect that was the issue.   Otherwise, we need to see the 
errors of the crash from the kafka logs. 




was (Author: sql_consulting):
[~blodsbror]

I am not very familiar with 5.4 setup. 

Do you have the error message of the crash in the log?  is it missing the 
zkclient jar like below? 

{code}
$ ls -l zk*.jar
-rw-r--r-- 1 georgeli engineering 74589 Nov 18 18:21 zkclient-0.11.jar
$ jar tvf zkclient-0.11.jar 
 0 Mon Nov 18 18:11:58 UTC 2019 META-INF/
  1135 Mon Nov 18 18:11:58 UTC 2019 META-INF/MANIFEST.MF
 0 Mon Nov 18 18:11:58 UTC 2019 org/
 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/
 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/
  3486 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/ContentWatcher.class
   263 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/DataUpdater.class
{code}

If this jar file was there before, please copy it back.   I need to find out 
why it was missing after the build.  maybe some dependency setup in gradle.  I 
have also update the [install doc 
|https://docs.google.com/document/d/14vlPkbaog_5Xdd-HB4vMRaQQ7Fq4SlxddULsvc3PlbY/edit]
 using `./gradew clean build -x test` 

Also make sure the startup script for kafka is not hard coding 5.4 jars,  but 
take the jars from the lib classpath?  e.g.

{code}
/usr/lib/jvm/java-8-openjdk-amd64/bin/java 
-Dlog4j.configuration=file:/etc/kafka/log4j.xml -Xms22G -Xmx22G -XX:+UseG1GC 
-XX:MaxGCPauseMillis=20 -XX:NewSize=16G -XX:MaxNewSize=16G 
-XX:InitiatingHeapOccupancyPercent=3 -XX:G1MixedGCCountTarget=1 
-XX:G1HeapWastePercent=1 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintGCDateStamps -verbose:gc -Xloggc:/var/log/kafka/gc-kafka.log -server 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port=29010 
-Djava.rmi.server.hostname=kafka12345-dca4 -cp '.:/usr/share/kafka/lib/*' 
kafka.Kafka /etc/kafka/server.properties
{code}


If you give us more details, we can help more. 

Thanks


Actually,  I just patched and added back zkclient libs for the gradle build.  
Please "git clone https://github.com/sql888/kafka.git; and try to build again.  
I suspect that was the issue.   Otherwise, we need to see the errors of the 
crash from the kafka logs. 



> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
> 

[jira] [Comment Edited] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-19 Thread GEORGE LI (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086772#comment-17086772
 ] 

GEORGE LI edited comment on KAFKA-4084 at 4/19/20, 7:20 AM:


[~blodsbror]

I am not very familiar with 5.4 setup. 

Do you have the error message of the crash in the log?  is it missing the 
zkclient jar like below? 

{code}
$ ls -l zk*.jar
-rw-r--r-- 1 georgeli engineering 74589 Nov 18 18:21 zkclient-0.11.jar
$ jar tvf zkclient-0.11.jar 
 0 Mon Nov 18 18:11:58 UTC 2019 META-INF/
  1135 Mon Nov 18 18:11:58 UTC 2019 META-INF/MANIFEST.MF
 0 Mon Nov 18 18:11:58 UTC 2019 org/
 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/
 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/
  3486 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/ContentWatcher.class
   263 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/DataUpdater.class
{code}

If this jar file was there before, please copy it back.   I need to find out 
why it was missing after the build.  maybe some dependency setup in gradle.  I 
have also update the [install doc 
|https://docs.google.com/document/d/14vlPkbaog_5Xdd-HB4vMRaQQ7Fq4SlxddULsvc3PlbY/edit]
 using `./gradew clean build -x test` 

Also make sure the startup script for kafka is not hard coding 5.4 jars,  but 
take the jars from the lib classpath?  e.g.

{code}
/usr/lib/jvm/java-8-openjdk-amd64/bin/java 
-Dlog4j.configuration=file:/etc/kafka/log4j.xml -Xms22G -Xmx22G -XX:+UseG1GC 
-XX:MaxGCPauseMillis=20 -XX:NewSize=16G -XX:MaxNewSize=16G 
-XX:InitiatingHeapOccupancyPercent=3 -XX:G1MixedGCCountTarget=1 
-XX:G1HeapWastePercent=1 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintGCDateStamps -verbose:gc -Xloggc:/var/log/kafka/gc-kafka.log -server 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port=29010 
-Djava.rmi.server.hostname=kafka12345-dca4 -cp '.:/usr/share/kafka/lib/*' 
kafka.Kafka /etc/kafka/server.properties
{code}


If you give us more details, we can help more. 

Thanks


Actually,  I just patched and added back zkclient libs for the gradle build.  
Please "git clone https://github.com/sql888/kafka.git; and try to build again.  
I suspect that was the issue.   Otherwise, we need to see the errors of the 
crash from the kafka logs. 




was (Author: sql_consulting):
[~blodsbror]

I am not very familiar with 5.4 setup. 

Do you have the error message of the crash in the log?  is it missing the 
zkclient jar like below? 

{code}
$ ls -l zk*.jar
-rw-r--r-- 1 georgeli engineering 74589 Nov 18 18:21 zkclient-0.11.jar
$ jar tvf zkclient-0.11.jar 
 0 Mon Nov 18 18:11:58 UTC 2019 META-INF/
  1135 Mon Nov 18 18:11:58 UTC 2019 META-INF/MANIFEST.MF
 0 Mon Nov 18 18:11:58 UTC 2019 org/
 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/
 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/
  3486 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/ContentWatcher.class
   263 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/DataUpdater.class
{code}

If this jar file was there before, please copy it back.   I need to find out 
why it was missing after the build.  maybe some dependency setup in gradle.  I 
have also update the [install doc 
|https://docs.google.com/document/d/14vlPkbaog_5Xdd-HB4vMRaQQ7Fq4SlxddULsvc3PlbY/edit]
 using `./gradew clean build -x test` 

Also make sure the startup script for kafka is not hard coding 5.4 jars,  but 
take the jars from the lib classpath?  e.g.

{code}
/usr/lib/jvm/java-8-openjdk-amd64/bin/java 
-Dlog4j.configuration=file:/etc/kafka/log4j.xml -Xms22G -Xmx22G -XX:+UseG1GC 
-XX:MaxGCPauseMillis=20 -XX:NewSize=16G -XX:MaxNewSize=16G 
-XX:InitiatingHeapOccupancyPercent=3 -XX:G1MixedGCCountTarget=1 
-XX:G1HeapWastePercent=1 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintGCDateStamps -verbose:gc -Xloggc:/var/log/kafka/gc-kafka.log -server 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port=29010 
-Djava.rmi.server.hostname=kafka12345-dca4 -cp '.:/usr/share/kafka/lib/*' 
kafka.Kafka /etc/kafka/server.properties
{code}


If you give us more details, we can help more. 

Thanks

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} 

[jira] [Commented] (KAFKA-9224) State store should not see uncommitted transaction result

2020-04-19 Thread Matthias J. Sax (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086781#comment-17086781
 ] 

Matthias J. Sax commented on KAFKA-9224:


I agree with John, that it seems to be two independent problems. This ticket 
(from my understanding) is only about not serving uncommitted state. Also note, 
that the EOS issue and the "non monotonicity" issue are the same thing in a 
per-KIP-535 world: we only allowed to query the active task when the state is 
"running" and thus, non monotonicity could only happen for the EOS case when 
querying non-committed state.

With KIP-535, the monotonicity issue is a "general" issue and seems to be 
totally independent of the processing mode. I think that this ticket won't 
solve the monotonicity issue for the EOS case anyway. With or without EOS, we 
allow users to query potentially lagging StandbyTasks and also allow to query 
restoring active tasks. For both cases and both processing modes, a non 
monotonic answer could be returned. Does this sound correct?

If we fix this ticket, is seems that users can avoid the monotonicity issue 
only by only querying the active task in "running" mode; hence, it seems to be 
a consistency-vs-HA tradeoff they can pick.

I am not even sure if we should address the monotonicity issue (at least not 
atm), as KIP-535 is pretty new and we might want to wait for user feedback. One 
solution I see for the monotonicity issue is (please share if you have a better 
idea) to have some "global agreement" across active and standby tasks what the 
current valid state is that is used to answer queries: this sounds like a hard 
distributed system problem that requires a single source of truth that is 
shared in a distributed setting... Uff... From my current understanding, we can 
only solve this by letting the _client_ (this could maybe be the query routing 
layer) maintain this "query barrier" (similar to a high watermark as maintained 
by the brokers to track replication) – we can think of it as an offset: the 
client would include its desired "query offset" in each query and the stores 
would need to keep a corresponding history to answer the query based on the 
"query offset" (thus, it does not matter which instance is queried, they all 
return the same answer). Furthermore, all instances (ie, active and all 
standbys) would notify the client about their highest available query offset 
base on their local state copy regularly. The client can pick the minimum of 
those numbers to get a consistent view of the worlds (like a high watermark). 
Hence, it becomes a data-freshness-vs-consistency-vs-HA tradeoff and a client 
would not query the latest state of the active task for most cases (as the 
latest state would not have been replicated to the standbys yet). If an 
individual instance crashes and state needs to be restore, the instance would 
be effectively "offline" for IQ until it caught up. However, the instance 
itself would not know anything about being offline, but the client would just 
not use this instance until the instant's newly reported "query offset" is 
larger than the clients currently desired "query offset". This is somewhat 
similar to the current "max lag" approach for which the client also decided 
which instant it wants to query, however, instead of having a relative offset, 
i.e., "current active minus max.lag" and allow to query within a whole range of 
offsets, we "fix" the "query offset" to a single value and just advance it 
gradually. We can still use the current "max.lag" and stop serving query (or 
warn the user about it) if data becomes too stale compared to the active task 
(ie, standby replication falls back too much). Being even more fancy, we could 
have a "min.isr" configuration to not stop serving queries if just a subset of 
standbys fall back...

Anyway. I would propose to have two independent tickets :), and not work on the 
monotonicity issue for now.

Thoughts?

 

> State store should not see uncommitted transaction result
> -
>
> Key: KAFKA-9224
> URL: https://issues.apache.org/jira/browse/KAFKA-9224
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Boyang Chen
>Assignee: Boyang Chen
>Priority: Major
>
> Currently under EOS, the uncommitted write could be reflected in the state 
> store before the ongoing transaction is finished. This means interactive 
> query could see uncommitted data within state store which is not ideal for 
> users relying on state stores for strong consistency. Ideally, we should have 
> an option to include state store commit as part of ongoing transaction, 
> however an immediate step towards a better reasoned system is to `write after 
> transaction commit`, which means we always buffer data within stream cache 
> for EOS until the 

[jira] [Commented] (KAFKA-4084) automated leader rebalance causes replication downtime for clusters with too many partitions

2020-04-19 Thread GEORGE LI (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17086772#comment-17086772
 ] 

GEORGE LI commented on KAFKA-4084:
--

[~blodsbror]

I am not very familiar with 5.4 setup. 

Do you have the error message of the crash in the log?  is it missing the 
zkclient jar like below? 

{code}
$ ls -l zk*.jar
-rw-r--r-- 1 georgeli engineering 74589 Nov 18 18:21 zkclient-0.11.jar
$ jar tvf zkclient-0.11.jar 
 0 Mon Nov 18 18:11:58 UTC 2019 META-INF/
  1135 Mon Nov 18 18:11:58 UTC 2019 META-INF/MANIFEST.MF
 0 Mon Nov 18 18:11:58 UTC 2019 org/
 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/
 0 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/
  3486 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/ContentWatcher.class
   263 Mon Nov 18 18:11:58 UTC 2019 org/I0Itec/zkclient/DataUpdater.class
{code}

If this jar file was there before, please copy it back.   I need to find out 
why it was missing after the build.  maybe some dependency setup in gradle.  I 
have also update the [install doc 
|https://docs.google.com/document/d/14vlPkbaog_5Xdd-HB4vMRaQQ7Fq4SlxddULsvc3PlbY/edit]
 using `./gradew clean build -x test` 

Also make sure the startup script for kafka is not hard coding 5.4 jars,  but 
take the jars from the lib classpath?  e.g.

{code}
/usr/lib/jvm/java-8-openjdk-amd64/bin/java 
-Dlog4j.configuration=file:/etc/kafka/log4j.xml -Xms22G -Xmx22G -XX:+UseG1GC 
-XX:MaxGCPauseMillis=20 -XX:NewSize=16G -XX:MaxNewSize=16G 
-XX:InitiatingHeapOccupancyPercent=3 -XX:G1MixedGCCountTarget=1 
-XX:G1HeapWastePercent=1 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
-XX:+PrintGCDateStamps -verbose:gc -Xloggc:/var/log/kafka/gc-kafka.log -server 
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.port=29010 
-Djava.rmi.server.hostname=kafka12345-dca4 -cp '.:/usr/share/kafka/lib/*' 
kafka.Kafka /etc/kafka/server.properties
{code}


If you give us more details, we can help more. 

Thanks

> automated leader rebalance causes replication downtime for clusters with too 
> many partitions
> 
>
> Key: KAFKA-4084
> URL: https://issues.apache.org/jira/browse/KAFKA-4084
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 0.8.2.2, 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>Reporter: Tom Crayford
>Priority: Major
>  Labels: reliability
> Fix For: 1.1.0
>
>
> If you enable {{auto.leader.rebalance.enable}} (which is on by default), and 
> you have a cluster with many partitions, there is a severe amount of 
> replication downtime following a restart. This causes 
> `UnderReplicatedPartitions` to fire, and replication is paused.
> This is because the current automated leader rebalance mechanism changes 
> leaders for *all* imbalanced partitions at once, instead of doing it 
> gradually. This effectively stops all replica fetchers in the cluster 
> (assuming there are enough imbalanced partitions), and restarts them. This 
> can take minutes on busy clusters, during which no replication is happening 
> and user data is at risk. Clients with {{acks=-1}} also see issues at this 
> time, because replication is effectively stalled.
> To quote Todd Palino from the mailing list:
> bq. There is an admin CLI command to trigger the preferred replica election 
> manually. There is also a broker configuration “auto.leader.rebalance.enable” 
> which you can set to have the broker automatically perform the PLE when 
> needed. DO NOT USE THIS OPTION. There are serious performance issues when 
> doing so, especially on larger clusters. It needs some development work that 
> has not been fully identified yet.
> This setting is extremely useful for smaller clusters, but with high 
> partition counts causes the huge issues stated above.
> One potential fix could be adding a new configuration for the number of 
> partitions to do automated leader rebalancing for at once, and *stop* once 
> that number of leader rebalances are in flight, until they're done. There may 
> be better mechanisms, and I'd love to hear if anybody has any ideas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)