[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down
[ https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565772#comment-16565772 ] Matthias J. Sax commented on KAFKA-7209: About `offsets.topic.replication.factor` – if you set it to one, and the corresponding broker goes down, you cannot commit offset any longer, thus, you might want to set it to 3. Also note, that the number of in-sync replicas config is important – the broker default is used, and you can only write to the topic if enough in-sync replicas are online. Thus, you should not set it to 3, but at max 2 to survive a single broker failure. For `transaction.state.log.XXX` configs: as long as you don't use exactly-once, you can ignore those setting. For the failure scenarios: can you provide DEBUG logs for the brokers and the Streams application so we can dig into it? For the first scenario, after the rebalance, the state directories should be created, but we will need Streams DEBUG logs to see. For scenario (2) there should not be any data loss – we might need Streams and broker logs to dig into it. For a clean restart with the same application.id, you should check out the application reset tool: [https://kafka.apache.org/20/documentation/streams/developer-guide/app-reset-tool.html] Btw: you report this error for 0.11.0.1 and 0.11.0.3 was release recently – I would highly recommend to upgrade to 0.11.0.3 and check if the issue is still there – there are many bug fixed and the issue might be resolved already. > Kafka stream does not rebalance when one node gets down > --- > > Key: KAFKA-7209 > URL: https://issues.apache.org/jira/browse/KAFKA-7209 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.1 >Reporter: Yogesh BG >Priority: Critical > > I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and > backoff time default > > I have 3 nodes running kafka cluster of 3 broker > and i am running the 3 kafka stream with same > [application.id|http://application.id/] > each node has one broker one kafka stream application > everything works fine during setup > i bringdown one node, so one kafka broker and one streaming app is down > now i see exceptions in other two streaming apps and it never gets re > balanced waited for hours and never comes back to norma > is there anything am missing? > i also tried looking into when one broker is down call stream.close, cleanup > and restart this also doesn't help > can anyone help me? > > > > One thing i observed lately is that kafka topics with partitions one gets > reassigned but i have topics of 16 partitions and replication factor 3. It > never settles up -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down
[ https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560143#comment-16560143 ] Yogesh BG commented on KAFKA-7209: -- offsets.topic.replication.factor set to 1 then also i receive something like below {{ Received GroupCoordinator response ClientResponse(receivedTimeMs=1532716985393, latencyMs=15, disconnected=false, requestHeader=\{api_key=10,api_version=1,correlation_id=6157,client_id=ks_0_inst_THUNDER_METRICS-StreamThread-37-consumer}, responseBody=FindCoordinatorResponse(throttleTimeMs=0, errorMessage='null', error=COORDINATOR_NOT_AVAILABLE, node=:-1 (id: -1 rack: null))) for group aggregation-framework030_THUNDER_METRICS}}{{18:43:05.394 [ks_0_inst_THUNDER_LOG_L4-StreamThread-70] DEBUG o.a.k.c.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 0 for partition THUNDER_LOG_L4_PC-17 returned fetch data (error=NONE, highWaterMark=0, lastStableOffset = -1, logStartOffset = 0, abortedTransactions = null, recordsSizeInBytes=0)}} > Kafka stream does not rebalance when one node gets down > --- > > Key: KAFKA-7209 > URL: https://issues.apache.org/jira/browse/KAFKA-7209 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.1 >Reporter: Yogesh BG >Priority: Critical > > I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and > backoff time default > > I have 3 nodes running kafka cluster of 3 broker > and i am running the 3 kafka stream with same > [application.id|http://application.id/] > each node has one broker one kafka stream application > everything works fine during setup > i bringdown one node, so one kafka broker and one streaming app is down > now i see exceptions in other two streaming apps and it never gets re > balanced waited for hours and never comes back to norma > is there anything am missing? > i also tried looking into when one broker is down call stream.close, cleanup > and restart this also doesn't help > can anyone help me? > > > > One thing i observed lately is that kafka topics with partitions one gets > reassigned but i have topics of 16 partitions and replication factor 3. It > never settles up -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down
[ https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559935#comment-16559935 ] Yogesh BG commented on KAFKA-7209: -- Hi can u suggest me anything am missing, we are blocked for our product release due to this bug... is there any way i safely clean the kstrems and restart them with the same application.id??? during thins process some amount of data loss is also fine... or either confirmation that its a bug in streaming app could help me taking some decision abt what alternative restart process i can do... > Kafka stream does not rebalance when one node gets down > --- > > Key: KAFKA-7209 > URL: https://issues.apache.org/jira/browse/KAFKA-7209 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.1 >Reporter: Yogesh BG >Priority: Critical > > I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and > backoff time default > > I have 3 nodes running kafka cluster of 3 broker > and i am running the 3 kafka stream with same > [application.id|http://application.id/] > each node has one broker one kafka stream application > everything works fine during setup > i bringdown one node, so one kafka broker and one streaming app is down > now i see exceptions in other two streaming apps and it never gets re > balanced waited for hours and never comes back to norma > is there anything am missing? > i also tried looking into when one broker is down call stream.close, cleanup > and restart this also doesn't help > can anyone help me? > > > > One thing i observed lately is that kafka topics with partitions one gets > reassigned but i have topics of 16 partitions and replication factor 3. It > never settles up -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down
[ https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559492#comment-16559492 ] Yogesh BG commented on KAFKA-7209: -- I tried setting these configuration and try, but no luck {{conf.put("retries",Integer.MAX_VALUE);}} {{conf.put("rebalance.max.retries",Integer.MAX_VALUE);}} {{conf.put("zookeeper.session.timeout.ms",1000);}} > Kafka stream does not rebalance when one node gets down > --- > > Key: KAFKA-7209 > URL: https://issues.apache.org/jira/browse/KAFKA-7209 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.1 >Reporter: Yogesh BG >Priority: Critical > > I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and > backoff time default > > I have 3 nodes running kafka cluster of 3 broker > and i am running the 3 kafka stream with same > [application.id|http://application.id/] > each node has one broker one kafka stream application > everything works fine during setup > i bringdown one node, so one kafka broker and one streaming app is down > now i see exceptions in other two streaming apps and it never gets re > balanced waited for hours and never comes back to norma > is there anything am missing? > i also tried looking into when one broker is down call stream.close, cleanup > and restart this also doesn't help > can anyone help me? > > > > One thing i observed lately is that kafka topics with partitions one gets > reassigned but i have topics of 16 partitions and replication factor 3. It > never settles up -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down
[ https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558874#comment-16558874 ] Yogesh BG commented on KAFKA-7209: -- Here are observation 3 broker and 3 stream app - initially working fine kill one app, then gets rebalanced and start streaming without loss in data i could see below logs {{20:15:26.627 [ks_0_inst_CSV_LOG-StreamThread-22] INFO o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions [PR-35, PR_vThunder-27, PR-27, PR_vThunder-35] for group aggregation-framework03_CSV_LOG}}{{20:15:26.627 [ks_0_inst_CSV_LOG-StreamThread-20] INFO o.a.k.s.p.internals.StreamThread - stream-thread [ks_0_inst_CSV_LOG-StreamThread-20] State transition from PARTITIONS_REVOKED to PARTITIONS_ASSIGNED.}}{{20:16:32.174 [ks_0_inst_THUNDER_LOG_L7-StreamThread-90] INFO o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions [THUNDER_LOG_L7_PR-15, THUNDER_LOG_L7_PE-15] for group aggregation-framework03_THUNDER_LOG_L7}}{{20:16:32.175 [ks_0_inst_THUNDER_LOG_L7-StreamThread-86] INFO o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions [THUNDER_LOG_L7_PR-9, THUNDER_LOG_L7_PE-9] for group aggregation-framework03_THUNDER_LOG_L7}}{{20:16:32.175 [ks_0_inst_THUNDER_LOG_L7-StreamThread-85] INFO o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions [THUNDER_LOG_L7_PE-35, THUNDER_LOG_L7_PR-27, THUNDER_LOG_L7_PE-27, THUNDER_LOG_L7_PR-35] for group aggregation-framework03_THUNDER_LOG_L7}} But the thing i dont get is when i look into sate dir i dont see the partition folders get created for newly assigned partitions below is the initial state before i kill first one[rtp-worker-2] and for other two it remains same and does not changes at all {{[root@rtp-worker-2 /]# ls /tmp/data/kstreams/aggregation-framework_THUNDER_LOG_L7/}}{{*0_0* *0_10* *0_12* *0_14* *0_16* *0_18* *0_2* *0_21* *0_23* *0_25* *0_27* *0_29* *0_30* *0_32* *0_34* *0_4* *0_6* *0_8*}}{{*0_1* *0_11* *0_13* *0_15* *0_17* *0_19* *0_20* *0_22* *0_24* *0_26* *0_28* *0_3* *0_31* *0_33* *0_35* *0_5* *0_7* *0_9*}}{{[root@rtp-worker-0 /]# ls /tmp/data/kstreams/aggregation-framework_THUNDER_LOG_L7/}}{{*0_0* *0_1* *0_10* *0_11* *0_2* *0_3* *0_4* *0_5* *0_6* *0_7* *0_8* *0_9*}}{{[root@rtp-worker-1 /]# ls /tmp/data/kstreams/aggregation-framework_THUNDER_LOG_L7/}}{{*0_11* *0_12* *0_13* *0_14* *0_15* *0_16* *0_17* *0_18* *0_19* *0_20* *0_21* *0_22* *0_23*}} Another case is that all 3 apps running successfully, i bring down one broker then broker gets rebalanced itself. Apps also gets rebalanced with broker and start streaming data, *but there is a data loss observed, when rebalancing in broker is happening. Is there a way to avoid this? does other two broker become non responsive when cluster is rebalancing???* {color:#FF}*Next is when broker and stream goes down at the same time, then i could see broker gets rebalanced and i see some communication messages being received by apps but they never gets back to streaming, esp when multiple partitions are there, those topics which has one partitions gets to streaming in sometime.*{color} > Kafka stream does not rebalance when one node gets down > --- > > Key: KAFKA-7209 > URL: https://issues.apache.org/jira/browse/KAFKA-7209 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.1 >Reporter: Yogesh BG >Priority: Critical > > I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and > backoff time default > > I have 3 nodes running kafka cluster of 3 broker > and i am running the 3 kafka stream with same > [application.id|http://application.id/] > each node has one broker one kafka stream application > everything works fine during setup > i bringdown one node, so one kafka broker and one streaming app is down > now i see exceptions in other two streaming apps and it never gets re > balanced waited for hours and never comes back to norma > is there anything am missing? > i also tried looking into when one broker is down call stream.close, cleanup > and restart this also doesn't help > can anyone help me? > > > > One thing i observed lately is that kafka topics with partitions one gets > reassigned but i have topics of 16 partitions and replication factor 3. It > never settles up -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down
[ https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558772#comment-16558772 ] Yogesh BG commented on KAFKA-7209: -- in one of the forum i see below statement, does this impact anything? we keep default value being 3 for these configurations what is your offsets.topic.replication.factor and transaction.state.log.replication.factor? If those are set to 3 and you have less than 3 brokers I don't imagine things will go well > Kafka stream does not rebalance when one node gets down > --- > > Key: KAFKA-7209 > URL: https://issues.apache.org/jira/browse/KAFKA-7209 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.1 >Reporter: Yogesh BG >Priority: Critical > > I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and > backoff time default > > I have 3 nodes running kafka cluster of 3 broker > and i am running the 3 kafka stream with same > [application.id|http://application.id/] > each node has one broker one kafka stream application > everything works fine during setup > i bringdown one node, so one kafka broker and one streaming app is down > now i see exceptions in other two streaming apps and it never gets re > balanced waited for hours and never comes back to norma > is there anything am missing? > i also tried looking into when one broker is down call stream.close, cleanup > and restart this also doesn't help > can anyone help me? > > > > One thing i observed lately is that kafka topics with partitions one gets > reassigned but i have topics of 16 partitions and replication factor 3. It > never settles up -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down
[ https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558748#comment-16558748 ] Matthias J. Sax commented on KAFKA-7209: Would be good to verify. If broker-only and streams-only fail-over works, it seems to be a bug if "double fail over" does not work. It's unclear though, if it's a stream, consumer, or broker bug. Do the broker fail over correctly if you kill the machine (ie, broker and streams at once)? > Kafka stream does not rebalance when one node gets down > --- > > Key: KAFKA-7209 > URL: https://issues.apache.org/jira/browse/KAFKA-7209 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.1 >Reporter: Yogesh BG >Priority: Critical > > I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and > backoff time default > > I have 3 nodes running kafka cluster of 3 broker > and i am running the 3 kafka stream with same > [application.id|http://application.id/] > each node has one broker one kafka stream application > everything works fine during setup > i bringdown one node, so one kafka broker and one streaming app is down > now i see exceptions in other two streaming apps and it never gets re > balanced waited for hours and never comes back to norma > is there anything am missing? > i also tried looking into when one broker is down call stream.close, cleanup > and restart this also doesn't help > can anyone help me? > > > > One thing i observed lately is that kafka topics with partitions one gets > reassigned but i have topics of 16 partitions and replication factor 3. It > never settles up -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down
[ https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558731#comment-16558731 ] Yogesh BG commented on KAFKA-7209: -- is there any manual work around for this to resolve? i think killing only broker or killing only application works > Kafka stream does not rebalance when one node gets down > --- > > Key: KAFKA-7209 > URL: https://issues.apache.org/jira/browse/KAFKA-7209 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.1 >Reporter: Yogesh BG >Priority: Critical > > I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and > backoff time default > > I have 3 nodes running kafka cluster of 3 broker > and i am running the 3 kafka stream with same > [application.id|http://application.id/] > each node has one broker one kafka stream application > everything works fine during setup > i bringdown one node, so one kafka broker and one streaming app is down > now i see exceptions in other two streaming apps and it never gets re > balanced waited for hours and never comes back to norma > is there anything am missing? > i also tried looking into when one broker is down call stream.close, cleanup > and restart this also doesn't help > can anyone help me? > > > > One thing i observed lately is that kafka topics with partitions one gets > reassigned but i have topics of 16 partitions and replication factor 3. It > never settles up -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down
[ https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558702#comment-16558702 ] Matthias J. Sax commented on KAFKA-7209: Not sure atm. Note, that it is not recommended to run brokers and streams application on the same machine. Does it work if you only kill the broker or if you only kill the streams application? Also, can you try out 0.11.0.3? > Kafka stream does not rebalance when one node gets down > --- > > Key: KAFKA-7209 > URL: https://issues.apache.org/jira/browse/KAFKA-7209 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 0.11.0.1 >Reporter: Yogesh BG >Priority: Critical > > I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and > backoff time default > > I have 3 nodes running kafka cluster of 3 broker > and i am running the 3 kafka stream with same > [application.id|http://application.id/] > each node has one broker one kafka stream application > everything works fine during setup > i bringdown one node, so one kafka broker and one streaming app is down > now i see exceptions in other two streaming apps and it never gets re > balanced waited for hours and never comes back to norma > is there anything am missing? > i also tried looking into when one broker is down call stream.close, cleanup > and restart this also doesn't help > can anyone help me? > > > > One thing i observed lately is that kafka topics with partitions one gets > reassigned but i have topics of 16 partitions and replication factor 3. It > never settles up -- This message was sent by Atlassian JIRA (v7.6.3#76005)