[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down

2018-08-01 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565772#comment-16565772
 ] 

Matthias J. Sax commented on KAFKA-7209:


About `offsets.topic.replication.factor` – if you set it to one, and the 
corresponding broker goes down, you cannot commit offset any longer, thus, you 
might want to set it to 3. Also note, that the number of in-sync replicas 
config is important – the broker default is used, and you can only write to the 
topic if enough in-sync replicas are online. Thus, you should not set it to 3, 
but at max 2 to survive a single broker failure.

For `transaction.state.log.XXX` configs: as long as you don't use exactly-once, 
you can ignore those setting.

For the failure scenarios: can you provide DEBUG logs for the brokers and the 
Streams application so we can dig into it? For the first scenario, after the 
rebalance, the state directories should be created, but we will need Streams 
DEBUG logs to see. For scenario (2) there should not be any data loss – we 
might need Streams and broker logs to dig into it.

For a clean restart with the same application.id, you should check out the 
application reset tool: 
[https://kafka.apache.org/20/documentation/streams/developer-guide/app-reset-tool.html]

Btw: you report this error for 0.11.0.1 and 0.11.0.3 was release recently – I 
would highly recommend to upgrade to 0.11.0.3 and check if the issue is still 
there – there are many bug fixed and the issue might be resolved already.

> Kafka stream does not rebalance when one node gets down
> ---
>
> Key: KAFKA-7209
> URL: https://issues.apache.org/jira/browse/KAFKA-7209
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.1
>Reporter: Yogesh BG
>Priority: Critical
>
> I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and 
> backoff time default
>  
> I have 3 nodes running kafka cluster of 3 broker
> and i am running the 3 kafka stream with same 
> [application.id|http://application.id/]
> each node has one broker one kafka stream application
> everything works fine during setup
> i bringdown one node, so one kafka broker and one streaming app is down
> now i see exceptions in other two streaming apps and it never gets re 
> balanced waited for hours and never comes back to norma
> is there anything am missing?
> i also tried looking into when one broker is down call stream.close, cleanup 
> and restart this also doesn't help
> can anyone help me?
>  
>  
>  
>  One thing i observed lately is that kafka topics with partitions one gets 
> reassigned but i have topics of 16 partitions and replication factor 3. It 
> never settles up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down

2018-07-27 Thread Yogesh BG (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560143#comment-16560143
 ] 

Yogesh BG commented on KAFKA-7209:
--

offsets.topic.replication.factor set to 1 then also i receive something like 
below

 

{{ Received GroupCoordinator response 
ClientResponse(receivedTimeMs=1532716985393, latencyMs=15, disconnected=false, 
requestHeader=\{api_key=10,api_version=1,correlation_id=6157,client_id=ks_0_inst_THUNDER_METRICS-StreamThread-37-consumer},
 responseBody=FindCoordinatorResponse(throttleTimeMs=0, errorMessage='null', 
error=COORDINATOR_NOT_AVAILABLE, node=:-1 (id: -1 rack: null))) for group 
aggregation-framework030_THUNDER_METRICS}}{{18:43:05.394 
[ks_0_inst_THUNDER_LOG_L4-StreamThread-70] DEBUG 
o.a.k.c.consumer.internals.Fetcher - Fetch READ_UNCOMMITTED at offset 0 for 
partition THUNDER_LOG_L4_PC-17 returned fetch data (error=NONE, 
highWaterMark=0, lastStableOffset = -1, logStartOffset = 0, abortedTransactions 
= null, recordsSizeInBytes=0)}}

> Kafka stream does not rebalance when one node gets down
> ---
>
> Key: KAFKA-7209
> URL: https://issues.apache.org/jira/browse/KAFKA-7209
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.1
>Reporter: Yogesh BG
>Priority: Critical
>
> I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and 
> backoff time default
>  
> I have 3 nodes running kafka cluster of 3 broker
> and i am running the 3 kafka stream with same 
> [application.id|http://application.id/]
> each node has one broker one kafka stream application
> everything works fine during setup
> i bringdown one node, so one kafka broker and one streaming app is down
> now i see exceptions in other two streaming apps and it never gets re 
> balanced waited for hours and never comes back to norma
> is there anything am missing?
> i also tried looking into when one broker is down call stream.close, cleanup 
> and restart this also doesn't help
> can anyone help me?
>  
>  
>  
>  One thing i observed lately is that kafka topics with partitions one gets 
> reassigned but i have topics of 16 partitions and replication factor 3. It 
> never settles up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down

2018-07-27 Thread Yogesh BG (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559935#comment-16559935
 ] 

Yogesh BG commented on KAFKA-7209:
--

Hi can u suggest me anything am missing, we are blocked for our product release 
due to this bug... is there any way i safely clean the kstrems and restart them 
with the same application.id??? during thins process some amount of data loss 
is also fine...

or either confirmation that its a bug in streaming app could help me taking 
some decision abt what alternative restart process i can do...

> Kafka stream does not rebalance when one node gets down
> ---
>
> Key: KAFKA-7209
> URL: https://issues.apache.org/jira/browse/KAFKA-7209
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.1
>Reporter: Yogesh BG
>Priority: Critical
>
> I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and 
> backoff time default
>  
> I have 3 nodes running kafka cluster of 3 broker
> and i am running the 3 kafka stream with same 
> [application.id|http://application.id/]
> each node has one broker one kafka stream application
> everything works fine during setup
> i bringdown one node, so one kafka broker and one streaming app is down
> now i see exceptions in other two streaming apps and it never gets re 
> balanced waited for hours and never comes back to norma
> is there anything am missing?
> i also tried looking into when one broker is down call stream.close, cleanup 
> and restart this also doesn't help
> can anyone help me?
>  
>  
>  
>  One thing i observed lately is that kafka topics with partitions one gets 
> reassigned but i have topics of 16 partitions and replication factor 3. It 
> never settles up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down

2018-07-27 Thread Yogesh BG (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559492#comment-16559492
 ] 

Yogesh BG commented on KAFKA-7209:
--

I tried setting these configuration and try, but no luck

 

{{conf.put("retries",Integer.MAX_VALUE);}}

{{conf.put("rebalance.max.retries",Integer.MAX_VALUE);}}

{{conf.put("zookeeper.session.timeout.ms",1000);}}

> Kafka stream does not rebalance when one node gets down
> ---
>
> Key: KAFKA-7209
> URL: https://issues.apache.org/jira/browse/KAFKA-7209
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.1
>Reporter: Yogesh BG
>Priority: Critical
>
> I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and 
> backoff time default
>  
> I have 3 nodes running kafka cluster of 3 broker
> and i am running the 3 kafka stream with same 
> [application.id|http://application.id/]
> each node has one broker one kafka stream application
> everything works fine during setup
> i bringdown one node, so one kafka broker and one streaming app is down
> now i see exceptions in other two streaming apps and it never gets re 
> balanced waited for hours and never comes back to norma
> is there anything am missing?
> i also tried looking into when one broker is down call stream.close, cleanup 
> and restart this also doesn't help
> can anyone help me?
>  
>  
>  
>  One thing i observed lately is that kafka topics with partitions one gets 
> reassigned but i have topics of 16 partitions and replication factor 3. It 
> never settles up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down

2018-07-26 Thread Yogesh BG (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558874#comment-16558874
 ] 

Yogesh BG commented on KAFKA-7209:
--

Here are observation

3 broker and 3 stream app - initially working fine

kill one app, then gets rebalanced and start streaming without loss in data

i could see below logs

{{20:15:26.627 [ks_0_inst_CSV_LOG-StreamThread-22] INFO  
o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions [PR-35, 
PR_vThunder-27, PR-27, PR_vThunder-35] for group 
aggregation-framework03_CSV_LOG}}{{20:15:26.627 
[ks_0_inst_CSV_LOG-StreamThread-20] INFO  o.a.k.s.p.internals.StreamThread - 
stream-thread [ks_0_inst_CSV_LOG-StreamThread-20] State transition from 
PARTITIONS_REVOKED to PARTITIONS_ASSIGNED.}}{{20:16:32.174 
[ks_0_inst_THUNDER_LOG_L7-StreamThread-90] INFO  
o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions 
[THUNDER_LOG_L7_PR-15, THUNDER_LOG_L7_PE-15] for group 
aggregation-framework03_THUNDER_LOG_L7}}{{20:16:32.175 
[ks_0_inst_THUNDER_LOG_L7-StreamThread-86] INFO  
o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions 
[THUNDER_LOG_L7_PR-9, THUNDER_LOG_L7_PE-9] for group 
aggregation-framework03_THUNDER_LOG_L7}}{{20:16:32.175 
[ks_0_inst_THUNDER_LOG_L7-StreamThread-85] INFO  
o.a.k.c.c.i.ConsumerCoordinator - Setting newly assigned partitions 
[THUNDER_LOG_L7_PE-35, THUNDER_LOG_L7_PR-27, THUNDER_LOG_L7_PE-27, 
THUNDER_LOG_L7_PR-35] for group aggregation-framework03_THUNDER_LOG_L7}}

 

But the thing i dont get is when i look into sate dir i dont see the partition 
folders get created for newly assigned partitions

below is the initial state before i kill first one[rtp-worker-2] and for other 
two it remains same and does not changes at all

 

{{[root@rtp-worker-2 /]# ls 
/tmp/data/kstreams/aggregation-framework_THUNDER_LOG_L7/}}{{*0_0*  *0_10*  
*0_12*  *0_14*  *0_16*  *0_18*  *0_2*   *0_21*  *0_23*  *0_25*  *0_27*  *0_29*  
*0_30*  *0_32*  *0_34*  *0_4*  *0_6*  *0_8*}}{{*0_1*  *0_11*  *0_13*  *0_15*  
*0_17*  *0_19*  *0_20*  *0_22*  *0_24*  *0_26*  *0_28*  *0_3*   *0_31*  *0_33*  
*0_35*  *0_5*  *0_7*  *0_9*}}{{[root@rtp-worker-0 /]# ls 
/tmp/data/kstreams/aggregation-framework_THUNDER_LOG_L7/}}{{*0_0*  *0_1*  
*0_10*  *0_11*  *0_2*  *0_3*  *0_4*  *0_5*  *0_6*  *0_7*  *0_8*  
*0_9*}}{{[root@rtp-worker-1 /]# ls 
/tmp/data/kstreams/aggregation-framework_THUNDER_LOG_L7/}}{{*0_11*  *0_12*  
*0_13*  *0_14*  *0_15*  *0_16*  *0_17*  *0_18*  *0_19*  *0_20*  *0_21*  *0_22*  
*0_23*}}

 

Another case is that all 3 apps running successfully, i bring down one broker 
then broker gets rebalanced itself. Apps also gets rebalanced with broker and 
start streaming data, *but there is a data loss observed, when rebalancing in 
broker is happening. Is there a way to avoid this? does other two broker become 
non responsive when cluster is rebalancing???*

 

{color:#FF}*Next is when broker and stream goes down at the same time, then 
i could see broker gets rebalanced and i see some communication messages being 
received by apps but they never gets back to streaming, esp when multiple 
partitions are there, those topics which has one partitions gets to streaming 
in sometime.*{color}

> Kafka stream does not rebalance when one node gets down
> ---
>
> Key: KAFKA-7209
> URL: https://issues.apache.org/jira/browse/KAFKA-7209
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.1
>Reporter: Yogesh BG
>Priority: Critical
>
> I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and 
> backoff time default
>  
> I have 3 nodes running kafka cluster of 3 broker
> and i am running the 3 kafka stream with same 
> [application.id|http://application.id/]
> each node has one broker one kafka stream application
> everything works fine during setup
> i bringdown one node, so one kafka broker and one streaming app is down
> now i see exceptions in other two streaming apps and it never gets re 
> balanced waited for hours and never comes back to norma
> is there anything am missing?
> i also tried looking into when one broker is down call stream.close, cleanup 
> and restart this also doesn't help
> can anyone help me?
>  
>  
>  
>  One thing i observed lately is that kafka topics with partitions one gets 
> reassigned but i have topics of 16 partitions and replication factor 3. It 
> never settles up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down

2018-07-26 Thread Yogesh BG (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558772#comment-16558772
 ] 

Yogesh BG commented on KAFKA-7209:
--

in one of the forum i see below statement, does this impact anything? we keep 
default value being 3 for these configurations

 

what is your offsets.topic.replication.factor and 
transaction.state.log.replication.factor? If those are set to 3 and you have 
less than 3 brokers I don't imagine things will go well

> Kafka stream does not rebalance when one node gets down
> ---
>
> Key: KAFKA-7209
> URL: https://issues.apache.org/jira/browse/KAFKA-7209
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.1
>Reporter: Yogesh BG
>Priority: Critical
>
> I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and 
> backoff time default
>  
> I have 3 nodes running kafka cluster of 3 broker
> and i am running the 3 kafka stream with same 
> [application.id|http://application.id/]
> each node has one broker one kafka stream application
> everything works fine during setup
> i bringdown one node, so one kafka broker and one streaming app is down
> now i see exceptions in other two streaming apps and it never gets re 
> balanced waited for hours and never comes back to norma
> is there anything am missing?
> i also tried looking into when one broker is down call stream.close, cleanup 
> and restart this also doesn't help
> can anyone help me?
>  
>  
>  
>  One thing i observed lately is that kafka topics with partitions one gets 
> reassigned but i have topics of 16 partitions and replication factor 3. It 
> never settles up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down

2018-07-26 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558748#comment-16558748
 ] 

Matthias J. Sax commented on KAFKA-7209:


Would be good to verify. If broker-only and streams-only fail-over works, it 
seems to be a bug if "double fail over" does not work. It's unclear though, if 
it's a stream, consumer, or broker bug. Do the broker fail over correctly if 
you kill the machine (ie, broker and streams at once)?

> Kafka stream does not rebalance when one node gets down
> ---
>
> Key: KAFKA-7209
> URL: https://issues.apache.org/jira/browse/KAFKA-7209
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.1
>Reporter: Yogesh BG
>Priority: Critical
>
> I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and 
> backoff time default
>  
> I have 3 nodes running kafka cluster of 3 broker
> and i am running the 3 kafka stream with same 
> [application.id|http://application.id/]
> each node has one broker one kafka stream application
> everything works fine during setup
> i bringdown one node, so one kafka broker and one streaming app is down
> now i see exceptions in other two streaming apps and it never gets re 
> balanced waited for hours and never comes back to norma
> is there anything am missing?
> i also tried looking into when one broker is down call stream.close, cleanup 
> and restart this also doesn't help
> can anyone help me?
>  
>  
>  
>  One thing i observed lately is that kafka topics with partitions one gets 
> reassigned but i have topics of 16 partitions and replication factor 3. It 
> never settles up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down

2018-07-26 Thread Yogesh BG (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558731#comment-16558731
 ] 

Yogesh BG commented on KAFKA-7209:
--

is there any manual work around for this to resolve? i think killing only 
broker or killing only application works

> Kafka stream does not rebalance when one node gets down
> ---
>
> Key: KAFKA-7209
> URL: https://issues.apache.org/jira/browse/KAFKA-7209
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.1
>Reporter: Yogesh BG
>Priority: Critical
>
> I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and 
> backoff time default
>  
> I have 3 nodes running kafka cluster of 3 broker
> and i am running the 3 kafka stream with same 
> [application.id|http://application.id/]
> each node has one broker one kafka stream application
> everything works fine during setup
> i bringdown one node, so one kafka broker and one streaming app is down
> now i see exceptions in other two streaming apps and it never gets re 
> balanced waited for hours and never comes back to norma
> is there anything am missing?
> i also tried looking into when one broker is down call stream.close, cleanup 
> and restart this also doesn't help
> can anyone help me?
>  
>  
>  
>  One thing i observed lately is that kafka topics with partitions one gets 
> reassigned but i have topics of 16 partitions and replication factor 3. It 
> never settles up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-7209) Kafka stream does not rebalance when one node gets down

2018-07-26 Thread Matthias J. Sax (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16558702#comment-16558702
 ] 

Matthias J. Sax commented on KAFKA-7209:


Not sure atm. Note, that it is not recommended to run brokers and streams 
application on the same machine. Does it work if you only kill the broker or if 
you only kill the streams application? Also, can you try out 0.11.0.3?

> Kafka stream does not rebalance when one node gets down
> ---
>
> Key: KAFKA-7209
> URL: https://issues.apache.org/jira/browse/KAFKA-7209
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 0.11.0.1
>Reporter: Yogesh BG
>Priority: Critical
>
> I use kafka streams 0.11.0.1session timeout 6, retries to be int.max and 
> backoff time default
>  
> I have 3 nodes running kafka cluster of 3 broker
> and i am running the 3 kafka stream with same 
> [application.id|http://application.id/]
> each node has one broker one kafka stream application
> everything works fine during setup
> i bringdown one node, so one kafka broker and one streaming app is down
> now i see exceptions in other two streaming apps and it never gets re 
> balanced waited for hours and never comes back to norma
> is there anything am missing?
> i also tried looking into when one broker is down call stream.close, cleanup 
> and restart this also doesn't help
> can anyone help me?
>  
>  
>  
>  One thing i observed lately is that kafka topics with partitions one gets 
> reassigned but i have topics of 16 partitions and replication factor 3. It 
> never settles up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)