[jira] [Commented] (KAFKA-6249) Interactive query downtime when node goes down even with standby replicas

2017-11-28 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268997#comment-16268997
 ] 

Matthias J. Sax commented on KAFKA-6249:


Thanks. I understand what you are saying. This is related to another issue 
(basically, your fail-over behaves the same way as scale-out scenario for which 
StandbyTasks don't help): KAFKA-6145

Will close this as duplicate.

> Interactive query downtime when node goes down even with standby replicas
> -
>
> Key: KAFKA-6249
> URL: https://issues.apache.org/jira/browse/KAFKA-6249
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Charles Crain
>
> In a multi-node Kafka Streams application that uses interactive queries, the 
> queryable store will become unavailable (throw InvalidStateStoreException) 
> for up to several minutes when a node goes down.  This happens regardless of 
> how many nodes are in the application as well as how many standby replicas 
> are configured.
> My expectation is that if a standby replica is present, that the interactive 
> query would fail over to the live replica immediately causing negligible 
> downtime for interactive queries.  Instead, what appears to happen is that 
> the queryable store is down for however long it takes for the nodes to 
> completely rebalance (this takes a few minutes for a couple GB of total data 
> in the queryable store's backing topic).
> I am filing this as a bug, realizing that it may in fact be a feature 
> request.  However, until there is a way we can use interactive queries with 
> minimal (~zero) downtime on node failure, we are having to entertain other 
> strategies for serving queries (e.g. manually materializing the topic to an 
> external resilient store such as Cassandra) in order to meet our SLAs.
> If there is a way to minimize the downtime of interactive queries on node 
> failure that I am missing, I would like to know what it is.
> Our team is super-enthusiastic about Kafka Streams and we're keen to use it 
> for just about everything!  This is our only major roadblock.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6249) Interactive query downtime when node goes down even with standby replicas

2017-11-28 Thread Charles Crain (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16268753#comment-16268753
 ] 

Charles Crain commented on KAFKA-6249:
--

It is possible I am seeing the delay due to the new instance.  We use 
Kubernetes as our cluster manager, so if a node crashes, one is immediately 
brought back up to replace it.  So our actual failure scenario is one node 
leaving, then another shortly joining.  During this time, the cluster spends a 
long time in the rebalancing state, during which time some or all partitions of 
data are unavailable even when standby replicas are configured.

If KAFKA-6144 is relating to querying standby replicas, and will also allow 
querying of stale data (i.e. data from replicas that are in the process of 
rebalancing  but have not concluded yet), then yes I think you can close this 
as a duplicate.

Ultimately our desired behavior is: as long as the number of offline nodes 
during a rebalance is never greater than the number of standby replicas, then 
the data should always be present on at least one node, and therefore queries 
should continue to work.

> Interactive query downtime when node goes down even with standby replicas
> -
>
> Key: KAFKA-6249
> URL: https://issues.apache.org/jira/browse/KAFKA-6249
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Charles Crain
>
> In a multi-node Kafka Streams application that uses interactive queries, the 
> queryable store will become unavailable (throw InvalidStateStoreException) 
> for up to several minutes when a node goes down.  This happens regardless of 
> how many nodes are in the application as well as how many standby replicas 
> are configured.
> My expectation is that if a standby replica is present, that the interactive 
> query would fail over to the live replica immediately causing negligible 
> downtime for interactive queries.  Instead, what appears to happen is that 
> the queryable store is down for however long it takes for the nodes to 
> completely rebalance (this takes a few minutes for a couple GB of total data 
> in the queryable store's backing topic).
> I am filing this as a bug, realizing that it may in fact be a feature 
> request.  However, until there is a way we can use interactive queries with 
> minimal (~zero) downtime on node failure, we are having to entertain other 
> strategies for serving queries (e.g. manually materializing the topic to an 
> external resilient store such as Cassandra) in order to meet our SLAs.
> If there is a way to minimize the downtime of interactive queries on node 
> failure that I am missing, I would like to know what it is.
> Our team is super-enthusiastic about Kafka Streams and we're keen to use it 
> for just about everything!  This is our only major roadblock.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6249) Interactive query downtime when node goes down even with standby replicas

2017-11-22 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263424#comment-16263424
 ] 

Matthias J. Sax commented on KAFKA-6249:


Atm, it is not possible to query StandbyTasks -- for this reason, the metadata 
is not exposed. KAFKA-6144 is exactly about making StandbyTasks queryable (the 
JIRA title is a little bit obscure, but exposing the metadata implies to make 
StandbyTask queryable -- otherwise, exposing the metadata would not make any 
sense).

So your observation/experiment make totally sense. Can we close this JIRA as 
duplicate of KAFKA-6144 than?

This wondering, why you have long time for which you cannot query -- for 
failure scenario, StandbyTasks should become active tasks and take over -- 
thus, downtime for IQ should be very short. (Note: this does not apply to scale 
out scenarios -- newly added instanced are not queryable for some time as they 
start with "nothing" thus need to rebuild state, what can take some time. This 
is a know issue and will be addressed in the future though).

> Interactive query downtime when node goes down even with standby replicas
> -
>
> Key: KAFKA-6249
> URL: https://issues.apache.org/jira/browse/KAFKA-6249
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Charles Crain
>
> In a multi-node Kafka Streams application that uses interactive queries, the 
> queryable store will become unavailable (throw InvalidStateStoreException) 
> for up to several minutes when a node goes down.  This happens regardless of 
> how many nodes are in the application as well as how many standby replicas 
> are configured.
> My expectation is that if a standby replica is present, that the interactive 
> query would fail over to the live replica immediately causing negligible 
> downtime for interactive queries.  Instead, what appears to happen is that 
> the queryable store is down for however long it takes for the nodes to 
> completely rebalance (this takes a few minutes for a couple GB of total data 
> in the queryable store's backing topic).
> I am filing this as a bug, realizing that it may in fact be a feature 
> request.  However, until there is a way we can use interactive queries with 
> minimal (~zero) downtime on node failure, we are having to entertain other 
> strategies for serving queries (e.g. manually materializing the topic to an 
> external resilient store such as Cassandra) in order to meet our SLAs.
> If there is a way to minimize the downtime of interactive queries on node 
> failure that I am missing, I would like to know what it is.
> Our team is super-enthusiastic about Kafka Streams and we're keen to use it 
> for just about everything!  This is our only major roadblock.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6249) Interactive query downtime when node goes down even with standby replicas

2017-11-22 Thread Charles Crain (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16263231#comment-16263231
 ] 

Charles Crain commented on KAFKA-6249:
--

KAFKA-6144 is definitely related.  Solving it may solve my issue as long as 
stale data includes data from standby replicas.

Question: is it possible to query a standby replica of a state store?  Let me 
elaborate: I did an experiment where I ran 3 replicas of a Kafka Stream app, 
and I printed out the results of querying a particular key from a state store 
on all 3.  As expected, with zero standby replicas, 1 replica returned the data 
while the other 2 returned null.  

However when I set the standby replicas config to 1, this still happened.  I 
would have naively expected 2 of the 3 replicas to return valid data for a 
particular key.  Perhaps this is intended behavior, i.e. the standby replica is 
"hidden" somehow until it is made live.  But, it would be very useful if the 
replica of the state store data were able to be queried somehow.

In fact, it would be ideal if metadataForKey() would return all nodes that have 
data for a particular key available, including standbys.  That way, if one 
replica fails we could try another.  That, combined with KAFKA-6144 should 
allow implementation of queryable stores with zero down time on node failure, 
as long as number of standby replicas >= total number of nodes that fail before 
rebalance is complete.

> Interactive query downtime when node goes down even with standby replicas
> -
>
> Key: KAFKA-6249
> URL: https://issues.apache.org/jira/browse/KAFKA-6249
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Charles Crain
>
> In a multi-node Kafka Streams application that uses interactive queries, the 
> queryable store will become unavailable (throw InvalidStateStoreException) 
> for up to several minutes when a node goes down.  This happens regardless of 
> how many nodes are in the application as well as how many standby replicas 
> are configured.
> My expectation is that if a standby replica is present, that the interactive 
> query would fail over to the live replica immediately causing negligible 
> downtime for interactive queries.  Instead, what appears to happen is that 
> the queryable store is down for however long it takes for the nodes to 
> completely rebalance (this takes a few minutes for a couple GB of total data 
> in the queryable store's backing topic).
> I am filing this as a bug, realizing that it may in fact be a feature 
> request.  However, until there is a way we can use interactive queries with 
> minimal (~zero) downtime on node failure, we are having to entertain other 
> strategies for serving queries (e.g. manually materializing the topic to an 
> external resilient store such as Cassandra) in order to meet our SLAs.
> If there is a way to minimize the downtime of interactive queries on node 
> failure that I am missing, I would like to know what it is.
> Our team is super-enthusiastic about Kafka Streams and we're keen to use it 
> for just about everything!  This is out only major roadblock.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6249) Interactive query downtime when node goes down even with standby replicas

2017-11-21 Thread Matthias J. Sax (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16261575#comment-16261575
 ] 

Matthias J. Sax commented on KAFKA-6249:


Thanks for giving feedback :) It seems that this is a duplicate of KAFKA-6144 ?

What I am wondering though is: why does it take so long to recover? If 
StandbyTasks are configured, those should read the changelog topic and be 
almost up-to-date with the active Task. Thus, rebalance and restore should be 
super short as the StandbyTask only needs to read the remaining tail of the 
changelog. Can you elaborate here? Would be useful to learn how this behaves 
"in the wild" -- do StandbyTasks lag behind and cannot keep up maintaining the 
hot standby stores?

> Interactive query downtime when node goes down even with standby replicas
> -
>
> Key: KAFKA-6249
> URL: https://issues.apache.org/jira/browse/KAFKA-6249
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.0
>Reporter: Charles Crain
>
> In a multi-node Kafka Streams application that uses interactive queries, the 
> queryable store will become unavailable (throw InvalidStateStoreException) 
> for up to several minutes when a node goes down.  This happens regardless of 
> how many nodes are in the application as well as how many standby replicas 
> are configured.
> My expectation is that if a standby replica is present, that the interactive 
> query would fail over to the live replica immediately causing negligible 
> downtime for interactive queries.  Instead, what appears to happen is that 
> the queryable store is down for however long it takes for the nodes to 
> completely rebalance (this takes a few minutes for a couple GB of total data 
> in the queryable store's backing topic).
> I am filing this as a bug, realizing that it may in fact be a feature 
> request.  However, until there is a way we can use interactive queries with 
> minimal (~zero) downtime on node failure, we are having to entertain other 
> strategies for serving queries (e.g. manually materializing the topic to an 
> external resilient store such as Cassandra) in order to meet our SLAs.
> If there is a way to minimize the downtime of interactive queries on node 
> failure that I am missing, I would like to know what it is.
> Our team is super-enthusiastic about Kafka Streams and we're keen to use it 
> for just about everything!  This is out only major roadblock.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)