[jira] [Created] (KAFKA-13170) Flaky Test InternalTopicManagerTest.shouldRetryDeleteTopicWhenTopicUnknown

2021-08-05 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-13170:
--

 Summary: Flaky Test 
InternalTopicManagerTest.shouldRetryDeleteTopicWhenTopicUnknown
 Key: KAFKA-13170
 URL: https://issues.apache.org/jira/browse/KAFKA-13170
 Project: Kafka
  Issue Type: Bug
  Components: streams, unit tests
Reporter: A. Sophie Blee-Goldman


[https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11176/2/testReport/org.apache.kafka.streams.processor.internals/InternalTopicManagerTest/Build___JDK_8_and_Scala_2_12___shouldRetryDeleteTopicWhenTopicUnknown_2/]
{code:java}
Stacktracejava.lang.AssertionError: unexpected exception type thrown; 
expected: but 
was:
  at org.junit.Assert.assertThrows(Assert.java:1020)
  at org.junit.Assert.assertThrows(Assert.java:981)
  at 
org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldRetryDeleteTopicWhenRetriableException(InternalTopicManagerTest.java:526)
  at 
org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldRetryDeleteTopicWhenTopicUnknown(InternalTopicManagerTest.java:497)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13169) Flaky Test QueryableStateIntegrationTest.shouldBeAbleToQueryStateWithNonZeroSizedCache

2021-08-05 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-13169:
--

 Summary: Flaky Test 
QueryableStateIntegrationTest.shouldBeAbleToQueryStateWithNonZeroSizedCache
 Key: KAFKA-13169
 URL: https://issues.apache.org/jira/browse/KAFKA-13169
 Project: Kafka
  Issue Type: Bug
  Components: streams, unit tests
Reporter: A. Sophie Blee-Goldman


[https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11176/2/testReport/org.apache.kafka.streams.processor.internals/InternalTopicManagerTest/Build___JDK_8_and_Scala_2_12___shouldRetryDeleteTopicWhenTopicUnknown/]
{code:java}
Stacktrace
java.lang.AssertionError: unexpected exception type thrown; 
expected: but 
was:
  at org.junit.Assert.assertThrows(Assert.java:1020)
  at org.junit.Assert.assertThrows(Assert.java:981)
  at 
org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldRetryDeleteTopicWhenRetriableException(InternalTopicManagerTest.java:526)
  at 
org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldRetryDeleteTopicWhenTopicUnknown(InternalTopicManagerTest.java:497)
{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13145) Renaming the time interval window for better understanding

2021-08-03 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392615#comment-17392615
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13145:


Fine with me. FWIW the "InclusiveExclusiveWindow" name was my idea, but that 
was just to avoid using something called "SessionWindow" in the _Sliding_ 
window processor – making a new SlidingWindow class works too.

> Renaming the time interval window for better understanding
> --
>
> Key: KAFKA-13145
> URL: https://issues.apache.org/jira/browse/KAFKA-13145
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Luke Chen
>Assignee: Luke Chen
>Priority: Major
>
>  
> I have another thought, which is to rename the time interval related windows. 
> Currently, we have 3 types of time interval window:
>  {{TimeWindow}} -> to have {{[start,end)}} time interval
>  {{SessionWindow}} -> to have {{[start,end]}} time interval
>  {{UnlimitedWindow}} -> to have {{[start, MAX_VALUE)}} time interval
> I think the name {{SessionWindow}} is definitely not good here, especially we 
> want to use it in {{SlidingWindows}} now, although it is only used for 
> {{SessionWindows}} before. We should name them with time interval meaning, 
> not the streaming window functions meaning. {{}}Because these 3 window types 
> are internal use only, it is safe to rename them.
>  
> {{TimeWindow}} --> {{InclusiveExclusiveWindow}}
>  {{SessionWindow}} / {{SlidingWindow}} --> {{InclusiveInclusiveWindow}}
>  {{UnlimitedWindow}} --> {{InclusiveUnboundedWindow}}
> {{}}
> See the discussion here{{: 
> [https://github.com/apache/kafka/pull/11124#issuecomment-887989639]}}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12559) Add a top-level Streams config for bounding off-heap memory

2021-08-03 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392610#comment-17392610
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12559:


[~brz] I'd say give [~msundeq] another day or two to respond and if you don't 
hear back then feel free to assign this ticket to yourself. I just added you as 
a contributor on the project so you should be able to self-assign tickets from 
now on.

> Add a top-level Streams config for bounding off-heap memory
> ---
>
> Key: KAFKA-12559
> URL: https://issues.apache.org/jira/browse/KAFKA-12559
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Martin Sundeqvist
>Priority: Major
>  Labels: needs-kip, newbie, newbie++
>
> At the moment we provide an example of how to bound the memory usage of 
> rocskdb in the [Memory 
> Management|https://kafka.apache.org/27/documentation/streams/developer-guide/memory-mgmt.html#rocksdb]
>  section of the docs. This requires implementing a custom RocksDBConfigSetter 
> class and setting a number of rocksdb options for relatively advanced 
> concepts and configurations. It seems a fair number of users either fail to 
> find this or consider it to be for more advanced use cases/users. But RocksDB 
> can eat up a lot of off-heap memory and it's not uncommon for users to come 
> across a {{RocksDBException: Cannot allocate memory}}
> It would probably be a much better user experience if we implemented this 
> memory bound out-of-the-box and just gave users a top-level StreamsConfig to 
> tune the off-heap memory given to rocksdb, like we have for on-heap cache 
> memory with cache.max.bytes.buffering. More advanced users can continue to 
> fine-tune their memory bounding and apply other configs with a custom config 
> setter, while new or more casual users can cap on the off-heap memory without 
> getting their hands dirty with rocksdb.
> I would propose to add the following top-level config:
> rocksdb.max.bytes.off.heap: medium priority, default to -1 (unbounded), valid 
> values are [0, inf]
> I'd also want to consider adding a second, lower priority top-level config to 
> give users a knob for adjusting how much of that total off-heap memory goes 
> to the block cache + index/filter blocks, and how much of it is afforded to 
> the write buffers. I'm struggling to come up with a good name for this 
> config, but it would be something like
> rocksdb.memtable.to.block.cache.off.heap.memory.ratio: low priority, default 
> to 0.5, valid values are [0, 1]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633

2021-08-03 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12994:
---
Labels: kip-633 newbie newbie++  (was: kip kip-633)

> Migrate all Tests to New API and Remove Suppression for Deprecation Warnings 
> related to KIP-633
> ---
>
> Key: KAFKA-12994
> URL: https://issues.apache.org/jira/browse/KAFKA-12994
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, unit tests
>Affects Versions: 3.0.0
>Reporter: Israel Ekpo
>Priority: Major
>  Labels: kip-633, newbie, newbie++
> Fix For: 3.1.0
>
>
> Due to the API changes for KIP-633 a lot of deprecation warnings have been 
> generated in tests that are using the old deprecated APIs. There are a lot of 
> tests using the deprecated methods. We should absolutely migrate them all to 
> the new APIs and then get rid of all the applicable annotations for 
> suppressing the deprecation warnings.
> The applies to all Java and Scala examples and tests using the deprecated 
> APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows 
> classes.
>  
> This is based on the feedback from reviewers in this PR
>  
> https://github.com/apache/kafka/pull/10926



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633

2021-08-03 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-12994:
--

Assignee: A. Sophie Blee-Goldman

> Migrate all Tests to New API and Remove Suppression for Deprecation Warnings 
> related to KIP-633
> ---
>
> Key: KAFKA-12994
> URL: https://issues.apache.org/jira/browse/KAFKA-12994
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, unit tests
>Affects Versions: 3.0.0
>Reporter: Israel Ekpo
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
>  Labels: kip-633, newbie, newbie++
> Fix For: 3.1.0
>
>
> Due to the API changes for KIP-633 a lot of deprecation warnings have been 
> generated in tests that are using the old deprecated APIs. There are a lot of 
> tests using the deprecated methods. We should absolutely migrate them all to 
> the new APIs and then get rid of all the applicable annotations for 
> suppressing the deprecation warnings.
> The applies to all Java and Scala examples and tests using the deprecated 
> APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows 
> classes.
>  
> This is based on the feedback from reviewers in this PR
>  
> https://github.com/apache/kafka/pull/10926



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633

2021-08-03 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-12994:
--

Assignee: (was: Israel Ekpo)

> Migrate all Tests to New API and Remove Suppression for Deprecation Warnings 
> related to KIP-633
> ---
>
> Key: KAFKA-12994
> URL: https://issues.apache.org/jira/browse/KAFKA-12994
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, unit tests
>Affects Versions: 3.0.0
>Reporter: Israel Ekpo
>Priority: Major
>  Labels: kip, kip-633
> Fix For: 3.1.0
>
>
> Due to the API changes for KIP-633 a lot of deprecation warnings have been 
> generated in tests that are using the old deprecated APIs. There are a lot of 
> tests using the deprecated methods. We should absolutely migrate them all to 
> the new APIs and then get rid of all the applicable annotations for 
> suppressing the deprecation warnings.
> The applies to all Java and Scala examples and tests using the deprecated 
> APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows 
> classes.
>  
> This is based on the feedback from reviewers in this PR
>  
> https://github.com/apache/kafka/pull/10926



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633

2021-08-03 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-12994:
--

Assignee: (was: A. Sophie Blee-Goldman)

> Migrate all Tests to New API and Remove Suppression for Deprecation Warnings 
> related to KIP-633
> ---
>
> Key: KAFKA-12994
> URL: https://issues.apache.org/jira/browse/KAFKA-12994
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, unit tests
>Affects Versions: 3.0.0
>Reporter: Israel Ekpo
>Priority: Major
>  Labels: kip-633, newbie, newbie++
> Fix For: 3.1.0
>
>
> Due to the API changes for KIP-633 a lot of deprecation warnings have been 
> generated in tests that are using the old deprecated APIs. There are a lot of 
> tests using the deprecated methods. We should absolutely migrate them all to 
> the new APIs and then get rid of all the applicable annotations for 
> suppressing the deprecation warnings.
> The applies to all Java and Scala examples and tests using the deprecated 
> APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows 
> classes.
>  
> This is based on the feedback from reviewers in this PR
>  
> https://github.com/apache/kafka/pull/10926



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633

2021-08-03 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392530#comment-17392530
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12994:


Hey [~iekpo], I'm unassigning this in case someone else wants to pick it up. If 
you already started working on this and have a partial PR ready with some 
subset of the tests migrated over, you can just open that PR and we can merge 
this in pieces. It seems like a lot of tests so splitting it up into multiple 
PRs is probably a good idea anyway. (And obviously feel free to re-assign it to 
yourself if you want to continue working on it, and/or split this ticket up 
into sub-tasks covering different sets of tests)

> Migrate all Tests to New API and Remove Suppression for Deprecation Warnings 
> related to KIP-633
> ---
>
> Key: KAFKA-12994
> URL: https://issues.apache.org/jira/browse/KAFKA-12994
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, unit tests
>Affects Versions: 3.0.0
>Reporter: Israel Ekpo
>Assignee: Israel Ekpo
>Priority: Major
>  Labels: kip, kip-633
> Fix For: 3.1.0
>
>
> Due to the API changes for KIP-633 a lot of deprecation warnings have been 
> generated in tests that are using the old deprecated APIs. There are a lot of 
> tests using the deprecated methods. We should absolutely migrate them all to 
> the new APIs and then get rid of all the applicable annotations for 
> suppressing the deprecation warnings.
> The applies to all Java and Scala examples and tests using the deprecated 
> APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows 
> classes.
>  
> This is based on the feedback from reviewers in this PR
>  
> https://github.com/apache/kafka/pull/10926



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633

2021-08-03 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12994:
---
Fix Version/s: (was: 3.0.0)
   3.1.0

> Migrate all Tests to New API and Remove Suppression for Deprecation Warnings 
> related to KIP-633
> ---
>
> Key: KAFKA-12994
> URL: https://issues.apache.org/jira/browse/KAFKA-12994
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, unit tests
>Affects Versions: 3.0.0
>Reporter: Israel Ekpo
>Assignee: Israel Ekpo
>Priority: Major
>  Labels: kip, kip-633
> Fix For: 3.1.0
>
>
> Due to the API changes for KIP-633 a lot of deprecation warnings have been 
> generated in tests that are using the old deprecated APIs. There are a lot of 
> tests using the deprecated methods. We should absolutely migrate them all to 
> the new APIs and then get rid of all the applicable annotations for 
> suppressing the deprecation warnings.
> The applies to all Java and Scala examples and tests using the deprecated 
> APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows 
> classes.
>  
> This is based on the feedback from reviewers in this PR
>  
> https://github.com/apache/kafka/pull/10926



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9897) Flaky Test StoreQueryIntegrationTest#shouldQuerySpecificActivePartitionStores

2021-07-30 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390799#comment-17390799
 ] 

A. Sophie Blee-Goldman commented on KAFKA-9897:
---

[https://github.com/apache/kafka/pull/3|https://github.com/apache/kafka/pull/3]

> Flaky Test StoreQueryIntegrationTest#shouldQuerySpecificActivePartitionStores
> -
>
> Key: KAFKA-9897
> URL: https://issues.apache.org/jira/browse/KAFKA-9897
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.6.0
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
>
> [https://builds.apache.org/job/kafka-pr-jdk14-scala2.13/22/testReport/junit/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/shouldQuerySpecificActivePartitionStores/]
> {quote}org.apache.kafka.streams.errors.InvalidStateStoreException: Cannot get 
> state store source-table because the stream thread is PARTITIONS_ASSIGNED, 
> not RUNNING at 
> org.apache.kafka.streams.state.internals.StreamThreadStateStoreProvider.stores(StreamThreadStateStoreProvider.java:85)
>  at 
> org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:61)
>  at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:1183) at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQuerySpecificActivePartitionStores(StoreQueryIntegrationTest.java:178){quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-10246) AbstractProcessorContext topic() throws NullPointerException when modifying a state store within the DSL from a punctuator

2021-07-30 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-10246.

Resolution: Fixed

> AbstractProcessorContext topic() throws NullPointerException when modifying a 
> state store within the DSL from a punctuator
> --
>
> Key: KAFKA-10246
> URL: https://issues.apache.org/jira/browse/KAFKA-10246
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.5.0
> Environment: linux, windows, java 11
>Reporter: Peter Pringle
>Priority: Major
>
> NullPointerException seen when a KTable statestore is being modified by a 
> punctuated method which is added to a topology via the DSL processor/ktable 
> valueTransfomer methods.
> It seems valid for AbstractProcessorContext.topic() to return null; however 
> the check below returns a NullPointerException before a null can be returned.
> {quote}if (topic.equals(NONEXIST_TOPIC)) {
> {quote}
> Made a local fix to reverse the ordering of the check (i.e. avoid the null) 
> and this appears to fix the issue and sends the change to the state stores 
> changelog topic.
> {quote}if (NONEXIST_TOPIC.equals(topic)) {
> {quote}
> Stacktrace below
> {{2020-07-02 07:29:46,829 
> [ABC_aggregator-551a90c1-d7c3-4357-a608-3ea79951f4e8-StreamThread-5] ERROR 
> [o.a.k.s.p.i.StreamThread]: stream-thread [ABC_aggregator-5}}
>  {{51a90c1-d7c3-4357-a608-3ea79951f4e8-StreamThread-5] Encountered the 
> following error during processing:}}
>  {{java.lang.NullPointerException: null}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.AbstractProcessorContext.topic(AbstractProcessorContext.java:115)}}
>  \{{ at 
> org.apache.kafka.streams.state.internals.CachingKeyValueStore.putInternal(CachingKeyValueStore.java:141)}}
>  \{{ at 
> org.apache.kafka.streams.state.internals.CachingKeyValueStore.put(CachingKeyValueStore.java:123)}}
>  \{{ at 
> org.apache.kafka.streams.state.internals.CachingKeyValueStore.put(CachingKeyValueStore.java:36)}}
>  \{{ at 
> org.apache.kafka.streams.state.internals.MeteredKeyValueStore.lambda$put$3(MeteredKeyValueStore.java:144)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)}}
>  \{{ at 
> org.apache.kafka.streams.state.internals.MeteredKeyValueStore.put(MeteredKeyValueStore.java:144)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl$KeyValueStoreReadWriteDecorator.put(ProcessorContextImpl.java:487)}}
>  \{{ at 
> org.apache.kafka.streams.kstream.internals.KTableKTableJoinMerger$KTableKTableJoinMergeProcessor.process(KTableKTableJoinMerger.java:118)}}
>  \{{ at 
> org.apache.kafka.streams.kstream.internals.KTableKTableJoinMerger$KTableKTableJoinMergeProcessor.process(KTableKTableJoinMerger.java:97)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.ProcessorNode.lambda$process$2(ProcessorNode.java:142)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:142)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)}}
>  \{{ at 
> org.apache.kafka.streams.kstream.internals.KTableKTableOuterJoin$KTableKTableOuterJoinProcessor.process(KTableKTableOuterJoin.java:118)}}
>  \{{ at 
> org.apache.kafka.streams.kstream.internals.KTableKTableOuterJoin$KTableKTableOuterJoinProcessor.process(KTableKTableOuterJoin.java:65)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.ProcessorNode.lambda$process$2(ProcessorNode.java:142)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:142)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)}}
>  \{{ at 
> org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)}}
>  \{{ at 
> org.apache.kafka.streams.kstream.internals.TimestampedCacheFlushListener.apply(TimestampedCacheFlushListener.java:45)}}
>  \{{ at 
> org.apache.kafka.streams.kstream.internals.TimestampedCacheFlushListener.apply(TimestampedCacheFlushListener.java:28)}}
>  \{{ at 
> org.apache.kafka.streams.state.internals.MeteredKeyValueStore.lambda$setFlushListener$1(MeteredKeyValueStore.java:119)}}

[jira] [Resolved] (KAFKA-13150) How is Kafkastream configured to consume data from a specified offset ?

2021-07-29 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-13150.

Resolution: Invalid

> How is Kafkastream configured to consume data from a specified offset ?
> ---
>
> Key: KAFKA-13150
> URL: https://issues.apache.org/jira/browse/KAFKA-13150
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: wangjh
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13150) How is Kafkastream configured to consume data from a specified offset ?

2021-07-29 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390096#comment-17390096
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13150:


[~wangjh] Jira is intended only for bug reports and feature requests, please us 
the mailing list for questions

> How is Kafkastream configured to consume data from a specified offset ?
> ---
>
> Key: KAFKA-13150
> URL: https://issues.apache.org/jira/browse/KAFKA-13150
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: wangjh
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8295) Optimize count() using RocksDB merge operator

2021-07-27 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388304#comment-17388304
 ] 

A. Sophie Blee-Goldman commented on KAFKA-8295:
---

I was just re-reading the wiki page on the Merge Operator, and now I wonder if 
it may not be _as_ helpful as I originally thought – but probably still can 
offer some improvement. Here's my take, let me know what you think.

Regardless of whether a custom MergeOperator suffers from the same performance 
impact of crossing the jni, I would bet that use cases such as list-append 
would still be more performant (since reading out an entire list, appending to 
it, and then writing the entire thing back is a lot of I/O). There are also the 
built-in, native MergeOperators that wouldn't need to cross the jni such as the 
UInt64AddOperator as you point out. So there are definitely cases where a 
MergeOperator would still outperform a RDW sequence.

The thing I didn't fully appreciate before (but seems kind of obvious now that 
I think of it lol) is that the merge() call doesn't actually return the current 
value, either before or after the merge. So if we have to know this value in 
addition to updating it, we need to do a get(), and using merge()  instead of 
RMW is only saving us the cost of `put(full_merged_value) - 
put(single_update_value)` – which means for constant-size values, like the 
unint64 unfortunately, there's pretty much no savings at all. So we don't even 
need to worry about whether/how to handle the fact that this is now a 
ValueAndTimestamp instead of a plain Value, ie a Long in the case of count(), 
because I don't think there's likely to be any performance improvement there.

I didn't realize that at the time of filing this ticket, so maybe we should 
look past the current title of this ticket. This still leaves some cases that 
could potentially benefit from even a custom MergeOperator, such as list-append 
or any other where the difference in size between the full_merged_value and the 
single_update_value is very large. So it could be worth doing a POC of 
something like this and benchmarking that for a KIP.  But tbh, having seen how 
messy it is to add new operators to the StateStore interface at the moment, I 
think we should probably try to avoid doing so unless there's good motivation 
and a clear benefit. In this case, while there may be a benefit, I'm not sure 
there's a good motivation to do so since no user has requested this feature 
yet. Of course that could just be because they aren't aware of the possibility, 
so how about this: we update the title of this ticket to describe this possible 
new feature and then see if any users chime in here or vote on the ticket. If 
we gauge real user interest then it makes more sense to put time into doing 
this. WDYT?

> Optimize count() using RocksDB merge operator
> -
>
> Key: KAFKA-8295
> URL: https://issues.apache.org/jira/browse/KAFKA-8295
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Sagar Rao
>Priority: Major
>
> In addition to regular put/get/delete RocksDB provides a fourth operation, 
> merge. This essentially provides an optimized read/update/write path in a 
> single operation. One of the built-in (C++) merge operators exposed over the 
> Java API is a counter. We should be able to leverage this for a more 
> efficient implementation of count()
>  
> (Note: Unfortunately it seems unlikely we can use this to optimize general 
> aggregations, even if RocksJava allowed for a custom merge operator, unless 
> we provide a way for the user to specify and connect a C++ implemented 
> aggregator – otherwise we incur too much cost crossing the jni for a net 
> performance benefit)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-9897) Flaky Test StoreQueryIntegrationTest#shouldQuerySpecificActivePartitionStores

2021-07-26 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387652#comment-17387652
 ] 

A. Sophie Blee-Goldman commented on KAFKA-9897:
---

[~mjsax] that last failure is for a different test that was added recently, we 
merged a fix three days ago which I think was just after that build was 
triggered...but if you see that specific test fail again on a recent build can 
you reopen KAFKA-13128 instead?

> Flaky Test StoreQueryIntegrationTest#shouldQuerySpecificActivePartitionStores
> -
>
> Key: KAFKA-9897
> URL: https://issues.apache.org/jira/browse/KAFKA-9897
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.6.0
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
>
> [https://builds.apache.org/job/kafka-pr-jdk14-scala2.13/22/testReport/junit/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/shouldQuerySpecificActivePartitionStores/]
> {quote}org.apache.kafka.streams.errors.InvalidStateStoreException: Cannot get 
> state store source-table because the stream thread is PARTITIONS_ASSIGNED, 
> not RUNNING at 
> org.apache.kafka.streams.state.internals.StreamThreadStateStoreProvider.stores(StreamThreadStateStoreProvider.java:85)
>  at 
> org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:61)
>  at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:1183) at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQuerySpecificActivePartitionStores(StoreQueryIntegrationTest.java:178){quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13120) Flesh out `streams_static_membership_test` to be more robust

2021-07-26 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387506#comment-17387506
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13120:


Thanks [~Reggiehsu111]! I assigned the ticket to you and added you as a 
contributor so you can self-assign tickets from now on. Let me or [~lct45] know 
if you have any questions

> Flesh out `streams_static_membership_test` to be more robust
> 
>
> Key: KAFKA-13120
> URL: https://issues.apache.org/jira/browse/KAFKA-13120
> Project: Kafka
>  Issue Type: Task
>  Components: streams, system tests
>Reporter: Leah Thomas
>Assignee: Reggie Hsu
>Priority: Minor
>  Labels: newbie++
>
> When fixing the `streams_static_membership_test.py` we noticed that the test 
> is pretty bare bones, it creates a streams application but doesn't do much 
> with the streams application, eg has no stateful processing. We should flesh 
> this out a bit to be more realistic and potentially consider testing with EOS 
> as well. The full java test is in `StaticMembershipTestClient`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-13120) Flesh out `streams_static_membership_test` to be more robust

2021-07-26 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-13120:
--

Assignee: Reggie Hsu

> Flesh out `streams_static_membership_test` to be more robust
> 
>
> Key: KAFKA-13120
> URL: https://issues.apache.org/jira/browse/KAFKA-13120
> Project: Kafka
>  Issue Type: Task
>  Components: streams, system tests
>Reporter: Leah Thomas
>Assignee: Reggie Hsu
>Priority: Minor
>  Labels: newbie++
>
> When fixing the `streams_static_membership_test.py` we noticed that the test 
> is pretty bare bones, it creates a streams application but doesn't do much 
> with the streams application, eg has no stateful processing. We should flesh 
> this out a bit to be more realistic and potentially consider testing with EOS 
> as well. The full java test is in `StaticMembershipTestClient`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13021) Improve Javadocs for API Changes and address followup from KIP-633

2021-07-23 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-13021.

Resolution: Fixed

> Improve Javadocs for API Changes and address followup from KIP-633
> --
>
> Key: KAFKA-13021
> URL: https://issues.apache.org/jira/browse/KAFKA-13021
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation, streams
>Affects Versions: 3.0.0
>Reporter: Israel Ekpo
>Assignee: Israel Ekpo
>Priority: Major
> Fix For: 3.0.0
>
>
> There are Javadoc changes from the following PR that needs to be completed 
> prior to the 3.0 release. This Jira item is to track that work
> [https://github.com/apache/kafka/pull/10926]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13021) Improve Javadocs for API Changes and address followup from KIP-633

2021-07-23 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13021:
---
Fix Version/s: 3.0.0

> Improve Javadocs for API Changes and address followup from KIP-633
> --
>
> Key: KAFKA-13021
> URL: https://issues.apache.org/jira/browse/KAFKA-13021
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation, streams
>Affects Versions: 3.0.0
>Reporter: Israel Ekpo
>Assignee: Israel Ekpo
>Priority: Major
> Fix For: 3.0.0
>
>
> There are Javadoc changes from the following PR that needs to be completed 
> prior to the 3.0 release. This Jira item is to track that work
> [https://github.com/apache/kafka/pull/10926]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread

2021-07-23 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13128:
---
Affects Version/s: 2.8.1
   3.0.0

> Flaky Test 
> StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
> 
>
> Key: KAFKA-13128
> URL: https://issues.apache.org/jira/browse/KAFKA-13128
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.0.0, 2.8.1
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
>  Labels: flaky-test
> Fix For: 3.0.0, 2.8.1
>
>
> h3. Stacktrace
> java.lang.AssertionError: Expected: is not null but: was null 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455)
>  
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13130) Deprecate long based range queries in SessionStore

2021-07-23 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386523#comment-17386523
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13130:


Hey [~pstuedi], just a heads up there is actually a KIP for this that's already 
been accepted. I suspect [~jeqo] wouldn't mind if you took over the 
implementation.

[KIP-666|https://cwiki.apache.org/confluence/display/KAFKA/KIP-666%3A+Add+Instant-based+methods+to+ReadOnlySessionStore]

> Deprecate long based range queries in SessionStore
> --
>
> Key: KAFKA-13130
> URL: https://issues.apache.org/jira/browse/KAFKA-13130
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Patrick Stuedi
>Assignee: Patrick Stuedi
>Priority: Minor
>  Labels: needs-kip, newbie, newbie++
> Fix For: 3.1.0
>
>
> Migrate long based queries in ReadOnlySessionStore (fetchSession*, 
> findSession*, etc.) to object based interfaces. Deprecate old long based 
> interface, similar to how it was done for the ReadOnlyWindowStore.
> Related KIPs:
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-358%3A+Migrate+Streams+API+to+Duration+instead+of+long+ms+times]
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-617%3A+Allow+Kafka+Streams+State+Stores+to+be+iterated+backwards]
>  
> Related Jiras:
> https://issues.apache.org/jira/browse/KAFKA-12419
> https://issues.apache.org/jira/browse/KAFKA-12526
> https://issues.apache.org/jira/browse/KAFKA-12451
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13120) Flesh out `streams_static_membership_test` to be more robust

2021-07-23 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13120:
---
Labels: newbie++  (was: )

> Flesh out `streams_static_membership_test` to be more robust
> 
>
> Key: KAFKA-13120
> URL: https://issues.apache.org/jira/browse/KAFKA-13120
> Project: Kafka
>  Issue Type: Task
>  Components: streams, system tests
>Reporter: Leah Thomas
>Priority: Minor
>  Labels: newbie++
>
> When fixing the `streams_static_membership_test.py` we noticed that the test 
> is pretty bare bones, it creates a streams application but doesn't do much 
> with the streams application, eg has no stateful processing. We should flesh 
> this out a bit to be more realistic and potentially consider testing with EOS 
> as well. The full java test is in `StaticMembershipTestClient`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13021) Improve Javadocs for API Changes and address followup from KIP-633

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385929#comment-17385929
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13021:


PR: https://github.com/apache/kafka/pull/4

> Improve Javadocs for API Changes and address followup from KIP-633
> --
>
> Key: KAFKA-13021
> URL: https://issues.apache.org/jira/browse/KAFKA-13021
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation, streams
>Affects Versions: 3.0.0
>Reporter: Israel Ekpo
>Assignee: Israel Ekpo
>Priority: Major
>
> There are Javadoc changes from the following PR that needs to be completed 
> prior to the 3.0 release. This Jira item is to track that work
> [https://github.com/apache/kafka/pull/10926]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13021) Improve Javadocs for API Changes and address followup from KIP-633

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13021:
---
Summary: Improve Javadocs for API Changes and address followup from KIP-633 
 (was: Improve Javadocs for API Changes from KIP-633)

> Improve Javadocs for API Changes and address followup from KIP-633
> --
>
> Key: KAFKA-13021
> URL: https://issues.apache.org/jira/browse/KAFKA-13021
> Project: Kafka
>  Issue Type: Improvement
>  Components: documentation, streams
>Affects Versions: 3.0.0
>Reporter: Israel Ekpo
>Assignee: Israel Ekpo
>Priority: Major
>
> There are Javadoc changes from the following PR that needs to be completed 
> prior to the 3.0 release. This Jira item is to track that work
> [https://github.com/apache/kafka/pull/10926]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-6948) Avoid overflow in timestamp comparison

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-6948:
-

Assignee: A. Sophie Blee-Goldman

> Avoid overflow in timestamp comparison
> --
>
> Key: KAFKA-6948
> URL: https://issues.apache.org/jira/browse/KAFKA-6948
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Giovanni Liva
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
>
> Some comparisons with timestamp values are not safe. This comparisons can 
> trigger errors that were found in some other issues, e.g. KAFKA-4290 or 
> KAFKA-6608.
> The following classes contains some comparison between timestamps that can 
> overflow.
>  * org.apache.kafka.clients.NetworkClientUtils
>  * org.apache.kafka.clients.consumer.internals.ConsumerCoordinator
>  * org.apache.kafka.common.security.kerberos.KerberosLogin
>  * org.apache.kafka.connect.runtime.WorkerSinkTask
>  * org.apache.kafka.connect.tools.MockSinkTask
>  * org.apache.kafka.connect.tools.MockSourceTask
>  * org.apache.kafka.streams.processor.internals.GlobalStreamThread
>  * org.apache.kafka.streams.processor.internals.StateDirectory
>  * org.apache.kafka.streams.processor.internals.StreamThread
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13128:
---
Priority: Blocker  (was: Major)

> Flaky Test 
> StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
> 
>
> Key: KAFKA-13128
> URL: https://issues.apache.org/jira/browse/KAFKA-13128
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
>  Labels: flaky-test
> Fix For: 3.0.0, 2.8.1
>
>
> h3. Stacktrace
> java.lang.AssertionError: Expected: is not null but: was null 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455)
>  
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13128:
---
Affects Version/s: (was: 3.1.0)

> Flaky Test 
> StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
> 
>
> Key: KAFKA-13128
> URL: https://issues.apache.org/jira/browse/KAFKA-13128
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
>  Labels: flaky-test
> Fix For: 3.0.0, 2.8.1
>
>
> h3. Stacktrace
> java.lang.AssertionError: Expected: is not null but: was null 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455)
>  
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13128:
---
Fix Version/s: 2.8.1
   3.0.0

> Flaky Test 
> StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
> 
>
> Key: KAFKA-13128
> URL: https://issues.apache.org/jira/browse/KAFKA-13128
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.0
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
>  Labels: flaky-test
> Fix For: 3.0.0, 2.8.1
>
>
> h3. Stacktrace
> java.lang.AssertionError: Expected: is not null but: was null 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455)
>  
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-13128:
--

Assignee: A. Sophie Blee-Goldman

> Flaky Test 
> StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
> 
>
> Key: KAFKA-13128
> URL: https://issues.apache.org/jira/browse/KAFKA-13128
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.0
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
>  Labels: flaky-test
>
> h3. Stacktrace
> java.lang.AssertionError: Expected: is not null but: was null 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455)
>  
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385841#comment-17385841
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13128:


The failure is from the second line in this series of assertions 
{code:java}
assertThat(store1.get(key), is(notNullValue()));
assertThat(store1.get(key2), is(notNullValue()));
assertThat(store1.get(key3), is(notNullValue()));
{code}
which is basically the first time we attempt IQ after starting up. The test 
setup includes starting Streams and waiting for it to reach RUNNING, then 
adding a new thread, and finally producing a set of 100 records for each of the 
three keys. After that it waits for all records to be processed *for _key3_* 
and then proceeds to the above assertions.

I suspect the problem is that we only wait for all data to be processed for 
_key3_, but not the other two keys. In theory this should work, since the data 
for _key3_ is produced last and would have the largest timestamps meaning the 
keys should be processed more or less in order. However the input topic 
actually has two partitions, so it could be that _key1_ and _key3_ correspond 
to task 1 while _key2_ corresponds to task 2. Again, that shouldn't affect the 
order in which records are processed – as long as the tasks are on the same 
thread.

But we started up a new thread in between waiting for Streams to reach RUNNING 
and producing data to the input topics. This new thread has to be assigned one 
of the tasks, but due to cooperative rebalancing it will take two full (though 
short) rebalances before the new thread can actually start processing any 
tasks. Therefore as long as the original thread continues to own the task 
corresponding to _key3_ after the new thread is added, it can easily get 
through all records for _key3_. Which would mean the test can proceed to the 
above assertions while the new thread is still waiting to start processing any 
data for _key2_ at all.

There are a few ways we can address this given how many things had to happen 
exactly right in order to see this failure, but the simplest fix is to just 
wait on all three keys to be fully processed rather than just the one. This 
seems to align with the original intention of the test best as well

> Flaky Test 
> StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
> 
>
> Key: KAFKA-13128
> URL: https://issues.apache.org/jira/browse/KAFKA-13128
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 3.1.0
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>  Labels: flaky-test
>
> h3. Stacktrace
> java.lang.AssertionError: Expected: is not null but: was null 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) 
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506)
>   at 
> org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455)
>  
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-13128:
--

 Summary: Flaky Test 
StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
 Key: KAFKA-13128
 URL: https://issues.apache.org/jira/browse/KAFKA-13128
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 3.1.0
Reporter: A. Sophie Blee-Goldman


h3. Stacktrace

java.lang.AssertionError: Expected: is not null but: was null 
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) 
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
  at 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461)
  at 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506)
  at 
org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455)

 

https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-7497) Kafka Streams should support self-join on streams

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-7497:
-

Assignee: (was: A. Sophie Blee-Goldman)

> Kafka Streams should support self-join on streams
> -
>
> Key: KAFKA-7497
> URL: https://issues.apache.org/jira/browse/KAFKA-7497
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: Robin Moffatt
>Priority: Major
>  Labels: needs-kip
>
> There are valid reasons to want to join a stream to itself, but Kafka Streams 
> does not currently support this ({{Invalid topology: Topic foo has already 
> been registered by another source.}}).  To perform the join requires creating 
> a second stream as a clone of the first, and then doing a join between the 
> two. This is a clunky workaround and results in unnecessary duplication of 
> data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13126) Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads to missing rebalances

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13126:
---
Description: 
In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
override, users of both the plain consumer client and kafka streams still set 
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set 
to the {{request.timeout.ms}} instead, which is much lower.

This can easily make consumers drop out of the group, since they must rejoin 
now within 30s (by default) but have no obligation to almost ever call poll() 
given the high {{max.poll.interval.ms}} – basically they will only do so after 
processing the last record from the previously polled batch. So in heavy 
processing cases, where each record takes a long time to process, or when using 
a very large  {{max.poll.records}}, it can be difficult to make any progress at 
all before dropping out and needing to rejoin. And of course, the rebalance 
that is kicked off upon this member rejoining can result in many of the other 
members in the group dropping out as well, leading to an endless cycle of 
missed rebalances.

We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it 
occurs. The workaround until then is of course to just set the 
{{max.poll.interval.ms}} to MAX_VALUE - 5000 (5s is the 
JOIN_GROUP_TIMEOUT_LAPSE)

  was:
In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
override, users of both the plain consumer client and kafka streams still set 
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set 
to the {{request.timeout.ms}} instead, which is much lower.

This can easily make consumers drop out of the group, since they must rejoin 
now within 30s (by default) but have no obligation to almost ever call poll() 
given the high {{max.poll.interval.ms}} – basically they will only do so after 
processing the last record from the previously polled batch. So in heavy 
processing cases, where each record takes a long time to process, or when using 
a very large  {{max.poll.records}}, it can be difficult to make any progress at 
all before dropping out and needing to rejoin. And of course, the rebalance 
that is kicked off upon this member rejoining can result in many of the other 
members in the group dropping out as well, leading to an endless cycle of 
missed rebalances.

We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it 
occurs.


> Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads 
> to missing rebalances
> -
>
> Key: KAFKA-13126
> URL: https://issues.apache.org/jira/browse/KAFKA-13126
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.1.0
>
>
> In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
> overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
> override, users of both the plain consumer client and kafka streams still set 
> the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
> overflow when computing the {{joinGroupTimeoutMs}} and results in it being 
> set to the {{request.timeout.ms}} instead, which is much lower.
> This can easily make consumers drop out of the group, since they must rejoin 
> now within 30s (by default) but have no obligation to almost ever call poll() 
> given the high {{max.poll.interval.ms}} – basically they will only do so 
> after processing the last record from the previously polled batch. So in 
> heavy processing cases, where each record takes a long time to process, or 
> when using a very large  {{max.poll.records}}, it can be difficult to make 
> any progress at all before dropping out and needing to rejoin. And of course, 
> the rebalance that is kicked off upon this member rejoining can result in 
> many of the other members in the group dropping out as well, leading to an 
> endless cycle of missed rebalances.
> We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when 
> it occurs. The workaround until then is of course to just set the 
> {{max.poll.interval.ms}} to MAX_VALUE - 5000 (5s is the 
> JOIN_GROUP_TIMEOUT_LAPSE)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13126) Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads to missing rebalances

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13126:
---
Description: 
In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
override, users of both the plain consumer client and kafka streams still set 
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set 
to the {{request.timeout.ms}} instead, which is much lower.

This can easily make consumers drop out of the group, since they must rejoin 
now within 30s (by default) but have no obligation to almost ever call poll() 
given the high {{max.poll.interval.ms}} – basically they will only do so after 
processing the last record from the previously polled batch. So in heavy 
processing cases, where each record takes a long time to process, or when using 
a very large  {{max.poll.records}}, it can be difficult to make any progress at 
all before dropping out and needing to rejoin. And of course, the rebalance 
that is kicked off upon this member rejoining can result in many of the other 
members in the group dropping out as well, leading to an endless cycle of 
missed rebalances.

We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it 
occurs.

  was:
In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
override, users of both the plain consumer client and kafka streams still set 
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set 
to the {{request.timeout.ms}} instead, which is much lower.

This can easily make consumers drop out of the group, since they must rejoin 
now within 30s (by default) yet have no obligation to almost ever call poll() 
given the high {{max.poll.interval.ms}}. We just need to check for overflow and 
fix it to {{Integer.MAX_VALUE}} when it occurs.


> Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads 
> to missing rebalances
> -
>
> Key: KAFKA-13126
> URL: https://issues.apache.org/jira/browse/KAFKA-13126
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.1.0
>
>
> In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
> overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
> override, users of both the plain consumer client and kafka streams still set 
> the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
> overflow when computing the {{joinGroupTimeoutMs}} and results in it being 
> set to the {{request.timeout.ms}} instead, which is much lower.
> This can easily make consumers drop out of the group, since they must rejoin 
> now within 30s (by default) but have no obligation to almost ever call poll() 
> given the high {{max.poll.interval.ms}} – basically they will only do so 
> after processing the last record from the previously polled batch. So in 
> heavy processing cases, where each record takes a long time to process, or 
> when using a very large  {{max.poll.records}}, it can be difficult to make 
> any progress at all before dropping out and needing to rejoin. And of course, 
> the rebalance that is kicked off upon this member rejoining can result in 
> many of the other members in the group dropping out as well, leading to an 
> endless cycle of missed rebalances.
> We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when 
> it occurs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13126) Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads to missing rebalances

2021-07-22 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-13126:
--

 Summary: Overflow in joinGroupTimeoutMs when max.poll.interval.ms 
is MAX_VALUE leads to missing rebalances
 Key: KAFKA-13126
 URL: https://issues.apache.org/jira/browse/KAFKA-13126
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Reporter: A. Sophie Blee-Goldman
Assignee: A. Sophie Blee-Goldman
 Fix For: 3.1.0


In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was 
overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this 
override, users of both the plain consumer client and kafka streams still set 
the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an 
overflow when computing the {{joinGroupTimeoutMs}} and results in it being set 
to the {{request.timeout.ms}} instead, which is much lower.

This can easily make consumers drop out of the group, since they must rejoin 
now within 30s (by default) yet have no obligation to almost ever call poll() 
given the high {{max.poll.interval.ms}}. We just need to check for overflow and 
fix it to {{Integer.MAX_VALUE}} when it occurs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13121) Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates()

2021-07-21 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385093#comment-17385093
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13121:


Hey [~jolshan], thought you might have some context on this since it (sort of) 
had to do with topic ids.

The full logs aren't much but I'll upload them in case that helps: 
[^TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates.rtf]

 

> Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates()
> ---
>
> Key: KAFKA-13121
> URL: https://issues.apache.org/jira/browse/KAFKA-13121
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
> Attachments: 
> TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates.rtf
>
>
> h4. Stack Trace
> {code:java}
> org.apache.kafka.server.log.remote.storage.RemoteResourceNotFoundException: 
> No resource found for partition: 
> TopicIdPartition{topicId=2B9rDu44TE6c8pLG8A0RAg, topicPartition=new-leader-0}
> at 
> org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.getRemoteLogMetadataCache(RemotePartitionMetadataStore.java:112)
>  
> at 
> org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.listRemoteLogSegments(RemotePartitionMetadataStore.java:98)
> at 
> org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager.listRemoteLogSegments(TopicBasedRemoteLogMetadataManager.java:212)
> at 
> org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates(TopicBasedRemoteLogMetadataManagerTest.java:99){code}
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10921/11/testReport/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13121) Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates()

2021-07-21 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13121:
---
Attachment: 
TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates.rtf

> Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates()
> ---
>
> Key: KAFKA-13121
> URL: https://issues.apache.org/jira/browse/KAFKA-13121
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
> Attachments: 
> TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates.rtf
>
>
> h4. Stack Trace
> {code:java}
> org.apache.kafka.server.log.remote.storage.RemoteResourceNotFoundException: 
> No resource found for partition: 
> TopicIdPartition{topicId=2B9rDu44TE6c8pLG8A0RAg, topicPartition=new-leader-0}
> at 
> org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.getRemoteLogMetadataCache(RemotePartitionMetadataStore.java:112)
>  
> at 
> org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.listRemoteLogSegments(RemotePartitionMetadataStore.java:98)
> at 
> org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager.listRemoteLogSegments(TopicBasedRemoteLogMetadataManager.java:212)
> at 
> org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates(TopicBasedRemoteLogMetadataManagerTest.java:99){code}
> https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10921/11/testReport/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13121) Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates()

2021-07-21 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-13121:
--

 Summary: Flaky Test 
TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates()
 Key: KAFKA-13121
 URL: https://issues.apache.org/jira/browse/KAFKA-13121
 Project: Kafka
  Issue Type: Bug
  Components: log
Reporter: A. Sophie Blee-Goldman


h4. Stack Trace
{code:java}
org.apache.kafka.server.log.remote.storage.RemoteResourceNotFoundException: No 
resource found for partition: TopicIdPartition{topicId=2B9rDu44TE6c8pLG8A0RAg, 
topicPartition=new-leader-0}
at 
org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.getRemoteLogMetadataCache(RemotePartitionMetadataStore.java:112)
 
at 
org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.listRemoteLogSegments(RemotePartitionMetadataStore.java:98)
at 
org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager.listRemoteLogSegments(TopicBasedRemoteLogMetadataManager.java:212)
at 
org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates(TopicBasedRemoteLogMetadataManagerTest.java:99){code}
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10921/11/testReport/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-12291) Fix Ignored Upgrade Tests in streams_upgrade_test.py: test_upgrade_downgrade_brokers

2021-07-20 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12291:
---
Fix Version/s: (was: 3.0.0)
   3.1.0

> Fix Ignored Upgrade Tests in streams_upgrade_test.py: 
> test_upgrade_downgrade_brokers
> 
>
> Key: KAFKA-12291
> URL: https://issues.apache.org/jira/browse/KAFKA-12291
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, system tests
>Reporter: Bruno Cadonna
>Priority: Blocker
> Fix For: 3.1.0
>
>
> Fix in the oldest branch that ignores the test and cherry-pick forward.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()

2021-07-20 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13010:
---
Fix Version/s: (was: 3.0.1)
   3.1.0

> Flaky test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
> ---
>
> Key: KAFKA-13010
> URL: https://issues.apache.org/jira/browse/KAFKA-13010
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: Walker Carlson
>Priority: Major
>  Labels: flaky-test
> Fix For: 3.1.0
>
> Attachments: 
> TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf
>
>
> Integration test {{test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}}
>  sometimes fails with
> {code:java}
> java.lang.AssertionError: only one task
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()

2021-07-19 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383504#comment-17383504
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13010:


Yeah it's honestly pretty hard to imagine how that could have introduced a bug 
like this, but the timing definitely is suspicious. Seeing as Walker was able 
to reproduce it after a "reasonable" number of retries, it should be easy to 
confirm the suspicion by just running the test again with sufficient repeats on 
the commit just before KIP-744 was merged.

> Flaky test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
> ---
>
> Key: KAFKA-13010
> URL: https://issues.apache.org/jira/browse/KAFKA-13010
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: Walker Carlson
>Priority: Major
>  Labels: flaky-test
> Attachments: 
> TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf
>
>
> Integration test {{test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}}
>  sometimes fails with
> {code:java}
> java.lang.AssertionError: only one task
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()

2021-07-19 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383445#comment-17383445
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13010:


It's worth noting that this test seemed to be pretty stable for roughly a full 
release cycle, then started failing pretty frequently right after 
[KIP-744|https://cwiki.apache.org/confluence/display/KAFKA/KIP-744%3A+Migrate+TaskMetadata+and+ThreadMetadata+to+an+interface+with+internal+implementation]
 / KAFKA-12849 was merged (June 25th). [~wcarlson5] I wonder if the changes to 
the metadata hierarchy could have introduced a potentially serious bug?

> Flaky test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
> ---
>
> Key: KAFKA-13010
> URL: https://issues.apache.org/jira/browse/KAFKA-13010
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: Walker Carlson
>Priority: Major
>  Labels: flaky-test
> Attachments: 
> TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf
>
>
> Integration test {{test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}}
>  sometimes fails with
> {code:java}
> java.lang.AssertionError: only one task
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13096) QueryableStoreProvider is not updated when threads are added/removed/replaced rendering IQ impossible

2021-07-15 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-13096:
--

 Summary: QueryableStoreProvider is not updated when threads are 
added/removed/replaced rendering IQ impossible
 Key: KAFKA-13096
 URL: https://issues.apache.org/jira/browse/KAFKA-13096
 Project: Kafka
  Issue Type: Bug
  Components: streams
Affects Versions: 2.8.0
Reporter: A. Sophie Blee-Goldman
 Fix For: 3.0.0, 2.8.1


The QueryableStoreProviders class is used to route queries to the correct state 
store on the owning StreamThread, making it a critical piece of IQ. It gets 
instantiated when you create a new KafkaStreams, and is passed in a list of 
StreamThreadStateStoreProviders which it then copies and stores. Because it 
only stores a copy it only ever contains a provider for the StreamThreads that 
were created during the app's startup, and unfortunately is never updated 
during an add/remove/replace thread event. 

This means that IQ can’t get a handle on any stores that belong to a thread 
that wasn’t in the original set. If the app is starting up new threads through 
the #addStreamThread API or following a REPLACE_THREAD event, none of the data 
in any of the stores owned by that new thread will be accessible by IQ. If a 
user is removing threads through #removeStreamThread, or threads die and get 
replaced, you can fall into an endless loop of {{InvalidStateStoreException}} 
from doing a lookup into stores that have been closed since the thread was 
removed/died.

If over time all of the original threads are removed or replaced, then IQ won’t 
work at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12896) Group rebalance loop caused by repeated group leader JoinGroups

2021-07-15 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-12896.

Resolution: Fixed

> Group rebalance loop caused by repeated group leader JoinGroups
> ---
>
> Key: KAFKA-12896
> URL: https://issues.apache.org/jira/browse/KAFKA-12896
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.6.0
>Reporter: Lucas Bradstreet
>Assignee: David Jacot
>Priority: Blocker
> Fix For: 3.0.0
>
>
> We encountered a strange case of a rebalance loop with the 
> "cooperative-sticky" assignor. The logs show the following for several hours:
>  
> {{Apr 7, 2021 @ 03:58:36.040  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830137 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.992  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830136 
> (__consumer_offsets-7) (reason: Updating metadata for member 
> mygroup-1-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}}
> {{Apr 7, 2021 @ 03:58:35.988  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830136 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.972  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830135 
> (__consumer_offsets-7) (reason: Updating metadata for member mygroup during 
> CompletingRebalance)}}
> {{Apr 7, 2021 @ 03:58:35.965  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830135 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.953  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830134 
> (__consumer_offsets-7) (reason: Updating metadata for member 
> mygroup-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}}
> {{Apr 7, 2021 @ 03:58:35.941  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830134 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.926  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830133 
> (__consumer_offsets-7) (reason: Updating metadata for member mygroup during 
> CompletingRebalance)}}
> Every single time, it was the same member that triggered the JoinGroup and it 
> was always the leader of the group.{{}}
> The leader has the privilege of being able to trigger a rebalance by sending 
> `JoinGroup` even if its subscription metadata has not changed. But why would 
> it do so?
> It is possible that this is due to the same issue or a similar bug to 
> https://issues.apache.org/jira/browse/KAFKA-12890.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12896) Group rebalance loop caused by repeated group leader JoinGroups

2021-07-15 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381120#comment-17381120
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12896:


Yep, that should do it

> Group rebalance loop caused by repeated group leader JoinGroups
> ---
>
> Key: KAFKA-12896
> URL: https://issues.apache.org/jira/browse/KAFKA-12896
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.6.0
>Reporter: Lucas Bradstreet
>Assignee: David Jacot
>Priority: Blocker
> Fix For: 3.0.0
>
>
> We encountered a strange case of a rebalance loop with the 
> "cooperative-sticky" assignor. The logs show the following for several hours:
>  
> {{Apr 7, 2021 @ 03:58:36.040  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830137 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.992  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830136 
> (__consumer_offsets-7) (reason: Updating metadata for member 
> mygroup-1-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}}
> {{Apr 7, 2021 @ 03:58:35.988  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830136 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.972  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830135 
> (__consumer_offsets-7) (reason: Updating metadata for member mygroup during 
> CompletingRebalance)}}
> {{Apr 7, 2021 @ 03:58:35.965  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830135 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.953  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830134 
> (__consumer_offsets-7) (reason: Updating metadata for member 
> mygroup-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}}
> {{Apr 7, 2021 @ 03:58:35.941  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830134 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.926  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830133 
> (__consumer_offsets-7) (reason: Updating metadata for member mygroup during 
> CompletingRebalance)}}
> Every single time, it was the same member that triggered the JoinGroup and it 
> was always the leader of the group.{{}}
> The leader has the privilege of being able to trigger a rebalance by sending 
> `JoinGroup` even if its subscription metadata has not changed. But why would 
> it do so?
> It is possible that this is due to the same issue or a similar bug to 
> https://issues.apache.org/jira/browse/KAFKA-12890.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()

2021-07-14 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380939#comment-17380939
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13010:


Not the exact same test, but I did manage to reproduce this same "only one 
task" failure in 
TaskMetadataIntegrationTest.shouldReportCorrectEndOffsetInformation:
{code:java}
java.lang.AssertionError: only one task
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
at 
org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:163)
at 
org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectEndOffsetInformation(TaskMetadataIntegrationTest.java:144)
{code}
I saved the logs since they should actually be the full, un-truncated logs – 
hope this helps: 
[^TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf]

> Flaky test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
> ---
>
> Key: KAFKA-13010
> URL: https://issues.apache.org/jira/browse/KAFKA-13010
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Bruno Cadonna
>Priority: Major
>  Labels: flaky-test
> Attachments: 
> TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf
>
>
> Integration test {{test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}}
>  sometimes fails with
> {code:java}
> java.lang.AssertionError: only one task
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()

2021-07-14 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13010:
---
Attachment: 
TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf

> Flaky test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
> ---
>
> Key: KAFKA-13010
> URL: https://issues.apache.org/jira/browse/KAFKA-13010
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Bruno Cadonna
>Priority: Major
>  Labels: flaky-test
> Attachments: 
> TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf
>
>
> Integration test {{test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}}
>  sometimes fails with
> {code:java}
> java.lang.AssertionError: only one task
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()

2021-07-14 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380838#comment-17380838
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13010:


[~wcarl...@confluent.io] I'm guessing you wrote this test so you have the most 
context, can you reproduce this locally and take a minute or two to look 
through the logs and see if anything jumps out at you? 

> Flaky test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
> ---
>
> Key: KAFKA-13010
> URL: https://issues.apache.org/jira/browse/KAFKA-13010
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Bruno Cadonna
>Priority: Major
>  Labels: flaky-test
>
> Integration test {{test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}}
>  sometimes fails with
> {code:java}
> java.lang.AssertionError: only one task
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13037) "Thread state is already PENDING_SHUTDOWN" log spam

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380295#comment-17380295
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13037:


Thanks for the fix – btw, I added you as a contributor so you should be able to 
self-assign tickets from now on. 

> "Thread state is already PENDING_SHUTDOWN" log spam
> ---
>
> Key: KAFKA-13037
> URL: https://issues.apache.org/jira/browse/KAFKA-13037
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0, 2.7.1
>Reporter: John Gray
>Assignee: John Gray
>Priority: Major
> Fix For: 3.0.0, 2.8.1
>
>
> KAFKA-12462 introduced a 
> [change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
>  that increased this "Thread state is already {}" logger to info instead of 
> debug. We are running into a problem with our streams apps when they hit an 
> unrecoverable exception that shuts down the streams thread, we get this log 
> printed about 50,000 times per second per thread. I am guessing it is once 
> per record we have queued up when the exception happens? We have temporarily 
> raised the StreamThread logger to WARN instead of INFO to avoid the spam, but 
> we do miss the other good logs we get on INFO in that class. Could this log 
> be reverted back to debug? Thank you! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-13037) "Thread state is already PENDING_SHUTDOWN" log spam

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-13037:
--

Assignee: John Gray

> "Thread state is already PENDING_SHUTDOWN" log spam
> ---
>
> Key: KAFKA-13037
> URL: https://issues.apache.org/jira/browse/KAFKA-13037
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0, 2.7.1
>Reporter: John Gray
>Assignee: John Gray
>Priority: Major
> Fix For: 3.0.0, 2.8.1
>
>
> KAFKA-12462 introduced a 
> [change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
>  that increased this "Thread state is already {}" logger to info instead of 
> debug. We are running into a problem with our streams apps when they hit an 
> unrecoverable exception that shuts down the streams thread, we get this log 
> printed about 50,000 times per second per thread. I am guessing it is once 
> per record we have queued up when the exception happens? We have temporarily 
> raised the StreamThread logger to WARN instead of INFO to avoid the spam, but 
> we do miss the other good logs we get on INFO in that class. Could this log 
> be reverted back to debug? Thank you! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13081) Port sticky assignor fixes (KAFKA-12984) back to 2.8

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13081:
---
Component/s: consumer

> Port sticky assignor fixes (KAFKA-12984) back to 2.8
> 
>
> Key: KAFKA-13081
> URL: https://issues.apache.org/jira/browse/KAFKA-13081
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: Luke Chen
>Priority: Blocker
> Fix For: 2.8.1
>
>
> We should make sure that fix #1 and #2 of 
> [#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 
> sticky assignor, since it's pretty much impossible to smoothly cherrypick 
> that commit from 3.0 to 2.8 due to all the recent improvements and 
> refactoring in the AbstractStickyAssignor. Either we can just extract and 
> apply those two fixes to 2.8 directly, or go back and port all the commits 
> that made this cherrypick difficult over to 2.8 as well. If we do so then 
> cherrypicking the original commit should be easy.
> See [this 
> comment|https://github.com/apache/kafka/pull/10985#issuecomment-879521196]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13081) Port sticky assignor fixes (KAFKA-12984) back to 2.8

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380251#comment-17380251
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13081:


Personally I would probably advocate for just applying the two fixes to 2.8 
rather than porting over the full set of improvements and refactorings. While 
the improvements are great to have, they're not strictly necessary and 
relatively complex, so porting them back carries some risk in case of 
as-yet-undiscovered bugs that were introduced during the large-scale 
refactoring of this assignor. Given that it's hard to be 100% confident in the 
correctness of such a complicated algorithm, regardless of how good the test 
coverage is, I would prefer to just apply the minimal required fixes and leave 
2.8 as a "safe" branch. That way we can fall back to it in case of unexpected 
issues in the 3.0 assignor, and recommend users who want to downgrade to the 
last stable version with a good "cooperative-sticky"/"sticky" assignor they can 
continue to use until the new assignor is stabilized.

Just my 2 cents

> Port sticky assignor fixes (KAFKA-12984) back to 2.8
> 
>
> Key: KAFKA-13081
> URL: https://issues.apache.org/jira/browse/KAFKA-13081
> Project: Kafka
>  Issue Type: Bug
>Reporter: A. Sophie Blee-Goldman
>Assignee: Luke Chen
>Priority: Blocker
> Fix For: 2.8.1
>
>
> We should make sure that fix #1 and #2 of 
> [#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 
> sticky assignor, since it's pretty much impossible to smoothly cherrypick 
> that commit from 3.0 to 2.8 due to all the recent improvements and 
> refactoring in the AbstractStickyAssignor. Either we can just extract and 
> apply those two fixes to 2.8 directly, or go back and port all the commits 
> that made this cherrypick difficult over to 2.8 as well. If we do so then 
> cherrypicking the original commit should be easy.
> See [this 
> comment|https://github.com/apache/kafka/pull/10985#issuecomment-879521196]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13081) Port sticky assignor fixes (KAFKA-12984) back to 2.8

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13081:
---
Description: 
We should make sure that fix #1 and #2 of 
[#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 
sticky assignor, since it's pretty much impossible to smoothly cherrypick that 
commit from 3.0 to 2.8 due to all the recent improvements and refactoring in 
the AbstractStickyAssignor. Either we can just extract and apply those two 
fixes to 2.8 directly, or go back and port all the commits that made this 
cherrypick difficult over to 2.8 as well. If we do so then cherrypicking the 
original commit should be easy.

See [this 
comment|https://github.com/apache/kafka/pull/10985#issuecomment-879521196]

  was:We should make sure that fix #1 and #2 of 
[#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 
sticky assignor, since it's pretty much impossible to smoothly cherrypick that 
commit from 3.0 to 2.8 due to all the recent improvements and refactoring in 
the AbstractStickyAssignor. Either we can just extract and apply those two 
fixes to 2.8 directly, or go back and port all the commits that made this 
cherrypick difficult over to 2.8 as well. If we do so then cherrypicking the 
original commit should be easy


> Port sticky assignor fixes (KAFKA-12984) back to 2.8
> 
>
> Key: KAFKA-13081
> URL: https://issues.apache.org/jira/browse/KAFKA-13081
> Project: Kafka
>  Issue Type: Bug
>Reporter: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 2.8.1
>
>
> We should make sure that fix #1 and #2 of 
> [#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 
> sticky assignor, since it's pretty much impossible to smoothly cherrypick 
> that commit from 3.0 to 2.8 due to all the recent improvements and 
> refactoring in the AbstractStickyAssignor. Either we can just extract and 
> apply those two fixes to 2.8 directly, or go back and port all the commits 
> that made this cherrypick difficult over to 2.8 as well. If we do so then 
> cherrypicking the original commit should be easy.
> See [this 
> comment|https://github.com/apache/kafka/pull/10985#issuecomment-879521196]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-13081) Port sticky assignor fixes (KAFKA-12984) back to 2.8

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-13081:
--

 Summary: Port sticky assignor fixes (KAFKA-12984) back to 2.8
 Key: KAFKA-13081
 URL: https://issues.apache.org/jira/browse/KAFKA-13081
 Project: Kafka
  Issue Type: Bug
Reporter: A. Sophie Blee-Goldman
 Fix For: 2.8.1


We should make sure that fix #1 and #2 of 
[#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 
sticky assignor, since it's pretty much impossible to smoothly cherrypick that 
commit from 3.0 to 2.8 due to all the recent improvements and refactoring in 
the AbstractStickyAssignor. Either we can just extract and apply those two 
fixes to 2.8 directly, or go back and port all the commits that made this 
cherrypick difficult over to 2.8 as well. If we do so then cherrypicking the 
original commit should be easy



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-12984) Cooperative sticky assignor can get stuck with invalid SubscriptionState input metadata

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12984:
---
Fix Version/s: (was: 2.8.1)

> Cooperative sticky assignor can get stuck with invalid SubscriptionState 
> input metadata
> ---
>
> Key: KAFKA-12984
> URL: https://issues.apache.org/jira/browse/KAFKA-12984
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Some users have reported seeing their consumer group become stuck in the 
> CompletingRebalance phase when using the cooperative-sticky assignor. Based 
> on the request metadata we were able to deduce that multiple consumers were 
> reporting the same partition(s) in their "ownedPartitions" field of the 
> consumer protocol. Since this is an invalid state, the input causes the 
> cooperative-sticky assignor to detect that something is wrong and throw an 
> IllegalStateException. If the consumer application is set up to simply retry, 
> this will cause the group to appear to hang in the rebalance state.
> The "ownedPartitions" field is encoded based on the ConsumerCoordinator's 
> SubscriptionState, which was assumed to always be up to date. However there 
> may be cases where the consumer has dropped out of the group but fails to 
> clear the SubscriptionState, allowing it to report some partitions as owned 
> that have since been reassigned to another member.
> We should (a) fix the sticky assignment algorithm to resolve cases of 
> improper input conditions by invalidating the "ownedPartitions" in cases of 
> double ownership, and (b) shore up the ConsumerCoordinator logic to better 
> handle rejoining the group and keeping its internal state consistent. See 
> KAFKA-12983 for more details on (b)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380212#comment-17380212
 ] 

A. Sophie Blee-Goldman edited comment on KAFKA-13010 at 7/13/21, 10:39 PM:
---

Failed again 
[https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10985/6/testReport/junit/org.apache.kafka.streams.integration/TaskMetadataIntegrationTest/Build___JDK_11_and_Scala_2_13___shouldReportCorrectCommittedOffsetInformation/]

FWIW I didn't see the above log message about that subscribed topic not being 
assigned to any members. The logs were truncated so it's possible that it 
actually was there, but I don't think that's the case since AFAICT the 
truncated logs are mostly from kafka/zookeeper. The relevant logs from the 
rebalance seem to be present


was (Author: ableegoldman):
Failed again 
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10985/6/testReport/junit/org.apache.kafka.streams.integration/TaskMetadataIntegrationTest/Build___JDK_11_and_Scala_2_13___shouldReportCorrectCommittedOffsetInformation/

> Flaky test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
> ---
>
> Key: KAFKA-13010
> URL: https://issues.apache.org/jira/browse/KAFKA-13010
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Bruno Cadonna
>Priority: Major
>  Labels: flaky-test
>
> Integration test {{test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}}
>  sometimes fails with
> {code:java}
> java.lang.AssertionError: only one task
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380212#comment-17380212
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13010:


Failed again 
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10985/6/testReport/junit/org.apache.kafka.streams.integration/TaskMetadataIntegrationTest/Build___JDK_11_and_Scala_2_13___shouldReportCorrectCommittedOffsetInformation/

> Flaky test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
> ---
>
> Key: KAFKA-13010
> URL: https://issues.apache.org/jira/browse/KAFKA-13010
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Bruno Cadonna
>Priority: Major
>  Labels: flaky-test
>
> Integration test {{test 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}}
>  sometimes fails with
> {code:java}
> java.lang.AssertionError: only one task
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162)
>   at 
> org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-13075) Consolidate RocksDBStoreTest and RocksDBKeyValueStoreTest

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-13075.

Fix Version/s: 3.1.0
   Resolution: Fixed

> Consolidate RocksDBStoreTest and RocksDBKeyValueStoreTest
> -
>
> Key: KAFKA-13075
> URL: https://issues.apache.org/jira/browse/KAFKA-13075
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Chun-Hao Tang
>Priority: Major
>  Labels: newbie, newbie++
> Fix For: 3.1.0
>
>
> Looks like we have two different test classes covering pretty much the same 
> thing: RocksDBStore. It seems like RocksDBKeyValueStoreTest was the original 
> test class for RocksDBStore, but someone later added RocksDBStoreTest, most 
> likely because they didn't notice the RocksDBKeyValueStoreTest which didn't 
> follow the usual naming scheme for test classes. 
> We should consolidate these two into a single file, ideally retaining the 
> RocksDBStoreTest name since that conforms to the test naming pattern used 
> throughout Streams (and so this same thing doesn't happen again). It should 
> also extend AbstractKeyValueStoreTest like the RocksDBKeyValueStoreTest 
> currently does so we continue to get the benefit of all the tests in there as 
> well



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-13075) Consolidate RocksDBStoreTest and RocksDBKeyValueStoreTest

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-13075:
---
Summary: Consolidate RocksDBStoreTest and RocksDBKeyValueStoreTest  (was: 
Consolidate RocksDBStore and RocksDBKeyValueStoreTest)

> Consolidate RocksDBStoreTest and RocksDBKeyValueStoreTest
> -
>
> Key: KAFKA-13075
> URL: https://issues.apache.org/jira/browse/KAFKA-13075
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Chun-Hao Tang
>Priority: Major
>  Labels: newbie, newbie++
>
> Looks like we have two different test classes covering pretty much the same 
> thing: RocksDBStore. It seems like RocksDBKeyValueStoreTest was the original 
> test class for RocksDBStore, but someone later added RocksDBStoreTest, most 
> likely because they didn't notice the RocksDBKeyValueStoreTest which didn't 
> follow the usual naming scheme for test classes. 
> We should consolidate these two into a single file, ideally retaining the 
> RocksDBStoreTest name since that conforms to the test naming pattern used 
> throughout Streams (and so this same thing doesn't happen again). It should 
> also extend AbstractKeyValueStoreTest like the RocksDBKeyValueStoreTest 
> currently does so we continue to get the benefit of all the tests in there as 
> well



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8529) Flakey test ConsumerBounceTest#testCloseDuringRebalance

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380138#comment-17380138
 ] 

A. Sophie Blee-Goldman commented on KAFKA-8529:
---

This has been failing _very_ frequently as of late, for example 
[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-11009/4/tests].

I took a brief looks at the logs and it appears to be an issue with the 
connection, or possibly something to do with the incremental fetch (not sure if 
this warning is a red herring or not):

 
{code:java}
[2021-07-13 12:31:46,774] WARN [ReplicaFetcher replicaId=1, leaderId=2, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={closetest-1=PartitionData(fetchOffset=0, logStartOffset=0, 
maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty), closetest-7=PartitionData(fetchOffset=0, 
logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty), closetest-4=PartitionData(fetchOffset=0, 
logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, 
metadata=(sessionId=1829936913, epoch=1), rackId=) 
(kafka.server.ReplicaFetcherThread:72)org.apache.kafka.common.errors.FetchSessionTopicIdException:
 The fetch session encountered inconsistent topic ID usage[2021-07-13 
12:31:47,002] WARN [ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error 
in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, 
minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, 
toForget=, metadata=(sessionId=2081224068, epoch=1), rackId=) 
(kafka.server.ReplicaFetcherThread:72)org.apache.kafka.common.errors.FetchSessionTopicIdException:
 The fetch session encountered inconsistent topic ID usage[2021-07-13 
12:31:48,545] WARN [Consumer clientId=ConsumerTestConsumer, groupId=group1] 
Close timed out with 3 pending requests to coordinator, terminating client 
connections 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1024)[2021-07-13
 12:31:48,549] WARN [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error 
in response for fetch request (type=FetchRequest, replicaId=0, maxWait=500, 
minBytes=1, maxBytes=10485760, 
fetchData={closetest-1=PartitionData(fetchOffset=0, logStartOffset=0, 
maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty), closetest-7=PartitionData(fetchOffset=0, 
logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty), topic-1=PartitionData(fetchOffset=0, 
logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty), closetest-4=PartitionData(fetchOffset=0, 
logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, 
metadata=(sessionId=1532743228, epoch=INITIAL), rackId=) 
(kafka.server.ReplicaFetcherThread:72)java.io.IOException: Connection to 2 was 
disconnected before the response was read 
at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:109)
at 
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:219)
at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:313)
at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:137)
at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:136)
at scala.Option.foreach(Option.scala:437) at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:136)
at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:119)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
{code}
 

Wondering if we should bump this the priority on this? Can someone more 
familiar with this test maybe take a few minutes to glance over these logs and 
chime in on whether this is potentially concerning or not? cc [~hachikuji] 
[~cmccabe]

> Flakey test ConsumerBounceTest#testCloseDuringRebalance
> ---
>
> Key: KAFKA-8529
> URL: https://issues.apache.org/jira/browse/KAFKA-8529
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: Boyang Chen
>Priority: Major
>
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/5450/consoleFull]
>  
> *16:16:10* kafka.api.ConsumerBounceTest > testCloseDuringRebalance 
> STARTED*16:16:22* kafka.api.ConsumerBounceTest.testCloseDuringRebalance 
> failed, 

[jira] [Comment Edited] (KAFKA-8529) Flakey test ConsumerBounceTest#testCloseDuringRebalance

2021-07-13 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380138#comment-17380138
 ] 

A. Sophie Blee-Goldman edited comment on KAFKA-8529 at 7/13/21, 7:57 PM:
-

This has been failing _very_ frequently as of late, for example 
[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-11009/4/tests].

I took a brief looks at the logs and it appears to be an issue with the 
connection, or possibly something to do with the incremental fetch (not sure if 
the *FetchSessionTopicIdException* is a red herring or not):
{code:java}
[2021-07-13 12:31:46,774] WARN [ReplicaFetcher replicaId=1, leaderId=2, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 
replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, 
fetchData={closetest-1=PartitionData(fetchOffset=0, logStartOffset=0, 
maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty), closetest-7=PartitionData(fetchOffset=0, 
logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty), closetest-4=PartitionData(fetchOffset=0, 
logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, 
metadata=(sessionId=1829936913, epoch=1), rackId=) 
(kafka.server.ReplicaFetcherThread:72)org.apache.kafka.common.errors.FetchSessionTopicIdException:
 The fetch session encountered inconsistent topic ID usage[2021-07-13 
12:31:47,002] WARN [ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error 
in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, 
minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, 
toForget=, metadata=(sessionId=2081224068, epoch=1), rackId=) 
(kafka.server.ReplicaFetcherThread:72)org.apache.kafka.common.errors.FetchSessionTopicIdException:
 The fetch session encountered inconsistent topic ID usage[2021-07-13 
12:31:48,545] WARN [Consumer clientId=ConsumerTestConsumer, groupId=group1] 
Close timed out with 3 pending requests to coordinator, terminating client 
connections 
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1024)[2021-07-13
 12:31:48,549] WARN [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error 
in response for fetch request (type=FetchRequest, replicaId=0, maxWait=500, 
minBytes=1, maxBytes=10485760, 
fetchData={closetest-1=PartitionData(fetchOffset=0, logStartOffset=0, 
maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty), closetest-7=PartitionData(fetchOffset=0, 
logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty), topic-1=PartitionData(fetchOffset=0, 
logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty), closetest-4=PartitionData(fetchOffset=0, 
logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], 
lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, 
metadata=(sessionId=1532743228, epoch=INITIAL), rackId=) 
(kafka.server.ReplicaFetcherThread:72)java.io.IOException: Connection to 2 was 
disconnected before the response was read 
at 
org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100)
at 
kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:109)
at 
kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:219)
at 
kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:313)
at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:137)
at 
kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:136)
at scala.Option.foreach(Option.scala:437) at 
kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:136)
at 
kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:119)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
{code}
 Wondering if we should bump this the priority on this? Can someone more 
familiar with this test maybe take a few minutes to glance over these logs and 
chime in on whether this is potentially concerning or not? cc [~hachikuji] 
[~cmccabe]


was (Author: ableegoldman):
This has been failing _very_ frequently as of late, for example 
[https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-11009/4/tests].

I took a brief looks at the logs and it appears to be an issue with the 
connection, or possibly something to do with the incremental fetch (not sure if 
this warning is a red herring or not):

 
{code:java}
[2021-07-13 12:31:46,774] WARN [ReplicaFetcher replicaId=1, leaderId=2, 
fetcherId=0] Error in response for fetch request (type=FetchRequest, 

[jira] [Created] (KAFKA-13075) Consolidate RocksDBStore and RocksDBKeyValueStoreTest

2021-07-12 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-13075:
--

 Summary: Consolidate RocksDBStore and RocksDBKeyValueStoreTest
 Key: KAFKA-13075
 URL: https://issues.apache.org/jira/browse/KAFKA-13075
 Project: Kafka
  Issue Type: Improvement
  Components: streams
Reporter: A. Sophie Blee-Goldman


Looks like we have two different test classes covering pretty much the same 
thing: RocksDBStore. It seems like RocksDBKeyValueStoreTest was the original 
test class for RocksDBStore, but someone later added RocksDBStoreTest, most 
likely because they didn't notice the RocksDBKeyValueStoreTest which didn't 
follow the usual naming scheme for test classes. 

We should consolidate these two into a single file, ideally retaining the 
RocksDBStoreTest name since that conforms to the test naming pattern used 
throughout Streams (and so this same thing doesn't happen again). It should 
also extend AbstractKeyValueStoreTest like the RocksDBKeyValueStoreTest 
currently does so we continue to get the benefit of all the tests in there as 
well



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-12925) prefixScan missing from intermediate interfaces

2021-07-12 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12925:
---
Priority: Critical  (was: Major)

> prefixScan missing from intermediate interfaces
> ---
>
> Key: KAFKA-12925
> URL: https://issues.apache.org/jira/browse/KAFKA-12925
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
>Reporter: Michael Viamari
>Assignee: Sagar Rao
>Priority: Critical
> Fix For: 3.0.0, 2.8.1
>
>
> [KIP-614|https://cwiki.apache.org/confluence/display/KAFKA/KIP-614%3A+Add+Prefix+Scan+support+for+State+Stores]
>  and [KAFKA-10648|https://issues.apache.org/jira/browse/KAFKA-10648] 
> introduced support for {{prefixScan}} to StateStores.
> It seems that many of the intermediate {{StateStore}} interfaces are missing 
> a definition for {{prefixScan}}, and as such is not accessible in all cases.
> For example, when accessing the state stores through a the processor context, 
> the {{KeyValueStoreReadWriteDecorator}} and associated interfaces do not 
> define {{prefixScan}} and it falls back to the default implementation in 
> {{KeyValueStore}}, which throws {{UnsupportedOperationException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13008) Stream will stop processing data for a long time while waiting for the partition lag

2021-07-12 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379505#comment-17379505
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13008:


Thanks Guozhang, +1 on that approach (though I'll let Colin confirm whether #1 
does make sense or not). We'll definitely need a Streams/client side fix if the 
'real' fix is going to be on the broker side. My only question is whether this 
is something that might trip up other plain consumer client users in addition 
to Streams, and if so, whether there's anything we could do in the consumer 
client itself. But AFAIK it's only Streams that really relies on this metadata 
in this critical way, so I'm happy with the Streams-side fix as well

> Stream will stop processing data for a long time while waiting for the 
> partition lag
> 
>
> Key: KAFKA-13008
> URL: https://issues.apache.org/jira/browse/KAFKA-13008
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Luke Chen
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: image-2021-07-07-11-19-55-630.png
>
>
> In KIP-695, we improved the task idling mechanism by checking partition lag. 
> It's a good improvement for timestamp sync. But I found it will cause the 
> stream stop processing the data for a long time while waiting for the 
> partition metadata.
>  
> I've been investigating this case for a while, and figuring out the issue 
> will happen in below situation (or similar situation):
>  # start 2 streams (each with 1 thread) to consume from a topicA (with 3 
> partitions: A-0, A-1, A-2)
>  # After 2 streams started, the partitions assignment are: (I skipped some 
> other processing related partitions for simplicity)
>  stream1-thread1: A-0, A-1 
>  stream2-thread1: A-2
>  # start processing some data, assume now, the position and high watermark is:
>  A-0: offset: 2, highWM: 2
>  A-1: offset: 2, highWM: 2
>  A-2: offset: 2, highWM: 2
>  # Now, stream3 joined, so trigger rebalance with this assignment:
>  stream1-thread1: A-0 
>  stream2-thread1: A-2
>  stream3-thread1: A-1
>  # Suddenly, stream3 left, so now, rebalance again, with the step 2 
> assignment:
>  stream1-thread1: A-0, *A-1* 
>  stream2-thread1: A-2
>  (note: after initialization, the  position of A-1 will be: position: null, 
> highWM: null)
>  # Now, note that, the partition A-1 used to get assigned to stream1-thread1, 
> and now, it's back. And also, assume the partition A-1 has slow input (ex: 1 
> record per 30 mins), and partition A-0 has fast input (ex: 10K records / 
> sec). So, now, the stream1-thread1 won't process any data until we got input 
> from partition A-1 (even if partition A-0 is buffered a lot, and we have 
> `{{max.task.idle.ms}}` set to 0).
>  
> The reason why the stream1-thread1 won't process any data is because we can't 
> get the lag of partition A-1. And why we can't get the lag? It's because
>  # In KIP-695, we use consumer's cache to get the partition lag, to avoid 
> remote call
>  # The lag for a partition will be cleared if the assignment in this round 
> doesn't have this partition. check 
> [here|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java#L272].
>  So, in the above example, the metadata cache for partition A-1 will be 
> cleared in step 4, and re-initialized (to null) in step 5
>  # In KIP-227, we introduced a fetch session to have incremental fetch 
> request/response. That is, if the session existed, the client(consumer) will 
> get the update only when the fetched partition have update (ex: new data). 
> So, in the above case, the partition A-1 has slow input (ex: 1 record per 30 
> mins), it won't have update until next 30 mins, or wait for the fetch session 
> become inactive for (default) 2 mins to be evicted. Either case, the metadata 
> won't be updated for a while.
>  
> In KIP-695, if we don't get the partition lag, we can't determine the 
> partition data status to do timestamp sync, so we'll keep waiting and not 
> processing any data. That's why this issue will happen.
>  
> *Proposed solution:*
>  # If we don't get the current lag for a partition, or the current lag > 0, 
> we start to wait for max.task.idle.ms, and reset the deadline when we get the 
> partition lag, like what we did in previous KIP-353
>  # Introduce a waiting time config when no partition lag, or partition lag 
> keeps > 0 (need KIP)
> [~vvcephei] [~guozhang] , any suggestions?
>  
> cc [~ableegoldman]  [~mjsax] , this is the root cause that in 
> [https://github.com/apache/kafka/pull/10736,] we discussed and thought 
> there's a data lose situation. FYI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-12993) Formatting of Streams 'Memory Management' docs is messed up

2021-07-12 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12993:
---
Fix Version/s: (was: 2.8.1)

> Formatting of Streams 'Memory Management' docs is messed up 
> 
>
> Key: KAFKA-12993
> URL: https://issues.apache.org/jira/browse/KAFKA-12993
> Project: Kafka
>  Issue Type: Bug
>  Components: docs, streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Luke Chen
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The formatting of this page is all messed up, starting in the RocksDB 
> section. It looks like there's a missing closing tag after the example 
> BoundedMemoryRocksDBConfig class



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-7493) Rewrite test_broker_type_bounce_at_start

2021-07-12 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-7493:
-

Assignee: (was: A. Sophie Blee-Goldman)

> Rewrite test_broker_type_bounce_at_start
> 
>
> Key: KAFKA-7493
> URL: https://issues.apache.org/jira/browse/KAFKA-7493
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, system tests
>Reporter: John Roesler
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the test test_broker_type_bounce_at_start in 
> streams_broker_bounce_test.py is ignored.
> As written, there are a couple of race conditions that lead to flakiness.
> It should be possible to re-write the test to wait on log messages, as the 
> other tests do, instead of just sleeping to more deterministically transition 
> the test from one state to the next.
> Once the test is fixed, the fix should be back-ported to all prior branches, 
> back to 0.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-7493) Rewrite test_broker_type_bounce_at_start

2021-07-12 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-7493:
--
Fix Version/s: (was: 3.0.0)
   3.1.0

> Rewrite test_broker_type_bounce_at_start
> 
>
> Key: KAFKA-7493
> URL: https://issues.apache.org/jira/browse/KAFKA-7493
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams, system tests
>Reporter: John Roesler
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the test test_broker_type_bounce_at_start in 
> streams_broker_bounce_test.py is ignored.
> As written, there are a couple of race conditions that lead to flakiness.
> It should be possible to re-write the test to wait on log messages, as the 
> other tests do, instead of just sleeping to more deterministically transition 
> the test from one state to the next.
> Once the test is fixed, the fix should be back-ported to all prior branches, 
> back to 0.10.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13008) Stream will stop processing data for a long time while waiting for the partition lag

2021-07-12 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379462#comment-17379462
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13008:


{quote}Re-reading KIP-227, it seems like there should be a way for the client 
to add re-acquired partitions like this to the incremental fetch request so 
that it can reinitialize its metadata cache. In other words, it seems like 
getting a partition assigned that you haven't owned for a while is effectively 
the same case as getting a partition that you've never owned, and there does 
seem to be a mechanism for the latter.
{quote}
Thanks John, that is exactly what I was trying to suggest above, but I may have 
mungled it with my lack of understanding of the incremental fetch design. Given 
how long this bug went unnoticed and the in-depth investigation it took to 
uncover the bug (again, nicely done [~showuon]), it seems like any user of the 
plain consumer client in addition to Streams could be easily tripped up by 
this. And just personally, I had to read the analysis twice to really 
understand what was going on, since the behavior was/is so unintuitive to me.

> Stream will stop processing data for a long time while waiting for the 
> partition lag
> 
>
> Key: KAFKA-13008
> URL: https://issues.apache.org/jira/browse/KAFKA-13008
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Luke Chen
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: image-2021-07-07-11-19-55-630.png
>
>
> In KIP-695, we improved the task idling mechanism by checking partition lag. 
> It's a good improvement for timestamp sync. But I found it will cause the 
> stream stop processing the data for a long time while waiting for the 
> partition metadata.
>  
> I've been investigating this case for a while, and figuring out the issue 
> will happen in below situation (or similar situation):
>  # start 2 streams (each with 1 thread) to consume from a topicA (with 3 
> partitions: A-0, A-1, A-2)
>  # After 2 streams started, the partitions assignment are: (I skipped some 
> other processing related partitions for simplicity)
>  stream1-thread1: A-0, A-1 
>  stream2-thread1: A-2
>  # start processing some data, assume now, the position and high watermark is:
>  A-0: offset: 2, highWM: 2
>  A-1: offset: 2, highWM: 2
>  A-2: offset: 2, highWM: 2
>  # Now, stream3 joined, so trigger rebalance with this assignment:
>  stream1-thread1: A-0 
>  stream2-thread1: A-2
>  stream3-thread1: A-1
>  # Suddenly, stream3 left, so now, rebalance again, with the step 2 
> assignment:
>  stream1-thread1: A-0, *A-1* 
>  stream2-thread1: A-2
>  (note: after initialization, the  position of A-1 will be: position: null, 
> highWM: null)
>  # Now, note that, the partition A-1 used to get assigned to stream1-thread1, 
> and now, it's back. And also, assume the partition A-1 has slow input (ex: 1 
> record per 30 mins), and partition A-0 has fast input (ex: 10K records / 
> sec). So, now, the stream1-thread1 won't process any data until we got input 
> from partition A-1 (even if partition A-0 is buffered a lot, and we have 
> `{{max.task.idle.ms}}` set to 0).
>  
> The reason why the stream1-thread1 won't process any data is because we can't 
> get the lag of partition A-1. And why we can't get the lag? It's because
>  # In KIP-695, we use consumer's cache to get the partition lag, to avoid 
> remote call
>  # The lag for a partition will be cleared if the assignment in this round 
> doesn't have this partition. check 
> [here|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java#L272].
>  So, in the above example, the metadata cache for partition A-1 will be 
> cleared in step 4, and re-initialized (to null) in step 5
>  # In KIP-227, we introduced a fetch session to have incremental fetch 
> request/response. That is, if the session existed, the client(consumer) will 
> get the update only when the fetched partition have update (ex: new data). 
> So, in the above case, the partition A-1 has slow input (ex: 1 record per 30 
> mins), it won't have update until next 30 mins, or wait for the fetch session 
> become inactive for (default) 2 mins to be evicted. Either case, the metadata 
> won't be updated for a while.
>  
> In KIP-695, if we don't get the partition lag, we can't determine the 
> partition data status to do timestamp sync, so we'll keep waiting and not 
> processing any data. That's why this issue will happen.
>  
> *Proposed solution:*
>  # If we don't get the current lag for a partition, or the current lag > 0, 
> we start to wait for max.task.idle.ms, and reset the deadline when we get the 
> partition lag, 

[jira] [Commented] (KAFKA-12477) Smart rebalancing with dynamic protocol selection

2021-07-08 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377635#comment-17377635
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12477:


No, we already decided to postpone this work to 3.1 so we can focus on some 
related things. Moved the Fix Version to 3.1.0

> Smart rebalancing with dynamic protocol selection
> -
>
> Key: KAFKA-12477
> URL: https://issues.apache.org/jira/browse/KAFKA-12477
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.1.0
>
>
> Users who want to upgrade their applications and enable the COOPERATIVE 
> rebalancing protocol in their consumer apps are required to follow a double 
> rolling bounce upgrade path. The reason for this is laid out in the [Consumer 
> Upgrades|https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol#KIP429:KafkaConsumerIncrementalRebalanceProtocol-Consumer]
>  section of KIP-429. Basically, the ConsumerCoordinator picks a rebalancing 
> protocol in its constructor based on the list of supported partition 
> assignors. The protocol is selected as the highest protocol that is commonly 
> supported by all assignors in the list, and never changes after that.
> This is a bit unfortunate because it may end up using an older protocol even 
> after every member in the group has been updated to support the newer 
> protocol. After the first rolling bounce of the upgrade, all members will 
> have two assignors: "cooperative-sticky" and "range" (or 
> sticky/round-robin/etc). At this point the EAGER protocol will still be 
> selected due to the presence of the "range" assignor, but it's the 
> "cooperative-sticky" assignor that will ultimately be selected for use in 
> rebalances if that assignor is preferred (ie positioned first in the list). 
> The only reason for the second rolling bounce is to strip off the "range" 
> assignor and allow the upgraded members to switch over to COOPERATIVE. We 
> can't allow them to use cooperative rebalancing until everyone has been 
> upgraded, but once they have it's safe to do so.
> And there is already a way for the client to detect that everyone is on the 
> new byte code: if the CooperativeStickyAssignor is selected by the group 
> coordinator, then that means it is supported by all consumers in the group 
> and therefore everyone must be upgraded. 
> We may be able to save the second rolling bounce by dynamically updating the 
> rebalancing protocol inside the ConsumerCoordinator as "the highest protocol 
> supported by the assignor chosen by the group coordinator". This means we'll 
> still be using EAGER at the first rebalance, since we of course need to wait 
> for this initial rebalance to get the response from the group coordinator. 
> But we should take the hint from the chosen assignor rather than dropping 
> this information on the floor and sticking with the original protocol.
> Concrete Proposal:
> This assumes we will change the default assignor to ["cooperative-sticky", 
> "range"] in KIP-726. It also acknowledges that users may attempt any kind of 
> upgrade without reading the docs, and so we need to put in safeguards against 
> data corruption rather than assume everyone will follow the safe upgrade path.
> With this proposal, 
> 1) New applications on 3.0 will enable cooperative rebalancing by default
> 2) Existing applications which don’t set an assignor can safely upgrade to 
> 3.0 using a single rolling bounce with no extra steps, and will automatically 
> transition to cooperative rebalancing
> 3) Existing applications which do set an assignor that uses EAGER can 
> likewise upgrade their applications to COOPERATIVE with a single rolling 
> bounce
> 4) Once on 3.0, applications can safely go back and forth between EAGER and 
> COOPERATIVE
> 5) Applications can safely downgrade away from 3.0
> The high-level idea for dynamic protocol upgrades is that the group will 
> leverage the assignor selected by the group coordinator to determine when 
> it’s safe to upgrade to COOPERATIVE, and trigger a fail-safe to protect the 
> group in case of rare events or user misconfiguration. The group coordinator 
> selects the most preferred assignor that’s supported by all members of the 
> group, so we know that all members will support COOPERATIVE once we receive 
> the “cooperative-sticky” assignor after a rebalance. At this point, each 
> member can upgrade their own protocol to COOPERATIVE. However, there may be 
> situations in which an EAGER member may join the group even after upgrading 
> to COOPERATIVE. For example, during a rolling upgrade if the last remaining 
> member on 

[jira] [Updated] (KAFKA-12477) Smart rebalancing with dynamic protocol selection

2021-07-08 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12477:
---
Fix Version/s: (was: 3.0.0)
   3.1.0

> Smart rebalancing with dynamic protocol selection
> -
>
> Key: KAFKA-12477
> URL: https://issues.apache.org/jira/browse/KAFKA-12477
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.1.0
>
>
> Users who want to upgrade their applications and enable the COOPERATIVE 
> rebalancing protocol in their consumer apps are required to follow a double 
> rolling bounce upgrade path. The reason for this is laid out in the [Consumer 
> Upgrades|https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol#KIP429:KafkaConsumerIncrementalRebalanceProtocol-Consumer]
>  section of KIP-429. Basically, the ConsumerCoordinator picks a rebalancing 
> protocol in its constructor based on the list of supported partition 
> assignors. The protocol is selected as the highest protocol that is commonly 
> supported by all assignors in the list, and never changes after that.
> This is a bit unfortunate because it may end up using an older protocol even 
> after every member in the group has been updated to support the newer 
> protocol. After the first rolling bounce of the upgrade, all members will 
> have two assignors: "cooperative-sticky" and "range" (or 
> sticky/round-robin/etc). At this point the EAGER protocol will still be 
> selected due to the presence of the "range" assignor, but it's the 
> "cooperative-sticky" assignor that will ultimately be selected for use in 
> rebalances if that assignor is preferred (ie positioned first in the list). 
> The only reason for the second rolling bounce is to strip off the "range" 
> assignor and allow the upgraded members to switch over to COOPERATIVE. We 
> can't allow them to use cooperative rebalancing until everyone has been 
> upgraded, but once they have it's safe to do so.
> And there is already a way for the client to detect that everyone is on the 
> new byte code: if the CooperativeStickyAssignor is selected by the group 
> coordinator, then that means it is supported by all consumers in the group 
> and therefore everyone must be upgraded. 
> We may be able to save the second rolling bounce by dynamically updating the 
> rebalancing protocol inside the ConsumerCoordinator as "the highest protocol 
> supported by the assignor chosen by the group coordinator". This means we'll 
> still be using EAGER at the first rebalance, since we of course need to wait 
> for this initial rebalance to get the response from the group coordinator. 
> But we should take the hint from the chosen assignor rather than dropping 
> this information on the floor and sticking with the original protocol.
> Concrete Proposal:
> This assumes we will change the default assignor to ["cooperative-sticky", 
> "range"] in KIP-726. It also acknowledges that users may attempt any kind of 
> upgrade without reading the docs, and so we need to put in safeguards against 
> data corruption rather than assume everyone will follow the safe upgrade path.
> With this proposal, 
> 1) New applications on 3.0 will enable cooperative rebalancing by default
> 2) Existing applications which don’t set an assignor can safely upgrade to 
> 3.0 using a single rolling bounce with no extra steps, and will automatically 
> transition to cooperative rebalancing
> 3) Existing applications which do set an assignor that uses EAGER can 
> likewise upgrade their applications to COOPERATIVE with a single rolling 
> bounce
> 4) Once on 3.0, applications can safely go back and forth between EAGER and 
> COOPERATIVE
> 5) Applications can safely downgrade away from 3.0
> The high-level idea for dynamic protocol upgrades is that the group will 
> leverage the assignor selected by the group coordinator to determine when 
> it’s safe to upgrade to COOPERATIVE, and trigger a fail-safe to protect the 
> group in case of rare events or user misconfiguration. The group coordinator 
> selects the most preferred assignor that’s supported by all members of the 
> group, so we know that all members will support COOPERATIVE once we receive 
> the “cooperative-sticky” assignor after a rebalance. At this point, each 
> member can upgrade their own protocol to COOPERATIVE. However, there may be 
> situations in which an EAGER member may join the group even after upgrading 
> to COOPERATIVE. For example, during a rolling upgrade if the last remaining 
> member on the old bytecode misses a rebalance, the other members will be 
> allowed to upgrade to COOPERATIVE. If 

[jira] [Commented] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable

2021-07-08 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377616#comment-17377616
 ] 

A. Sophie Blee-Goldman commented on KAFKA-9295:
---

[~kkonstantine] Luke was able to figure out the likely root cause and it 
actually does seem to be a potentially severe bug. See 
https://issues.apache.org/jira/browse/KAFKA-13008

> KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
> --
>
> Key: KAFKA-9295
> URL: https://issues.apache.org/jira/browse/KAFKA-9295
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.4.0, 2.6.0
>Reporter: Matthias J. Sax
>Assignee: Luke Chen
>Priority: Blocker
>  Labels: flaky-test
> Fix For: 3.0.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/]
> {quote}java.lang.AssertionError: Did not receive all 1 records from topic 
> output- within 6 ms Expected: is a value equal to or greater than <1> 
> but: <0> was less than <1> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13037) "Thread state is already PENDING_SHUTDOWN" log spam

2021-07-06 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376066#comment-17376066
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13037:


Hey [~gray.john], would you be interested in submitting a PR for this? I 
completely agree that logging this at INFO on every iteration is wildly 
inappropriate, I just didn't push it at the time since I figured someone would 
file a ticket if it was really bothering people. And here we are :) 

> "Thread state is already PENDING_SHUTDOWN" log spam
> ---
>
> Key: KAFKA-13037
> URL: https://issues.apache.org/jira/browse/KAFKA-13037
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0, 2.7.1
>Reporter: John Gray
>Priority: Major
>
> KAFKA-12462 introduced a 
> [change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722]
>  that increased this "Thread state is already {}" logger to info instead of 
> debug. We are running into a problem with our streams apps when they hit an 
> unrecoverable exception that shuts down the streams thread, we get this log 
> printed about 50,000 times per second per thread. I am guessing it is once 
> per record we have queued up when the exception happens? We have temporarily 
> raised the StreamThread logger to WARN instead of INFO to avoid the spam, but 
> we do miss the other good logs we get on INFO in that class. Could this log 
> be reverted back to debug? Thank you! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-13008) Stream will stop processing data for a long time while waiting for the partition lag

2021-07-06 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-13008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376019#comment-17376019
 ] 

A. Sophie Blee-Goldman commented on KAFKA-13008:


Nice find! I agree, this does not seem like the expected behavior, and given 
that it's been causing a test to fail regularly I think we can assert that this 
should not happen. 

One thing I don't understand, and maybe this is because I don't have much 
context on the incremental fetch internals, is: why would we not get the 
metadata again after the partition is re-assigned in step 5? Surely if a 
partition was removed from the assignment and then added back, this should 
constitute a new 'session', and thus it should get the metadata again on 
assignment? If not that sounds like a bug in the incremental fetch to me, but 
again, I'm not too familiar with it so there could be a valid reason it works 
this way. 

If so, then maybe we should consider allowing metadata to remain around after a 
partition is unassigned, in case it gets this same partition back within the 
session? Could there be other consequences of this lack of metadata, outside of 
Streams?

> Stream will stop processing data for a long time while waiting for the 
> partition lag
> 
>
> Key: KAFKA-13008
> URL: https://issues.apache.org/jira/browse/KAFKA-13008
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Luke Chen
>Priority: Major
>
> In KIP-695, we improved the task idling mechanism by checking partition lag. 
> It's a good improvement for timestamp sync. But I found it will cause the 
> stream stop processing the data for a long time while waiting for the 
> partition metadata.
>  
> I've been investigating this case for a while, and figuring out the issue 
> will happen in below situation (or similar situation):
>  # start 2 streams (each with 1 thread) to consume from a topicA (with 3 
> partitions: A-0, A-1, A-2)
>  # After 2 streams started, the partitions assignment are: (I skipped some 
> other processing related partitions for simplicity)
>  stream1-thread1: A-0, A-1 
>  stream2-thread1: A-2
>  # start processing some data, assume now, the position and high watermark is:
>  A-0: offset: 2, highWM: 2
>  A-1: offset: 2, highWM: 2
>  A-2: offset: 2, highWM: 2
>  # Now, stream3 joined, so trigger rebalance with this assignment:
>  stream1-thread1: A-0 
>  stream2-thread1: A-2
>  stream3-thread1: A-1
>  # Suddenly, stream3 left, so now, rebalance again, with the step 2 
> assignment:
>  stream1-thread1: A-0, *A-1* 
>  stream2-thread1: A-2
>  (note: after initialization, the  position of A-1 will be: position: null, 
> highWM: null)
>  # Now, note that, the partition A-1 used to get assigned to stream1-thread1, 
> and now, it's back. And also, assume the partition A-1 has slow input (ex: 1 
> record per 30 mins), and partition A-0 has fast input (ex: 10K records / 
> sec). So, now, the stream1-thread1 won't process any data until we got input 
> from partition A-1 (even if partition A-0 is buffered a lot, and we have 
> `{{max.task.idle.ms}}` set to 0).
>  
> The reason why the stream1-thread1 won't process any data is because we can't 
> get the lag of partition A-1. And why we can't get the lag? It's because
>  # In KIP-695, we use consumer's cache to get the partition lag, to avoid 
> remote call
>  # The lag for a partition will be cleared if the assignment in this round 
> doesn't have this partition. check 
> [here|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java#L272].
>  So, in the above example, the metadata cache for partition A-1 will be 
> cleared in step 4, and re-initialized (to null) in step 5
>  # In KIP-227, we introduced a fetch session to have incremental fetch 
> request/response. That is, if the session existed, the client(consumer) will 
> get the update only when the fetched partition have update (ex: new data). 
> So, in the above case, the partition A-1 has slow input (ex: 1 record per 30 
> mins), it won't have update until next 30 mins, or wait for the fetch session 
> become inactive for (default) 2 mins to be evicted. Either case, the metadata 
> won't be updated for a while.
>  
> In KIP-695, if we don't get the partition lag, we can't determine the 
> partition data status to do timestamp sync, so we'll keep waiting and not 
> processing any data. That's why this issue will happen.
>  
> *Proposed solution:*
>  # If we don't get the current lag for a partition, or the current lag > 0, 
> we start to wait for max.task.idle.ms, and reset the deadline when we get the 
> partition lag, like what we did in previous KIP-353
>  # Introduce a waiting time config when no partition lag, or 

[jira] [Commented] (KAFKA-9177) Pause completed partitions on restore consumer

2021-06-25 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369771#comment-17369771
 ] 

A. Sophie Blee-Goldman commented on KAFKA-9177:
---

[~apolyakov] this ticket was specifically about pausing those completed 
changelogs in the restore consumer's assignment so that it doesn't continue to 
fetch records beyond the "end" of the changelog. The log message you're seeing 
doesn't indicate that this wasn't done, it's just a debug message to confirm 
we've finished restoring all changelogs when we run through the no-op restore 
phase after restoration is complete.

I wouldn't worry about it.

 

> Pause completed partitions on restore consumer
> --
>
> Key: KAFKA-9177
> URL: https://issues.apache.org/jira/browse/KAFKA-9177
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Guozhang Wang
>Priority: Major
> Fix For: 2.6.0
>
>
> The StoreChangelogReader is responsible for tracking and restoring active 
> tasks, but once a store has finished restoring it will continue polling for 
> records on that partition.
> Ordinarily this doesn't make a difference as a store is not completely 
> restored until its entire changelog has been read, so there are no more 
> records for poll to return anyway. But if the restoring state is actually an 
> optimized source KTable, the changelog is just the source topic and poll will 
> keep returning records for that partition until all stores have been restored.
> Note that this isn't a correctness issue since it's just the restore 
> consumer, but it is wasteful to be polling for records and throwing them 
> away. We should pause completed partitions in StoreChangelogReader so we 
> don't slow down the restore consumer in reading from the unfinished changelog 
> topics, and avoid wasted network.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12993) Formatting of Streams 'Memory Management' docs is messed up

2021-06-25 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369758#comment-17369758
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12993:


Oh, you already have a PR for kafka-site. I should have known :P

> Formatting of Streams 'Memory Management' docs is messed up 
> 
>
> Key: KAFKA-12993
> URL: https://issues.apache.org/jira/browse/KAFKA-12993
> Project: Kafka
>  Issue Type: Bug
>  Components: docs, streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Luke Chen
>Priority: Blocker
> Fix For: 3.0.0, 2.8.1
>
>
> The formatting of this page is all messed up, starting in the RocksDB 
> section. It looks like there's a missing closing tag after the example 
> BoundedMemoryRocksDBConfig class



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12993) Formatting of Streams 'Memory Management' docs is messed up

2021-06-25 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369756#comment-17369756
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12993:


Ah, I didn't realize this was already addressed. Thanks for the heads up Bruno. 
I agree with Luke's take on this: not every docs change needs to go into 
kafka-site right away, but for something like this which is a pretty egregious 
formatting issue that makes it very difficult to read, we should fix it on the 
live site ASAP.

[~showuon] do you want to port this PR over to kafka-site?

> Formatting of Streams 'Memory Management' docs is messed up 
> 
>
> Key: KAFKA-12993
> URL: https://issues.apache.org/jira/browse/KAFKA-12993
> Project: Kafka
>  Issue Type: Bug
>  Components: docs, streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Luke Chen
>Priority: Blocker
> Fix For: 3.0.0, 2.8.1
>
>
> The formatting of this page is all messed up, starting in the RocksDB 
> section. It looks like there's a missing closing tag after the example 
> BoundedMemoryRocksDBConfig class



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (KAFKA-12993) Formatting of Streams 'Memory Management' docs is messed up

2021-06-24 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12993:
---
Component/s: docs

> Formatting of Streams 'Memory Management' docs is messed up 
> 
>
> Key: KAFKA-12993
> URL: https://issues.apache.org/jira/browse/KAFKA-12993
> Project: Kafka
>  Issue Type: Bug
>  Components: docs, streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 3.0.0, 2.8.1
>
>
> The formatting of this page is all messed up, starting in the RocksDB 
> section. It looks like there's a missing closing tag after the example 
> BoundedMemoryRocksDBConfig class



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12993) Formatting of Streams 'Memory Management' docs is messed up

2021-06-24 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-12993:
--

 Summary: Formatting of Streams 'Memory Management' docs is messed 
up 
 Key: KAFKA-12993
 URL: https://issues.apache.org/jira/browse/KAFKA-12993
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: A. Sophie Blee-Goldman
 Fix For: 3.0.0, 2.8.1


The formatting of this page is all messed up, starting in the RocksDB section. 
It looks like there's a missing closing tag after the example 
BoundedMemoryRocksDBConfig class



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12984) Cooperative sticky assignor can get stuck with invalid SubscriptionState input metadata

2021-06-23 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368542#comment-17368542
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12984:


[~mjsax] Technically yes, the issue with the SubscriptionState potentially 
providing invalid "ownedPartitions" input can affect Kafka Streams as well. 
However the impact for Streams is be considerably less severe, as the 
assignment algorithm it doesn't make any assumptions about the previous 
assignment being valid. The worst that should happen to a Streams application 
is that the assignment could be slightly sub-optimal, with a partition/active 
task being assigned to a member that had dropped out of the group since being 
assigned that partition, instead of its true current owner. 

> Cooperative sticky assignor can get stuck with invalid SubscriptionState 
> input metadata
> ---
>
> Key: KAFKA-12984
> URL: https://issues.apache.org/jira/browse/KAFKA-12984
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 3.0.0, 2.8.1
>
>
> Some users have reported seeing their consumer group become stuck in the 
> CompletingRebalance phase when using the cooperative-sticky assignor. Based 
> on the request metadata we were able to deduce that multiple consumers were 
> reporting the same partition(s) in their "ownedPartitions" field of the 
> consumer protocol. Since this is an invalid state, the input causes the 
> cooperative-sticky assignor to detect that something is wrong and throw an 
> IllegalStateException. If the consumer application is set up to simply retry, 
> this will cause the group to appear to hang in the rebalance state.
> The "ownedPartitions" field is encoded based on the ConsumerCoordinator's 
> SubscriptionState, which was assumed to always be up to date. However there 
> may be cases where the consumer has dropped out of the group but fails to 
> clear the SubscriptionState, allowing it to report some partitions as owned 
> that have since been reassigned to another member.
> We should (a) fix the sticky assignment algorithm to resolve cases of 
> improper input conditions by invalidating the "ownedPartitions" in cases of 
> double ownership, and (b) shore up the ConsumerCoordinator logic to better 
> handle rejoining the group and keeping its internal state consistent. See 
> KAFKA-12983 for more details on (b)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12896) Group rebalance loop caused by repeated group leader JoinGroups

2021-06-22 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367805#comment-17367805
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12896:


I believe this is caused by https://issues.apache.org/jira/browse/KAFKA-12984, 
which in turn is caused by https://issues.apache.org/jira/browse/KAFKA-12983. 
I'll prepare a fix for both of these issues

> Group rebalance loop caused by repeated group leader JoinGroups
> ---
>
> Key: KAFKA-12896
> URL: https://issues.apache.org/jira/browse/KAFKA-12896
> Project: Kafka
>  Issue Type: Bug
>  Components: clients
>Affects Versions: 2.6.0
>Reporter: Lucas Bradstreet
>Assignee: David Jacot
>Priority: Blocker
> Fix For: 3.0.0
>
>
> We encountered a strange case of a rebalance loop with the 
> "cooperative-sticky" assignor. The logs show the following for several hours:
>  
> {{Apr 7, 2021 @ 03:58:36.040  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830137 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.992  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830136 
> (__consumer_offsets-7) (reason: Updating metadata for member 
> mygroup-1-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}}
> {{Apr 7, 2021 @ 03:58:35.988  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830136 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.972  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830135 
> (__consumer_offsets-7) (reason: Updating metadata for member mygroup during 
> CompletingRebalance)}}
> {{Apr 7, 2021 @ 03:58:35.965  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830135 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.953  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830134 
> (__consumer_offsets-7) (reason: Updating metadata for member 
> mygroup-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}}
> {{Apr 7, 2021 @ 03:58:35.941  [GroupCoordinator 7]: Stabilized group mygroup 
> generation 19830134 (__consumer_offsets-7)}}
> {{Apr 7, 2021 @ 03:58:35.926  [GroupCoordinator 7]: Preparing to rebalance 
> group mygroup in state PreparingRebalance with old generation 19830133 
> (__consumer_offsets-7) (reason: Updating metadata for member mygroup during 
> CompletingRebalance)}}
> Every single time, it was the same member that triggered the JoinGroup and it 
> was always the leader of the group.{{}}
> The leader has the privilege of being able to trigger a rebalance by sending 
> `JoinGroup` even if its subscription metadata has not changed. But why would 
> it do so?
> It is possible that this is due to the same issue or a similar bug to 
> https://issues.apache.org/jira/browse/KAFKA-12890.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-12983) onJoinPrepare is not always invoked before joining the group

2021-06-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-12983:
--

Assignee: A. Sophie Blee-Goldman

> onJoinPrepare is not always invoked before joining the group
> 
>
> Key: KAFKA-12983
> URL: https://issues.apache.org/jira/browse/KAFKA-12983
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 3.0.0, 2.8.1
>
>
> As the title suggests, the #onJoinPrepare callback is not always invoked 
> before a member (re)joins the group, but only once when it first enters the 
> rebalance. This means that any updates or events that occur during the join 
> phase can be lost in the internal state: for example, clearing the 
> SubscriptionState (and thus the "ownedPartitions" that are used for 
> cooperative rebalancing) after losing its memberId during a rebalance.
> We should reset the `needsJoinPrepare` flag inside the resetStateAndRejoin() 
> method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (KAFKA-12984) Cooperative sticky assignor can get stuck with invalid SubscriptionState input metadata

2021-06-22 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-12984:
--

Assignee: A. Sophie Blee-Goldman

> Cooperative sticky assignor can get stuck with invalid SubscriptionState 
> input metadata
> ---
>
> Key: KAFKA-12984
> URL: https://issues.apache.org/jira/browse/KAFKA-12984
> Project: Kafka
>  Issue Type: Bug
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 3.0.0, 2.8.1
>
>
> Some users have reported seeing their consumer group become stuck in the 
> CompletingRebalance phase when using the cooperative-sticky assignor. Based 
> on the request metadata we were able to deduce that multiple consumers were 
> reporting the same partition(s) in their "ownedPartitions" field of the 
> consumer protocol. Since this is an invalid state, the input causes the 
> cooperative-sticky assignor to detect that something is wrong and throw an 
> IllegalStateException. If the consumer application is set up to simply retry, 
> this will cause the group to appear to hang in the rebalance state.
> The "ownedPartitions" field is encoded based on the ConsumerCoordinator's 
> SubscriptionState, which was assumed to always be up to date. However there 
> may be cases where the consumer has dropped out of the group but fails to 
> clear the SubscriptionState, allowing it to report some partitions as owned 
> that have since been reassigned to another member.
> We should (a) fix the sticky assignment algorithm to resolve cases of 
> improper input conditions by invalidating the "ownedPartitions" in cases of 
> double ownership, and (b) shore up the ConsumerCoordinator logic to better 
> handle rejoining the group and keeping its internal state consistent. See 
> KAFKA-12983 for more details on (b)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12984) Cooperative sticky assignor can get stuck with invalid SubscriptionState input metadata

2021-06-22 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-12984:
--

 Summary: Cooperative sticky assignor can get stuck with invalid 
SubscriptionState input metadata
 Key: KAFKA-12984
 URL: https://issues.apache.org/jira/browse/KAFKA-12984
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Reporter: A. Sophie Blee-Goldman
 Fix For: 3.0.0, 2.8.1


Some users have reported seeing their consumer group become stuck in the 
CompletingRebalance phase when using the cooperative-sticky assignor. Based on 
the request metadata we were able to deduce that multiple consumers were 
reporting the same partition(s) in their "ownedPartitions" field of the 
consumer protocol. Since this is an invalid state, the input causes the 
cooperative-sticky assignor to detect that something is wrong and throw an 
IllegalStateException. If the consumer application is set up to simply retry, 
this will cause the group to appear to hang in the rebalance state.

The "ownedPartitions" field is encoded based on the ConsumerCoordinator's 
SubscriptionState, which was assumed to always be up to date. However there may 
be cases where the consumer has dropped out of the group but fails to clear the 
SubscriptionState, allowing it to report some partitions as owned that have 
since been reassigned to another member.

We should (a) fix the sticky assignment algorithm to resolve cases of improper 
input conditions by invalidating the "ownedPartitions" in cases of double 
ownership, and (b) shore up the ConsumerCoordinator logic to better handle 
rejoining the group and keeping its internal state consistent. See KAFKA-12983 
for more details on (b)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12983) onJoinPrepare is not always invoked before joining the group

2021-06-22 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-12983:
--

 Summary: onJoinPrepare is not always invoked before joining the 
group
 Key: KAFKA-12983
 URL: https://issues.apache.org/jira/browse/KAFKA-12983
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Reporter: A. Sophie Blee-Goldman
 Fix For: 3.0.0, 2.8.1


As the title suggests, the #onJoinPrepare callback is not always invoked before 
a member (re)joins the group, but only once when it first enters the rebalance. 
This means that any updates or events that occur during the join phase can be 
lost in the internal state: for example, clearing the SubscriptionState (and 
thus the "ownedPartitions" that are used for cooperative rebalancing) after 
losing its memberId during a rebalance.

We should reset the `needsJoinPrepare` flag inside the resetStateAndRejoin() 
method



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12960) WindowStore and SessionStore do not enforce strict retention time

2021-06-22 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367795#comment-17367795
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12960:


+1 on pushing this responsibility to the state stores. Imo this was always 
intended to be a contract of the state store interface itself and was just not 
strictly enforced, ie it's more like a bug to fix than a matter of "pushing" 
the responsibility onto them 

> WindowStore and SessionStore do not enforce strict retention time
> -
>
> Key: KAFKA-12960
> URL: https://issues.apache.org/jira/browse/KAFKA-12960
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>Priority: Major
>
> WindowedStore and SessionStore do not implement a strict retention time in 
> general. We should consider to make retention time strict: even if we still 
> have some record in the store (due to the segmented implementation), we might 
> want to filter expired records on-read. This might benefit PAPI users.
> Atm, InMemoryWindow store does already enforce a strict retention time.
> As an alternative, we could also inject such a filter in the wrapping 
> `MeteredStore` – this might lift the burden from users who implement a custom 
> state store.
> As an alternative, we could change all DSL operators to verify if data from a 
> state store is already expired or not. It might be better to push this 
> responsibility into the stores though.
> It's especially an issue for stream-stream joins, because the operator relies 
> on the retention time to implement it's grace period.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8295) Optimize count() using RocksDB merge operator

2021-06-14 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363356#comment-17363356
 ] 

A. Sophie Blee-Goldman commented on KAFKA-8295:
---

Yes, any of the count DSL operators. It may be a bit more tricky than it 
appears on the surface because count is actually converted into a generic 
aggregation under the covers, so you'd have to tease it out into its own 
independent optimized implementation. To be honest, I don't have a good sense 
of whether it's even worth the additional code complexity, because I don't know 
how much additional code and/or code paths this will introduce :) I recommend 
looking into that before jumping straight in.

Of course, we could consider introducing some kind of top-level merge-based 
operator to the DSL as a feature in its own right. Then count could just be 
converted to use that instead of the aggregation implementation. 

Not sure what that would look like, or if it would even be useful at all – just 
throwing out thoughts here. Anyways I just thought it would be interesting to 
explore what we might be able to do with this merge operator in Kafka Streams, 
whether that's an optimization of existing operators or some kind of first 
class operator of its own. That's really the point of this ticket: to explore 
the merge operator.

> Optimize count() using RocksDB merge operator
> -
>
> Key: KAFKA-8295
> URL: https://issues.apache.org/jira/browse/KAFKA-8295
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Sagar Rao
>Priority: Major
>
> In addition to regular put/get/delete RocksDB provides a fourth operation, 
> merge. This essentially provides an optimized read/update/write path in a 
> single operation. One of the built-in (C++) merge operators exposed over the 
> Java API is a counter. We should be able to leverage this for a more 
> efficient implementation of count()
>  
> (Note: Unfortunately it seems unlikely we can use this to optimize general 
> aggregations, even if RocksJava allowed for a custom merge operator, unless 
> we provide a way for the user to specify and connect a C++ implemented 
> aggregator – otherwise we incur too much cost crossing the jni for a net 
> performance benefit)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12844) KIP-740 follow up: clean up TaskId

2021-06-14 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363355#comment-17363355
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12844:


As mentioned elsewhere, this ticket is marked for 4.0 and as such, cannot be 
worked on yet. When 4.0 is announced you are free to pick this up and work on 
it again, but as of this point we don't yet know when version 4.0 will be 
released. Most likely we will have 3.1 after the in-progress 3.0, though I 
suppose it depends on the Zookeeper removal work.

> KIP-740 follow up: clean up TaskId
> --
>
> Key: KAFKA-12844
> URL: https://issues.apache.org/jira/browse/KAFKA-12844
> Project: Kafka
>  Issue Type: Task
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: loboxu
>Priority: Blocker
> Fix For: 4.0.0
>
>
> See 
> [KIP-740|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181306557]
>  – for the TaskId class, we need to remove the following deprecated APIs:
>  # The public partition and topicGroupId fields should be "removed", ie made 
> private (can also now rename topicGroupId to subtopology to match the getter)
>  # The two #readFrom and two #writeTo methods can be removed (they have 
> already been converted to internal utility methods we now use instead, so 
> just remove them)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12843) KIP-740 follow up: clean up TaskMetadata

2021-06-14 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363354#comment-17363354
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12843:


This was pointed out on the PR, but just leaving a note on the ticket for 
visibility/, plus to clarify for anyone else who comes across this:

This ticket is marked for fix in version 4.0, which means we can't work on it 
yet. We are only in the process of releasing 3.0 at the moment, and it's likely 
that 3.1 will come after that. Once the bump to 4.0 has been decided, you can 
pick this up again and actually work on it. 

> KIP-740 follow up: clean up TaskMetadata
> 
>
> Key: KAFKA-12843
> URL: https://issues.apache.org/jira/browse/KAFKA-12843
> Project: Kafka
>  Issue Type: Task
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: loboxu
>Priority: Blocker
> Fix For: 4.0.0
>
>
> See 
> [KIP-740|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181306557]
>  – for the TaskMetadata class, we need to:
>  # Deprecate the TaskMetadata#getTaskId method
>  # "Remove" the deprecated TaskMetadata#taskId method, then re-add a taskId() 
> API that returns a TaskId instead of a String
>  # Remove the deprecated constructor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12690) Remove deprecated Producer#sendOffsetsToTransaction

2021-06-14 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363343#comment-17363343
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12690:


Hey [~loboxu], as pointed out elsewhere this ticket is marked for 4.0, and is 
not yet ready to be worked on. You're absolutely free to pick this up when the 
time comes around, so you can leave yourself assigned if you'd like, but it may 
be a while before 4.0 comes around. Just a heads up :) 

> Remove deprecated Producer#sendOffsetsToTransaction
> ---
>
> Key: KAFKA-12690
> URL: https://issues.apache.org/jira/browse/KAFKA-12690
> Project: Kafka
>  Issue Type: Task
>  Components: producer 
>Reporter: A. Sophie Blee-Goldman
>Assignee: loboxu
>Priority: Blocker
> Fix For: 4.0.0
>
>
> In 
> [KIP-732|https://cwiki.apache.org/confluence/display/KAFKA/KIP-732%3A+Deprecate+eos-alpha+and+replace+eos-beta+with+eos-v2]
>  we deprecated the EXACTLY_ONCE and EXACTLY_ONCE_BETA configs in 
> StreamsConfig, to be removed in 4.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12689) Remove deprecated EOS configs

2021-06-14 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363342#comment-17363342
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12689:


Hey [~loboxu], this ticket is marked with 4.0 as the Fix Version, which means 
it can't be worked on until the version 4.0. This version has yet to be 
announced, and since 3.0 is only just about to be released I would not assume 
that 4.0 is definitely right around the corner. Most likely 4.0 will be 
soon-ish, but I would look for other tickets to work on for now.

> Remove deprecated EOS configs
> -
>
> Key: KAFKA-12689
> URL: https://issues.apache.org/jira/browse/KAFKA-12689
> Project: Kafka
>  Issue Type: Task
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: loboxu
>Priority: Blocker
> Fix For: 4.0.0
>
>
> In 
> [KIP-732|https://cwiki.apache.org/confluence/display/KAFKA/KIP-732%3A+Deprecate+eos-alpha+and+replace+eos-beta+with+eos-v2]
>  we deprecated the EXACTLY_ONCE and EXACTLY_ONCE_BETA configs in 
> StreamsConfig, to be removed in 4.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12935) Flaky Test RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore

2021-06-10 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361374#comment-17361374
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12935:


Filed https://issues.apache.org/jira/browse/KAFKA-12936

> Flaky Test 
> RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore
> 
>
> Key: KAFKA-12935
> URL: https://issues.apache.org/jira/browse/KAFKA-12935
> Project: Kafka
>  Issue Type: Test
>  Components: streams, unit tests
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
>
> {quote}java.lang.AssertionError: Expected: <0L> but: was <5005L> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at 
> org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(RestoreIntegrationTest.java:374)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (KAFKA-12936) In-memory stores are always restored from scratch after dropping out of the group

2021-06-10 Thread A. Sophie Blee-Goldman (Jira)
A. Sophie Blee-Goldman created KAFKA-12936:
--

 Summary: In-memory stores are always restored from scratch after 
dropping out of the group
 Key: KAFKA-12936
 URL: https://issues.apache.org/jira/browse/KAFKA-12936
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: A. Sophie Blee-Goldman


Whenever an in-memory store is closed, the actual store contents are garbage 
collected and the state will need to be restored from scratch if the task is 
reassigned and re-initialized. We introduced the recycling feature to prevent 
this from occurring when a task is transitioned from standby to active (or vice 
versa), but it's still possible for the in-memory state to be unnecessarily 
wiped out in the case the member has dropped out of the group. In this case, 
the onPartitionsLost callback is invoked, which will close all active tasks as 
dirty before the member rejoins the group. This means that all these tasks will 
need to be restored from scratch if they are reassigned back to this consumer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12935) Flaky Test RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore

2021-06-10 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361367#comment-17361367
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12935:


Hm, actually, I guess that could be considered a bug in itself, or at least a 
flaw in  the recycling feature  – for persistent stores with ALOS, dropping out 
of the group only causes tasks to be closed dirty, it doesn't force them to be 
wiped out to restore from the changelog from scratch. But with in-memory 
stores, simply closing them is akin to physically wiping out the state 
directory for that task. Avoiding that was the basis for this recycling feature 
in the first place.

This does kind of suck, but at least it should be a relatively rare event. I'm 
a bit worried about how much complexity it would introduce to the code to fix 
this "bug", but I'll at least file a ticket for it now and we can go from there

> Flaky Test 
> RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore
> 
>
> Key: KAFKA-12935
> URL: https://issues.apache.org/jira/browse/KAFKA-12935
> Project: Kafka
>  Issue Type: Test
>  Components: streams, unit tests
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
>
> {quote}java.lang.AssertionError: Expected: <0L> but: was <5005L> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at 
> org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(RestoreIntegrationTest.java:374)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12935) Flaky Test RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore

2021-06-10 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361359#comment-17361359
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12935:


Hm, yeah I think I've seen this fail once or twice before. I did look into it a 
bit a while back, and just could not figure out whether it was a possible bug 
or an issue with the test itself. My money's definitely on the latter, but it 
might be worth taking another look sometime if we have the chance. 

If there is a bug that this is uncovering, at least it would not be a 
correctness bug, only an annoyance in restoring when it's not necessary. I 
think it's more likely that the test is flaky because for example the consumer 
dropped out of the group and invoked onPartitionsLost, which would close the 
task as dirty and require restoring from the changelog

> Flaky Test 
> RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore
> 
>
> Key: KAFKA-12935
> URL: https://issues.apache.org/jira/browse/KAFKA-12935
> Project: Kafka
>  Issue Type: Test
>  Components: streams, unit tests
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
>
> {quote}java.lang.AssertionError: Expected: <0L> but: was <5005L> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at 
> org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(RestoreIntegrationTest.java:374)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-8940) Flaky Test SmokeTestDriverIntegrationTest.shouldWorkWithRebalance

2021-06-10 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361196#comment-17361196
 ] 

A. Sophie Blee-Goldman commented on KAFKA-8940:
---

Hey [~mjsax] [~josep.prat] (and others), if you read my last comment on this 
ticket it explains exactly why this test is failing. Luckily it has to do with 
only the test itself, as it's a bug in the assumptions for the generated input. 
It's just not necessarily a quick fix. 

Maybe we should @Ignore it for now, and then file a separate ticket to circle 
back and correct the assumptions in this test.

> Flaky Test SmokeTestDriverIntegrationTest.shouldWorkWithRebalance
> -
>
> Key: KAFKA-8940
> URL: https://issues.apache.org/jira/browse/KAFKA-8940
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: flaky-test, newbie++
>
> The test does not properly account for windowing. See this comment for full 
> details.
> We can patch this test by fixing the timestamps of the input data to avoid 
> crossing over a window boundary, or account for this when verifying the 
> output. Since we have access to the input data it should be possible to 
> compute whether/when we do cross a window boundary, and adjust the expected 
> output accordingly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (KAFKA-12920) Consumer's cooperative sticky assignor need to clear generation / assignment data upon `onPartitionsLost`

2021-06-09 Thread A. Sophie Blee-Goldman (Jira)


 [ 
https://issues.apache.org/jira/browse/KAFKA-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-12920.

Resolution: Not A Bug

> Consumer's cooperative sticky assignor need to clear generation / assignment 
> data upon `onPartitionsLost`
> -
>
> Key: KAFKA-12920
> URL: https://issues.apache.org/jira/browse/KAFKA-12920
> Project: Kafka
>  Issue Type: Bug
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: bug, consumer
>
> Consumer's cooperative-sticky assignor does not track the owned partitions 
> inside the assignor --- i.e. when it reset its state in event of 
> ``onPartitionsLost``, the ``memberAssignment`` and ``generation`` inside the 
> assignor would not be cleared. This would cause a member to join with empty 
> generation on the protocol while with non-empty user-data encoding the old 
> assignment still (and hence pass the validation check on broker side during 
> JoinGroup), and eventually cause a single partition to be assigned to 
> multiple consumers within a generation.
> We should let the assignor to also clear its assignment/generation when 
> ``onPartitionsLost`` is triggered in order to avoid this scenario.
> Note that 1) for the regular sticky assignor the generation would still have 
> an older value, and this would cause the previously owned partitions to be 
> discarded during the assignment, and 2) for Streams' sticky assignor, it’s 
> encoding would indeed be cleared along with ``onPartitionsLost``. Hence only 
> Consumer's cooperative-sticky assignor have this issue to solve.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12920) Consumer's cooperative sticky assignor need to clear generation / assignment data upon `onPartitionsLost`

2021-06-09 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360480#comment-17360480
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12920:


I believe we've found the real issue, and are just discussing what the correct 
behavior here should be. I'm going to close this ticket as 'Not a Bug' and file 
a separate issue with the problems we've found in the 
Consumer/AbstractCoordinator that have resulted in 
[this|https://issues.apache.org/jira/browse/KAFKA-12896] odd behavior we 
observed with the cooperative-sticky assignor

> Consumer's cooperative sticky assignor need to clear generation / assignment 
> data upon `onPartitionsLost`
> -
>
> Key: KAFKA-12920
> URL: https://issues.apache.org/jira/browse/KAFKA-12920
> Project: Kafka
>  Issue Type: Bug
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: bug, consumer
>
> Consumer's cooperative-sticky assignor does not track the owned partitions 
> inside the assignor --- i.e. when it reset its state in event of 
> ``onPartitionsLost``, the ``memberAssignment`` and ``generation`` inside the 
> assignor would not be cleared. This would cause a member to join with empty 
> generation on the protocol while with non-empty user-data encoding the old 
> assignment still (and hence pass the validation check on broker side during 
> JoinGroup), and eventually cause a single partition to be assigned to 
> multiple consumers within a generation.
> We should let the assignor to also clear its assignment/generation when 
> ``onPartitionsLost`` is triggered in order to avoid this scenario.
> Note that 1) for the regular sticky assignor the generation would still have 
> an older value, and this would cause the previously owned partitions to be 
> discarded during the assignment, and 2) for Streams' sticky assignor, it’s 
> encoding would indeed be cleared along with ``onPartitionsLost``. Hence only 
> Consumer's cooperative-sticky assignor have this issue to solve.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (KAFKA-12925) prefixScan missing from intermediate interfaces

2021-06-09 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360472#comment-17360472
 ] 

A. Sophie Blee-Goldman edited comment on KAFKA-12925 at 6/10/21, 12:51 AM:
---

[~sagarrao] do you think you'll be able to get in a fix for this in the next 
few weeks? It would be good to have this working in full capacity by 3.0

(I wouldn't consider it a blocker for 3.0, but we have enough time before code 
freeze that this should not be an issue, provided someone can pick it up)


was (Author: ableegoldman):
[~sagarrao] do you think you'll be able to get in a fix for this in the next 
few weeks? It would be good to have this working in full capacity by 3.0

> prefixScan missing from intermediate interfaces
> ---
>
> Key: KAFKA-12925
> URL: https://issues.apache.org/jira/browse/KAFKA-12925
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
>Reporter: Michael Viamari
>Priority: Major
> Fix For: 3.0.0, 2.8.1
>
>
> [KIP-614|https://cwiki.apache.org/confluence/display/KAFKA/KIP-614%3A+Add+Prefix+Scan+support+for+State+Stores]
>  and [KAFKA-10648|https://issues.apache.org/jira/browse/KAFKA-10648] 
> introduced support for {{prefixScan}} to StateStores.
> It seems that many of the intermediate {{StateStore}} interfaces are missing 
> a definition for {{prefixScan}}, and as such is not accessible in all cases.
> For example, when accessing the state stores through a the processor context, 
> the {{KeyValueStoreReadWriteDecorator}} and associated interfaces do not 
> define {{prefixScan}} and it falls back to the default implementation in 
> {{KeyValueStore}}, which throws {{UnsupportedOperationException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (KAFKA-12925) prefixScan missing from intermediate interfaces

2021-06-09 Thread A. Sophie Blee-Goldman (Jira)


[ 
https://issues.apache.org/jira/browse/KAFKA-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360472#comment-17360472
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12925:


[~sagarrao] do you think you'll be able to get in a fix for this in the next 
few weeks? It would be good to have this working in full capacity by 3.0

> prefixScan missing from intermediate interfaces
> ---
>
> Key: KAFKA-12925
> URL: https://issues.apache.org/jira/browse/KAFKA-12925
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
>Reporter: Michael Viamari
>Priority: Major
>
> [KIP-614|https://cwiki.apache.org/confluence/display/KAFKA/KIP-614%3A+Add+Prefix+Scan+support+for+State+Stores]
>  and [KAFKA-10648|https://issues.apache.org/jira/browse/KAFKA-10648] 
> introduced support for {{prefixScan}} to StateStores.
> It seems that many of the intermediate {{StateStore}} interfaces are missing 
> a definition for {{prefixScan}}, and as such is not accessible in all cases.
> For example, when accessing the state stores through a the processor context, 
> the {{KeyValueStoreReadWriteDecorator}} and associated interfaces do not 
> define {{prefixScan}} and it falls back to the default implementation in 
> {{KeyValueStore}}, which throws {{UnsupportedOperationException}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >