[jira] [Created] (KAFKA-13170) Flaky Test InternalTopicManagerTest.shouldRetryDeleteTopicWhenTopicUnknown
A. Sophie Blee-Goldman created KAFKA-13170: -- Summary: Flaky Test InternalTopicManagerTest.shouldRetryDeleteTopicWhenTopicUnknown Key: KAFKA-13170 URL: https://issues.apache.org/jira/browse/KAFKA-13170 Project: Kafka Issue Type: Bug Components: streams, unit tests Reporter: A. Sophie Blee-Goldman [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11176/2/testReport/org.apache.kafka.streams.processor.internals/InternalTopicManagerTest/Build___JDK_8_and_Scala_2_12___shouldRetryDeleteTopicWhenTopicUnknown_2/] {code:java} Stacktracejava.lang.AssertionError: unexpected exception type thrown; expected: but was: at org.junit.Assert.assertThrows(Assert.java:1020) at org.junit.Assert.assertThrows(Assert.java:981) at org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldRetryDeleteTopicWhenRetriableException(InternalTopicManagerTest.java:526) at org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldRetryDeleteTopicWhenTopicUnknown(InternalTopicManagerTest.java:497) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13169) Flaky Test QueryableStateIntegrationTest.shouldBeAbleToQueryStateWithNonZeroSizedCache
A. Sophie Blee-Goldman created KAFKA-13169: -- Summary: Flaky Test QueryableStateIntegrationTest.shouldBeAbleToQueryStateWithNonZeroSizedCache Key: KAFKA-13169 URL: https://issues.apache.org/jira/browse/KAFKA-13169 Project: Kafka Issue Type: Bug Components: streams, unit tests Reporter: A. Sophie Blee-Goldman [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11176/2/testReport/org.apache.kafka.streams.processor.internals/InternalTopicManagerTest/Build___JDK_8_and_Scala_2_12___shouldRetryDeleteTopicWhenTopicUnknown/] {code:java} Stacktrace java.lang.AssertionError: unexpected exception type thrown; expected: but was: at org.junit.Assert.assertThrows(Assert.java:1020) at org.junit.Assert.assertThrows(Assert.java:981) at org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldRetryDeleteTopicWhenRetriableException(InternalTopicManagerTest.java:526) at org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldRetryDeleteTopicWhenTopicUnknown(InternalTopicManagerTest.java:497) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13145) Renaming the time interval window for better understanding
[ https://issues.apache.org/jira/browse/KAFKA-13145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392615#comment-17392615 ] A. Sophie Blee-Goldman commented on KAFKA-13145: Fine with me. FWIW the "InclusiveExclusiveWindow" name was my idea, but that was just to avoid using something called "SessionWindow" in the _Sliding_ window processor – making a new SlidingWindow class works too. > Renaming the time interval window for better understanding > -- > > Key: KAFKA-13145 > URL: https://issues.apache.org/jira/browse/KAFKA-13145 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Luke Chen >Assignee: Luke Chen >Priority: Major > > > I have another thought, which is to rename the time interval related windows. > Currently, we have 3 types of time interval window: > {{TimeWindow}} -> to have {{[start,end)}} time interval > {{SessionWindow}} -> to have {{[start,end]}} time interval > {{UnlimitedWindow}} -> to have {{[start, MAX_VALUE)}} time interval > I think the name {{SessionWindow}} is definitely not good here, especially we > want to use it in {{SlidingWindows}} now, although it is only used for > {{SessionWindows}} before. We should name them with time interval meaning, > not the streaming window functions meaning. {{}}Because these 3 window types > are internal use only, it is safe to rename them. > > {{TimeWindow}} --> {{InclusiveExclusiveWindow}} > {{SessionWindow}} / {{SlidingWindow}} --> {{InclusiveInclusiveWindow}} > {{UnlimitedWindow}} --> {{InclusiveUnboundedWindow}} > {{}} > See the discussion here{{: > [https://github.com/apache/kafka/pull/11124#issuecomment-887989639]}} > {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12559) Add a top-level Streams config for bounding off-heap memory
[ https://issues.apache.org/jira/browse/KAFKA-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392610#comment-17392610 ] A. Sophie Blee-Goldman commented on KAFKA-12559: [~brz] I'd say give [~msundeq] another day or two to respond and if you don't hear back then feel free to assign this ticket to yourself. I just added you as a contributor on the project so you should be able to self-assign tickets from now on. > Add a top-level Streams config for bounding off-heap memory > --- > > Key: KAFKA-12559 > URL: https://issues.apache.org/jira/browse/KAFKA-12559 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: Martin Sundeqvist >Priority: Major > Labels: needs-kip, newbie, newbie++ > > At the moment we provide an example of how to bound the memory usage of > rocskdb in the [Memory > Management|https://kafka.apache.org/27/documentation/streams/developer-guide/memory-mgmt.html#rocksdb] > section of the docs. This requires implementing a custom RocksDBConfigSetter > class and setting a number of rocksdb options for relatively advanced > concepts and configurations. It seems a fair number of users either fail to > find this or consider it to be for more advanced use cases/users. But RocksDB > can eat up a lot of off-heap memory and it's not uncommon for users to come > across a {{RocksDBException: Cannot allocate memory}} > It would probably be a much better user experience if we implemented this > memory bound out-of-the-box and just gave users a top-level StreamsConfig to > tune the off-heap memory given to rocksdb, like we have for on-heap cache > memory with cache.max.bytes.buffering. More advanced users can continue to > fine-tune their memory bounding and apply other configs with a custom config > setter, while new or more casual users can cap on the off-heap memory without > getting their hands dirty with rocksdb. > I would propose to add the following top-level config: > rocksdb.max.bytes.off.heap: medium priority, default to -1 (unbounded), valid > values are [0, inf] > I'd also want to consider adding a second, lower priority top-level config to > give users a knob for adjusting how much of that total off-heap memory goes > to the block cache + index/filter blocks, and how much of it is afforded to > the write buffers. I'm struggling to come up with a good name for this > config, but it would be something like > rocksdb.memtable.to.block.cache.off.heap.memory.ratio: low priority, default > to 0.5, valid values are [0, 1] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633
[ https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12994: --- Labels: kip-633 newbie newbie++ (was: kip kip-633) > Migrate all Tests to New API and Remove Suppression for Deprecation Warnings > related to KIP-633 > --- > > Key: KAFKA-12994 > URL: https://issues.apache.org/jira/browse/KAFKA-12994 > Project: Kafka > Issue Type: Improvement > Components: streams, unit tests >Affects Versions: 3.0.0 >Reporter: Israel Ekpo >Priority: Major > Labels: kip-633, newbie, newbie++ > Fix For: 3.1.0 > > > Due to the API changes for KIP-633 a lot of deprecation warnings have been > generated in tests that are using the old deprecated APIs. There are a lot of > tests using the deprecated methods. We should absolutely migrate them all to > the new APIs and then get rid of all the applicable annotations for > suppressing the deprecation warnings. > The applies to all Java and Scala examples and tests using the deprecated > APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows > classes. > > This is based on the feedback from reviewers in this PR > > https://github.com/apache/kafka/pull/10926 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633
[ https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-12994: -- Assignee: A. Sophie Blee-Goldman > Migrate all Tests to New API and Remove Suppression for Deprecation Warnings > related to KIP-633 > --- > > Key: KAFKA-12994 > URL: https://issues.apache.org/jira/browse/KAFKA-12994 > Project: Kafka > Issue Type: Improvement > Components: streams, unit tests >Affects Versions: 3.0.0 >Reporter: Israel Ekpo >Assignee: A. Sophie Blee-Goldman >Priority: Major > Labels: kip-633, newbie, newbie++ > Fix For: 3.1.0 > > > Due to the API changes for KIP-633 a lot of deprecation warnings have been > generated in tests that are using the old deprecated APIs. There are a lot of > tests using the deprecated methods. We should absolutely migrate them all to > the new APIs and then get rid of all the applicable annotations for > suppressing the deprecation warnings. > The applies to all Java and Scala examples and tests using the deprecated > APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows > classes. > > This is based on the feedback from reviewers in this PR > > https://github.com/apache/kafka/pull/10926 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633
[ https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-12994: -- Assignee: (was: Israel Ekpo) > Migrate all Tests to New API and Remove Suppression for Deprecation Warnings > related to KIP-633 > --- > > Key: KAFKA-12994 > URL: https://issues.apache.org/jira/browse/KAFKA-12994 > Project: Kafka > Issue Type: Improvement > Components: streams, unit tests >Affects Versions: 3.0.0 >Reporter: Israel Ekpo >Priority: Major > Labels: kip, kip-633 > Fix For: 3.1.0 > > > Due to the API changes for KIP-633 a lot of deprecation warnings have been > generated in tests that are using the old deprecated APIs. There are a lot of > tests using the deprecated methods. We should absolutely migrate them all to > the new APIs and then get rid of all the applicable annotations for > suppressing the deprecation warnings. > The applies to all Java and Scala examples and tests using the deprecated > APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows > classes. > > This is based on the feedback from reviewers in this PR > > https://github.com/apache/kafka/pull/10926 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633
[ https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-12994: -- Assignee: (was: A. Sophie Blee-Goldman) > Migrate all Tests to New API and Remove Suppression for Deprecation Warnings > related to KIP-633 > --- > > Key: KAFKA-12994 > URL: https://issues.apache.org/jira/browse/KAFKA-12994 > Project: Kafka > Issue Type: Improvement > Components: streams, unit tests >Affects Versions: 3.0.0 >Reporter: Israel Ekpo >Priority: Major > Labels: kip-633, newbie, newbie++ > Fix For: 3.1.0 > > > Due to the API changes for KIP-633 a lot of deprecation warnings have been > generated in tests that are using the old deprecated APIs. There are a lot of > tests using the deprecated methods. We should absolutely migrate them all to > the new APIs and then get rid of all the applicable annotations for > suppressing the deprecation warnings. > The applies to all Java and Scala examples and tests using the deprecated > APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows > classes. > > This is based on the feedback from reviewers in this PR > > https://github.com/apache/kafka/pull/10926 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633
[ https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17392530#comment-17392530 ] A. Sophie Blee-Goldman commented on KAFKA-12994: Hey [~iekpo], I'm unassigning this in case someone else wants to pick it up. If you already started working on this and have a partial PR ready with some subset of the tests migrated over, you can just open that PR and we can merge this in pieces. It seems like a lot of tests so splitting it up into multiple PRs is probably a good idea anyway. (And obviously feel free to re-assign it to yourself if you want to continue working on it, and/or split this ticket up into sub-tasks covering different sets of tests) > Migrate all Tests to New API and Remove Suppression for Deprecation Warnings > related to KIP-633 > --- > > Key: KAFKA-12994 > URL: https://issues.apache.org/jira/browse/KAFKA-12994 > Project: Kafka > Issue Type: Improvement > Components: streams, unit tests >Affects Versions: 3.0.0 >Reporter: Israel Ekpo >Assignee: Israel Ekpo >Priority: Major > Labels: kip, kip-633 > Fix For: 3.1.0 > > > Due to the API changes for KIP-633 a lot of deprecation warnings have been > generated in tests that are using the old deprecated APIs. There are a lot of > tests using the deprecated methods. We should absolutely migrate them all to > the new APIs and then get rid of all the applicable annotations for > suppressing the deprecation warnings. > The applies to all Java and Scala examples and tests using the deprecated > APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows > classes. > > This is based on the feedback from reviewers in this PR > > https://github.com/apache/kafka/pull/10926 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12994) Migrate all Tests to New API and Remove Suppression for Deprecation Warnings related to KIP-633
[ https://issues.apache.org/jira/browse/KAFKA-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12994: --- Fix Version/s: (was: 3.0.0) 3.1.0 > Migrate all Tests to New API and Remove Suppression for Deprecation Warnings > related to KIP-633 > --- > > Key: KAFKA-12994 > URL: https://issues.apache.org/jira/browse/KAFKA-12994 > Project: Kafka > Issue Type: Improvement > Components: streams, unit tests >Affects Versions: 3.0.0 >Reporter: Israel Ekpo >Assignee: Israel Ekpo >Priority: Major > Labels: kip, kip-633 > Fix For: 3.1.0 > > > Due to the API changes for KIP-633 a lot of deprecation warnings have been > generated in tests that are using the old deprecated APIs. There are a lot of > tests using the deprecated methods. We should absolutely migrate them all to > the new APIs and then get rid of all the applicable annotations for > suppressing the deprecation warnings. > The applies to all Java and Scala examples and tests using the deprecated > APIs in the JoinWindows, SessionWindows, TimeWindows and SlidingWindows > classes. > > This is based on the feedback from reviewers in this PR > > https://github.com/apache/kafka/pull/10926 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9897) Flaky Test StoreQueryIntegrationTest#shouldQuerySpecificActivePartitionStores
[ https://issues.apache.org/jira/browse/KAFKA-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390799#comment-17390799 ] A. Sophie Blee-Goldman commented on KAFKA-9897: --- [https://github.com/apache/kafka/pull/3|https://github.com/apache/kafka/pull/3] > Flaky Test StoreQueryIntegrationTest#shouldQuerySpecificActivePartitionStores > - > > Key: KAFKA-9897 > URL: https://issues.apache.org/jira/browse/KAFKA-9897 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Affects Versions: 2.6.0 >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > > [https://builds.apache.org/job/kafka-pr-jdk14-scala2.13/22/testReport/junit/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/shouldQuerySpecificActivePartitionStores/] > {quote}org.apache.kafka.streams.errors.InvalidStateStoreException: Cannot get > state store source-table because the stream thread is PARTITIONS_ASSIGNED, > not RUNNING at > org.apache.kafka.streams.state.internals.StreamThreadStateStoreProvider.stores(StreamThreadStateStoreProvider.java:85) > at > org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:61) > at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:1183) at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQuerySpecificActivePartitionStores(StoreQueryIntegrationTest.java:178){quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-10246) AbstractProcessorContext topic() throws NullPointerException when modifying a state store within the DSL from a punctuator
[ https://issues.apache.org/jira/browse/KAFKA-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-10246. Resolution: Fixed > AbstractProcessorContext topic() throws NullPointerException when modifying a > state store within the DSL from a punctuator > -- > > Key: KAFKA-10246 > URL: https://issues.apache.org/jira/browse/KAFKA-10246 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.5.0 > Environment: linux, windows, java 11 >Reporter: Peter Pringle >Priority: Major > > NullPointerException seen when a KTable statestore is being modified by a > punctuated method which is added to a topology via the DSL processor/ktable > valueTransfomer methods. > It seems valid for AbstractProcessorContext.topic() to return null; however > the check below returns a NullPointerException before a null can be returned. > {quote}if (topic.equals(NONEXIST_TOPIC)) { > {quote} > Made a local fix to reverse the ordering of the check (i.e. avoid the null) > and this appears to fix the issue and sends the change to the state stores > changelog topic. > {quote}if (NONEXIST_TOPIC.equals(topic)) { > {quote} > Stacktrace below > {{2020-07-02 07:29:46,829 > [ABC_aggregator-551a90c1-d7c3-4357-a608-3ea79951f4e8-StreamThread-5] ERROR > [o.a.k.s.p.i.StreamThread]: stream-thread [ABC_aggregator-5}} > {{51a90c1-d7c3-4357-a608-3ea79951f4e8-StreamThread-5] Encountered the > following error during processing:}} > {{java.lang.NullPointerException: null}} > \{{ at > org.apache.kafka.streams.processor.internals.AbstractProcessorContext.topic(AbstractProcessorContext.java:115)}} > \{{ at > org.apache.kafka.streams.state.internals.CachingKeyValueStore.putInternal(CachingKeyValueStore.java:141)}} > \{{ at > org.apache.kafka.streams.state.internals.CachingKeyValueStore.put(CachingKeyValueStore.java:123)}} > \{{ at > org.apache.kafka.streams.state.internals.CachingKeyValueStore.put(CachingKeyValueStore.java:36)}} > \{{ at > org.apache.kafka.streams.state.internals.MeteredKeyValueStore.lambda$put$3(MeteredKeyValueStore.java:144)}} > \{{ at > org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)}} > \{{ at > org.apache.kafka.streams.state.internals.MeteredKeyValueStore.put(MeteredKeyValueStore.java:144)}} > \{{ at > org.apache.kafka.streams.processor.internals.ProcessorContextImpl$KeyValueStoreReadWriteDecorator.put(ProcessorContextImpl.java:487)}} > \{{ at > org.apache.kafka.streams.kstream.internals.KTableKTableJoinMerger$KTableKTableJoinMergeProcessor.process(KTableKTableJoinMerger.java:118)}} > \{{ at > org.apache.kafka.streams.kstream.internals.KTableKTableJoinMerger$KTableKTableJoinMergeProcessor.process(KTableKTableJoinMerger.java:97)}} > \{{ at > org.apache.kafka.streams.processor.internals.ProcessorNode.lambda$process$2(ProcessorNode.java:142)}} > \{{ at > org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)}} > \{{ at > org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:142)}} > \{{ at > org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)}} > \{{ at > org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)}} > \{{ at > org.apache.kafka.streams.kstream.internals.KTableKTableOuterJoin$KTableKTableOuterJoinProcessor.process(KTableKTableOuterJoin.java:118)}} > \{{ at > org.apache.kafka.streams.kstream.internals.KTableKTableOuterJoin$KTableKTableOuterJoinProcessor.process(KTableKTableOuterJoin.java:65)}} > \{{ at > org.apache.kafka.streams.processor.internals.ProcessorNode.lambda$process$2(ProcessorNode.java:142)}} > \{{ at > org.apache.kafka.streams.processor.internals.metrics.StreamsMetricsImpl.maybeMeasureLatency(StreamsMetricsImpl.java:806)}} > \{{ at > org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:142)}} > \{{ at > org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:201)}} > \{{ at > org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:180)}} > \{{ at > org.apache.kafka.streams.kstream.internals.TimestampedCacheFlushListener.apply(TimestampedCacheFlushListener.java:45)}} > \{{ at > org.apache.kafka.streams.kstream.internals.TimestampedCacheFlushListener.apply(TimestampedCacheFlushListener.java:28)}} > \{{ at > org.apache.kafka.streams.state.internals.MeteredKeyValueStore.lambda$setFlushListener$1(MeteredKeyValueStore.java:119)}}
[jira] [Resolved] (KAFKA-13150) How is Kafkastream configured to consume data from a specified offset ?
[ https://issues.apache.org/jira/browse/KAFKA-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-13150. Resolution: Invalid > How is Kafkastream configured to consume data from a specified offset ? > --- > > Key: KAFKA-13150 > URL: https://issues.apache.org/jira/browse/KAFKA-13150 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: wangjh >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13150) How is Kafkastream configured to consume data from a specified offset ?
[ https://issues.apache.org/jira/browse/KAFKA-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390096#comment-17390096 ] A. Sophie Blee-Goldman commented on KAFKA-13150: [~wangjh] Jira is intended only for bug reports and feature requests, please us the mailing list for questions > How is Kafkastream configured to consume data from a specified offset ? > --- > > Key: KAFKA-13150 > URL: https://issues.apache.org/jira/browse/KAFKA-13150 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: wangjh >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8295) Optimize count() using RocksDB merge operator
[ https://issues.apache.org/jira/browse/KAFKA-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17388304#comment-17388304 ] A. Sophie Blee-Goldman commented on KAFKA-8295: --- I was just re-reading the wiki page on the Merge Operator, and now I wonder if it may not be _as_ helpful as I originally thought – but probably still can offer some improvement. Here's my take, let me know what you think. Regardless of whether a custom MergeOperator suffers from the same performance impact of crossing the jni, I would bet that use cases such as list-append would still be more performant (since reading out an entire list, appending to it, and then writing the entire thing back is a lot of I/O). There are also the built-in, native MergeOperators that wouldn't need to cross the jni such as the UInt64AddOperator as you point out. So there are definitely cases where a MergeOperator would still outperform a RDW sequence. The thing I didn't fully appreciate before (but seems kind of obvious now that I think of it lol) is that the merge() call doesn't actually return the current value, either before or after the merge. So if we have to know this value in addition to updating it, we need to do a get(), and using merge() instead of RMW is only saving us the cost of `put(full_merged_value) - put(single_update_value)` – which means for constant-size values, like the unint64 unfortunately, there's pretty much no savings at all. So we don't even need to worry about whether/how to handle the fact that this is now a ValueAndTimestamp instead of a plain Value, ie a Long in the case of count(), because I don't think there's likely to be any performance improvement there. I didn't realize that at the time of filing this ticket, so maybe we should look past the current title of this ticket. This still leaves some cases that could potentially benefit from even a custom MergeOperator, such as list-append or any other where the difference in size between the full_merged_value and the single_update_value is very large. So it could be worth doing a POC of something like this and benchmarking that for a KIP. But tbh, having seen how messy it is to add new operators to the StateStore interface at the moment, I think we should probably try to avoid doing so unless there's good motivation and a clear benefit. In this case, while there may be a benefit, I'm not sure there's a good motivation to do so since no user has requested this feature yet. Of course that could just be because they aren't aware of the possibility, so how about this: we update the title of this ticket to describe this possible new feature and then see if any users chime in here or vote on the ticket. If we gauge real user interest then it makes more sense to put time into doing this. WDYT? > Optimize count() using RocksDB merge operator > - > > Key: KAFKA-8295 > URL: https://issues.apache.org/jira/browse/KAFKA-8295 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: Sagar Rao >Priority: Major > > In addition to regular put/get/delete RocksDB provides a fourth operation, > merge. This essentially provides an optimized read/update/write path in a > single operation. One of the built-in (C++) merge operators exposed over the > Java API is a counter. We should be able to leverage this for a more > efficient implementation of count() > > (Note: Unfortunately it seems unlikely we can use this to optimize general > aggregations, even if RocksJava allowed for a custom merge operator, unless > we provide a way for the user to specify and connect a C++ implemented > aggregator – otherwise we incur too much cost crossing the jni for a net > performance benefit) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9897) Flaky Test StoreQueryIntegrationTest#shouldQuerySpecificActivePartitionStores
[ https://issues.apache.org/jira/browse/KAFKA-9897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387652#comment-17387652 ] A. Sophie Blee-Goldman commented on KAFKA-9897: --- [~mjsax] that last failure is for a different test that was added recently, we merged a fix three days ago which I think was just after that build was triggered...but if you see that specific test fail again on a recent build can you reopen KAFKA-13128 instead? > Flaky Test StoreQueryIntegrationTest#shouldQuerySpecificActivePartitionStores > - > > Key: KAFKA-9897 > URL: https://issues.apache.org/jira/browse/KAFKA-9897 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Affects Versions: 2.6.0 >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > > [https://builds.apache.org/job/kafka-pr-jdk14-scala2.13/22/testReport/junit/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/shouldQuerySpecificActivePartitionStores/] > {quote}org.apache.kafka.streams.errors.InvalidStateStoreException: Cannot get > state store source-table because the stream thread is PARTITIONS_ASSIGNED, > not RUNNING at > org.apache.kafka.streams.state.internals.StreamThreadStateStoreProvider.stores(StreamThreadStateStoreProvider.java:85) > at > org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:61) > at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:1183) at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQuerySpecificActivePartitionStores(StoreQueryIntegrationTest.java:178){quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13120) Flesh out `streams_static_membership_test` to be more robust
[ https://issues.apache.org/jira/browse/KAFKA-13120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17387506#comment-17387506 ] A. Sophie Blee-Goldman commented on KAFKA-13120: Thanks [~Reggiehsu111]! I assigned the ticket to you and added you as a contributor so you can self-assign tickets from now on. Let me or [~lct45] know if you have any questions > Flesh out `streams_static_membership_test` to be more robust > > > Key: KAFKA-13120 > URL: https://issues.apache.org/jira/browse/KAFKA-13120 > Project: Kafka > Issue Type: Task > Components: streams, system tests >Reporter: Leah Thomas >Assignee: Reggie Hsu >Priority: Minor > Labels: newbie++ > > When fixing the `streams_static_membership_test.py` we noticed that the test > is pretty bare bones, it creates a streams application but doesn't do much > with the streams application, eg has no stateful processing. We should flesh > this out a bit to be more realistic and potentially consider testing with EOS > as well. The full java test is in `StaticMembershipTestClient` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-13120) Flesh out `streams_static_membership_test` to be more robust
[ https://issues.apache.org/jira/browse/KAFKA-13120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-13120: -- Assignee: Reggie Hsu > Flesh out `streams_static_membership_test` to be more robust > > > Key: KAFKA-13120 > URL: https://issues.apache.org/jira/browse/KAFKA-13120 > Project: Kafka > Issue Type: Task > Components: streams, system tests >Reporter: Leah Thomas >Assignee: Reggie Hsu >Priority: Minor > Labels: newbie++ > > When fixing the `streams_static_membership_test.py` we noticed that the test > is pretty bare bones, it creates a streams application but doesn't do much > with the streams application, eg has no stateful processing. We should flesh > this out a bit to be more realistic and potentially consider testing with EOS > as well. The full java test is in `StaticMembershipTestClient` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-13021) Improve Javadocs for API Changes and address followup from KIP-633
[ https://issues.apache.org/jira/browse/KAFKA-13021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-13021. Resolution: Fixed > Improve Javadocs for API Changes and address followup from KIP-633 > -- > > Key: KAFKA-13021 > URL: https://issues.apache.org/jira/browse/KAFKA-13021 > Project: Kafka > Issue Type: Improvement > Components: documentation, streams >Affects Versions: 3.0.0 >Reporter: Israel Ekpo >Assignee: Israel Ekpo >Priority: Major > Fix For: 3.0.0 > > > There are Javadoc changes from the following PR that needs to be completed > prior to the 3.0 release. This Jira item is to track that work > [https://github.com/apache/kafka/pull/10926] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13021) Improve Javadocs for API Changes and address followup from KIP-633
[ https://issues.apache.org/jira/browse/KAFKA-13021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13021: --- Fix Version/s: 3.0.0 > Improve Javadocs for API Changes and address followup from KIP-633 > -- > > Key: KAFKA-13021 > URL: https://issues.apache.org/jira/browse/KAFKA-13021 > Project: Kafka > Issue Type: Improvement > Components: documentation, streams >Affects Versions: 3.0.0 >Reporter: Israel Ekpo >Assignee: Israel Ekpo >Priority: Major > Fix For: 3.0.0 > > > There are Javadoc changes from the following PR that needs to be completed > prior to the 3.0 release. This Jira item is to track that work > [https://github.com/apache/kafka/pull/10926] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
[ https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13128: --- Affects Version/s: 2.8.1 3.0.0 > Flaky Test > StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread > > > Key: KAFKA-13128 > URL: https://issues.apache.org/jira/browse/KAFKA-13128 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 3.0.0, 2.8.1 >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Labels: flaky-test > Fix For: 3.0.0, 2.8.1 > > > h3. Stacktrace > java.lang.AssertionError: Expected: is not null but: was null > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455) > > https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13130) Deprecate long based range queries in SessionStore
[ https://issues.apache.org/jira/browse/KAFKA-13130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386523#comment-17386523 ] A. Sophie Blee-Goldman commented on KAFKA-13130: Hey [~pstuedi], just a heads up there is actually a KIP for this that's already been accepted. I suspect [~jeqo] wouldn't mind if you took over the implementation. [KIP-666|https://cwiki.apache.org/confluence/display/KAFKA/KIP-666%3A+Add+Instant-based+methods+to+ReadOnlySessionStore] > Deprecate long based range queries in SessionStore > -- > > Key: KAFKA-13130 > URL: https://issues.apache.org/jira/browse/KAFKA-13130 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Patrick Stuedi >Assignee: Patrick Stuedi >Priority: Minor > Labels: needs-kip, newbie, newbie++ > Fix For: 3.1.0 > > > Migrate long based queries in ReadOnlySessionStore (fetchSession*, > findSession*, etc.) to object based interfaces. Deprecate old long based > interface, similar to how it was done for the ReadOnlyWindowStore. > Related KIPs: > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-358%3A+Migrate+Streams+API+to+Duration+instead+of+long+ms+times] > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-617%3A+Allow+Kafka+Streams+State+Stores+to+be+iterated+backwards] > > Related Jiras: > https://issues.apache.org/jira/browse/KAFKA-12419 > https://issues.apache.org/jira/browse/KAFKA-12526 > https://issues.apache.org/jira/browse/KAFKA-12451 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13120) Flesh out `streams_static_membership_test` to be more robust
[ https://issues.apache.org/jira/browse/KAFKA-13120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13120: --- Labels: newbie++ (was: ) > Flesh out `streams_static_membership_test` to be more robust > > > Key: KAFKA-13120 > URL: https://issues.apache.org/jira/browse/KAFKA-13120 > Project: Kafka > Issue Type: Task > Components: streams, system tests >Reporter: Leah Thomas >Priority: Minor > Labels: newbie++ > > When fixing the `streams_static_membership_test.py` we noticed that the test > is pretty bare bones, it creates a streams application but doesn't do much > with the streams application, eg has no stateful processing. We should flesh > this out a bit to be more realistic and potentially consider testing with EOS > as well. The full java test is in `StaticMembershipTestClient` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13021) Improve Javadocs for API Changes and address followup from KIP-633
[ https://issues.apache.org/jira/browse/KAFKA-13021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385929#comment-17385929 ] A. Sophie Blee-Goldman commented on KAFKA-13021: PR: https://github.com/apache/kafka/pull/4 > Improve Javadocs for API Changes and address followup from KIP-633 > -- > > Key: KAFKA-13021 > URL: https://issues.apache.org/jira/browse/KAFKA-13021 > Project: Kafka > Issue Type: Improvement > Components: documentation, streams >Affects Versions: 3.0.0 >Reporter: Israel Ekpo >Assignee: Israel Ekpo >Priority: Major > > There are Javadoc changes from the following PR that needs to be completed > prior to the 3.0 release. This Jira item is to track that work > [https://github.com/apache/kafka/pull/10926] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13021) Improve Javadocs for API Changes and address followup from KIP-633
[ https://issues.apache.org/jira/browse/KAFKA-13021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13021: --- Summary: Improve Javadocs for API Changes and address followup from KIP-633 (was: Improve Javadocs for API Changes from KIP-633) > Improve Javadocs for API Changes and address followup from KIP-633 > -- > > Key: KAFKA-13021 > URL: https://issues.apache.org/jira/browse/KAFKA-13021 > Project: Kafka > Issue Type: Improvement > Components: documentation, streams >Affects Versions: 3.0.0 >Reporter: Israel Ekpo >Assignee: Israel Ekpo >Priority: Major > > There are Javadoc changes from the following PR that needs to be completed > prior to the 3.0 release. This Jira item is to track that work > [https://github.com/apache/kafka/pull/10926] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-6948) Avoid overflow in timestamp comparison
[ https://issues.apache.org/jira/browse/KAFKA-6948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-6948: - Assignee: A. Sophie Blee-Goldman > Avoid overflow in timestamp comparison > -- > > Key: KAFKA-6948 > URL: https://issues.apache.org/jira/browse/KAFKA-6948 > Project: Kafka > Issue Type: Improvement >Reporter: Giovanni Liva >Assignee: A. Sophie Blee-Goldman >Priority: Major > > Some comparisons with timestamp values are not safe. This comparisons can > trigger errors that were found in some other issues, e.g. KAFKA-4290 or > KAFKA-6608. > The following classes contains some comparison between timestamps that can > overflow. > * org.apache.kafka.clients.NetworkClientUtils > * org.apache.kafka.clients.consumer.internals.ConsumerCoordinator > * org.apache.kafka.common.security.kerberos.KerberosLogin > * org.apache.kafka.connect.runtime.WorkerSinkTask > * org.apache.kafka.connect.tools.MockSinkTask > * org.apache.kafka.connect.tools.MockSourceTask > * org.apache.kafka.streams.processor.internals.GlobalStreamThread > * org.apache.kafka.streams.processor.internals.StateDirectory > * org.apache.kafka.streams.processor.internals.StreamThread > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
[ https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13128: --- Priority: Blocker (was: Major) > Flaky Test > StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread > > > Key: KAFKA-13128 > URL: https://issues.apache.org/jira/browse/KAFKA-13128 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Labels: flaky-test > Fix For: 3.0.0, 2.8.1 > > > h3. Stacktrace > java.lang.AssertionError: Expected: is not null but: was null > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455) > > https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
[ https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13128: --- Affects Version/s: (was: 3.1.0) > Flaky Test > StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread > > > Key: KAFKA-13128 > URL: https://issues.apache.org/jira/browse/KAFKA-13128 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Major > Labels: flaky-test > Fix For: 3.0.0, 2.8.1 > > > h3. Stacktrace > java.lang.AssertionError: Expected: is not null but: was null > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455) > > https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
[ https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13128: --- Fix Version/s: 2.8.1 3.0.0 > Flaky Test > StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread > > > Key: KAFKA-13128 > URL: https://issues.apache.org/jira/browse/KAFKA-13128 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 3.1.0 >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Major > Labels: flaky-test > Fix For: 3.0.0, 2.8.1 > > > h3. Stacktrace > java.lang.AssertionError: Expected: is not null but: was null > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455) > > https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
[ https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-13128: -- Assignee: A. Sophie Blee-Goldman > Flaky Test > StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread > > > Key: KAFKA-13128 > URL: https://issues.apache.org/jira/browse/KAFKA-13128 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 3.1.0 >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Major > Labels: flaky-test > > h3. Stacktrace > java.lang.AssertionError: Expected: is not null but: was null > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455) > > https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
[ https://issues.apache.org/jira/browse/KAFKA-13128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385841#comment-17385841 ] A. Sophie Blee-Goldman commented on KAFKA-13128: The failure is from the second line in this series of assertions {code:java} assertThat(store1.get(key), is(notNullValue())); assertThat(store1.get(key2), is(notNullValue())); assertThat(store1.get(key3), is(notNullValue())); {code} which is basically the first time we attempt IQ after starting up. The test setup includes starting Streams and waiting for it to reach RUNNING, then adding a new thread, and finally producing a set of 100 records for each of the three keys. After that it waits for all records to be processed *for _key3_* and then proceeds to the above assertions. I suspect the problem is that we only wait for all data to be processed for _key3_, but not the other two keys. In theory this should work, since the data for _key3_ is produced last and would have the largest timestamps meaning the keys should be processed more or less in order. However the input topic actually has two partitions, so it could be that _key1_ and _key3_ correspond to task 1 while _key2_ corresponds to task 2. Again, that shouldn't affect the order in which records are processed – as long as the tasks are on the same thread. But we started up a new thread in between waiting for Streams to reach RUNNING and producing data to the input topics. This new thread has to be assigned one of the tasks, but due to cooperative rebalancing it will take two full (though short) rebalances before the new thread can actually start processing any tasks. Therefore as long as the original thread continues to own the task corresponding to _key3_ after the new thread is added, it can easily get through all records for _key3_. Which would mean the test can proceed to the above assertions while the new thread is still waiting to start processing any data for _key2_ at all. There are a few ways we can address this given how many things had to happen exactly right in order to see this failure, but the simplest fix is to just wait on all three keys to be fully processed rather than just the one. This seems to align with the original intention of the test best as well > Flaky Test > StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread > > > Key: KAFKA-13128 > URL: https://issues.apache.org/jira/browse/KAFKA-13128 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 3.1.0 >Reporter: A. Sophie Blee-Goldman >Priority: Major > Labels: flaky-test > > h3. Stacktrace > java.lang.AssertionError: Expected: is not null but: was null > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506) > at > org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455) > > https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13128) Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread
A. Sophie Blee-Goldman created KAFKA-13128: -- Summary: Flaky Test StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread Key: KAFKA-13128 URL: https://issues.apache.org/jira/browse/KAFKA-13128 Project: Kafka Issue Type: Bug Components: streams Affects Versions: 3.1.0 Reporter: A. Sophie Blee-Goldman h3. Stacktrace java.lang.AssertionError: Expected: is not null but: was null at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at org.apache.kafka.streams.integration.StoreQueryIntegrationTest.lambda$shouldQueryStoresAfterAddingAndRemovingStreamThread$19(StoreQueryIntegrationTest.java:461) at org.apache.kafka.streams.integration.StoreQueryIntegrationTest.until(StoreQueryIntegrationTest.java:506) at org.apache.kafka.streams.integration.StoreQueryIntegrationTest.shouldQueryStoresAfterAddingAndRemovingStreamThread(StoreQueryIntegrationTest.java:455) https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-11085/5/testReport/org.apache.kafka.streams.integration/StoreQueryIntegrationTest/Build___JDK_16_and_Scala_2_13___shouldQueryStoresAfterAddingAndRemovingStreamThread_2/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-7497) Kafka Streams should support self-join on streams
[ https://issues.apache.org/jira/browse/KAFKA-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-7497: - Assignee: (was: A. Sophie Blee-Goldman) > Kafka Streams should support self-join on streams > - > > Key: KAFKA-7497 > URL: https://issues.apache.org/jira/browse/KAFKA-7497 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: Robin Moffatt >Priority: Major > Labels: needs-kip > > There are valid reasons to want to join a stream to itself, but Kafka Streams > does not currently support this ({{Invalid topology: Topic foo has already > been registered by another source.}}). To perform the join requires creating > a second stream as a clone of the first, and then doing a join between the > two. This is a clunky workaround and results in unnecessary duplication of > data. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13126) Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads to missing rebalances
[ https://issues.apache.org/jira/browse/KAFKA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13126: --- Description: In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this override, users of both the plain consumer client and kafka streams still set the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an overflow when computing the {{joinGroupTimeoutMs}} and results in it being set to the {{request.timeout.ms}} instead, which is much lower. This can easily make consumers drop out of the group, since they must rejoin now within 30s (by default) but have no obligation to almost ever call poll() given the high {{max.poll.interval.ms}} – basically they will only do so after processing the last record from the previously polled batch. So in heavy processing cases, where each record takes a long time to process, or when using a very large {{max.poll.records}}, it can be difficult to make any progress at all before dropping out and needing to rejoin. And of course, the rebalance that is kicked off upon this member rejoining can result in many of the other members in the group dropping out as well, leading to an endless cycle of missed rebalances. We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it occurs. The workaround until then is of course to just set the {{max.poll.interval.ms}} to MAX_VALUE - 5000 (5s is the JOIN_GROUP_TIMEOUT_LAPSE) was: In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this override, users of both the plain consumer client and kafka streams still set the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an overflow when computing the {{joinGroupTimeoutMs}} and results in it being set to the {{request.timeout.ms}} instead, which is much lower. This can easily make consumers drop out of the group, since they must rejoin now within 30s (by default) but have no obligation to almost ever call poll() given the high {{max.poll.interval.ms}} – basically they will only do so after processing the last record from the previously polled batch. So in heavy processing cases, where each record takes a long time to process, or when using a very large {{max.poll.records}}, it can be difficult to make any progress at all before dropping out and needing to rejoin. And of course, the rebalance that is kicked off upon this member rejoining can result in many of the other members in the group dropping out as well, leading to an endless cycle of missed rebalances. We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it occurs. > Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads > to missing rebalances > - > > Key: KAFKA-13126 > URL: https://issues.apache.org/jira/browse/KAFKA-13126 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Major > Fix For: 3.1.0 > > > In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was > overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this > override, users of both the plain consumer client and kafka streams still set > the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an > overflow when computing the {{joinGroupTimeoutMs}} and results in it being > set to the {{request.timeout.ms}} instead, which is much lower. > This can easily make consumers drop out of the group, since they must rejoin > now within 30s (by default) but have no obligation to almost ever call poll() > given the high {{max.poll.interval.ms}} – basically they will only do so > after processing the last record from the previously polled batch. So in > heavy processing cases, where each record takes a long time to process, or > when using a very large {{max.poll.records}}, it can be difficult to make > any progress at all before dropping out and needing to rejoin. And of course, > the rebalance that is kicked off upon this member rejoining can result in > many of the other members in the group dropping out as well, leading to an > endless cycle of missed rebalances. > We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when > it occurs. The workaround until then is of course to just set the > {{max.poll.interval.ms}} to MAX_VALUE - 5000 (5s is the > JOIN_GROUP_TIMEOUT_LAPSE) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13126) Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads to missing rebalances
[ https://issues.apache.org/jira/browse/KAFKA-13126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13126: --- Description: In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this override, users of both the plain consumer client and kafka streams still set the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an overflow when computing the {{joinGroupTimeoutMs}} and results in it being set to the {{request.timeout.ms}} instead, which is much lower. This can easily make consumers drop out of the group, since they must rejoin now within 30s (by default) but have no obligation to almost ever call poll() given the high {{max.poll.interval.ms}} – basically they will only do so after processing the last record from the previously polled batch. So in heavy processing cases, where each record takes a long time to process, or when using a very large {{max.poll.records}}, it can be difficult to make any progress at all before dropping out and needing to rejoin. And of course, the rebalance that is kicked off upon this member rejoining can result in many of the other members in the group dropping out as well, leading to an endless cycle of missed rebalances. We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it occurs. was: In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this override, users of both the plain consumer client and kafka streams still set the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an overflow when computing the {{joinGroupTimeoutMs}} and results in it being set to the {{request.timeout.ms}} instead, which is much lower. This can easily make consumers drop out of the group, since they must rejoin now within 30s (by default) yet have no obligation to almost ever call poll() given the high {{max.poll.interval.ms}}. We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it occurs. > Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads > to missing rebalances > - > > Key: KAFKA-13126 > URL: https://issues.apache.org/jira/browse/KAFKA-13126 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Major > Fix For: 3.1.0 > > > In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was > overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this > override, users of both the plain consumer client and kafka streams still set > the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an > overflow when computing the {{joinGroupTimeoutMs}} and results in it being > set to the {{request.timeout.ms}} instead, which is much lower. > This can easily make consumers drop out of the group, since they must rejoin > now within 30s (by default) but have no obligation to almost ever call poll() > given the high {{max.poll.interval.ms}} – basically they will only do so > after processing the last record from the previously polled batch. So in > heavy processing cases, where each record takes a long time to process, or > when using a very large {{max.poll.records}}, it can be difficult to make > any progress at all before dropping out and needing to rejoin. And of course, > the rebalance that is kicked off upon this member rejoining can result in > many of the other members in the group dropping out as well, leading to an > endless cycle of missed rebalances. > We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when > it occurs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13126) Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads to missing rebalances
A. Sophie Blee-Goldman created KAFKA-13126: -- Summary: Overflow in joinGroupTimeoutMs when max.poll.interval.ms is MAX_VALUE leads to missing rebalances Key: KAFKA-13126 URL: https://issues.apache.org/jira/browse/KAFKA-13126 Project: Kafka Issue Type: Bug Components: consumer Reporter: A. Sophie Blee-Goldman Assignee: A. Sophie Blee-Goldman Fix For: 3.1.0 In older versions of Kafka Streams, the {{max.poll.interval.ms}} config was overridden by default to {{Integer.MAX_VALUE}}. Even after we removed this override, users of both the plain consumer client and kafka streams still set the poll interval to MAX_VALUE somewhat often. Unfortunately, this causes an overflow when computing the {{joinGroupTimeoutMs}} and results in it being set to the {{request.timeout.ms}} instead, which is much lower. This can easily make consumers drop out of the group, since they must rejoin now within 30s (by default) yet have no obligation to almost ever call poll() given the high {{max.poll.interval.ms}}. We just need to check for overflow and fix it to {{Integer.MAX_VALUE}} when it occurs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13121) Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates()
[ https://issues.apache.org/jira/browse/KAFKA-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17385093#comment-17385093 ] A. Sophie Blee-Goldman commented on KAFKA-13121: Hey [~jolshan], thought you might have some context on this since it (sort of) had to do with topic ids. The full logs aren't much but I'll upload them in case that helps: [^TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates.rtf] > Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates() > --- > > Key: KAFKA-13121 > URL: https://issues.apache.org/jira/browse/KAFKA-13121 > Project: Kafka > Issue Type: Bug > Components: log >Reporter: A. Sophie Blee-Goldman >Priority: Major > Attachments: > TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates.rtf > > > h4. Stack Trace > {code:java} > org.apache.kafka.server.log.remote.storage.RemoteResourceNotFoundException: > No resource found for partition: > TopicIdPartition{topicId=2B9rDu44TE6c8pLG8A0RAg, topicPartition=new-leader-0} > at > org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.getRemoteLogMetadataCache(RemotePartitionMetadataStore.java:112) > > at > org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.listRemoteLogSegments(RemotePartitionMetadataStore.java:98) > at > org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager.listRemoteLogSegments(TopicBasedRemoteLogMetadataManager.java:212) > at > org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates(TopicBasedRemoteLogMetadataManagerTest.java:99){code} > https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10921/11/testReport/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13121) Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates()
[ https://issues.apache.org/jira/browse/KAFKA-13121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13121: --- Attachment: TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates.rtf > Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates() > --- > > Key: KAFKA-13121 > URL: https://issues.apache.org/jira/browse/KAFKA-13121 > Project: Kafka > Issue Type: Bug > Components: log >Reporter: A. Sophie Blee-Goldman >Priority: Major > Attachments: > TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates.rtf > > > h4. Stack Trace > {code:java} > org.apache.kafka.server.log.remote.storage.RemoteResourceNotFoundException: > No resource found for partition: > TopicIdPartition{topicId=2B9rDu44TE6c8pLG8A0RAg, topicPartition=new-leader-0} > at > org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.getRemoteLogMetadataCache(RemotePartitionMetadataStore.java:112) > > at > org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.listRemoteLogSegments(RemotePartitionMetadataStore.java:98) > at > org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager.listRemoteLogSegments(TopicBasedRemoteLogMetadataManager.java:212) > at > org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates(TopicBasedRemoteLogMetadataManagerTest.java:99){code} > https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10921/11/testReport/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13121) Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates()
A. Sophie Blee-Goldman created KAFKA-13121: -- Summary: Flaky Test TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates() Key: KAFKA-13121 URL: https://issues.apache.org/jira/browse/KAFKA-13121 Project: Kafka Issue Type: Bug Components: log Reporter: A. Sophie Blee-Goldman h4. Stack Trace {code:java} org.apache.kafka.server.log.remote.storage.RemoteResourceNotFoundException: No resource found for partition: TopicIdPartition{topicId=2B9rDu44TE6c8pLG8A0RAg, topicPartition=new-leader-0} at org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.getRemoteLogMetadataCache(RemotePartitionMetadataStore.java:112) at org.apache.kafka.server.log.remote.metadata.storage.RemotePartitionMetadataStore.listRemoteLogSegments(RemotePartitionMetadataStore.java:98) at org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager.listRemoteLogSegments(TopicBasedRemoteLogMetadataManager.java:212) at org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManagerTest.testNewPartitionUpdates(TopicBasedRemoteLogMetadataManagerTest.java:99){code} https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10921/11/testReport/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12291) Fix Ignored Upgrade Tests in streams_upgrade_test.py: test_upgrade_downgrade_brokers
[ https://issues.apache.org/jira/browse/KAFKA-12291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12291: --- Fix Version/s: (was: 3.0.0) 3.1.0 > Fix Ignored Upgrade Tests in streams_upgrade_test.py: > test_upgrade_downgrade_brokers > > > Key: KAFKA-12291 > URL: https://issues.apache.org/jira/browse/KAFKA-12291 > Project: Kafka > Issue Type: Improvement > Components: streams, system tests >Reporter: Bruno Cadonna >Priority: Blocker > Fix For: 3.1.0 > > > Fix in the oldest branch that ignores the test and cherry-pick forward. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
[ https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13010: --- Fix Version/s: (was: 3.0.1) 3.1.0 > Flaky test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation() > --- > > Key: KAFKA-13010 > URL: https://issues.apache.org/jira/browse/KAFKA-13010 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Bruno Cadonna >Assignee: Walker Carlson >Priority: Major > Labels: flaky-test > Fix For: 3.1.0 > > Attachments: > TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf > > > Integration test {{test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}} > sometimes fails with > {code:java} > java.lang.AssertionError: only one task > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
[ https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383504#comment-17383504 ] A. Sophie Blee-Goldman commented on KAFKA-13010: Yeah it's honestly pretty hard to imagine how that could have introduced a bug like this, but the timing definitely is suspicious. Seeing as Walker was able to reproduce it after a "reasonable" number of retries, it should be easy to confirm the suspicion by just running the test again with sufficient repeats on the commit just before KIP-744 was merged. > Flaky test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation() > --- > > Key: KAFKA-13010 > URL: https://issues.apache.org/jira/browse/KAFKA-13010 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Bruno Cadonna >Assignee: Walker Carlson >Priority: Major > Labels: flaky-test > Attachments: > TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf > > > Integration test {{test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}} > sometimes fails with > {code:java} > java.lang.AssertionError: only one task > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
[ https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17383445#comment-17383445 ] A. Sophie Blee-Goldman commented on KAFKA-13010: It's worth noting that this test seemed to be pretty stable for roughly a full release cycle, then started failing pretty frequently right after [KIP-744|https://cwiki.apache.org/confluence/display/KAFKA/KIP-744%3A+Migrate+TaskMetadata+and+ThreadMetadata+to+an+interface+with+internal+implementation] / KAFKA-12849 was merged (June 25th). [~wcarlson5] I wonder if the changes to the metadata hierarchy could have introduced a potentially serious bug? > Flaky test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation() > --- > > Key: KAFKA-13010 > URL: https://issues.apache.org/jira/browse/KAFKA-13010 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Bruno Cadonna >Assignee: Walker Carlson >Priority: Major > Labels: flaky-test > Attachments: > TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf > > > Integration test {{test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}} > sometimes fails with > {code:java} > java.lang.AssertionError: only one task > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13096) QueryableStoreProvider is not updated when threads are added/removed/replaced rendering IQ impossible
A. Sophie Blee-Goldman created KAFKA-13096: -- Summary: QueryableStoreProvider is not updated when threads are added/removed/replaced rendering IQ impossible Key: KAFKA-13096 URL: https://issues.apache.org/jira/browse/KAFKA-13096 Project: Kafka Issue Type: Bug Components: streams Affects Versions: 2.8.0 Reporter: A. Sophie Blee-Goldman Fix For: 3.0.0, 2.8.1 The QueryableStoreProviders class is used to route queries to the correct state store on the owning StreamThread, making it a critical piece of IQ. It gets instantiated when you create a new KafkaStreams, and is passed in a list of StreamThreadStateStoreProviders which it then copies and stores. Because it only stores a copy it only ever contains a provider for the StreamThreads that were created during the app's startup, and unfortunately is never updated during an add/remove/replace thread event. This means that IQ can’t get a handle on any stores that belong to a thread that wasn’t in the original set. If the app is starting up new threads through the #addStreamThread API or following a REPLACE_THREAD event, none of the data in any of the stores owned by that new thread will be accessible by IQ. If a user is removing threads through #removeStreamThread, or threads die and get replaced, you can fall into an endless loop of {{InvalidStateStoreException}} from doing a lookup into stores that have been closed since the thread was removed/died. If over time all of the original threads are removed or replaced, then IQ won’t work at all. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12896) Group rebalance loop caused by repeated group leader JoinGroups
[ https://issues.apache.org/jira/browse/KAFKA-12896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-12896. Resolution: Fixed > Group rebalance loop caused by repeated group leader JoinGroups > --- > > Key: KAFKA-12896 > URL: https://issues.apache.org/jira/browse/KAFKA-12896 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.6.0 >Reporter: Lucas Bradstreet >Assignee: David Jacot >Priority: Blocker > Fix For: 3.0.0 > > > We encountered a strange case of a rebalance loop with the > "cooperative-sticky" assignor. The logs show the following for several hours: > > {{Apr 7, 2021 @ 03:58:36.040 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830137 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.992 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830136 > (__consumer_offsets-7) (reason: Updating metadata for member > mygroup-1-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}} > {{Apr 7, 2021 @ 03:58:35.988 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830136 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.972 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830135 > (__consumer_offsets-7) (reason: Updating metadata for member mygroup during > CompletingRebalance)}} > {{Apr 7, 2021 @ 03:58:35.965 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830135 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.953 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830134 > (__consumer_offsets-7) (reason: Updating metadata for member > mygroup-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}} > {{Apr 7, 2021 @ 03:58:35.941 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830134 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.926 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830133 > (__consumer_offsets-7) (reason: Updating metadata for member mygroup during > CompletingRebalance)}} > Every single time, it was the same member that triggered the JoinGroup and it > was always the leader of the group.{{}} > The leader has the privilege of being able to trigger a rebalance by sending > `JoinGroup` even if its subscription metadata has not changed. But why would > it do so? > It is possible that this is due to the same issue or a similar bug to > https://issues.apache.org/jira/browse/KAFKA-12890. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12896) Group rebalance loop caused by repeated group leader JoinGroups
[ https://issues.apache.org/jira/browse/KAFKA-12896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381120#comment-17381120 ] A. Sophie Blee-Goldman commented on KAFKA-12896: Yep, that should do it > Group rebalance loop caused by repeated group leader JoinGroups > --- > > Key: KAFKA-12896 > URL: https://issues.apache.org/jira/browse/KAFKA-12896 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.6.0 >Reporter: Lucas Bradstreet >Assignee: David Jacot >Priority: Blocker > Fix For: 3.0.0 > > > We encountered a strange case of a rebalance loop with the > "cooperative-sticky" assignor. The logs show the following for several hours: > > {{Apr 7, 2021 @ 03:58:36.040 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830137 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.992 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830136 > (__consumer_offsets-7) (reason: Updating metadata for member > mygroup-1-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}} > {{Apr 7, 2021 @ 03:58:35.988 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830136 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.972 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830135 > (__consumer_offsets-7) (reason: Updating metadata for member mygroup during > CompletingRebalance)}} > {{Apr 7, 2021 @ 03:58:35.965 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830135 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.953 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830134 > (__consumer_offsets-7) (reason: Updating metadata for member > mygroup-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}} > {{Apr 7, 2021 @ 03:58:35.941 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830134 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.926 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830133 > (__consumer_offsets-7) (reason: Updating metadata for member mygroup during > CompletingRebalance)}} > Every single time, it was the same member that triggered the JoinGroup and it > was always the leader of the group.{{}} > The leader has the privilege of being able to trigger a rebalance by sending > `JoinGroup` even if its subscription metadata has not changed. But why would > it do so? > It is possible that this is due to the same issue or a similar bug to > https://issues.apache.org/jira/browse/KAFKA-12890. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
[ https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380939#comment-17380939 ] A. Sophie Blee-Goldman commented on KAFKA-13010: Not the exact same test, but I did manage to reproduce this same "only one task" failure in TaskMetadataIntegrationTest.shouldReportCorrectEndOffsetInformation: {code:java} java.lang.AssertionError: only one task at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26) at org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:163) at org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectEndOffsetInformation(TaskMetadataIntegrationTest.java:144) {code} I saved the logs since they should actually be the full, un-truncated logs – hope this helps: [^TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf] > Flaky test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation() > --- > > Key: KAFKA-13010 > URL: https://issues.apache.org/jira/browse/KAFKA-13010 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Bruno Cadonna >Priority: Major > Labels: flaky-test > Attachments: > TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf > > > Integration test {{test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}} > sometimes fails with > {code:java} > java.lang.AssertionError: only one task > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
[ https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13010: --- Attachment: TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf > Flaky test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation() > --- > > Key: KAFKA-13010 > URL: https://issues.apache.org/jira/browse/KAFKA-13010 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Bruno Cadonna >Priority: Major > Labels: flaky-test > Attachments: > TaskMetadataIntegrationTest#shouldReportCorrectEndOffsetInformation.rtf > > > Integration test {{test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}} > sometimes fails with > {code:java} > java.lang.AssertionError: only one task > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
[ https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380838#comment-17380838 ] A. Sophie Blee-Goldman commented on KAFKA-13010: [~wcarl...@confluent.io] I'm guessing you wrote this test so you have the most context, can you reproduce this locally and take a minute or two to look through the logs and see if anything jumps out at you? > Flaky test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation() > --- > > Key: KAFKA-13010 > URL: https://issues.apache.org/jira/browse/KAFKA-13010 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Bruno Cadonna >Priority: Major > Labels: flaky-test > > Integration test {{test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}} > sometimes fails with > {code:java} > java.lang.AssertionError: only one task > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13037) "Thread state is already PENDING_SHUTDOWN" log spam
[ https://issues.apache.org/jira/browse/KAFKA-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380295#comment-17380295 ] A. Sophie Blee-Goldman commented on KAFKA-13037: Thanks for the fix – btw, I added you as a contributor so you should be able to self-assign tickets from now on. > "Thread state is already PENDING_SHUTDOWN" log spam > --- > > Key: KAFKA-13037 > URL: https://issues.apache.org/jira/browse/KAFKA-13037 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0, 2.7.1 >Reporter: John Gray >Assignee: John Gray >Priority: Major > Fix For: 3.0.0, 2.8.1 > > > KAFKA-12462 introduced a > [change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722] > that increased this "Thread state is already {}" logger to info instead of > debug. We are running into a problem with our streams apps when they hit an > unrecoverable exception that shuts down the streams thread, we get this log > printed about 50,000 times per second per thread. I am guessing it is once > per record we have queued up when the exception happens? We have temporarily > raised the StreamThread logger to WARN instead of INFO to avoid the spam, but > we do miss the other good logs we get on INFO in that class. Could this log > be reverted back to debug? Thank you! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-13037) "Thread state is already PENDING_SHUTDOWN" log spam
[ https://issues.apache.org/jira/browse/KAFKA-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-13037: -- Assignee: John Gray > "Thread state is already PENDING_SHUTDOWN" log spam > --- > > Key: KAFKA-13037 > URL: https://issues.apache.org/jira/browse/KAFKA-13037 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0, 2.7.1 >Reporter: John Gray >Assignee: John Gray >Priority: Major > Fix For: 3.0.0, 2.8.1 > > > KAFKA-12462 introduced a > [change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722] > that increased this "Thread state is already {}" logger to info instead of > debug. We are running into a problem with our streams apps when they hit an > unrecoverable exception that shuts down the streams thread, we get this log > printed about 50,000 times per second per thread. I am guessing it is once > per record we have queued up when the exception happens? We have temporarily > raised the StreamThread logger to WARN instead of INFO to avoid the spam, but > we do miss the other good logs we get on INFO in that class. Could this log > be reverted back to debug? Thank you! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13081) Port sticky assignor fixes (KAFKA-12984) back to 2.8
[ https://issues.apache.org/jira/browse/KAFKA-13081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13081: --- Component/s: consumer > Port sticky assignor fixes (KAFKA-12984) back to 2.8 > > > Key: KAFKA-13081 > URL: https://issues.apache.org/jira/browse/KAFKA-13081 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: A. Sophie Blee-Goldman >Assignee: Luke Chen >Priority: Blocker > Fix For: 2.8.1 > > > We should make sure that fix #1 and #2 of > [#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 > sticky assignor, since it's pretty much impossible to smoothly cherrypick > that commit from 3.0 to 2.8 due to all the recent improvements and > refactoring in the AbstractStickyAssignor. Either we can just extract and > apply those two fixes to 2.8 directly, or go back and port all the commits > that made this cherrypick difficult over to 2.8 as well. If we do so then > cherrypicking the original commit should be easy. > See [this > comment|https://github.com/apache/kafka/pull/10985#issuecomment-879521196] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13081) Port sticky assignor fixes (KAFKA-12984) back to 2.8
[ https://issues.apache.org/jira/browse/KAFKA-13081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380251#comment-17380251 ] A. Sophie Blee-Goldman commented on KAFKA-13081: Personally I would probably advocate for just applying the two fixes to 2.8 rather than porting over the full set of improvements and refactorings. While the improvements are great to have, they're not strictly necessary and relatively complex, so porting them back carries some risk in case of as-yet-undiscovered bugs that were introduced during the large-scale refactoring of this assignor. Given that it's hard to be 100% confident in the correctness of such a complicated algorithm, regardless of how good the test coverage is, I would prefer to just apply the minimal required fixes and leave 2.8 as a "safe" branch. That way we can fall back to it in case of unexpected issues in the 3.0 assignor, and recommend users who want to downgrade to the last stable version with a good "cooperative-sticky"/"sticky" assignor they can continue to use until the new assignor is stabilized. Just my 2 cents > Port sticky assignor fixes (KAFKA-12984) back to 2.8 > > > Key: KAFKA-13081 > URL: https://issues.apache.org/jira/browse/KAFKA-13081 > Project: Kafka > Issue Type: Bug >Reporter: A. Sophie Blee-Goldman >Assignee: Luke Chen >Priority: Blocker > Fix For: 2.8.1 > > > We should make sure that fix #1 and #2 of > [#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 > sticky assignor, since it's pretty much impossible to smoothly cherrypick > that commit from 3.0 to 2.8 due to all the recent improvements and > refactoring in the AbstractStickyAssignor. Either we can just extract and > apply those two fixes to 2.8 directly, or go back and port all the commits > that made this cherrypick difficult over to 2.8 as well. If we do so then > cherrypicking the original commit should be easy. > See [this > comment|https://github.com/apache/kafka/pull/10985#issuecomment-879521196] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13081) Port sticky assignor fixes (KAFKA-12984) back to 2.8
[ https://issues.apache.org/jira/browse/KAFKA-13081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13081: --- Description: We should make sure that fix #1 and #2 of [#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 sticky assignor, since it's pretty much impossible to smoothly cherrypick that commit from 3.0 to 2.8 due to all the recent improvements and refactoring in the AbstractStickyAssignor. Either we can just extract and apply those two fixes to 2.8 directly, or go back and port all the commits that made this cherrypick difficult over to 2.8 as well. If we do so then cherrypicking the original commit should be easy. See [this comment|https://github.com/apache/kafka/pull/10985#issuecomment-879521196] was:We should make sure that fix #1 and #2 of [#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 sticky assignor, since it's pretty much impossible to smoothly cherrypick that commit from 3.0 to 2.8 due to all the recent improvements and refactoring in the AbstractStickyAssignor. Either we can just extract and apply those two fixes to 2.8 directly, or go back and port all the commits that made this cherrypick difficult over to 2.8 as well. If we do so then cherrypicking the original commit should be easy > Port sticky assignor fixes (KAFKA-12984) back to 2.8 > > > Key: KAFKA-13081 > URL: https://issues.apache.org/jira/browse/KAFKA-13081 > Project: Kafka > Issue Type: Bug >Reporter: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 2.8.1 > > > We should make sure that fix #1 and #2 of > [#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 > sticky assignor, since it's pretty much impossible to smoothly cherrypick > that commit from 3.0 to 2.8 due to all the recent improvements and > refactoring in the AbstractStickyAssignor. Either we can just extract and > apply those two fixes to 2.8 directly, or go back and port all the commits > that made this cherrypick difficult over to 2.8 as well. If we do so then > cherrypicking the original commit should be easy. > See [this > comment|https://github.com/apache/kafka/pull/10985#issuecomment-879521196] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-13081) Port sticky assignor fixes (KAFKA-12984) back to 2.8
A. Sophie Blee-Goldman created KAFKA-13081: -- Summary: Port sticky assignor fixes (KAFKA-12984) back to 2.8 Key: KAFKA-13081 URL: https://issues.apache.org/jira/browse/KAFKA-13081 Project: Kafka Issue Type: Bug Reporter: A. Sophie Blee-Goldman Fix For: 2.8.1 We should make sure that fix #1 and #2 of [#10985|https://github.com/apache/kafka/pull/10985] make it back to the 2.8 sticky assignor, since it's pretty much impossible to smoothly cherrypick that commit from 3.0 to 2.8 due to all the recent improvements and refactoring in the AbstractStickyAssignor. Either we can just extract and apply those two fixes to 2.8 directly, or go back and port all the commits that made this cherrypick difficult over to 2.8 as well. If we do so then cherrypicking the original commit should be easy -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12984) Cooperative sticky assignor can get stuck with invalid SubscriptionState input metadata
[ https://issues.apache.org/jira/browse/KAFKA-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12984: --- Fix Version/s: (was: 2.8.1) > Cooperative sticky assignor can get stuck with invalid SubscriptionState > input metadata > --- > > Key: KAFKA-12984 > URL: https://issues.apache.org/jira/browse/KAFKA-12984 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 3.0.0 > > > Some users have reported seeing their consumer group become stuck in the > CompletingRebalance phase when using the cooperative-sticky assignor. Based > on the request metadata we were able to deduce that multiple consumers were > reporting the same partition(s) in their "ownedPartitions" field of the > consumer protocol. Since this is an invalid state, the input causes the > cooperative-sticky assignor to detect that something is wrong and throw an > IllegalStateException. If the consumer application is set up to simply retry, > this will cause the group to appear to hang in the rebalance state. > The "ownedPartitions" field is encoded based on the ConsumerCoordinator's > SubscriptionState, which was assumed to always be up to date. However there > may be cases where the consumer has dropped out of the group but fails to > clear the SubscriptionState, allowing it to report some partitions as owned > that have since been reassigned to another member. > We should (a) fix the sticky assignment algorithm to resolve cases of > improper input conditions by invalidating the "ownedPartitions" in cases of > double ownership, and (b) shore up the ConsumerCoordinator logic to better > handle rejoining the group and keeping its internal state consistent. See > KAFKA-12983 for more details on (b) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
[ https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380212#comment-17380212 ] A. Sophie Blee-Goldman edited comment on KAFKA-13010 at 7/13/21, 10:39 PM: --- Failed again [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10985/6/testReport/junit/org.apache.kafka.streams.integration/TaskMetadataIntegrationTest/Build___JDK_11_and_Scala_2_13___shouldReportCorrectCommittedOffsetInformation/] FWIW I didn't see the above log message about that subscribed topic not being assigned to any members. The logs were truncated so it's possible that it actually was there, but I don't think that's the case since AFAICT the truncated logs are mostly from kafka/zookeeper. The relevant logs from the rebalance seem to be present was (Author: ableegoldman): Failed again https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10985/6/testReport/junit/org.apache.kafka.streams.integration/TaskMetadataIntegrationTest/Build___JDK_11_and_Scala_2_13___shouldReportCorrectCommittedOffsetInformation/ > Flaky test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation() > --- > > Key: KAFKA-13010 > URL: https://issues.apache.org/jira/browse/KAFKA-13010 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Bruno Cadonna >Priority: Major > Labels: flaky-test > > Integration test {{test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}} > sometimes fails with > {code:java} > java.lang.AssertionError: only one task > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13010) Flaky test org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()
[ https://issues.apache.org/jira/browse/KAFKA-13010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380212#comment-17380212 ] A. Sophie Blee-Goldman commented on KAFKA-13010: Failed again https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10985/6/testReport/junit/org.apache.kafka.streams.integration/TaskMetadataIntegrationTest/Build___JDK_11_and_Scala_2_13___shouldReportCorrectCommittedOffsetInformation/ > Flaky test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation() > --- > > Key: KAFKA-13010 > URL: https://issues.apache.org/jira/browse/KAFKA-13010 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Bruno Cadonna >Priority: Major > Labels: flaky-test > > Integration test {{test > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation()}} > sometimes fails with > {code:java} > java.lang.AssertionError: only one task > at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:26) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.getTaskMetadata(TaskMetadataIntegrationTest.java:162) > at > org.apache.kafka.streams.integration.TaskMetadataIntegrationTest.shouldReportCorrectCommittedOffsetInformation(TaskMetadataIntegrationTest.java:117) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-13075) Consolidate RocksDBStoreTest and RocksDBKeyValueStoreTest
[ https://issues.apache.org/jira/browse/KAFKA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-13075. Fix Version/s: 3.1.0 Resolution: Fixed > Consolidate RocksDBStoreTest and RocksDBKeyValueStoreTest > - > > Key: KAFKA-13075 > URL: https://issues.apache.org/jira/browse/KAFKA-13075 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: Chun-Hao Tang >Priority: Major > Labels: newbie, newbie++ > Fix For: 3.1.0 > > > Looks like we have two different test classes covering pretty much the same > thing: RocksDBStore. It seems like RocksDBKeyValueStoreTest was the original > test class for RocksDBStore, but someone later added RocksDBStoreTest, most > likely because they didn't notice the RocksDBKeyValueStoreTest which didn't > follow the usual naming scheme for test classes. > We should consolidate these two into a single file, ideally retaining the > RocksDBStoreTest name since that conforms to the test naming pattern used > throughout Streams (and so this same thing doesn't happen again). It should > also extend AbstractKeyValueStoreTest like the RocksDBKeyValueStoreTest > currently does so we continue to get the benefit of all the tests in there as > well -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-13075) Consolidate RocksDBStoreTest and RocksDBKeyValueStoreTest
[ https://issues.apache.org/jira/browse/KAFKA-13075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-13075: --- Summary: Consolidate RocksDBStoreTest and RocksDBKeyValueStoreTest (was: Consolidate RocksDBStore and RocksDBKeyValueStoreTest) > Consolidate RocksDBStoreTest and RocksDBKeyValueStoreTest > - > > Key: KAFKA-13075 > URL: https://issues.apache.org/jira/browse/KAFKA-13075 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: Chun-Hao Tang >Priority: Major > Labels: newbie, newbie++ > > Looks like we have two different test classes covering pretty much the same > thing: RocksDBStore. It seems like RocksDBKeyValueStoreTest was the original > test class for RocksDBStore, but someone later added RocksDBStoreTest, most > likely because they didn't notice the RocksDBKeyValueStoreTest which didn't > follow the usual naming scheme for test classes. > We should consolidate these two into a single file, ideally retaining the > RocksDBStoreTest name since that conforms to the test naming pattern used > throughout Streams (and so this same thing doesn't happen again). It should > also extend AbstractKeyValueStoreTest like the RocksDBKeyValueStoreTest > currently does so we continue to get the benefit of all the tests in there as > well -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8529) Flakey test ConsumerBounceTest#testCloseDuringRebalance
[ https://issues.apache.org/jira/browse/KAFKA-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380138#comment-17380138 ] A. Sophie Blee-Goldman commented on KAFKA-8529: --- This has been failing _very_ frequently as of late, for example [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-11009/4/tests]. I took a brief looks at the logs and it appears to be an issue with the connection, or possibly something to do with the incremental fetch (not sure if this warning is a red herring or not): {code:java} [2021-07-13 12:31:46,774] WARN [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={closetest-1=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), closetest-7=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), closetest-4=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1829936913, epoch=1), rackId=) (kafka.server.ReplicaFetcherThread:72)org.apache.kafka.common.errors.FetchSessionTopicIdException: The fetch session encountered inconsistent topic ID usage[2021-07-13 12:31:47,002] WARN [ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=2081224068, epoch=1), rackId=) (kafka.server.ReplicaFetcherThread:72)org.apache.kafka.common.errors.FetchSessionTopicIdException: The fetch session encountered inconsistent topic ID usage[2021-07-13 12:31:48,545] WARN [Consumer clientId=ConsumerTestConsumer, groupId=group1] Close timed out with 3 pending requests to coordinator, terminating client connections (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1024)[2021-07-13 12:31:48,549] WARN [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=0, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={closetest-1=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), closetest-7=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), topic-1=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), closetest-4=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1532743228, epoch=INITIAL), rackId=) (kafka.server.ReplicaFetcherThread:72)java.io.IOException: Connection to 2 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:109) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:219) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:313) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:137) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:136) at scala.Option.foreach(Option.scala:437) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:136) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:119) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) {code} Wondering if we should bump this the priority on this? Can someone more familiar with this test maybe take a few minutes to glance over these logs and chime in on whether this is potentially concerning or not? cc [~hachikuji] [~cmccabe] > Flakey test ConsumerBounceTest#testCloseDuringRebalance > --- > > Key: KAFKA-8529 > URL: https://issues.apache.org/jira/browse/KAFKA-8529 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: Boyang Chen >Priority: Major > > [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/5450/consoleFull] > > *16:16:10* kafka.api.ConsumerBounceTest > testCloseDuringRebalance > STARTED*16:16:22* kafka.api.ConsumerBounceTest.testCloseDuringRebalance > failed,
[jira] [Comment Edited] (KAFKA-8529) Flakey test ConsumerBounceTest#testCloseDuringRebalance
[ https://issues.apache.org/jira/browse/KAFKA-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17380138#comment-17380138 ] A. Sophie Blee-Goldman edited comment on KAFKA-8529 at 7/13/21, 7:57 PM: - This has been failing _very_ frequently as of late, for example [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-11009/4/tests]. I took a brief looks at the logs and it appears to be an issue with the connection, or possibly something to do with the incremental fetch (not sure if the *FetchSessionTopicIdException* is a red herring or not): {code:java} [2021-07-13 12:31:46,774] WARN [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={closetest-1=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), closetest-7=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), closetest-4=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1829936913, epoch=1), rackId=) (kafka.server.ReplicaFetcherThread:72)org.apache.kafka.common.errors.FetchSessionTopicIdException: The fetch session encountered inconsistent topic ID usage[2021-07-13 12:31:47,002] WARN [ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=2081224068, epoch=1), rackId=) (kafka.server.ReplicaFetcherThread:72)org.apache.kafka.common.errors.FetchSessionTopicIdException: The fetch session encountered inconsistent topic ID usage[2021-07-13 12:31:48,545] WARN [Consumer clientId=ConsumerTestConsumer, groupId=group1] Close timed out with 3 pending requests to coordinator, terminating client connections (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1024)[2021-07-13 12:31:48,549] WARN [ReplicaFetcher replicaId=0, leaderId=2, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=0, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={closetest-1=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), closetest-7=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), topic-1=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty), closetest-4=PartitionData(fetchOffset=0, logStartOffset=0, maxBytes=1048576, currentLeaderEpoch=Optional[0], lastFetchedEpoch=Optional.empty)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1532743228, epoch=INITIAL), rackId=) (kafka.server.ReplicaFetcherThread:72)java.io.IOException: Connection to 2 was disconnected before the response was read at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:100) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:109) at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:219) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:313) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:137) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:136) at scala.Option.foreach(Option.scala:437) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:136) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:119) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96) {code} Wondering if we should bump this the priority on this? Can someone more familiar with this test maybe take a few minutes to glance over these logs and chime in on whether this is potentially concerning or not? cc [~hachikuji] [~cmccabe] was (Author: ableegoldman): This has been failing _very_ frequently as of late, for example [https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-11009/4/tests]. I took a brief looks at the logs and it appears to be an issue with the connection, or possibly something to do with the incremental fetch (not sure if this warning is a red herring or not): {code:java} [2021-07-13 12:31:46,774] WARN [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Error in response for fetch request (type=FetchRequest,
[jira] [Created] (KAFKA-13075) Consolidate RocksDBStore and RocksDBKeyValueStoreTest
A. Sophie Blee-Goldman created KAFKA-13075: -- Summary: Consolidate RocksDBStore and RocksDBKeyValueStoreTest Key: KAFKA-13075 URL: https://issues.apache.org/jira/browse/KAFKA-13075 Project: Kafka Issue Type: Improvement Components: streams Reporter: A. Sophie Blee-Goldman Looks like we have two different test classes covering pretty much the same thing: RocksDBStore. It seems like RocksDBKeyValueStoreTest was the original test class for RocksDBStore, but someone later added RocksDBStoreTest, most likely because they didn't notice the RocksDBKeyValueStoreTest which didn't follow the usual naming scheme for test classes. We should consolidate these two into a single file, ideally retaining the RocksDBStoreTest name since that conforms to the test naming pattern used throughout Streams (and so this same thing doesn't happen again). It should also extend AbstractKeyValueStoreTest like the RocksDBKeyValueStoreTest currently does so we continue to get the benefit of all the tests in there as well -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12925) prefixScan missing from intermediate interfaces
[ https://issues.apache.org/jira/browse/KAFKA-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12925: --- Priority: Critical (was: Major) > prefixScan missing from intermediate interfaces > --- > > Key: KAFKA-12925 > URL: https://issues.apache.org/jira/browse/KAFKA-12925 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0 >Reporter: Michael Viamari >Assignee: Sagar Rao >Priority: Critical > Fix For: 3.0.0, 2.8.1 > > > [KIP-614|https://cwiki.apache.org/confluence/display/KAFKA/KIP-614%3A+Add+Prefix+Scan+support+for+State+Stores] > and [KAFKA-10648|https://issues.apache.org/jira/browse/KAFKA-10648] > introduced support for {{prefixScan}} to StateStores. > It seems that many of the intermediate {{StateStore}} interfaces are missing > a definition for {{prefixScan}}, and as such is not accessible in all cases. > For example, when accessing the state stores through a the processor context, > the {{KeyValueStoreReadWriteDecorator}} and associated interfaces do not > define {{prefixScan}} and it falls back to the default implementation in > {{KeyValueStore}}, which throws {{UnsupportedOperationException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13008) Stream will stop processing data for a long time while waiting for the partition lag
[ https://issues.apache.org/jira/browse/KAFKA-13008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379505#comment-17379505 ] A. Sophie Blee-Goldman commented on KAFKA-13008: Thanks Guozhang, +1 on that approach (though I'll let Colin confirm whether #1 does make sense or not). We'll definitely need a Streams/client side fix if the 'real' fix is going to be on the broker side. My only question is whether this is something that might trip up other plain consumer client users in addition to Streams, and if so, whether there's anything we could do in the consumer client itself. But AFAIK it's only Streams that really relies on this metadata in this critical way, so I'm happy with the Streams-side fix as well > Stream will stop processing data for a long time while waiting for the > partition lag > > > Key: KAFKA-13008 > URL: https://issues.apache.org/jira/browse/KAFKA-13008 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Luke Chen >Priority: Blocker > Fix For: 3.0.0 > > Attachments: image-2021-07-07-11-19-55-630.png > > > In KIP-695, we improved the task idling mechanism by checking partition lag. > It's a good improvement for timestamp sync. But I found it will cause the > stream stop processing the data for a long time while waiting for the > partition metadata. > > I've been investigating this case for a while, and figuring out the issue > will happen in below situation (or similar situation): > # start 2 streams (each with 1 thread) to consume from a topicA (with 3 > partitions: A-0, A-1, A-2) > # After 2 streams started, the partitions assignment are: (I skipped some > other processing related partitions for simplicity) > stream1-thread1: A-0, A-1 > stream2-thread1: A-2 > # start processing some data, assume now, the position and high watermark is: > A-0: offset: 2, highWM: 2 > A-1: offset: 2, highWM: 2 > A-2: offset: 2, highWM: 2 > # Now, stream3 joined, so trigger rebalance with this assignment: > stream1-thread1: A-0 > stream2-thread1: A-2 > stream3-thread1: A-1 > # Suddenly, stream3 left, so now, rebalance again, with the step 2 > assignment: > stream1-thread1: A-0, *A-1* > stream2-thread1: A-2 > (note: after initialization, the position of A-1 will be: position: null, > highWM: null) > # Now, note that, the partition A-1 used to get assigned to stream1-thread1, > and now, it's back. And also, assume the partition A-1 has slow input (ex: 1 > record per 30 mins), and partition A-0 has fast input (ex: 10K records / > sec). So, now, the stream1-thread1 won't process any data until we got input > from partition A-1 (even if partition A-0 is buffered a lot, and we have > `{{max.task.idle.ms}}` set to 0). > > The reason why the stream1-thread1 won't process any data is because we can't > get the lag of partition A-1. And why we can't get the lag? It's because > # In KIP-695, we use consumer's cache to get the partition lag, to avoid > remote call > # The lag for a partition will be cleared if the assignment in this round > doesn't have this partition. check > [here|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java#L272]. > So, in the above example, the metadata cache for partition A-1 will be > cleared in step 4, and re-initialized (to null) in step 5 > # In KIP-227, we introduced a fetch session to have incremental fetch > request/response. That is, if the session existed, the client(consumer) will > get the update only when the fetched partition have update (ex: new data). > So, in the above case, the partition A-1 has slow input (ex: 1 record per 30 > mins), it won't have update until next 30 mins, or wait for the fetch session > become inactive for (default) 2 mins to be evicted. Either case, the metadata > won't be updated for a while. > > In KIP-695, if we don't get the partition lag, we can't determine the > partition data status to do timestamp sync, so we'll keep waiting and not > processing any data. That's why this issue will happen. > > *Proposed solution:* > # If we don't get the current lag for a partition, or the current lag > 0, > we start to wait for max.task.idle.ms, and reset the deadline when we get the > partition lag, like what we did in previous KIP-353 > # Introduce a waiting time config when no partition lag, or partition lag > keeps > 0 (need KIP) > [~vvcephei] [~guozhang] , any suggestions? > > cc [~ableegoldman] [~mjsax] , this is the root cause that in > [https://github.com/apache/kafka/pull/10736,] we discussed and thought > there's a data lose situation. FYI. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12993) Formatting of Streams 'Memory Management' docs is messed up
[ https://issues.apache.org/jira/browse/KAFKA-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12993: --- Fix Version/s: (was: 2.8.1) > Formatting of Streams 'Memory Management' docs is messed up > > > Key: KAFKA-12993 > URL: https://issues.apache.org/jira/browse/KAFKA-12993 > Project: Kafka > Issue Type: Bug > Components: docs, streams >Reporter: A. Sophie Blee-Goldman >Assignee: Luke Chen >Priority: Blocker > Fix For: 3.0.0 > > > The formatting of this page is all messed up, starting in the RocksDB > section. It looks like there's a missing closing tag after the example > BoundedMemoryRocksDBConfig class -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-7493) Rewrite test_broker_type_bounce_at_start
[ https://issues.apache.org/jira/browse/KAFKA-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-7493: - Assignee: (was: A. Sophie Blee-Goldman) > Rewrite test_broker_type_bounce_at_start > > > Key: KAFKA-7493 > URL: https://issues.apache.org/jira/browse/KAFKA-7493 > Project: Kafka > Issue Type: Improvement > Components: streams, system tests >Reporter: John Roesler >Priority: Major > Fix For: 3.0.0 > > > Currently, the test test_broker_type_bounce_at_start in > streams_broker_bounce_test.py is ignored. > As written, there are a couple of race conditions that lead to flakiness. > It should be possible to re-write the test to wait on log messages, as the > other tests do, instead of just sleeping to more deterministically transition > the test from one state to the next. > Once the test is fixed, the fix should be back-ported to all prior branches, > back to 0.10. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-7493) Rewrite test_broker_type_bounce_at_start
[ https://issues.apache.org/jira/browse/KAFKA-7493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-7493: -- Fix Version/s: (was: 3.0.0) 3.1.0 > Rewrite test_broker_type_bounce_at_start > > > Key: KAFKA-7493 > URL: https://issues.apache.org/jira/browse/KAFKA-7493 > Project: Kafka > Issue Type: Improvement > Components: streams, system tests >Reporter: John Roesler >Priority: Major > Fix For: 3.1.0 > > > Currently, the test test_broker_type_bounce_at_start in > streams_broker_bounce_test.py is ignored. > As written, there are a couple of race conditions that lead to flakiness. > It should be possible to re-write the test to wait on log messages, as the > other tests do, instead of just sleeping to more deterministically transition > the test from one state to the next. > Once the test is fixed, the fix should be back-ported to all prior branches, > back to 0.10. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13008) Stream will stop processing data for a long time while waiting for the partition lag
[ https://issues.apache.org/jira/browse/KAFKA-13008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379462#comment-17379462 ] A. Sophie Blee-Goldman commented on KAFKA-13008: {quote}Re-reading KIP-227, it seems like there should be a way for the client to add re-acquired partitions like this to the incremental fetch request so that it can reinitialize its metadata cache. In other words, it seems like getting a partition assigned that you haven't owned for a while is effectively the same case as getting a partition that you've never owned, and there does seem to be a mechanism for the latter. {quote} Thanks John, that is exactly what I was trying to suggest above, but I may have mungled it with my lack of understanding of the incremental fetch design. Given how long this bug went unnoticed and the in-depth investigation it took to uncover the bug (again, nicely done [~showuon]), it seems like any user of the plain consumer client in addition to Streams could be easily tripped up by this. And just personally, I had to read the analysis twice to really understand what was going on, since the behavior was/is so unintuitive to me. > Stream will stop processing data for a long time while waiting for the > partition lag > > > Key: KAFKA-13008 > URL: https://issues.apache.org/jira/browse/KAFKA-13008 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Luke Chen >Priority: Blocker > Fix For: 3.0.0 > > Attachments: image-2021-07-07-11-19-55-630.png > > > In KIP-695, we improved the task idling mechanism by checking partition lag. > It's a good improvement for timestamp sync. But I found it will cause the > stream stop processing the data for a long time while waiting for the > partition metadata. > > I've been investigating this case for a while, and figuring out the issue > will happen in below situation (or similar situation): > # start 2 streams (each with 1 thread) to consume from a topicA (with 3 > partitions: A-0, A-1, A-2) > # After 2 streams started, the partitions assignment are: (I skipped some > other processing related partitions for simplicity) > stream1-thread1: A-0, A-1 > stream2-thread1: A-2 > # start processing some data, assume now, the position and high watermark is: > A-0: offset: 2, highWM: 2 > A-1: offset: 2, highWM: 2 > A-2: offset: 2, highWM: 2 > # Now, stream3 joined, so trigger rebalance with this assignment: > stream1-thread1: A-0 > stream2-thread1: A-2 > stream3-thread1: A-1 > # Suddenly, stream3 left, so now, rebalance again, with the step 2 > assignment: > stream1-thread1: A-0, *A-1* > stream2-thread1: A-2 > (note: after initialization, the position of A-1 will be: position: null, > highWM: null) > # Now, note that, the partition A-1 used to get assigned to stream1-thread1, > and now, it's back. And also, assume the partition A-1 has slow input (ex: 1 > record per 30 mins), and partition A-0 has fast input (ex: 10K records / > sec). So, now, the stream1-thread1 won't process any data until we got input > from partition A-1 (even if partition A-0 is buffered a lot, and we have > `{{max.task.idle.ms}}` set to 0). > > The reason why the stream1-thread1 won't process any data is because we can't > get the lag of partition A-1. And why we can't get the lag? It's because > # In KIP-695, we use consumer's cache to get the partition lag, to avoid > remote call > # The lag for a partition will be cleared if the assignment in this round > doesn't have this partition. check > [here|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java#L272]. > So, in the above example, the metadata cache for partition A-1 will be > cleared in step 4, and re-initialized (to null) in step 5 > # In KIP-227, we introduced a fetch session to have incremental fetch > request/response. That is, if the session existed, the client(consumer) will > get the update only when the fetched partition have update (ex: new data). > So, in the above case, the partition A-1 has slow input (ex: 1 record per 30 > mins), it won't have update until next 30 mins, or wait for the fetch session > become inactive for (default) 2 mins to be evicted. Either case, the metadata > won't be updated for a while. > > In KIP-695, if we don't get the partition lag, we can't determine the > partition data status to do timestamp sync, so we'll keep waiting and not > processing any data. That's why this issue will happen. > > *Proposed solution:* > # If we don't get the current lag for a partition, or the current lag > 0, > we start to wait for max.task.idle.ms, and reset the deadline when we get the > partition lag,
[jira] [Commented] (KAFKA-12477) Smart rebalancing with dynamic protocol selection
[ https://issues.apache.org/jira/browse/KAFKA-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377635#comment-17377635 ] A. Sophie Blee-Goldman commented on KAFKA-12477: No, we already decided to postpone this work to 3.1 so we can focus on some related things. Moved the Fix Version to 3.1.0 > Smart rebalancing with dynamic protocol selection > - > > Key: KAFKA-12477 > URL: https://issues.apache.org/jira/browse/KAFKA-12477 > Project: Kafka > Issue Type: Improvement > Components: consumer >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Major > Fix For: 3.1.0 > > > Users who want to upgrade their applications and enable the COOPERATIVE > rebalancing protocol in their consumer apps are required to follow a double > rolling bounce upgrade path. The reason for this is laid out in the [Consumer > Upgrades|https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol#KIP429:KafkaConsumerIncrementalRebalanceProtocol-Consumer] > section of KIP-429. Basically, the ConsumerCoordinator picks a rebalancing > protocol in its constructor based on the list of supported partition > assignors. The protocol is selected as the highest protocol that is commonly > supported by all assignors in the list, and never changes after that. > This is a bit unfortunate because it may end up using an older protocol even > after every member in the group has been updated to support the newer > protocol. After the first rolling bounce of the upgrade, all members will > have two assignors: "cooperative-sticky" and "range" (or > sticky/round-robin/etc). At this point the EAGER protocol will still be > selected due to the presence of the "range" assignor, but it's the > "cooperative-sticky" assignor that will ultimately be selected for use in > rebalances if that assignor is preferred (ie positioned first in the list). > The only reason for the second rolling bounce is to strip off the "range" > assignor and allow the upgraded members to switch over to COOPERATIVE. We > can't allow them to use cooperative rebalancing until everyone has been > upgraded, but once they have it's safe to do so. > And there is already a way for the client to detect that everyone is on the > new byte code: if the CooperativeStickyAssignor is selected by the group > coordinator, then that means it is supported by all consumers in the group > and therefore everyone must be upgraded. > We may be able to save the second rolling bounce by dynamically updating the > rebalancing protocol inside the ConsumerCoordinator as "the highest protocol > supported by the assignor chosen by the group coordinator". This means we'll > still be using EAGER at the first rebalance, since we of course need to wait > for this initial rebalance to get the response from the group coordinator. > But we should take the hint from the chosen assignor rather than dropping > this information on the floor and sticking with the original protocol. > Concrete Proposal: > This assumes we will change the default assignor to ["cooperative-sticky", > "range"] in KIP-726. It also acknowledges that users may attempt any kind of > upgrade without reading the docs, and so we need to put in safeguards against > data corruption rather than assume everyone will follow the safe upgrade path. > With this proposal, > 1) New applications on 3.0 will enable cooperative rebalancing by default > 2) Existing applications which don’t set an assignor can safely upgrade to > 3.0 using a single rolling bounce with no extra steps, and will automatically > transition to cooperative rebalancing > 3) Existing applications which do set an assignor that uses EAGER can > likewise upgrade their applications to COOPERATIVE with a single rolling > bounce > 4) Once on 3.0, applications can safely go back and forth between EAGER and > COOPERATIVE > 5) Applications can safely downgrade away from 3.0 > The high-level idea for dynamic protocol upgrades is that the group will > leverage the assignor selected by the group coordinator to determine when > it’s safe to upgrade to COOPERATIVE, and trigger a fail-safe to protect the > group in case of rare events or user misconfiguration. The group coordinator > selects the most preferred assignor that’s supported by all members of the > group, so we know that all members will support COOPERATIVE once we receive > the “cooperative-sticky” assignor after a rebalance. At this point, each > member can upgrade their own protocol to COOPERATIVE. However, there may be > situations in which an EAGER member may join the group even after upgrading > to COOPERATIVE. For example, during a rolling upgrade if the last remaining > member on
[jira] [Updated] (KAFKA-12477) Smart rebalancing with dynamic protocol selection
[ https://issues.apache.org/jira/browse/KAFKA-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12477: --- Fix Version/s: (was: 3.0.0) 3.1.0 > Smart rebalancing with dynamic protocol selection > - > > Key: KAFKA-12477 > URL: https://issues.apache.org/jira/browse/KAFKA-12477 > Project: Kafka > Issue Type: Improvement > Components: consumer >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Major > Fix For: 3.1.0 > > > Users who want to upgrade their applications and enable the COOPERATIVE > rebalancing protocol in their consumer apps are required to follow a double > rolling bounce upgrade path. The reason for this is laid out in the [Consumer > Upgrades|https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol#KIP429:KafkaConsumerIncrementalRebalanceProtocol-Consumer] > section of KIP-429. Basically, the ConsumerCoordinator picks a rebalancing > protocol in its constructor based on the list of supported partition > assignors. The protocol is selected as the highest protocol that is commonly > supported by all assignors in the list, and never changes after that. > This is a bit unfortunate because it may end up using an older protocol even > after every member in the group has been updated to support the newer > protocol. After the first rolling bounce of the upgrade, all members will > have two assignors: "cooperative-sticky" and "range" (or > sticky/round-robin/etc). At this point the EAGER protocol will still be > selected due to the presence of the "range" assignor, but it's the > "cooperative-sticky" assignor that will ultimately be selected for use in > rebalances if that assignor is preferred (ie positioned first in the list). > The only reason for the second rolling bounce is to strip off the "range" > assignor and allow the upgraded members to switch over to COOPERATIVE. We > can't allow them to use cooperative rebalancing until everyone has been > upgraded, but once they have it's safe to do so. > And there is already a way for the client to detect that everyone is on the > new byte code: if the CooperativeStickyAssignor is selected by the group > coordinator, then that means it is supported by all consumers in the group > and therefore everyone must be upgraded. > We may be able to save the second rolling bounce by dynamically updating the > rebalancing protocol inside the ConsumerCoordinator as "the highest protocol > supported by the assignor chosen by the group coordinator". This means we'll > still be using EAGER at the first rebalance, since we of course need to wait > for this initial rebalance to get the response from the group coordinator. > But we should take the hint from the chosen assignor rather than dropping > this information on the floor and sticking with the original protocol. > Concrete Proposal: > This assumes we will change the default assignor to ["cooperative-sticky", > "range"] in KIP-726. It also acknowledges that users may attempt any kind of > upgrade without reading the docs, and so we need to put in safeguards against > data corruption rather than assume everyone will follow the safe upgrade path. > With this proposal, > 1) New applications on 3.0 will enable cooperative rebalancing by default > 2) Existing applications which don’t set an assignor can safely upgrade to > 3.0 using a single rolling bounce with no extra steps, and will automatically > transition to cooperative rebalancing > 3) Existing applications which do set an assignor that uses EAGER can > likewise upgrade their applications to COOPERATIVE with a single rolling > bounce > 4) Once on 3.0, applications can safely go back and forth between EAGER and > COOPERATIVE > 5) Applications can safely downgrade away from 3.0 > The high-level idea for dynamic protocol upgrades is that the group will > leverage the assignor selected by the group coordinator to determine when > it’s safe to upgrade to COOPERATIVE, and trigger a fail-safe to protect the > group in case of rare events or user misconfiguration. The group coordinator > selects the most preferred assignor that’s supported by all members of the > group, so we know that all members will support COOPERATIVE once we receive > the “cooperative-sticky” assignor after a rebalance. At this point, each > member can upgrade their own protocol to COOPERATIVE. However, there may be > situations in which an EAGER member may join the group even after upgrading > to COOPERATIVE. For example, during a rolling upgrade if the last remaining > member on the old bytecode misses a rebalance, the other members will be > allowed to upgrade to COOPERATIVE. If
[jira] [Commented] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
[ https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377616#comment-17377616 ] A. Sophie Blee-Goldman commented on KAFKA-9295: --- [~kkonstantine] Luke was able to figure out the likely root cause and it actually does seem to be a potentially severe bug. See https://issues.apache.org/jira/browse/KAFKA-13008 > KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable > -- > > Key: KAFKA-9295 > URL: https://issues.apache.org/jira/browse/KAFKA-9295 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Affects Versions: 2.4.0, 2.6.0 >Reporter: Matthias J. Sax >Assignee: Luke Chen >Priority: Blocker > Labels: flaky-test > Fix For: 3.0.0 > > > [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/] > {quote}java.lang.AssertionError: Did not receive all 1 records from topic > output- within 6 ms Expected: is a value equal to or greater than <1> > but: <0> was less than <1> at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13037) "Thread state is already PENDING_SHUTDOWN" log spam
[ https://issues.apache.org/jira/browse/KAFKA-13037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376066#comment-17376066 ] A. Sophie Blee-Goldman commented on KAFKA-13037: Hey [~gray.john], would you be interested in submitting a PR for this? I completely agree that logging this at INFO on every iteration is wildly inappropriate, I just didn't push it at the time since I figured someone would file a ticket if it was really bothering people. And here we are :) > "Thread state is already PENDING_SHUTDOWN" log spam > --- > > Key: KAFKA-13037 > URL: https://issues.apache.org/jira/browse/KAFKA-13037 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0, 2.7.1 >Reporter: John Gray >Priority: Major > > KAFKA-12462 introduced a > [change|https://github.com/apache/kafka/commit/4fe4cdc4a61cbac8e070a8b5514403235194015b#diff-76f629d0df8bd30b2593cbcf4a2dc80de3167ebf55ef8b5558e6e6285a057496R722] > that increased this "Thread state is already {}" logger to info instead of > debug. We are running into a problem with our streams apps when they hit an > unrecoverable exception that shuts down the streams thread, we get this log > printed about 50,000 times per second per thread. I am guessing it is once > per record we have queued up when the exception happens? We have temporarily > raised the StreamThread logger to WARN instead of INFO to avoid the spam, but > we do miss the other good logs we get on INFO in that class. Could this log > be reverted back to debug? Thank you! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-13008) Stream will stop processing data for a long time while waiting for the partition lag
[ https://issues.apache.org/jira/browse/KAFKA-13008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376019#comment-17376019 ] A. Sophie Blee-Goldman commented on KAFKA-13008: Nice find! I agree, this does not seem like the expected behavior, and given that it's been causing a test to fail regularly I think we can assert that this should not happen. One thing I don't understand, and maybe this is because I don't have much context on the incremental fetch internals, is: why would we not get the metadata again after the partition is re-assigned in step 5? Surely if a partition was removed from the assignment and then added back, this should constitute a new 'session', and thus it should get the metadata again on assignment? If not that sounds like a bug in the incremental fetch to me, but again, I'm not too familiar with it so there could be a valid reason it works this way. If so, then maybe we should consider allowing metadata to remain around after a partition is unassigned, in case it gets this same partition back within the session? Could there be other consequences of this lack of metadata, outside of Streams? > Stream will stop processing data for a long time while waiting for the > partition lag > > > Key: KAFKA-13008 > URL: https://issues.apache.org/jira/browse/KAFKA-13008 > Project: Kafka > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Luke Chen >Priority: Major > > In KIP-695, we improved the task idling mechanism by checking partition lag. > It's a good improvement for timestamp sync. But I found it will cause the > stream stop processing the data for a long time while waiting for the > partition metadata. > > I've been investigating this case for a while, and figuring out the issue > will happen in below situation (or similar situation): > # start 2 streams (each with 1 thread) to consume from a topicA (with 3 > partitions: A-0, A-1, A-2) > # After 2 streams started, the partitions assignment are: (I skipped some > other processing related partitions for simplicity) > stream1-thread1: A-0, A-1 > stream2-thread1: A-2 > # start processing some data, assume now, the position and high watermark is: > A-0: offset: 2, highWM: 2 > A-1: offset: 2, highWM: 2 > A-2: offset: 2, highWM: 2 > # Now, stream3 joined, so trigger rebalance with this assignment: > stream1-thread1: A-0 > stream2-thread1: A-2 > stream3-thread1: A-1 > # Suddenly, stream3 left, so now, rebalance again, with the step 2 > assignment: > stream1-thread1: A-0, *A-1* > stream2-thread1: A-2 > (note: after initialization, the position of A-1 will be: position: null, > highWM: null) > # Now, note that, the partition A-1 used to get assigned to stream1-thread1, > and now, it's back. And also, assume the partition A-1 has slow input (ex: 1 > record per 30 mins), and partition A-0 has fast input (ex: 10K records / > sec). So, now, the stream1-thread1 won't process any data until we got input > from partition A-1 (even if partition A-0 is buffered a lot, and we have > `{{max.task.idle.ms}}` set to 0). > > The reason why the stream1-thread1 won't process any data is because we can't > get the lag of partition A-1. And why we can't get the lag? It's because > # In KIP-695, we use consumer's cache to get the partition lag, to avoid > remote call > # The lag for a partition will be cleared if the assignment in this round > doesn't have this partition. check > [here|https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/internals/SubscriptionState.java#L272]. > So, in the above example, the metadata cache for partition A-1 will be > cleared in step 4, and re-initialized (to null) in step 5 > # In KIP-227, we introduced a fetch session to have incremental fetch > request/response. That is, if the session existed, the client(consumer) will > get the update only when the fetched partition have update (ex: new data). > So, in the above case, the partition A-1 has slow input (ex: 1 record per 30 > mins), it won't have update until next 30 mins, or wait for the fetch session > become inactive for (default) 2 mins to be evicted. Either case, the metadata > won't be updated for a while. > > In KIP-695, if we don't get the partition lag, we can't determine the > partition data status to do timestamp sync, so we'll keep waiting and not > processing any data. That's why this issue will happen. > > *Proposed solution:* > # If we don't get the current lag for a partition, or the current lag > 0, > we start to wait for max.task.idle.ms, and reset the deadline when we get the > partition lag, like what we did in previous KIP-353 > # Introduce a waiting time config when no partition lag, or
[jira] [Commented] (KAFKA-9177) Pause completed partitions on restore consumer
[ https://issues.apache.org/jira/browse/KAFKA-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369771#comment-17369771 ] A. Sophie Blee-Goldman commented on KAFKA-9177: --- [~apolyakov] this ticket was specifically about pausing those completed changelogs in the restore consumer's assignment so that it doesn't continue to fetch records beyond the "end" of the changelog. The log message you're seeing doesn't indicate that this wasn't done, it's just a debug message to confirm we've finished restoring all changelogs when we run through the no-op restore phase after restoration is complete. I wouldn't worry about it. > Pause completed partitions on restore consumer > -- > > Key: KAFKA-9177 > URL: https://issues.apache.org/jira/browse/KAFKA-9177 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: Guozhang Wang >Priority: Major > Fix For: 2.6.0 > > > The StoreChangelogReader is responsible for tracking and restoring active > tasks, but once a store has finished restoring it will continue polling for > records on that partition. > Ordinarily this doesn't make a difference as a store is not completely > restored until its entire changelog has been read, so there are no more > records for poll to return anyway. But if the restoring state is actually an > optimized source KTable, the changelog is just the source topic and poll will > keep returning records for that partition until all stores have been restored. > Note that this isn't a correctness issue since it's just the restore > consumer, but it is wasteful to be polling for records and throwing them > away. We should pause completed partitions in StoreChangelogReader so we > don't slow down the restore consumer in reading from the unfinished changelog > topics, and avoid wasted network. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12993) Formatting of Streams 'Memory Management' docs is messed up
[ https://issues.apache.org/jira/browse/KAFKA-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369758#comment-17369758 ] A. Sophie Blee-Goldman commented on KAFKA-12993: Oh, you already have a PR for kafka-site. I should have known :P > Formatting of Streams 'Memory Management' docs is messed up > > > Key: KAFKA-12993 > URL: https://issues.apache.org/jira/browse/KAFKA-12993 > Project: Kafka > Issue Type: Bug > Components: docs, streams >Reporter: A. Sophie Blee-Goldman >Assignee: Luke Chen >Priority: Blocker > Fix For: 3.0.0, 2.8.1 > > > The formatting of this page is all messed up, starting in the RocksDB > section. It looks like there's a missing closing tag after the example > BoundedMemoryRocksDBConfig class -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12993) Formatting of Streams 'Memory Management' docs is messed up
[ https://issues.apache.org/jira/browse/KAFKA-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369756#comment-17369756 ] A. Sophie Blee-Goldman commented on KAFKA-12993: Ah, I didn't realize this was already addressed. Thanks for the heads up Bruno. I agree with Luke's take on this: not every docs change needs to go into kafka-site right away, but for something like this which is a pretty egregious formatting issue that makes it very difficult to read, we should fix it on the live site ASAP. [~showuon] do you want to port this PR over to kafka-site? > Formatting of Streams 'Memory Management' docs is messed up > > > Key: KAFKA-12993 > URL: https://issues.apache.org/jira/browse/KAFKA-12993 > Project: Kafka > Issue Type: Bug > Components: docs, streams >Reporter: A. Sophie Blee-Goldman >Assignee: Luke Chen >Priority: Blocker > Fix For: 3.0.0, 2.8.1 > > > The formatting of this page is all messed up, starting in the RocksDB > section. It looks like there's a missing closing tag after the example > BoundedMemoryRocksDBConfig class -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12993) Formatting of Streams 'Memory Management' docs is messed up
[ https://issues.apache.org/jira/browse/KAFKA-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12993: --- Component/s: docs > Formatting of Streams 'Memory Management' docs is messed up > > > Key: KAFKA-12993 > URL: https://issues.apache.org/jira/browse/KAFKA-12993 > Project: Kafka > Issue Type: Bug > Components: docs, streams >Reporter: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 3.0.0, 2.8.1 > > > The formatting of this page is all messed up, starting in the RocksDB > section. It looks like there's a missing closing tag after the example > BoundedMemoryRocksDBConfig class -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12993) Formatting of Streams 'Memory Management' docs is messed up
A. Sophie Blee-Goldman created KAFKA-12993: -- Summary: Formatting of Streams 'Memory Management' docs is messed up Key: KAFKA-12993 URL: https://issues.apache.org/jira/browse/KAFKA-12993 Project: Kafka Issue Type: Bug Components: streams Reporter: A. Sophie Blee-Goldman Fix For: 3.0.0, 2.8.1 The formatting of this page is all messed up, starting in the RocksDB section. It looks like there's a missing closing tag after the example BoundedMemoryRocksDBConfig class -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12984) Cooperative sticky assignor can get stuck with invalid SubscriptionState input metadata
[ https://issues.apache.org/jira/browse/KAFKA-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368542#comment-17368542 ] A. Sophie Blee-Goldman commented on KAFKA-12984: [~mjsax] Technically yes, the issue with the SubscriptionState potentially providing invalid "ownedPartitions" input can affect Kafka Streams as well. However the impact for Streams is be considerably less severe, as the assignment algorithm it doesn't make any assumptions about the previous assignment being valid. The worst that should happen to a Streams application is that the assignment could be slightly sub-optimal, with a partition/active task being assigned to a member that had dropped out of the group since being assigned that partition, instead of its true current owner. > Cooperative sticky assignor can get stuck with invalid SubscriptionState > input metadata > --- > > Key: KAFKA-12984 > URL: https://issues.apache.org/jira/browse/KAFKA-12984 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 3.0.0, 2.8.1 > > > Some users have reported seeing their consumer group become stuck in the > CompletingRebalance phase when using the cooperative-sticky assignor. Based > on the request metadata we were able to deduce that multiple consumers were > reporting the same partition(s) in their "ownedPartitions" field of the > consumer protocol. Since this is an invalid state, the input causes the > cooperative-sticky assignor to detect that something is wrong and throw an > IllegalStateException. If the consumer application is set up to simply retry, > this will cause the group to appear to hang in the rebalance state. > The "ownedPartitions" field is encoded based on the ConsumerCoordinator's > SubscriptionState, which was assumed to always be up to date. However there > may be cases where the consumer has dropped out of the group but fails to > clear the SubscriptionState, allowing it to report some partitions as owned > that have since been reassigned to another member. > We should (a) fix the sticky assignment algorithm to resolve cases of > improper input conditions by invalidating the "ownedPartitions" in cases of > double ownership, and (b) shore up the ConsumerCoordinator logic to better > handle rejoining the group and keeping its internal state consistent. See > KAFKA-12983 for more details on (b) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12896) Group rebalance loop caused by repeated group leader JoinGroups
[ https://issues.apache.org/jira/browse/KAFKA-12896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367805#comment-17367805 ] A. Sophie Blee-Goldman commented on KAFKA-12896: I believe this is caused by https://issues.apache.org/jira/browse/KAFKA-12984, which in turn is caused by https://issues.apache.org/jira/browse/KAFKA-12983. I'll prepare a fix for both of these issues > Group rebalance loop caused by repeated group leader JoinGroups > --- > > Key: KAFKA-12896 > URL: https://issues.apache.org/jira/browse/KAFKA-12896 > Project: Kafka > Issue Type: Bug > Components: clients >Affects Versions: 2.6.0 >Reporter: Lucas Bradstreet >Assignee: David Jacot >Priority: Blocker > Fix For: 3.0.0 > > > We encountered a strange case of a rebalance loop with the > "cooperative-sticky" assignor. The logs show the following for several hours: > > {{Apr 7, 2021 @ 03:58:36.040 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830137 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.992 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830136 > (__consumer_offsets-7) (reason: Updating metadata for member > mygroup-1-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}} > {{Apr 7, 2021 @ 03:58:35.988 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830136 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.972 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830135 > (__consumer_offsets-7) (reason: Updating metadata for member mygroup during > CompletingRebalance)}} > {{Apr 7, 2021 @ 03:58:35.965 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830135 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.953 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830134 > (__consumer_offsets-7) (reason: Updating metadata for member > mygroup-7ad27e07-3784-4588-97e1-d796a74d4ecc during CompletingRebalance)}} > {{Apr 7, 2021 @ 03:58:35.941 [GroupCoordinator 7]: Stabilized group mygroup > generation 19830134 (__consumer_offsets-7)}} > {{Apr 7, 2021 @ 03:58:35.926 [GroupCoordinator 7]: Preparing to rebalance > group mygroup in state PreparingRebalance with old generation 19830133 > (__consumer_offsets-7) (reason: Updating metadata for member mygroup during > CompletingRebalance)}} > Every single time, it was the same member that triggered the JoinGroup and it > was always the leader of the group.{{}} > The leader has the privilege of being able to trigger a rebalance by sending > `JoinGroup` even if its subscription metadata has not changed. But why would > it do so? > It is possible that this is due to the same issue or a similar bug to > https://issues.apache.org/jira/browse/KAFKA-12890. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-12983) onJoinPrepare is not always invoked before joining the group
[ https://issues.apache.org/jira/browse/KAFKA-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-12983: -- Assignee: A. Sophie Blee-Goldman > onJoinPrepare is not always invoked before joining the group > > > Key: KAFKA-12983 > URL: https://issues.apache.org/jira/browse/KAFKA-12983 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 3.0.0, 2.8.1 > > > As the title suggests, the #onJoinPrepare callback is not always invoked > before a member (re)joins the group, but only once when it first enters the > rebalance. This means that any updates or events that occur during the join > phase can be lost in the internal state: for example, clearing the > SubscriptionState (and thus the "ownedPartitions" that are used for > cooperative rebalancing) after losing its memberId during a rebalance. > We should reset the `needsJoinPrepare` flag inside the resetStateAndRejoin() > method -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-12984) Cooperative sticky assignor can get stuck with invalid SubscriptionState input metadata
[ https://issues.apache.org/jira/browse/KAFKA-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-12984: -- Assignee: A. Sophie Blee-Goldman > Cooperative sticky assignor can get stuck with invalid SubscriptionState > input metadata > --- > > Key: KAFKA-12984 > URL: https://issues.apache.org/jira/browse/KAFKA-12984 > Project: Kafka > Issue Type: Bug > Components: consumer >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 3.0.0, 2.8.1 > > > Some users have reported seeing their consumer group become stuck in the > CompletingRebalance phase when using the cooperative-sticky assignor. Based > on the request metadata we were able to deduce that multiple consumers were > reporting the same partition(s) in their "ownedPartitions" field of the > consumer protocol. Since this is an invalid state, the input causes the > cooperative-sticky assignor to detect that something is wrong and throw an > IllegalStateException. If the consumer application is set up to simply retry, > this will cause the group to appear to hang in the rebalance state. > The "ownedPartitions" field is encoded based on the ConsumerCoordinator's > SubscriptionState, which was assumed to always be up to date. However there > may be cases where the consumer has dropped out of the group but fails to > clear the SubscriptionState, allowing it to report some partitions as owned > that have since been reassigned to another member. > We should (a) fix the sticky assignment algorithm to resolve cases of > improper input conditions by invalidating the "ownedPartitions" in cases of > double ownership, and (b) shore up the ConsumerCoordinator logic to better > handle rejoining the group and keeping its internal state consistent. See > KAFKA-12983 for more details on (b) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12984) Cooperative sticky assignor can get stuck with invalid SubscriptionState input metadata
A. Sophie Blee-Goldman created KAFKA-12984: -- Summary: Cooperative sticky assignor can get stuck with invalid SubscriptionState input metadata Key: KAFKA-12984 URL: https://issues.apache.org/jira/browse/KAFKA-12984 Project: Kafka Issue Type: Bug Components: consumer Reporter: A. Sophie Blee-Goldman Fix For: 3.0.0, 2.8.1 Some users have reported seeing their consumer group become stuck in the CompletingRebalance phase when using the cooperative-sticky assignor. Based on the request metadata we were able to deduce that multiple consumers were reporting the same partition(s) in their "ownedPartitions" field of the consumer protocol. Since this is an invalid state, the input causes the cooperative-sticky assignor to detect that something is wrong and throw an IllegalStateException. If the consumer application is set up to simply retry, this will cause the group to appear to hang in the rebalance state. The "ownedPartitions" field is encoded based on the ConsumerCoordinator's SubscriptionState, which was assumed to always be up to date. However there may be cases where the consumer has dropped out of the group but fails to clear the SubscriptionState, allowing it to report some partitions as owned that have since been reassigned to another member. We should (a) fix the sticky assignment algorithm to resolve cases of improper input conditions by invalidating the "ownedPartitions" in cases of double ownership, and (b) shore up the ConsumerCoordinator logic to better handle rejoining the group and keeping its internal state consistent. See KAFKA-12983 for more details on (b) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12983) onJoinPrepare is not always invoked before joining the group
A. Sophie Blee-Goldman created KAFKA-12983: -- Summary: onJoinPrepare is not always invoked before joining the group Key: KAFKA-12983 URL: https://issues.apache.org/jira/browse/KAFKA-12983 Project: Kafka Issue Type: Bug Components: consumer Reporter: A. Sophie Blee-Goldman Fix For: 3.0.0, 2.8.1 As the title suggests, the #onJoinPrepare callback is not always invoked before a member (re)joins the group, but only once when it first enters the rebalance. This means that any updates or events that occur during the join phase can be lost in the internal state: for example, clearing the SubscriptionState (and thus the "ownedPartitions" that are used for cooperative rebalancing) after losing its memberId during a rebalance. We should reset the `needsJoinPrepare` flag inside the resetStateAndRejoin() method -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12960) WindowStore and SessionStore do not enforce strict retention time
[ https://issues.apache.org/jira/browse/KAFKA-12960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367795#comment-17367795 ] A. Sophie Blee-Goldman commented on KAFKA-12960: +1 on pushing this responsibility to the state stores. Imo this was always intended to be a contract of the state store interface itself and was just not strictly enforced, ie it's more like a bug to fix than a matter of "pushing" the responsibility onto them > WindowStore and SessionStore do not enforce strict retention time > - > > Key: KAFKA-12960 > URL: https://issues.apache.org/jira/browse/KAFKA-12960 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax >Priority: Major > > WindowedStore and SessionStore do not implement a strict retention time in > general. We should consider to make retention time strict: even if we still > have some record in the store (due to the segmented implementation), we might > want to filter expired records on-read. This might benefit PAPI users. > Atm, InMemoryWindow store does already enforce a strict retention time. > As an alternative, we could also inject such a filter in the wrapping > `MeteredStore` – this might lift the burden from users who implement a custom > state store. > As an alternative, we could change all DSL operators to verify if data from a > state store is already expired or not. It might be better to push this > responsibility into the stores though. > It's especially an issue for stream-stream joins, because the operator relies > on the retention time to implement it's grace period. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8295) Optimize count() using RocksDB merge operator
[ https://issues.apache.org/jira/browse/KAFKA-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363356#comment-17363356 ] A. Sophie Blee-Goldman commented on KAFKA-8295: --- Yes, any of the count DSL operators. It may be a bit more tricky than it appears on the surface because count is actually converted into a generic aggregation under the covers, so you'd have to tease it out into its own independent optimized implementation. To be honest, I don't have a good sense of whether it's even worth the additional code complexity, because I don't know how much additional code and/or code paths this will introduce :) I recommend looking into that before jumping straight in. Of course, we could consider introducing some kind of top-level merge-based operator to the DSL as a feature in its own right. Then count could just be converted to use that instead of the aggregation implementation. Not sure what that would look like, or if it would even be useful at all – just throwing out thoughts here. Anyways I just thought it would be interesting to explore what we might be able to do with this merge operator in Kafka Streams, whether that's an optimization of existing operators or some kind of first class operator of its own. That's really the point of this ticket: to explore the merge operator. > Optimize count() using RocksDB merge operator > - > > Key: KAFKA-8295 > URL: https://issues.apache.org/jira/browse/KAFKA-8295 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: Sagar Rao >Priority: Major > > In addition to regular put/get/delete RocksDB provides a fourth operation, > merge. This essentially provides an optimized read/update/write path in a > single operation. One of the built-in (C++) merge operators exposed over the > Java API is a counter. We should be able to leverage this for a more > efficient implementation of count() > > (Note: Unfortunately it seems unlikely we can use this to optimize general > aggregations, even if RocksJava allowed for a custom merge operator, unless > we provide a way for the user to specify and connect a C++ implemented > aggregator – otherwise we incur too much cost crossing the jni for a net > performance benefit) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12844) KIP-740 follow up: clean up TaskId
[ https://issues.apache.org/jira/browse/KAFKA-12844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363355#comment-17363355 ] A. Sophie Blee-Goldman commented on KAFKA-12844: As mentioned elsewhere, this ticket is marked for 4.0 and as such, cannot be worked on yet. When 4.0 is announced you are free to pick this up and work on it again, but as of this point we don't yet know when version 4.0 will be released. Most likely we will have 3.1 after the in-progress 3.0, though I suppose it depends on the Zookeeper removal work. > KIP-740 follow up: clean up TaskId > -- > > Key: KAFKA-12844 > URL: https://issues.apache.org/jira/browse/KAFKA-12844 > Project: Kafka > Issue Type: Task > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: loboxu >Priority: Blocker > Fix For: 4.0.0 > > > See > [KIP-740|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181306557] > – for the TaskId class, we need to remove the following deprecated APIs: > # The public partition and topicGroupId fields should be "removed", ie made > private (can also now rename topicGroupId to subtopology to match the getter) > # The two #readFrom and two #writeTo methods can be removed (they have > already been converted to internal utility methods we now use instead, so > just remove them) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12843) KIP-740 follow up: clean up TaskMetadata
[ https://issues.apache.org/jira/browse/KAFKA-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363354#comment-17363354 ] A. Sophie Blee-Goldman commented on KAFKA-12843: This was pointed out on the PR, but just leaving a note on the ticket for visibility/, plus to clarify for anyone else who comes across this: This ticket is marked for fix in version 4.0, which means we can't work on it yet. We are only in the process of releasing 3.0 at the moment, and it's likely that 3.1 will come after that. Once the bump to 4.0 has been decided, you can pick this up again and actually work on it. > KIP-740 follow up: clean up TaskMetadata > > > Key: KAFKA-12843 > URL: https://issues.apache.org/jira/browse/KAFKA-12843 > Project: Kafka > Issue Type: Task > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: loboxu >Priority: Blocker > Fix For: 4.0.0 > > > See > [KIP-740|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=181306557] > – for the TaskMetadata class, we need to: > # Deprecate the TaskMetadata#getTaskId method > # "Remove" the deprecated TaskMetadata#taskId method, then re-add a taskId() > API that returns a TaskId instead of a String > # Remove the deprecated constructor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12690) Remove deprecated Producer#sendOffsetsToTransaction
[ https://issues.apache.org/jira/browse/KAFKA-12690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363343#comment-17363343 ] A. Sophie Blee-Goldman commented on KAFKA-12690: Hey [~loboxu], as pointed out elsewhere this ticket is marked for 4.0, and is not yet ready to be worked on. You're absolutely free to pick this up when the time comes around, so you can leave yourself assigned if you'd like, but it may be a while before 4.0 comes around. Just a heads up :) > Remove deprecated Producer#sendOffsetsToTransaction > --- > > Key: KAFKA-12690 > URL: https://issues.apache.org/jira/browse/KAFKA-12690 > Project: Kafka > Issue Type: Task > Components: producer >Reporter: A. Sophie Blee-Goldman >Assignee: loboxu >Priority: Blocker > Fix For: 4.0.0 > > > In > [KIP-732|https://cwiki.apache.org/confluence/display/KAFKA/KIP-732%3A+Deprecate+eos-alpha+and+replace+eos-beta+with+eos-v2] > we deprecated the EXACTLY_ONCE and EXACTLY_ONCE_BETA configs in > StreamsConfig, to be removed in 4.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12689) Remove deprecated EOS configs
[ https://issues.apache.org/jira/browse/KAFKA-12689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363342#comment-17363342 ] A. Sophie Blee-Goldman commented on KAFKA-12689: Hey [~loboxu], this ticket is marked with 4.0 as the Fix Version, which means it can't be worked on until the version 4.0. This version has yet to be announced, and since 3.0 is only just about to be released I would not assume that 4.0 is definitely right around the corner. Most likely 4.0 will be soon-ish, but I would look for other tickets to work on for now. > Remove deprecated EOS configs > - > > Key: KAFKA-12689 > URL: https://issues.apache.org/jira/browse/KAFKA-12689 > Project: Kafka > Issue Type: Task > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: loboxu >Priority: Blocker > Fix For: 4.0.0 > > > In > [KIP-732|https://cwiki.apache.org/confluence/display/KAFKA/KIP-732%3A+Deprecate+eos-alpha+and+replace+eos-beta+with+eos-v2] > we deprecated the EXACTLY_ONCE and EXACTLY_ONCE_BETA configs in > StreamsConfig, to be removed in 4.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12935) Flaky Test RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore
[ https://issues.apache.org/jira/browse/KAFKA-12935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361374#comment-17361374 ] A. Sophie Blee-Goldman commented on KAFKA-12935: Filed https://issues.apache.org/jira/browse/KAFKA-12936 > Flaky Test > RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore > > > Key: KAFKA-12935 > URL: https://issues.apache.org/jira/browse/KAFKA-12935 > Project: Kafka > Issue Type: Test > Components: streams, unit tests >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > > {quote}java.lang.AssertionError: Expected: <0L> but: was <5005L> at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at > org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(RestoreIntegrationTest.java:374) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12936) In-memory stores are always restored from scratch after dropping out of the group
A. Sophie Blee-Goldman created KAFKA-12936: -- Summary: In-memory stores are always restored from scratch after dropping out of the group Key: KAFKA-12936 URL: https://issues.apache.org/jira/browse/KAFKA-12936 Project: Kafka Issue Type: Bug Components: streams Reporter: A. Sophie Blee-Goldman Whenever an in-memory store is closed, the actual store contents are garbage collected and the state will need to be restored from scratch if the task is reassigned and re-initialized. We introduced the recycling feature to prevent this from occurring when a task is transitioned from standby to active (or vice versa), but it's still possible for the in-memory state to be unnecessarily wiped out in the case the member has dropped out of the group. In this case, the onPartitionsLost callback is invoked, which will close all active tasks as dirty before the member rejoins the group. This means that all these tasks will need to be restored from scratch if they are reassigned back to this consumer. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12935) Flaky Test RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore
[ https://issues.apache.org/jira/browse/KAFKA-12935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361367#comment-17361367 ] A. Sophie Blee-Goldman commented on KAFKA-12935: Hm, actually, I guess that could be considered a bug in itself, or at least a flaw in the recycling feature – for persistent stores with ALOS, dropping out of the group only causes tasks to be closed dirty, it doesn't force them to be wiped out to restore from the changelog from scratch. But with in-memory stores, simply closing them is akin to physically wiping out the state directory for that task. Avoiding that was the basis for this recycling feature in the first place. This does kind of suck, but at least it should be a relatively rare event. I'm a bit worried about how much complexity it would introduce to the code to fix this "bug", but I'll at least file a ticket for it now and we can go from there > Flaky Test > RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore > > > Key: KAFKA-12935 > URL: https://issues.apache.org/jira/browse/KAFKA-12935 > Project: Kafka > Issue Type: Test > Components: streams, unit tests >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > > {quote}java.lang.AssertionError: Expected: <0L> but: was <5005L> at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at > org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(RestoreIntegrationTest.java:374) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12935) Flaky Test RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore
[ https://issues.apache.org/jira/browse/KAFKA-12935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361359#comment-17361359 ] A. Sophie Blee-Goldman commented on KAFKA-12935: Hm, yeah I think I've seen this fail once or twice before. I did look into it a bit a while back, and just could not figure out whether it was a possible bug or an issue with the test itself. My money's definitely on the latter, but it might be worth taking another look sometime if we have the chance. If there is a bug that this is uncovering, at least it would not be a correctness bug, only an annoyance in restoring when it's not necessary. I think it's more likely that the test is flaky because for example the consumer dropped out of the group and invoked onPartitionsLost, which would close the task as dirty and require restoring from the changelog > Flaky Test > RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore > > > Key: KAFKA-12935 > URL: https://issues.apache.org/jira/browse/KAFKA-12935 > Project: Kafka > Issue Type: Test > Components: streams, unit tests >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > > {quote}java.lang.AssertionError: Expected: <0L> but: was <5005L> at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20) at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6) at > org.apache.kafka.streams.integration.RestoreIntegrationTest.shouldRecycleStateFromStandbyTaskPromotedToActiveTaskAndNotRestore(RestoreIntegrationTest.java:374) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8940) Flaky Test SmokeTestDriverIntegrationTest.shouldWorkWithRebalance
[ https://issues.apache.org/jira/browse/KAFKA-8940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17361196#comment-17361196 ] A. Sophie Blee-Goldman commented on KAFKA-8940: --- Hey [~mjsax] [~josep.prat] (and others), if you read my last comment on this ticket it explains exactly why this test is failing. Luckily it has to do with only the test itself, as it's a bug in the assumptions for the generated input. It's just not necessarily a quick fix. Maybe we should @Ignore it for now, and then file a separate ticket to circle back and correct the assumptions in this test. > Flaky Test SmokeTestDriverIntegrationTest.shouldWorkWithRebalance > - > > Key: KAFKA-8940 > URL: https://issues.apache.org/jira/browse/KAFKA-8940 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Reporter: Guozhang Wang >Priority: Major > Labels: flaky-test, newbie++ > > The test does not properly account for windowing. See this comment for full > details. > We can patch this test by fixing the timestamps of the input data to avoid > crossing over a window boundary, or account for this when verifying the > output. Since we have access to the input data it should be possible to > compute whether/when we do cross a window boundary, and adjust the expected > output accordingly -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12920) Consumer's cooperative sticky assignor need to clear generation / assignment data upon `onPartitionsLost`
[ https://issues.apache.org/jira/browse/KAFKA-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-12920. Resolution: Not A Bug > Consumer's cooperative sticky assignor need to clear generation / assignment > data upon `onPartitionsLost` > - > > Key: KAFKA-12920 > URL: https://issues.apache.org/jira/browse/KAFKA-12920 > Project: Kafka > Issue Type: Bug >Reporter: Guozhang Wang >Priority: Major > Labels: bug, consumer > > Consumer's cooperative-sticky assignor does not track the owned partitions > inside the assignor --- i.e. when it reset its state in event of > ``onPartitionsLost``, the ``memberAssignment`` and ``generation`` inside the > assignor would not be cleared. This would cause a member to join with empty > generation on the protocol while with non-empty user-data encoding the old > assignment still (and hence pass the validation check on broker side during > JoinGroup), and eventually cause a single partition to be assigned to > multiple consumers within a generation. > We should let the assignor to also clear its assignment/generation when > ``onPartitionsLost`` is triggered in order to avoid this scenario. > Note that 1) for the regular sticky assignor the generation would still have > an older value, and this would cause the previously owned partitions to be > discarded during the assignment, and 2) for Streams' sticky assignor, it’s > encoding would indeed be cleared along with ``onPartitionsLost``. Hence only > Consumer's cooperative-sticky assignor have this issue to solve. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12920) Consumer's cooperative sticky assignor need to clear generation / assignment data upon `onPartitionsLost`
[ https://issues.apache.org/jira/browse/KAFKA-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360480#comment-17360480 ] A. Sophie Blee-Goldman commented on KAFKA-12920: I believe we've found the real issue, and are just discussing what the correct behavior here should be. I'm going to close this ticket as 'Not a Bug' and file a separate issue with the problems we've found in the Consumer/AbstractCoordinator that have resulted in [this|https://issues.apache.org/jira/browse/KAFKA-12896] odd behavior we observed with the cooperative-sticky assignor > Consumer's cooperative sticky assignor need to clear generation / assignment > data upon `onPartitionsLost` > - > > Key: KAFKA-12920 > URL: https://issues.apache.org/jira/browse/KAFKA-12920 > Project: Kafka > Issue Type: Bug >Reporter: Guozhang Wang >Priority: Major > Labels: bug, consumer > > Consumer's cooperative-sticky assignor does not track the owned partitions > inside the assignor --- i.e. when it reset its state in event of > ``onPartitionsLost``, the ``memberAssignment`` and ``generation`` inside the > assignor would not be cleared. This would cause a member to join with empty > generation on the protocol while with non-empty user-data encoding the old > assignment still (and hence pass the validation check on broker side during > JoinGroup), and eventually cause a single partition to be assigned to > multiple consumers within a generation. > We should let the assignor to also clear its assignment/generation when > ``onPartitionsLost`` is triggered in order to avoid this scenario. > Note that 1) for the regular sticky assignor the generation would still have > an older value, and this would cause the previously owned partitions to be > discarded during the assignment, and 2) for Streams' sticky assignor, it’s > encoding would indeed be cleared along with ``onPartitionsLost``. Hence only > Consumer's cooperative-sticky assignor have this issue to solve. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (KAFKA-12925) prefixScan missing from intermediate interfaces
[ https://issues.apache.org/jira/browse/KAFKA-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360472#comment-17360472 ] A. Sophie Blee-Goldman edited comment on KAFKA-12925 at 6/10/21, 12:51 AM: --- [~sagarrao] do you think you'll be able to get in a fix for this in the next few weeks? It would be good to have this working in full capacity by 3.0 (I wouldn't consider it a blocker for 3.0, but we have enough time before code freeze that this should not be an issue, provided someone can pick it up) was (Author: ableegoldman): [~sagarrao] do you think you'll be able to get in a fix for this in the next few weeks? It would be good to have this working in full capacity by 3.0 > prefixScan missing from intermediate interfaces > --- > > Key: KAFKA-12925 > URL: https://issues.apache.org/jira/browse/KAFKA-12925 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0 >Reporter: Michael Viamari >Priority: Major > Fix For: 3.0.0, 2.8.1 > > > [KIP-614|https://cwiki.apache.org/confluence/display/KAFKA/KIP-614%3A+Add+Prefix+Scan+support+for+State+Stores] > and [KAFKA-10648|https://issues.apache.org/jira/browse/KAFKA-10648] > introduced support for {{prefixScan}} to StateStores. > It seems that many of the intermediate {{StateStore}} interfaces are missing > a definition for {{prefixScan}}, and as such is not accessible in all cases. > For example, when accessing the state stores through a the processor context, > the {{KeyValueStoreReadWriteDecorator}} and associated interfaces do not > define {{prefixScan}} and it falls back to the default implementation in > {{KeyValueStore}}, which throws {{UnsupportedOperationException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12925) prefixScan missing from intermediate interfaces
[ https://issues.apache.org/jira/browse/KAFKA-12925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360472#comment-17360472 ] A. Sophie Blee-Goldman commented on KAFKA-12925: [~sagarrao] do you think you'll be able to get in a fix for this in the next few weeks? It would be good to have this working in full capacity by 3.0 > prefixScan missing from intermediate interfaces > --- > > Key: KAFKA-12925 > URL: https://issues.apache.org/jira/browse/KAFKA-12925 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0 >Reporter: Michael Viamari >Priority: Major > > [KIP-614|https://cwiki.apache.org/confluence/display/KAFKA/KIP-614%3A+Add+Prefix+Scan+support+for+State+Stores] > and [KAFKA-10648|https://issues.apache.org/jira/browse/KAFKA-10648] > introduced support for {{prefixScan}} to StateStores. > It seems that many of the intermediate {{StateStore}} interfaces are missing > a definition for {{prefixScan}}, and as such is not accessible in all cases. > For example, when accessing the state stores through a the processor context, > the {{KeyValueStoreReadWriteDecorator}} and associated interfaces do not > define {{prefixScan}} and it falls back to the default implementation in > {{KeyValueStore}}, which throws {{UnsupportedOperationException}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)