[jira] [Updated] (KAFKA-12740) 3. Resume processing from last-cleared processor after soft crash
[ https://issues.apache.org/jira/browse/KAFKA-12740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12740: --- Description: Building off of that, we can go one step further and avoid duplicate work within the subtopology itself. Any time a record is partially processed through a subtopology before hitting an error, all of the processors up to that point will be applied again when the record is reprocessed. If we can keep track of how far along the subtopology a record was processed, then we can start reprocessing it from the last processor it had cleared before hitting an error. We’ll need to make sure to commit and/or flush everything up to that point in the subtopology as well. This proposal is the most likely to benefit from letting a StreamThread recover after an unexpected exception rather than letting it die and starting up a new one, as we don’t need to worry about handing anything off from the dying thread to its replacement. Note: we should consider whether we should only allow this (A) if we can be sure the task is re/still assigned to the same client (ie user does not select SHUTDOWN_CLIENT), or (B) we allow it for ALOS and just skip this under EOS until we can implement (A) (at which time we can re-enable for EOS as well) was: Building off of that, we can go one step further and avoid duplicate work within the subtopology itself. Any time a record is partially processed through a subtopology before hitting an error, all of the processors up to that point will be applied again when the record is reprocessed. If we can keep track of how far along the subtopology a record was processed, then we can start reprocessing it from the last processor it had cleared before hitting an error. We’ll need to make sure to commit and/or flush everything up to that point in the subtopology as well. This proposal is the most likely to benefit from letting a StreamThread recover after an unexpected exception rather than letting it die and starting up a new one, as we don’t need to worry about handing anything off from the dying thread to its replacement. > 3. Resume processing from last-cleared processor after soft crash > - > > Key: KAFKA-12740 > URL: https://issues.apache.org/jira/browse/KAFKA-12740 > Project: Kafka > Issue Type: Sub-task > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > > Building off of that, we can go one step further and avoid duplicate work > within the subtopology itself. Any time a record is partially processed > through a subtopology before hitting an error, all of the processors up to > that point will be applied again when the record is reprocessed. If we can > keep track of how far along the subtopology a record was processed, then we > can start reprocessing it from the last processor it had cleared before > hitting an error. We’ll need to make sure to commit and/or flush everything > up to that point in the subtopology as well. > This proposal is the most likely to benefit from letting a StreamThread > recover after an unexpected exception rather than letting it die and starting > up a new one, as we don’t need to worry about handing anything off from the > dying thread to its replacement. > Note: we should consider whether we should only allow this (A) if we can be > sure the task is re/still assigned to the same client (ie user does not > select SHUTDOWN_CLIENT), or (B) we allow it for ALOS and just skip this under > EOS until we can implement (A) (at which time we can re-enable for EOS as > well) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12746) Allow StreamThreads to recover from some exceptions without killing/replacing the thread
A. Sophie Blee-Goldman created KAFKA-12746: -- Summary: Allow StreamThreads to recover from some exceptions without killing/replacing the thread Key: KAFKA-12746 URL: https://issues.apache.org/jira/browse/KAFKA-12746 Project: Kafka Issue Type: Sub-task Components: streams Reporter: A. Sophie Blee-Goldman As the title says, in some cases it should be possible to recover the StreamThread without killing and restarting it (it may in fact be _possible_ for all exception types, though in some cases it may be preferable to just let the thread die and start up with a fresh thread). This would obviously allow for greatly mitigating the impact of errors in the subtopology, and allow us to retry with little overhead. It may also make it easier to improve the processing semantics in the face of exceptions. This will also be useful if/when we implement a more sophisticated task scheduling which can, for example, work to deprioritize tasks that are frequently experiencing errors without crashing the whole thread or dropping the task entirely -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12742) 5. Checkpoint all uncorrupted state stores within the subtopology
A. Sophie Blee-Goldman created KAFKA-12742: -- Summary: 5. Checkpoint all uncorrupted state stores within the subtopology Key: KAFKA-12742 URL: https://issues.apache.org/jira/browse/KAFKA-12742 Project: Kafka Issue Type: Sub-task Components: streams Reporter: A. Sophie Blee-Goldman Once we have [KAFKA-12740|https://issues.apache.org/jira/browse/KAFKA-12740], we can close the loop on EOS by checkpointing not only those state stores which are attached to processors that the record has successfully passed, but also any remaining state stores further downstream in the subtopology that aren't connected to the processor where the error occurred. At this point, outside of a hard crash (eg process is killed) or dropping out of the group, we’ll only ever need to restore state stores from scratch if the exception came from the specific processor node they’re attached to. Which is pretty darn cool -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12740) 3. Resume processing from last-cleared processor after soft crash
[ https://issues.apache.org/jira/browse/KAFKA-12740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12740: --- Description: Building off of that, we can go one step further and avoid duplicate work within the subtopology itself. Any time a record is partially processed through a subtopology before hitting an error, all of the processors up to that point will be applied again when the record is reprocessed. If we can keep track of how far along the subtopology a record was processed, then we can start reprocessing it from the last processor it had cleared before hitting an error. We’ll need to make sure to commit and/or flush everything up to that point in the subtopology as well. This proposal is the most likely to benefit from letting a StreamThread recover after an unexpected exception rather than letting it die and starting up a new one, as we don’t need to worry about handing anything off from the dying thread to its replacement. was: Building off of that, we can go one step further and avoid duplicate work within the subtopology itself. Any time a record is partially processed through a subtopology before hitting an error, all of the processors up to that point will be applied again when the record is reprocessed. If we can keep track of how far along the subtopology a record was processed, then we can start reprocessing it from the last processor it had cleared before hitting an error. This proposal is the most likely to benefit from letting a StreamThread recover after an unexpected exception rather than letting it die and starting up a new one, as we don’t need to worry about handing anything off from the dying thread to its replacement. > 3. Resume processing from last-cleared processor after soft crash > - > > Key: KAFKA-12740 > URL: https://issues.apache.org/jira/browse/KAFKA-12740 > Project: Kafka > Issue Type: Sub-task > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > > Building off of that, we can go one step further and avoid duplicate work > within the subtopology itself. Any time a record is partially processed > through a subtopology before hitting an error, all of the processors up to > that point will be applied again when the record is reprocessed. If we can > keep track of how far along the subtopology a record was processed, then we > can start reprocessing it from the last processor it had cleared before > hitting an error. We’ll need to make sure to commit and/or flush everything > up to that point in the subtopology as well. > This proposal is the most likely to benefit from letting a StreamThread > recover after an unexpected exception rather than letting it die and starting > up a new one, as we don’t need to worry about handing anything off from the > dying thread to its replacement. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12741) 4. Handle thread-wide errors
A. Sophie Blee-Goldman created KAFKA-12741: -- Summary: 4. Handle thread-wide errors Key: KAFKA-12741 URL: https://issues.apache.org/jira/browse/KAFKA-12741 Project: Kafka Issue Type: Sub-task Components: streams Reporter: A. Sophie Blee-Goldman Although just implementing 1-3 will likely be a significant improvement, we will eventually want to tackle the case of an exception that affects all tasks on that thread. I think in this case we would just need to keep retrying the specific call that’s failing, whatever that may be. This would almost certainly require solution #3/4 as a prerequisite as we would need to keep retrying on that thread’s Producer. Of course, with KIP-572 we’ll already retry most kinds of errors on the Producer, if not all. So this would most likely only apply to a few kinds of exceptions, such as any Consumer calls that fail for reasons besides having dropped out of the group. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12739) 2. Commit any cleanly-processed records within a corrupted task
[ https://issues.apache.org/jira/browse/KAFKA-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12739: --- Description: Within a task, there will typically be a number of records that have been successfully processed through the subtopology but not yet committed. If the next record to be picked up hits an unexpected exception, we’ll dirty close the entire task and essentially throw away all the work we did on those previous records. We should be able to drop only the corrupted record and just commit the offsets up to that point. Again, for some exceptions such as de/serialization or user code errors, this can be straightforward as the thread/task is otherwise in a healthy state. Other cases such as an error in the Producer will need to be tackled separately, since a Producer error cannot be isolated to a single task. The challenge here will be in handling records sent to the changelog while processing the record that hits an error – we may need to buffer those records so they aren’t sent to the RecordCollector until a record has been fully processed, otherwise they will be committed and break EOS semantics (unless we can immediately implement [KAFKA-12740|https://issues.apache.org/jira/browse/KAFKA-12740]) was:Within a task, there will typically be a number of records that have been successfully processed through the subtopology but not yet committed. If the next record to be picked up hits an unexpected exception, we’ll dirty close the entire task and essentially throw away all the work we did on those previous records. We should be able to drop only the corrupted record and just commit the offsets up to that point. Again, for some exceptions such as de/serialization or user code errors, this can be straightforward as the thread/task is otherwise in a healthy state. Other cases such as an error in the Producer will need to be tackled separately, since a Producer error cannot be isolated to a single task. > 2. Commit any cleanly-processed records within a corrupted task > --- > > Key: KAFKA-12739 > URL: https://issues.apache.org/jira/browse/KAFKA-12739 > Project: Kafka > Issue Type: Sub-task > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > > Within a task, there will typically be a number of records that have been > successfully processed through the subtopology but not yet committed. If the > next record to be picked up hits an unexpected exception, we’ll dirty close > the entire task and essentially throw away all the work we did on those > previous records. We should be able to drop only the corrupted record and > just commit the offsets up to that point. > Again, for some exceptions such as de/serialization or user code errors, this > can be straightforward as the thread/task is otherwise in a healthy state. > Other cases such as an error in the Producer will need to be tackled > separately, since a Producer error cannot be isolated to a single task. > The challenge here will be in handling records sent to the changelog while > processing the record that hits an error – we may need to buffer those > records so they aren’t sent to the RecordCollector until a record has been > fully processed, otherwise they will be committed and break EOS semantics > (unless we can immediately implement > [KAFKA-12740|https://issues.apache.org/jira/browse/KAFKA-12740]) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12740) 3. Resume processing from last-cleared processor after soft crash
A. Sophie Blee-Goldman created KAFKA-12740: -- Summary: 3. Resume processing from last-cleared processor after soft crash Key: KAFKA-12740 URL: https://issues.apache.org/jira/browse/KAFKA-12740 Project: Kafka Issue Type: Sub-task Components: streams Reporter: A. Sophie Blee-Goldman Building off of that, we can go one step further and avoid duplicate work within the subtopology itself. Any time a record is partially processed through a subtopology before hitting an error, all of the processors up to that point will be applied again when the record is reprocessed. If we can keep track of how far along the subtopology a record was processed, then we can start reprocessing it from the last processor it had cleared before hitting an error. This proposal is the most likely to benefit from letting a StreamThread recover after an unexpected exception rather than letting it die and starting up a new one, as we don’t need to worry about handing anything off from the dying thread to its replacement. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12739) 2. Commit any cleanly-processed records within a corrupted task
A. Sophie Blee-Goldman created KAFKA-12739: -- Summary: 2. Commit any cleanly-processed records within a corrupted task Key: KAFKA-12739 URL: https://issues.apache.org/jira/browse/KAFKA-12739 Project: Kafka Issue Type: Sub-task Components: streams Reporter: A. Sophie Blee-Goldman Within a task, there will typically be a number of records that have been successfully processed through the subtopology but not yet committed. If the next record to be picked up hits an unexpected exception, we’ll dirty close the entire task and essentially throw away all the work we did on those previous records. We should be able to drop only the corrupted record and just commit the offsets up to that point. Again, for some exceptions such as de/serialization or user code errors, this can be straightforward as the thread/task is otherwise in a healthy state. Other cases such as an error in the Producer will need to be tackled separately, since a Producer error cannot be isolated to a single task. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12737) 1. Commit all healthy tasks after a task-specific error for better task isolation
[ https://issues.apache.org/jira/browse/KAFKA-12737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12737: --- Summary: 1. Commit all healthy tasks after a task-specific error for better task isolation (was: Commit all healthy tasks after a task-specific error for better task isolation and reduced overcounting under ALOS) > 1. Commit all healthy tasks after a task-specific error for better task > isolation > - > > Key: KAFKA-12737 > URL: https://issues.apache.org/jira/browse/KAFKA-12737 > Project: Kafka > Issue Type: Sub-task > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > > At the moment any time we hit the exception handler, an unclean shutdown will > be triggered on that thread, which means no tasks will be committed. For > certain kinds of exceptions this is unavoidable: for example if the consumer > has dropped out of the group then by definition it can’t commit during > shutdown, and the task will have already been reassigned to another > StreamThread. However there are many common scenarios in which we can (and > should) attempt to commit all the tasks which are in a clean state, ie > everyone except for the task currently being processed when the exception > occurred. A good example of this is de/serialization or user code errors, as > well as exceptions that occur during an operation like closing or suspending > a particular task. In all those cases, there’s no need to throw away all the > progress that has been made by the unaffected tasks who just happened to be > assigned to the same StreamThread. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12737) Commit all healthy tasks after a task-specific error for better task isolation and reduced overcounting under ALOS
[ https://issues.apache.org/jira/browse/KAFKA-12737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12737: --- Parent: KAFKA-12738 Issue Type: Sub-task (was: Improvement) > Commit all healthy tasks after a task-specific error for better task > isolation and reduced overcounting under ALOS > -- > > Key: KAFKA-12737 > URL: https://issues.apache.org/jira/browse/KAFKA-12737 > Project: Kafka > Issue Type: Sub-task > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > > At the moment any time we hit the exception handler, an unclean shutdown will > be triggered on that thread, which means no tasks will be committed. For > certain kinds of exceptions this is unavoidable: for example if the consumer > has dropped out of the group then by definition it can’t commit during > shutdown, and the task will have already been reassigned to another > StreamThread. However there are many common scenarios in which we can (and > should) attempt to commit all the tasks which are in a clean state, ie > everyone except for the task currently being processed when the exception > occurred. A good example of this is de/serialization or user code errors, as > well as exceptions that occur during an operation like closing or suspending > a particular task. In all those cases, there’s no need to throw away all the > progress that has been made by the unaffected tasks who just happened to be > assigned to the same StreamThread. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12738) Improved error handling for better at-least-once semantics and faster EOS
A. Sophie Blee-Goldman created KAFKA-12738: -- Summary: Improved error handling for better at-least-once semantics and faster EOS Key: KAFKA-12738 URL: https://issues.apache.org/jira/browse/KAFKA-12738 Project: Kafka Issue Type: Improvement Components: streams Reporter: A. Sophie Blee-Goldman Umbrella ticket for various ideas I had to improve the error handling behavior of Streams -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12737) Commit all healthy tasks after a task-specific error for better task isolation and reduced overcounting under ALOS
A. Sophie Blee-Goldman created KAFKA-12737: -- Summary: Commit all healthy tasks after a task-specific error for better task isolation and reduced overcounting under ALOS Key: KAFKA-12737 URL: https://issues.apache.org/jira/browse/KAFKA-12737 Project: Kafka Issue Type: Improvement Components: streams Reporter: A. Sophie Blee-Goldman At the moment any time we hit the exception handler, an unclean shutdown will be triggered on that thread, which means no tasks will be committed. For certain kinds of exceptions this is unavoidable: for example if the consumer has dropped out of the group then by definition it can’t commit during shutdown, and the task will have already been reassigned to another StreamThread. However there are many common scenarios in which we can (and should) attempt to commit all the tasks which are in a clean state, ie everyone except for the task currently being processed when the exception occurred. A good example of this is de/serialization or user code errors, as well as exceptions that occur during an operation like closing or suspending a particular task. In all those cases, there’s no need to throw away all the progress that has been made by the unaffected tasks who just happened to be assigned to the same StreamThread. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12718) SessionWindows are closed too early
[ https://issues.apache.org/jira/browse/KAFKA-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336997#comment-17336997 ] A. Sophie Blee-Goldman commented on KAFKA-12718: {quote}I was just looking into `suppress()` implementation, that obviously does not know anything about semantics of upstream window definitions{quote} [~mjsax] can you elaborate? Suppression does in fact search the processor graph to find the grace period of the windowed operator which is upstream of the suppression. It's called GraphGraceSearchUtil or something > SessionWindows are closed too early > --- > > Key: KAFKA-12718 > URL: https://issues.apache.org/jira/browse/KAFKA-12718 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: Matthias J. Sax >Priority: Major > Labels: beginner, easy-fix, newbie > Fix For: 3.0.0 > > > SessionWindows are defined based on a {{gap}} parameter, and also support an > additional {{grace-period}} configuration to handle out-of-order data. > To incorporate the session-gap a session window should only be closed at > {{window-end + gap}} and to incorporate grace-period, the close time should > be pushed out further to {{window-end + gap + grace}}. > However, atm we compute the window close time as {{window-end + grace}} > omitting the {{gap}} parameter. > Because default grace-period is 24h most users might not notice this issues. > Even if they set a grace period explicitly (eg, when using suppress()), they > would most likely set a grace-period larger than gap-time not hitting the > issue (or maybe only realize it when inspecting the behavior closely). > However, if a user wants to disable the grace-period and sets it to zero (on > any other value smaller than gap-time), sessions might be close too early and > user might notice. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-6409) LogRecoveryTest (testHWCheckpointWithFailuresSingleLogSegment) is flaky
[ https://issues.apache.org/jira/browse/KAFKA-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-6409. --- Resolution: Duplicate > LogRecoveryTest (testHWCheckpointWithFailuresSingleLogSegment) is flaky > --- > > Key: KAFKA-6409 > URL: https://issues.apache.org/jira/browse/KAFKA-6409 > Project: Kafka > Issue Type: Bug > Components: log >Reporter: Wladimir Schmidt >Priority: Major > > In the LogRecoveryTest the test named > testHWCheckpointWithFailuresSingleLogSegment is affected and not stable. > Sometimes it passes, sometimes it is not. > Scala 2.12. JDK9 > java.lang.AssertionError: Timing out after 3 ms since a new leader that > is different from 1 was not elected for partition new-topic-0, leader is > Some(1) > at kafka.utils.TestUtils$.fail(TestUtils.scala:351) > at > kafka.utils.TestUtils$.$anonfun$waitUntilLeaderIsElectedOrChanged$8(TestUtils.scala:828) > at scala.Option.getOrElse(Option.scala:121) > at > kafka.utils.TestUtils$.waitUntilLeaderIsElectedOrChanged(TestUtils.scala:818) > at > kafka.server.LogRecoveryTest.testHWCheckpointWithFailuresSingleLogSegment(LogRecoveryTest.scala:152) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:564) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.runTestClass(JUnitTestClassExecuter.java:114) > at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.execute(JUnitTestClassExecuter.java:57) > at > org.gradle.api.internal.tasks.testing.junit.JUnitTestClassProcessor.processTestClass(JUnitTestClassProcessor.java:66) > at > org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51) > at jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:564) > at > org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) > at > org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) > at > org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32) > at > org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93) > at com.sun.proxy.$Proxy1.processTestClass(Unknown Source) > at > org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:108) > at jdk.internal.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:564) > at >
[jira] [Commented] (KAFKA-5492) LogRecoveryTest.testHWCheckpointWithFailuresSingleLogSegment transient failure
[ https://issues.apache.org/jira/browse/KAFKA-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335786#comment-17335786 ] A. Sophie Blee-Goldman commented on KAFKA-5492: --- Failed again: Stacktrace org.opentest4j.AssertionFailedError: Server 1 is not able to join the ISR after restart at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:39) at org.junit.jupiter.api.Assertions.fail(Assertions.java:117) at kafka.server.LogRecoveryTest.testHWCheckpointWithFailuresSingleLogSegment(LogRecoveryTest.scala:152) https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10608/2/testReport/junit/kafka.server/LogRecoveryTest/Build___JDK_11_and_Scala_2_13___testHWCheckpointWithFailuresSingleLogSegment__/ > LogRecoveryTest.testHWCheckpointWithFailuresSingleLogSegment transient failure > -- > > Key: KAFKA-5492 > URL: https://issues.apache.org/jira/browse/KAFKA-5492 > Project: Kafka > Issue Type: Sub-task > Components: unit tests >Reporter: Jason Gustafson >Priority: Major > > {code} > ava.lang.AssertionError: Timing out after 3 ms since a new leader that is > different from 1 was not elected for partition new-topic-0, leader is Some(1) > at kafka.utils.TestUtils$.fail(TestUtils.scala:333) > at > kafka.utils.TestUtils$.$anonfun$waitUntilLeaderIsElectedOrChanged$8(TestUtils.scala:808) > at scala.Option.getOrElse(Option.scala:121) > at > kafka.utils.TestUtils$.waitUntilLeaderIsElectedOrChanged(TestUtils.scala:798) > at > kafka.server.LogRecoveryTest.testHWCheckpointWithFailuresSingleLogSegment(LogRecoveryTest.scala:152) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12629) Flaky Test RaftClusterTest
[ https://issues.apache.org/jira/browse/KAFKA-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335783#comment-17335783 ] A. Sophie Blee-Goldman commented on KAFKA-12629: This test has multiple failures on almost every PR build I see lately (and I would say at least one failure on _every_ PR)... > Flaky Test RaftClusterTest > -- > > Key: KAFKA-12629 > URL: https://issues.apache.org/jira/browse/KAFKA-12629 > Project: Kafka > Issue Type: Test > Components: core, unit tests >Reporter: Matthias J. Sax >Assignee: Luke Chen >Priority: Critical > Labels: flaky-test > > {quote} {{java.util.concurrent.ExecutionException: > java.lang.ClassNotFoundException: > org.apache.kafka.controller.NoOpSnapshotWriterBuilder > at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) > at > kafka.testkit.KafkaClusterTestKit.startup(KafkaClusterTestKit.java:364) > at > kafka.server.RaftClusterTest.testCreateClusterAndCreateAndManyTopicsWithManyPartitions(RaftClusterTest.scala:181)}}{quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-7964) Flaky Test ConsumerBounceTest#testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize
[ https://issues.apache.org/jira/browse/KAFKA-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335780#comment-17335780 ] A. Sophie Blee-Goldman commented on KAFKA-7964: --- Failed again, with a different error this time. Seems to be from the test setup and not the test itself: Stacktrace org.apache.kafka.common.KafkaException: Socket server failed to bind to localhost:33521: Address already in use. at kafka.network.Acceptor.openServerSocket(SocketServer.scala:667) at kafka.network.Acceptor.(SocketServer.scala:560) at kafka.network.SocketServer.createAcceptor(SocketServer.scala:288) at kafka.network.SocketServer.$anonfun$createDataPlaneAcceptorsAndProcessors$1(SocketServer.scala:261) at kafka.network.SocketServer.$anonfun$createDataPlaneAcceptorsAndProcessors$1$adapted(SocketServer.scala:259) at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) at scala.collection.AbstractIterable.foreach(Iterable.scala:919) at kafka.network.SocketServer.createDataPlaneAcceptorsAndProcessors(SocketServer.scala:259) at kafka.network.SocketServer.startup(SocketServer.scala:131) at kafka.server.KafkaServer.startup(KafkaServer.scala:290) at kafka.utils.TestUtils$.createServer(TestUtils.scala:166) at kafka.integration.KafkaServerTestHarness.$anonfun$setUp$1(KafkaServerTestHarness.scala:107) at scala.collection.immutable.Vector.foreach(Vector.scala:1856) at kafka.integration.KafkaServerTestHarness.setUp(KafkaServerTestHarness.scala:102) at kafka.api.IntegrationTestHarness.doSetup(IntegrationTestHarness.scala:93) at kafka.api.IntegrationTestHarness.setUp(IntegrationTestHarness.scala:84) https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10608/2/testReport/junit/kafka.api/ConsumerBounceTest/Build___JDK_11_and_Scala_2_13___testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize__/ > Flaky Test > ConsumerBounceTest#testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize > -- > > Key: KAFKA-7964 > URL: https://issues.apache.org/jira/browse/KAFKA-7964 > Project: Kafka > Issue Type: Bug > Components: clients, consumer, unit tests >Affects Versions: 2.2.0 >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > > To get stable nightly builds for `2.2` release, I create tickets for all > observed test failures. > [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/] > {quote}java.lang.AssertionError: expected:<100> but was:<0> at > org.junit.Assert.fail(Assert.java:88) at > org.junit.Assert.failNotEquals(Assert.java:834) at > org.junit.Assert.assertEquals(Assert.java:645) at > org.junit.Assert.assertEquals(Assert.java:631) at > kafka.api.ConsumerBounceTest.receiveExactRecords(ConsumerBounceTest.scala:551) > at > kafka.api.ConsumerBounceTest.$anonfun$testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize$2(ConsumerBounceTest.scala:409) > at > kafka.api.ConsumerBounceTest.$anonfun$testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize$2$adapted(ConsumerBounceTest.scala:408) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at > kafka.api.ConsumerBounceTest.testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize(ConsumerBounceTest.scala:408){quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
[ https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335773#comment-17335773 ] A. Sophie Blee-Goldman commented on KAFKA-9295: --- Failed again, same stacktrace. I take back what I said in the earlier comment, it is actually failing semi-frequently now https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10608/2/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/Build___JDK_15_and_Scala_2_13___shouldInnerJoinMultiPartitionQueryable_2/ > KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable > -- > > Key: KAFKA-9295 > URL: https://issues.apache.org/jira/browse/KAFKA-9295 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Affects Versions: 2.4.0, 2.6.0 >Reporter: Matthias J. Sax >Assignee: Luke Chen >Priority: Critical > Labels: flaky-test > Fix For: 3.0.0 > > > [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/] > {quote}java.lang.AssertionError: Did not receive all 1 records from topic > output- within 6 ms Expected: is a value equal to or greater than <1> > but: <0> was less than <1> at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-6655) CleanupThread: Failed to lock the state directory due to an unexpected exception (Windows OS)
[ https://issues.apache.org/jira/browse/KAFKA-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-6655. --- Resolution: Fixed This should be fixed, a number of improvements were made around the locking a few versions ago but we actually took it a step further and removing this kind of file-based locking for task directories entirely as of 3.0 > CleanupThread: Failed to lock the state directory due to an unexpected > exception (Windows OS) > - > > Key: KAFKA-6655 > URL: https://issues.apache.org/jira/browse/KAFKA-6655 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 1.0.1 > Environment: Windows Operating System >Reporter: Srini >Priority: Major > > This issue happens on Windows OS. It code works fine on Linux. This ticket is > related to KAFKA-6647, that also reports locking issues on Windows. However, > there, the issue occurs if users calls KafkaStreams#cleanUp() explicitly, > while this ticket is related to KafkaStreams background CleanupThread (note, > that both use `StateDirectory.cleanRemovedTasks`, but behavior is still > slightly different as different parameters are passed into the method). > {quote}[CleanupThread] Failed to lock the state directory due to an > unexpected exceptionjava.nio.file.DirectoryNotEmptyException: > \tmp\kafka-streams\srini-20171208\0_9 > at > sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:266) > at > sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103) > at java.nio.file.Files.delete(Files.java:1126) > at org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:636) > at org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:619) > at java.nio.file.Files.walkFileTree(Files.java:2688) > at java.nio.file.Files.walkFileTree(Files.java:2742) > at org.apache.kafka.common.utils.Utils.delete(Utils.java:619) > at > org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:245) > at org.apache.kafka.streams.KafkaStreams$3.run(KafkaStreams.java:761) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
[ https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334937#comment-17334937 ] A. Sophie Blee-Goldman commented on KAFKA-9295: --- It's failing on {noformat} startApplicationAndWaitUntilRunning(kafkaStreamsList, ofSeconds(60)); {noformat} At this point a session timeout seems unlikely, since [~showuon] observed a Streams instance dropping out on the heartbeat interval only a couple of times even when it failed, and all it has to do here is get to RUNNING once. It doesn't require that all KafkaStreams in the list get to RUNNING and then stay there, so all the instance has to do is start up and go through at least once successful rebalance in that time. There's nothing to restore so the transition to RUNNING should be immediate after the rebalance. Now technically 60s is a typical timeout for startApplicationAndWaitUntilRunning in the Streams integration tests, but the difference between this and other tests is that most have only one or two KafkaStreams to start up whereas this test has three. They're not started up and waited on sequentially so that shouldn't _really_ matter that much, but still it might just be that a longer timeout should be used in this case. I'm open to other theories however Also note that we should soon have a larger default session interval, so once Jason's KIP for that has been implemented we'll be able to get that improvement for free. Even if we think the session interval is the problem with this test, it probably makes sense to just wait for that KIP than to hardcode in some special value. If it starts to fail very frequently we can reconsider, but I haven't observed it doing so since the last fix was merged > KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable > -- > > Key: KAFKA-9295 > URL: https://issues.apache.org/jira/browse/KAFKA-9295 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Affects Versions: 2.4.0, 2.6.0 >Reporter: Matthias J. Sax >Assignee: Luke Chen >Priority: Critical > Labels: flaky-test > Fix For: 3.0.0 > > > [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/] > {quote}java.lang.AssertionError: Did not receive all 1 records from topic > output- within 6 ms Expected: is a value equal to or greater than <1> > but: <0> was less than <1> at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
[ https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reopened KAFKA-9295: --- Still failing, saw this on both the Java 8 and Java 11 build of a PR: Stacktrace java.lang.AssertionError: Application did not reach a RUNNING state for all streams instances. Non-running instances: {org.apache.kafka.streams.KafkaStreams@cce6a1d=REBALANCING, org.apache.kafka.streams.KafkaStreams@45af6187=REBALANCING, org.apache.kafka.streams.KafkaStreams@5d8ba53=REBALANCING} at org.junit.Assert.fail(Assert.java:89) at org.apache.kafka.streams.integration.utils.IntegrationTestUtils.startApplicationAndWaitUntilRunning(IntegrationTestUtils.java:904) at org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:197) at org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:185) https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10573/8/testReport/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldInnerJoinMultiPartitionQueryable/ > KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable > -- > > Key: KAFKA-9295 > URL: https://issues.apache.org/jira/browse/KAFKA-9295 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Affects Versions: 2.4.0, 2.6.0 >Reporter: Matthias J. Sax >Assignee: Luke Chen >Priority: Critical > Labels: flaky-test > Fix For: 3.0.0 > > > [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/] > {quote}java.lang.AssertionError: Did not receive all 1 records from topic > output- within 6 ms Expected: is a value equal to or greater than <1> > but: <0> was less than <1> at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12727) Flaky test ListConsumerGroupTest.testListConsumerGroupsWithStates
A. Sophie Blee-Goldman created KAFKA-12727: -- Summary: Flaky test ListConsumerGroupTest.testListConsumerGroupsWithStates Key: KAFKA-12727 URL: https://issues.apache.org/jira/browse/KAFKA-12727 Project: Kafka Issue Type: Bug Components: consumer Reporter: A. Sophie Blee-Goldman Stacktrace org.opentest4j.AssertionFailedError: Expected to show groups Set((groupId='simple-group', isSimpleConsumerGroup=true, state=Optional[Empty]), (groupId='test.group', isSimpleConsumerGroup=false, state=Optional[Stable])), but found Set((groupId='test.group', isSimpleConsumerGroup=false, state=Optional[Stable])) at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:39) at org.junit.jupiter.api.Assertions.fail(Assertions.java:117) at kafka.admin.ListConsumerGroupTest.testListConsumerGroupsWithStates(ListConsumerGroupTest.scala:66) https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10573/8/testReport/kafka.admin/ListConsumerGroupTest/Build___JDK_11_and_Scala_2_13___testListConsumerGroupsWithStates__/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12666) Fix flaky kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic
[ https://issues.apache.org/jira/browse/KAFKA-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334205#comment-17334205 ] A. Sophie Blee-Goldman commented on KAFKA-12666: There are quite a few tickets for the (very frequently failing) RaftClusterTest issues, it seems we have [KAFKA-12629|https://issues.apache.org/jira/browse/KAFKA-12629] as an umbrella ticket so can we close this one as a duplicate to keep everything in one place? > Fix flaky > kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic > > > Key: KAFKA-12666 > URL: https://issues.apache.org/jira/browse/KAFKA-12666 > Project: Kafka > Issue Type: Test >Reporter: Bruno Cadonna >Priority: Major > > Found two similar failures of this test on a PR that was unrelated: > {code:java} > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, > deadlineMs=1618341006330, tries=583, nextAllowedTryMs=1618341006437) timed > out at 1618341006337 after 583 attempt(s) > at > org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) > at > org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) > at > org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89) > at > org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260) > at > kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic(RaftClusterTest.scala:94) > {code} > > {code:java} > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node > assignment. Call: createTopics > at > org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) > at > org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) > at > org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89) > at > org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260) > at > kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic(RaftClusterTest.scala:94) > {code} > > [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10529/4/testReport/?cloudbees-analytics-link=scm-reporting%2Ftests%2Ffailed] > > Might be related to KAFKA-12561. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12629) Flaky Test RaftClusterTest
[ https://issues.apache.org/jira/browse/KAFKA-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334174#comment-17334174 ] A. Sophie Blee-Goldman commented on KAFKA-12629: I'm seeing a lot of that exception as well, ie the {noformat} java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request. {noformat} Four different RaftClusterTest failures with this same error on https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-8923/6/testReport/?cloudbees-analytics-link=scm-reporting%2Ftests%2Ffailed > Flaky Test RaftClusterTest > -- > > Key: KAFKA-12629 > URL: https://issues.apache.org/jira/browse/KAFKA-12629 > Project: Kafka > Issue Type: Test > Components: core, unit tests >Reporter: Matthias J. Sax >Assignee: Luke Chen >Priority: Critical > Labels: flaky-test > > {quote} {{java.util.concurrent.ExecutionException: > java.lang.ClassNotFoundException: > org.apache.kafka.controller.NoOpSnapshotWriterBuilder > at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) > at > kafka.testkit.KafkaClusterTestKit.startup(KafkaClusterTestKit.java:364) > at > kafka.server.RaftClusterTest.testCreateClusterAndCreateAndManyTopicsWithManyPartitions(RaftClusterTest.scala:181)}}{quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12666) Fix flaky kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic
[ https://issues.apache.org/jira/browse/KAFKA-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-12666. Resolution: Duplicate > Fix flaky > kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic > > > Key: KAFKA-12666 > URL: https://issues.apache.org/jira/browse/KAFKA-12666 > Project: Kafka > Issue Type: Test >Reporter: Bruno Cadonna >Priority: Major > > Found two similar failures of this test on a PR that was unrelated: > {code:java} > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, > deadlineMs=1618341006330, tries=583, nextAllowedTryMs=1618341006437) timed > out at 1618341006337 after 583 attempt(s) > at > org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) > at > org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) > at > org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89) > at > org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260) > at > kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic(RaftClusterTest.scala:94) > {code} > > {code:java} > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node > assignment. Call: createTopics > at > org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) > at > org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) > at > org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89) > at > org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260) > at > kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic(RaftClusterTest.scala:94) > {code} > > [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10529/4/testReport/?cloudbees-analytics-link=scm-reporting%2Ftests%2Ffailed] > > Might be related to KAFKA-12561. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12721) Flaky Test ResetConsumerGroupOffsetTest.testResetOffsetsAllTopicsAllGroups
A. Sophie Blee-Goldman created KAFKA-12721: -- Summary: Flaky Test ResetConsumerGroupOffsetTest.testResetOffsetsAllTopicsAllGroups Key: KAFKA-12721 URL: https://issues.apache.org/jira/browse/KAFKA-12721 Project: Kafka Issue Type: Bug Components: consumer Reporter: A. Sophie Blee-Goldman https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-8923/6/testReport/kafka.admin/ResetConsumerGroupOffsetTest/Build___JDK_8_and_Scala_2_12___testResetOffsetsAllTopicsAllGroups__/ Stacktrace org.opentest4j.AssertionFailedError: Expected that consumer group has consumed all messages from topic/partition. Expected offset: 100. Actual offset: 0 at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:39) at org.junit.jupiter.api.Assertions.fail(Assertions.java:117) at kafka.admin.ResetConsumerGroupOffsetTest.awaitConsumerProgress(ResetConsumerGroupOffsetTest.scala:455) at kafka.admin.ResetConsumerGroupOffsetTest.$anonfun$testResetOffsetsAllTopicsAllGroups$5(ResetConsumerGroupOffsetTest.scala:140) at kafka.admin.ResetConsumerGroupOffsetTest.$anonfun$testResetOffsetsAllTopicsAllGroups$5$adapted(ResetConsumerGroupOffsetTest.scala:137) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at kafka.admin.ResetConsumerGroupOffsetTest.$anonfun$testResetOffsetsAllTopicsAllGroups$4(ResetConsumerGroupOffsetTest.scala:137) at kafka.admin.ResetConsumerGroupOffsetTest.$anonfun$testResetOffsetsAllTopicsAllGroups$4$adapted(ResetConsumerGroupOffsetTest.scala:136) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at kafka.admin.ResetConsumerGroupOffsetTest.testResetOffsetsAllTopicsAllGroups(ResetConsumerGroupOffsetTest.scala:136) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-6435) Application Reset Tool might delete incorrect internal topics
[ https://issues.apache.org/jira/browse/KAFKA-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-6435: -- Fix Version/s: 3.0.0 > Application Reset Tool might delete incorrect internal topics > - > > Key: KAFKA-6435 > URL: https://issues.apache.org/jira/browse/KAFKA-6435 > Project: Kafka > Issue Type: Bug > Components: streams, tools >Affects Versions: 1.0.0 >Reporter: Matthias J. Sax >Assignee: Joel Wee >Priority: Major > Labels: bug, help-wanted, newbie > Fix For: 3.0.0 > > > The streams application reset tool, deletes all topic that start with > {{-}}. > If people have two versions of the same application and name them {{"app"}} > and {{"app-v2"}}, resetting {{"app"}} would also delete the internal topics > of {{"app-v2"}}. > We either need to disallow the dash in the application ID, or improve the > topic identification logic in the reset tool to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-6435) Application Reset Tool might delete incorrect internal topics
[ https://issues.apache.org/jira/browse/KAFKA-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-6435. --- Resolution: Fixed > Application Reset Tool might delete incorrect internal topics > - > > Key: KAFKA-6435 > URL: https://issues.apache.org/jira/browse/KAFKA-6435 > Project: Kafka > Issue Type: Bug > Components: streams, tools >Affects Versions: 1.0.0 >Reporter: Matthias J. Sax >Assignee: Joel Wee >Priority: Major > Labels: bug, help-wanted, newbie > Fix For: 3.0.0 > > > The streams application reset tool, deletes all topic that start with > {{-}}. > If people have two versions of the same application and name them {{"app"}} > and {{"app-v2"}}, resetting {{"app"}} would also delete the internal topics > of {{"app-v2"}}. > We either need to disallow the dash in the application ID, or improve the > topic identification logic in the reset tool to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8147) Add changelog topic configuration to KTable suppress
[ https://issues.apache.org/jira/browse/KAFKA-8147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332711#comment-17332711 ] A. Sophie Blee-Goldman commented on KAFKA-8147: --- [~philbour] that would be a bug, you should be able to set these configs in any order. Seems like BufferConfigInternal#emitEarlyWhenFull creates a new EagerBufferConfigImpl and passes the two original configs (maxRecords and maxBytes) in to the constructor, but loses the logging configs at that point. Same thing for BufferConfigInternal#shutDownWhenFull. Looks like the PR for this feature just missed updating this, I notice that it did remember to add this parameter in the constructor calls inside EagerBufferConfigImpl and StrictBufferConfigImpl. That said, this looks like kind of an abuse of this pattern so I'm not surprised bugs slipped through. Maybe instead of just patching the current problem by adding this parameter to the constructor calls in BufferConfigInternal we can try to refactor things a bit so we aren't calling constructors all over the place and making things vulnerable to future changes. For example in Materialized all of the non-static .withXXX methods just set that parameter directly instead of creating a new Materialized object every time you set some configuration. But I'm sure there was a reason to do it this way initially... [~philbour] can you file a separate ticket for this? And would you be interested in submitting a PR to fix the bug you found? > Add changelog topic configuration to KTable suppress > > > Key: KAFKA-8147 > URL: https://issues.apache.org/jira/browse/KAFKA-8147 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.1.1 >Reporter: Maarten >Assignee: highluck >Priority: Minor > Labels: kip > Fix For: 2.6.0 > > > The streams DSL does not provide a way to configure the changelog topic > created by KTable.suppress. > From the perspective of an external user this could be implemented similar to > the configuration of aggregate + materialized, i.e., > {code:java} > changelogTopicConfigs = // Configs > materialized = Materialized.as(..).withLoggingEnabled(changelogTopicConfigs) > .. > KGroupedStream.aggregate(..,materialized) > {code} > [KIP-446: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-446%3A+Add+changelog+topic+configuration+to+KTable+suppress|https://cwiki.apache.org/confluence/display/KAFKA/KIP-446%3A+Add+changelog+topic+configuration+to+KTable+suppress] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12344) Support SlidingWindows in the Scala API
[ https://issues.apache.org/jira/browse/KAFKA-12344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-12344. Fix Version/s: 3.0.0 Resolution: Fixed > Support SlidingWindows in the Scala API > --- > > Key: KAFKA-12344 > URL: https://issues.apache.org/jira/browse/KAFKA-12344 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.7.0 >Reporter: Leah Thomas >Assignee: Ketul Gupta >Priority: Major > Labels: newbie, scala > Fix For: 3.0.0 > > > in KIP-450 we implemented sliding windows for the Java API but left out a few > crucial methods to allow sliding windows to work through the Scala API. We > need to add those methods to make the Scala API fully leverage sliding windows -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12710) Consider enabling (at least some) optimizations by default
[ https://issues.apache.org/jira/browse/KAFKA-12710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330995#comment-17330995 ] A. Sophie Blee-Goldman commented on KAFKA-12710: Thanks, I'd forgotten about that KIP. Being able to selectively disable optimizations would make enabling (at least some) optimizations by default much more palatable, if we can do them together that would be ideal. > Consider enabling (at least some) optimizations by default > -- > > Key: KAFKA-12710 > URL: https://issues.apache.org/jira/browse/KAFKA-12710 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > > Topology optimizations such as the repartition consolidation and source topic > changelog are extremely useful at reducing the footprint of a Kafka Streams > application on the broker. The additional storage and resource utilization > due to changelogs and repartitions is a very real pain point, and has even > been cited as the reason for turning to other stream processing frameworks in > the past (though of course I question that judgement) > The repartition topic optimization, at the very least, should be enabled by > default. The problem is that we can't just flip the switch without breaking > existing applications during upgrade, since the location and name of such > topics in the topology may change. One possibility is to just detect this > situation and disable the optimization if we find that it would produce an > incompatible topology for an existing application. We can determine that this > is the case simply by looking for pre-existing repartition topics. If any > such topics are present, and match the set of repartition topics in the > un-optimized topology, then we know we need to switch the optimization off. > If we don't find any repartition topics, or they match the optimized > topology, then we're safe to enable it by default. > Alternatively, we could just do a KIP to indicate that we intend to change > the default in the next breaking release and that existing applications > should override this config if necessary. We should be able to implement a > fail-safe and shut down if a user misses or forgets to do so, using the > method mentioned above. > The source topic optimization is perhaps more controversial, as there have > been a few issues raised with regards to things like [restoring bad data and > asymmetric serdes|https://issues.apache.org/jira/browse/KAFKA-8037], or more > recently the bug discovered in the [emit-on-change semantics for > KTables|https://issues.apache.org/jira/browse/KAFKA-12508?focusedCommentId=17306323=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17306323]. > However for this case at least there are no compatibility concerns. It's > safe to upgrade from using a separate changelog for a source KTable to just > using that source topic directly, although the reverse is not true. We could > even automatically delete the no-longer-necessary changelog for upgrading > applications -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12710) Consider enabling (at least some) optimizations by default
A. Sophie Blee-Goldman created KAFKA-12710: -- Summary: Consider enabling (at least some) optimizations by default Key: KAFKA-12710 URL: https://issues.apache.org/jira/browse/KAFKA-12710 Project: Kafka Issue Type: Improvement Components: streams Reporter: A. Sophie Blee-Goldman Topology optimizations such as the repartition consolidation and source topic changelog are extremely useful at reducing the footprint of a Kafka Streams application on the broker. The additional storage and resource utilization due to changelogs and repartitions is a very real pain point, and has even been cited as the reason for turning to other stream processing frameworks in the past (though of course I question that judgement) The repartition topic optimization, at the very least, should be enabled by default. The problem is that we can't just flip the switch without breaking existing applications during upgrade, since the location and name of such topics in the topology may change. One possibility is to just detect this situation and disable the optimization if we find that it would produce an incompatible topology for an existing application. We can determine that this is the case simply by looking for pre-existing repartition topics. If any such topics are present, and match the set of repartition topics in the un-optimized topology, then we know we need to switch the optimization off. If we don't find any repartition topics, or they match the optimized topology, then we're safe to enable it by default. Alternatively, we could just do a KIP to indicate that we intend to change the default in the next breaking release and that existing applications should override this config if necessary. We should be able to implement a fail-safe and shut down if a user misses or forgets to do so, using the method mentioned above. The source topic optimization is perhaps more controversial, as there have been a few issues raised with regards to things like [restoring bad data and asymmetric serdes|https://issues.apache.org/jira/browse/KAFKA-8037], or more recently the bug discovered in the [emit-on-change semantics for KTables|https://issues.apache.org/jira/browse/KAFKA-12508?focusedCommentId=17306323=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17306323]. However for this case at least there are no compatibility concerns. It's safe to upgrade from using a separate changelog for a source KTable to just using that source topic directly, although the reverse is not true. We could even automatically delete the no-longer-necessary changelog for upgrading applications -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-10493) KTable out-of-order updates are not being ignored
[ https://issues.apache.org/jira/browse/KAFKA-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329888#comment-17329888 ] A. Sophie Blee-Goldman commented on KAFKA-10493: Imo we should find a way to fix this that doesn't prevent users from leveraging the source topic optimization. As I've mentioned before, the additional storage footprint from changelogs is a very real complaint and has been cited as the reason for not using Kafka Streams in the past. And it sounds to me like this would make it even worse, as we would need to not only use a dedicated changelog for all source KTables but also disable compaction entirely IIUC. That just does not sound like a feasible path forward I haven't fully digested this current discussion about the impact of dropping out-of-order updates with a compacted changelog, but perhaps we could store some information in the committed offset metadata to help us here? > KTable out-of-order updates are not being ignored > - > > Key: KAFKA-10493 > URL: https://issues.apache.org/jira/browse/KAFKA-10493 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.6.0 >Reporter: Pedro Gontijo >Assignee: Matthias J. Sax >Priority: Blocker > Fix For: 3.0.0 > > Attachments: KTableOutOfOrderBug.java > > > On a materialized KTable, out-of-order records for a given key (records which > timestamp are older than the current value in store) are not being ignored > but used to update the local store value and also being forwarded. > I believe the bug is here: > [https://github.com/apache/kafka/blob/2.6.0/streams/src/main/java/org/apache/kafka/streams/state/internals/ValueAndTimestampSerializer.java#L77] > It should return true, not false (see javadoc) > The bug impacts here: > [https://github.com/apache/kafka/blob/2.6.0/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableSource.java#L142-L148] > I have attached a simple stream app that shows the issue happening. > Thank you! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12699) Streams no longer overrides the java default uncaught exception handler
[ https://issues.apache.org/jira/browse/KAFKA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327007#comment-17327007 ] A. Sophie Blee-Goldman commented on KAFKA-12699: Marking it down for 3.0 so we don't forget > Streams no longer overrides the java default uncaught exception handler > - > > Key: KAFKA-12699 > URL: https://issues.apache.org/jira/browse/KAFKA-12699 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0 >Reporter: Walker Carlson >Priority: Minor > Fix For: 3.0.0 > > > If a user used `Thread.setUncaughtExceptionHanlder()` to set the handler for > all threads in the runtime streams would override that with its own handler. > However since streams does not use the `Thread` handler anymore it will no > longer do so. This can cause problems if the user does something like > `System.exit(1)` in the handler. > > If using the old handler in streams it will still work as it used to -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12699) Streams no longer overrides the java default uncaught exception handler
[ https://issues.apache.org/jira/browse/KAFKA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12699: --- Fix Version/s: 3.0.0 > Streams no longer overrides the java default uncaught exception handler > - > > Key: KAFKA-12699 > URL: https://issues.apache.org/jira/browse/KAFKA-12699 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0 >Reporter: Walker Carlson >Priority: Minor > Fix For: 3.0.0 > > > If a user used `Thread.setUncaughtExceptionHanlder()` to set the handler for > all threads in the runtime streams would override that with its own handler. > However since streams does not use the `Thread` handler anymore it will no > longer do so. This can cause problems if the user does something like > `System.exit(1)` in the handler. > > If using the old handler in streams it will still work as it used to -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12699) Streams no longer overrides the java default uncaught exception handler
[ https://issues.apache.org/jira/browse/KAFKA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327006#comment-17327006 ] A. Sophie Blee-Goldman commented on KAFKA-12699: > we should have the exception raised again to make sure it it handled properly. I agree with the sentiment here, but if we disable the uncaught exception handler then there's no other handling that is/can be done. I'm also not sure what EOS has to do with this, in both ALOS and EOS we want to (and have to) do a "dirty close", otherwise we're committing bad data > stream user and the owner of the runtime are not entirely equal so might have > different requirements Good point > Streams no longer overrides the java default uncaught exception handler > - > > Key: KAFKA-12699 > URL: https://issues.apache.org/jira/browse/KAFKA-12699 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0 >Reporter: Walker Carlson >Priority: Minor > > If a user used `Thread.setUncaughtExceptionHanlder()` to set the handler for > all threads in the runtime streams would override that with its own handler. > However since streams does not use the `Thread` handler anymore it will no > longer do so. This can cause problems if the user does something like > `System.exit(1)` in the handler. > > If using the old handler in streams it will still work as it used to -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12706) Consider adding reason and source of error in APPLICATION_SHUTDOWN
A. Sophie Blee-Goldman created KAFKA-12706: -- Summary: Consider adding reason and source of error in APPLICATION_SHUTDOWN Key: KAFKA-12706 URL: https://issues.apache.org/jira/browse/KAFKA-12706 Project: Kafka Issue Type: Improvement Components: streams Reporter: A. Sophie Blee-Goldman At the moment when a user opts to shut down the application in the streams uncaught exception handler, we just send a signal to all members of the group who then shut down. If there are a large number of application instances running it can be annoying and time consuming to locate the client that hit this error. It would be nice if we could let each client log the exception that triggered this, and possibly also the client who requested the shutdown. That will make it much easier to identify the problem, and figure out which set of logs to look into for further information -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12699) Streams no longer overrides the java default uncaught exception handler
[ https://issues.apache.org/jira/browse/KAFKA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17326215#comment-17326215 ] A. Sophie Blee-Goldman commented on KAFKA-12699: To be fair, I think claiming that this is a bug that Streams "no longer overrides the java default handler" is a bit harsh since (a) as you point out, this isn't a regression but just a different behavior in a new feature, and (b) it was never specified or implied that the new feature would do exactly the same thing as the old feature. I would actually argue that continuing to override the java default handler when the user has not actually supplied a Thread.UncaughtExceptionHandler to override it would be more unexpected. However, I do think it's a bit odd that we would end up invoking the default thread uncaught exception handler at all, since Streams will have already caught and invoked the user-defined handler at that point. The "bug" from my perspective is that we rethrow the exception up through run(), vs swallowing it once we reach the outer try in StreamThread#run. If we think that the user-defined default handler should not be invoked when using the new StreamsUncaughtExceptionHandler (and I think that assumption is not a given, although I don't feel strongly for or against), and should be overriden with a no-op handler, then why continue to throw the exception. And if we don't throw the exception, then why do we need to override with a no-op -- just my line of thinking, again I'm not sure what a realistic user expected behavior is here. But setting a global exception handler and then just blindly calling System.exit in a multi-threaded app where you've consciously implemented a handler that makes it clear threads can die and be replaced...seems like user error to me > Streams no longer overrides the java default uncaught exception handler > - > > Key: KAFKA-12699 > URL: https://issues.apache.org/jira/browse/KAFKA-12699 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.8.0 >Reporter: Walker Carlson >Priority: Minor > > If a user used `Thread.setUncaughtExceptionHanlder()` to set the handler for > all threads in the runtime streams would override that with its own handler. > However since streams does not use the `Thread` handler anymore it will no > longer do so. This can cause problems if the user does something like > `System.exit(1)` in the handler. > > If using the old handler in streams it will still work as it used to -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12700) The admin.listeners config has wonky valid values in the docs
A. Sophie Blee-Goldman created KAFKA-12700: -- Summary: The admin.listeners config has wonky valid values in the docs Key: KAFKA-12700 URL: https://issues.apache.org/jira/browse/KAFKA-12700 Project: Kafka Issue Type: Bug Components: docs, KafkaConnect Reporter: A. Sophie Blee-Goldman Noticed this while updating the docs for the 2.6.2 release, the docs for these configs are generated from the config definition, including info such as default, type, valid values, etc. When defining WorkerConfig.ADMIN_LISTENERS_CONFIG we seem to pass an actual `new AdminListenersValidator()` object in for the "valid values" parameter, causing this field to display some wonky useless object reference in the docs. See https://kafka.apache.org/documentation/#connectconfigs_admin.listeners: Valid Values: org.apache.kafka.connect.runtime.WorkerConfig$AdminListenersValidator@383534aa -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12696) Add standard getters to LagInfo class to allow automatic serialization
[ https://issues.apache.org/jira/browse/KAFKA-12696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17326075#comment-17326075 ] A. Sophie Blee-Goldman commented on KAFKA-12696: Thanks for the PR [~mihasya]. You need to be added as a contributor to assignor yourself issues, by the way. I just added you so you should be able to self-assign any tickets from now on. > Add standard getters to LagInfo class to allow automatic serialization > -- > > Key: KAFKA-12696 > URL: https://issues.apache.org/jira/browse/KAFKA-12696 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Mikhail Panchenko >Assignee: Mikhail Panchenko >Priority: Trivial > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The LagInfo class has non-standard getters for its member variables. This > means that Jackson and other serialization frameworks do not know how to > serialize them without additional annotations. So when implementing the sort > of system that KAFKA-6144 is meant to enable, as documented here in docs like > [Kafka Streams Interactive > Queries|https://docs.confluent.io/platform/current/streams/developer-guide/interactive-queries.html#querying-state-stores-during-a-rebalance], > one has to either inject a bunch of custom serialization logic into Jersey > or wrap this class. > The patch to fix this is trivial, and I will be putting one up shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-12696) Add standard getters to LagInfo class to allow automatic serialization
[ https://issues.apache.org/jira/browse/KAFKA-12696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-12696: -- Assignee: Mikhail Panchenko > Add standard getters to LagInfo class to allow automatic serialization > -- > > Key: KAFKA-12696 > URL: https://issues.apache.org/jira/browse/KAFKA-12696 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Mikhail Panchenko >Assignee: Mikhail Panchenko >Priority: Trivial > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The LagInfo class has non-standard getters for its member variables. This > means that Jackson and other serialization frameworks do not know how to > serialize them without additional annotations. So when implementing the sort > of system that KAFKA-6144 is meant to enable, as documented here in docs like > [Kafka Streams Interactive > Queries|https://docs.confluent.io/platform/current/streams/developer-guide/interactive-queries.html#querying-state-stores-during-a-rebalance], > one has to either inject a bunch of custom serialization logic into Jersey > or wrap this class. > The patch to fix this is trivial, and I will be putting one up shortly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12690) Remove deprecated Producer#sendOffsetsToTransaction
[ https://issues.apache.org/jira/browse/KAFKA-12690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12690: --- Description: In [KIP-732|https://cwiki.apache.org/confluence/display/KAFKA/KIP-732%3A+Deprecate+eos-alpha+and+replace+eos-beta+with+eos-v2] we deprecated the EXACTLY_ONCE and EXACTLY_ONCE_BETA configs in StreamsConfig, to be removed in 4.0 > Remove deprecated Producer#sendOffsetsToTransaction > --- > > Key: KAFKA-12690 > URL: https://issues.apache.org/jira/browse/KAFKA-12690 > Project: Kafka > Issue Type: Task > Components: producer >Reporter: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 4.0.0 > > > In > [KIP-732|https://cwiki.apache.org/confluence/display/KAFKA/KIP-732%3A+Deprecate+eos-alpha+and+replace+eos-beta+with+eos-v2] > we deprecated the EXACTLY_ONCE and EXACTLY_ONCE_BETA configs in > StreamsConfig, to be removed in 4.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12690) Remove deprecated Producer#sendOffsetsToTransaction
A. Sophie Blee-Goldman created KAFKA-12690: -- Summary: Remove deprecated Producer#sendOffsetsToTransaction Key: KAFKA-12690 URL: https://issues.apache.org/jira/browse/KAFKA-12690 Project: Kafka Issue Type: Task Components: producer Reporter: A. Sophie Blee-Goldman Fix For: 4.0.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12689) Remove deprecated EOS configs
A. Sophie Blee-Goldman created KAFKA-12689: -- Summary: Remove deprecated EOS configs Key: KAFKA-12689 URL: https://issues.apache.org/jira/browse/KAFKA-12689 Project: Kafka Issue Type: Task Components: streams Reporter: A. Sophie Blee-Goldman Fix For: 4.0.0 In [KIP-732|https://cwiki.apache.org/confluence/display/KAFKA/KIP-732%3A+Deprecate+eos-alpha+and+replace+eos-beta+with+eos-v2] we deprecated the EXACTLY_ONCE and EXACTLY_ONCE_BETA configs in StreamsConfig, to be removed in 4.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9168) Integrate JNI direct buffer support to RocksDBStore
[ https://issues.apache.org/jira/browse/KAFKA-9168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325187#comment-17325187 ] A. Sophie Blee-Goldman commented on KAFKA-9168: --- [~sagarrao] Feel free to pick this up, but you may need to hold off on working on it until [~cadonna] has completed KAFKA-8897. I believe he's battling some weird runtime issues in RocksDB with the upgrade at the moment, but once we've figured that out then this should be unblocked. > Integrate JNI direct buffer support to RocksDBStore > --- > > Key: KAFKA-9168 > URL: https://issues.apache.org/jira/browse/KAFKA-9168 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Sagar Rao >Priority: Major > Labels: perfomance > > There has been a PR created on rocksdb Java client to support direct > ByteBuffers in Java. We can look at integrating it whenever it gets merged. > Link to PR: [https://github.com/facebook/rocksdb/pull/2283] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12675) Improve sticky general assignor scalability and performance
[ https://issues.apache.org/jira/browse/KAFKA-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325181#comment-17325181 ] A. Sophie Blee-Goldman commented on KAFKA-12675: Nice! It would be great if we could get these improvements in to 3.0, since as Luke mentioned we plan to make this the default assignor. > Improve sticky general assignor scalability and performance > --- > > Key: KAFKA-12675 > URL: https://issues.apache.org/jira/browse/KAFKA-12675 > Project: Kafka > Issue Type: Improvement >Reporter: Luke Chen >Assignee: Luke Chen >Priority: Major > > Currently, we have "general assignor" for non-equal subscription case and > "constrained assignor" for all equal subscription case. There's a performance > test for constrained assignor with: > topicCount = {color:#ff}500{color}; > partitionCount = {color:#ff}2000{color}; > consumerCount = {color:#ff}2000{color}; > in _testLargeAssignmentAndGroupWithUniformSubscription,_ total 1 million > partitions and we can complete the assignment within 2 second in my machine. > However, if we let 1 of the consumer subscribe to only 1 topic, it'll use > "general assignor", and the result with the same setting as above is: > *OutOfMemory,* > Even we down the count to: > topicCount = {color:#ff}50{color}; > partitionCount = 1{color:#ff}000{color}; > consumerCount = 1{color:#ff}000{color}; > We still got *OutOfMemory*. > With this setting: > topicCount = {color:#ff}50{color}; > partitionCount = 8{color:#ff}00{color}; > consumerCount = 8{color:#ff}00{color}; > We can complete in 10 seconds in my machine, which is still slow. > > Since we are going to set default assignment strategy to > "CooperativeStickyAssignor" soon, we should improve the scalability and > performance for sticky general assignor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12675) Improve sticky general assignor scalability and performance
[ https://issues.apache.org/jira/browse/KAFKA-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324090#comment-17324090 ] A. Sophie Blee-Goldman commented on KAFKA-12675: [~twmb] would you be interested in submitting a PR for your algorithm? You'd need to translate it (and tests) into Java, but other than that it seems pretty much complete. > Improve sticky general assignor scalability and performance > --- > > Key: KAFKA-12675 > URL: https://issues.apache.org/jira/browse/KAFKA-12675 > Project: Kafka > Issue Type: Improvement >Reporter: Luke Chen >Assignee: Luke Chen >Priority: Major > > Currently, we have "general assignor" for non-equal subscription case and > "constrained assignor" for all equal subscription case. There's a performance > test for constrained assignor with: > topicCount = {color:#ff}500{color}; > partitionCount = {color:#ff}2000{color}; > consumerCount = {color:#ff}2000{color}; > in _testLargeAssignmentAndGroupWithUniformSubscription,_ total 1 million > partitions and we can complete the assignment within 2 second in my machine. > However, if we let 1 of the consumer subscribe to only 1 topic, it'll use > "general assignor", and the result with the same setting as above is: > *OutOfMemory,* > Even we down the count to: > topicCount = {color:#ff}50{color}; > partitionCount = 1{color:#ff}000{color}; > consumerCount = 1{color:#ff}000{color}; > We still got *OutOfMemory*. > With this setting: > topicCount = {color:#ff}50{color}; > partitionCount = 8{color:#ff}00{color}; > consumerCount = 8{color:#ff}00{color}; > We can complete in 10 seconds in my machine, which is still slow. > > Since we are going to set default assignment strategy to > "CooperativeStickyAssignor" soon, we should improve the scalability and > performance for sticky general assignor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12676) Improve sticky general assignor underlying algorithm for the unequal subscriptions case
[ https://issues.apache.org/jira/browse/KAFKA-12676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12676: --- Summary: Improve sticky general assignor underlying algorithm for the unequal subscriptions case (was: Improve sticky general assignor underlying algorithm for the imbalanced case) > Improve sticky general assignor underlying algorithm for the unequal > subscriptions case > --- > > Key: KAFKA-12676 > URL: https://issues.apache.org/jira/browse/KAFKA-12676 > Project: Kafka > Issue Type: Improvement >Reporter: Luke Chen >Priority: Major > > As discussed in KAFKA-12675, we think the general assignor algorithm might be > able to improve to make some edge cases more balanced, and improve the > scalability and performance. > Ref: The algorithm is here: > [https://github.com/twmb/franz-go/blob/master/pkg/kgo/internal|https://github.com/twmb/franz-go/blob/master/pkg/kgo/internal/sticky/graph.go] > > [https://issues.apache.org/jira/browse/KAFKA-12675?focusedCommentId=17322603=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17322603/sticky/graph.go|https://github.com/twmb/franz-go/blob/master/pkg/kgo/internal/sticky/graph.go] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12675) Improve sticky general assignor scalability and performance
[ https://issues.apache.org/jira/browse/KAFKA-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324089#comment-17324089 ] A. Sophie Blee-Goldman commented on KAFKA-12675: Yeah just to clarify, what [~twmb] proposed would not require a KIP since it's just an improvement to the existing (somewhat lacking) algorithm for the general case. There shouldn't be any public facing impact, except perhaps for the memory consumption. But since the current algorithm can't even handle the partition counts he described testing, I'd still consider this an improvement across the board. > Improve sticky general assignor scalability and performance > --- > > Key: KAFKA-12675 > URL: https://issues.apache.org/jira/browse/KAFKA-12675 > Project: Kafka > Issue Type: Improvement >Reporter: Luke Chen >Assignee: Luke Chen >Priority: Major > > Currently, we have "general assignor" for non-equal subscription case and > "constrained assignor" for all equal subscription case. There's a performance > test for constrained assignor with: > topicCount = {color:#ff}500{color}; > partitionCount = {color:#ff}2000{color}; > consumerCount = {color:#ff}2000{color}; > in _testLargeAssignmentAndGroupWithUniformSubscription,_ total 1 million > partitions and we can complete the assignment within 2 second in my machine. > However, if we let 1 of the consumer subscribe to only 1 topic, it'll use > "general assignor", and the result with the same setting as above is: > *OutOfMemory,* > Even we down the count to: > topicCount = {color:#ff}50{color}; > partitionCount = 1{color:#ff}000{color}; > consumerCount = 1{color:#ff}000{color}; > We still got *OutOfMemory*. > With this setting: > topicCount = {color:#ff}50{color}; > partitionCount = 8{color:#ff}00{color}; > consumerCount = 8{color:#ff}00{color}; > We can complete in 10 seconds in my machine, which is still slow. > > Since we are going to set default assignment strategy to > "CooperativeStickyAssignor" soon, we should improve the scalability and > performance for sticky general assignor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12477) Smart rebalancing with dynamic protocol selection
[ https://issues.apache.org/jira/browse/KAFKA-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12477: --- Description: Users who want to upgrade their applications and enable the COOPERATIVE rebalancing protocol in their consumer apps are required to follow a double rolling bounce upgrade path. The reason for this is laid out in the [Consumer Upgrades|https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol#KIP429:KafkaConsumerIncrementalRebalanceProtocol-Consumer] section of KIP-429. Basically, the ConsumerCoordinator picks a rebalancing protocol in its constructor based on the list of supported partition assignors. The protocol is selected as the highest protocol that is commonly supported by all assignors in the list, and never changes after that. This is a bit unfortunate because it may end up using an older protocol even after every member in the group has been updated to support the newer protocol. After the first rolling bounce of the upgrade, all members will have two assignors: "cooperative-sticky" and "range" (or sticky/round-robin/etc). At this point the EAGER protocol will still be selected due to the presence of the "range" assignor, but it's the "cooperative-sticky" assignor that will ultimately be selected for use in rebalances if that assignor is preferred (ie positioned first in the list). The only reason for the second rolling bounce is to strip off the "range" assignor and allow the upgraded members to switch over to COOPERATIVE. We can't allow them to use cooperative rebalancing until everyone has been upgraded, but once they have it's safe to do so. And there is already a way for the client to detect that everyone is on the new byte code: if the CooperativeStickyAssignor is selected by the group coordinator, then that means it is supported by all consumers in the group and therefore everyone must be upgraded. We may be able to save the second rolling bounce by dynamically updating the rebalancing protocol inside the ConsumerCoordinator as "the highest protocol supported by the assignor chosen by the group coordinator". This means we'll still be using EAGER at the first rebalance, since we of course need to wait for this initial rebalance to get the response from the group coordinator. But we should take the hint from the chosen assignor rather than dropping this information on the floor and sticking with the original protocol. Concrete Proposal: This assumes we will change the default assignor to ["cooperative-sticky", "range"] in KIP-726. It also acknowledges that users may attempt any kind of upgrade without reading the docs, and so we need to put in safeguards against data corruption rather than assume everyone will follow the safe upgrade path. With this proposal, 1) New applications on 3.0 will enable cooperative rebalancing by default 2) Existing applications which don’t set an assignor can safely upgrade to 3.0 using a single rolling bounce with no extra steps, and will automatically transition to cooperative rebalancing 3) Existing applications which do set an assignor that uses EAGER can likewise upgrade their applications to COOPERATIVE with a single rolling bounce 4) Once on 3.0, applications can safely go back and forth between EAGER and COOPERATIVE 5) Applications can safely downgrade away from 3.0 The high-level idea for dynamic protocol upgrades is that the group will leverage the assignor selected by the group coordinator to determine when it’s safe to upgrade to COOPERATIVE, and trigger a fail-safe to protect the group in case of rare events or user misconfiguration. The group coordinator selects the most preferred assignor that’s supported by all members of the group, so we know that all members will support COOPERATIVE once we receive the “cooperative-sticky” assignor after a rebalance. At this point, each member can upgrade their own protocol to COOPERATIVE. However, there may be situations in which an EAGER member may join the group even after upgrading to COOPERATIVE. For example, during a rolling upgrade if the last remaining member on the old bytecode misses a rebalance, the other members will be allowed to upgrade to COOPERATIVE. If the old member rejoins and is chosen to be the group leader before it’s upgraded to 3.0, it won’t be aware that the other members of the group have not yet revoked their partitions when computing the assignment. Short Circuit: The risk of mixing the cooperative and eager rebalancing protocols is that a partition may be assigned to one member while it has yet to be revoked from its previous owner. The danger is that the new owner may begin processing and committing offsets for this partition while the previous owner is also committing offsets in its #onPartitionsRevoked callback, which is invoked at the end
[jira] [Updated] (KAFKA-12669) Add deleteRange to WindowStore / KeyValueStore interfaces
[ https://issues.apache.org/jira/browse/KAFKA-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12669: --- Issue Type: Improvement (was: Bug) > Add deleteRange to WindowStore / KeyValueStore interfaces > - > > Key: KAFKA-12669 > URL: https://issues.apache.org/jira/browse/KAFKA-12669 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Guozhang Wang >Priority: Major > Labels: needs-kip > > We can consider adding such APIs where the underlying implementation classes > have better optimizations than deleting the keys as get-and-delete one by one. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12667) Incorrect error log on StateDirectory close
[ https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-12667. Resolution: Fixed > Incorrect error log on StateDirectory close > --- > > Key: KAFKA-12667 > URL: https://issues.apache.org/jira/browse/KAFKA-12667 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.7.0, 2.6.1, 2.8.0 >Reporter: Jan Justesen >Assignee: A. Sophie Blee-Goldman >Priority: Major > Fix For: 3.0.0, 2.6.3, 2.7.2, 2.8.1 > > > {{In StateDirectory.close() an error is logged about unclean shutdown if all > locks are in fact released, and nothing is logged in case of an unclean > shutdown.}} > > {code:java} > // all threads should be stopped and cleaned up by now, so none should remain > holding a lock > if (locks.isEmpty()) { > log.error("Some task directories still locked while closing state, this > indicates unclean shutdown: {}", locks); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12667) Incorrect error log on StateDirectory close
[ https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12667: --- Fix Version/s: 2.6.3 > Incorrect error log on StateDirectory close > --- > > Key: KAFKA-12667 > URL: https://issues.apache.org/jira/browse/KAFKA-12667 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.7.0, 2.6.1, 2.8.0 >Reporter: Jan Justesen >Assignee: A. Sophie Blee-Goldman >Priority: Major > Fix For: 3.0.0, 2.6.3, 2.7.2, 2.8.1 > > > {{In StateDirectory.close() an error is logged about unclean shutdown if all > locks are in fact released, and nothing is logged in case of an unclean > shutdown.}} > > {code:java} > // all threads should be stopped and cleaned up by now, so none should remain > holding a lock > if (locks.isEmpty()) { > log.error("Some task directories still locked while closing state, this > indicates unclean shutdown: {}", locks); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12667) Incorrect error log on StateDirectory close
[ https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12667: --- Affects Version/s: 2.8.0 > Incorrect error log on StateDirectory close > --- > > Key: KAFKA-12667 > URL: https://issues.apache.org/jira/browse/KAFKA-12667 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.7.0, 2.6.1, 2.8.0 >Reporter: Jan Justesen >Priority: Major > Fix For: 3.0.0, 2.7.2, 2.8.1 > > > {{In StateDirectory.close() an error is logged about unclean shutdown if all > locks are in fact released, and nothing is logged in case of an unclean > shutdown.}} > > {code:java} > // all threads should be stopped and cleaned up by now, so none should remain > holding a lock > if (locks.isEmpty()) { > log.error("Some task directories still locked while closing state, this > indicates unclean shutdown: {}", locks); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12667) Incorrect error log on StateDirectory close
[ https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321201#comment-17321201 ] A. Sophie Blee-Goldman commented on KAFKA-12667: Thanks [~antrenta]. As Bruno mentioned, we did discover this bug and fixed it in trunk, but unfortunately missed the 2.6.2, 2.7.1, and 2.8.0 releases which have all been in the release process. I'm happy to backport the fix to the 2.8 branch once 2.8.0 is finally out the door, and maybe to the 2.7 branch as well. It seems unlikely that there will be a 2.6.3 release however. Sorry for the trouble, I know it's a concerning message to see. > Incorrect error log on StateDirectory close > --- > > Key: KAFKA-12667 > URL: https://issues.apache.org/jira/browse/KAFKA-12667 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.7.0, 2.6.1 >Reporter: Jan Justesen >Priority: Major > Fix For: 2.6.3, 2.7.2 > > > {{In StateDirectory.close() an error is logged about unclean shutdown if all > locks are in fact released, and nothing is logged in case of an unclean > shutdown.}} > > {code:java} > // all threads should be stopped and cleaned up by now, so none should remain > holding a lock > if (locks.isEmpty()) { > log.error("Some task directories still locked while closing state, this > indicates unclean shutdown: {}", locks); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-12667) Incorrect error log on StateDirectory close
[ https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-12667: -- Assignee: A. Sophie Blee-Goldman > Incorrect error log on StateDirectory close > --- > > Key: KAFKA-12667 > URL: https://issues.apache.org/jira/browse/KAFKA-12667 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.7.0, 2.6.1, 2.8.0 >Reporter: Jan Justesen >Assignee: A. Sophie Blee-Goldman >Priority: Major > Fix For: 3.0.0, 2.7.2, 2.8.1 > > > {{In StateDirectory.close() an error is logged about unclean shutdown if all > locks are in fact released, and nothing is logged in case of an unclean > shutdown.}} > > {code:java} > // all threads should be stopped and cleaned up by now, so none should remain > holding a lock > if (locks.isEmpty()) { > log.error("Some task directories still locked while closing state, this > indicates unclean shutdown: {}", locks); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12667) Incorrect error log on StateDirectory close
[ https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12667: --- Fix Version/s: (was: 2.6.3) 2.8.1 3.0.0 > Incorrect error log on StateDirectory close > --- > > Key: KAFKA-12667 > URL: https://issues.apache.org/jira/browse/KAFKA-12667 > Project: Kafka > Issue Type: Bug > Components: streams >Affects Versions: 2.7.0, 2.6.1 >Reporter: Jan Justesen >Priority: Major > Fix For: 3.0.0, 2.7.2, 2.8.1 > > > {{In StateDirectory.close() an error is logged about unclean shutdown if all > locks are in fact released, and nothing is logged in case of an unclean > shutdown.}} > > {code:java} > // all threads should be stopped and cleaned up by now, so none should remain > holding a lock > if (locks.isEmpty()) { > log.error("Some task directories still locked while closing state, this > indicates unclean shutdown: {}", locks); > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
[ https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320699#comment-17320699 ] A. Sophie Blee-Goldman commented on KAFKA-9295: --- [~cadonna] Hm...github says the commit that kicked off that build was merged 10 hours ago, and the PR to improve this test was also merged 10 hours ago. But based on the line numbers in the stack trace, I think that build did not contain the fix -- line 187 in the test used to be verifyKTableKTableJoin(), but now is just an empty line. So let's leave the ticket resolved but keep an eye out for further failures. Still, thanks for reporting this! Fingers crossed that this fix will be sufficient and we won't have to dig any further or mess with the session interval > KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable > -- > > Key: KAFKA-9295 > URL: https://issues.apache.org/jira/browse/KAFKA-9295 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Affects Versions: 2.4.0, 2.6.0 >Reporter: Matthias J. Sax >Assignee: Luke Chen >Priority: Critical > Labels: flaky-test > Fix For: 3.0.0 > > > [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/] > {quote}java.lang.AssertionError: Did not receive all 1 records from topic > output- within 6 ms Expected: is a value equal to or greater than <1> > but: <0> was less than <1> at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-7499) Extend ProductionExceptionHandler to cover serialization exceptions
[ https://issues.apache.org/jira/browse/KAFKA-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-7499: -- Labels: beginner kip newbie newbie++ (was: beginner kip newbie) > Extend ProductionExceptionHandler to cover serialization exceptions > --- > > Key: KAFKA-7499 > URL: https://issues.apache.org/jira/browse/KAFKA-7499 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax >Priority: Major > Labels: beginner, kip, newbie, newbie++ > > In > [KIP-210|https://cwiki.apache.org/confluence/display/KAFKA/KIP-210+-+Provide+for+custom+error+handling++when+Kafka+Streams+fails+to+produce], > an exception handler for the write path was introduced. This exception > handler covers exception that are raised in the producer callback. > However, serialization happens before the data is handed to the producer with > Kafka Streams itself and the producer uses `byte[]/byte[]` key-value-pair > types. > Thus, we might want to extend the ProductionExceptionHandler to cover > serialization exception, too, to skip over corrupted output messages. An > example could be a "String" message that contains invalid JSON and should be > serialized as JSON. > KIP-399 (not voted yet; feel free to pick it up): > [https://cwiki.apache.org/confluence/display/KAFKA/KIP-399%3A+Extend+ProductionExceptionHandler+to+cover+serialization+exceptions] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-12574) Deprecate eos-alpha
[ https://issues.apache.org/jira/browse/KAFKA-12574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-12574: -- Assignee: A. Sophie Blee-Goldman > Deprecate eos-alpha > --- > > Key: KAFKA-12574 > URL: https://issues.apache.org/jira/browse/KAFKA-12574 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Labels: needs-kip > Fix For: 3.0.0 > > > In KIP-447 we introduced a new thread-producer which is capable of > exactly-once semantics across multiple tasks. The new mode of EOS, called > eos-beta, is intended to eventually be the preferred processing mode for EOS > as it improves the performance and scaling of partitions/tasks. The only > downside is that it requires brokers to be on version 2.5+ in order to > understand the latest APIs that are necessary for this thread-producer. > We should consider deprecating the eos-alpha config, ie > StreamsConfig.EXACTLY_ONCE, to encourage new. & existing EOS users to migrate > to the new-and-improved processing mode, and upgrade their brokers if > necessary. > Eventually we would like to be able to remove the eos-alpha code paths from > Streams as this will help to simplify the logic and reduce the processing > mode branching. But since this will break client-broker compatibility, and > 2.5 is still a relatively recent version, we probably can't actually remove > eos-alpha in the near future -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
[ https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-9295: -- Fix Version/s: 3.0.0 > KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable > -- > > Key: KAFKA-9295 > URL: https://issues.apache.org/jira/browse/KAFKA-9295 > Project: Kafka > Issue Type: Bug > Components: streams, unit tests >Affects Versions: 2.4.0, 2.6.0 >Reporter: Matthias J. Sax >Assignee: Luke Chen >Priority: Critical > Labels: flaky-test > Fix For: 3.0.0 > > > [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/] > {quote}java.lang.AssertionError: Did not receive all 1 records from topic > output- within 6 ms Expected: is a value equal to or greater than <1> > but: <0> was less than <1> at > org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417) > at > org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511) > at > org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200) > at > org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12637) Remove deprecated PartitionAssignor interface
[ https://issues.apache.org/jira/browse/KAFKA-12637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-12637. Resolution: Fixed > Remove deprecated PartitionAssignor interface > - > > Key: KAFKA-12637 > URL: https://issues.apache.org/jira/browse/KAFKA-12637 > Project: Kafka > Issue Type: Improvement > Components: consumer >Reporter: A. Sophie Blee-Goldman >Assignee: dengziming >Priority: Blocker > Labels: newbie, newbie++ > Fix For: 3.0.0 > > > In KIP-429, we deprecated the existing PartitionAssignor interface in order > to move it out of the internals package and better align the name with other > pluggable Consumer interfaces. We added an adapter to convert from existing > o.a.k.clients.consumer.internals.PartitionAssignor to the new > o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated > interface. This was deprecated in 2.4, so we should be ok to remove it and > the PartitionAssignorAdaptor in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (KAFKA-8391) Flaky Test RebalanceSourceConnectorsIntegrationTest#testDeleteConnector
[ https://issues.apache.org/jira/browse/KAFKA-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reopened KAFKA-8391: --- [~rhauch] [~kkonstantine] this test failed again -- based on the error message it looks like it may be a "real" failure this time, not environmental. Stacktrace java.lang.AssertionError: Tasks are imbalanced: localhost:35163=[seq-source11-0, seq-source11-3, seq-source10-1, seq-source12-1] localhost:36961=[seq-source11-1, seq-source10-2, seq-source12-2] localhost:39023=[seq-source11-2, seq-source10-0, seq-source10-3, seq-source12-0, seq-source12-3] at org.junit.Assert.fail(Assert.java:89) at org.junit.Assert.assertTrue(Assert.java:42) at org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.assertConnectorAndTasksAreUniqueAndBalanced(RebalanceSourceConnectorsIntegrationTest.java:365) at org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:319) at org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:367) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:316) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:300) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:290) at org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.testDeleteConnector(RebalanceSourceConnectorsIntegrationTest.java:213) https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10512/5/testReport/org.apache.kafka.connect.integration/RebalanceSourceConnectorsIntegrationTest/Build___JDK_15_and_Scala_2_13___testDeleteConnector/ > Flaky Test RebalanceSourceConnectorsIntegrationTest#testDeleteConnector > --- > > Key: KAFKA-8391 > URL: https://issues.apache.org/jira/browse/KAFKA-8391 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Affects Versions: 2.3.0 >Reporter: Matthias J. Sax >Assignee: Randall Hauch >Priority: Critical > Labels: flaky-test > Fix For: 2.3.2, 2.6.0, 2.4.2, 2.5.1 > > Attachments: 100-gradle-builds.tar > > > [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/4747/testReport/junit/org.apache.kafka.connect.integration/RebalanceSourceConnectorsIntegrationTest/testDeleteConnector/] > {quote}java.lang.AssertionError: Condition not met within timeout 3. > Connector tasks did not stop in time. at > org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:375) at > org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:352) at > org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.testDeleteConnector(RebalanceSourceConnectorsIntegrationTest.java:166){quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-9013) Flaky Test MirrorConnectorsIntegrationTest#testReplication
[ https://issues.apache.org/jira/browse/KAFKA-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319797#comment-17319797 ] A. Sophie Blee-Goldman commented on KAFKA-9013: --- Failed again: https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10512/5/testReport/org.apache.kafka.connect.mirror.integration/MirrorConnectorsIntegrationTest/Build___JDK_11_and_Scala_2_13___testReplication__/ > Flaky Test MirrorConnectorsIntegrationTest#testReplication > -- > > Key: KAFKA-9013 > URL: https://issues.apache.org/jira/browse/KAFKA-9013 > Project: Kafka > Issue Type: Bug > Components: KafkaConnect >Reporter: Bruno Cadonna >Priority: Major > Labels: flaky-test > > h1. Stacktrace: > {code:java} > java.lang.AssertionError: Condition not met within timeout 2. Offsets not > translated downstream to primary cluster. > at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:377) > at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:354) > at > org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest.testReplication(MirrorConnectorsIntegrationTest.java:239) > {code} > h1. Standard Error > {code} > Standard Error > Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers > checkProviderRuntime > WARNING: A provider > org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource > registered in SERVER runtime does not implement any provider interfaces > applicable in the SERVER runtime. Due to constraint configuration problems > the provider > org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource will > be ignored. > Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers > checkProviderRuntime > WARNING: A provider > org.apache.kafka.connect.runtime.rest.resources.LoggingResource registered in > SERVER runtime does not implement any provider interfaces applicable in the > SERVER runtime. Due to constraint configuration problems the provider > org.apache.kafka.connect.runtime.rest.resources.LoggingResource will be > ignored. > Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers > checkProviderRuntime > WARNING: A provider > org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource registered > in SERVER runtime does not implement any provider interfaces applicable in > the SERVER runtime. Due to constraint configuration problems the provider > org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource will be > ignored. > Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers > checkProviderRuntime > WARNING: A provider > org.apache.kafka.connect.runtime.rest.resources.RootResource registered in > SERVER runtime does not implement any provider interfaces applicable in the > SERVER runtime. Due to constraint configuration problems the provider > org.apache.kafka.connect.runtime.rest.resources.RootResource will be ignored. > Oct 09, 2019 11:32:01 PM org.glassfish.jersey.internal.Errors logErrors > WARNING: The following warnings have been detected: WARNING: The > (sub)resource method listLoggers in > org.apache.kafka.connect.runtime.rest.resources.LoggingResource contains > empty path annotation. > WARNING: The (sub)resource method listConnectors in > org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains > empty path annotation. > WARNING: The (sub)resource method createConnector in > org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains > empty path annotation. > WARNING: The (sub)resource method listConnectorPlugins in > org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource > contains empty path annotation. > WARNING: The (sub)resource method serverInfo in > org.apache.kafka.connect.runtime.rest.resources.RootResource contains empty > path annotation. > Oct 09, 2019 11:32:02 PM org.glassfish.jersey.internal.inject.Providers > checkProviderRuntime > WARNING: A provider > org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource > registered in SERVER runtime does not implement any provider interfaces > applicable in the SERVER runtime. Due to constraint configuration problems > the provider > org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource will > be ignored. > Oct 09, 2019 11:32:02 PM org.glassfish.jersey.internal.inject.Providers > checkProviderRuntime > WARNING: A provider > org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource registered > in SERVER runtime does not implement any provider interfaces applicable in > the SERVER runtime. Due to constraint configuration problems the provider >
[jira] [Reopened] (KAFKA-12284) Flaky Test MirrorConnectorsIntegrationSSLTest#testOneWayReplicationWithAutoOffsetSync
[ https://issues.apache.org/jira/browse/KAFKA-12284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reopened KAFKA-12284: Assignee: (was: Luke Chen) Failed again, on both the SSL and plain version of this test: https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10512/5/testReport/org.apache.kafka.connect.mirror.integration/MirrorConnectorsIntegrationSSLTest/Build___JDK_15_and_Scala_2_13___testOneWayReplicationWithAutoOffsetSync__/ https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10512/5/testReport/org.apache.kafka.connect.mirror.integration/MirrorConnectorsIntegrationTest/Build___JDK_15_and_Scala_2_13___testOneWayReplicationWithAutoOffsetSync__/ java.lang.RuntimeException: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: The request timed out. at org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:365) at org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:340) at org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.createTopics(MirrorConnectorsIntegrationBaseTest.java:609) > Flaky Test > MirrorConnectorsIntegrationSSLTest#testOneWayReplicationWithAutoOffsetSync > - > > Key: KAFKA-12284 > URL: https://issues.apache.org/jira/browse/KAFKA-12284 > Project: Kafka > Issue Type: Test > Components: mirrormaker, unit tests >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > Fix For: 3.0.0 > > > [https://github.com/apache/kafka/pull/9997/checks?check_run_id=1820178470] > {quote} {{java.lang.RuntimeException: > java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.TopicExistsException: Topic > 'primary.test-topic-2' already exists. > at > org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:366) > at > org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:341) > at > org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.testOneWayReplicationWithAutoOffsetSync(MirrorConnectorsIntegrationBaseTest.java:419)}} > [...] > > {{Caused by: java.util.concurrent.ExecutionException: > org.apache.kafka.common.errors.TopicExistsException: Topic > 'primary.test-topic-2' already exists. > at > org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) > at > org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) > at > org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89) > at > org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260) > at > org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:364) > ... 92 more > Caused by: org.apache.kafka.common.errors.TopicExistsException: Topic > 'primary.test-topic-2' already exists.}} > {quote} > STDOUT > {quote} {{2021-02-03 04ː19ː15,975] ERROR [MirrorHeartbeatConnector|task-0] > WorkerSourceTask\{id=MirrorHeartbeatConnector-0} failed to send record to > heartbeats: (org.apache.kafka.connect.runtime.WorkerSourceTask:354) > org.apache.kafka.common.KafkaException: Producer is closed forcefully. > at > org.apache.kafka.clients.producer.internals.RecordAccumulator.abortBatches(RecordAccumulator.java:750) > at > org.apache.kafka.clients.producer.internals.RecordAccumulator.abortIncompleteBatches(RecordAccumulator.java:737) > at > org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:282) > at java.lang.Thread.run(Thread.java:748)}}{quote} > {quote} {{[2021-02-03 04ː19ː36,767] ERROR Could not check connector state > info. > (org.apache.kafka.connect.util.clusters.EmbeddedConnectClusterAssertions:420) > org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: Could not > read connector state. Error response: \{"error_code":404,"message":"No status > found for connector MirrorSourceConnector"} > at > org.apache.kafka.connect.util.clusters.EmbeddedConnectCluster.connectorStatus(EmbeddedConnectCluster.java:466) > at > org.apache.kafka.connect.util.clusters.EmbeddedConnectClusterAssertions.checkConnectorState(EmbeddedConnectClusterAssertions.java:413) > at > org.apache.kafka.connect.util.clusters.EmbeddedConnectClusterAssertions.lambda$assertConnectorAndAtLeastNumTasksAreRunning$16(EmbeddedConnectClusterAssertions.java:286) > at > org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:303) > at >
[jira] [Commented] (KAFKA-12629) Flaky Test RaftClusterTest
[ https://issues.apache.org/jira/browse/KAFKA-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319794#comment-17319794 ] A. Sophie Blee-Goldman commented on KAFKA-12629: I'm seeing at least one failure in RaftClusterTest on almost every PR build lately. Looks like it's typically the same TimeoutException mentioned above. Seen on both testCreateClusterAndCreateAndManyTopics() and testCreateClusterAndCreateAndManyTopicsWithManyPartitions() > Flaky Test RaftClusterTest > -- > > Key: KAFKA-12629 > URL: https://issues.apache.org/jira/browse/KAFKA-12629 > Project: Kafka > Issue Type: Test > Components: core, unit tests >Reporter: Matthias J. Sax >Priority: Critical > Labels: flaky-test > > {quote} {{java.util.concurrent.ExecutionException: > java.lang.ClassNotFoundException: > org.apache.kafka.controller.NoOpSnapshotWriterBuilder > at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) > at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) > at > kafka.testkit.KafkaClusterTestKit.startup(KafkaClusterTestKit.java:364) > at > kafka.server.RaftClusterTest.testCreateClusterAndCreateAndManyTopicsWithManyPartitions(RaftClusterTest.scala:181)}}{quote} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12654) separate checkAllSubscriptionEqual and getConsumerToOwnedPartitions methods
[ https://issues.apache.org/jira/browse/KAFKA-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319675#comment-17319675 ] A. Sophie Blee-Goldman commented on KAFKA-12654: To be honest, the sticky assignment algorithm for the general case is so slow that the additional time spent populating this map is probably negligible. That said, the general algorithm ends up populating exactly the same map in #prepopulateCurrentAssignments, so we could just pass it along to the generalAssign as well. But it also has to populate an additional map (prevAssignment) so it would still need to loop over and deserialize this info again anyways > separate checkAllSubscriptionEqual and getConsumerToOwnedPartitions methods > --- > > Key: KAFKA-12654 > URL: https://issues.apache.org/jira/browse/KAFKA-12654 > Project: Kafka > Issue Type: Improvement >Reporter: Luke Chen >Assignee: Luke Chen >Priority: Major > > Currently, when entering sticky assignor, we'll check if all consumers have > the same subscription to decide which assignor to use (constrained assignor > or general assignor). While checking the subscription, we also deserialize > the subscription user data to get the partitions owned by the consumer > (consumerToOwnedPartitions). However, the consumerToOwnedPartitions info is > not used in general assignor (we'll actually deserialize it inside general > assignor), so, we don't need to deserialize data for general assignor again. > We should separate these 2 things into 2 methods, and only deserialize the > user data for constrained assignor, to improve the general sticky assignor > performance. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12650) NPE in InternalTopicManager#cleanUpCreatedTopics
[ https://issues.apache.org/jira/browse/KAFKA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318404#comment-17318404 ] A. Sophie Blee-Goldman commented on KAFKA-12650: This was from InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup() btw. https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10411/12/testReport/junit/org.apache.kafka.streams.processor.internals/InternalTopicManagerTest/shouldOnlyRetryNotSuccessfulFuturesDuringSetup/ > NPE in InternalTopicManager#cleanUpCreatedTopics > > > Key: KAFKA-12650 > URL: https://issues.apache.org/jira/browse/KAFKA-12650 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 3.0.0 > > > {code:java} > java.lang.NullPointerException > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.cleanUpCreatedTopics(InternalTopicManager.java:675) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.maybeThrowTimeoutExceptionDuringSetup(InternalTopicManager.java:755) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.processCreateTopicResults(InternalTopicManager.java:652) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.setup(InternalTopicManager.java:599) > at > org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup(InternalTopicManagerTest.java:180) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12650) NPE in InternalTopicManager#cleanUpCreatedTopics
[ https://issues.apache.org/jira/browse/KAFKA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12650: --- Fix Version/s: 3.0.0 > NPE in InternalTopicManager#cleanUpCreatedTopics > > > Key: KAFKA-12650 > URL: https://issues.apache.org/jira/browse/KAFKA-12650 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > Fix For: 3.0.0 > > > {code:java} > java.lang.NullPointerException > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.cleanUpCreatedTopics(InternalTopicManager.java:675) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.maybeThrowTimeoutExceptionDuringSetup(InternalTopicManager.java:755) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.processCreateTopicResults(InternalTopicManager.java:652) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.setup(InternalTopicManager.java:599) > at > org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup(InternalTopicManagerTest.java:180) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12650) NPE in InternalTopicManager#cleanUpCreatedTopics
[ https://issues.apache.org/jira/browse/KAFKA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12650: --- Priority: Blocker (was: Major) > NPE in InternalTopicManager#cleanUpCreatedTopics > > > Key: KAFKA-12650 > URL: https://issues.apache.org/jira/browse/KAFKA-12650 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 3.0.0 > > > {code:java} > java.lang.NullPointerException > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.cleanUpCreatedTopics(InternalTopicManager.java:675) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.maybeThrowTimeoutExceptionDuringSetup(InternalTopicManager.java:755) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.processCreateTopicResults(InternalTopicManager.java:652) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.setup(InternalTopicManager.java:599) > at > org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup(InternalTopicManagerTest.java:180) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12650) NPE in InternalTopicManager#cleanUpCreatedTopics
[ https://issues.apache.org/jira/browse/KAFKA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318403#comment-17318403 ] A. Sophie Blee-Goldman commented on KAFKA-12650: cc [~cadonna] can you take a look at this? > NPE in InternalTopicManager#cleanUpCreatedTopics > > > Key: KAFKA-12650 > URL: https://issues.apache.org/jira/browse/KAFKA-12650 > Project: Kafka > Issue Type: Bug > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > > {code:java} > java.lang.NullPointerException > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.cleanUpCreatedTopics(InternalTopicManager.java:675) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.maybeThrowTimeoutExceptionDuringSetup(InternalTopicManager.java:755) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.processCreateTopicResults(InternalTopicManager.java:652) > at > org.apache.kafka.streams.processor.internals.InternalTopicManager.setup(InternalTopicManager.java:599) > at > org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup(InternalTopicManagerTest.java:180) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12650) NPE in InternalTopicManager#cleanUpCreatedTopics
A. Sophie Blee-Goldman created KAFKA-12650: -- Summary: NPE in InternalTopicManager#cleanUpCreatedTopics Key: KAFKA-12650 URL: https://issues.apache.org/jira/browse/KAFKA-12650 Project: Kafka Issue Type: Bug Components: streams Reporter: A. Sophie Blee-Goldman {code:java} java.lang.NullPointerException at org.apache.kafka.streams.processor.internals.InternalTopicManager.cleanUpCreatedTopics(InternalTopicManager.java:675) at org.apache.kafka.streams.processor.internals.InternalTopicManager.maybeThrowTimeoutExceptionDuringSetup(InternalTopicManager.java:755) at org.apache.kafka.streams.processor.internals.InternalTopicManager.processCreateTopicResults(InternalTopicManager.java:652) at org.apache.kafka.streams.processor.internals.InternalTopicManager.setup(InternalTopicManager.java:599) at org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup(InternalTopicManagerTest.java:180) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12649) Expose cache resizer for dynamic memory allocation
A. Sophie Blee-Goldman created KAFKA-12649: -- Summary: Expose cache resizer for dynamic memory allocation Key: KAFKA-12649 URL: https://issues.apache.org/jira/browse/KAFKA-12649 Project: Kafka Issue Type: New Feature Components: streams Reporter: A. Sophie Blee-Goldman When we added the add/removeStreamThread() APIs to Streams, we implemented a cache resizer to adjust the allocated cache memory per thread accordingly. We could expose that to users as a public API to let them dynamically increase/decrease the amount of memory for the Streams cache (ie cache.max.bytes.buffering). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12648) Experiment with resilient isomorphic topologies
[ https://issues.apache.org/jira/browse/KAFKA-12648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12648: --- Description: We're not ready to make this a public feature yet, but I want to start experimenting with some ways to make Streams applications more resilient in the face of isomorphic topological changes (eg adding/removing/reordering subtopologies). If this turns out to be stable and useful, we can circle back on doing a KIP to bring this feature into the public API was:We're not ready to make this a public feature yet, but I want to start experimenting with some ways to make Streams applications more resilient in the face of isomorphic topological changes (eg adding/removing/reordering subtopologies) > Experiment with resilient isomorphic topologies > --- > > Key: KAFKA-12648 > URL: https://issues.apache.org/jira/browse/KAFKA-12648 > Project: Kafka > Issue Type: New Feature > Components: streams >Reporter: A. Sophie Blee-Goldman >Assignee: A. Sophie Blee-Goldman >Priority: Major > > We're not ready to make this a public feature yet, but I want to start > experimenting with some ways to make Streams applications more resilient in > the face of isomorphic topological changes (eg adding/removing/reordering > subtopologies). > If this turns out to be stable and useful, we can circle back on doing a KIP > to bring this feature into the public API -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12648) Experiment with resilient isomorphic topologies
A. Sophie Blee-Goldman created KAFKA-12648: -- Summary: Experiment with resilient isomorphic topologies Key: KAFKA-12648 URL: https://issues.apache.org/jira/browse/KAFKA-12648 Project: Kafka Issue Type: New Feature Components: streams Reporter: A. Sophie Blee-Goldman Assignee: A. Sophie Blee-Goldman We're not ready to make this a public feature yet, but I want to start experimenting with some ways to make Streams applications more resilient in the face of isomorphic topological changes (eg adding/removing/reordering subtopologies) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12638) Remove default implementation of ConsumerRebalanceListener#onPartitionsLost
[ https://issues.apache.org/jira/browse/KAFKA-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317638#comment-17317638 ] A. Sophie Blee-Goldman commented on KAFKA-12638: If you're interested in this ticket, we can't do the whole thing because of the compatibility concerns I mentioned but feel free to pick up the first part, and just log a warning if the user has not implemented the #onPartitionsLost callback. Something like this: https://github.com/apache/kafka/blob/2.8/streams/src/main/java/org/apache/kafka/streams/state/RocksDBConfigSetter.java#L61 > Remove default implementation of ConsumerRebalanceListener#onPartitionsLost > --- > > Key: KAFKA-12638 > URL: https://issues.apache.org/jira/browse/KAFKA-12638 > Project: Kafka > Issue Type: Improvement > Components: consumer >Reporter: A. Sophie Blee-Goldman >Priority: Major > > When we added the #onPartitionsLost callback to the ConsumerRebalanceListener > in KIP-429, we gave it a default implementation that just invoked the > existing #onPartitionsRevoked method for backwards compatibility. This is > somewhat inconvenient, since we generally want to invoke #onPartitionsLost in > order to skip the committing of offsets on revoked partitions, which is > exactly what #onPartitionsRevoked does. > I don't think we can just remove it in 3.0 since we haven't indicated that we > "deprecated" the default implementation or logged a warning that we intend to > remove the default in a future release (as we did for the > RocksDBConfigSetter#close method in Streams, for example). We should try to > add such a warning now, so we can remove it in a future release. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12638) Remove default implementation of ConsumerRebalanceListener#onPartitionsLost
[ https://issues.apache.org/jira/browse/KAFKA-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317635#comment-17317635 ] A. Sophie Blee-Goldman commented on KAFKA-12638: Nope, no such things as credits in Kafka -- technically anyone can pick up anything, and the major deciding factor is just your own confidence and familiarity with Kafka. If you're relatively new to Kafka and pick up something large that you need a lot of help with, you might struggle to get it done just because everyone who works on this is always busy (not because they don't want to help -- they do). We try to label things with newbie and/or newbie++ to indicate good entry-level tickets. That said, it never hurts to ask before picking something up -- often if it's a blocker ticket, or maybe critical, then it's likely that the person who reported it or someone they know already plan to work on it. But you can always ask -- and "Major" is the default priority which many tickets are just left at, so don't let that stop you. > Remove default implementation of ConsumerRebalanceListener#onPartitionsLost > --- > > Key: KAFKA-12638 > URL: https://issues.apache.org/jira/browse/KAFKA-12638 > Project: Kafka > Issue Type: Improvement > Components: consumer >Reporter: A. Sophie Blee-Goldman >Priority: Major > > When we added the #onPartitionsLost callback to the ConsumerRebalanceListener > in KIP-429, we gave it a default implementation that just invoked the > existing #onPartitionsRevoked method for backwards compatibility. This is > somewhat inconvenient, since we generally want to invoke #onPartitionsLost in > order to skip the committing of offsets on revoked partitions, which is > exactly what #onPartitionsRevoked does. > I don't think we can just remove it in 3.0 since we haven't indicated that we > "deprecated" the default implementation or logged a warning that we intend to > remove the default in a future release (as we did for the > RocksDBConfigSetter#close method in Streams, for example). We should try to > add such a warning now, so we can remove it in a future release. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12492) Formatting of example RocksDBConfigSetter is messed up
[ https://issues.apache.org/jira/browse/KAFKA-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317629#comment-17317629 ] A. Sophie Blee-Goldman commented on KAFKA-12492: Merged, thanks for the PR! Unfortunately we just barely missed the 2.8 release, as John cut the RC earlier today. If you want to see this fix in the 2.8 docs then you'll need to submit this exact PR against the kafka-site repo as we discussed, but the 2.8 RC is still under vote at the moment so you'd need to wait for that to be released at which point the docs in kafka/2.8 will be copied over to a new 28 subdirectory in kafka-site. > Formatting of example RocksDBConfigSetter is messed up > -- > > Key: KAFKA-12492 > URL: https://issues.apache.org/jira/browse/KAFKA-12492 > Project: Kafka > Issue Type: Bug > Components: documentation, streams >Reporter: A. Sophie Blee-Goldman >Assignee: Ben Chen >Priority: Trivial > Labels: docs, newbie > Fix For: 3.0.0 > > > See the example implementation class CustomRocksDBConfig in the docs for the > rocksdb.config.setter > https://kafka.apache.org/documentation/streams/developer-guide/config-streams.html#rocksdb-config-setter -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12634) Should checkpoint after restore finished
[ https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317632#comment-17317632 ] A. Sophie Blee-Goldman commented on KAFKA-12634: I don't think bulk loading should affect the checkpoint, the data up to an offset is either in the state store or it isn't. I'd be fine with just doing a small KIP to fix in 3.0+ though > Should checkpoint after restore finished > > > Key: KAFKA-12634 > URL: https://issues.apache.org/jira/browse/KAFKA-12634 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.5.0 >Reporter: Matthias J. Sax >Priority: Major > > For state stores, Kafka Streams maintains local checkpoint files to track the > offsets of the state store changelog topics. The checkpoint is updated on > commit or when a task is closed cleanly. > However, after a successful restore, the checkpoint is not written. Thus, if > an instance crashes after restore but before committing, even if the state is > on local disk the checkpoint file is missing (indicating that there is no > state) and thus state would be restored from scratch. > While for most cases, the time between restore end and next commit is small, > there are cases when this time could be large, for example if there is no new > input data to be processed (if there is no input data, the commit would be > skipped). > Thus, we should write the checkpoint file after a successful restore to close > this gap (or course, only for at-least-once processing). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12492) Formatting of example RocksDBConfigSetter is messed up
[ https://issues.apache.org/jira/browse/KAFKA-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12492: --- Fix Version/s: 3.0.0 > Formatting of example RocksDBConfigSetter is messed up > -- > > Key: KAFKA-12492 > URL: https://issues.apache.org/jira/browse/KAFKA-12492 > Project: Kafka > Issue Type: Bug > Components: documentation, streams >Reporter: A. Sophie Blee-Goldman >Assignee: Ben Chen >Priority: Trivial > Labels: docs, newbie > Fix For: 3.0.0 > > > See the example implementation class CustomRocksDBConfig in the docs for the > rocksdb.config.setter > https://kafka.apache.org/documentation/streams/developer-guide/config-streams.html#rocksdb-config-setter -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12638) Remove default implementation of ConsumerRebalanceListener#onPartitionsLost
A. Sophie Blee-Goldman created KAFKA-12638: -- Summary: Remove default implementation of ConsumerRebalanceListener#onPartitionsLost Key: KAFKA-12638 URL: https://issues.apache.org/jira/browse/KAFKA-12638 Project: Kafka Issue Type: Improvement Components: consumer Reporter: A. Sophie Blee-Goldman When we added the #onPartitionsLost callback to the ConsumerRebalanceListener in KIP-429, we gave it a default implementation that just invoked the existing #onPartitionsRevoked method for backwards compatibility. This is somewhat inconvenient, since we generally want to invoke #onPartitionsLost in order to skip the committing of offsets on revoked partitions, which is exactly what #onPartitionsRevoked does. I don't think we can just remove it in 3.0 since we haven't indicated that we "deprecated" the default implementation or logged a warning that we intend to remove the default in a future release (as we did for the RocksDBConfigSetter#close method in Streams, for example). We should try to add such a warning now, so we can remove it in a future release. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12637) Remove deprecated PartitionAssignor interface
[ https://issues.apache.org/jira/browse/KAFKA-12637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12637: --- Description: In KIP-429, we deprecated the existing PartitionAssignor interface in order to move it out of the internals package and better align the name with other pluggable Consumer interfaces. We added an adapter to convert from existing o.a.k.clients.consumer.internals.PartitionAssignor to the new o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated interface. This was deprecated in 2.4, so we should be ok to remove it and the PartitionAssignorAdaptor in 3.0 (was: In KIP-429, we deprecated the existing PartitionAssignor interface in order to move it out of the internals package and better align the name with other pluggable Consumer interfaces. We added an adapter to convert from existing o.a.k.clients.consumer.internals.PartitionAssignor to the new o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated interface. This was deprecated in 2.4, so we should be ok to remove it and the adaptor in 3.0) > Remove deprecated PartitionAssignor interface > - > > Key: KAFKA-12637 > URL: https://issues.apache.org/jira/browse/KAFKA-12637 > Project: Kafka > Issue Type: Improvement > Components: consumer >Reporter: A. Sophie Blee-Goldman >Priority: Blocker > Labels: newbie, newbie++ > Fix For: 3.0.0 > > > In KIP-429, we deprecated the existing PartitionAssignor interface in order > to move it out of the internals package and better align the name with other > pluggable Consumer interfaces. We added an adapter to convert from existing > o.a.k.clients.consumer.internals.PartitionAssignor to the new > o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated > interface. This was deprecated in 2.4, so we should be ok to remove it and > the PartitionAssignorAdaptor in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12637) Remove deprecated PartitionAssignor interface
[ https://issues.apache.org/jira/browse/KAFKA-12637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12637: --- Labels: newbie newbie++ (was: ) > Remove deprecated PartitionAssignor interface > - > > Key: KAFKA-12637 > URL: https://issues.apache.org/jira/browse/KAFKA-12637 > Project: Kafka > Issue Type: Improvement > Components: consumer >Reporter: A. Sophie Blee-Goldman >Priority: Blocker > Labels: newbie, newbie++ > Fix For: 3.0.0 > > > In KIP-429, we deprecated the existing PartitionAssignor interface in order > to move it out of the internals package and better align the name with other > pluggable Consumer interfaces. We added an adapter to convert from existing > o.a.k.clients.consumer.internals.PartitionAssignor to the new > o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated > interface. This was deprecated in 2.4, so we should be ok to remove it and > the adaptor in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (KAFKA-12637) Remove deprecated PartitionAssignor interface
A. Sophie Blee-Goldman created KAFKA-12637: -- Summary: Remove deprecated PartitionAssignor interface Key: KAFKA-12637 URL: https://issues.apache.org/jira/browse/KAFKA-12637 Project: Kafka Issue Type: Improvement Components: consumer Reporter: A. Sophie Blee-Goldman Fix For: 3.0.0 In KIP-429, we deprecated the existing PartitionAssignor interface in order to move it out of the internals package and better align the name with other pluggable Consumer interfaces. We added an adapter to convert from existing o.a.k.clients.consumer.internals.PartitionAssignor to the new o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated interface. This was deprecated in 2.4, so we should be ok to remove it and the adaptor in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8613) Set default grace period to 0
[ https://issues.apache.org/jira/browse/KAFKA-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317519#comment-17317519 ] A. Sophie Blee-Goldman commented on KAFKA-8613: --- Just replied on the KIP thread -- will go ahead and assign this ticket to you. It should be pretty straightforward but let me know if you have any questions on the implementation > Set default grace period to 0 > - > > Key: KAFKA-8613 > URL: https://issues.apache.org/jira/browse/KAFKA-8613 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Bruno Cadonna >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 3.0.0 > > > Currently, the grace period is set to retention time if the grace period is > not specified explicitly. The reason for setting the default grace period to > retention time was backward compatibility. Topologies that were implemented > before the introduction of the grace period, added late arriving records to a > window as long as the window existed, i.e., as long as its retention time was > not elapsed. > This unintuitive default grace period has already caused confusion among > users. > For the next major release, we should set the default grace period to > {{Duration.ZERO}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-8613) Set default grace period to 0
[ https://issues.apache.org/jira/browse/KAFKA-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-8613: - Assignee: Israel Ekpo (was: A. Sophie Blee-Goldman) > Set default grace period to 0 > - > > Key: KAFKA-8613 > URL: https://issues.apache.org/jira/browse/KAFKA-8613 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Bruno Cadonna >Assignee: Israel Ekpo >Priority: Blocker > Fix For: 3.0.0 > > > Currently, the grace period is set to retention time if the grace period is > not specified explicitly. The reason for setting the default grace period to > retention time was backward compatibility. Topologies that were implemented > before the introduction of the grace period, added late arriving records to a > window as long as the window existed, i.e., as long as its retention time was > not elapsed. > This unintuitive default grace period has already caused confusion among > users. > For the next major release, we should set the default grace period to > {{Duration.ZERO}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12634) Should checkpoint after restore finished
[ https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317516#comment-17317516 ] A. Sophie Blee-Goldman commented on KAFKA-12634: Oh. :/ By the way, shouldn't we also checkpoint _during_ a restore? We probably don't want to do so at the same frequency as the commit interval, but it would be a bummer if an app with an hour-long restore crashed in the middle and had to start up again from scratch. Maybe we could pick an interval like 5min that should allow users to avoid loosing all progress without severely slowing down the restore even further? > Should checkpoint after restore finished > > > Key: KAFKA-12634 > URL: https://issues.apache.org/jira/browse/KAFKA-12634 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.5.0 >Reporter: Matthias J. Sax >Priority: Major > > For state stores, Kafka Streams maintains local checkpoint files to track the > offsets of the state store changelog topics. The checkpoint is updated on > commit or when a task is closed cleanly. > However, after a successful restore, the checkpoint is not written. Thus, if > an instance crashes after restore but before committing, even if the state is > on local disk the checkpoint file is missing (indicating that there is no > state) and thus state would be restored from scratch. > While for most cases, the time between restore end and next commit is small, > there are cases when this time could be large, for example if there is no new > input data to be processed (if there is no input data, the commit would be > skipped). > Thus, we should write the checkpoint file after a successful restore to close > this gap (or course, only for at-least-once processing). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12506) Expand AdjustStreamThreadCountTest
[ https://issues.apache.org/jira/browse/KAFKA-12506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317487#comment-17317487 ] A. Sophie Blee-Goldman commented on KAFKA-12506: It looks like the build might be broken with some unrelated tests that are just flaky. You can navigate to the tab that says "Tests" at the top on those links you provided and it will display a lit of the failed tests. eg https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-10432/7/tests You might not see those when running locally. Some tests are just flaky when run on Jenkins, like org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest. Others might not show up because you're only running Streams tests locally, and they're from another module, like kafka.server.RaftClusterTest Don't worry about these flaky tests, especially if they're in core or connect, or anywhere outside of Streams since none of those have a dependency on Streams. If there are tests failing in Streams, but they're not a class or test you touched in the PR, then we need to use some judgement to figure out if it might be due to the changes or just flaky. We can always re-run the tests to try again. > Expand AdjustStreamThreadCountTest > -- > > Key: KAFKA-12506 > URL: https://issues.apache.org/jira/browse/KAFKA-12506 > Project: Kafka > Issue Type: Sub-task > Components: streams, unit tests >Reporter: A. Sophie Blee-Goldman >Assignee: Aviral Srivastava >Priority: Major > Labels: newbie, newbie++ > > Right now the AdjustStreamThreadCountTest runs a minimal topology that just > consumes a single input topic, and doesn't produce any data to this topic. > Some of the complex concurrency bugs that we've found only showed up when we > had some actual data to process and a stateful topology: > [KAFKA-12503|https://issues.apache.org/jira/browse/KAFKA-12503] KAFKA-12500 > See the umbrella ticket for the list of improvements we need to make this a > more effective integration test -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12634) Should checkpoint after restore finished
[ https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12634: --- Affects Version/s: 2.5.0 > Should checkpoint after restore finished > > > Key: KAFKA-12634 > URL: https://issues.apache.org/jira/browse/KAFKA-12634 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.5.0 >Reporter: Matthias J. Sax >Priority: Major > > For state stores, Kafka Streams maintains local checkpoint files to track the > offsets of the state store changelog topics. The checkpoint is updated on > commit or when a task is closed cleanly. > However, after a successful restore, the checkpoint is not written. Thus, if > an instance crashes after restore but before committing, even if the state is > on local disk the checkpoint file is missing (indicating that there is no > state) and thus state would be restored from scratch. > While for most cases, the time between restore end and next commit is small, > there are cases when this time could be large, for example if there is no new > input data to be processed (if there is no input data, the commit would be > skipped). > Thus, we should write the checkpoint file after a successful restore to close > this gap (or course, only for at-least-once processing). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12634) Should checkpoint after restore finished
[ https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317484#comment-17317484 ] A. Sophie Blee-Goldman commented on KAFKA-12634: Well we had that big refactor of task management in 2.6 so I wouldn't be surprised if it was fixed since then > Should checkpoint after restore finished > > > Key: KAFKA-12634 > URL: https://issues.apache.org/jira/browse/KAFKA-12634 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax >Priority: Major > > For state stores, Kafka Streams maintains local checkpoint files to track the > offsets of the state store changelog topics. The checkpoint is updated on > commit or when a task is closed cleanly. > However, after a successful restore, the checkpoint is not written. Thus, if > an instance crashes after restore but before committing, even if the state is > on local disk the checkpoint file is missing (indicating that there is no > state) and thus state would be restored from scratch. > While for most cases, the time between restore end and next commit is small, > there are cases when this time could be large, for example if there is no new > input data to be processed (if there is no input data, the commit would be > skipped). > Thus, we should write the checkpoint file after a successful restore to close > this gap (or course, only for at-least-once processing). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12559) Add a top-level Streams config for bounding off-heap memory
[ https://issues.apache.org/jira/browse/KAFKA-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317460#comment-17317460 ] A. Sophie Blee-Goldman commented on KAFKA-12559: Sure thing -- I recommend checking out the example implementation in the Memory Management section (linked to in the ticket description) and read up on the KIP process if you're not familiar with it yet: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals The idea here is to add one (or two) new StreamsConfigs which let the user control the rocksdb memory without having to implement a RocksDBConfigSetter. So if a user has set these configs but not a RocksDBConfigSetter, then we would have a default RocksDBConfigSetter similar to the one in the Memory Management section, where the rocksdb.max.bytes.off.heap config determines the value of TOTAL_OFF_HEAP_MEMORY in the example. We probably also want to give users a way to control the other parameter in that example, TOTAL_MEMTABLE_MEMORY. The basic formula for memory usage in rocksdb is TOTAL_OFF_HEAP_MEMORY = TOTAL_MEMTABLE_MEMORY + TOTAL_CACHE_MEMORY -- so just think about what is the best way to let users specify how much of the total memory should go to the cache vs towards the memory. Maybe you just want one config for TOTAL_MEMTABLE_MEMORY, or you could consider a config like rocksdb.memtable.to.block.cache.off.heap.memory.ratio which represents the ratio of memtable memory, ie TOTAL_MEMTABLE_MEMORY / TOTAL_OFF_HEAP_MEMORY Does that make sense? Let me know if you have any specific questions > Add a top-level Streams config for bounding off-heap memory > --- > > Key: KAFKA-12559 > URL: https://issues.apache.org/jira/browse/KAFKA-12559 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > Labels: needs-kip, newbie, newbie++ > > At the moment we provide an example of how to bound the memory usage of > rocskdb in the [Memory > Management|https://kafka.apache.org/27/documentation/streams/developer-guide/memory-mgmt.html#rocksdb] > section of the docs. This requires implementing a custom RocksDBConfigSetter > class and setting a number of rocksdb options for relatively advanced > concepts and configurations. It seems a fair number of users either fail to > find this or consider it to be for more advanced use cases/users. But RocksDB > can eat up a lot of off-heap memory and it's not uncommon for users to come > across a {{RocksDBException: Cannot allocate memory}} > It would probably be a much better user experience if we implemented this > memory bound out-of-the-box and just gave users a top-level StreamsConfig to > tune the off-heap memory given to rocksdb, like we have for on-heap cache > memory with cache.max.bytes.buffering. More advanced users can continue to > fine-tune their memory bounding and apply other configs with a custom config > setter, while new or more casual users can cap on the off-heap memory without > getting their hands dirty with rocksdb. > I would propose to add the following top-level config: > rocksdb.max.bytes.off.heap: medium priority, default to -1 (unbounded), valid > values are [0, inf] > I'd also want to consider adding a second, lower priority top-level config to > give users a knob for adjusting how much of that total off-heap memory goes > to the block cache + index/filter blocks, and how much of it is afforded to > the write buffers. I'm struggling to come up with a good name for this > config, but it would be something like > rocksdb.memtable.to.block.cache.off.heap.memory.ratio: low priority, default > to 0.5, valid values are [0, 1] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12506) Expand AdjustStreamThreadCountTest
[ https://issues.apache.org/jira/browse/KAFKA-12506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317455#comment-17317455 ] A. Sophie Blee-Goldman commented on KAFKA-12506: Hey [~kebab-mai-haddi], if I understand correctly you are running `./gradlew streams:test` locally on your changes and it's passing, but the build is failing on the PR? If so it may be due to upstream changes, as Jenkins will always merge your branch with the latest trunk before running the unit tests on the PR. I recommend pulling or rebasing from trunk and re-running the tests. > Expand AdjustStreamThreadCountTest > -- > > Key: KAFKA-12506 > URL: https://issues.apache.org/jira/browse/KAFKA-12506 > Project: Kafka > Issue Type: Sub-task > Components: streams, unit tests >Reporter: A. Sophie Blee-Goldman >Assignee: Aviral Srivastava >Priority: Major > Labels: newbie, newbie++ > > Right now the AdjustStreamThreadCountTest runs a minimal topology that just > consumes a single input topic, and doesn't produce any data to this topic. > Some of the complex concurrency bugs that we've found only showed up when we > had some actual data to process and a stateful topology: > [KAFKA-12503|https://issues.apache.org/jira/browse/KAFKA-12503] KAFKA-12500 > See the umbrella ticket for the list of improvements we need to make this a > more effective integration test -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8295) Optimize count() using RocksDB merge operator
[ https://issues.apache.org/jira/browse/KAFKA-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317410#comment-17317410 ] A. Sophie Blee-Goldman commented on KAFKA-8295: --- Thanks Sagar -- I think the custom MergeOperator did exist back then but I was hesitant to assume that it would always perform better given our experience with the performance of a custom ByteComparator. We would want to run some benchmarks to determine whether this is or isn't the case for the merge operator as well. It's possible that MergeOperator isn't as severely affected by the overhead of crossing the jni, or that it's been improved over the years. I actually recall reading a blog post about rocksdb tuning recently where they mentioned using the merge operator gave them a significant improvement for certain data types, I think this was in Flink maybe? This is definitely something we could add to the StateStore interface. It would require a KIP, and need to be careful since not all store backends will necessarily support this. You could check out [KIP-617: Allow Kafka Streams State Stores to be iterated backwards|https://cwiki.apache.org/confluence/display/KAFKA/KIP-617%3A+Allow+Kafka+Streams+State+Stores+to+be+iterated+backwards] as a reference, reverse iteration is another rocksdb feature we wanted to expose but which isn't necessarily supported by all. > Optimize count() using RocksDB merge operator > - > > Key: KAFKA-8295 > URL: https://issues.apache.org/jira/browse/KAFKA-8295 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: A. Sophie Blee-Goldman >Priority: Major > > In addition to regular put/get/delete RocksDB provides a fourth operation, > merge. This essentially provides an optimized read/update/write path in a > single operation. One of the built-in (C++) merge operators exposed over the > Java API is a counter. We should be able to leverage this for a more > efficient implementation of count() > > (Note: Unfortunately it seems unlikely we can use this to optimize general > aggregations, even if RocksJava allowed for a custom merge operator, unless > we provide a way for the user to specify and connect a C++ implemented > aggregator – otherwise we incur too much cost crossing the jni for a net > performance benefit) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12634) Should checkpoint after restore finished
[ https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317393#comment-17317393 ] A. Sophie Blee-Goldman commented on KAFKA-12634: Nice catch. Do you know which versions this affects? > Should checkpoint after restore finished > > > Key: KAFKA-12634 > URL: https://issues.apache.org/jira/browse/KAFKA-12634 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Matthias J. Sax >Priority: Major > > For state stores, Kafka Streams maintains local checkpoint files to track the > offsets of the state store changelog topics. The checkpoint is updated on > commit or when a task is closed cleanly. > However, after a successful restore, the checkpoint is not written. Thus, if > an instance crashes after restore but before committing, even if the state is > on local disk the checkpoint file is missing (indicating that there is no > state) and thus state would be restored from scratch. > While for most cases, the time between restore end and next commit is small, > there are cases when this time could be large, for example if there is no new > input data to be processed (if there is no input data, the commit would be > skipped). > Thus, we should write the checkpoint file after a successful restore to close > this gap (or course, only for at-least-once processing). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-12624) Fix LICENSE in 2.6
[ https://issues.apache.org/jira/browse/KAFKA-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316849#comment-17316849 ] A. Sophie Blee-Goldman commented on KAFKA-12624: Had to change a handful of dependencies whose versions changed between 2.6 and 2.8 in the cherrypick: {code:java} -javassist-3.27.0-GA +javassist-3.25.0-GA +javassist-3.26.0-GA -scala-collection-compat_2.13-2.3.0 -scala-library-2.13.5 -scala-logging_2.13-3.9.2 -scala-reflect-2.13.5 +scala-collection-compat_2.13-2.1.6 -snappy-java-1.1.8.1 +scala-library-2.13.2 +scala-logging_2.13-3.9.2 +scala-reflect-2.13.2 +scala-reflect-2.13.2 +snappy-java-1.1.7.3 -zstd-jni-1.4.9-1, see: licenses/zstd-jni-BSD-2-clause +zstd-jni-1.4.4-7, see: licenses/zstd-jni-BSD-2-clause {code} > Fix LICENSE in 2.6 > -- > > Key: KAFKA-12624 > URL: https://issues.apache.org/jira/browse/KAFKA-12624 > Project: Kafka > Issue Type: Sub-task >Reporter: John Roesler >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 2.6.2 > > > Just splitting this out as a sub-task. > I've fixed the parent ticket on trunk and 2.8. > You'll need to cherry-pick the fix from 2.8 (see > [https://github.com/apache/kafka/pull/10474)] > Then, you can follow the manual verification steps I detailed here: > https://issues.apache.org/jira/browse/KAFKA-12622 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (KAFKA-12624) Fix LICENSE in 2.6
[ https://issues.apache.org/jira/browse/KAFKA-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-12624. Resolution: Fixed > Fix LICENSE in 2.6 > -- > > Key: KAFKA-12624 > URL: https://issues.apache.org/jira/browse/KAFKA-12624 > Project: Kafka > Issue Type: Sub-task >Reporter: John Roesler >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 2.6.2 > > > Just splitting this out as a sub-task. > I've fixed the parent ticket on trunk and 2.8. > You'll need to cherry-pick the fix from 2.8 (see > [https://github.com/apache/kafka/pull/10474)] > Then, you can follow the manual verification steps I detailed here: > https://issues.apache.org/jira/browse/KAFKA-12622 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (KAFKA-12622) Automate LICENCSE file validation
[ https://issues.apache.org/jira/browse/KAFKA-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman updated KAFKA-12622: --- Description: In https://issues.apache.org/jira/browse/KAFKA-12602, we manually constructed a correct license file for 2.8.0. This file will certainly become wrong again in later releases, so we need to write some kind of script to automate a check. It crossed my mind to automate the generation of the file, but it seems to be an intractable problem, considering that each dependency may change licenses, may package license files, link to them from their poms, link to them from their repos, etc. I've also found multiple URLs listed with various delimiters, broken links that I have to chase down, etc. Therefore, it seems like the solution to aim for is simply: list all the jars that we package, and print out a report of each jar that's extra or missing vs. the ones in our `LICENSE-binary` file. The check should be part of the release script at least, if not part of the regular build (so we keep it up to date as dependencies change). Here's how I do this manually right now: {code:java} // build the binary artifacts $ ./gradlewAll releaseTarGz // unpack the binary artifact $ tar xf core/build/distributions/kafka_2.13-X.Y.Z.tgz $ cd xf kafka_2.13-X.Y.Z // list the packaged jars // (you can ignore the jars for our own modules, like kafka, kafka-clients, etc.) $ ls libs/ // cross check the jars with the packaged LICENSE // make sure all dependencies are listed with the right versions $ cat LICENSE // also double check all the mentioned license files are present $ ls licenses {code} was: In https://issues.apache.org/jira/browse/KAFKA-12602, we manually constructed a correct license file for 2.8.0. This file will certainly become wrong again in later releases, so we need to write some kind of script to automate a check. It crossed my mind to automate the generation of the file, but it seems to be an intractable problem, considering that each dependency may change licenses, may package license files, link to them from their poms, link to them from their repos, etc. I've also found multiple URLs listed with various delimiters, broken links that I have to chase down, etc. Therefore, it seems like the solution to aim for is simply: list all the jars that we package, and print out a report of each jar that's extra or missing vs. the ones in our `LICENSE-binary` file. The check should be part of the release script at least, if not part of the regular build (so we keep it up to date as dependencies change). Here's how I do this manually right now: {code:java} // build the binary artifacts $ ./gradlewAll releaseTarGz // unpack the binary artifact $ cd core/build/distributions/ $ tar xf kafka_2.13-X.Y.Z.tgz $ cd xf kafka_2.13-X.Y.Z // list the packaged jars // (you can ignore the jars for our own modules, like kafka, kafka-clients, etc.) $ ls libs/ // cross check the jars with the packaged LICENSE // make sure all dependencies are listed with the right versions $ cat LICENSE // also double check all the mentioned license files are present $ ls licenses {code} > Automate LICENCSE file validation > - > > Key: KAFKA-12622 > URL: https://issues.apache.org/jira/browse/KAFKA-12622 > Project: Kafka > Issue Type: Task >Reporter: John Roesler >Priority: Major > Fix For: 3.0.0, 2.8.1 > > > In https://issues.apache.org/jira/browse/KAFKA-12602, we manually constructed > a correct license file for 2.8.0. This file will certainly become wrong again > in later releases, so we need to write some kind of script to automate a > check. > It crossed my mind to automate the generation of the file, but it seems to be > an intractable problem, considering that each dependency may change licenses, > may package license files, link to them from their poms, link to them from > their repos, etc. I've also found multiple URLs listed with various > delimiters, broken links that I have to chase down, etc. > Therefore, it seems like the solution to aim for is simply: list all the jars > that we package, and print out a report of each jar that's extra or missing > vs. the ones in our `LICENSE-binary` file. > The check should be part of the release script at least, if not part of the > regular build (so we keep it up to date as dependencies change). > > Here's how I do this manually right now: > {code:java} > // build the binary artifacts > $ ./gradlewAll releaseTarGz > // unpack the binary artifact > $ tar xf core/build/distributions/kafka_2.13-X.Y.Z.tgz > $ cd xf kafka_2.13-X.Y.Z > // list the packaged jars > // (you can ignore the jars for our own modules, like kafka, kafka-clients, > etc.) > $ ls libs/ > // cross check the jars with the packaged
[jira] [Resolved] (KAFKA-8924) Default grace period (-1) of TimeWindows causes suppress to emit events after 24h
[ https://issues.apache.org/jira/browse/KAFKA-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman resolved KAFKA-8924. --- Resolution: Duplicate > Default grace period (-1) of TimeWindows causes suppress to emit events after > 24h > - > > Key: KAFKA-8924 > URL: https://issues.apache.org/jira/browse/KAFKA-8924 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.3.0 >Reporter: Michał >Assignee: Michał >Priority: Major > Labels: needs-kip > > h2. Problem > The default creation of TimeWindows, like > {code} > TimeWindows.of(ofMillis(xxx)) > {code} > calls an internal constructor > {code} > return new TimeWindows(sizeMs, sizeMs, -1, DEFAULT_RETENTION_MS); > {code} > And the *-1* parameter is the default grace period which I think is here for > backward compatibility > {code} > @SuppressWarnings("deprecation") // continuing to support > Windows#maintainMs/segmentInterval in fallback mode > @Override > public long gracePeriodMs() { > // NOTE: in the future, when we remove maintainMs, > // we should default the grace period to 24h to maintain the default > behavior, > // or we can default to (24h - size) if you want to be super accurate. > return graceMs != -1 ? graceMs : maintainMs() - size(); > } > {code} > The problem is that if you use a TimeWindows with gracePeriod of *-1* > together with suppress *untilWindowCloses*, it never emits an event. > You can check the Suppress tests > (SuppressScenarioTest.shouldSupportFinalResultsForTimeWindows), where > [~vvcephei] was (maybe) aware of that and all the scenarios specify the > gracePeriod. > I will add a test without it on my branch and it will fail. > The test: > https://github.com/atais/kafka/commit/221a04dc40d636ffe93a0ad95dfc6bcad653f4db > > h2. Now what can be done > One easy fix would be to change the default value to 0, which works fine for > me in my project, however, I am not aware of the impact it would have done > due to the changes in the *gracePeriodMs* method mentioned before. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (KAFKA-8924) Default grace period (-1) of TimeWindows causes suppress to emit events after 24h
[ https://issues.apache.org/jira/browse/KAFKA-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315217#comment-17315217 ] A. Sophie Blee-Goldman commented on KAFKA-8924: --- Can we close this as a duplicate? We have KIP-633 currently under voting which should improve things, we can track the progress from here with https://issues.apache.org/jira/browse/KAFKA-8613 > Default grace period (-1) of TimeWindows causes suppress to emit events after > 24h > - > > Key: KAFKA-8924 > URL: https://issues.apache.org/jira/browse/KAFKA-8924 > Project: Kafka > Issue Type: Improvement > Components: streams >Affects Versions: 2.3.0 >Reporter: Michał >Assignee: Michał >Priority: Major > Labels: needs-kip > > h2. Problem > The default creation of TimeWindows, like > {code} > TimeWindows.of(ofMillis(xxx)) > {code} > calls an internal constructor > {code} > return new TimeWindows(sizeMs, sizeMs, -1, DEFAULT_RETENTION_MS); > {code} > And the *-1* parameter is the default grace period which I think is here for > backward compatibility > {code} > @SuppressWarnings("deprecation") // continuing to support > Windows#maintainMs/segmentInterval in fallback mode > @Override > public long gracePeriodMs() { > // NOTE: in the future, when we remove maintainMs, > // we should default the grace period to 24h to maintain the default > behavior, > // or we can default to (24h - size) if you want to be super accurate. > return graceMs != -1 ? graceMs : maintainMs() - size(); > } > {code} > The problem is that if you use a TimeWindows with gracePeriod of *-1* > together with suppress *untilWindowCloses*, it never emits an event. > You can check the Suppress tests > (SuppressScenarioTest.shouldSupportFinalResultsForTimeWindows), where > [~vvcephei] was (maybe) aware of that and all the scenarios specify the > gracePeriod. > I will add a test without it on my branch and it will fail. > The test: > https://github.com/atais/kafka/commit/221a04dc40d636ffe93a0ad95dfc6bcad653f4db > > h2. Now what can be done > One easy fix would be to change the default value to 0, which works fine for > me in my project, however, I am not aware of the impact it would have done > due to the changes in the *gracePeriodMs* method mentioned before. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (KAFKA-8613) Set default grace period to 0
[ https://issues.apache.org/jira/browse/KAFKA-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Sophie Blee-Goldman reassigned KAFKA-8613: - Assignee: A. Sophie Blee-Goldman > Set default grace period to 0 > - > > Key: KAFKA-8613 > URL: https://issues.apache.org/jira/browse/KAFKA-8613 > Project: Kafka > Issue Type: Improvement > Components: streams >Reporter: Bruno Cadonna >Assignee: A. Sophie Blee-Goldman >Priority: Blocker > Fix For: 3.0.0 > > > Currently, the grace period is set to retention time if the grace period is > not specified explicitly. The reason for setting the default grace period to > retention time was backward compatibility. Topologies that were implemented > before the introduction of the grace period, added late arriving records to a > window as long as the window existed, i.e., as long as its retention time was > not elapsed. > This unintuitive default grace period has already caused confusion among > users. > For the next major release, we should set the default grace period to > {{Duration.ZERO}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)