from:"Sophie Blee\-Goldman \(JIRA\)"

[jira] [Updated] (KAFKA-12740) 3. Resume processing from last-cleared processor after soft crash

2021-05-03 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12740:
---
Description: 
Building off of that, we can go one step further and avoid duplicate work 
within the subtopology itself. Any time a record is partially processed through 
a subtopology before hitting an error, all of the processors up to that point 
will be applied again when the record is reprocessed. If we can keep track of 
how far along the subtopology a record was processed, then we can start 
reprocessing it from the last processor it had cleared before hitting an error. 
We’ll need to make sure to commit and/or flush everything up to that point in 
the subtopology as well.
This proposal is the most likely to benefit from letting a StreamThread recover 
after an unexpected exception rather than letting it die and starting up a new 
one, as we don’t need to worry about handing anything off from the dying thread 
to its replacement. 

Note: we should consider whether we should only allow this (A) if we can be 
sure the task is re/still assigned to the same client (ie user does not select 
SHUTDOWN_CLIENT), or (B) we allow it for ALOS and just skip this under EOS 
until we can implement (A) (at which time we can re-enable for EOS as well)


  was:
Building off of that, we can go one step further and avoid duplicate work 
within the subtopology itself. Any time a record is partially processed through 
a subtopology before hitting an error, all of the processors up to that point 
will be applied again when the record is reprocessed. If we can keep track of 
how far along the subtopology a record was processed, then we can start 
reprocessing it from the last processor it had cleared before hitting an error. 
We’ll need to make sure to commit and/or flush everything up to that point in 
the subtopology as well.
This proposal is the most likely to benefit from letting a StreamThread recover 
after an unexpected exception rather than letting it die and starting up a new 
one, as we don’t need to worry about handing anything off from the dying thread 
to its replacement. 



> 3. Resume processing from last-cleared processor after soft crash
> -
>
> Key: KAFKA-12740
> URL: https://issues.apache.org/jira/browse/KAFKA-12740
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> Building off of that, we can go one step further and avoid duplicate work 
> within the subtopology itself. Any time a record is partially processed 
> through a subtopology before hitting an error, all of the processors up to 
> that point will be applied again when the record is reprocessed. If we can 
> keep track of how far along the subtopology a record was processed, then we 
> can start reprocessing it from the last processor it had cleared before 
> hitting an error. We’ll need to make sure to commit and/or flush everything 
> up to that point in the subtopology as well.
> This proposal is the most likely to benefit from letting a StreamThread 
> recover after an unexpected exception rather than letting it die and starting 
> up a new one, as we don’t need to worry about handing anything off from the 
> dying thread to its replacement. 
> Note: we should consider whether we should only allow this (A) if we can be 
> sure the task is re/still assigned to the same client (ie user does not 
> select SHUTDOWN_CLIENT), or (B) we allow it for ALOS and just skip this under 
> EOS until we can implement (A) (at which time we can re-enable for EOS as 
> well)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12746) Allow StreamThreads to recover from some exceptions without killing/replacing the thread

2021-05-03 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12746:
--

 Summary: Allow StreamThreads to recover from some exceptions 
without killing/replacing the thread
 Key: KAFKA-12746
 URL: https://issues.apache.org/jira/browse/KAFKA-12746
 Project: Kafka
  Issue Type: Sub-task
  Components: streams
Reporter: A. Sophie Blee-Goldman


As the title says, in some cases it should be possible to recover the 
StreamThread without killing and restarting it (it may in fact be _possible_ 
for all exception types, though in some cases it may be preferable to just let 
the thread die and start up with a fresh thread). 

This would obviously allow for greatly mitigating the impact of errors in the 
subtopology, and allow us to retry with little overhead. It may also make it 
easier to improve the processing semantics in the face of exceptions.

This will also be useful if/when we implement a more sophisticated task 
scheduling which can, for example, work to deprioritize tasks that are 
frequently experiencing errors without crashing the whole thread or dropping 
the task entirely



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12742) 5. Checkpoint all uncorrupted state stores within the subtopology

2021-04-30 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12742:
--

 Summary: 5. Checkpoint all uncorrupted state stores within the 
subtopology
 Key: KAFKA-12742
 URL: https://issues.apache.org/jira/browse/KAFKA-12742
 Project: Kafka
  Issue Type: Sub-task
  Components: streams
Reporter: A. Sophie Blee-Goldman


Once we have [KAFKA-12740|https://issues.apache.org/jira/browse/KAFKA-12740], 
we can close the loop on EOS by checkpointing not only those state stores which 
are attached to processors that the record has successfully passed, but also 
any remaining state stores further downstream in the subtopology that aren't 
connected to the processor where the error occurred.
At this point, outside of a hard crash (eg process is killed) or dropping out 
of the group, we’ll only ever need to restore state stores from scratch if the 
exception came from the specific processor node they’re attached to. Which is 
pretty darn cool



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12740) 3. Resume processing from last-cleared processor after soft crash

2021-04-30 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12740:
---
Description: 
Building off of that, we can go one step further and avoid duplicate work 
within the subtopology itself. Any time a record is partially processed through 
a subtopology before hitting an error, all of the processors up to that point 
will be applied again when the record is reprocessed. If we can keep track of 
how far along the subtopology a record was processed, then we can start 
reprocessing it from the last processor it had cleared before hitting an error. 
We’ll need to make sure to commit and/or flush everything up to that point in 
the subtopology as well.
This proposal is the most likely to benefit from letting a StreamThread recover 
after an unexpected exception rather than letting it die and starting up a new 
one, as we don’t need to worry about handing anything off from the dying thread 
to its replacement. 


  was:
Building off of that, we can go one step further and avoid duplicate work 
within the subtopology itself. Any time a record is partially processed through 
a subtopology before hitting an error, all of the processors up to that point 
will be applied again when the record is reprocessed. If we can keep track of 
how far along the subtopology a record was processed, then we can start 
reprocessing it from the last processor it had cleared before hitting an error.
This proposal is the most likely to benefit from letting a StreamThread recover 
after an unexpected exception rather than letting it die and starting up a new 
one, as we don’t need to worry about handing anything off from the dying thread 
to its replacement. 



> 3. Resume processing from last-cleared processor after soft crash
> -
>
> Key: KAFKA-12740
> URL: https://issues.apache.org/jira/browse/KAFKA-12740
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> Building off of that, we can go one step further and avoid duplicate work 
> within the subtopology itself. Any time a record is partially processed 
> through a subtopology before hitting an error, all of the processors up to 
> that point will be applied again when the record is reprocessed. If we can 
> keep track of how far along the subtopology a record was processed, then we 
> can start reprocessing it from the last processor it had cleared before 
> hitting an error. We’ll need to make sure to commit and/or flush everything 
> up to that point in the subtopology as well.
> This proposal is the most likely to benefit from letting a StreamThread 
> recover after an unexpected exception rather than letting it die and starting 
> up a new one, as we don’t need to worry about handing anything off from the 
> dying thread to its replacement. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12741) 4. Handle thread-wide errors

2021-04-30 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12741:
--

 Summary: 4. Handle thread-wide errors
 Key: KAFKA-12741
 URL: https://issues.apache.org/jira/browse/KAFKA-12741
 Project: Kafka
  Issue Type: Sub-task
  Components: streams
Reporter: A. Sophie Blee-Goldman


Although just implementing 1-3 will likely be a significant improvement, we 
will eventually want to tackle the case of an exception that affects all tasks 
on that thread. I think in this case we would just need to keep retrying the 
specific call that’s failing, whatever that may be. This would almost certainly 
require solution #3/4 as a prerequisite as we would need to keep retrying on 
that thread’s Producer. Of course, with KIP-572 we’ll already retry most kinds 
of errors on the Producer, if not all. So this would most likely only apply to 
a few kinds of exceptions, such as any Consumer calls that fail for reasons 
besides having dropped out of the group.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12739) 2. Commit any cleanly-processed records within a corrupted task

2021-04-30 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12739:
---
Description: 
Within a task, there will typically be a number of records that have been 
successfully processed through the subtopology but not yet committed. If the 
next record to be picked up hits an unexpected exception, we’ll dirty close the 
entire task and essentially throw away all the work we did on those previous 
records. We should be able to drop only the corrupted record and just commit 
the offsets up to that point.
Again, for some exceptions such as de/serialization or user code errors, this 
can be straightforward as the thread/task is otherwise in a healthy state. 
Other cases such as an error in the Producer will need to be tackled 
separately, since a Producer error cannot be isolated to a single task.

The challenge here will be in handling records sent to the changelog while 
processing the record that hits an error – we may need to buffer those records 
so they aren’t sent to the RecordCollector until a record has been fully 
processed, otherwise they will be committed and break EOS semantics (unless we 
can immediately implement 
[KAFKA-12740|https://issues.apache.org/jira/browse/KAFKA-12740])

  was:Within a task, there will typically be a number of records that have been 
successfully processed through the subtopology but not yet committed. If the 
next record to be picked up hits an unexpected exception, we’ll dirty close the 
entire task and essentially throw away all the work we did on those previous 
records. We should be able to drop only the corrupted record and just commit 
the offsets up to that point. Again, for some exceptions such as 
de/serialization or user code errors, this can be straightforward as the 
thread/task is otherwise in a healthy state. Other cases such as an error in 
the Producer will need to be tackled separately, since a Producer error cannot 
be isolated to a single task.


> 2. Commit any cleanly-processed records within a corrupted task
> ---
>
> Key: KAFKA-12739
> URL: https://issues.apache.org/jira/browse/KAFKA-12739
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> Within a task, there will typically be a number of records that have been 
> successfully processed through the subtopology but not yet committed. If the 
> next record to be picked up hits an unexpected exception, we’ll dirty close 
> the entire task and essentially throw away all the work we did on those 
> previous records. We should be able to drop only the corrupted record and 
> just commit the offsets up to that point.
> Again, for some exceptions such as de/serialization or user code errors, this 
> can be straightforward as the thread/task is otherwise in a healthy state. 
> Other cases such as an error in the Producer will need to be tackled 
> separately, since a Producer error cannot be isolated to a single task.
> The challenge here will be in handling records sent to the changelog while 
> processing the record that hits an error – we may need to buffer those 
> records so they aren’t sent to the RecordCollector until a record has been 
> fully processed, otherwise they will be committed and break EOS semantics 
> (unless we can immediately implement 
> [KAFKA-12740|https://issues.apache.org/jira/browse/KAFKA-12740])



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12740) 3. Resume processing from last-cleared processor after soft crash

2021-04-30 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12740:
--

 Summary: 3. Resume processing from last-cleared processor after 
soft crash
 Key: KAFKA-12740
 URL: https://issues.apache.org/jira/browse/KAFKA-12740
 Project: Kafka
  Issue Type: Sub-task
  Components: streams
Reporter: A. Sophie Blee-Goldman


Building off of that, we can go one step further and avoid duplicate work 
within the subtopology itself. Any time a record is partially processed through 
a subtopology before hitting an error, all of the processors up to that point 
will be applied again when the record is reprocessed. If we can keep track of 
how far along the subtopology a record was processed, then we can start 
reprocessing it from the last processor it had cleared before hitting an error.
This proposal is the most likely to benefit from letting a StreamThread recover 
after an unexpected exception rather than letting it die and starting up a new 
one, as we don’t need to worry about handing anything off from the dying thread 
to its replacement. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12739) 2. Commit any cleanly-processed records within a corrupted task

2021-04-30 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12739:
--

 Summary: 2. Commit any cleanly-processed records within a 
corrupted task
 Key: KAFKA-12739
 URL: https://issues.apache.org/jira/browse/KAFKA-12739
 Project: Kafka
  Issue Type: Sub-task
  Components: streams
Reporter: A. Sophie Blee-Goldman


Within a task, there will typically be a number of records that have been 
successfully processed through the subtopology but not yet committed. If the 
next record to be picked up hits an unexpected exception, we’ll dirty close the 
entire task and essentially throw away all the work we did on those previous 
records. We should be able to drop only the corrupted record and just commit 
the offsets up to that point. Again, for some exceptions such as 
de/serialization or user code errors, this can be straightforward as the 
thread/task is otherwise in a healthy state. Other cases such as an error in 
the Producer will need to be tackled separately, since a Producer error cannot 
be isolated to a single task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12737) 1. Commit all healthy tasks after a task-specific error for better task isolation

2021-04-30 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12737:
---
Summary: 1. Commit all healthy tasks after a task-specific error for better 
task isolation  (was: Commit all healthy tasks after a task-specific error for 
better task isolation and reduced overcounting under ALOS)

> 1. Commit all healthy tasks after a task-specific error for better task 
> isolation
> -
>
> Key: KAFKA-12737
> URL: https://issues.apache.org/jira/browse/KAFKA-12737
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> At the moment any time we hit the exception handler, an unclean shutdown will 
> be triggered on that thread, which means no tasks will be committed. For 
> certain kinds of exceptions this is unavoidable: for example if the consumer 
> has dropped out of the group then by definition it can’t commit during 
> shutdown, and the task will have already been reassigned to another 
> StreamThread. However there are many common scenarios in which we can (and 
> should) attempt to commit all the tasks which are in a clean state, ie 
> everyone except for the task currently being processed when the exception 
> occurred. A good example of this is de/serialization or user code errors, as 
> well as exceptions that occur during an operation like closing or suspending 
> a particular task. In all those cases, there’s no need to throw away all the 
> progress that has been made by the unaffected tasks who just happened to be 
> assigned to the same StreamThread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12737) Commit all healthy tasks after a task-specific error for better task isolation and reduced overcounting under ALOS

2021-04-30 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12737:
---
Parent: KAFKA-12738
Issue Type: Sub-task  (was: Improvement)

> Commit all healthy tasks after a task-specific error for better task 
> isolation and reduced overcounting under ALOS
> --
>
> Key: KAFKA-12737
> URL: https://issues.apache.org/jira/browse/KAFKA-12737
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> At the moment any time we hit the exception handler, an unclean shutdown will 
> be triggered on that thread, which means no tasks will be committed. For 
> certain kinds of exceptions this is unavoidable: for example if the consumer 
> has dropped out of the group then by definition it can’t commit during 
> shutdown, and the task will have already been reassigned to another 
> StreamThread. However there are many common scenarios in which we can (and 
> should) attempt to commit all the tasks which are in a clean state, ie 
> everyone except for the task currently being processed when the exception 
> occurred. A good example of this is de/serialization or user code errors, as 
> well as exceptions that occur during an operation like closing or suspending 
> a particular task. In all those cases, there’s no need to throw away all the 
> progress that has been made by the unaffected tasks who just happened to be 
> assigned to the same StreamThread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12738) Improved error handling for better at-least-once semantics and faster EOS

2021-04-30 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12738:
--

 Summary: Improved error handling for better at-least-once 
semantics and faster EOS
 Key: KAFKA-12738
 URL: https://issues.apache.org/jira/browse/KAFKA-12738
 Project: Kafka
  Issue Type: Improvement
  Components: streams
Reporter: A. Sophie Blee-Goldman


Umbrella ticket for various ideas I had to improve the error handling behavior 
of Streams



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12737) Commit all healthy tasks after a task-specific error for better task isolation and reduced overcounting under ALOS

2021-04-30 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12737:
--

 Summary: Commit all healthy tasks after a task-specific error for 
better task isolation and reduced overcounting under ALOS
 Key: KAFKA-12737
 URL: https://issues.apache.org/jira/browse/KAFKA-12737
 Project: Kafka
  Issue Type: Improvement
  Components: streams
Reporter: A. Sophie Blee-Goldman


At the moment any time we hit the exception handler, an unclean shutdown will 
be triggered on that thread, which means no tasks will be committed. For 
certain kinds of exceptions this is unavoidable: for example if the consumer 
has dropped out of the group then by definition it can’t commit during 
shutdown, and the task will have already been reassigned to another 
StreamThread. However there are many common scenarios in which we can (and 
should) attempt to commit all the tasks which are in a clean state, ie everyone 
except for the task currently being processed when the exception occurred. A 
good example of this is de/serialization or user code errors, as well as 
exceptions that occur during an operation like closing or suspending a 
particular task. In all those cases, there’s no need to throw away all the 
progress that has been made by the unaffected tasks who just happened to be 
assigned to the same StreamThread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12718) SessionWindows are closed too early

2021-04-29 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17336997#comment-17336997
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12718:


{quote}I was just looking into `suppress()` implementation, that obviously does 
not know anything about semantics of upstream window definitions{quote}
[~mjsax] can you elaborate? Suppression does in fact search the processor graph 
to find the grace period of the windowed operator which is upstream of the 
suppression. It's called GraphGraceSearchUtil or something

> SessionWindows are closed too early
> ---
>
> Key: KAFKA-12718
> URL: https://issues.apache.org/jira/browse/KAFKA-12718
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: Matthias J. Sax
>Priority: Major
>  Labels: beginner, easy-fix, newbie
> Fix For: 3.0.0
>
>
> SessionWindows are defined based on a {{gap}} parameter, and also support an 
> additional {{grace-period}} configuration to handle out-of-order data.
> To incorporate the session-gap a session window should only be closed at 
> {{window-end + gap}} and to incorporate grace-period, the close time should 
> be pushed out further to {{window-end + gap + grace}}.
> However, atm we compute the window close time as {{window-end + grace}} 
> omitting the {{gap}} parameter.
> Because default grace-period is 24h most users might not notice this issues. 
> Even if they set a grace period explicitly (eg, when using suppress()), they 
> would most likely set a grace-period larger than gap-time not hitting the 
> issue (or maybe only realize it when inspecting the behavior closely).
> However, if a user wants to disable the grace-period and sets it to zero (on 
> any other value smaller than gap-time), sessions might be close too early and 
> user might notice.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-6409) LogRecoveryTest (testHWCheckpointWithFailuresSingleLogSegment) is flaky

2021-04-29 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-6409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-6409.
---
Resolution: Duplicate

> LogRecoveryTest (testHWCheckpointWithFailuresSingleLogSegment) is flaky
> ---
>
> Key: KAFKA-6409
> URL: https://issues.apache.org/jira/browse/KAFKA-6409
> Project: Kafka
>  Issue Type: Bug
>  Components: log
>Reporter: Wladimir Schmidt
>Priority: Major
>
> In the LogRecoveryTest the test named 
> testHWCheckpointWithFailuresSingleLogSegment is affected and not stable. 
> Sometimes it passes, sometimes it is not.
> Scala 2.12. JDK9
> java.lang.AssertionError: Timing out after 3 ms since a new leader that 
> is different from 1 was not elected for partition new-topic-0, leader is 
> Some(1)
>   at kafka.utils.TestUtils$.fail(TestUtils.scala:351)
>   at 
> kafka.utils.TestUtils$.$anonfun$waitUntilLeaderIsElectedOrChanged$8(TestUtils.scala:828)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> kafka.utils.TestUtils$.waitUntilLeaderIsElectedOrChanged(TestUtils.scala:818)
>   at 
> kafka.server.LogRecoveryTest.testHWCheckpointWithFailuresSingleLogSegment(LogRecoveryTest.scala:152)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.runTestClass(JUnitTestClassExecuter.java:114)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.execute(JUnitTestClassExecuter.java:57)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassProcessor.processTestClass(JUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at jdk.internal.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>   at 
> org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
>   at 
> org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
>   at com.sun.proxy.$Proxy1.processTestClass(Unknown Source)
>   at 
> org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:108)
>   at jdk.internal.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
>   at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>   at 
>

[jira] [Commented] (KAFKA-5492) LogRecoveryTest.testHWCheckpointWithFailuresSingleLogSegment transient failure

2021-04-29 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-5492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335786#comment-17335786
 ] 

A. Sophie Blee-Goldman commented on KAFKA-5492:
---

Failed again: 

Stacktrace
org.opentest4j.AssertionFailedError: Server 1 is not able to join the ISR after 
restart
at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:39)
at org.junit.jupiter.api.Assertions.fail(Assertions.java:117)
at 
kafka.server.LogRecoveryTest.testHWCheckpointWithFailuresSingleLogSegment(LogRecoveryTest.scala:152)

https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10608/2/testReport/junit/kafka.server/LogRecoveryTest/Build___JDK_11_and_Scala_2_13___testHWCheckpointWithFailuresSingleLogSegment__/

> LogRecoveryTest.testHWCheckpointWithFailuresSingleLogSegment transient failure
> --
>
> Key: KAFKA-5492
> URL: https://issues.apache.org/jira/browse/KAFKA-5492
> Project: Kafka
>  Issue Type: Sub-task
>  Components: unit tests
>Reporter: Jason Gustafson
>Priority: Major
>
> {code}
> ava.lang.AssertionError: Timing out after 3 ms since a new leader that is 
> different from 1 was not elected for partition new-topic-0, leader is Some(1)
>   at kafka.utils.TestUtils$.fail(TestUtils.scala:333)
>   at 
> kafka.utils.TestUtils$.$anonfun$waitUntilLeaderIsElectedOrChanged$8(TestUtils.scala:808)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> kafka.utils.TestUtils$.waitUntilLeaderIsElectedOrChanged(TestUtils.scala:798)
>   at 
> kafka.server.LogRecoveryTest.testHWCheckpointWithFailuresSingleLogSegment(LogRecoveryTest.scala:152)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12629) Flaky Test RaftClusterTest

2021-04-29 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335783#comment-17335783
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12629:


This test has multiple failures on almost every PR build I see lately (and I 
would say at least one failure on _every_ PR)...

> Flaky Test RaftClusterTest
> --
>
> Key: KAFKA-12629
> URL: https://issues.apache.org/jira/browse/KAFKA-12629
> Project: Kafka
>  Issue Type: Test
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Assignee: Luke Chen
>Priority: Critical
>  Labels: flaky-test
>
> {quote} {{java.util.concurrent.ExecutionException: 
> java.lang.ClassNotFoundException: 
> org.apache.kafka.controller.NoOpSnapshotWriterBuilder
>   at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
>   at 
> kafka.testkit.KafkaClusterTestKit.startup(KafkaClusterTestKit.java:364)
>   at 
> kafka.server.RaftClusterTest.testCreateClusterAndCreateAndManyTopicsWithManyPartitions(RaftClusterTest.scala:181)}}{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-7964) Flaky Test ConsumerBounceTest#testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize

2021-04-29 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335780#comment-17335780
 ] 

A. Sophie Blee-Goldman commented on KAFKA-7964:
---

Failed again, with a different error this time. Seems to be from the test setup 
and not the test itself:

Stacktrace
org.apache.kafka.common.KafkaException: Socket server failed to bind to 
localhost:33521: Address already in use.
at kafka.network.Acceptor.openServerSocket(SocketServer.scala:667)
at kafka.network.Acceptor.(SocketServer.scala:560)
at kafka.network.SocketServer.createAcceptor(SocketServer.scala:288)
at 
kafka.network.SocketServer.$anonfun$createDataPlaneAcceptorsAndProcessors$1(SocketServer.scala:261)
at 
kafka.network.SocketServer.$anonfun$createDataPlaneAcceptorsAndProcessors$1$adapted(SocketServer.scala:259)
at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
at scala.collection.AbstractIterable.foreach(Iterable.scala:919)
at 
kafka.network.SocketServer.createDataPlaneAcceptorsAndProcessors(SocketServer.scala:259)
at kafka.network.SocketServer.startup(SocketServer.scala:131)
at kafka.server.KafkaServer.startup(KafkaServer.scala:290)
at kafka.utils.TestUtils$.createServer(TestUtils.scala:166)
at 
kafka.integration.KafkaServerTestHarness.$anonfun$setUp$1(KafkaServerTestHarness.scala:107)
at scala.collection.immutable.Vector.foreach(Vector.scala:1856)
at 
kafka.integration.KafkaServerTestHarness.setUp(KafkaServerTestHarness.scala:102)
at 
kafka.api.IntegrationTestHarness.doSetup(IntegrationTestHarness.scala:93)
at 
kafka.api.IntegrationTestHarness.setUp(IntegrationTestHarness.scala:84)

https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10608/2/testReport/junit/kafka.api/ConsumerBounceTest/Build___JDK_11_and_Scala_2_13___testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize__/

> Flaky Test 
> ConsumerBounceTest#testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize
> --
>
> Key: KAFKA-7964
> URL: https://issues.apache.org/jira/browse/KAFKA-7964
> Project: Kafka
>  Issue Type: Bug
>  Components: clients, consumer, unit tests
>Affects Versions: 2.2.0
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
>
> To get stable nightly builds for `2.2` release, I create tickets for all 
> observed test failures.
> [https://jenkins.confluent.io/job/apache-kafka-test/job/2.2/21/]
> {quote}java.lang.AssertionError: expected:<100> but was:<0> at 
> org.junit.Assert.fail(Assert.java:88) at 
> org.junit.Assert.failNotEquals(Assert.java:834) at 
> org.junit.Assert.assertEquals(Assert.java:645) at 
> org.junit.Assert.assertEquals(Assert.java:631) at 
> kafka.api.ConsumerBounceTest.receiveExactRecords(ConsumerBounceTest.scala:551)
>  at 
> kafka.api.ConsumerBounceTest.$anonfun$testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize$2(ConsumerBounceTest.scala:409)
>  at 
> kafka.api.ConsumerBounceTest.$anonfun$testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize$2$adapted(ConsumerBounceTest.scala:408)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) 
> at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) 
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at 
> kafka.api.ConsumerBounceTest.testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize(ConsumerBounceTest.scala:408){quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable

2021-04-29 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335773#comment-17335773
 ] 

A. Sophie Blee-Goldman commented on KAFKA-9295:
---

Failed again, same stacktrace. I take back what I said in the earlier comment, 
it is actually failing semi-frequently now

https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10608/2/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/Build___JDK_15_and_Scala_2_13___shouldInnerJoinMultiPartitionQueryable_2/

> KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
> --
>
> Key: KAFKA-9295
> URL: https://issues.apache.org/jira/browse/KAFKA-9295
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.4.0, 2.6.0
>Reporter: Matthias J. Sax
>Assignee: Luke Chen
>Priority: Critical
>  Labels: flaky-test
> Fix For: 3.0.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/]
> {quote}java.lang.AssertionError: Did not receive all 1 records from topic 
> output- within 6 ms Expected: is a value equal to or greater than <1> 
> but: <0> was less than <1> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-6655) CleanupThread: Failed to lock the state directory due to an unexpected exception (Windows OS)

2021-04-28 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-6655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-6655.
---
Resolution: Fixed

This should be fixed, a number of improvements were made around the locking a 
few versions ago but we actually took it a step further and removing this kind 
of file-based locking for task directories entirely as of 3.0

> CleanupThread: Failed to lock the state directory due to an unexpected 
> exception (Windows OS)
> -
>
> Key: KAFKA-6655
> URL: https://issues.apache.org/jira/browse/KAFKA-6655
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 1.0.1
> Environment: Windows Operating System
>Reporter: Srini
>Priority: Major
>
> This issue happens on Windows OS. It code works fine on Linux. This ticket is 
> related to KAFKA-6647, that also reports locking issues on Windows. However, 
> there, the issue occurs if users calls KafkaStreams#cleanUp() explicitly, 
> while this ticket is related to KafkaStreams background CleanupThread (note, 
> that both use `StateDirectory.cleanRemovedTasks`, but behavior is still 
> slightly different as different parameters are passed into the method).
> {quote}[CleanupThread] Failed to lock the state directory due to an 
> unexpected exceptionjava.nio.file.DirectoryNotEmptyException: 
> \tmp\kafka-streams\srini-20171208\0_9
>  at 
> sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:266)
>  at 
> sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
>  at java.nio.file.Files.delete(Files.java:1126)
>  at org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:636)
>  at org.apache.kafka.common.utils.Utils$1.postVisitDirectory(Utils.java:619)
>  at java.nio.file.Files.walkFileTree(Files.java:2688)
>  at java.nio.file.Files.walkFileTree(Files.java:2742)
>  at org.apache.kafka.common.utils.Utils.delete(Utils.java:619)
>  at 
> org.apache.kafka.streams.processor.internals.StateDirectory.cleanRemovedTasks(StateDirectory.java:245)
>  at org.apache.kafka.streams.KafkaStreams$3.run(KafkaStreams.java:761)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable

2021-04-28 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334937#comment-17334937
 ] 

A. Sophie Blee-Goldman commented on KAFKA-9295:
---

It's failing on 
{noformat}
startApplicationAndWaitUntilRunning(kafkaStreamsList, ofSeconds(60));
{noformat}

At this point a session timeout seems unlikely, since [~showuon] observed a 
Streams instance dropping out on the heartbeat interval only a couple of times 
even when it failed, and all it has to do here is get to RUNNING once. It 
doesn't require that all KafkaStreams in the list get to RUNNING and then stay 
there, so all the instance has to do is start up and go through at least once 
successful rebalance in that time. There's nothing to restore so the transition 
to RUNNING should be immediate after the rebalance.

Now technically 60s is a typical timeout for 
startApplicationAndWaitUntilRunning in the Streams integration tests, but the 
difference between this and other tests is that most have only one or two 
KafkaStreams to start up whereas this test has three. They're not started up 
and waited on sequentially so that shouldn't _really_ matter that much, but 
still it might just be that a longer timeout should be used in this case. I'm 
open to other theories however

Also note that we should soon have a larger default session interval, so once 
Jason's KIP for that has been implemented we'll be able to get that improvement 
for free. Even if we think the session interval is the problem with this test, 
it probably makes sense to just wait for that KIP than to hardcode in some 
special value. If it starts to fail very frequently we can reconsider, but I 
haven't observed it doing so since the last fix was merged

> KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
> --
>
> Key: KAFKA-9295
> URL: https://issues.apache.org/jira/browse/KAFKA-9295
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.4.0, 2.6.0
>Reporter: Matthias J. Sax
>Assignee: Luke Chen
>Priority: Critical
>  Labels: flaky-test
> Fix For: 3.0.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/]
> {quote}java.lang.AssertionError: Did not receive all 1 records from topic 
> output- within 6 ms Expected: is a value equal to or greater than <1> 
> but: <0> was less than <1> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable

2021-04-28 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reopened KAFKA-9295:
---

Still failing, saw this on both the Java 8 and Java 11 build of a PR:

Stacktrace
java.lang.AssertionError: Application did not reach a RUNNING state for all 
streams instances. Non-running instances: 
{org.apache.kafka.streams.KafkaStreams@cce6a1d=REBALANCING, 
org.apache.kafka.streams.KafkaStreams@45af6187=REBALANCING, 
org.apache.kafka.streams.KafkaStreams@5d8ba53=REBALANCING}
at org.junit.Assert.fail(Assert.java:89)
at 
org.apache.kafka.streams.integration.utils.IntegrationTestUtils.startApplicationAndWaitUntilRunning(IntegrationTestUtils.java:904)
at 
org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:197)
at 
org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:185)


https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10573/8/testReport/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/Build___JDK_8_and_Scala_2_12___shouldInnerJoinMultiPartitionQueryable/

> KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
> --
>
> Key: KAFKA-9295
> URL: https://issues.apache.org/jira/browse/KAFKA-9295
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.4.0, 2.6.0
>Reporter: Matthias J. Sax
>Assignee: Luke Chen
>Priority: Critical
>  Labels: flaky-test
> Fix For: 3.0.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/]
> {quote}java.lang.AssertionError: Did not receive all 1 records from topic 
> output- within 6 ms Expected: is a value equal to or greater than <1> 
> but: <0> was less than <1> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12727) Flaky test ListConsumerGroupTest.testListConsumerGroupsWithStates

2021-04-28 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12727:
--

 Summary: Flaky test 
ListConsumerGroupTest.testListConsumerGroupsWithStates
 Key: KAFKA-12727
 URL: https://issues.apache.org/jira/browse/KAFKA-12727
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Reporter: A. Sophie Blee-Goldman


Stacktrace
org.opentest4j.AssertionFailedError: Expected to show groups 
Set((groupId='simple-group', isSimpleConsumerGroup=true, 
state=Optional[Empty]), (groupId='test.group', isSimpleConsumerGroup=false, 
state=Optional[Stable])), but found Set((groupId='test.group', 
isSimpleConsumerGroup=false, state=Optional[Stable]))
at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:39)
at org.junit.jupiter.api.Assertions.fail(Assertions.java:117)
at 
kafka.admin.ListConsumerGroupTest.testListConsumerGroupsWithStates(ListConsumerGroupTest.scala:66)

https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10573/8/testReport/kafka.admin/ListConsumerGroupTest/Build___JDK_11_and_Scala_2_13___testListConsumerGroupsWithStates__/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12666) Fix flaky kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic

2021-04-27 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334205#comment-17334205
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12666:


There are quite a few tickets for the (very frequently failing) RaftClusterTest 
issues, it seems we have 
[KAFKA-12629|https://issues.apache.org/jira/browse/KAFKA-12629] as an umbrella 
ticket so can we close this one as a duplicate to keep everything in one place? 

> Fix flaky 
> kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic
> 
>
> Key: KAFKA-12666
> URL: https://issues.apache.org/jira/browse/KAFKA-12666
> Project: Kafka
>  Issue Type: Test
>Reporter: Bruno Cadonna
>Priority: Major
>
> Found two similar failures of this test on a PR that was unrelated:
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, 
> deadlineMs=1618341006330, tries=583, nextAllowedTryMs=1618341006437) timed 
> out at 1618341006337 after 583 attempt(s)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
>   at 
> kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic(RaftClusterTest.scala:94)
>  {code}
>  
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node 
> assignment. Call: createTopics
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
>   at 
> kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic(RaftClusterTest.scala:94)
>  {code}
>  
> [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10529/4/testReport/?cloudbees-analytics-link=scm-reporting%2Ftests%2Ffailed]
>  
> Might be related to KAFKA-12561.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12629) Flaky Test RaftClusterTest

2021-04-27 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334174#comment-17334174
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12629:


I'm seeing a lot of that exception as well, ie the 

{noformat}
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request.
{noformat}

Four different RaftClusterTest failures with this same error on 
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-8923/6/testReport/?cloudbees-analytics-link=scm-reporting%2Ftests%2Ffailed


> Flaky Test RaftClusterTest
> --
>
> Key: KAFKA-12629
> URL: https://issues.apache.org/jira/browse/KAFKA-12629
> Project: Kafka
>  Issue Type: Test
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Assignee: Luke Chen
>Priority: Critical
>  Labels: flaky-test
>
> {quote} {{java.util.concurrent.ExecutionException: 
> java.lang.ClassNotFoundException: 
> org.apache.kafka.controller.NoOpSnapshotWriterBuilder
>   at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
>   at 
> kafka.testkit.KafkaClusterTestKit.startup(KafkaClusterTestKit.java:364)
>   at 
> kafka.server.RaftClusterTest.testCreateClusterAndCreateAndManyTopicsWithManyPartitions(RaftClusterTest.scala:181)}}{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-12666) Fix flaky kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic

2021-04-27 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-12666.

Resolution: Duplicate

> Fix flaky 
> kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic
> 
>
> Key: KAFKA-12666
> URL: https://issues.apache.org/jira/browse/KAFKA-12666
> Project: Kafka
>  Issue Type: Test
>Reporter: Bruno Cadonna
>Priority: Major
>
> Found two similar failures of this test on a PR that was unrelated:
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: Call(callName=createTopics, 
> deadlineMs=1618341006330, tries=583, nextAllowedTryMs=1618341006437) timed 
> out at 1618341006337 after 583 attempt(s)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
>   at 
> kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic(RaftClusterTest.scala:94)
>  {code}
>  
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node 
> assignment. Call: createTopics
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
>   at 
> kafka.server.RaftClusterTest.testCreateClusterAndCreateListDeleteTopic(RaftClusterTest.scala:94)
>  {code}
>  
> [https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10529/4/testReport/?cloudbees-analytics-link=scm-reporting%2Ftests%2Ffailed]
>  
> Might be related to KAFKA-12561.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12721) Flaky Test ResetConsumerGroupOffsetTest.testResetOffsetsAllTopicsAllGroups

2021-04-27 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12721:
--

 Summary: Flaky Test 
ResetConsumerGroupOffsetTest.testResetOffsetsAllTopicsAllGroups
 Key: KAFKA-12721
 URL: https://issues.apache.org/jira/browse/KAFKA-12721
 Project: Kafka
  Issue Type: Bug
  Components: consumer
Reporter: A. Sophie Blee-Goldman


https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-8923/6/testReport/kafka.admin/ResetConsumerGroupOffsetTest/Build___JDK_8_and_Scala_2_12___testResetOffsetsAllTopicsAllGroups__/

Stacktrace
org.opentest4j.AssertionFailedError: Expected that consumer group has consumed 
all messages from topic/partition. Expected offset: 100. Actual offset: 0
at org.junit.jupiter.api.AssertionUtils.fail(AssertionUtils.java:39)
at org.junit.jupiter.api.Assertions.fail(Assertions.java:117)
at 
kafka.admin.ResetConsumerGroupOffsetTest.awaitConsumerProgress(ResetConsumerGroupOffsetTest.scala:455)
at 
kafka.admin.ResetConsumerGroupOffsetTest.$anonfun$testResetOffsetsAllTopicsAllGroups$5(ResetConsumerGroupOffsetTest.scala:140)
at 
kafka.admin.ResetConsumerGroupOffsetTest.$anonfun$testResetOffsetsAllTopicsAllGroups$5$adapted(ResetConsumerGroupOffsetTest.scala:137)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
kafka.admin.ResetConsumerGroupOffsetTest.$anonfun$testResetOffsetsAllTopicsAllGroups$4(ResetConsumerGroupOffsetTest.scala:137)
at 
kafka.admin.ResetConsumerGroupOffsetTest.$anonfun$testResetOffsetsAllTopicsAllGroups$4$adapted(ResetConsumerGroupOffsetTest.scala:136)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
kafka.admin.ResetConsumerGroupOffsetTest.testResetOffsetsAllTopicsAllGroups(ResetConsumerGroupOffsetTest.scala:136)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-6435) Application Reset Tool might delete incorrect internal topics

2021-04-27 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-6435:
--
Fix Version/s: 3.0.0

> Application Reset Tool might delete incorrect internal topics
> -
>
> Key: KAFKA-6435
> URL: https://issues.apache.org/jira/browse/KAFKA-6435
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, tools
>Affects Versions: 1.0.0
>Reporter: Matthias J. Sax
>Assignee: Joel Wee
>Priority: Major
>  Labels: bug, help-wanted, newbie
> Fix For: 3.0.0
>
>
> The streams application reset tool, deletes all topic that start with 
> {{-}}.
> If people have two versions of the same application and name them {{"app"}} 
> and {{"app-v2"}}, resetting {{"app"}} would also delete the internal topics 
> of {{"app-v2"}}.
> We either need to disallow the dash in the application ID, or improve the 
> topic identification logic in the reset tool to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-6435) Application Reset Tool might delete incorrect internal topics

2021-04-27 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-6435.
---
Resolution: Fixed

> Application Reset Tool might delete incorrect internal topics
> -
>
> Key: KAFKA-6435
> URL: https://issues.apache.org/jira/browse/KAFKA-6435
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, tools
>Affects Versions: 1.0.0
>Reporter: Matthias J. Sax
>Assignee: Joel Wee
>Priority: Major
>  Labels: bug, help-wanted, newbie
> Fix For: 3.0.0
>
>
> The streams application reset tool, deletes all topic that start with 
> {{-}}.
> If people have two versions of the same application and name them {{"app"}} 
> and {{"app-v2"}}, resetting {{"app"}} would also delete the internal topics 
> of {{"app-v2"}}.
> We either need to disallow the dash in the application ID, or improve the 
> topic identification logic in the reset tool to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8147) Add changelog topic configuration to KTable suppress

2021-04-26 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332711#comment-17332711
 ] 

A. Sophie Blee-Goldman commented on KAFKA-8147:
---

[~philbour] that would be a bug, you should be able to set these configs in any 
order. Seems like BufferConfigInternal#emitEarlyWhenFull creates a new 
EagerBufferConfigImpl and passes the two original configs (maxRecords and 
maxBytes) in to the constructor, but loses the logging configs at that point. 
Same thing for BufferConfigInternal#shutDownWhenFull. Looks like the PR for 
this feature just missed updating this, I notice that it did remember to add 
this parameter in the constructor calls inside EagerBufferConfigImpl and 
StrictBufferConfigImpl.

That said, this looks like kind of an abuse of this pattern so I'm not 
surprised bugs slipped through. Maybe instead of just patching the current 
problem by adding this parameter to the constructor calls in 
BufferConfigInternal we can try to refactor things a bit so we aren't calling 
constructors all over the place and making things vulnerable to future changes. 
For example in Materialized all of the non-static .withXXX methods just set 
that parameter directly instead of creating a new Materialized object every 
time you set some configuration. But I'm sure there was a reason to do it this 
way initially...

[~philbour] can you file a separate ticket for this? And would you be 
interested in submitting a PR to fix the bug you found?

> Add changelog topic configuration to KTable suppress
> 
>
> Key: KAFKA-8147
> URL: https://issues.apache.org/jira/browse/KAFKA-8147
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.1.1
>Reporter: Maarten
>Assignee: highluck
>Priority: Minor
>  Labels: kip
> Fix For: 2.6.0
>
>
> The streams DSL does not provide a way to configure the changelog topic 
> created by KTable.suppress.
> From the perspective of an external user this could be implemented similar to 
> the configuration of aggregate + materialized, i.e.,
> {code:java}
> changelogTopicConfigs = // Configs
> materialized = Materialized.as(..).withLoggingEnabled(changelogTopicConfigs)
> ..
> KGroupedStream.aggregate(..,materialized)
> {code}
> [KIP-446: 
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-446%3A+Add+changelog+topic+configuration+to+KTable+suppress|https://cwiki.apache.org/confluence/display/KAFKA/KIP-446%3A+Add+changelog+topic+configuration+to+KTable+suppress]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-12344) Support SlidingWindows in the Scala API

2021-04-26 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-12344.

Fix Version/s: 3.0.0
   Resolution: Fixed

> Support SlidingWindows in the Scala API
> ---
>
> Key: KAFKA-12344
> URL: https://issues.apache.org/jira/browse/KAFKA-12344
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.7.0
>Reporter: Leah Thomas
>Assignee: Ketul Gupta
>Priority: Major
>  Labels: newbie, scala
> Fix For: 3.0.0
>
>
> in KIP-450 we implemented sliding windows for the Java API but left out a few 
> crucial methods to allow sliding windows to work through the Scala API. We 
> need to add those methods to make the Scala API fully leverage sliding windows



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12710) Consider enabling (at least some) optimizations by default

2021-04-23 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330995#comment-17330995
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12710:


Thanks, I'd forgotten about that KIP. Being able to selectively disable 
optimizations would make enabling (at least some) optimizations by default much 
more palatable, if we can do them together that would be ideal.

> Consider enabling (at least some) optimizations by default
> --
>
> Key: KAFKA-12710
> URL: https://issues.apache.org/jira/browse/KAFKA-12710
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> Topology optimizations such as the repartition consolidation and source topic 
> changelog are extremely useful at reducing the footprint of a Kafka Streams 
> application on the broker. The additional storage and resource utilization 
> due to changelogs and repartitions is a very real pain point, and has even 
> been cited as the reason for turning to other stream processing frameworks in 
> the past (though of course I question that judgement)
> The repartition topic optimization, at the very least, should be enabled by 
> default. The problem is that we can't just flip the switch without breaking 
> existing applications during upgrade, since the location and name of such 
> topics in the topology may change. One possibility is to just detect this 
> situation and disable the optimization if we find that it would produce an 
> incompatible topology for an existing application. We can determine that this 
> is the case simply by looking for pre-existing repartition topics. If any 
> such topics are present, and match the set of repartition topics in the 
> un-optimized topology, then we know we need to switch the optimization off. 
> If we don't find any repartition topics, or they match the optimized 
> topology, then we're safe to enable it by default.
> Alternatively, we could just do a KIP to indicate that we intend to change 
> the default in the next breaking release and that existing applications 
> should override this config if necessary. We should be able to implement a 
> fail-safe and shut down if a user misses or forgets to do so, using the 
> method mentioned above.
> The source topic optimization is perhaps more controversial, as there have 
> been a few issues raised with regards to things like [restoring bad data and 
> asymmetric serdes|https://issues.apache.org/jira/browse/KAFKA-8037], or more 
> recently the bug discovered in the [emit-on-change semantics for 
> KTables|https://issues.apache.org/jira/browse/KAFKA-12508?focusedCommentId=17306323=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17306323].
>  However for this case at least there are no compatibility concerns. It's 
> safe to upgrade from using a separate changelog for a source KTable to just 
> using that source topic directly, although the reverse is not true. We could 
> even automatically delete the no-longer-necessary changelog for upgrading 
> applications



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12710) Consider enabling (at least some) optimizations by default

2021-04-22 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12710:
--

Summary: Consider enabling (at least some) optimizations by default
Key: KAFKA-12710
URL: https://issues.apache.org/jira/browse/KAFKA-12710
Project: Kafka
Issue Type: Improvement
Components: streams
Reporter: A. Sophie Blee-Goldman

Topology optimizations such as the repartition consolidation and source topic
changelog are extremely useful at reducing the footprint of a Kafka Streams
application on the broker. The additional storage and resource utilization due
to changelogs and repartitions is a very real pain point, and has even been
cited as the reason for turning to other stream processing frameworks in the
past (though of course I question that judgement)

The repartition topic optimization, at the very least, should be enabled by
default. The problem is that we can't just flip the switch without breaking
existing applications during upgrade, since the location and name of such
topics in the topology may change. One possibility is to just detect this
situation and disable the optimization if we find that it would produce an
incompatible topology for an existing application. We can determine that this
is the case simply by looking for pre-existing repartition topics. If any such
topics are present, and match the set of repartition topics in the un-optimized
topology, then we know we need to switch the optimization off. If we don't find
any repartition topics, or they match the optimized topology, then we're safe
to enable it by default.

Alternatively, we could just do a KIP to indicate that we intend to change the
default in the next breaking release and that existing applications should
override this config if necessary. We should be able to implement a fail-safe
and shut down if a user misses or forgets to do so, using the method mentioned
above.

The source topic optimization is perhaps more controversial, as there have been
a few issues raised with regards to things like [restoring bad data and
asymmetric serdes|https://issues.apache.org/jira/browse/KAFKA-8037], or more
recently the bug discovered in the [emit-on-change semantics for
KTables|https://issues.apache.org/jira/browse/KAFKA-12508?focusedCommentId=17306323=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17306323].
However for this case at least there are no compatibility concerns. It's safe
to upgrade from using a separate changelog for a source KTable to just using
that source topic directly, although the reverse is not true. We could even
automatically delete the no-longer-necessary changelog for upgrading
applications

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-10493) KTable out-of-order updates are not being ignored

2021-04-22 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329888#comment-17329888
 ] 

A. Sophie Blee-Goldman commented on KAFKA-10493:


Imo we should find a way to fix this that doesn't prevent users from leveraging 
the source topic optimization. As I've mentioned before, the additional storage 
footprint from changelogs is a very real complaint and has been cited as the 
reason for not using Kafka Streams in the past. And it sounds to me like this 
would make it even worse, as we would need to not only use a dedicated 
changelog for all source KTables but also disable compaction entirely IIUC. 
That just does not sound like a feasible path forward

I haven't fully digested this current discussion about the impact of dropping 
out-of-order updates with a compacted changelog, but perhaps we could store 
some information in the committed offset metadata to help us here?

> KTable out-of-order updates are not being ignored
> -
>
> Key: KAFKA-10493
> URL: https://issues.apache.org/jira/browse/KAFKA-10493
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.6.0
>Reporter: Pedro Gontijo
>Assignee: Matthias J. Sax
>Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: KTableOutOfOrderBug.java
>
>
> On a materialized KTable, out-of-order records for a given key (records which 
> timestamp are older than the current value in store) are not being ignored 
> but used to update the local store value and also being forwarded.
> I believe the bug is here: 
> [https://github.com/apache/kafka/blob/2.6.0/streams/src/main/java/org/apache/kafka/streams/state/internals/ValueAndTimestampSerializer.java#L77]
>  It should return true, not false (see javadoc)
> The bug impacts here: 
> [https://github.com/apache/kafka/blob/2.6.0/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KTableSource.java#L142-L148]
> I have attached a simple stream app that shows the issue happening.
> Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12699) Streams no longer overrides the java default uncaught exception handler

2021-04-21 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327007#comment-17327007
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12699:


Marking it down for 3.0 so we don't forget

> Streams no longer overrides the java default uncaught exception handler  
> -
>
> Key: KAFKA-12699
> URL: https://issues.apache.org/jira/browse/KAFKA-12699
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
>Reporter: Walker Carlson
>Priority: Minor
> Fix For: 3.0.0
>
>
> If a user used `Thread.setUncaughtExceptionHanlder()` to set the handler for 
> all threads in the runtime streams would override that with its own handler. 
> However since streams does not use the `Thread` handler anymore it will no 
> longer do so. This can cause problems if the user does something like 
> `System.exit(1)` in the handler. 
>  
> If using the old handler in streams it will still work as it used to



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12699) Streams no longer overrides the java default uncaught exception handler

2021-04-21 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12699:
---
Fix Version/s: 3.0.0

> Streams no longer overrides the java default uncaught exception handler  
> -
>
> Key: KAFKA-12699
> URL: https://issues.apache.org/jira/browse/KAFKA-12699
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
>Reporter: Walker Carlson
>Priority: Minor
> Fix For: 3.0.0
>
>
> If a user used `Thread.setUncaughtExceptionHanlder()` to set the handler for 
> all threads in the runtime streams would override that with its own handler. 
> However since streams does not use the `Thread` handler anymore it will no 
> longer do so. This can cause problems if the user does something like 
> `System.exit(1)` in the handler. 
>  
> If using the old handler in streams it will still work as it used to



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12699) Streams no longer overrides the java default uncaught exception handler

2021-04-21 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327006#comment-17327006
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12699:


> we should have the exception raised again to make sure it it handled properly.
I agree with the sentiment here, but if we disable the uncaught exception 
handler then there's no other handling that is/can be done. I'm also not sure 
what EOS has to do with this, in both ALOS and EOS we want to (and have to) do 
a "dirty close", otherwise we're committing bad data

> stream user and the owner of the runtime are not entirely equal so might have 
> different requirements
Good point


> Streams no longer overrides the java default uncaught exception handler  
> -
>
> Key: KAFKA-12699
> URL: https://issues.apache.org/jira/browse/KAFKA-12699
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
>Reporter: Walker Carlson
>Priority: Minor
>
> If a user used `Thread.setUncaughtExceptionHanlder()` to set the handler for 
> all threads in the runtime streams would override that with its own handler. 
> However since streams does not use the `Thread` handler anymore it will no 
> longer do so. This can cause problems if the user does something like 
> `System.exit(1)` in the handler. 
>  
> If using the old handler in streams it will still work as it used to



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12706) Consider adding reason and source of error in APPLICATION_SHUTDOWN

2021-04-21 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12706:
--

 Summary: Consider adding reason and source of error in 
APPLICATION_SHUTDOWN
 Key: KAFKA-12706
 URL: https://issues.apache.org/jira/browse/KAFKA-12706
 Project: Kafka
  Issue Type: Improvement
  Components: streams
Reporter: A. Sophie Blee-Goldman


At the moment when a user opts to shut down the application in the streams 
uncaught exception handler, we just send a signal to all members of the group 
who then shut down. If there are a large number of application instances 
running it can be annoying and time consuming to locate the client that hit 
this error.

It would be nice if we could let each client log the exception that triggered 
this, and possibly also the client who requested the shutdown. That will make 
it much easier to identify the problem, and figure out which set of logs to 
look into for further information



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12699) Streams no longer overrides the java default uncaught exception handler

2021-04-20 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17326215#comment-17326215
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12699:


To be fair, I think claiming that this is a bug that Streams "no longer 
overrides the java default handler" is a bit harsh since (a) as you point out, 
this isn't a regression but just a different behavior in a new feature, and (b) 
it was never specified or implied that the new feature would do exactly the 
same thing as the old feature. I would actually argue that continuing to 
override the java default handler when the user has not actually supplied a 
Thread.UncaughtExceptionHandler to override it would be more unexpected.

However, I do think it's a bit odd that we would end up invoking the default 
thread uncaught exception handler at all, since Streams will have already 
caught and invoked the user-defined handler at that point. The "bug" from my 
perspective is that we rethrow the exception up through run(), vs swallowing it 
once we reach the outer try in StreamThread#run.

If we think that the user-defined default handler should not be invoked when 
using the new StreamsUncaughtExceptionHandler (and I think that assumption is 
not a given, although I don't feel strongly for or against), and should be 
overriden with a no-op handler, then why continue to throw the exception. And 
if we don't throw the exception, then why do we need to override with a no-op 
-- just my line of thinking, again I'm not sure what a realistic user expected 
behavior is here. But setting a global exception handler and then just blindly 
calling System.exit in a multi-threaded app where you've consciously 
implemented a handler that makes it clear threads can die and be 
replaced...seems like user error to me

> Streams no longer overrides the java default uncaught exception handler  
> -
>
> Key: KAFKA-12699
> URL: https://issues.apache.org/jira/browse/KAFKA-12699
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.8.0
>Reporter: Walker Carlson
>Priority: Minor
>
> If a user used `Thread.setUncaughtExceptionHanlder()` to set the handler for 
> all threads in the runtime streams would override that with its own handler. 
> However since streams does not use the `Thread` handler anymore it will no 
> longer do so. This can cause problems if the user does something like 
> `System.exit(1)` in the handler. 
>  
> If using the old handler in streams it will still work as it used to



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12700) The admin.listeners config has wonky valid values in the docs

2021-04-20 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12700:
--

 Summary: The admin.listeners config has wonky valid values in the 
docs
 Key: KAFKA-12700
 URL: https://issues.apache.org/jira/browse/KAFKA-12700
 Project: Kafka
  Issue Type: Bug
  Components: docs, KafkaConnect
Reporter: A. Sophie Blee-Goldman


Noticed this while updating the docs for the 2.6.2 release, the docs for these 
configs are generated from the config definition, including info such as 
default, type, valid values, etc. 

When defining WorkerConfig.ADMIN_LISTENERS_CONFIG we seem to pass an actual 
`new AdminListenersValidator()` object in for the "valid values" parameter, 
causing this field to display some wonky useless object reference in the docs. 
See https://kafka.apache.org/documentation/#connectconfigs_admin.listeners:

Valid Values:   
org.apache.kafka.connect.runtime.WorkerConfig$AdminListenersValidator@383534aa



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12696) Add standard getters to LagInfo class to allow automatic serialization

2021-04-20 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17326075#comment-17326075
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12696:


Thanks for the PR [~mihasya]. You need to be added as a contributor to assignor 
yourself issues, by the way. I just added you so you should be able to 
self-assign any tickets from now on.

> Add standard getters to LagInfo class to allow automatic serialization
> --
>
> Key: KAFKA-12696
> URL: https://issues.apache.org/jira/browse/KAFKA-12696
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Mikhail Panchenko
>Assignee: Mikhail Panchenko
>Priority: Trivial
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The LagInfo class has non-standard getters for its member variables. This 
> means that Jackson and other serialization frameworks do not know how to 
> serialize them without additional annotations. So when implementing the sort 
> of system that KAFKA-6144 is meant to enable, as documented here in docs like 
> [Kafka Streams Interactive 
> Queries|https://docs.confluent.io/platform/current/streams/developer-guide/interactive-queries.html#querying-state-stores-during-a-rebalance],
>  one has to either inject a bunch of custom serialization logic into Jersey 
> or wrap this class.
> The patch to fix this is trivial, and I will be putting one up shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-12696) Add standard getters to LagInfo class to allow automatic serialization

2021-04-20 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-12696:
--

Assignee: Mikhail Panchenko

> Add standard getters to LagInfo class to allow automatic serialization
> --
>
> Key: KAFKA-12696
> URL: https://issues.apache.org/jira/browse/KAFKA-12696
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Mikhail Panchenko
>Assignee: Mikhail Panchenko
>Priority: Trivial
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The LagInfo class has non-standard getters for its member variables. This 
> means that Jackson and other serialization frameworks do not know how to 
> serialize them without additional annotations. So when implementing the sort 
> of system that KAFKA-6144 is meant to enable, as documented here in docs like 
> [Kafka Streams Interactive 
> Queries|https://docs.confluent.io/platform/current/streams/developer-guide/interactive-queries.html#querying-state-stores-during-a-rebalance],
>  one has to either inject a bunch of custom serialization logic into Jersey 
> or wrap this class.
> The patch to fix this is trivial, and I will be putting one up shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12690) Remove deprecated Producer#sendOffsetsToTransaction

2021-04-19 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12690:
---
Description: In 
[KIP-732|https://cwiki.apache.org/confluence/display/KAFKA/KIP-732%3A+Deprecate+eos-alpha+and+replace+eos-beta+with+eos-v2]
 we deprecated the EXACTLY_ONCE and EXACTLY_ONCE_BETA configs in StreamsConfig, 
to be removed in 4.0

> Remove deprecated Producer#sendOffsetsToTransaction
> ---
>
> Key: KAFKA-12690
> URL: https://issues.apache.org/jira/browse/KAFKA-12690
> Project: Kafka
>  Issue Type: Task
>  Components: producer 
>Reporter: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 4.0.0
>
>
> In 
> [KIP-732|https://cwiki.apache.org/confluence/display/KAFKA/KIP-732%3A+Deprecate+eos-alpha+and+replace+eos-beta+with+eos-v2]
>  we deprecated the EXACTLY_ONCE and EXACTLY_ONCE_BETA configs in 
> StreamsConfig, to be removed in 4.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12690) Remove deprecated Producer#sendOffsetsToTransaction

2021-04-19 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12690:
--

 Summary: Remove deprecated Producer#sendOffsetsToTransaction
 Key: KAFKA-12690
 URL: https://issues.apache.org/jira/browse/KAFKA-12690
 Project: Kafka
  Issue Type: Task
  Components: producer 
Reporter: A. Sophie Blee-Goldman
 Fix For: 4.0.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12689) Remove deprecated EOS configs

2021-04-19 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12689:
--

 Summary: Remove deprecated EOS configs
 Key: KAFKA-12689
 URL: https://issues.apache.org/jira/browse/KAFKA-12689
 Project: Kafka
  Issue Type: Task
  Components: streams
Reporter: A. Sophie Blee-Goldman
 Fix For: 4.0.0


In 
[KIP-732|https://cwiki.apache.org/confluence/display/KAFKA/KIP-732%3A+Deprecate+eos-alpha+and+replace+eos-beta+with+eos-v2]
 we deprecated the EXACTLY_ONCE and EXACTLY_ONCE_BETA configs in StreamsConfig, 
to be removed in 4.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9168) Integrate JNI direct buffer support to RocksDBStore

2021-04-19 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325187#comment-17325187
 ] 

A. Sophie Blee-Goldman commented on KAFKA-9168:
---

[~sagarrao] Feel free to pick this up, but you may need to hold off on working 
on it until [~cadonna] has completed KAFKA-8897. I believe he's battling some 
weird runtime issues in RocksDB with the upgrade at the moment, but once we've 
figured that out then this should be unblocked. 

> Integrate JNI direct buffer support to RocksDBStore
> ---
>
> Key: KAFKA-9168
> URL: https://issues.apache.org/jira/browse/KAFKA-9168
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Sagar Rao
>Priority: Major
>  Labels: perfomance
>
> There has been a PR created on rocksdb Java client to support direct 
> ByteBuffers in Java. We can look at integrating it whenever it gets merged. 
> Link to PR: [https://github.com/facebook/rocksdb/pull/2283]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12675) Improve sticky general assignor scalability and performance

2021-04-19 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17325181#comment-17325181
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12675:


Nice! It would be great if we could get these improvements in to 3.0, since as 
Luke mentioned we plan to make this the default assignor.

> Improve sticky general assignor scalability and performance
> ---
>
> Key: KAFKA-12675
> URL: https://issues.apache.org/jira/browse/KAFKA-12675
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Luke Chen
>Assignee: Luke Chen
>Priority: Major
>
> Currently, we have "general assignor" for non-equal subscription case and 
> "constrained assignor" for all equal subscription case. There's a performance 
> test for constrained assignor with:
> topicCount = {color:#ff}500{color};
>  partitionCount = {color:#ff}2000{color}; 
>  consumerCount = {color:#ff}2000{color};
> in _testLargeAssignmentAndGroupWithUniformSubscription,_ total 1 million 
> partitions and we can complete the assignment within 2 second in my machine.
> However, if we let 1 of the consumer subscribe to only 1 topic, it'll use 
> "general assignor", and the result with the same setting as above is: 
> *OutOfMemory,* 
>  Even we down the count to:
> topicCount = {color:#ff}50{color};
>  partitionCount = 1{color:#ff}000{color}; 
>  consumerCount = 1{color:#ff}000{color};
> We still got *OutOfMemory*.
> With this setting:
> topicCount = {color:#ff}50{color};
>  partitionCount = 8{color:#ff}00{color}; 
>  consumerCount = 8{color:#ff}00{color};
> We can complete in 10 seconds in my machine, which is still slow.
>  
> Since we are going to set default assignment strategy to 
> "CooperativeStickyAssignor" soon,  we should improve the scalability and 
> performance for sticky general assignor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12675) Improve sticky general assignor scalability and performance

2021-04-16 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324090#comment-17324090
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12675:


[~twmb] would you be interested in submitting a PR for your algorithm? You'd 
need to translate it (and tests) into Java, but other than that it seems pretty 
much complete.

> Improve sticky general assignor scalability and performance
> ---
>
> Key: KAFKA-12675
> URL: https://issues.apache.org/jira/browse/KAFKA-12675
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Luke Chen
>Assignee: Luke Chen
>Priority: Major
>
> Currently, we have "general assignor" for non-equal subscription case and 
> "constrained assignor" for all equal subscription case. There's a performance 
> test for constrained assignor with:
> topicCount = {color:#ff}500{color};
>  partitionCount = {color:#ff}2000{color}; 
>  consumerCount = {color:#ff}2000{color};
> in _testLargeAssignmentAndGroupWithUniformSubscription,_ total 1 million 
> partitions and we can complete the assignment within 2 second in my machine.
> However, if we let 1 of the consumer subscribe to only 1 topic, it'll use 
> "general assignor", and the result with the same setting as above is: 
> *OutOfMemory,* 
>  Even we down the count to:
> topicCount = {color:#ff}50{color};
>  partitionCount = 1{color:#ff}000{color}; 
>  consumerCount = 1{color:#ff}000{color};
> We still got *OutOfMemory*.
> With this setting:
> topicCount = {color:#ff}50{color};
>  partitionCount = 8{color:#ff}00{color}; 
>  consumerCount = 8{color:#ff}00{color};
> We can complete in 10 seconds in my machine, which is still slow.
>  
> Since we are going to set default assignment strategy to 
> "CooperativeStickyAssignor" soon,  we should improve the scalability and 
> performance for sticky general assignor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12676) Improve sticky general assignor underlying algorithm for the unequal subscriptions case

2021-04-16 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12676:
---
Summary: Improve sticky general assignor underlying algorithm for the 
unequal subscriptions case  (was: Improve sticky general assignor underlying 
algorithm for the imbalanced case)

> Improve sticky general assignor underlying algorithm for the unequal 
> subscriptions case
> ---
>
> Key: KAFKA-12676
> URL: https://issues.apache.org/jira/browse/KAFKA-12676
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Luke Chen
>Priority: Major
>
> As discussed in KAFKA-12675, we think the general assignor algorithm might be 
> able to improve to make some edge cases more balanced, and improve the 
> scalability and performance.
> Ref: The algorithm is here: 
> [https://github.com/twmb/franz-go/blob/master/pkg/kgo/internal|https://github.com/twmb/franz-go/blob/master/pkg/kgo/internal/sticky/graph.go]
>  
> [https://issues.apache.org/jira/browse/KAFKA-12675?focusedCommentId=17322603=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17322603/sticky/graph.go|https://github.com/twmb/franz-go/blob/master/pkg/kgo/internal/sticky/graph.go]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12675) Improve sticky general assignor scalability and performance

2021-04-16 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17324089#comment-17324089
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12675:


Yeah just to clarify, what [~twmb] proposed would not require a KIP since it's 
just an improvement to the existing (somewhat lacking) algorithm for the 
general case. There shouldn't be any public facing impact, except perhaps for 
the memory consumption. But since the current algorithm can't even handle the 
partition counts he described testing, I'd still consider this an improvement 
across the board.

> Improve sticky general assignor scalability and performance
> ---
>
> Key: KAFKA-12675
> URL: https://issues.apache.org/jira/browse/KAFKA-12675
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Luke Chen
>Assignee: Luke Chen
>Priority: Major
>
> Currently, we have "general assignor" for non-equal subscription case and 
> "constrained assignor" for all equal subscription case. There's a performance 
> test for constrained assignor with:
> topicCount = {color:#ff}500{color};
>  partitionCount = {color:#ff}2000{color}; 
>  consumerCount = {color:#ff}2000{color};
> in _testLargeAssignmentAndGroupWithUniformSubscription,_ total 1 million 
> partitions and we can complete the assignment within 2 second in my machine.
> However, if we let 1 of the consumer subscribe to only 1 topic, it'll use 
> "general assignor", and the result with the same setting as above is: 
> *OutOfMemory,* 
>  Even we down the count to:
> topicCount = {color:#ff}50{color};
>  partitionCount = 1{color:#ff}000{color}; 
>  consumerCount = 1{color:#ff}000{color};
> We still got *OutOfMemory*.
> With this setting:
> topicCount = {color:#ff}50{color};
>  partitionCount = 8{color:#ff}00{color}; 
>  consumerCount = 8{color:#ff}00{color};
> We can complete in 10 seconds in my machine, which is still slow.
>  
> Since we are going to set default assignment strategy to 
> "CooperativeStickyAssignor" soon,  we should improve the scalability and 
> performance for sticky general assignor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12477) Smart rebalancing with dynamic protocol selection

2021-04-14 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12477:
---
Description: 
Users who want to upgrade their applications and enable the COOPERATIVE 
rebalancing protocol in their consumer apps are required to follow a double 
rolling bounce upgrade path. The reason for this is laid out in the [Consumer 
Upgrades|https://cwiki.apache.org/confluence/display/KAFKA/KIP-429%3A+Kafka+Consumer+Incremental+Rebalance+Protocol#KIP429:KafkaConsumerIncrementalRebalanceProtocol-Consumer]
 section of KIP-429. Basically, the ConsumerCoordinator picks a rebalancing 
protocol in its constructor based on the list of supported partition assignors. 
The protocol is selected as the highest protocol that is commonly supported by 
all assignors in the list, and never changes after that.

This is a bit unfortunate because it may end up using an older protocol even 
after every member in the group has been updated to support the newer protocol. 
After the first rolling bounce of the upgrade, all members will have two 
assignors: "cooperative-sticky" and "range" (or sticky/round-robin/etc). At 
this point the EAGER protocol will still be selected due to the presence of the 
"range" assignor, but it's the "cooperative-sticky" assignor that will 
ultimately be selected for use in rebalances if that assignor is preferred (ie 
positioned first in the list). The only reason for the second rolling bounce is 
to strip off the "range" assignor and allow the upgraded members to switch over 
to COOPERATIVE. We can't allow them to use cooperative rebalancing until 
everyone has been upgraded, but once they have it's safe to do so.

And there is already a way for the client to detect that everyone is on the new 
byte code: if the CooperativeStickyAssignor is selected by the group 
coordinator, then that means it is supported by all consumers in the group and 
therefore everyone must be upgraded. 

We may be able to save the second rolling bounce by dynamically updating the 
rebalancing protocol inside the ConsumerCoordinator as "the highest protocol 
supported by the assignor chosen by the group coordinator". This means we'll 
still be using EAGER at the first rebalance, since we of course need to wait 
for this initial rebalance to get the response from the group coordinator. But 
we should take the hint from the chosen assignor rather than dropping this 
information on the floor and sticking with the original protocol.


Concrete Proposal:
This assumes we will change the default assignor to ["cooperative-sticky", 
"range"] in KIP-726. It also acknowledges that users may attempt any kind of 
upgrade without reading the docs, and so we need to put in safeguards against 
data corruption rather than assume everyone will follow the safe upgrade path.

With this proposal, 
1) New applications on 3.0 will enable cooperative rebalancing by default
2) Existing applications which don’t set an assignor can safely upgrade to 3.0 
using a single rolling bounce with no extra steps, and will automatically 
transition to cooperative rebalancing
3) Existing applications which do set an assignor that uses EAGER can likewise 
upgrade their applications to COOPERATIVE with a single rolling bounce
4) Once on 3.0, applications can safely go back and forth between EAGER and 
COOPERATIVE
5) Applications can safely downgrade away from 3.0

The high-level idea for dynamic protocol upgrades is that the group will 
leverage the assignor selected by the group coordinator to determine when it’s 
safe to upgrade to COOPERATIVE, and trigger a fail-safe to protect the group in 
case of rare events or user misconfiguration. The group coordinator selects the 
most preferred assignor that’s supported by all members of the group, so we 
know that all members will support COOPERATIVE once we receive the 
“cooperative-sticky” assignor after a rebalance. At this point, each member can 
upgrade their own protocol to COOPERATIVE. However, there may be situations in 
which an EAGER member may join the group even after upgrading to COOPERATIVE. 
For example, during a rolling upgrade if the last remaining member on the old 
bytecode misses a rebalance, the other members will be allowed to upgrade to 
COOPERATIVE. If the old member rejoins and is chosen to be the group leader 
before it’s upgraded to 3.0, it won’t be aware that the other members of the 
group have not yet revoked their partitions when computing the assignment.

Short Circuit:
The risk of mixing the cooperative and eager rebalancing protocols is that a 
partition may be assigned to one member while it has yet to be revoked from its 
previous owner. The danger is that the new owner may begin processing and 
committing offsets for this partition while the previous owner is also 
committing offsets in its #onPartitionsRevoked callback, which is invoked at 
the end

[jira] [Updated] (KAFKA-12669) Add deleteRange to WindowStore / KeyValueStore interfaces

2021-04-14 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12669:
---
Issue Type: Improvement  (was: Bug)

> Add deleteRange to WindowStore / KeyValueStore interfaces
> -
>
> Key: KAFKA-12669
> URL: https://issues.apache.org/jira/browse/KAFKA-12669
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Guozhang Wang
>Priority: Major
>  Labels: needs-kip
>
> We can consider adding such APIs where the underlying implementation classes 
> have better optimizations than deleting the keys as get-and-delete one by one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-12667) Incorrect error log on StateDirectory close

2021-04-14 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-12667.

Resolution: Fixed

> Incorrect error log on StateDirectory close
> ---
>
> Key: KAFKA-12667
> URL: https://issues.apache.org/jira/browse/KAFKA-12667
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.7.0, 2.6.1, 2.8.0
>Reporter: Jan Justesen
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.0.0, 2.6.3, 2.7.2, 2.8.1
>
>
> {{In StateDirectory.close() an error is logged about unclean shutdown if all 
> locks are in fact released, and nothing is logged in case of an unclean 
> shutdown.}}
>  
> {code:java}
> // all threads should be stopped and cleaned up by now, so none should remain 
> holding a lock
> if (locks.isEmpty()) {
>  log.error("Some task directories still locked while closing state, this 
> indicates unclean shutdown: {}", locks);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12667) Incorrect error log on StateDirectory close

2021-04-14 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12667:
---
Fix Version/s: 2.6.3

> Incorrect error log on StateDirectory close
> ---
>
> Key: KAFKA-12667
> URL: https://issues.apache.org/jira/browse/KAFKA-12667
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.7.0, 2.6.1, 2.8.0
>Reporter: Jan Justesen
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.0.0, 2.6.3, 2.7.2, 2.8.1
>
>
> {{In StateDirectory.close() an error is logged about unclean shutdown if all 
> locks are in fact released, and nothing is logged in case of an unclean 
> shutdown.}}
>  
> {code:java}
> // all threads should be stopped and cleaned up by now, so none should remain 
> holding a lock
> if (locks.isEmpty()) {
>  log.error("Some task directories still locked while closing state, this 
> indicates unclean shutdown: {}", locks);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12667) Incorrect error log on StateDirectory close

2021-04-14 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12667:
---
Affects Version/s: 2.8.0

> Incorrect error log on StateDirectory close
> ---
>
> Key: KAFKA-12667
> URL: https://issues.apache.org/jira/browse/KAFKA-12667
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.7.0, 2.6.1, 2.8.0
>Reporter: Jan Justesen
>Priority: Major
> Fix For: 3.0.0, 2.7.2, 2.8.1
>
>
> {{In StateDirectory.close() an error is logged about unclean shutdown if all 
> locks are in fact released, and nothing is logged in case of an unclean 
> shutdown.}}
>  
> {code:java}
> // all threads should be stopped and cleaned up by now, so none should remain 
> holding a lock
> if (locks.isEmpty()) {
>  log.error("Some task directories still locked while closing state, this 
> indicates unclean shutdown: {}", locks);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12667) Incorrect error log on StateDirectory close

2021-04-14 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321201#comment-17321201
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12667:


Thanks [~antrenta]. As Bruno mentioned, we did discover this bug and fixed it 
in trunk, but unfortunately missed the 2.6.2, 2.7.1, and 2.8.0 releases which 
have all been in the release process.

I'm happy to backport the fix to the 2.8 branch once 2.8.0 is finally out the 
door, and maybe to the 2.7 branch as well. It seems unlikely that there will be 
a 2.6.3 release however.

Sorry for the trouble, I know it's a concerning message to see.

> Incorrect error log on StateDirectory close
> ---
>
> Key: KAFKA-12667
> URL: https://issues.apache.org/jira/browse/KAFKA-12667
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.7.0, 2.6.1
>Reporter: Jan Justesen
>Priority: Major
> Fix For: 2.6.3, 2.7.2
>
>
> {{In StateDirectory.close() an error is logged about unclean shutdown if all 
> locks are in fact released, and nothing is logged in case of an unclean 
> shutdown.}}
>  
> {code:java}
> // all threads should be stopped and cleaned up by now, so none should remain 
> holding a lock
> if (locks.isEmpty()) {
>  log.error("Some task directories still locked while closing state, this 
> indicates unclean shutdown: {}", locks);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-12667) Incorrect error log on StateDirectory close

2021-04-14 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-12667:
--

Assignee: A. Sophie Blee-Goldman

> Incorrect error log on StateDirectory close
> ---
>
> Key: KAFKA-12667
> URL: https://issues.apache.org/jira/browse/KAFKA-12667
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.7.0, 2.6.1, 2.8.0
>Reporter: Jan Justesen
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.0.0, 2.7.2, 2.8.1
>
>
> {{In StateDirectory.close() an error is logged about unclean shutdown if all 
> locks are in fact released, and nothing is logged in case of an unclean 
> shutdown.}}
>  
> {code:java}
> // all threads should be stopped and cleaned up by now, so none should remain 
> holding a lock
> if (locks.isEmpty()) {
>  log.error("Some task directories still locked while closing state, this 
> indicates unclean shutdown: {}", locks);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12667) Incorrect error log on StateDirectory close

2021-04-14 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12667:
---
Fix Version/s: (was: 2.6.3)
   2.8.1
   3.0.0

> Incorrect error log on StateDirectory close
> ---
>
> Key: KAFKA-12667
> URL: https://issues.apache.org/jira/browse/KAFKA-12667
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Affects Versions: 2.7.0, 2.6.1
>Reporter: Jan Justesen
>Priority: Major
> Fix For: 3.0.0, 2.7.2, 2.8.1
>
>
> {{In StateDirectory.close() an error is logged about unclean shutdown if all 
> locks are in fact released, and nothing is logged in case of an unclean 
> shutdown.}}
>  
> {code:java}
> // all threads should be stopped and cleaned up by now, so none should remain 
> holding a lock
> if (locks.isEmpty()) {
>  log.error("Some task directories still locked while closing state, this 
> indicates unclean shutdown: {}", locks);
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable

2021-04-13 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320699#comment-17320699
 ] 

A. Sophie Blee-Goldman commented on KAFKA-9295:
---

[~cadonna] Hm...github says the commit that kicked off that build was merged 10 
hours ago, and the PR to improve this test was also merged 10 hours ago. But 
based on the line numbers in the stack trace, I think that build did not 
contain the fix -- line 187 in the test used to be verifyKTableKTableJoin(), 
but now is just an empty line. So let's leave the ticket resolved but keep an 
eye out for further failures. 

Still, thanks for reporting this! Fingers crossed that this fix will be 
sufficient and we won't have to dig any further or mess with the session 
interval

> KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
> --
>
> Key: KAFKA-9295
> URL: https://issues.apache.org/jira/browse/KAFKA-9295
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.4.0, 2.6.0
>Reporter: Matthias J. Sax
>Assignee: Luke Chen
>Priority: Critical
>  Labels: flaky-test
> Fix For: 3.0.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/]
> {quote}java.lang.AssertionError: Did not receive all 1 records from topic 
> output- within 6 ms Expected: is a value equal to or greater than <1> 
> but: <0> was less than <1> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-7499) Extend ProductionExceptionHandler to cover serialization exceptions

2021-04-13 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-7499:
--
Labels: beginner kip newbie newbie++  (was: beginner kip newbie)

> Extend ProductionExceptionHandler to cover serialization exceptions
> ---
>
> Key: KAFKA-7499
> URL: https://issues.apache.org/jira/browse/KAFKA-7499
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>Priority: Major
>  Labels: beginner, kip, newbie, newbie++
>
> In 
> [KIP-210|https://cwiki.apache.org/confluence/display/KAFKA/KIP-210+-+Provide+for+custom+error+handling++when+Kafka+Streams+fails+to+produce],
>  an exception handler for the write path was introduced. This exception 
> handler covers exception that are raised in the producer callback.
> However, serialization happens before the data is handed to the producer with 
> Kafka Streams itself and the producer uses `byte[]/byte[]` key-value-pair 
> types.
> Thus, we might want to extend the ProductionExceptionHandler to cover 
> serialization exception, too, to skip over corrupted output messages. An 
> example could be a "String" message that contains invalid JSON and should be 
> serialized as JSON.
> KIP-399 (not voted yet; feel free to pick it up): 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-399%3A+Extend+ProductionExceptionHandler+to+cover+serialization+exceptions]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-12574) Deprecate eos-alpha

2021-04-13 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-12574:
--

Assignee: A. Sophie Blee-Goldman

> Deprecate eos-alpha
> ---
>
> Key: KAFKA-12574
> URL: https://issues.apache.org/jira/browse/KAFKA-12574
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
>  Labels: needs-kip
> Fix For: 3.0.0
>
>
> In KIP-447 we introduced a new thread-producer which is capable of 
> exactly-once semantics across multiple tasks. The new mode of EOS, called 
> eos-beta, is intended to eventually be the preferred processing mode for EOS 
> as it improves the performance and scaling of partitions/tasks. The only 
> downside is that it requires brokers to be on version 2.5+ in order to 
> understand the latest APIs that are necessary for this thread-producer.
> We should consider deprecating the eos-alpha config, ie 
> StreamsConfig.EXACTLY_ONCE, to encourage new. & existing EOS users to migrate 
> to the new-and-improved processing mode, and upgrade their brokers if 
> necessary.
> Eventually we would like to be able to remove the eos-alpha code paths from 
> Streams as this will help to simplify the logic and reduce the processing 
> mode branching. But since this will break client-broker compatibility, and 
> 2.5 is still a relatively recent version, we probably can't actually remove 
> eos-alpha in the near future



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-9295) KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable

2021-04-13 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-9295:
--
Fix Version/s: 3.0.0

> KTableKTableForeignKeyInnerJoinMultiIntegrationTest#shouldInnerJoinMultiPartitionQueryable
> --
>
> Key: KAFKA-9295
> URL: https://issues.apache.org/jira/browse/KAFKA-9295
> Project: Kafka
>  Issue Type: Bug
>  Components: streams, unit tests
>Affects Versions: 2.4.0, 2.6.0
>Reporter: Matthias J. Sax
>Assignee: Luke Chen
>Priority: Critical
>  Labels: flaky-test
> Fix For: 3.0.0
>
>
> [https://builds.apache.org/job/kafka-pr-jdk8-scala2.11/27106/testReport/junit/org.apache.kafka.streams.integration/KTableKTableForeignKeyInnerJoinMultiIntegrationTest/shouldInnerJoinMultiPartitionQueryable/]
> {quote}java.lang.AssertionError: Did not receive all 1 records from topic 
> output- within 6 ms Expected: is a value equal to or greater than <1> 
> but: <0> was less than <1> at 
> org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18) at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.lambda$waitUntilMinKeyValueRecordsReceived$1(IntegrationTestUtils.java:515)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:417)
>  at 
> org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:385)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:511)
>  at 
> org.apache.kafka.streams.integration.utils.IntegrationTestUtils.waitUntilMinKeyValueRecordsReceived(IntegrationTestUtils.java:489)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.verifyKTableKTableJoin(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:200)
>  at 
> org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.shouldInnerJoinMultiPartitionQueryable(KTableKTableForeignKeyInnerJoinMultiIntegrationTest.java:183){quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-12637) Remove deprecated PartitionAssignor interface

2021-04-12 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-12637.

Resolution: Fixed

> Remove deprecated PartitionAssignor interface
> -
>
> Key: KAFKA-12637
> URL: https://issues.apache.org/jira/browse/KAFKA-12637
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Assignee: dengziming
>Priority: Blocker
>  Labels: newbie, newbie++
> Fix For: 3.0.0
>
>
> In KIP-429, we deprecated the existing PartitionAssignor interface in order 
> to move it out of the internals package and better align the name with other 
> pluggable Consumer interfaces. We added an adapter to convert from existing 
> o.a.k.clients.consumer.internals.PartitionAssignor to the new 
> o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated 
> interface. This was deprecated in 2.4, so we should be ok to remove it and 
> the PartitionAssignorAdaptor in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (KAFKA-8391) Flaky Test RebalanceSourceConnectorsIntegrationTest#testDeleteConnector

2021-04-12 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reopened KAFKA-8391:
---

[~rhauch] [~kkonstantine] this test failed again -- based on the error message 
it looks like it may be a "real" failure this time, not environmental.

Stacktrace
java.lang.AssertionError: Tasks are imbalanced: 
localhost:35163=[seq-source11-0, seq-source11-3, seq-source10-1, seq-source12-1]
localhost:36961=[seq-source11-1, seq-source10-2, seq-source12-2]
localhost:39023=[seq-source11-2, seq-source10-0, seq-source10-3, 
seq-source12-0, seq-source12-3]
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.assertTrue(Assert.java:42)
at 
org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.assertConnectorAndTasksAreUniqueAndBalanced(RebalanceSourceConnectorsIntegrationTest.java:365)
at 
org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:319)
at 
org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:367)
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:316)
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:300)
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:290)
at 
org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.testDeleteConnector(RebalanceSourceConnectorsIntegrationTest.java:213)


https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10512/5/testReport/org.apache.kafka.connect.integration/RebalanceSourceConnectorsIntegrationTest/Build___JDK_15_and_Scala_2_13___testDeleteConnector/

> Flaky Test RebalanceSourceConnectorsIntegrationTest#testDeleteConnector
> ---
>
> Key: KAFKA-8391
> URL: https://issues.apache.org/jira/browse/KAFKA-8391
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Affects Versions: 2.3.0
>Reporter: Matthias J. Sax
>Assignee: Randall Hauch
>Priority: Critical
>  Labels: flaky-test
> Fix For: 2.3.2, 2.6.0, 2.4.2, 2.5.1
>
> Attachments: 100-gradle-builds.tar
>
>
> [https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/4747/testReport/junit/org.apache.kafka.connect.integration/RebalanceSourceConnectorsIntegrationTest/testDeleteConnector/]
> {quote}java.lang.AssertionError: Condition not met within timeout 3. 
> Connector tasks did not stop in time. at 
> org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:375) at 
> org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:352) at 
> org.apache.kafka.connect.integration.RebalanceSourceConnectorsIntegrationTest.testDeleteConnector(RebalanceSourceConnectorsIntegrationTest.java:166){quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-9013) Flaky Test MirrorConnectorsIntegrationTest#testReplication

2021-04-12 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-9013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319797#comment-17319797
 ] 

A. Sophie Blee-Goldman commented on KAFKA-9013:
---

Failed again: 
https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10512/5/testReport/org.apache.kafka.connect.mirror.integration/MirrorConnectorsIntegrationTest/Build___JDK_11_and_Scala_2_13___testReplication__/

> Flaky Test MirrorConnectorsIntegrationTest#testReplication
> --
>
> Key: KAFKA-9013
> URL: https://issues.apache.org/jira/browse/KAFKA-9013
> Project: Kafka
>  Issue Type: Bug
>  Components: KafkaConnect
>Reporter: Bruno Cadonna
>Priority: Major
>  Labels: flaky-test
>
> h1. Stacktrace:
> {code:java}
> java.lang.AssertionError: Condition not met within timeout 2. Offsets not 
> translated downstream to primary cluster.
>   at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:377)
>   at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:354)
>   at 
> org.apache.kafka.connect.mirror.MirrorConnectorsIntegrationTest.testReplication(MirrorConnectorsIntegrationTest.java:239)
> {code}
> h1. Standard Error
> {code}
> Standard Error
> Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource 
> registered in SERVER runtime does not implement any provider interfaces 
> applicable in the SERVER runtime. Due to constraint configuration problems 
> the provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource will 
> be ignored. 
> Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.LoggingResource registered in 
> SERVER runtime does not implement any provider interfaces applicable in the 
> SERVER runtime. Due to constraint configuration problems the provider 
> org.apache.kafka.connect.runtime.rest.resources.LoggingResource will be 
> ignored. 
> Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource registered 
> in SERVER runtime does not implement any provider interfaces applicable in 
> the SERVER runtime. Due to constraint configuration problems the provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource will be 
> ignored. 
> Oct 09, 2019 11:32:00 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.RootResource registered in 
> SERVER runtime does not implement any provider interfaces applicable in the 
> SERVER runtime. Due to constraint configuration problems the provider 
> org.apache.kafka.connect.runtime.rest.resources.RootResource will be ignored. 
> Oct 09, 2019 11:32:01 PM org.glassfish.jersey.internal.Errors logErrors
> WARNING: The following warnings have been detected: WARNING: The 
> (sub)resource method listLoggers in 
> org.apache.kafka.connect.runtime.rest.resources.LoggingResource contains 
> empty path annotation.
> WARNING: The (sub)resource method listConnectors in 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains 
> empty path annotation.
> WARNING: The (sub)resource method createConnector in 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource contains 
> empty path annotation.
> WARNING: The (sub)resource method listConnectorPlugins in 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource 
> contains empty path annotation.
> WARNING: The (sub)resource method serverInfo in 
> org.apache.kafka.connect.runtime.rest.resources.RootResource contains empty 
> path annotation.
> Oct 09, 2019 11:32:02 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource 
> registered in SERVER runtime does not implement any provider interfaces 
> applicable in the SERVER runtime. Due to constraint configuration problems 
> the provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorPluginsResource will 
> be ignored. 
> Oct 09, 2019 11:32:02 PM org.glassfish.jersey.internal.inject.Providers 
> checkProviderRuntime
> WARNING: A provider 
> org.apache.kafka.connect.runtime.rest.resources.ConnectorsResource registered 
> in SERVER runtime does not implement any provider interfaces applicable in 
> the SERVER runtime. Due to constraint configuration problems the provider 
>

[jira] [Reopened] (KAFKA-12284) Flaky Test MirrorConnectorsIntegrationSSLTest#testOneWayReplicationWithAutoOffsetSync

2021-04-12 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reopened KAFKA-12284:

  Assignee: (was: Luke Chen)

Failed again, on both the SSL and plain version of this test:

https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10512/5/testReport/org.apache.kafka.connect.mirror.integration/MirrorConnectorsIntegrationSSLTest/Build___JDK_15_and_Scala_2_13___testOneWayReplicationWithAutoOffsetSync__/

https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10512/5/testReport/org.apache.kafka.connect.mirror.integration/MirrorConnectorsIntegrationTest/Build___JDK_15_and_Scala_2_13___testOneWayReplicationWithAutoOffsetSync__/
 
java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.TimeoutException: The request timed out.
at 
org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:365)
at 
org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:340)
at 
org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.createTopics(MirrorConnectorsIntegrationBaseTest.java:609)

> Flaky Test 
> MirrorConnectorsIntegrationSSLTest#testOneWayReplicationWithAutoOffsetSync
> -
>
> Key: KAFKA-12284
> URL: https://issues.apache.org/jira/browse/KAFKA-12284
> Project: Kafka
>  Issue Type: Test
>  Components: mirrormaker, unit tests
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
> Fix For: 3.0.0
>
>
> [https://github.com/apache/kafka/pull/9997/checks?check_run_id=1820178470]
> {quote} {{java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TopicExistsException: Topic 
> 'primary.test-topic-2' already exists.
>   at 
> org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:366)
>   at 
> org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:341)
>   at 
> org.apache.kafka.connect.mirror.integration.MirrorConnectorsIntegrationBaseTest.testOneWayReplicationWithAutoOffsetSync(MirrorConnectorsIntegrationBaseTest.java:419)}}
> [...]
>  
> {{Caused by: java.util.concurrent.ExecutionException: 
> org.apache.kafka.common.errors.TopicExistsException: Topic 
> 'primary.test-topic-2' already exists.
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
>   at 
> org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260)
>   at 
> org.apache.kafka.connect.util.clusters.EmbeddedKafkaCluster.createTopic(EmbeddedKafkaCluster.java:364)
>   ... 92 more
> Caused by: org.apache.kafka.common.errors.TopicExistsException: Topic 
> 'primary.test-topic-2' already exists.}}
> {quote}
> STDOUT
> {quote} {{2021-02-03 04ː19ː15,975] ERROR [MirrorHeartbeatConnector|task-0] 
> WorkerSourceTask\{id=MirrorHeartbeatConnector-0} failed to send record to 
> heartbeats:  (org.apache.kafka.connect.runtime.WorkerSourceTask:354)
> org.apache.kafka.common.KafkaException: Producer is closed forcefully.
>   at 
> org.apache.kafka.clients.producer.internals.RecordAccumulator.abortBatches(RecordAccumulator.java:750)
>   at 
> org.apache.kafka.clients.producer.internals.RecordAccumulator.abortIncompleteBatches(RecordAccumulator.java:737)
>   at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:282)
>   at java.lang.Thread.run(Thread.java:748)}}{quote}
> {quote} {{[2021-02-03 04ː19ː36,767] ERROR Could not check connector state 
> info. 
> (org.apache.kafka.connect.util.clusters.EmbeddedConnectClusterAssertions:420)
> org.apache.kafka.connect.runtime.rest.errors.ConnectRestException: Could not 
> read connector state. Error response: \{"error_code":404,"message":"No status 
> found for connector MirrorSourceConnector"}
>   at 
> org.apache.kafka.connect.util.clusters.EmbeddedConnectCluster.connectorStatus(EmbeddedConnectCluster.java:466)
>   at 
> org.apache.kafka.connect.util.clusters.EmbeddedConnectClusterAssertions.checkConnectorState(EmbeddedConnectClusterAssertions.java:413)
>   at 
> org.apache.kafka.connect.util.clusters.EmbeddedConnectClusterAssertions.lambda$assertConnectorAndAtLeastNumTasksAreRunning$16(EmbeddedConnectClusterAssertions.java:286)
>   at 
> org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:303)
>   at 
>

[jira] [Commented] (KAFKA-12629) Flaky Test RaftClusterTest

2021-04-12 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319794#comment-17319794
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12629:


I'm seeing at least one failure in RaftClusterTest on almost every PR build 
lately.

Looks like it's typically the same TimeoutException mentioned above. Seen on 
both testCreateClusterAndCreateAndManyTopics() and 
testCreateClusterAndCreateAndManyTopicsWithManyPartitions()


> Flaky Test RaftClusterTest
> --
>
> Key: KAFKA-12629
> URL: https://issues.apache.org/jira/browse/KAFKA-12629
> Project: Kafka
>  Issue Type: Test
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Priority: Critical
>  Labels: flaky-test
>
> {quote} {{java.util.concurrent.ExecutionException: 
> java.lang.ClassNotFoundException: 
> org.apache.kafka.controller.NoOpSnapshotWriterBuilder
>   at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
>   at 
> kafka.testkit.KafkaClusterTestKit.startup(KafkaClusterTestKit.java:364)
>   at 
> kafka.server.RaftClusterTest.testCreateClusterAndCreateAndManyTopicsWithManyPartitions(RaftClusterTest.scala:181)}}{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12654) separate checkAllSubscriptionEqual and getConsumerToOwnedPartitions methods

2021-04-12 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17319675#comment-17319675
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12654:


To be honest, the sticky assignment algorithm for the general case is so slow 
that the additional time spent populating this map is probably negligible. That 
said, the general algorithm ends up populating exactly the same map in 
#prepopulateCurrentAssignments, so we could just pass it along to the 
generalAssign as well. But it also has to populate an additional map 
(prevAssignment) so it would still need to loop over and deserialize this info 
again anyways

> separate checkAllSubscriptionEqual and getConsumerToOwnedPartitions methods
> ---
>
> Key: KAFKA-12654
> URL: https://issues.apache.org/jira/browse/KAFKA-12654
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Luke Chen
>Assignee: Luke Chen
>Priority: Major
>
> Currently, when entering sticky assignor, we'll check if all consumers have 
> the same subscription to decide which assignor to use (constrained assignor 
> or general assignor). While checking the subscription, we also deserialize 
> the subscription user data to get the partitions owned by the consumer 
> (consumerToOwnedPartitions). However, the consumerToOwnedPartitions info is 
> not used in general assignor (we'll actually deserialize it inside general 
> assignor), so, we don't need to deserialize data for general assignor again. 
> We should separate these 2 things into 2 methods, and only deserialize the 
> user data for constrained assignor, to improve the general sticky assignor 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12650) NPE in InternalTopicManager#cleanUpCreatedTopics

2021-04-09 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318404#comment-17318404
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12650:


This was from 
InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup() btw.

https://ci-builds.apache.org/job/Kafka/job/kafka-pr/job/PR-10411/12/testReport/junit/org.apache.kafka.streams.processor.internals/InternalTopicManagerTest/shouldOnlyRetryNotSuccessfulFuturesDuringSetup/

> NPE in InternalTopicManager#cleanUpCreatedTopics
> 
>
> Key: KAFKA-12650
> URL: https://issues.apache.org/jira/browse/KAFKA-12650
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 3.0.0
>
>
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.cleanUpCreatedTopics(InternalTopicManager.java:675)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.maybeThrowTimeoutExceptionDuringSetup(InternalTopicManager.java:755)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.processCreateTopicResults(InternalTopicManager.java:652)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.setup(InternalTopicManager.java:599)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup(InternalTopicManagerTest.java:180)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12650) NPE in InternalTopicManager#cleanUpCreatedTopics

2021-04-09 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12650:
---
Fix Version/s: 3.0.0

> NPE in InternalTopicManager#cleanUpCreatedTopics
> 
>
> Key: KAFKA-12650
> URL: https://issues.apache.org/jira/browse/KAFKA-12650
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.cleanUpCreatedTopics(InternalTopicManager.java:675)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.maybeThrowTimeoutExceptionDuringSetup(InternalTopicManager.java:755)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.processCreateTopicResults(InternalTopicManager.java:652)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.setup(InternalTopicManager.java:599)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup(InternalTopicManagerTest.java:180)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12650) NPE in InternalTopicManager#cleanUpCreatedTopics

2021-04-09 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12650:
---
Priority: Blocker  (was: Major)

> NPE in InternalTopicManager#cleanUpCreatedTopics
> 
>
> Key: KAFKA-12650
> URL: https://issues.apache.org/jira/browse/KAFKA-12650
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 3.0.0
>
>
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.cleanUpCreatedTopics(InternalTopicManager.java:675)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.maybeThrowTimeoutExceptionDuringSetup(InternalTopicManager.java:755)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.processCreateTopicResults(InternalTopicManager.java:652)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.setup(InternalTopicManager.java:599)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup(InternalTopicManagerTest.java:180)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12650) NPE in InternalTopicManager#cleanUpCreatedTopics

2021-04-09 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17318403#comment-17318403
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12650:


cc [~cadonna] can you take a look at this? 

> NPE in InternalTopicManager#cleanUpCreatedTopics
> 
>
> Key: KAFKA-12650
> URL: https://issues.apache.org/jira/browse/KAFKA-12650
> Project: Kafka
>  Issue Type: Bug
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.cleanUpCreatedTopics(InternalTopicManager.java:675)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.maybeThrowTimeoutExceptionDuringSetup(InternalTopicManager.java:755)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.processCreateTopicResults(InternalTopicManager.java:652)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManager.setup(InternalTopicManager.java:599)
>   at 
> org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup(InternalTopicManagerTest.java:180)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12650) NPE in InternalTopicManager#cleanUpCreatedTopics

2021-04-09 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12650:
--

 Summary: NPE in InternalTopicManager#cleanUpCreatedTopics
 Key: KAFKA-12650
 URL: https://issues.apache.org/jira/browse/KAFKA-12650
 Project: Kafka
  Issue Type: Bug
  Components: streams
Reporter: A. Sophie Blee-Goldman


{code:java}
java.lang.NullPointerException
at 
org.apache.kafka.streams.processor.internals.InternalTopicManager.cleanUpCreatedTopics(InternalTopicManager.java:675)
at 
org.apache.kafka.streams.processor.internals.InternalTopicManager.maybeThrowTimeoutExceptionDuringSetup(InternalTopicManager.java:755)
at 
org.apache.kafka.streams.processor.internals.InternalTopicManager.processCreateTopicResults(InternalTopicManager.java:652)
at 
org.apache.kafka.streams.processor.internals.InternalTopicManager.setup(InternalTopicManager.java:599)
at 
org.apache.kafka.streams.processor.internals.InternalTopicManagerTest.shouldOnlyRetryNotSuccessfulFuturesDuringSetup(InternalTopicManagerTest.java:180)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12649) Expose cache resizer for dynamic memory allocation

2021-04-09 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12649:
--

 Summary: Expose cache resizer for dynamic memory allocation
 Key: KAFKA-12649
 URL: https://issues.apache.org/jira/browse/KAFKA-12649
 Project: Kafka
  Issue Type: New Feature
  Components: streams
Reporter: A. Sophie Blee-Goldman


When we added the add/removeStreamThread() APIs to Streams, we implemented a 
cache resizer to adjust the allocated cache memory per thread accordingly. We 
could expose that to users as a public API to let them dynamically 
increase/decrease the amount of memory for the Streams cache (ie 
cache.max.bytes.buffering).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12648) Experiment with resilient isomorphic topologies

2021-04-09 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12648:
---
Description: 
We're not ready to make this a public feature yet, but I want to start 
experimenting with some ways to make Streams applications more resilient in the 
face of isomorphic topological changes (eg adding/removing/reordering 
subtopologies).

If this turns out to be stable and useful, we can circle back on doing a KIP to 
bring this feature into the public API

  was:We're not ready to make this a public feature yet, but I want to start 
experimenting with some ways to make Streams applications more resilient in the 
face of isomorphic topological changes (eg adding/removing/reordering 
subtopologies)


> Experiment with resilient isomorphic topologies
> ---
>
> Key: KAFKA-12648
> URL: https://issues.apache.org/jira/browse/KAFKA-12648
> Project: Kafka
>  Issue Type: New Feature
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: A. Sophie Blee-Goldman
>Priority: Major
>
> We're not ready to make this a public feature yet, but I want to start 
> experimenting with some ways to make Streams applications more resilient in 
> the face of isomorphic topological changes (eg adding/removing/reordering 
> subtopologies).
> If this turns out to be stable and useful, we can circle back on doing a KIP 
> to bring this feature into the public API



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12648) Experiment with resilient isomorphic topologies

2021-04-09 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12648:
--

 Summary: Experiment with resilient isomorphic topologies
 Key: KAFKA-12648
 URL: https://issues.apache.org/jira/browse/KAFKA-12648
 Project: Kafka
  Issue Type: New Feature
  Components: streams
Reporter: A. Sophie Blee-Goldman
Assignee: A. Sophie Blee-Goldman


We're not ready to make this a public feature yet, but I want to start 
experimenting with some ways to make Streams applications more resilient in the 
face of isomorphic topological changes (eg adding/removing/reordering 
subtopologies)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12638) Remove default implementation of ConsumerRebalanceListener#onPartitionsLost

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317638#comment-17317638
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12638:


If you're interested in this ticket, we can't do the whole thing because of the 
compatibility concerns I mentioned but feel free to pick up the first part, and 
just log a warning if the user has not implemented the #onPartitionsLost 
callback. Something like this: 
https://github.com/apache/kafka/blob/2.8/streams/src/main/java/org/apache/kafka/streams/state/RocksDBConfigSetter.java#L61

> Remove default implementation of ConsumerRebalanceListener#onPartitionsLost
> ---
>
> Key: KAFKA-12638
> URL: https://issues.apache.org/jira/browse/KAFKA-12638
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> When we added the #onPartitionsLost callback to the ConsumerRebalanceListener 
> in KIP-429, we gave it a default implementation that just invoked the 
> existing #onPartitionsRevoked method for backwards compatibility. This is 
> somewhat inconvenient, since we generally want to invoke #onPartitionsLost in 
> order to skip the committing of offsets on revoked partitions, which is 
> exactly what #onPartitionsRevoked does.
> I don't think we can just remove it in 3.0 since we haven't indicated that we 
> "deprecated" the default implementation or logged a warning that we intend to 
> remove the default in a future release (as we did for the 
> RocksDBConfigSetter#close method in Streams, for example). We should try to 
> add such a warning now, so we can remove it in a future release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12638) Remove default implementation of ConsumerRebalanceListener#onPartitionsLost

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317635#comment-17317635
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12638:


Nope, no such things as credits in Kafka -- technically anyone can pick up 
anything, and the major deciding factor is just your own confidence and 
familiarity with Kafka. If you're relatively new to Kafka and pick up something 
large that you need a lot of help with, you might struggle to get it done just 
because everyone who works on this is always busy (not because they don't want 
to help -- they do). We try to label things with newbie and/or newbie++ to 
indicate good entry-level tickets. 

That said, it never hurts to ask before picking something up -- often if it's a 
blocker ticket, or maybe critical, then it's likely that the person who 
reported it or someone they know already plan to work on it. But you can always 
ask -- and "Major" is the default priority which many tickets are just left at, 
so don't let that stop you.

> Remove default implementation of ConsumerRebalanceListener#onPartitionsLost
> ---
>
> Key: KAFKA-12638
> URL: https://issues.apache.org/jira/browse/KAFKA-12638
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> When we added the #onPartitionsLost callback to the ConsumerRebalanceListener 
> in KIP-429, we gave it a default implementation that just invoked the 
> existing #onPartitionsRevoked method for backwards compatibility. This is 
> somewhat inconvenient, since we generally want to invoke #onPartitionsLost in 
> order to skip the committing of offsets on revoked partitions, which is 
> exactly what #onPartitionsRevoked does.
> I don't think we can just remove it in 3.0 since we haven't indicated that we 
> "deprecated" the default implementation or logged a warning that we intend to 
> remove the default in a future release (as we did for the 
> RocksDBConfigSetter#close method in Streams, for example). We should try to 
> add such a warning now, so we can remove it in a future release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12492) Formatting of example RocksDBConfigSetter is messed up

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317629#comment-17317629
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12492:


Merged, thanks for the PR! Unfortunately we just barely missed the 2.8 release, 
as John cut the RC earlier today. If you want to see this fix in the 2.8 docs 
then you'll need to submit this exact PR against the kafka-site repo as we 
discussed, but the 2.8 RC is still under vote at the moment so you'd need to 
wait for that to be released at which point the docs in kafka/2.8 will be 
copied over to a new 28 subdirectory in kafka-site. 

> Formatting of example RocksDBConfigSetter is messed up
> --
>
> Key: KAFKA-12492
> URL: https://issues.apache.org/jira/browse/KAFKA-12492
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation, streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Ben Chen
>Priority: Trivial
>  Labels: docs, newbie
> Fix For: 3.0.0
>
>
> See the example implementation class CustomRocksDBConfig in the docs for the 
> rocksdb.config.setter
> https://kafka.apache.org/documentation/streams/developer-guide/config-streams.html#rocksdb-config-setter



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12634) Should checkpoint after restore finished

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317632#comment-17317632
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12634:


I don't think bulk loading should affect the checkpoint, the data up to an 
offset is either in the state store or it isn't. I'd be fine with just doing a 
small KIP to fix in 3.0+ though

> Should checkpoint after restore finished
> 
>
> Key: KAFKA-12634
> URL: https://issues.apache.org/jira/browse/KAFKA-12634
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.5.0
>Reporter: Matthias J. Sax
>Priority: Major
>
> For state stores, Kafka Streams maintains local checkpoint files to track the 
> offsets of the state store changelog topics. The checkpoint is updated on 
> commit or when a task is closed cleanly.
> However, after a successful restore, the checkpoint is not written. Thus, if 
> an instance crashes after restore but before committing, even if the state is 
> on local disk the checkpoint file is missing (indicating that there is no 
> state) and thus state would be restored from scratch.
> While for most cases, the time between restore end and next commit is small, 
> there are cases when this time could be large, for example if there is no new 
> input data to be processed (if there is no input data, the commit would be 
> skipped).
> Thus, we should write the checkpoint file after a successful restore to close 
> this gap (or course, only for at-least-once processing).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12492) Formatting of example RocksDBConfigSetter is messed up

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12492:
---
Fix Version/s: 3.0.0

> Formatting of example RocksDBConfigSetter is messed up
> --
>
> Key: KAFKA-12492
> URL: https://issues.apache.org/jira/browse/KAFKA-12492
> Project: Kafka
>  Issue Type: Bug
>  Components: documentation, streams
>Reporter: A. Sophie Blee-Goldman
>Assignee: Ben Chen
>Priority: Trivial
>  Labels: docs, newbie
> Fix For: 3.0.0
>
>
> See the example implementation class CustomRocksDBConfig in the docs for the 
> rocksdb.config.setter
> https://kafka.apache.org/documentation/streams/developer-guide/config-streams.html#rocksdb-config-setter



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12638) Remove default implementation of ConsumerRebalanceListener#onPartitionsLost

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12638:
--

 Summary: Remove default implementation of 
ConsumerRebalanceListener#onPartitionsLost
 Key: KAFKA-12638
 URL: https://issues.apache.org/jira/browse/KAFKA-12638
 Project: Kafka
  Issue Type: Improvement
  Components: consumer
Reporter: A. Sophie Blee-Goldman


When we added the #onPartitionsLost callback to the ConsumerRebalanceListener 
in KIP-429, we gave it a default implementation that just invoked the existing 
#onPartitionsRevoked method for backwards compatibility. This is somewhat 
inconvenient, since we generally want to invoke #onPartitionsLost in order to 
skip the committing of offsets on revoked partitions, which is exactly what 
#onPartitionsRevoked does.

I don't think we can just remove it in 3.0 since we haven't indicated that we 
"deprecated" the default implementation or logged a warning that we intend to 
remove the default in a future release (as we did for the 
RocksDBConfigSetter#close method in Streams, for example). We should try to add 
such a warning now, so we can remove it in a future release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12637) Remove deprecated PartitionAssignor interface

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12637:
---
Description: In KIP-429, we deprecated the existing PartitionAssignor 
interface in order to move it out of the internals package and better align the 
name with other pluggable Consumer interfaces. We added an adapter to convert 
from existing o.a.k.clients.consumer.internals.PartitionAssignor to the new 
o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated 
interface. This was deprecated in 2.4, so we should be ok to remove it and the 
PartitionAssignorAdaptor in 3.0  (was: In KIP-429, we deprecated the existing 
PartitionAssignor interface in order to move it out of the internals package 
and better align the name with other pluggable Consumer interfaces. We added an 
adapter to convert from existing 
o.a.k.clients.consumer.internals.PartitionAssignor to the new 
o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated 
interface. This was deprecated in 2.4, so we should be ok to remove it and the 
adaptor in 3.0)

> Remove deprecated PartitionAssignor interface
> -
>
> Key: KAFKA-12637
> URL: https://issues.apache.org/jira/browse/KAFKA-12637
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Priority: Blocker
>  Labels: newbie, newbie++
> Fix For: 3.0.0
>
>
> In KIP-429, we deprecated the existing PartitionAssignor interface in order 
> to move it out of the internals package and better align the name with other 
> pluggable Consumer interfaces. We added an adapter to convert from existing 
> o.a.k.clients.consumer.internals.PartitionAssignor to the new 
> o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated 
> interface. This was deprecated in 2.4, so we should be ok to remove it and 
> the PartitionAssignorAdaptor in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12637) Remove deprecated PartitionAssignor interface

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12637:
---
Labels: newbie newbie++  (was: )

> Remove deprecated PartitionAssignor interface
> -
>
> Key: KAFKA-12637
> URL: https://issues.apache.org/jira/browse/KAFKA-12637
> Project: Kafka
>  Issue Type: Improvement
>  Components: consumer
>Reporter: A. Sophie Blee-Goldman
>Priority: Blocker
>  Labels: newbie, newbie++
> Fix For: 3.0.0
>
>
> In KIP-429, we deprecated the existing PartitionAssignor interface in order 
> to move it out of the internals package and better align the name with other 
> pluggable Consumer interfaces. We added an adapter to convert from existing 
> o.a.k.clients.consumer.internals.PartitionAssignor to the new 
> o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated 
> interface. This was deprecated in 2.4, so we should be ok to remove it and 
> the adaptor in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (KAFKA-12637) Remove deprecated PartitionAssignor interface

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)

A. Sophie Blee-Goldman created KAFKA-12637:
--

 Summary: Remove deprecated PartitionAssignor interface
 Key: KAFKA-12637
 URL: https://issues.apache.org/jira/browse/KAFKA-12637
 Project: Kafka
  Issue Type: Improvement
  Components: consumer
Reporter: A. Sophie Blee-Goldman
 Fix For: 3.0.0


In KIP-429, we deprecated the existing PartitionAssignor interface in order to 
move it out of the internals package and better align the name with other 
pluggable Consumer interfaces. We added an adapter to convert from existing 
o.a.k.clients.consumer.internals.PartitionAssignor to the new 
o.a.k.clients.consumer.ConsumerPartitionAssignor and support the deprecated 
interface. This was deprecated in 2.4, so we should be ok to remove it and the 
adaptor in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8613) Set default grace period to 0

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317519#comment-17317519
 ] 

A. Sophie Blee-Goldman commented on KAFKA-8613:
---

Just replied on the KIP thread -- will go ahead and assign this ticket to you. 
It should be pretty straightforward but let me know if you have any questions 
on the implementation

> Set default grace period to 0
> -
>
> Key: KAFKA-8613
> URL: https://issues.apache.org/jira/browse/KAFKA-8613
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Currently, the grace period is set to retention time if the grace period is 
> not specified explicitly. The reason for setting the default grace period to 
> retention time was backward compatibility. Topologies that were implemented 
> before the introduction of the grace period, added late arriving records to a 
> window as long as the window existed, i.e., as long as its retention time was 
> not elapsed.  
> This unintuitive default grace period has already caused confusion among 
> users.
> For the next major release, we should set the default grace period to 
> {{Duration.ZERO}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-8613) Set default grace period to 0

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-8613:
-

Assignee: Israel Ekpo  (was: A. Sophie Blee-Goldman)

> Set default grace period to 0
> -
>
> Key: KAFKA-8613
> URL: https://issues.apache.org/jira/browse/KAFKA-8613
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: Israel Ekpo
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Currently, the grace period is set to retention time if the grace period is 
> not specified explicitly. The reason for setting the default grace period to 
> retention time was backward compatibility. Topologies that were implemented 
> before the introduction of the grace period, added late arriving records to a 
> window as long as the window existed, i.e., as long as its retention time was 
> not elapsed.  
> This unintuitive default grace period has already caused confusion among 
> users.
> For the next major release, we should set the default grace period to 
> {{Duration.ZERO}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12634) Should checkpoint after restore finished

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317516#comment-17317516
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12634:


Oh. :/

By the way, shouldn't we also checkpoint _during_ a restore? We probably don't 
want to do so at the same frequency as the commit interval, but it would be a 
bummer if an app with an hour-long restore crashed in the middle and had to 
start up again from scratch. Maybe we could pick an interval like 5min that 
should allow users to avoid loosing all progress without severely slowing down 
the restore even further?

> Should checkpoint after restore finished
> 
>
> Key: KAFKA-12634
> URL: https://issues.apache.org/jira/browse/KAFKA-12634
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.5.0
>Reporter: Matthias J. Sax
>Priority: Major
>
> For state stores, Kafka Streams maintains local checkpoint files to track the 
> offsets of the state store changelog topics. The checkpoint is updated on 
> commit or when a task is closed cleanly.
> However, after a successful restore, the checkpoint is not written. Thus, if 
> an instance crashes after restore but before committing, even if the state is 
> on local disk the checkpoint file is missing (indicating that there is no 
> state) and thus state would be restored from scratch.
> While for most cases, the time between restore end and next commit is small, 
> there are cases when this time could be large, for example if there is no new 
> input data to be processed (if there is no input data, the commit would be 
> skipped).
> Thus, we should write the checkpoint file after a successful restore to close 
> this gap (or course, only for at-least-once processing).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12506) Expand AdjustStreamThreadCountTest

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317487#comment-17317487
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12506:


It looks like the build might be broken with some unrelated tests that are just 
flaky. You can navigate to the tab that says "Tests" at the top on those links 
you provided and it will display a lit of the failed tests. eg 
https://ci-builds.apache.org/blue/organizations/jenkins/Kafka%2Fkafka-pr/detail/PR-10432/7/tests

You might not see those when running locally. Some tests are just flaky when 
run on Jenkins, like 
org.apache.kafka.streams.integration.KTableKTableForeignKeyInnerJoinMultiIntegrationTest.
 Others might not show up because you're only running Streams tests locally, 
and they're from another module, like kafka.server.RaftClusterTest

Don't worry about these flaky tests, especially if they're in core or connect, 
or anywhere outside of Streams since none of those have a dependency on 
Streams. If there are tests failing in Streams, but they're not a class or test 
you touched in the PR, then we need to use some judgement to figure out if it 
might be due to the changes or just flaky. We can always re-run the tests to 
try again.

> Expand AdjustStreamThreadCountTest
> --
>
> Key: KAFKA-12506
> URL: https://issues.apache.org/jira/browse/KAFKA-12506
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams, unit tests
>Reporter: A. Sophie Blee-Goldman
>Assignee: Aviral Srivastava
>Priority: Major
>  Labels: newbie, newbie++
>
> Right now the AdjustStreamThreadCountTest runs a minimal topology that just 
> consumes a single input topic, and doesn't produce any data to this topic. 
> Some of the complex concurrency bugs that we've found only showed up when we 
> had some actual data to process and a stateful topology: 
> [KAFKA-12503|https://issues.apache.org/jira/browse/KAFKA-12503] KAFKA-12500
> See the umbrella ticket for the list of improvements we need to make this a 
> more effective integration test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12634) Should checkpoint after restore finished

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12634:
---
Affects Version/s: 2.5.0

> Should checkpoint after restore finished
> 
>
> Key: KAFKA-12634
> URL: https://issues.apache.org/jira/browse/KAFKA-12634
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.5.0
>Reporter: Matthias J. Sax
>Priority: Major
>
> For state stores, Kafka Streams maintains local checkpoint files to track the 
> offsets of the state store changelog topics. The checkpoint is updated on 
> commit or when a task is closed cleanly.
> However, after a successful restore, the checkpoint is not written. Thus, if 
> an instance crashes after restore but before committing, even if the state is 
> on local disk the checkpoint file is missing (indicating that there is no 
> state) and thus state would be restored from scratch.
> While for most cases, the time between restore end and next commit is small, 
> there are cases when this time could be large, for example if there is no new 
> input data to be processed (if there is no input data, the commit would be 
> skipped).
> Thus, we should write the checkpoint file after a successful restore to close 
> this gap (or course, only for at-least-once processing).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12634) Should checkpoint after restore finished

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317484#comment-17317484
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12634:


Well we had that big refactor of task management in 2.6 so I wouldn't be 
surprised if it was fixed since then

> Should checkpoint after restore finished
> 
>
> Key: KAFKA-12634
> URL: https://issues.apache.org/jira/browse/KAFKA-12634
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>Priority: Major
>
> For state stores, Kafka Streams maintains local checkpoint files to track the 
> offsets of the state store changelog topics. The checkpoint is updated on 
> commit or when a task is closed cleanly.
> However, after a successful restore, the checkpoint is not written. Thus, if 
> an instance crashes after restore but before committing, even if the state is 
> on local disk the checkpoint file is missing (indicating that there is no 
> state) and thus state would be restored from scratch.
> While for most cases, the time between restore end and next commit is small, 
> there are cases when this time could be large, for example if there is no new 
> input data to be processed (if there is no input data, the commit would be 
> skipped).
> Thus, we should write the checkpoint file after a successful restore to close 
> this gap (or course, only for at-least-once processing).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12559) Add a top-level Streams config for bounding off-heap memory

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317460#comment-17317460
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12559:


Sure thing -- I recommend checking out the example implementation in the Memory 
Management section (linked to in the ticket description) and read up on the KIP 
process if you're not familiar with it yet: 
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

The idea here is to add one (or two) new StreamsConfigs which let the user 
control the rocksdb memory without having to implement a RocksDBConfigSetter. 
So if a user has set these configs but not a RocksDBConfigSetter, then we would 
have a default RocksDBConfigSetter similar to the one in the Memory Management 
section, where the rocksdb.max.bytes.off.heap config determines the value of 
TOTAL_OFF_HEAP_MEMORY in the example.

We probably also want to give users a way to control the other parameter in 
that example, TOTAL_MEMTABLE_MEMORY. The basic formula for memory usage in 
rocksdb is TOTAL_OFF_HEAP_MEMORY = TOTAL_MEMTABLE_MEMORY + TOTAL_CACHE_MEMORY 
-- so just think about what is the best way to let users specify how much of 
the total memory should go to the cache vs towards the memory. Maybe you just 
want one config for TOTAL_MEMTABLE_MEMORY, or you could consider a config like 
rocksdb.memtable.to.block.cache.off.heap.memory.ratio which represents the 
ratio of memtable memory, ie TOTAL_MEMTABLE_MEMORY / TOTAL_OFF_HEAP_MEMORY

Does that make sense? Let me know if you have any specific questions

> Add a top-level Streams config for bounding off-heap memory
> ---
>
> Key: KAFKA-12559
> URL: https://issues.apache.org/jira/browse/KAFKA-12559
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>  Labels: needs-kip, newbie, newbie++
>
> At the moment we provide an example of how to bound the memory usage of 
> rocskdb in the [Memory 
> Management|https://kafka.apache.org/27/documentation/streams/developer-guide/memory-mgmt.html#rocksdb]
>  section of the docs. This requires implementing a custom RocksDBConfigSetter 
> class and setting a number of rocksdb options for relatively advanced 
> concepts and configurations. It seems a fair number of users either fail to 
> find this or consider it to be for more advanced use cases/users. But RocksDB 
> can eat up a lot of off-heap memory and it's not uncommon for users to come 
> across a {{RocksDBException: Cannot allocate memory}}
> It would probably be a much better user experience if we implemented this 
> memory bound out-of-the-box and just gave users a top-level StreamsConfig to 
> tune the off-heap memory given to rocksdb, like we have for on-heap cache 
> memory with cache.max.bytes.buffering. More advanced users can continue to 
> fine-tune their memory bounding and apply other configs with a custom config 
> setter, while new or more casual users can cap on the off-heap memory without 
> getting their hands dirty with rocksdb.
> I would propose to add the following top-level config:
> rocksdb.max.bytes.off.heap: medium priority, default to -1 (unbounded), valid 
> values are [0, inf]
> I'd also want to consider adding a second, lower priority top-level config to 
> give users a knob for adjusting how much of that total off-heap memory goes 
> to the block cache + index/filter blocks, and how much of it is afforded to 
> the write buffers. I'm struggling to come up with a good name for this 
> config, but it would be something like
> rocksdb.memtable.to.block.cache.off.heap.memory.ratio: low priority, default 
> to 0.5, valid values are [0, 1]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12506) Expand AdjustStreamThreadCountTest

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317455#comment-17317455
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12506:


Hey [~kebab-mai-haddi], if I understand correctly you are running `./gradlew 
streams:test` locally on your changes and it's passing, but the build is 
failing on the PR? If so it may be due to upstream changes, as Jenkins will 
always merge your branch with the latest trunk before running the unit tests on 
the PR. I recommend pulling or rebasing from trunk and re-running the tests.

> Expand AdjustStreamThreadCountTest
> --
>
> Key: KAFKA-12506
> URL: https://issues.apache.org/jira/browse/KAFKA-12506
> Project: Kafka
>  Issue Type: Sub-task
>  Components: streams, unit tests
>Reporter: A. Sophie Blee-Goldman
>Assignee: Aviral Srivastava
>Priority: Major
>  Labels: newbie, newbie++
>
> Right now the AdjustStreamThreadCountTest runs a minimal topology that just 
> consumes a single input topic, and doesn't produce any data to this topic. 
> Some of the complex concurrency bugs that we've found only showed up when we 
> had some actual data to process and a stateful topology: 
> [KAFKA-12503|https://issues.apache.org/jira/browse/KAFKA-12503] KAFKA-12500
> See the umbrella ticket for the list of improvements we need to make this a 
> more effective integration test



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8295) Optimize count() using RocksDB merge operator

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317410#comment-17317410
 ] 

A. Sophie Blee-Goldman commented on KAFKA-8295:
---

Thanks Sagar -- I think the custom MergeOperator did exist back then but I was 
hesitant to assume that it would always perform better given our experience 
with the performance of a custom ByteComparator. We would want to run some 
benchmarks to determine whether this is or isn't the case for the merge 
operator as well. It's possible that MergeOperator isn't as severely affected 
by the overhead of crossing the jni, or that it's been improved over the years. 
I actually recall reading a blog post about rocksdb tuning recently where they 
mentioned using the merge operator gave them a significant improvement for 
certain data types, I think this was in Flink maybe?

This is definitely something we could add to the StateStore interface. It would 
require a KIP, and need to be careful since not all store backends will 
necessarily support this. You could check out [KIP-617: Allow Kafka Streams 
State Stores to be iterated 
backwards|https://cwiki.apache.org/confluence/display/KAFKA/KIP-617%3A+Allow+Kafka+Streams+State+Stores+to+be+iterated+backwards]
 as a reference, reverse iteration is another rocksdb feature we wanted to 
expose but which isn't necessarily supported by all.

> Optimize count() using RocksDB merge operator
> -
>
> Key: KAFKA-8295
> URL: https://issues.apache.org/jira/browse/KAFKA-8295
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: A. Sophie Blee-Goldman
>Priority: Major
>
> In addition to regular put/get/delete RocksDB provides a fourth operation, 
> merge. This essentially provides an optimized read/update/write path in a 
> single operation. One of the built-in (C++) merge operators exposed over the 
> Java API is a counter. We should be able to leverage this for a more 
> efficient implementation of count()
>  
> (Note: Unfortunately it seems unlikely we can use this to optimize general 
> aggregations, even if RocksJava allowed for a custom merge operator, unless 
> we provide a way for the user to specify and connect a C++ implemented 
> aggregator – otherwise we incur too much cost crossing the jni for a net 
> performance benefit)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12634) Should checkpoint after restore finished

2021-04-08 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317393#comment-17317393
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12634:


Nice catch. Do you know which versions this affects?

> Should checkpoint after restore finished
> 
>
> Key: KAFKA-12634
> URL: https://issues.apache.org/jira/browse/KAFKA-12634
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Matthias J. Sax
>Priority: Major
>
> For state stores, Kafka Streams maintains local checkpoint files to track the 
> offsets of the state store changelog topics. The checkpoint is updated on 
> commit or when a task is closed cleanly.
> However, after a successful restore, the checkpoint is not written. Thus, if 
> an instance crashes after restore but before committing, even if the state is 
> on local disk the checkpoint file is missing (indicating that there is no 
> state) and thus state would be restored from scratch.
> While for most cases, the time between restore end and next commit is small, 
> there are cases when this time could be large, for example if there is no new 
> input data to be processed (if there is no input data, the commit would be 
> skipped).
> Thus, we should write the checkpoint file after a successful restore to close 
> this gap (or course, only for at-least-once processing).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-12624) Fix LICENSE in 2.6

2021-04-07 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316849#comment-17316849
 ] 

A. Sophie Blee-Goldman commented on KAFKA-12624:


Had to change a handful of dependencies whose versions changed between 2.6 and 
2.8 in the cherrypick:

{code:java}
-javassist-3.27.0-GA
+javassist-3.25.0-GA
+javassist-3.26.0-GA

-scala-collection-compat_2.13-2.3.0
-scala-library-2.13.5
-scala-logging_2.13-3.9.2
-scala-reflect-2.13.5
+scala-collection-compat_2.13-2.1.6
-snappy-java-1.1.8.1
+scala-library-2.13.2
+scala-logging_2.13-3.9.2
+scala-reflect-2.13.2
+scala-reflect-2.13.2
+snappy-java-1.1.7.3
 
-zstd-jni-1.4.9-1, see: licenses/zstd-jni-BSD-2-clause
+zstd-jni-1.4.4-7, see: licenses/zstd-jni-BSD-2-clause
{code}


> Fix LICENSE in 2.6
> --
>
> Key: KAFKA-12624
> URL: https://issues.apache.org/jira/browse/KAFKA-12624
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: John Roesler
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 2.6.2
>
>
> Just splitting this out as a sub-task.
> I've fixed the parent ticket on trunk and 2.8.
> You'll need to cherry-pick the fix from 2.8 (see 
> [https://github.com/apache/kafka/pull/10474)]
> Then, you can follow the manual verification steps I detailed here: 
> https://issues.apache.org/jira/browse/KAFKA-12622



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (KAFKA-12624) Fix LICENSE in 2.6

2021-04-07 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-12624.

Resolution: Fixed

> Fix LICENSE in 2.6
> --
>
> Key: KAFKA-12624
> URL: https://issues.apache.org/jira/browse/KAFKA-12624
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: John Roesler
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 2.6.2
>
>
> Just splitting this out as a sub-task.
> I've fixed the parent ticket on trunk and 2.8.
> You'll need to cherry-pick the fix from 2.8 (see 
> [https://github.com/apache/kafka/pull/10474)]
> Then, you can follow the manual verification steps I detailed here: 
> https://issues.apache.org/jira/browse/KAFKA-12622



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-12622) Automate LICENCSE file validation

2021-04-07 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman updated KAFKA-12622:
---
Description: 
In https://issues.apache.org/jira/browse/KAFKA-12602, we manually constructed a 
correct license file for 2.8.0. This file will certainly become wrong again in 
later releases, so we need to write some kind of script to automate a check.

It crossed my mind to automate the generation of the file, but it seems to be 
an intractable problem, considering that each dependency may change licenses, 
may package license files, link to them from their poms, link to them from 
their repos, etc. I've also found multiple URLs listed with various delimiters, 
broken links that I have to chase down, etc.

Therefore, it seems like the solution to aim for is simply: list all the jars 
that we package, and print out a report of each jar that's extra or missing vs. 
the ones in our `LICENSE-binary` file.

The check should be part of the release script at least, if not part of the 
regular build (so we keep it up to date as dependencies change).

 

Here's how I do this manually right now:
{code:java}
// build the binary artifacts
$ ./gradlewAll releaseTarGz

// unpack the binary artifact 
$ tar xf core/build/distributions/kafka_2.13-X.Y.Z.tgz
$ cd xf kafka_2.13-X.Y.Z

// list the packaged jars 
// (you can ignore the jars for our own modules, like kafka, kafka-clients, 
etc.)
$ ls libs/

// cross check the jars with the packaged LICENSE
// make sure all dependencies are listed with the right versions
$ cat LICENSE

// also double check all the mentioned license files are present
$ ls licenses {code}

  was:
In https://issues.apache.org/jira/browse/KAFKA-12602, we manually constructed a 
correct license file for 2.8.0. This file will certainly become wrong again in 
later releases, so we need to write some kind of script to automate a check.

It crossed my mind to automate the generation of the file, but it seems to be 
an intractable problem, considering that each dependency may change licenses, 
may package license files, link to them from their poms, link to them from 
their repos, etc. I've also found multiple URLs listed with various delimiters, 
broken links that I have to chase down, etc.

Therefore, it seems like the solution to aim for is simply: list all the jars 
that we package, and print out a report of each jar that's extra or missing vs. 
the ones in our `LICENSE-binary` file.

The check should be part of the release script at least, if not part of the 
regular build (so we keep it up to date as dependencies change).

 

Here's how I do this manually right now:
{code:java}
// build the binary artifacts
$ ./gradlewAll releaseTarGz

// unpack the binary artifact $ cd core/build/distributions/
$ tar xf kafka_2.13-X.Y.Z.tgz
$ cd xf kafka_2.13-X.Y.Z

// list the packaged jars 
// (you can ignore the jars for our own modules, like kafka, kafka-clients, 
etc.)
$ ls libs/

// cross check the jars with the packaged LICENSE
// make sure all dependencies are listed with the right versions
$ cat LICENSE

// also double check all the mentioned license files are present
$ ls licenses {code}


> Automate LICENCSE file validation
> -
>
> Key: KAFKA-12622
> URL: https://issues.apache.org/jira/browse/KAFKA-12622
> Project: Kafka
>  Issue Type: Task
>Reporter: John Roesler
>Priority: Major
> Fix For: 3.0.0, 2.8.1
>
>
> In https://issues.apache.org/jira/browse/KAFKA-12602, we manually constructed 
> a correct license file for 2.8.0. This file will certainly become wrong again 
> in later releases, so we need to write some kind of script to automate a 
> check.
> It crossed my mind to automate the generation of the file, but it seems to be 
> an intractable problem, considering that each dependency may change licenses, 
> may package license files, link to them from their poms, link to them from 
> their repos, etc. I've also found multiple URLs listed with various 
> delimiters, broken links that I have to chase down, etc.
> Therefore, it seems like the solution to aim for is simply: list all the jars 
> that we package, and print out a report of each jar that's extra or missing 
> vs. the ones in our `LICENSE-binary` file.
> The check should be part of the release script at least, if not part of the 
> regular build (so we keep it up to date as dependencies change).
>  
> Here's how I do this manually right now:
> {code:java}
> // build the binary artifacts
> $ ./gradlewAll releaseTarGz
> // unpack the binary artifact 
> $ tar xf core/build/distributions/kafka_2.13-X.Y.Z.tgz
> $ cd xf kafka_2.13-X.Y.Z
> // list the packaged jars 
> // (you can ignore the jars for our own modules, like kafka, kafka-clients, 
> etc.)
> $ ls libs/
> // cross check the jars with the packaged

[jira] [Resolved] (KAFKA-8924) Default grace period (-1) of TimeWindows causes suppress to emit events after 24h

2021-04-05 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman resolved KAFKA-8924.
---
Resolution: Duplicate

> Default grace period (-1) of TimeWindows causes suppress to emit events after 
> 24h
> -
>
> Key: KAFKA-8924
> URL: https://issues.apache.org/jira/browse/KAFKA-8924
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.3.0
>Reporter: Michał
>Assignee: Michał
>Priority: Major
>  Labels: needs-kip
>
> h2. Problem 
> The default creation of TimeWindows, like
> {code}
> TimeWindows.of(ofMillis(xxx))
> {code}
> calls an internal constructor
> {code}
> return new TimeWindows(sizeMs, sizeMs, -1, DEFAULT_RETENTION_MS);
> {code}
> And the *-1* parameter is the default grace period which I think is here for 
> backward compatibility
> {code}
> @SuppressWarnings("deprecation") // continuing to support 
> Windows#maintainMs/segmentInterval in fallback mode
> @Override
> public long gracePeriodMs() {
> // NOTE: in the future, when we remove maintainMs,
> // we should default the grace period to 24h to maintain the default 
> behavior,
> // or we can default to (24h - size) if you want to be super accurate.
> return graceMs != -1 ? graceMs : maintainMs() - size();
> }
> {code}
> The problem is that if you use a TimeWindows with gracePeriod of *-1* 
> together with suppress *untilWindowCloses*, it never emits an event.
> You can check the Suppress tests 
> (SuppressScenarioTest.shouldSupportFinalResultsForTimeWindows), where 
> [~vvcephei] was (maybe) aware of that and all the scenarios specify the 
> gracePeriod.
> I will add a test without it on my branch and it will fail.
> The test: 
> https://github.com/atais/kafka/commit/221a04dc40d636ffe93a0ad95dfc6bcad653f4db
>  
> h2. Now what can be done
> One easy fix would be to change the default value to 0, which works fine for 
> me in my project, however, I am not aware of the impact it would have done 
> due to the changes in the *gracePeriodMs* method mentioned before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-8924) Default grace period (-1) of TimeWindows causes suppress to emit events after 24h

2021-04-05 Thread A. Sophie Blee-Goldman (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-8924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17315217#comment-17315217
 ] 

A. Sophie Blee-Goldman commented on KAFKA-8924:
---

Can we close this as a duplicate? We have KIP-633 currently under voting which 
should improve things, we can track the progress from here with 
https://issues.apache.org/jira/browse/KAFKA-8613

> Default grace period (-1) of TimeWindows causes suppress to emit events after 
> 24h
> -
>
> Key: KAFKA-8924
> URL: https://issues.apache.org/jira/browse/KAFKA-8924
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Affects Versions: 2.3.0
>Reporter: Michał
>Assignee: Michał
>Priority: Major
>  Labels: needs-kip
>
> h2. Problem 
> The default creation of TimeWindows, like
> {code}
> TimeWindows.of(ofMillis(xxx))
> {code}
> calls an internal constructor
> {code}
> return new TimeWindows(sizeMs, sizeMs, -1, DEFAULT_RETENTION_MS);
> {code}
> And the *-1* parameter is the default grace period which I think is here for 
> backward compatibility
> {code}
> @SuppressWarnings("deprecation") // continuing to support 
> Windows#maintainMs/segmentInterval in fallback mode
> @Override
> public long gracePeriodMs() {
> // NOTE: in the future, when we remove maintainMs,
> // we should default the grace period to 24h to maintain the default 
> behavior,
> // or we can default to (24h - size) if you want to be super accurate.
> return graceMs != -1 ? graceMs : maintainMs() - size();
> }
> {code}
> The problem is that if you use a TimeWindows with gracePeriod of *-1* 
> together with suppress *untilWindowCloses*, it never emits an event.
> You can check the Suppress tests 
> (SuppressScenarioTest.shouldSupportFinalResultsForTimeWindows), where 
> [~vvcephei] was (maybe) aware of that and all the scenarios specify the 
> gracePeriod.
> I will add a test without it on my branch and it will fail.
> The test: 
> https://github.com/atais/kafka/commit/221a04dc40d636ffe93a0ad95dfc6bcad653f4db
>  
> h2. Now what can be done
> One easy fix would be to change the default value to 0, which works fine for 
> me in my project, however, I am not aware of the impact it would have done 
> due to the changes in the *gracePeriodMs* method mentioned before.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (KAFKA-8613) Set default grace period to 0

2021-04-05 Thread A. Sophie Blee-Goldman (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-8613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

A. Sophie Blee-Goldman reassigned KAFKA-8613:
-

Assignee: A. Sophie Blee-Goldman

> Set default grace period to 0
> -
>
> Key: KAFKA-8613
> URL: https://issues.apache.org/jira/browse/KAFKA-8613
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Bruno Cadonna
>Assignee: A. Sophie Blee-Goldman
>Priority: Blocker
> Fix For: 3.0.0
>
>
> Currently, the grace period is set to retention time if the grace period is 
> not specified explicitly. The reason for setting the default grace period to 
> retention time was backward compatibility. Topologies that were implemented 
> before the introduction of the grace period, added late arriving records to a 
> window as long as the window existed, i.e., as long as its retention time was 
> not elapsed.  
> This unintuitive default grace period has already caused confusion among 
> users.
> For the next major release, we should set the default grace period to 
> {{Duration.ZERO}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

< 2 3 4 5 6 7 8 9 10 11 >

601 - 700 of 1794 matches

Mail list logo