[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-689838971 @chia7712 : Thanks a lot for staying on this tricky issue and finding a simpler solution! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-689241869 @ijuma @hachikuji @rajinisivaram : I think this PR is ready to be merged. Any further comments from you? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-685831526 @chia7712 It seems there are some compilation errors in jenkins? https://ci-builds.apache.org/job/Kafka/job/kafka-pr/view/change-requests/job/PR-8657/1/consoleFull ` 00:31:51 [Error] /home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-8657/streams/streams-scala/src/test/scala/org/apache/kafka/streams/scala/StreamToTableJoinScalaIntegrationTestImplicitSerdes.scala:136: value stringSerde is not a member of object org.apache.kafka.streams.scala.Serdes ` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-683859366 @chia7712 : Thanks for the performance results. It seems that the average across multiple runs doesn't change much? Also, 1 failure in the latest system test run. http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-08-30--001.1598849124--chia7712--fix_8334_avoid_deadlock--960f19b29/report.html This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-683052000 @ijuma @hachikuji @rajinisivaram : Do you want to take another look at the latest solution from Chia-Ping? It (1) solves the known issues completely; (2) doesn't require new threads; (3) adds minimal changes to existing code; (4) simplifies existing code by removing the tryLock logic. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-681361030 The following is my understand. The current PR introduces a new deadlock through the following path. path 1 hold group lock -> joinPurgatory.tryCompleteElseWatch(delayedJoin) -> watchForOperation (now delayedJoin visible through other threads) -> operation.maybeTryComplete() -> hold delayedJoin.lock path 2 delayedJoin.maybeTryComplete -> hold hold delayedJoin.lock -> tryComplete() -> hold group lock The existing code doesn't have this deadlock since (1) delayedJoin.lock is the same as the group lock held in the caller and (2) a delayed join operation is registered under the group key (so each time we check completeness for a group key, only one delayed join operation will be affected). By switching back to this code, we avoid the new deadlock. The existing code has a different deadlock issue that groupManager.storeGroup() in GroupCoordinator.onCompleteJoin may need to complete to other delayed operations and potentially hold a different group lock while already holding a group lock. This issue will be resolved if we complete those delayed operations due to groupManager.storeGroup() elsewhere without holding any locks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-681109748 @chia7712 : Thanks for the reply. I like your overall idea and I think it can be used to solve the problem completely in a simpler way. 1. Instead of at `Partition`, we collect all pending delayed check operations in a queue in ReplicaManager. All callers to ReplicaManager.appendRecords() are expected to take up to 1 item from that queue and check the completeness for all affected partitions, without holding any conflicting locks. 2. Most callers to ReplicaManager.appendRecords() are from KafkaApis. We can just add the logic to check the ReplicaManager queue at the end of KafkaApis.handle(), at which point, no conflicting locks will be held. 3. Another potentially caller is the expiration thread in a purgatory. SystemTimer always runs the expiration logic in a separate thread and DelayedOperation.onExpiration() is always called without holding any conflicting lock. So, for those delayed operations using ReplicaManager.appendRecords(), we can pass down a flag to DelayedOperation so that at the end of onExpiration, we check the ReplicaManager queue if the flag is set. 4. We keep `DelayedJoin` as it is, still passing in the group lock to DelayedOperation to avoid deadlocks due to two levels of locking. 5. We can still get rid of the `tryLock` logic in DelayedOperation for simplification since there is no opportunity for deadlock. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-679256967 @chia7712 : Sorry for the late response. I just realized there seems to be another issue in addition to the above one that I mentioned. The second issue is that we hold a group lock while calling `joinPurgatory.tryCompleteElseWatch`. In this call, it's possible that DelayedJoin.onComplete() will be called. In that case, since the caller holds the group lock, we won't be completing partitionsToComplete in completeDelayedRequests(). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-671520491 @chia7712 : I agree mostly with your assessment. For most delayed operations, the checking for the completeness of the operation and the calling of onComplete() don't have to be protected under the same lock. The only one that I am not quite sure is DelayedJoin. Currently, DelayedJoin.tryCompleteJoin() checks if all members have joined and DelayedJoin.onComplete() modifies the state of the group. Both operations are done under the same group lock. If we relax the lock, it seems that the condition "all members have joined" may no longer be true when we get to DelayedJoin.onComplete() even though that condition was true during the DelayedJoin.tryCompleteJoin() check. It's not clear what we should do in that case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-670113688 @chia7712 : If we could solve the issue by simplifying DelayedOperation, it would be ideal. I am not sure how your proposal avoids the above potential deadlock. Could you provide a bit more detail? Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-669330984 @chia7712 : The following is my thought after thinking about this a bit more. The changes that we made in DelayedJoin is complicated and it still doesn't completely solve the deadlock issue. Adding more complexity there to solve the issue is probably not ideal. As a compromise, I was thinking another approach. We could pass down a flag to ReplicaManager.appendRecords() to complete the delayed requests in a separate thread there. Only GroupCoordinator will set this flag. So, the background completeness check is limited to the offset_commit topic. Since this topic is low volume, a single thread is likely enough to handle the load. So, we don't have to make the number of thread configurable and don't have to worry about this thread being overwhelmed. The benefit of this approach is that the code is probably easier to understand since we could keep most existing logic in GroupCoordinator unchanged. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-667233872 @chia7712 : So, it's just a rebase and there is no change in your PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-666753554 @ijuma : I think the proposal is to complete all delayed operations in a separate thread pool. My concerns for that approach are the following: (1) Configuration: How many threads should be used? Should that be dynamically configurable? Should the queue to this thread pool be unbounded or fixed size? If it's the latter, how do we configure the queue size? (2) Monitoring: How do we monitor the utilization of the new thread pool and the potential back pressure it creates (if the queue is fixed size)? (3) Quota: Do we need to account for the resource utilization of this new thread pool per client/user for quota purpose? (4) Debugging: If there is a latency issue, how do we know whether this is due to delays in the new thread pool? So, overall, I feel that we should be careful with adding another thread pool used in the common code path since it adds other complexity. Given that this issue is isolated in GroupCoordinator, I am not sure adding a thread pool is an overall simpler solution. @chia7712 : I saw that you consolidated the commit history. Were there significant changes since the last review? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-664652669 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-663597341 3 system test failures with trunk. http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2020-07-23--001.1595551051--apache--trunk--0b181fdde/report.html This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-663109443 @chia7712: Only 6 test failures in the latest run with your PR. http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-07-23--001.1595503536--chia7712--fix_8334_avoid_deadlock--3462b0008/report.html I will do another run on trunk for comparison. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-662642708 @chia7712 : I just merged #9026. Could you rebase again? I will run the system tests after that. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-658281873 @chia7712 : Here is the latest system test result. http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-07-14--001.1594711494--chia7712--fix_8334_avoid_deadlock--9ebe39447/report.html The number of failures went up to 17. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-656365594 @chia7712 : All 3 PRs you fixed above have been merged. Do you want to rebase again so that I can run system tests one more time? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-653953935 @chia7712 : Thanks. Are the unit test failures also due to flaky tests? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-653227128 @chia7712 : Thanks for the investigation. You don't need to fix all those flaky tests. It would be helpful if you could file separate jiras to track them, if not already. Overall, do you think there is no new system test failure introduced by this PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-650786185 Latest system test results. Down to 15 failures. http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-06-28--001.1593338982--chia7712--fix_8334_avoid_deadlock--c69ab94d3/report.html Will do another trunk run for comparison. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-650583070 @chia7712 I just merged the PR for KAFKA-10180. Perhaps you can rebase again. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-650240109 For comparison, 29 test failures in trunk http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2020-06-25--001.1593132363--apache--trunk--3348fc49d/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-649615869 Latest test results with 31 failures. http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-06-24--001.1593050486--chia7712--fix_8334_avoid_deadlock--4cd68e7a9/report.html I will trigger a system tests for trunk. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-648259565 Thanks. Triggering another round of system tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-648254407 @chia7712 : I think the client compatibility test failures are probably because you haven't rebased the PR. #8841 was fixed 6 days ago. Could you rebase your PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-648249808 There were still lots of client compatibility related failures http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-06-22--001.1592885123--chia7712--fix_8334_avoid_deadlock--142a6c4c0/report.html . Not sure why since those tests were passing in http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2020-06-16--001.1592335112--mumrah--KAFKA-10123--63c1b14a4/report.html This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-647688705 It seems we just fixed a bunch of client compatibility related failures in https://github.com/apache/kafka/pull/8841. Another 18 test failures were due to TLSv1.3 and are tracked in https://issues.apache.org/jira/browse/KAFKA-10180. I started another run of system tests. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-647020292 This is a second run of the system tests. http://testing.confluent.io/confluent-kafka-branch-builder-system-test-results/?prefix=2020-06-19--001.1592614513--chia7712--fix_8334_avoid_deadlock--142a6c4c0/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [kafka] junrao commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…
junrao commented on pull request #8657: URL: https://github.com/apache/kafka/pull/8657#issuecomment-646685855 @chia7712 : The system test results can be found in http://testing.confluent.io/confluent-kafka-branch-builder-system-test-results/?prefix=2020-06-18--001.1592550132--chia7712--fix_8334_avoid_deadlock--142a6c4c0/ If you click on report.html, it shows there were 76 failures. Could you take a look and see if it's related to this PR? I will trigger another run just to see if any of those failures were transient. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org