[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-09-09 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-689921735


   > Thanks a lot for staying on this tricky issue and finding a simpler 
solution!
   
   thanks for all suggestions. I benefit a lot from it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-09-08 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-689256048


   ```
   Build / JDK 15 / 
kafka.network.ConnectionQuotasTest.testNoConnectionLimitsByDefault
   Build / JDK 11 / 
kafka.network.DynamicConnectionQuotaTest.testDynamicListenerConnectionCreationRateQuota
   Build / JDK 11 / 
org.apache.kafka.streams.integration.EosBetaUpgradeIntegrationTest.shouldUpgradeFromEosAlphaToEosBeta[true]
   ```
   ```
   Module: kafkatest.tests.connect.connect_distributed_test
   Class:  ConnectDistributedTest
   Method: test_bounce
   Arguments:
   {
 "clean": true,
 "connect_protocol": "sessioned"
   }
   ```
   
   On my local, they are flaky on trunk branch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-09-07 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-688481474


   @junrao Thanks for all reviews again 👍 
   
   > Do you plan to remove some of the unused methods in DelayedOperations in 
Partition?
   
   my bad. I forgot this request :(
   
   Expect for ```checkAndCompleteFetch```, the other unused methods (in 
production scope) are removed by this PR.
   
   > Currently, when calling checkAndComplete() for the 
produce/fetch/deleteRecords purgatory, we still hold replicaStateChangeLock. 
This doesn't seem to cause any deadlock for now. In the future, we can 
potentially improve this by calling checkAndComplete() outside of the 
replicaStateChangeLock by passing leader epoch into those delayed operations 
and checking if leader epoch has changed in tryComplete().
   
   It seems we can remove ```delayedOperations``` from```Partition```. That is 
similar to this PR and ```Partition``` SHOULD NOT complete delayed request 
anymore. I can take over this in separate PR :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-09-02 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-685838009


   Retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-09-02 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-685837334


   
```kafka.network.DynamicConnectionQuotaTest.testDynamicListenerConnectionCreationRateQuota```
 pass on my local. retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-09-02 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-685833800


   > 00:31:51 [Error] 
/home/jenkins/jenkins-agent/workspace/Kafka_kafka-pr_PR-8657/streams/streams-scala/src/test/scala/org/apache/kafka/streams/scala/StreamToTableJoinScalaIntegrationTestImplicitSerdes.scala:136:
 value stringSerde is not a member of object 
org.apache.kafka.streams.scala.Serdes
   
   this error is caused by #8955 and #9245 is going to fix it



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-31 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-683863356


   > It seems that the average across multiple runs doesn't change much?
   
   yep. I didn't observe obvious regression caused by this patch.
   
   > Also, 1 failure in the latest system test run.
   
   ```kafkatest.tests.connect.connect_distributed_test``` was flaky (see 
https://issues.apache.org/jira/browse/KAFKA-10289)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-31 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-683602732


   @junrao the result of ```benchmark_test.py``` is attached (see description)
   
   The main regression ({"records_per_sec": 3653635.3672, "mb_per_sec": 
348.4378} -> {"records_per_sec": 2992220.2274, "mb_per_sec": 285.3604}) happens 
in case ```test_id: 
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_consumer_throughput.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy```.
 
   
   I re-run the case 5 times and it seems the throughput of that case is not 
stable.
   
   **BEFORE**
   
   1. {"records_per_sec": 3653635.3672, "mb_per_sec": 348.4378} 
   1. {"records_per_sec": 3812428.517, "mb_per_sec": 363.5815}
   1. {"records_per_sec": 3012048.1928, "mb_per_sec": 287.2513}
   1. {"records_per_sec": 3182686.1871, "mb_per_sec": 303.5246}
   1. {"records_per_sec": 2997601.9185, "mb_per_sec": 285.8736}
   
   **AFTER**
   
   1. {"records_per_sec": 2992220.2274, "mb_per_sec": 285.3604}
   1. {"records_per_sec": 3698224.8521, "mb_per_sec": 352.6902}
   1. {"records_per_sec": 2977076.5109, "mb_per_sec": 283.9161}
   1. {"records_per_sec": 3676470.5882, "mb_per_sec": 350.6156}
   1. {"records_per_sec": 3681885.1252, "mb_per_sec": 351.1319}
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-30 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-683434648


   > Could you do some perf tests so that we know the performance of fetch 
requests doesn't change noticeably?
   
   the result of ```Benchmark.test_producer_and_consumer``` is attached below. 
It seems the patch gets better throughput :)
   
   I will run more performance tests tomorrow.
   
   **BEFORE**
   ```
   test_id:
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=none
   status: PASS
   run time:   1 minute 8.041 seconds
   {"consumer": {"records_per_sec": 570060.4264, "mb_per_sec": 54.3652}, 
"producer": {"records_per_sec": 611583.389395, "mb_per_sec": 58.33}}
   

   test_id:
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=snappy
   status: PASS
   run time:   58.549 seconds
   {"consumer": {"records_per_sec": 1331203.4079, "mb_per_sec": 126.9535}, 
"producer": {"records_per_sec": 1257387.149503, "mb_per_sec": 119.91}}
   

   test_id:
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=none
   status: PASS
   run time:   1 minute 7.757 seconds
   {"consumer": {"records_per_sec": 577133.9528, "mb_per_sec": 55.0398}, 
"producer": {"records_per_sec": 582106.059724, "mb_per_sec": 55.51}}
   

   test_id:
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy
   status: PASS
   run time:   59.287 seconds
   {"consumer": {"records_per_sec": 1421868.335, "mb_per_sec": 135.5999}, 
"producer": {"records_per_sec": 1332267.519318, "mb_per_sec": 127.05}}
   

   test_id:
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.security_protocol=PLAINTEXT.compression_type=none
   status: PASS
   run time:   1 minute 1.819 seconds
   {"consumer": {"records_per_sec": 987361.7694, "mb_per_sec": 94.1622}, 
"producer": {"records_per_sec": 942862.530643, "mb_per_sec": 89.92}}
   

   test_id:
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.security_protocol=PLAINTEXT.compression_type=snappy
   status: PASS
   run time:   55.877 seconds
   {"consumer": {"records_per_sec": 1479508.8031, "mb_per_sec": 141.097}, 
"producer": {"records_per_sec": 1339046.598822, "mb_per_sec": 127.7}}
   

   ```
   
   
   **AFTER**
   ```
   test_id:
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=none
   status: PASS
   run time:   1 minute 9.619 seconds
   {"consumer": {"records_per_sec": 615422.4875, "mb_per_sec": 58.6913}, 
"producer": {"records_per_sec": 640779.187492, "mb_per_sec": 61.11}}
   

   test_id:
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.2.security_protocol=SSL.compression_type=snappy
   status: PASS
   run time:   57.667 seconds
   {"consumer": {"records_per_sec": 1444043.3213, "mb_per_sec": 137.7147}, 
"producer": {"records_per_sec": 1398014.818957, "mb_per_sec": 133.33}}
   

   test_id:
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=none
   status: PASS
   run time:   1 minute 6.802 seconds
   {"consumer": {"records_per_sec": 619770.6848, "mb_per_sec": 59.1059}, 
"producer": {"records_per_sec": 646203.55412, "mb_per_sec": 61.63}}
   

   test_id:
kafkatest.benchmarks.core.benchmark_test.Benchmark.test_producer_and_consumer.interbroker_security_protocol=PLAINTEXT.tls_version=TLSv1.3.security_protocol=SSL.compression_type=snappy
   status: PASS
   run time:   1 minute 0.790 seconds
   {"consumer": {"records_per_sec": 1

[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-30 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-683400677


   > Could you do some perf tests so that we know the performance of fetch 
requests doesn't change noticeably?
   
   @junrao Are there any suggested official benchmark tools?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-29 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-683284825


   > If it's meant to be called by every request, then maybe we should have the 
delayed actions in a separate class instead of ReplicaManager. Other classes 
could, in theory, add their own delayed actions to this queue too.
   
   I preferred this idea as it prevent us from missing any action.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-28 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-682879578


   @junrao Thanks for all suggestions!
   
   Is Jenkins on vacation? Could you trigger a system test?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-27 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-681708424


   @junrao I have submitted  a draft patch. As this new approach is totally 
different from previous code, I delete previous commits in order to clean git 
history.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-26 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-681375395


   > This issue will be resolved if we complete those delayed operations due to 
groupManager.storeGroup() elsewhere without holding any locks.
   
   So ```DelayedJoin``` DOES NOT complete any delayed requests in this path 
anymore and we expect that someone who don't hold lock should complete them?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-26 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-681318790


   > We keep DelayedJoin as it is, still passing in the group lock to 
DelayedOperation to avoid deadlocks due to two levels of locking.
   
   Just double check. We use a separate thread to handle 
```groupManager.storeGroup``` and ```joinPurgatory.tryCompleteElseWatch``` in 
```GroupCoordinator.onCompleteJoin```, right? 
```GroupCoordinator.onCompleteJoin``` is called by ```DelayedJoin.onComplete``` 
only and it is possible to hold a group lock already.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-25 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-679855792


   > The second issue is that we hold a group lock while calling 
joinPurgatory.tryCompleteElseWatch. In this call, it's possible that 
DelayedJoin.onComplete() will be called. In that case, since the caller holds 
the group lock, we won't be completing partitionsToComplete in 
completeDelayedRequests().
   
   @junrao I go through group lock again and it is almost used anywhere :( 
   
   I'm worry about deadlock caused by current approach so I'd like to address 
@hachikuji refactor and your approach (separate thread). The following changes 
are included.
   
   1. ```Partition``` does not complete delayed operations. Instead, it adds 
those delayed operations to a queue. The callers ought to call the new method 
```Partition.completeDelayedActions``` to complete those delayed operations in 
proper place (to avoid conflicting locking).
   1. apple above new method to all callers who need to complete delay 
operations.
   1. ```GroupCoordinator.onCompleteJoin``` is called by ```DelayedJoin``` only 
but it always held a group lock. To resolve it, we complete delayed requests in 
a separate thread. 
   
   @junrao WDYT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-10 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-671749755


   > Both operations are done under the same group lock. If we relax the lock, 
it seems that the condition "all members have joined" may no longer be true 
when we get to DelayedJoin.onComplete() even though that condition was true 
during the DelayedJoin.tryCompleteJoin() check. It's not clear what we should 
do in that case.
   
   Your feedback always makes sense. 👍 
   
   It seems to me the approach has to address following issue.
   
   1. avoid conflicting (multiples) locks
   1. small change (don't introduce complicated mechanism)
   1. keep behavior compatibility (```hasAllMembersJoined``` and 
```onCompleteJoin``` should be included in same lock)
   
   I'd like to add an new method (default implementation is empty body) to 
```DelayedOperation```. The new method is almost  same to ```onComplete``` 
except for that it is executed without locking. Currently, only 
```DelayedJoin``` has to use it.
   
   ```scala
   
 def cleanup(): Unit = {}
   
 private[server] def maybeTryComplete(): Boolean = {
   lock.lock()
   if (try tryComplete() finally lock.unlock()) {
 cleanup()
 true
   } else false  
 }
   
 override def run(): Unit = {
   if (forceComplete()) {
 cleanup()
 onExpiration()
   }
 }
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-06 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-670293057


   > Could you provide a bit more detail?
   
   The new deadlock you mentioned is caused by PR. This PR introduces extra 
lock (```ReentrantLock```) to ```DelayedJoin``` (By contrast, previous 
```DelayedJoin``` has a group lock only). The thread has to get two locks 
(group lock and ReentrantLock) to complete ```DelayedJoin``` and so deadlock 
happens when two locks are held by different thread. Hence, my approach is to 
remove the extra lock introduced by this PR and so deadlock will be gone.
   
   The changes to ```DelayedJoin``` in this PR is to resolve deadlock happening 
on ```DelayedJoin#onComplete```. ```onComplete``` has to get other group lock 
to resolve delayed request but it is executed within a group lock. In order to 
resolve deadlock, this PR tries to move ```onComplete``` out of group lock. 
However, after thinking about this a bit more, why ```onComplete``` is executed 
within a lock? It is thread-safe already and it is not always executed within a 
lock (if it is executed on timeout). Hence, the implementation of my approach 
is to refactor ```DelayedOperation``` to move ```onComplete```  out of 
unnecessary lock. The refactor includes following changes.
   
   1. ```forceComplete``` should be executed by ```DelayedOperation``` other 
than subclasss
   1. ```forceComplete``` is executed after releasing lock
   ```scala
   private[server] def maybeTryComplete(): Boolean = {
 lock.lock
 val timeToComplete = try tryComplete() finally lock.unlock
 timeToComplete && forceComplete()
   }
   ```
   3. ```forceComplete``` is renamed to ```canComplete```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-08-05 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-669647781


   @junrao I also don't want to include more mechanisms to complicate this 
story. It seems to me we can do a litter factor for ```DelayedOperation``` to 
resolve this issue. ```forceComplete``` is thread-safe already so it does not 
need lock. It means ```forceComplete``` can be decoupled from 
```tryComplete```.  With this change, ```forceComplete ``` is involved by 
```maybeTryComplete``` and it is called after releasing lock. for example:
   ```scala
   private[server] def maybeTryComplete(): Boolean = {
 boolean timeToComplete = try lock.lock tryComplete() finally lock.unlock
 timeToComplete && forceComplete()
   }
   ```
   
   Also, ```tryComplete``` can be renamed to ```canComplete```. the method name 
is consistent to its actual behavior.
   
   The benefit from this change are shown below. 
   1. simplify behavior of ```tryComplete```. the subclasses don't need to call 
```forceComplete``` always
   1. lower possibility of conflicting lock



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-31 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-667442912


   > So, it's just a rebase and there is no change in your PR?
   
   The last change to this PR is to rename a class (according to @ijuma’s 
comment)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-30 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-666882466


   > Were there significant changes since the last review?
   
   I rebased code base to include 
https://github.com/apache/kafka/commit/96e0719e421a91b5fa289dc0d23276977655633e 
which fixes flaky ```streams_standby_replica_test```.
   
   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-30 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-666398632


   > 2 system test failures in the latest PR
   
   those 2 failed tests are flaky on my local and there are issue/PR related to 
them.
   
   @junrao @ijuma @hachikuji @rajinisivaram please take a look.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-29 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-665149860


   ```streams_standby_replica_test```  -> 
https://issues.apache.org/jira/browse/KAFKA-10287
   
   I will take a look at ```streams_broker_bounce_test```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-24 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-663818439


   @junrao 
   
   I have rebased this PR to include fix of ```group_mode_transactions_test```. 
Could you run system tests again? Except for ```streams_eos_test``` and 
transaction tests, other tests work well on my local.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-24 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-663618045


   @junrao thanks for the reports.
   
   Unfortunately, the failed tests are totally different in both results :(
   
   This PR has been rebased so the fix for ```group_mode_transactions_test``` 
is included.
   
   ```streams_standby_replica_test``` will get fixed by #9066.
   
   I will test the others on my local later.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-22 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-662810186


   > Could you rebase again? I will run the system tests after that. Thanks.
   
   done. the known flaky ```group_mode_transactions_test.py``` is traced by 
#9095



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-21 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-661862318


   @junrao Could you run system tests on both trunk and this PR?
   
   the tests which fail frequently on my local are shown below.
   
   1. transactions_test.py #9026
   1. group_mode_transactions_test.py #9026
   1. security_rolling_upgrade_test.py #9021
   1. streams_standby_replica_test.py (I'm digging it)
   
   Also, I have filed tickets for other failed system tests.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-14 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-658294532


   > The number of failures went up to 17.
   
   ok. Let me dig in them again :)
   
   there are 3 failed tests related to ```security_rolling_upgrade_test``` and 
I have filed #9021 to fix them.

   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-14 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-658027392


   ```
   
org.apache.kafka.streams.integration.EosIntegrationTest.shouldNotViolateEosIfOneTaskFailsWithState[exactly_once]
   ```
   
   It is traced by https://issues.apache.org/jira/browse/KAFKA-9831



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-13 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-657739902


   @junrao could you trigger system tests again?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-09 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-656382338


   > Do you want to rebase again so that I can run system tests one more time?
   
   sure!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-05 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-653956055


   > Are the unit test failures also due to flaky tests?
   
   I think so. However, I’m trying to resolve all of them so as to make this PR 
more safe :)
   
   #8981 #8974 #8913 are pending for reviewing. 
   
   The remaining failed tests on this PR are downgrade_test and upgrade_test. I 
will check them carefully tomorrow.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-02 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-653232232


   > Overall, do you think there is no new system test failure introduced by 
this PR?
   
   I did not observe obvious relation between flaky and this PR. I have filed 4 
PRs to fix flaky system tests. Maybe we can run system tests again after those 
PRs are merged. Hope there is good luck to me :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-07-02 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-653224351


   > core/transactions_test
   core/downgrade_test
   core/upgrade_test
   
   those are flaky on my local as well. @junrao Should I fix all flaky before 
merging this PR? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-28 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-650923843


   connect_rest_test.py is traced by #8944



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-27 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-650586275


   > Perhaps you can rebase again.
   
   done



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-26 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-650503360


   > For comparison, 29 test failures in trunk 
http://testing.confluent.io/confluent-kafka-system-test-results/?prefix=2020-06-25--001.1593132363--apache--trunk--3348fc49d/
   
   It seems to me there are many flaky in system tests and I need more time to 
dig in them (TLSv1.3 is traced by 
https://issues.apache.org/jira/browse/KAFKA-10180)
   
   **Following failed tests happens with this PR**
   ```
   Module: kafkatest.tests.streams.streams_upgrade_test
   Class:  StreamsUpgradeTest
   Method: test_simple_upgrade_downgrade
   Arguments:
   {
 "from_version": "0.10.1.1",
 "to_version": "0.10.2.2"
   }
   ```
   
   ```
   Module: kafkatest.tests.streams.streams_cooperative_rebalance_upgrade_test
   Class:  StreamsCooperativeRebalanceUpgradeTest
   Method: test_upgrade_to_cooperative_rebalance
   Arguments:
   {
 "upgrade_from_version": "1.0.2"
   }
   ```
   
   ```
   Module: kafkatest.tests.core.round_trip_fault_test
   Class:  RoundTripFaultTest
   Method: test_round_trip_workload_with_broker_partition
   ```
   ```
   Module: kafkatest.tests.connect.connect_distributed_test
   Class:  ConnectDistributedTest
   Method: test_bounce
   Arguments:
   {
 "clean": true,
 "connect_protocol": "compatible"
   }
   ```
   
   ```
   Module: kafkatest.tests.core.group_mode_transactions_test
   Class:  GroupModeTransactionsTest
   Method: test_transactions
   Arguments:
   {
 "bounce_target": "clients",
 "failure_mode": "clean_bounce"
   }
   ```
   
   ```
   Module: kafkatest.tests.core.replica_scale_test
   Class:  ReplicaScaleTest
   Method: test_produce_consume
   Arguments:
   {
 "partition_count": 34,
 "replication_factor": 3,
 "topic_count": 500
   }
   ```
   ```
   Module: kafkatest.tests.core.zookeeper_security_upgrade_test
   Class:  ZooKeeperSecurityUpgradeTest
   Method: test_zk_security_upgrade
   Arguments:
   {
 "security_protocol": "SASL_PLAINTEXT"
   }
   ```
   ```
   Module: kafkatest.tests.streams.streams_eos_test
   Class:  StreamsEosTest
   Method: test_failure_and_recovery
   Arguments:
   {
 "processing_guarantee": "exactly_once"
   }
   ```
   
   ---
   **Following failed tests happens without this PR**
   ```
   Module: kafkatest.tests.streams.streams_upgrade_test
   Class:  StreamsUpgradeTest
   Method: test_metadata_upgrade
   Arguments:
   {
 "from_version": "0.10.0.1",
 "to_version": "0.11.0.3"
   }
   ```
   ```
   Module: kafkatest.tests.streams.streams_upgrade_test
   Class:  StreamsUpgradeTest
   Method: test_metadata_upgrade
   Arguments:
   {
 "from_version": "0.11.0.3",
 "to_version": "2.4.1"
   }
   ```
   ```
   Module: kafkatest.tests.core.transactions_test
   Class:  TransactionsTest
   Method: test_transactions
   Arguments:
   {
 "bounce_target": "brokers",
 "check_order": true,
 "failure_mode": "clean_bounce",
 "use_group_metadata": false
   }
   ```
   ```
   Module: kafkatest.tests.core.transactions_test
   Class:  TransactionsTest
   Method: test_transactions
   Arguments:
   {
 "bounce_target": "clients",
 "check_order": false,
 "failure_mode": "hard_bounce",
 "use_group_metadata": true
   }
   ```
   ```
   Module: kafkatest.tests.core.security_rolling_upgrade_test
   Class:  TestSecurityRollingUpgrade
   Method: test_rolling_upgrade_phase_two
   Arguments:
   {
 "broker_protocol": "SASL_SSL",
 "client_protocol": "SASL_SSL"
   }
   ```
   ```
   Module: kafkatest.tests.client.consumer_test
   Class:  OffsetValidationTest
   Method: test_consumer_failure
   Arguments:
   {
 "clean_shutdown": true,
 "enable_autocommit": true
   }
   ```
   ```
   Module: kafkatest.tests.core.upgrade_test
   Class:  TestUpgrade
   Method: test_upgrade
   Arguments:
   {
 "compression_types": [
   "snappy"
 ],
 "from_kafka_version": "0.10.0.1",
 "to_message_format_version": null
   }
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-25 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-649635872


   > I will trigger a system tests for trunk.
   
   @junrao big thanks !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-23 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-648254987


   > Could you rebase your PR?
   
   done



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-22 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-647352301


   I have filed https://issues.apache.org/jira/browse/KAFKA-10191 to trace 
```StreamsOptimizedTest```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-21 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-647233460


   > This is a second run of the system tests. 
http://testing.confluent.io/confluent-kafka-branch-builder-system-test-results/?prefix=2020-06-19--001.1592614513--chia7712--fix_8334_avoid_deadlock--142a6c4c0/
   
   ```ConsumeBenchTest``` pass on second run.
   
   
   ```StreamsOptimizedTest``` still fail and the error message is 
   ```
   Exception in thread 
"StreamsOptimizedTest-239c8c2e-5338-4a4c-a009-729ca21e1673-StreamThread-1" 
java.lang.IllegalStateException: Tried to lookup lag for unknown task 2_0
at 
org.apache.kafka.streams.processor.internals.assignment.ClientState.lagFor(ClientState.java:306)
at 
java.util.Comparator.lambda$comparingLong$6043328a$1(Comparator.java:511)
at 
java.util.Comparator.lambda$thenComparing$36697e65$1(Comparator.java:216)
at java.util.TreeMap.put(TreeMap.java:552)
at java.util.TreeSet.add(TreeSet.java:255)
at java.util.AbstractCollection.addAll(AbstractCollection.java:344)
at java.util.TreeSet.addAll(TreeSet.java:312)
at 
org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.getPreviousTasksByLag(StreamsPartitionAssignor.java:1250)
at 
org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.assignTasksToThreads(StreamsPartitionAssignor.java:1164)
at 
org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.computeNewAssignment(StreamsPartitionAssignor.java:920)
at 
org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.assign(StreamsPartitionAssignor.java:391)
at 
org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.performAssignment(ConsumerCoordinator.java:583)
   ```
   
   need to dig in it :(
   
   @junrao Could you trigger a system test for master branch ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-19 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-646718970


   > Could you take a look and see if it's related to this PR? I will trigger 
another run just to see if any of those failures were transient.
   
   I compare the failed tests to 
http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2020-06-16--001.1592310680--confluentinc--master--d07ee594d/report.html
 (the link is attached to https://issues.apache.org/jira/browse/KAFKA-10180). 
There are two failed tests happens only on this PR.
   
   1. ConsumeBenchTest
   2. StreamsOptimizedTest
   
   wait for the new run.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-19 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-646487562


   @junrao thanks for all your reviews. 
   
   > I triggered a system test on this PR. 
https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3970/
   
   How to see the result from this url? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-15 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-644521032


   > It will be helpful if you could preserve the commit history in future 
updates to the PR since that makes it easier to identify the delta changes.
   
   my bad :(
   
   I'll keep that in mind



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-14 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-643890644


   > Another way that doesn't require checking lock.isHeldByCurrentThread is 
the following. But your approach seems simpler.
   
   So... could we keep it simpler?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [kafka] chia7712 commented on pull request #8657: KAFKA-8334 Make sure the thread which tries to complete delayed reque…

2020-06-02 Thread GitBox


chia7712 commented on pull request #8657:
URL: https://github.com/apache/kafka/pull/8657#issuecomment-637668386


   @junrao Could you take a look? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org