[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581844#comment-15581844 ] Branimir Lambov commented on CASSANDRA-12784: - +1 > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581546#comment-15581546 ] Stefania commented on CASSANDRA-12784: -- {{testExistingCluster}} is not deterministic for random partitioner because: {code} public BigIntegerToken getRandomToken(Random random) { BigInteger token = FBUtilities.hashToBigInteger(GuidGenerator.guidAsBytes(random)); if ( token.signum() == -1 ) token = token.multiply(BigInteger.valueOf(-1L)); return new BigIntegerToken(token); } {code} and {code} public static ByteBuffer guidAsBytes(Random random) { StringBuilder sbValueBeforeMD5 = new StringBuilder(); long time = System.currentTimeMillis(); long rand = 0; rand = random.nextLong(); sbValueBeforeMD5.append(s_id) .append(":") .append(Long.toString(time)) .append(":") .append(Long.toString(rand)); String valueBeforeMD5 = sbValueBeforeMD5.toString(); return ByteBuffer.wrap(FBUtilities.threadLocalMD5Digest().digest(valueBeforeMD5.getBytes())); } {code} There is a dependency on the current time. I've removed it in the latest [commit|https://github.com/stef1927/cassandra/commit/b6e39d1105df6f1845c2a3928b454339b0e21201] and launched all tests: ||3.X||trunk|| |[patch|https://github.com/stef1927/cassandra/tree/12784-3.X]|[patch|https://github.com/stef1927/cassandra/tree/12784]| |[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-3.X-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-testall/]| |[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-3.X-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-dtest/]| The only other caller of {{RandomPartitioner.getRandomToken(Random)}} is compaction-stress (CASSANDRA-11844), and the documentation of {{CompationStress.generateTokens()}} also indicates that this method should generate tokens in a deterministic way. > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574732#comment-15574732 ] Stefania commented on CASSANDRA-12784: -- bq. testExistingCluster is supposed to be deterministic as it uses a seeded random and should either always fail or always succeed. That random must be somehow changing in the multiplexed run. Thank you, I'll have a look on Monday at why the random generator is changing. > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574676#comment-15574676 ] Branimir Lambov commented on CASSANDRA-12784: - {{testExistingCluster}} is supposed to be deterministic as it uses a seeded random and should either always fail or always succeed. That random must be somehow changing in the multiplexed run. bq. The stack printout still there That's great, I did not know this trick. > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574654#comment-15574654 ] Stefania commented on CASSANDRA-12784: -- Thanks for the review. The multiplexed run returned a number of [failures|https://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-testall-multiplex/26/] in {{RandomReplicationAwareTokenAllocatorTest.testExistingCluster}}, is this test also expected to be flaky or is this a problem? Splitting the test caused a failure in testall, I've fixed it and relaunched. bq. The stack printout for single flakes could be useful to track the history of a failure; I would prefer not to lose it, but I wouldn't stop the commit if that is something you think is worth sacrificing. The stack printout still there, the logger is smart enough to work it out, I've tested it locally, see sample output below. bq. I would rename flakyTestNewCluster in the base class to just testNewCluster since the individual runner is the one that declares it flaky and handles that. Done. bq. Is there a reason to post the the commits view instead of the the branch one? There's no direct way to get from the commits view to compare which is the one most useful for reviews. Not really: ||3.X||trunk|| |[patch|https://github.com/stef1927/cassandra/tree/12784-3.X]|[patch|https://github.com/stef1927/cassandra/tree/12784]| |[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-3.X-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-testall/]| -- Sample stack printout obtained by inserting a fake {{Assert.fail("Test message")}}: {code} [junit] INFO [main] 2016-10-14 16:10:44,810 ?:? - Test failed. It tends to fail sometimes due to the random selection of the tokens in the first few nodes. [junit] java.lang.AssertionError: Test message [junit] at org.junit.Assert.fail(Assert.java:88) ~[junit-4.12.jar:4.12] [junit] at org.apache.cassandra.dht.tokenallocator.AbstractReplicationAwareTokenAllocatorTest.testNewCluster(AbstractReplicationAwareTokenAllocatorTest.java:592) ~[classes/:na] [junit] at org.apache.cassandra.dht.tokenallocator.AbstractReplicationAwareTokenAllocatorTest.testNewCluster(AbstractReplicationAwareTokenAllocatorTest.java:563) ~[classes/:na] [junit] at org.apache.cassandra.dht.tokenallocator.RandomReplicationAwareTokenAllocatorTest.flakyTestNewCluster(RandomReplicationAwareTokenAllocatorTest.java:51) [classes/:na] [junit] at org.apache.cassandra.Util.runCatchingAssertionError(Util.java:581) ~[classes/:na] [junit] at org.apache.cassandra.Util.flakyTest(Util.java:611) ~[classes/:na] [junit] at org.apache.cassandra.dht.tokenallocator.RandomReplicationAwareTokenAllocatorTest.testNewClusterr(RandomReplicationAwareTokenAllocatorTest.java:44) [classes/:na] [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_101] [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_101] [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_101] [junit] at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_101] [junit] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) [junit-4.12.jar:4.12] [junit] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) [junit-4.12.jar:4.12] [junit] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) [junit-4.12.jar:4.12] [junit] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) [junit-4.12.jar:4.12] [junit] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) [junit-4.12.jar:4.12] [junit] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) [junit-4.12.jar:4.12] [junit] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) [junit-4.12.jar:4.12] [junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) [junit-4.12.jar:4.12] [junit] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) [junit-4.12.jar:4.12] [junit] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) [junit-4.12.jar:4.12] [junit] at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) [junit-4.12.jar:4.12] [junit] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) [junit-4.12.jar:4.12] [junit] at org.junit.runners.ParentRunner.run(ParentRunner.java:363) [junit-4.12.jar:4.12] [junit] at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:38) [junit-4.12.jar:na] [junit] at
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574514#comment-15574514 ] Branimir Lambov commented on CASSANDRA-12784: - +1. Nits: - The stack printout for single flakes could be useful to track the history of a failure; I would prefer not to lose it, but I wouldn't stop the commit if that is something you think is worth sacrificing. - I would rename [{{flakyTestNewCluster}} in the base class|https://github.com/apache/cassandra/compare/trunk...stef1927:12784-3.X#diff-f32c9e3d5921a2a50fe56db4612f14b4R548] to just {{testNewCluster}} since the individual runner is the one that declares it flaky and handles that. Is there a reason to post the the commits view instead of the [the branch one|https://github.com/stef1927/cassandra/tree/12784-3.X]? There's no direct way to get from the commits view to [compare|https://github.com/apache/cassandra/compare/trunk...stef1927:12784-3.X] which is the one most useful for reviews. > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574329#comment-15574329 ] Stefania commented on CASSANDRA-12784: -- These are the new timings on my laptop with 64 vnodes for Murmur3 and 16 vnodes for Random: {code} ... ... {code} Unfortunately Jenkins is much slower. I multiplexed the test 10 times [here|https://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-testall-multiplex/25/testReport/org.apache.cassandra.dht.tokenallocator/] and each iteration took approximately 5.5 minutes. There were no failures, so I could not reproduce the problem mentioned above. Because {{testNewClusterWithMurmur3Partitioner}} takes 2.5 minutes on Jenkins, a flaky test will timeout if it performs more than 2 additional runs, so I changed the iterations to 2 for Murmur3 and 3 for Random. I've also split the class into two sub-classes, so that the total limit of 10 minutes is doubled. Otherwise, if both tests are flaky in the same run, it will certainly timeout, and even if only {{testNewClusterWithMurmur3Partitioner}} is flaky, the total time is very close to the 10 minutes limit. This is the full patch: ||3.X||trunk|| |[patch|https://github.com/stef1927/cassandra/commits/12784-3.X]|[patch|https://github.com/stef1927/cassandra/commits/12784]| |[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-3.X-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-testall/]| I'm also multiplexing 50 times the random partitioner tests [here|https://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-testall-multiplex/26/], to see if we can reproduce any failures despite the flaky utility. [~blambov], [~dikanggu]: who wants to be the reviewer? > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574319#comment-15574319 ] Stefania commented on CASSANDRA-12784: -- Sure, will do! > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572959#comment-15572959 ] Dikang Gu commented on CASSANDRA-12784: --- Thanks [~Stefania] and [~blambov], let me know if there are anything I can help. > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571794#comment-15571794 ] Stefania commented on CASSANDRA-12784: -- The failure was reproduced locally using the junit version specified in build.xml, which is 4.6, kind of old actually. At least in 4.6, {{unit.framework.AssertionFailedError}} extends {{AssertionError}}. Let me add some debug information and see if I can reproduce it again tomorrow, I'll try to multiplex it on Jenkins and then we should know for sure what is going on. Noted about changing the assert import. One more thing, if we want to leave {{testNewClusterWithMurmur3Partitioner}} at 64 vnodes, then I suggest lowering the number of iterations in the flaky utility, or else a flaky run will time out. > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571481#comment-15571481 ] Branimir Lambov commented on CASSANDRA-12784: - bq. AssertionFailedError is a sub-class of AssertionError Not in all JUnit versions. You local one is probably different from the one in CI. Additionally, the test itself imports {{junit.framework.Assert}} (from JUnit 3) while it should import {{org.junit.Assert}} (JUnit 4). > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571415#comment-15571415 ] Stefania commented on CASSANDRA-12784: -- bq. Since the same test is being done for a large vnode count for the Murmur partitioner, I have absolutely nothing against reducing the scope of the random version to no higher than 16 vnodes. Thanks, I'll change the test accordingly. bq. The flaky utility failure is caused by not catching the right error type for some junit versions. The fix is to include || AssertionFailedError in the catch in runCatchingAssertionError. That's what I thought as well initially, but actually AssertionFailedError is a sub-class of AssertionError so that is not the reason. I verified in Intellij that if we call Assert.fail() in the same method, the exception is catched. I must admit I don't understand this yet, unless it really ran 5 times and it failed every time but the suppressed exceptions were not displayed. The trouble is that it is hard to reproduce. > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571350#comment-15571350 ] Branimir Lambov commented on CASSANDRA-12784: - Since the same test is being done for a large vnode count for the Murmur partitioner, I have absolutely nothing against reducing the scope of the random version to no higher than 16 vnodes. The flaky utility failure is caused by not catching the right error type for _some_ junit versions. The fix is to include {{|| AssertionFailedError}} in the [{{catch}}|https://github.com/apache/cassandra/blob/trunk/test/unit/org/apache/cassandra/Util.java#L579] in {{runCatchingAssertionError}}. > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk
[ https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571222#comment-15571222 ] Stefania commented on CASSANDRA-12784: -- This test has been timing out since it was introduced by CASSANDRA-12647. In fact, the problem is not {{testNewClusterWithMurmur3Partitioner}}, which was previously called {{testNewCluster}} and was timing out very rarely (due to the flaky utility?), but the new random partitioner tests. They take approximately twice as long as the murmur3 tests. On my laptop, with the current test configuration of 64 VNODES, the full test completes in approximately 500 seconds, see timings below. The total timeout on Jenkins is 600 seconds, therefore the test is almost always timing out given that the Jenkins VMs are slower. This analysis does not take into account the flaky utility, when this latter kicks in, there is no chance that the test completes within the timeout. I've attached a JFR profile, [^ReplicationAwareTokenAllocatorTest.jfr.gz], the slowness of the random partitioner tests is due to the big integer math in {{BigIntegerToken.size()}}. Unless we plan on improving the performance of the algorithm or of the big integer math, may I suggest reducing the scope of the test? I don't think it's reasonable to run a unit test that takes longer than 10 minutes, the full test can be moved to a burn test if required. One way to reduce the scope would be to reduce the number of iterations by reducing VNODES, do you have any other suggestions [~blambov] or [~dikanggu]? h5. Measurements on my laptop: *64-VNODES:* {code} ... ... {code} *32-VNODES:* {code} ... ... {code} *16-VNODES:* {code} ... ... {code} *8-VNODES:* {code} ... ... {code} *4-VNODES:* {code} ... ... {code} -- I've also noticed two failures with the following exception: {code} [junit] - --- [junit] Testcase: testNewClusterWithRandomPartitioner(org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest): FAILED [junit] Expected max unit size below 1.2000, was 1.2241 [junit] junit.framework.AssertionFailedError: Expected max unit size below 1.2000, was 1.2241 [junit] at org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest.grow(ReplicationAwareTokenAllocatorTest.java:698) [junit] at org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest.testNewCluster(ReplicationAwareTokenAllocatorTest.java:629) [junit] at org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest.flakyTestNewCluster(ReplicationAwareTokenAllocatorTest.java:611) [junit] at org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest.flakyTestNewClusterWithRandomPartitioner(ReplicationAwareTokenAllocatorTest.java:583) [junit] at org.apache.cassandra.Util.runCatchingAssertionError(Util.java:576) [junit] at org.apache.cassandra.Util.flakyTest(Util.java:601) [junit] at org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest.testNewClusterWithRandomPartitioner(ReplicationAwareTokenAllocatorTest.java:568) {code} Is the flaky utility effective for the random partitioner tests? > ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and > trunk > > > Key: CASSANDRA-12784 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12784 > Project: Cassandra > Issue Type: Bug >Reporter: Stefania >Assignee: Stefania > Fix For: 3.x > > Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz > > > Example failure: > http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)