[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-17 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581844#comment-15581844
 ] 

Branimir Lambov commented on CASSANDRA-12784:
-

+1

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-17 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15581546#comment-15581546
 ] 

Stefania commented on CASSANDRA-12784:
--

{{testExistingCluster}} is not deterministic for random partitioner because:

{code}
public BigIntegerToken getRandomToken(Random random)
{
BigInteger token = 
FBUtilities.hashToBigInteger(GuidGenerator.guidAsBytes(random));
if ( token.signum() == -1 )
token = token.multiply(BigInteger.valueOf(-1L));
return new BigIntegerToken(token);
}
{code}

and

{code}
public static ByteBuffer guidAsBytes(Random random)
{
StringBuilder sbValueBeforeMD5 = new StringBuilder();
long time = System.currentTimeMillis();
long rand = 0;
rand = random.nextLong();
sbValueBeforeMD5.append(s_id)
.append(":")
.append(Long.toString(time))
.append(":")
.append(Long.toString(rand));

String valueBeforeMD5 = sbValueBeforeMD5.toString();
return 
ByteBuffer.wrap(FBUtilities.threadLocalMD5Digest().digest(valueBeforeMD5.getBytes()));
}
{code}

There is a dependency on the current time. I've removed it in the latest 
[commit|https://github.com/stef1927/cassandra/commit/b6e39d1105df6f1845c2a3928b454339b0e21201]
 and launched all tests:

||3.X||trunk||
|[patch|https://github.com/stef1927/cassandra/tree/12784-3.X]|[patch|https://github.com/stef1927/cassandra/tree/12784]|
|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-3.X-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-testall/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-3.X-dtest/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-dtest/]|

The only other caller of {{RandomPartitioner.getRandomToken(Random)}} is 
compaction-stress (CASSANDRA-11844), and the documentation of 
{{CompationStress.generateTokens()}} also indicates that this method should 
generate tokens in a deterministic way.

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-14 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574732#comment-15574732
 ] 

Stefania commented on CASSANDRA-12784:
--

bq. testExistingCluster is supposed to be deterministic as it uses a seeded 
random and should either always fail or always succeed. That random must be 
somehow changing in the multiplexed run.

Thank you, I'll have a look on Monday at why the random generator is changing.

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-14 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574676#comment-15574676
 ] 

Branimir Lambov commented on CASSANDRA-12784:
-

{{testExistingCluster}} is supposed to be deterministic as it uses a seeded 
random and should either always fail or always succeed. That random must be 
somehow changing in the multiplexed run.

bq. The stack printout still there

That's great, I did not know this trick.

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-14 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574654#comment-15574654
 ] 

Stefania commented on CASSANDRA-12784:
--

Thanks for the review.

The multiplexed run returned a number of 
[failures|https://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-testall-multiplex/26/]
 in {{RandomReplicationAwareTokenAllocatorTest.testExistingCluster}}, is this 
test also expected to be flaky or is this a problem?

Splitting the test caused a failure in testall, I've fixed it and relaunched.

bq. The stack printout for single flakes could be useful to track the history 
of a failure; I would prefer not to lose it, but I wouldn't stop the commit if 
that is something you think is worth sacrificing.

The stack printout still there, the logger is smart enough to work it out, I've 
tested it locally, see sample output below.

bq. I would rename flakyTestNewCluster in the base class to just testNewCluster 
since the individual runner is the one that declares it flaky and handles that.

Done.

bq. Is there a reason to post the the commits view instead of the the branch 
one? There's no direct way to get from the commits view to compare which is the 
one most useful for reviews.

Not really:

||3.X||trunk||
|[patch|https://github.com/stef1927/cassandra/tree/12784-3.X]|[patch|https://github.com/stef1927/cassandra/tree/12784]|
|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-3.X-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-testall/]|

--

Sample stack printout obtained by inserting a fake {{Assert.fail("Test 
message")}}:

{code}
  [junit] INFO  [main] 2016-10-14 16:10:44,810 ?:? - Test failed. It tends to 
fail sometimes due to the random selection of the tokens in the first few nodes.
[junit] java.lang.AssertionError: Test message
[junit] at org.junit.Assert.fail(Assert.java:88) ~[junit-4.12.jar:4.12]
[junit] at 
org.apache.cassandra.dht.tokenallocator.AbstractReplicationAwareTokenAllocatorTest.testNewCluster(AbstractReplicationAwareTokenAllocatorTest.java:592)
 ~[classes/:na]
[junit] at 
org.apache.cassandra.dht.tokenallocator.AbstractReplicationAwareTokenAllocatorTest.testNewCluster(AbstractReplicationAwareTokenAllocatorTest.java:563)
 ~[classes/:na]
[junit] at 
org.apache.cassandra.dht.tokenallocator.RandomReplicationAwareTokenAllocatorTest.flakyTestNewCluster(RandomReplicationAwareTokenAllocatorTest.java:51)
 [classes/:na]
[junit] at 
org.apache.cassandra.Util.runCatchingAssertionError(Util.java:581) 
~[classes/:na]
[junit] at org.apache.cassandra.Util.flakyTest(Util.java:611) 
~[classes/:na]
[junit] at 
org.apache.cassandra.dht.tokenallocator.RandomReplicationAwareTokenAllocatorTest.testNewClusterr(RandomReplicationAwareTokenAllocatorTest.java:44)
 [classes/:na]
[junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[na:1.8.0_101]
[junit] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
~[na:1.8.0_101]
[junit] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[na:1.8.0_101]
[junit] at java.lang.reflect.Method.invoke(Method.java:498) 
~[na:1.8.0_101]
[junit] at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 [junit-4.12.jar:4.12]
[junit] at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 [junit-4.12.jar:4.12]
[junit] at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 [junit-4.12.jar:4.12]
[junit] at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 [junit-4.12.jar:4.12]
[junit] at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) 
[junit-4.12.jar:4.12]
[junit] at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 [junit-4.12.jar:4.12]
[junit] at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 [junit-4.12.jar:4.12]
[junit] at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) 
[junit-4.12.jar:4.12]
[junit] at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) 
[junit-4.12.jar:4.12]
[junit] at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) 
[junit-4.12.jar:4.12]
[junit] at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) 
[junit-4.12.jar:4.12]
[junit] at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) 
[junit-4.12.jar:4.12]
[junit] at org.junit.runners.ParentRunner.run(ParentRunner.java:363) 
[junit-4.12.jar:4.12]
[junit] at 
junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:38) 
[junit-4.12.jar:na]
[junit] at 

[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-14 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574514#comment-15574514
 ] 

Branimir Lambov commented on CASSANDRA-12784:
-

+1. Nits:

- The stack printout for single flakes could be useful to track the history of 
a failure; I would prefer not to lose it, but I wouldn't stop the commit if 
that is something you think is worth sacrificing.

- I would rename [{{flakyTestNewCluster}} in the base 
class|https://github.com/apache/cassandra/compare/trunk...stef1927:12784-3.X#diff-f32c9e3d5921a2a50fe56db4612f14b4R548]
 to just {{testNewCluster}} since the individual runner is the one that 
declares it flaky and handles that.

Is there a reason to post the the commits view instead of the [the branch 
one|https://github.com/stef1927/cassandra/tree/12784-3.X]? There's no direct 
way to get from the commits view to 
[compare|https://github.com/apache/cassandra/compare/trunk...stef1927:12784-3.X]
 which is the one most useful for reviews.



> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-13 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574329#comment-15574329
 ] 

Stefania commented on CASSANDRA-12784:
--

These are the new timings on my laptop with 64 vnodes for Murmur3 and 16 vnodes 
for Random:

{code}

...




...




{code}

Unfortunately Jenkins is much slower. I multiplexed the test 10 times 
[here|https://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-testall-multiplex/25/testReport/org.apache.cassandra.dht.tokenallocator/]
 and each iteration took approximately 5.5 minutes. There were no failures, so 
I could not reproduce the problem mentioned above.

Because {{testNewClusterWithMurmur3Partitioner}} takes 2.5 minutes on Jenkins, 
a flaky test will timeout if it performs more than 2 additional runs, so I 
changed the iterations to 2 for Murmur3 and 3 for Random. I've also split the 
class into two sub-classes, so that the total limit of 10 minutes is doubled. 
Otherwise, if both tests are flaky in the same run, it will certainly timeout, 
and even if only {{testNewClusterWithMurmur3Partitioner}} is flaky, the total 
time is very close to the 10 minutes limit.

This is the full patch:

||3.X||trunk||
|[patch|https://github.com/stef1927/cassandra/commits/12784-3.X]|[patch|https://github.com/stef1927/cassandra/commits/12784]|
|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-3.X-testall/]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12784-testall/]|

I'm also multiplexing 50 times the random partitioner tests 
[here|https://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-testall-multiplex/26/],
 to see if we can reproduce any failures despite the flaky utility.

[~blambov], [~dikanggu]: who wants to be the reviewer?

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-13 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574319#comment-15574319
 ] 

Stefania commented on CASSANDRA-12784:
--

Sure, will do!

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-13 Thread Dikang Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15572959#comment-15572959
 ] 

Dikang Gu commented on CASSANDRA-12784:
---

Thanks [~Stefania] and [~blambov], let me know if there are anything I can help.

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-13 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571794#comment-15571794
 ] 

Stefania commented on CASSANDRA-12784:
--

The failure was reproduced locally using the junit version specified in 
build.xml, which is 4.6, kind of old actually.

At least in 4.6, {{unit.framework.AssertionFailedError}} extends 
{{AssertionError}}. Let me add some debug information and see if I can 
reproduce it again tomorrow, I'll try to multiplex it on Jenkins and then we 
should know for sure what is going on.

Noted about changing the assert import. 

One more thing, if we want to leave {{testNewClusterWithMurmur3Partitioner}} at 
64 vnodes, then I suggest lowering the number of iterations in the flaky 
utility, or else a flaky run will time out.



> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-13 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571481#comment-15571481
 ] 

Branimir Lambov commented on CASSANDRA-12784:
-

bq. AssertionFailedError is a sub-class of AssertionError

Not in all JUnit versions. You local one is probably different from the one in 
CI.

Additionally, the test itself imports {{junit.framework.Assert}} (from JUnit 3) 
while it should import {{org.junit.Assert}} (JUnit 4).

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-13 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571415#comment-15571415
 ] 

Stefania commented on CASSANDRA-12784:
--

bq. Since the same test is being done for a large vnode count for the Murmur 
partitioner, I have absolutely nothing against reducing the scope of the random 
version to no higher than 16 vnodes.

Thanks, I'll change the test accordingly.

bq. The flaky utility failure is caused by not catching the right error type 
for some junit versions. The fix is to include || AssertionFailedError in the 
catch in runCatchingAssertionError.

That's what I thought as well initially, but actually AssertionFailedError is a 
sub-class of AssertionError so that is not the reason. I verified in Intellij 
that if we call Assert.fail() in the same method, the exception is catched. I 
must admit I don't understand this yet, unless it really ran 5 times and it 
failed every time but the suppressed exceptions were not displayed. The trouble 
is that it is hard to reproduce.

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-13 Thread Branimir Lambov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571350#comment-15571350
 ] 

Branimir Lambov commented on CASSANDRA-12784:
-

Since the same test is being done for a large vnode count for the Murmur 
partitioner, I have absolutely nothing against reducing the scope of the random 
version to no higher than 16 vnodes.

The flaky utility failure is caused by not catching the right error type for 
_some_ junit versions. The fix is to include {{|| AssertionFailedError}} in the 
[{{catch}}|https://github.com/apache/cassandra/blob/trunk/test/unit/org/apache/cassandra/Util.java#L579]
 in {{runCatchingAssertionError}}.

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-12784) ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and trunk

2016-10-13 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15571222#comment-15571222
 ] 

Stefania commented on CASSANDRA-12784:
--

This test has been timing out since it was introduced by CASSANDRA-12647. In 
fact, the problem is not {{testNewClusterWithMurmur3Partitioner}}, which was 
previously called {{testNewCluster}} and was timing out very rarely (due to the 
flaky utility?), but the new random partitioner tests. They take approximately 
twice as long as the murmur3 tests. On my laptop, with the current test 
configuration of 64 VNODES, the full test completes in approximately 500 
seconds, see timings below. The total timeout on Jenkins is 600 seconds, 
therefore the test is almost always timing out given that the Jenkins VMs are 
slower. This analysis does not take into account the flaky utility, when this 
latter kicks in, there is no chance that the test completes within the timeout.

I've attached a JFR profile, [^ReplicationAwareTokenAllocatorTest.jfr.gz], the 
slowness of the random partitioner tests is due to the big integer math in 
{{BigIntegerToken.size()}}. Unless we plan on improving the performance of the 
algorithm or of the big integer math, may I suggest reducing the scope of the 
test? I don't think it's reasonable to run a unit test that takes longer than 
10 minutes, the full test can be moved to a burn test if required.

One way to reduce the scope would be to reduce the number of iterations by 
reducing VNODES, do you have any other suggestions [~blambov] or [~dikanggu]?

h5. Measurements on my laptop:

*64-VNODES:*

{code}

...




...




{code}

*32-VNODES:*

{code}

...




...




{code}

*16-VNODES:*

{code}

...




...




{code}

*8-VNODES:*

{code}

...




...




{code}


*4-VNODES:*

{code}

...




...




{code}


--

I've also noticed two failures with the following exception:

{code}
[junit] -  ---
[junit] Testcase: 
testNewClusterWithRandomPartitioner(org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest):
  FAILED
[junit] Expected max unit size below 1.2000, was 1.2241
[junit] junit.framework.AssertionFailedError: Expected max unit size below 
1.2000, was 1.2241
[junit] at 
org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest.grow(ReplicationAwareTokenAllocatorTest.java:698)
[junit] at 
org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest.testNewCluster(ReplicationAwareTokenAllocatorTest.java:629)
[junit] at 
org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest.flakyTestNewCluster(ReplicationAwareTokenAllocatorTest.java:611)
[junit] at 
org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest.flakyTestNewClusterWithRandomPartitioner(ReplicationAwareTokenAllocatorTest.java:583)
[junit] at 
org.apache.cassandra.Util.runCatchingAssertionError(Util.java:576)
[junit] at org.apache.cassandra.Util.flakyTest(Util.java:601)
[junit] at 
org.apache.cassandra.dht.tokenallocator.ReplicationAwareTokenAllocatorTest.testNewClusterWithRandomPartitioner(ReplicationAwareTokenAllocatorTest.java:568)
{code}

Is the flaky utility effective for the random partitioner tests?

> ReplicationAwareTokenAllocatorTest times out almost every time for 3.X and 
> trunk
> 
>
> Key: CASSANDRA-12784
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12784
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefania
>Assignee: Stefania
> Fix For: 3.x
>
> Attachments: ReplicationAwareTokenAllocatorTest.jfr.gz
>
>
> Example failure: 
> http://cassci.datastax.com/view/cassandra-3.X/job/cassandra-3.X_testall/lastCompletedBuild/testReport/org.apache.cassandra.dht.tokenallocator/ReplicationAwareTokenAllocatorTest/testNewClusterWithMurmur3Partitioner/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)