Re: Cassandra project status update 2022-08-03

2022-08-11 Thread Andrés de la Peña
>
> > I think if we want to do this, it should be extremely easy - by which I
> mean automatic, really. This shouldn’t be too tricky I think? We just need
> to produce a diff of new test classes and methods within existing classes.


Having a CircleCI job that automatically runs all new/modified tests would
be a great way to prevent most of the new flakies. We would still miss some
cases, like unmodified tests that turn flaky after changing the tested
code, but I'd say that's not as usual.

> I can probably help out by putting together something to output @Test
> annotated methods within a source tree, if others are able to turn this
> into a part of the CircleCI pre-commit task (i.e. to pick the common
> ancestor with trunk, 4.1 etc, and run this task for each of the outputs)


I think we would need a bash/sh shell script taking a diff file and test
directory, and returning the file path and qualified class name of every
modified test class. I'd say we don't need the method names for Java tests
because quite often we see flaky tests that only fail when running their
entire class, so it's probably better to repeatedly run entire test classes
instead of particular methods.

We would also need a similar script for Python dtests. We would probably
want it to provide the full path of the modified tests (as
in cqlsh_tests/test_cqlsh.py::TestCqlshSmoke::test_create_index) because
those tests can be quite resource-intensive.

I think once we have those scripts we could plug their output to the
CircleCI commands for repeating tests.

Putting together all this seems relatively involved, so it can take us some
time to get it ready. In the meantime, I think it's a good practice to just
manually include any new/modified tests into the CircleCI config. Doing so
only requires to pass a few additional options to the script that generates
the config, which doesn't seem to require too much effort.

On Wed, 10 Aug 2022 at 19:47, Brandon Williams  wrote:

> > Side note, Butler is reporting CASSANDRA-17348 as open (it's resolved as
> a duplicate).
>
> This is fixed.
>


Re: Cassandra project status update 2022-08-03

2022-08-10 Thread Brandon Williams
> Side note, Butler is reporting CASSANDRA-17348 as open (it's resolved as a 
> duplicate).

This is fixed.


Re: Cassandra project status update 2022-08-03

2022-08-10 Thread Mick Semb Wever
On Wed, 10 Aug 2022 at 17:54, Josh McKenzie  wrote:

> “ We can start by putting the bar at a lower level and raise the level
> over time when most of the flakies that we hit are above that level.”
> My only concern is only who and how will track that.
>
> What's Butler's logic for flagging things flaky? Maybe a "flaky low" vs.
> "flaky high" distinction based on failure frequency (or some much better
> name I'm sure someone else will come up with) could make sense?
>


I'd be keen to see orders of magnitude, rather than arbitrary labels. Also
per CI system (having the data for basic correlation between systems will
be useful in other discussions and decisions).



> Then we could focus our efforts on the ones that are flagged as failing at
> whatever high water mark threshold we set.
>


Maybe obvious, but so long there's a way to bypass this when a flaky is
identified as being a legit bug and/or in a critical component (even a 1:1M
flakiness in certain components can be disastrous).

Some other questions…
 - how to measure the flakiness
 - how to measure post-commit rates across both CI systems
 - where the flakiness labels(/orders-of-magnitude) should be recorded
 - how we label flakies as being legit/critical/blocking (currently you
often have to read through the comments)


Applying this manually to the remaining 4.1 blockers we have:
- CASSANDRA-17461 CASTest. 1:40 failures on circle. looks to be able 1:2 on
ci-cassandra
- CASSANDRA-17618 InternodeEncryptionEnforcementTest. 1:167 circle. no
flaky in ci-cassandra
- CASSANDRA-17804 AutoSnapshotTtlTest. unknown flakiness in both ci.
- CASSANDRA-17573 PaxosRepairTest. 1:20 circle. no flakies in ci-cassandra.
- CASSANDRA-17658 KeyspaceMetricsTest. 1:20 circle. no flakies in
ci-cassandra.

In addition to these, Butler lists a number of flakies against 4.1, but
these are not regressions in 4.1 hence are not blockers. The jira board is
currently not blocking a 4.1-beta release on non-regression flakies. This
means our releases are not blocked on overall flakies, regardless if
there's more or less of them. How are we to place this with our recent
stance of no releases unless green…? (loops back to my "less overall
flakies than previous release /campground-cleaner" suggestion)

Side note, Butler is reporting CASSANDRA-17348 as open (it's resolved as a
duplicate).


Re: Cassandra project status update 2022-08-03

2022-08-10 Thread Josh McKenzie
> “ We can start by putting the bar at a lower level and raise the level over 
> time when most of the flakies that we hit are above that level.”
> My only concern is only who and how will track that.
What's Butler's logic for flagging things flaky? Maybe a "flaky low" vs. "flaky 
high" distinction based on failure frequency (or some much better name I'm sure 
someone else will come up with) could make sense? Then we could focus our 
efforts on the ones that are flagged as failing at whatever high water mark 
threshold we set.

It'd be trivial for me to update the script that parses test failure output for 
JIRA updates to flag things based on their failure frequency.

On Tue, Aug 9, 2022, at 5:24 PM, Ekaterina Dimitrova wrote:
> “ In my opinion, not all flakies are equals. Some fails every 10 runs, some 
> fails 1 in a 1000 runs.”
> Agreed, for all not new tests/regressions which are also not infra related.
> 
> “ We can start by putting the bar at a lower level and raise the level over 
> time when most of the flakies that we hit are above that level.”
> My only concern is only who and how will track that.
> Also, metric for non-infra issues I guess
> 
> “ At the same time we should make sure that we do not introduce new flakies. 
> One simple approach that has been mentioned several time is to run the new 
> tests added by a given patch in a loop using one of the CircleCI tasks. ”
> +1, I personally find this very valuable and more efficient than bisecting 
> and getting back to works done in some cases months ago
> 
> 
> “ We should also probably revert newly committed patch if we detect that they 
> introduced flakies.”
> +1, not that I like my patches to be reverted but it seems as the most fair 
> way to stick to our stated goals. But I think last time we talked about 
> reverting, we discussed it only for trunk? Or do I remember it wrong?
> 
> 
> 
> On Tue, 9 Aug 2022 at 7:58, Benjamin Lerer  wrote:
>> At this point it is clear that we will probably never be able to remove some 
>> level of flakiness from our tests. For me the questions are: 1) Where do we 
>> draw the line for a release ? and 2) How do we maintain that line over time?
>> 
>> In my opinion, not all flakies are equals. Some fails every 10 runs, some 
>> fails 1 in a 1000 runs. I would personally draw the line based on that 
>> metric. With the circleci tasks that Andres has added we can easily get that 
>> information for a given test.
>> We can start by putting the bar at a lower level and raise the level over 
>> time when most of the flakies that we hit are above that level.
>> 
>> TThat would allow us to minimize the risk of introducing flaky tests. We 
>> should also probably revert newly committed patch if we detect that they 
>> introduced flakies.
>> 
>> What do you think?
>> 
>> 
>> 
>> 
>> 
>> Le dim. 7 août 2022 à 12:24, Mick Semb Wever  a écrit :
>>> 
>>> 
 With that said, I guess we can just revise on a regular basis what exactly 
 are the last flakes and not numbers which also change quickly up and down 
 with the first change in the Infra. 
 
>>> 
>>> 
>>> +1, I am in favour of taking a pragmatic approach.
>>> 
>>> If flakies are identified and triaged enough that, with correlation from 
>>> both CI systems, we are confident that no legit bugs are behind them, I'm 
>>> in favour of going beta.
>>> 
>>> I still remain in favour of somehow incentivising reducing other flakies as 
>>> well. Flakies that expose poor/limited CI infra, and/or tests that are not 
>>> as resilient as they could be, are still noise that indirectly reduce our 
>>> QA (and increase efforts to find and tackle those legit runtime problems). 
>>> Interested in hearing input from others here that have been spending a lot 
>>> of time on this front. 
>>> 
>>> Could it work if we say: all flakies must be ticketed, and test/infra 
>>> related flakies do not block a beta release so long as there are fewer than 
>>> the previous release? The intent here being pragmatic, but keeping us on a 
>>> "keep the campground cleaner" trajectory… 


Re: Cassandra project status update 2022-08-03

2022-08-10 Thread Claude Warren, Jr via dev
Perhaps flaky tests need to be handled differently.  Is there a way to
build a statistical model of the current flakiness of the test that we can
then use during testing to accept the failures?  So if an acceptable level
of flakiness is developed then if the test fails, it needs to be run again
or multiple times to get a sample and ensure that the failure is not
statistically significant.



On Wed, Aug 10, 2022 at 8:51 AM Benedict Elliott Smith 
wrote:

> 
> > We can start by putting the bar at a lower level and raise the level
> over time
>
> +1
>
> > One simple approach that has been mentioned several time is to run the
> new tests added by a given patch in a loop using one of the CircleCI tasks
>
> I think if we want to do this, it should be extremely easy - by which I
> mean automatic, really. This shouldn’t be too tricky I think? We just need
> to produce a diff of new test classes and methods within existing classes.
> If there doesn’t already exist tooling to do this, I can probably help out
> by putting together something to output @Test annotated methods within a
> source tree, if others are able to turn this into a part of the CircleCI
> pre-commit task (i.e. to pick the common ancestor with trunk, 4.1 etc, and
> run this task for each of the outputs). We might want to start
> standardising branch naming structures to support picking the upstream
> branch.
>
> > We should also probably revert newly committed patch if we detect that
> they introduced flakies.
>
> There should be a strict time limit for reverting a patch for this reason,
> as environments change and what is flaky now was not necessarily before.
>
> On 9 Aug 2022, at 12:57, Benjamin Lerer  wrote:
>
> At this point it is clear that we will probably never be able to remove
> some level of flakiness from our tests. For me the questions are: 1) Where
> do we draw the line for a release ? and 2) How do we maintain that line
> over time?
>
> In my opinion, not all flakies are equals. Some fails every 10 runs, some
> fails 1 in a 1000 runs. I would personally draw the line based on that
> metric. With the circleci tasks that Andres has added we can easily get
> that information for a given test.
> We can start by putting the bar at a lower level and raise the level over
> time when most of the flakies that we hit are above that level.
>
> At the same time we should make sure that we do not introduce new flakies.
> One simple approach that has been mentioned several time is to run the new
> tests added by a given patch in a loop using one of the CircleCI tasks.
> That would allow us to minimize the risk of introducing flaky tests. We
> should also probably revert newly committed patch if we detect that they
> introduced flakies.
>
> What do you think?
>
>
>
>
>
> Le dim. 7 août 2022 à 12:24, Mick Semb Wever  a écrit :
>
>>
>>
>> With that said, I guess we can just revise on a regular basis what
>>> exactly are the last flakes and not numbers which also change quickly up
>>> and down with the first change in the Infra.
>>>
>>
>>
>> +1, I am in favour of taking a pragmatic approach.
>>
>> If flakies are identified and triaged enough that, with correlation from
>> both CI systems, we are confident that no legit bugs are behind them, I'm
>> in favour of going beta.
>>
>> I still remain in favour of somehow incentivising reducing other flakies
>> as well. Flakies that expose poor/limited CI infra, and/or tests that are
>> not as resilient as they could be, are still noise that indirectly reduce
>> our QA (and increase efforts to find and tackle those legit runtime
>> problems). Interested in hearing input from others here that have been
>> spending a lot of time on this front.
>>
>> Could it work if we say: all flakies must be ticketed, and test/infra
>> related flakies do not block a beta release so long as there are fewer than
>> the previous release? The intent here being pragmatic, but keeping us on a
>> "keep the campground cleaner" trajectory…
>>
>>
>


Re: Cassandra project status update 2022-08-03

2022-08-10 Thread Benedict Elliott Smith

> We can start by putting the bar at a lower level and raise the level over time

+1

> One simple approach that has been mentioned several time is to run the new 
> tests added by a given patch in a loop using one of the CircleCI tasks

I think if we want to do this, it should be extremely easy - by which I mean 
automatic, really. This shouldn’t be too tricky I think? We just need to 
produce a diff of new test classes and methods within existing classes. If 
there doesn’t already exist tooling to do this, I can probably help out by 
putting together something to output @Test annotated methods within a source 
tree, if others are able to turn this into a part of the CircleCI pre-commit 
task (i.e. to pick the common ancestor with trunk, 4.1 etc, and run this task 
for each of the outputs). We might want to start standardising branch naming 
structures to support picking the upstream branch.

> We should also probably revert newly committed patch if we detect that they 
> introduced flakies.

There should be a strict time limit for reverting a patch for this reason, as 
environments change and what is flaky now was not necessarily before.

> On 9 Aug 2022, at 12:57, Benjamin Lerer  wrote:
> 
> At this point it is clear that we will probably never be able to remove some 
> level of flakiness from our tests. For me the questions are: 1) Where do we 
> draw the line for a release ? and 2) How do we maintain that line over time?
> 
> In my opinion, not all flakies are equals. Some fails every 10 runs, some 
> fails 1 in a 1000 runs. I would personally draw the line based on that 
> metric. With the circleci tasks that Andres has added we can easily get that 
> information for a given test.
> We can start by putting the bar at a lower level and raise the level over 
> time when most of the flakies that we hit are above that level.
> 
> At the same time we should make sure that we do not introduce new flakies. 
> One simple approach that has been mentioned several time is to run the new 
> tests added by a given patch in a loop using one of the CircleCI tasks. That 
> would allow us to minimize the risk of introducing flaky tests. We should 
> also probably revert newly committed patch if we detect that they introduced 
> flakies.
> 
> What do you think?
> 
> 
> 
> 
> 
> Le dim. 7 août 2022 à 12:24, Mick Semb Wever  a écrit :
>> 
>> 
>>> With that said, I guess we can just revise on a regular basis what exactly 
>>> are the last flakes and not numbers which also change quickly up and down 
>>> with the first change in the Infra. 
>> 
>> 
>> 
>> +1, I am in favour of taking a pragmatic approach.
>> 
>> If flakies are identified and triaged enough that, with correlation from 
>> both CI systems, we are confident that no legit bugs are behind them, I'm in 
>> favour of going beta.
>> 
>> I still remain in favour of somehow incentivising reducing other flakies as 
>> well. Flakies that expose poor/limited CI infra, and/or tests that are not 
>> as resilient as they could be, are still noise that indirectly reduce our QA 
>> (and increase efforts to find and tackle those legit runtime problems). 
>> Interested in hearing input from others here that have been spending a lot 
>> of time on this front. 
>> 
>> Could it work if we say: all flakies must be ticketed, and test/infra 
>> related flakies do not block a beta release so long as there are fewer than 
>> the previous release? The intent here being pragmatic, but keeping us on a 
>> "keep the campground cleaner" trajectory… 



Re: Cassandra project status update 2022-08-03

2022-08-09 Thread Ekaterina Dimitrova
“ In my opinion, not all flakies are equals. Some fails every 10 runs, some
fails 1 in a 1000 runs.”
Agreed, for all not new tests/regressions which are also not infra related.

“ We can start by putting the bar at a lower level and raise the level over
time when most of the flakies that we hit are above that level.”
My only concern is only who and how will track that.
Also, metric for non-infra issues I guess

“ At the same time we should make sure that we do not introduce new
flakies. One simple approach that has been mentioned several time is to run
the new tests added by a given patch in a loop using one of the CircleCI
tasks. ”
+1, I personally find this very valuable and more efficient than bisecting
and getting back to works done in some cases months ago


“ We should also probably revert newly committed patch if we detect that
they introduced flakies.”
+1, not that I like my patches to be reverted but it seems as the most fair
way to stick to our stated goals. But I think last time we talked about
reverting, we discussed it only for trunk? Or do I remember it wrong?



On Tue, 9 Aug 2022 at 7:58, Benjamin Lerer  wrote:

> At this point it is clear that we will probably never be able to remove
> some level of flakiness from our tests. For me the questions are: 1) Where
> do we draw the line for a release ? and 2) How do we maintain that line
> over time?
>
> In my opinion, not all flakies are equals. Some fails every 10 runs, some
> fails 1 in a 1000 runs. I would personally draw the line based on that
> metric. With the circleci tasks that Andres has added we can easily get
> that information for a given test.
> We can start by putting the bar at a lower level and raise the level over
> time when most of the flakies that we hit are above that level.
>
> TThat would allow us to minimize the risk of introducing flaky tests. We
> should also probably revert newly committed patch if we detect that they
> introduced flakies.
>
> What do you think?
>
>
>
>
>
> Le dim. 7 août 2022 à 12:24, Mick Semb Wever  a écrit :
>
>>
>>
>> With that said, I guess we can just revise on a regular basis what
>>> exactly are the last flakes and not numbers which also change quickly up
>>> and down with the first change in the Infra.
>>>
>>
>>
>> +1, I am in favour of taking a pragmatic approach.
>>
>> If flakies are identified and triaged enough that, with correlation from
>> both CI systems, we are confident that no legit bugs are behind them, I'm
>> in favour of going beta.
>>
>> I still remain in favour of somehow incentivising reducing other flakies
>> as well. Flakies that expose poor/limited CI infra, and/or tests that are
>> not as resilient as they could be, are still noise that indirectly reduce
>> our QA (and increase efforts to find and tackle those legit runtime
>> problems). Interested in hearing input from others here that have been
>> spending a lot of time on this front.
>>
>> Could it work if we say: all flakies must be ticketed, and test/infra
>> related flakies do not block a beta release so long as there are fewer than
>> the previous release? The intent here being pragmatic, but keeping us on a
>> "keep the campground cleaner" trajectory…
>>
>>


Re: Cassandra project status update 2022-08-03

2022-08-09 Thread Benjamin Lerer
At this point it is clear that we will probably never be able to remove
some level of flakiness from our tests. For me the questions are: 1) Where
do we draw the line for a release ? and 2) How do we maintain that line
over time?

In my opinion, not all flakies are equals. Some fails every 10 runs, some
fails 1 in a 1000 runs. I would personally draw the line based on that
metric. With the circleci tasks that Andres has added we can easily get
that information for a given test.
We can start by putting the bar at a lower level and raise the level over
time when most of the flakies that we hit are above that level.

At the same time we should make sure that we do not introduce new flakies.
One simple approach that has been mentioned several time is to run the new
tests added by a given patch in a loop using one of the CircleCI tasks.
That would allow us to minimize the risk of introducing flaky tests. We
should also probably revert newly committed patch if we detect that they
introduced flakies.

What do you think?





Le dim. 7 août 2022 à 12:24, Mick Semb Wever  a écrit :

>
>
> With that said, I guess we can just revise on a regular basis what exactly
>> are the last flakes and not numbers which also change quickly up and down
>> with the first change in the Infra.
>>
>
>
> +1, I am in favour of taking a pragmatic approach.
>
> If flakies are identified and triaged enough that, with correlation from
> both CI systems, we are confident that no legit bugs are behind them, I'm
> in favour of going beta.
>
> I still remain in favour of somehow incentivising reducing other flakies
> as well. Flakies that expose poor/limited CI infra, and/or tests that are
> not as resilient as they could be, are still noise that indirectly reduce
> our QA (and increase efforts to find and tackle those legit runtime
> problems). Interested in hearing input from others here that have been
> spending a lot of time on this front.
>
> Could it work if we say: all flakies must be ticketed, and test/infra
> related flakies do not block a beta release so long as there are fewer than
> the previous release? The intent here being pragmatic, but keeping us on a
> "keep the campground cleaner" trajectory…
>
>


Re: Cassandra project status update 2022-08-03

2022-08-07 Thread Mick Semb Wever
With that said, I guess we can just revise on a regular basis what exactly
> are the last flakes and not numbers which also change quickly up and down
> with the first change in the Infra.
>


+1, I am in favour of taking a pragmatic approach.

If flakies are identified and triaged enough that, with correlation from
both CI systems, we are confident that no legit bugs are behind them, I'm
in favour of going beta.

I still remain in favour of somehow incentivising reducing other flakies as
well. Flakies that expose poor/limited CI infra, and/or tests that are not
as resilient as they could be, are still noise that indirectly reduce our
QA (and increase efforts to find and tackle those legit runtime problems).
Interested in hearing input from others here that have been spending a lot
of time on this front.

Could it work if we say: all flakies must be ticketed, and test/infra
related flakies do not block a beta release so long as there are fewer than
the previous release? The intent here being pragmatic, but keeping us on a
"keep the campground cleaner" trajectory…


Re: Cassandra project status update 2022-08-03

2022-08-03 Thread Ekaterina Dimitrova
Re: 17738 - the ticket was about any new properties which are actually not
of the new types. It had to guarantee that there is no disconnect between
updating Settings Virtual Table after startup and JMX setters/getters. (In
one of its “brother” tickets the issues we found exist since 4.0) I bring
it up as we need to ensure configuration parameters update the original
Config parameters from JMX if we want Settings Virtual Table to be properly
updated after startup and thus cut the confusion for the users. This is
actually a goal for this VT stated also in our Docs and the original ticket.
Raising the point again as while we still have both VT and JMX we need to
be sure we provide consistent information for our users. I will put also a
note in the Config docs to stress on this and remind people.
Probably when we add the update option for the Settings Virtual Table in
the next version we will need to think of better way to keep this in sync
or even start deprecating JMX but for now this is what we have in place and
we need to maintain it.

Thank you Josh for the report, it is always valuable!

About flaky tests - in my personal opinion it is more about what
outstanding flaky tests we have then how many. We can have 3 which surface
legit bugs, we can have 10 presenting only timeouts which are due to
environmental issues. These days I see Circle CI green all the time which
is really promising as many of our legit bugs were discovered there. With
that said, I guess we can just revise on a regular basis what exactly are
the last flakes and not numbers which also change quickly up and down with
the first change in the Infra.

On Wed, 3 Aug 2022 at 13:17, Josh McKenzie  wrote:

> Greetings everyone! Let's check in on 4.1, see how we're doing:
>
> https://butler.cassandra.apache.org/#/
> We had 4 failures on our last run. We've gone back and forth a bit with
> the CASTest failure, a test introduced back in CASSANDRA-12126 @Ignore'd,
> however that showed some legitimate failures that should be addressed by
> Paxos V2. If anyone from the discussion has the cycles (or someone with
> familiarity with the area) could take assignee on the test failure ticket
> (17461) and responsibility for driving it to resolution that would help
> clarify our efforts there. (
> https://issues.apache.org/jira/browse/CASSANDRA-17461)
>
> Along with that, we saw a failure in
> TopPartitionsTest.testServiceTopPartitionsSingleTable (cdc) and
> TestBootstrap.test_simultaneous_bootstrap (offheap). Given both are
> specific configurations of tests that ran successfully to completion in
> other configurations there's a reasonable chance they're flaky, be it from
> the logic of the test or the CI environment in which they're executing.
> Neither tickets appear to have active JIRA's associated with them in butler
> or in the kanban board (
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496=2252)
> so we could use a volunteer here to both create those tickets and to drive
> them.
>
> We're close enough that we're ready to again visit how we want to treat
> the requirement for no flaky failures before we cut beta (
> https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle,
> "No flaky tests - All tests (Unit Tests and DTests) should pass
> consistently"). After seeing a couple releases with this requirement (4.0
> and now 4.1), I'm inclined to agree with the comment from Dinesh that we
> should revise this requirement formally if we're going to effectively
> release with flaky tests anyway; best to be honest with ourselves and
> acknowledge it's not proving to be a forcing function for changing
> behavior. If this email doesn't see much traction on this topic I'll hit up
> the dev list with a DISCUSS thread on it.
>
> The kanban for 4.1 blockers show us 13 tickets:
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484=2455.
> Most of them are assigned and many in progress, however we have 3
> unassigned if anyone wants to pick those up:
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484=2455=2160
>
>
> [New Contributors Getting Started]
> One of the three issues on 4.1 blocker list or either of the 2 failing
> tests listed above would be great areas to focus your attention!
>
> Nuts and bolts / env / etc: here's an explanation of various types of
> contribution:
> https://cassandra.apache.org/_/community.html#how-to-contribute
> An overview of the C* architecture:
> https://cassandra.apache.org/doc/latest/cassandra/architecture/overview.html
> And here's our getting started contributing guide:
> https://cassandra.apache.org/_/development/index.html
> We hang out in #cassandra-dev on https://the-asf.slack.com, and you can
> ping the @cassandra_mentors alias to reach 13 of us who have volunteered to
> mentor new contributors on the project. Looking forward to seeing you there.
>
>
> [Dev list Digest]
> https://lists.apache.org/list?dev@cassandra.apache.org:lte=2w:
>
> The