Re: Cassandra project status update 2022-08-03
> > > I think if we want to do this, it should be extremely easy - by which I > mean automatic, really. This shouldn’t be too tricky I think? We just need > to produce a diff of new test classes and methods within existing classes. Having a CircleCI job that automatically runs all new/modified tests would be a great way to prevent most of the new flakies. We would still miss some cases, like unmodified tests that turn flaky after changing the tested code, but I'd say that's not as usual. > I can probably help out by putting together something to output @Test > annotated methods within a source tree, if others are able to turn this > into a part of the CircleCI pre-commit task (i.e. to pick the common > ancestor with trunk, 4.1 etc, and run this task for each of the outputs) I think we would need a bash/sh shell script taking a diff file and test directory, and returning the file path and qualified class name of every modified test class. I'd say we don't need the method names for Java tests because quite often we see flaky tests that only fail when running their entire class, so it's probably better to repeatedly run entire test classes instead of particular methods. We would also need a similar script for Python dtests. We would probably want it to provide the full path of the modified tests (as in cqlsh_tests/test_cqlsh.py::TestCqlshSmoke::test_create_index) because those tests can be quite resource-intensive. I think once we have those scripts we could plug their output to the CircleCI commands for repeating tests. Putting together all this seems relatively involved, so it can take us some time to get it ready. In the meantime, I think it's a good practice to just manually include any new/modified tests into the CircleCI config. Doing so only requires to pass a few additional options to the script that generates the config, which doesn't seem to require too much effort. On Wed, 10 Aug 2022 at 19:47, Brandon Williams wrote: > > Side note, Butler is reporting CASSANDRA-17348 as open (it's resolved as > a duplicate). > > This is fixed. >
Re: Cassandra project status update 2022-08-03
> Side note, Butler is reporting CASSANDRA-17348 as open (it's resolved as a > duplicate). This is fixed.
Re: Cassandra project status update 2022-08-03
On Wed, 10 Aug 2022 at 17:54, Josh McKenzie wrote: > “ We can start by putting the bar at a lower level and raise the level > over time when most of the flakies that we hit are above that level.” > My only concern is only who and how will track that. > > What's Butler's logic for flagging things flaky? Maybe a "flaky low" vs. > "flaky high" distinction based on failure frequency (or some much better > name I'm sure someone else will come up with) could make sense? > I'd be keen to see orders of magnitude, rather than arbitrary labels. Also per CI system (having the data for basic correlation between systems will be useful in other discussions and decisions). > Then we could focus our efforts on the ones that are flagged as failing at > whatever high water mark threshold we set. > Maybe obvious, but so long there's a way to bypass this when a flaky is identified as being a legit bug and/or in a critical component (even a 1:1M flakiness in certain components can be disastrous). Some other questions… - how to measure the flakiness - how to measure post-commit rates across both CI systems - where the flakiness labels(/orders-of-magnitude) should be recorded - how we label flakies as being legit/critical/blocking (currently you often have to read through the comments) Applying this manually to the remaining 4.1 blockers we have: - CASSANDRA-17461 CASTest. 1:40 failures on circle. looks to be able 1:2 on ci-cassandra - CASSANDRA-17618 InternodeEncryptionEnforcementTest. 1:167 circle. no flaky in ci-cassandra - CASSANDRA-17804 AutoSnapshotTtlTest. unknown flakiness in both ci. - CASSANDRA-17573 PaxosRepairTest. 1:20 circle. no flakies in ci-cassandra. - CASSANDRA-17658 KeyspaceMetricsTest. 1:20 circle. no flakies in ci-cassandra. In addition to these, Butler lists a number of flakies against 4.1, but these are not regressions in 4.1 hence are not blockers. The jira board is currently not blocking a 4.1-beta release on non-regression flakies. This means our releases are not blocked on overall flakies, regardless if there's more or less of them. How are we to place this with our recent stance of no releases unless green…? (loops back to my "less overall flakies than previous release /campground-cleaner" suggestion) Side note, Butler is reporting CASSANDRA-17348 as open (it's resolved as a duplicate).
Re: Cassandra project status update 2022-08-03
> “ We can start by putting the bar at a lower level and raise the level over > time when most of the flakies that we hit are above that level.” > My only concern is only who and how will track that. What's Butler's logic for flagging things flaky? Maybe a "flaky low" vs. "flaky high" distinction based on failure frequency (or some much better name I'm sure someone else will come up with) could make sense? Then we could focus our efforts on the ones that are flagged as failing at whatever high water mark threshold we set. It'd be trivial for me to update the script that parses test failure output for JIRA updates to flag things based on their failure frequency. On Tue, Aug 9, 2022, at 5:24 PM, Ekaterina Dimitrova wrote: > “ In my opinion, not all flakies are equals. Some fails every 10 runs, some > fails 1 in a 1000 runs.” > Agreed, for all not new tests/regressions which are also not infra related. > > “ We can start by putting the bar at a lower level and raise the level over > time when most of the flakies that we hit are above that level.” > My only concern is only who and how will track that. > Also, metric for non-infra issues I guess > > “ At the same time we should make sure that we do not introduce new flakies. > One simple approach that has been mentioned several time is to run the new > tests added by a given patch in a loop using one of the CircleCI tasks. ” > +1, I personally find this very valuable and more efficient than bisecting > and getting back to works done in some cases months ago > > > “ We should also probably revert newly committed patch if we detect that they > introduced flakies.” > +1, not that I like my patches to be reverted but it seems as the most fair > way to stick to our stated goals. But I think last time we talked about > reverting, we discussed it only for trunk? Or do I remember it wrong? > > > > On Tue, 9 Aug 2022 at 7:58, Benjamin Lerer wrote: >> At this point it is clear that we will probably never be able to remove some >> level of flakiness from our tests. For me the questions are: 1) Where do we >> draw the line for a release ? and 2) How do we maintain that line over time? >> >> In my opinion, not all flakies are equals. Some fails every 10 runs, some >> fails 1 in a 1000 runs. I would personally draw the line based on that >> metric. With the circleci tasks that Andres has added we can easily get that >> information for a given test. >> We can start by putting the bar at a lower level and raise the level over >> time when most of the flakies that we hit are above that level. >> >> TThat would allow us to minimize the risk of introducing flaky tests. We >> should also probably revert newly committed patch if we detect that they >> introduced flakies. >> >> What do you think? >> >> >> >> >> >> Le dim. 7 août 2022 à 12:24, Mick Semb Wever a écrit : >>> >>> With that said, I guess we can just revise on a regular basis what exactly are the last flakes and not numbers which also change quickly up and down with the first change in the Infra. >>> >>> >>> +1, I am in favour of taking a pragmatic approach. >>> >>> If flakies are identified and triaged enough that, with correlation from >>> both CI systems, we are confident that no legit bugs are behind them, I'm >>> in favour of going beta. >>> >>> I still remain in favour of somehow incentivising reducing other flakies as >>> well. Flakies that expose poor/limited CI infra, and/or tests that are not >>> as resilient as they could be, are still noise that indirectly reduce our >>> QA (and increase efforts to find and tackle those legit runtime problems). >>> Interested in hearing input from others here that have been spending a lot >>> of time on this front. >>> >>> Could it work if we say: all flakies must be ticketed, and test/infra >>> related flakies do not block a beta release so long as there are fewer than >>> the previous release? The intent here being pragmatic, but keeping us on a >>> "keep the campground cleaner" trajectory…
Re: Cassandra project status update 2022-08-03
Perhaps flaky tests need to be handled differently. Is there a way to build a statistical model of the current flakiness of the test that we can then use during testing to accept the failures? So if an acceptable level of flakiness is developed then if the test fails, it needs to be run again or multiple times to get a sample and ensure that the failure is not statistically significant. On Wed, Aug 10, 2022 at 8:51 AM Benedict Elliott Smith wrote: > > > We can start by putting the bar at a lower level and raise the level > over time > > +1 > > > One simple approach that has been mentioned several time is to run the > new tests added by a given patch in a loop using one of the CircleCI tasks > > I think if we want to do this, it should be extremely easy - by which I > mean automatic, really. This shouldn’t be too tricky I think? We just need > to produce a diff of new test classes and methods within existing classes. > If there doesn’t already exist tooling to do this, I can probably help out > by putting together something to output @Test annotated methods within a > source tree, if others are able to turn this into a part of the CircleCI > pre-commit task (i.e. to pick the common ancestor with trunk, 4.1 etc, and > run this task for each of the outputs). We might want to start > standardising branch naming structures to support picking the upstream > branch. > > > We should also probably revert newly committed patch if we detect that > they introduced flakies. > > There should be a strict time limit for reverting a patch for this reason, > as environments change and what is flaky now was not necessarily before. > > On 9 Aug 2022, at 12:57, Benjamin Lerer wrote: > > At this point it is clear that we will probably never be able to remove > some level of flakiness from our tests. For me the questions are: 1) Where > do we draw the line for a release ? and 2) How do we maintain that line > over time? > > In my opinion, not all flakies are equals. Some fails every 10 runs, some > fails 1 in a 1000 runs. I would personally draw the line based on that > metric. With the circleci tasks that Andres has added we can easily get > that information for a given test. > We can start by putting the bar at a lower level and raise the level over > time when most of the flakies that we hit are above that level. > > At the same time we should make sure that we do not introduce new flakies. > One simple approach that has been mentioned several time is to run the new > tests added by a given patch in a loop using one of the CircleCI tasks. > That would allow us to minimize the risk of introducing flaky tests. We > should also probably revert newly committed patch if we detect that they > introduced flakies. > > What do you think? > > > > > > Le dim. 7 août 2022 à 12:24, Mick Semb Wever a écrit : > >> >> >> With that said, I guess we can just revise on a regular basis what >>> exactly are the last flakes and not numbers which also change quickly up >>> and down with the first change in the Infra. >>> >> >> >> +1, I am in favour of taking a pragmatic approach. >> >> If flakies are identified and triaged enough that, with correlation from >> both CI systems, we are confident that no legit bugs are behind them, I'm >> in favour of going beta. >> >> I still remain in favour of somehow incentivising reducing other flakies >> as well. Flakies that expose poor/limited CI infra, and/or tests that are >> not as resilient as they could be, are still noise that indirectly reduce >> our QA (and increase efforts to find and tackle those legit runtime >> problems). Interested in hearing input from others here that have been >> spending a lot of time on this front. >> >> Could it work if we say: all flakies must be ticketed, and test/infra >> related flakies do not block a beta release so long as there are fewer than >> the previous release? The intent here being pragmatic, but keeping us on a >> "keep the campground cleaner" trajectory… >> >> >
Re: Cassandra project status update 2022-08-03
> We can start by putting the bar at a lower level and raise the level over time +1 > One simple approach that has been mentioned several time is to run the new > tests added by a given patch in a loop using one of the CircleCI tasks I think if we want to do this, it should be extremely easy - by which I mean automatic, really. This shouldn’t be too tricky I think? We just need to produce a diff of new test classes and methods within existing classes. If there doesn’t already exist tooling to do this, I can probably help out by putting together something to output @Test annotated methods within a source tree, if others are able to turn this into a part of the CircleCI pre-commit task (i.e. to pick the common ancestor with trunk, 4.1 etc, and run this task for each of the outputs). We might want to start standardising branch naming structures to support picking the upstream branch. > We should also probably revert newly committed patch if we detect that they > introduced flakies. There should be a strict time limit for reverting a patch for this reason, as environments change and what is flaky now was not necessarily before. > On 9 Aug 2022, at 12:57, Benjamin Lerer wrote: > > At this point it is clear that we will probably never be able to remove some > level of flakiness from our tests. For me the questions are: 1) Where do we > draw the line for a release ? and 2) How do we maintain that line over time? > > In my opinion, not all flakies are equals. Some fails every 10 runs, some > fails 1 in a 1000 runs. I would personally draw the line based on that > metric. With the circleci tasks that Andres has added we can easily get that > information for a given test. > We can start by putting the bar at a lower level and raise the level over > time when most of the flakies that we hit are above that level. > > At the same time we should make sure that we do not introduce new flakies. > One simple approach that has been mentioned several time is to run the new > tests added by a given patch in a loop using one of the CircleCI tasks. That > would allow us to minimize the risk of introducing flaky tests. We should > also probably revert newly committed patch if we detect that they introduced > flakies. > > What do you think? > > > > > > Le dim. 7 août 2022 à 12:24, Mick Semb Wever a écrit : >> >> >>> With that said, I guess we can just revise on a regular basis what exactly >>> are the last flakes and not numbers which also change quickly up and down >>> with the first change in the Infra. >> >> >> >> +1, I am in favour of taking a pragmatic approach. >> >> If flakies are identified and triaged enough that, with correlation from >> both CI systems, we are confident that no legit bugs are behind them, I'm in >> favour of going beta. >> >> I still remain in favour of somehow incentivising reducing other flakies as >> well. Flakies that expose poor/limited CI infra, and/or tests that are not >> as resilient as they could be, are still noise that indirectly reduce our QA >> (and increase efforts to find and tackle those legit runtime problems). >> Interested in hearing input from others here that have been spending a lot >> of time on this front. >> >> Could it work if we say: all flakies must be ticketed, and test/infra >> related flakies do not block a beta release so long as there are fewer than >> the previous release? The intent here being pragmatic, but keeping us on a >> "keep the campground cleaner" trajectory…
Re: Cassandra project status update 2022-08-03
“ In my opinion, not all flakies are equals. Some fails every 10 runs, some fails 1 in a 1000 runs.” Agreed, for all not new tests/regressions which are also not infra related. “ We can start by putting the bar at a lower level and raise the level over time when most of the flakies that we hit are above that level.” My only concern is only who and how will track that. Also, metric for non-infra issues I guess “ At the same time we should make sure that we do not introduce new flakies. One simple approach that has been mentioned several time is to run the new tests added by a given patch in a loop using one of the CircleCI tasks. ” +1, I personally find this very valuable and more efficient than bisecting and getting back to works done in some cases months ago “ We should also probably revert newly committed patch if we detect that they introduced flakies.” +1, not that I like my patches to be reverted but it seems as the most fair way to stick to our stated goals. But I think last time we talked about reverting, we discussed it only for trunk? Or do I remember it wrong? On Tue, 9 Aug 2022 at 7:58, Benjamin Lerer wrote: > At this point it is clear that we will probably never be able to remove > some level of flakiness from our tests. For me the questions are: 1) Where > do we draw the line for a release ? and 2) How do we maintain that line > over time? > > In my opinion, not all flakies are equals. Some fails every 10 runs, some > fails 1 in a 1000 runs. I would personally draw the line based on that > metric. With the circleci tasks that Andres has added we can easily get > that information for a given test. > We can start by putting the bar at a lower level and raise the level over > time when most of the flakies that we hit are above that level. > > TThat would allow us to minimize the risk of introducing flaky tests. We > should also probably revert newly committed patch if we detect that they > introduced flakies. > > What do you think? > > > > > > Le dim. 7 août 2022 à 12:24, Mick Semb Wever a écrit : > >> >> >> With that said, I guess we can just revise on a regular basis what >>> exactly are the last flakes and not numbers which also change quickly up >>> and down with the first change in the Infra. >>> >> >> >> +1, I am in favour of taking a pragmatic approach. >> >> If flakies are identified and triaged enough that, with correlation from >> both CI systems, we are confident that no legit bugs are behind them, I'm >> in favour of going beta. >> >> I still remain in favour of somehow incentivising reducing other flakies >> as well. Flakies that expose poor/limited CI infra, and/or tests that are >> not as resilient as they could be, are still noise that indirectly reduce >> our QA (and increase efforts to find and tackle those legit runtime >> problems). Interested in hearing input from others here that have been >> spending a lot of time on this front. >> >> Could it work if we say: all flakies must be ticketed, and test/infra >> related flakies do not block a beta release so long as there are fewer than >> the previous release? The intent here being pragmatic, but keeping us on a >> "keep the campground cleaner" trajectory… >> >>
Re: Cassandra project status update 2022-08-03
At this point it is clear that we will probably never be able to remove some level of flakiness from our tests. For me the questions are: 1) Where do we draw the line for a release ? and 2) How do we maintain that line over time? In my opinion, not all flakies are equals. Some fails every 10 runs, some fails 1 in a 1000 runs. I would personally draw the line based on that metric. With the circleci tasks that Andres has added we can easily get that information for a given test. We can start by putting the bar at a lower level and raise the level over time when most of the flakies that we hit are above that level. At the same time we should make sure that we do not introduce new flakies. One simple approach that has been mentioned several time is to run the new tests added by a given patch in a loop using one of the CircleCI tasks. That would allow us to minimize the risk of introducing flaky tests. We should also probably revert newly committed patch if we detect that they introduced flakies. What do you think? Le dim. 7 août 2022 à 12:24, Mick Semb Wever a écrit : > > > With that said, I guess we can just revise on a regular basis what exactly >> are the last flakes and not numbers which also change quickly up and down >> with the first change in the Infra. >> > > > +1, I am in favour of taking a pragmatic approach. > > If flakies are identified and triaged enough that, with correlation from > both CI systems, we are confident that no legit bugs are behind them, I'm > in favour of going beta. > > I still remain in favour of somehow incentivising reducing other flakies > as well. Flakies that expose poor/limited CI infra, and/or tests that are > not as resilient as they could be, are still noise that indirectly reduce > our QA (and increase efforts to find and tackle those legit runtime > problems). Interested in hearing input from others here that have been > spending a lot of time on this front. > > Could it work if we say: all flakies must be ticketed, and test/infra > related flakies do not block a beta release so long as there are fewer than > the previous release? The intent here being pragmatic, but keeping us on a > "keep the campground cleaner" trajectory… > >
Re: Cassandra project status update 2022-08-03
With that said, I guess we can just revise on a regular basis what exactly > are the last flakes and not numbers which also change quickly up and down > with the first change in the Infra. > +1, I am in favour of taking a pragmatic approach. If flakies are identified and triaged enough that, with correlation from both CI systems, we are confident that no legit bugs are behind them, I'm in favour of going beta. I still remain in favour of somehow incentivising reducing other flakies as well. Flakies that expose poor/limited CI infra, and/or tests that are not as resilient as they could be, are still noise that indirectly reduce our QA (and increase efforts to find and tackle those legit runtime problems). Interested in hearing input from others here that have been spending a lot of time on this front. Could it work if we say: all flakies must be ticketed, and test/infra related flakies do not block a beta release so long as there are fewer than the previous release? The intent here being pragmatic, but keeping us on a "keep the campground cleaner" trajectory…
Re: Cassandra project status update 2022-08-03
Re: 17738 - the ticket was about any new properties which are actually not of the new types. It had to guarantee that there is no disconnect between updating Settings Virtual Table after startup and JMX setters/getters. (In one of its “brother” tickets the issues we found exist since 4.0) I bring it up as we need to ensure configuration parameters update the original Config parameters from JMX if we want Settings Virtual Table to be properly updated after startup and thus cut the confusion for the users. This is actually a goal for this VT stated also in our Docs and the original ticket. Raising the point again as while we still have both VT and JMX we need to be sure we provide consistent information for our users. I will put also a note in the Config docs to stress on this and remind people. Probably when we add the update option for the Settings Virtual Table in the next version we will need to think of better way to keep this in sync or even start deprecating JMX but for now this is what we have in place and we need to maintain it. Thank you Josh for the report, it is always valuable! About flaky tests - in my personal opinion it is more about what outstanding flaky tests we have then how many. We can have 3 which surface legit bugs, we can have 10 presenting only timeouts which are due to environmental issues. These days I see Circle CI green all the time which is really promising as many of our legit bugs were discovered there. With that said, I guess we can just revise on a regular basis what exactly are the last flakes and not numbers which also change quickly up and down with the first change in the Infra. On Wed, 3 Aug 2022 at 13:17, Josh McKenzie wrote: > Greetings everyone! Let's check in on 4.1, see how we're doing: > > https://butler.cassandra.apache.org/#/ > We had 4 failures on our last run. We've gone back and forth a bit with > the CASTest failure, a test introduced back in CASSANDRA-12126 @Ignore'd, > however that showed some legitimate failures that should be addressed by > Paxos V2. If anyone from the discussion has the cycles (or someone with > familiarity with the area) could take assignee on the test failure ticket > (17461) and responsibility for driving it to resolution that would help > clarify our efforts there. ( > https://issues.apache.org/jira/browse/CASSANDRA-17461) > > Along with that, we saw a failure in > TopPartitionsTest.testServiceTopPartitionsSingleTable (cdc) and > TestBootstrap.test_simultaneous_bootstrap (offheap). Given both are > specific configurations of tests that ran successfully to completion in > other configurations there's a reasonable chance they're flaky, be it from > the logic of the test or the CI environment in which they're executing. > Neither tickets appear to have active JIRA's associated with them in butler > or in the kanban board ( > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496=2252) > so we could use a volunteer here to both create those tickets and to drive > them. > > We're close enough that we're ready to again visit how we want to treat > the requirement for no flaky failures before we cut beta ( > https://cwiki.apache.org/confluence/display/CASSANDRA/Release+Lifecycle, > "No flaky tests - All tests (Unit Tests and DTests) should pass > consistently"). After seeing a couple releases with this requirement (4.0 > and now 4.1), I'm inclined to agree with the comment from Dinesh that we > should revise this requirement formally if we're going to effectively > release with flaky tests anyway; best to be honest with ourselves and > acknowledge it's not proving to be a forcing function for changing > behavior. If this email doesn't see much traction on this topic I'll hit up > the dev list with a DISCUSS thread on it. > > The kanban for 4.1 blockers show us 13 tickets: > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484=2455. > Most of them are assigned and many in progress, however we have 3 > unassigned if anyone wants to pick those up: > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484=2455=2160 > > > [New Contributors Getting Started] > One of the three issues on 4.1 blocker list or either of the 2 failing > tests listed above would be great areas to focus your attention! > > Nuts and bolts / env / etc: here's an explanation of various types of > contribution: > https://cassandra.apache.org/_/community.html#how-to-contribute > An overview of the C* architecture: > https://cassandra.apache.org/doc/latest/cassandra/architecture/overview.html > And here's our getting started contributing guide: > https://cassandra.apache.org/_/development/index.html > We hang out in #cassandra-dev on https://the-asf.slack.com, and you can > ping the @cassandra_mentors alias to reach 13 of us who have volunteered to > mentor new contributors on the project. Looking forward to seeing you there. > > > [Dev list Digest] > https://lists.apache.org/list?dev@cassandra.apache.org:lte=2w: > > The