Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
> choose a consistent, representative subset of stable tests that we feel give > us a reasonable level of confidence in return for a reasonable amount of > runtime > > ... > Currently a dtest is being ran in j8 w/wo vnodes , j8/j11 w/wo vnodes and j11 > w/wo vnodes. That is 6 times total. I wonder about that ROI. > ... > test with the default number of vnodes, test with the default compression > settings, and test with the default heap/off-heap buffers. > If I take these at face value to be true (I happen to agree with them, so I'm going to do this :)), what falls out for me: 1. Pre-commit should be an intentional smoke-testing suite, much smaller relative to post-commit than it is today 2. We should aggressively cull all low-signal pre-commit tests, suites, and configurations that aren't needed to keep post-commit stable High signal in pre-commit (indicative; non-exhaustive): 1. Only the most commonly used JDK (JDK11 atm?) 2. Config defaults (vnodes, compression, heap/off-heap buffers, memtable format, sstable format) 3. Most popular / general / run-of-the-mill linux distro (debian?) Low signal in pre-commit (indicative; non-exhaustive): 1. No vnodes 2. JDK8; JDK17 3. Non-default settings (Compression off. Fully mmap, no mmap. Trie memtables or sstables, cdc enabled) So this shape of thinking - I'm curious what it triggers for you Brandon, Berenguer, Andres, Ekaterina, and Mick (when you're back from the mountains ;)). You guys paid a lot of the debt in the run up to 4.1 so have the most recent expertise and I trust your perspectives here. If a failure makes it to post-commit, it's much more expensive to root cause and figure out with much higher costs to the community's collective productivity. That said, I think we can make a lot of progress along this line of thinking. On Wed, Jul 5, 2023, at 5:54 AM, Jacek Lewandowski wrote: > Perhaps pre-commit checks should include mostly the typical configuration of > Cassandra rather than some subset of possible combinations. Like it was said > somewhere above - test with the default number of vnodes, test with the > default compression settings, and test with the default heap/off-heap > buffers. > > A longer-term goal could be to isolate what depends on particular > configuration options. Instead of blindly running everything with, say, > vnodes enabled and disabled, isolate those tests that need to be run with > those two configurations and run the rest with the default one. >> ... the rule of multiplexing new or changed tests might go a long way to >> mitigating that ... > > I wonder if there is some commonality in the flaky tests reported so far, > like the presence of certain statements? Also, there could be a tool that > inspects coverage analysis reports and chooses the proper tests to > run/multiplex because, in the end, we want to verify the changed production > code in addition to the modified test files. > > thanks, > Jacek > > śr., 5 lip 2023 o 06:28 Berenguer Blasi napisał(a): >> Currently a dtest is being ran in j8 w/wo vnodes , j8/j11 w/wo vnodes and >> j11 w/wo vnodes. That is 6 times total. I wonder about that ROI. >> >> On dtest cluster reusage yes, I stopped that as at the time we had lots of >> CI changes, an upcoming release and priorities. But when the CI starts >> flexing it's muscles that'd be easy to pick up again as dtests code >> shouldn't have changed much. >> >> On 4/7/23 17:11, Derek Chen-Becker wrote: >>> Ultimately I think we have to invest in two directions: first, choose a >>> consistent, representative subset of stable tests that we feel give us a >>> reasonable level of confidence in return for a reasonable amount of >>> runtime. Second, we need to invest in figuring out why certain tests fail. >>> I strongly dislike the term "flaky" because it suggests that it's some >>> inconsequential issue causing problems. The truth is that a test that fails >>> is either a bug in the service code or a bug in the test. I've come to >>> realize that the CI and build framework is way too complex for me to be >>> able to help with much, but I would love to start chipping away at failing >>> test bugs. I'm getting settled into my new job and I should be able to >>> commit some regular time each week to triage and fixing starting in August, >>> and if there are any other folks who are interested let me know. >>> >>> Cheers, >>> >>> Derek >>> >>> On Mon, Jul 3, 2023, 12:30 PM Josh McKenzie wrote: > Instead of running all the tests through available CI agents every time > we can have presets of tests: Back when I joined the project in 2014, unit tests took ~ 5 minutes to run on a local machine. We had pre-commit and post-commit tests as a distinction as well, but also had flakes in the pre and post batch. I'd love to see us get back to a unit test regime like that. The challenge we've always had is flaky tests showing up in either the >>>
Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
Perhaps pre-commit checks should include mostly the typical configuration of Cassandra rather than some subset of possible combinations. Like it was said somewhere above - test with the default number of vnodes, test with the default compression settings, and test with the default heap/off-heap buffers. A longer-term goal could be to isolate what depends on particular configuration options. Instead of blindly running everything with, say, vnodes enabled and disabled, isolate those tests that need to be run with those two configurations and run the rest with the default one. ... the rule of multiplexing new or changed tests might go a long way to > mitigating that ... I wonder if there is some commonality in the flaky tests reported so far, like the presence of certain statements? Also, there could be a tool that inspects coverage analysis reports and chooses the proper tests to run/multiplex because, in the end, we want to verify the changed production code in addition to the modified test files. thanks, Jacek śr., 5 lip 2023 o 06:28 Berenguer Blasi napisał(a): > Currently a dtest is being ran in j8 w/wo vnodes , j8/j11 w/wo vnodes and > j11 w/wo vnodes. That is 6 times total. I wonder about that ROI. > > On dtest cluster reusage yes, I stopped that as at the time we had lots of > CI changes, an upcoming release and priorities. But when the CI starts > flexing it's muscles that'd be easy to pick up again as dtests code > shouldn't have changed much. > On 4/7/23 17:11, Derek Chen-Becker wrote: > > Ultimately I think we have to invest in two directions: first, choose a > consistent, representative subset of stable tests that we feel give us a > reasonable level of confidence in return for a reasonable amount of > runtime. Second, we need to invest in figuring out why certain tests fail. > I strongly dislike the term "flaky" because it suggests that it's some > inconsequential issue causing problems. The truth is that a test that fails > is either a bug in the service code or a bug in the test. I've come to > realize that the CI and build framework is way too complex for me to be > able to help with much, but I would love to start chipping away at failing > test bugs. I'm getting settled into my new job and I should be able to > commit some regular time each week to triage and fixing starting in August, > and if there are any other folks who are interested let me know. > > Cheers, > > Derek > > On Mon, Jul 3, 2023, 12:30 PM Josh McKenzie wrote: > >> Instead of running all the tests through available CI agents every time >> we can have presets of tests: >> >> Back when I joined the project in 2014, unit tests took ~ 5 minutes to >> run on a local machine. We had pre-commit and post-commit tests as a >> distinction as well, but also had flakes in the pre and post batch. I'd >> love to see us get back to a unit test regime like that. >> >> The challenge we've always had is flaky tests showing up in either the >> pre-commit or post-commit groups and difficulty in attribution on a flaky >> failure to where it was introduced (not to lay blame but to educate and >> learn and prevent recurrence). While historically further reduced smoke >> testing suites would just mean more flakes showing up downstream, the rule >> of multiplexing new or changed tests might go a long way to mitigating that. >> >> Should we mention in this concept how we will build the sub-projects >> (e.g. Accord) alongside Cassandra? >> >> I think it's an interesting question, but I also think there's no real >> dependency of process between primary mainline branches and feature >> branches. My intuition is that having the same bar (green CI, multiplex, >> don't introduce flakes, smart smoke suite tiering) would be a good idea on >> feature branches so there's not a death march right before merge, squashing >> flakes when you have to multiplex hundreds of tests before merge to >> mainline (since presumably a feature branch would impact a lot of tests). >> >> Now that I write that all out it does sound Painful. =/ >> >> On Mon, Jul 3, 2023, at 10:38 AM, Maxim Muzafarov wrote: >> >> For me, the biggest benefit of keeping the build scripts and CI >> configurations as well in the same project is that these files are >> versioned in the same way as the main sources do. This ensures that we >> can build past releases without having any annoying errors in the >> scripts, so I would say that this is a pretty necessary change. >> >> I'd like to mention the approach that could work for the projects with >> a huge amount of tests. Instead of running all the tests through >> available CI agents every time we can have presets of tests: >> - base tests (to make sure that your design basically works, the set >> will not run longer than 30 min); >> - pre-commit tests (the number of tests to make sure that we can >> safely commit new changes and fit the run into the 1-2 hour build >> timeframe); >> - nightly builds (scheduled task to build everything we ha
Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
Currently a dtest is being ran in j8 w/wo vnodes , j8/j11 w/wo vnodes and j11 w/wo vnodes. That is 6 times total. I wonder about that ROI. On dtest cluster reusage yes, I stopped that as at the time we had lots of CI changes, an upcoming release and priorities. But when the CI starts flexing it's muscles that'd be easy to pick up again as dtests code shouldn't have changed much. On 4/7/23 17:11, Derek Chen-Becker wrote: Ultimately I think we have to invest in two directions: first, choose a consistent, representative subset of stable tests that we feel give us a reasonable level of confidence in return for a reasonable amount of runtime. Second, we need to invest in figuring out why certain tests fail. I strongly dislike the term "flaky" because it suggests that it's some inconsequential issue causing problems. The truth is that a test that fails is either a bug in the service code or a bug in the test. I've come to realize that the CI and build framework is way too complex for me to be able to help with much, but I would love to start chipping away at failing test bugs. I'm getting settled into my new job and I should be able to commit some regular time each week to triage and fixing starting in August, and if there are any other folks who are interested let me know. Cheers, Derek On Mon, Jul 3, 2023, 12:30 PM Josh McKenzie wrote: Instead of running all the tests through available CI agents every time we can have presets of tests: Back when I joined the project in 2014, unit tests took ~ 5 minutes to run on a local machine. We had pre-commit and post-commit tests as a distinction as well, but also had flakes in the pre and post batch. I'd love to see us get back to a unit test regime like that. The challenge we've always had is flaky tests showing up in either the pre-commit or post-commit groups and difficulty in attribution on a flaky failure to where it was introduced (not to lay blame but to educate and learn and prevent recurrence). While historically further reduced smoke testing suites would just mean more flakes showing up downstream, the rule of multiplexing new or changed tests might go a long way to mitigating that. Should we mention in this concept how we will build the sub-projects (e.g. Accord) alongside Cassandra? I think it's an interesting question, but I also think there's no real dependency of process between primary mainline branches and feature branches. My intuition is that having the same bar (green CI, multiplex, don't introduce flakes, smart smoke suite tiering) would be a good idea on feature branches so there's not a death march right before merge, squashing flakes when you have to multiplex hundreds of tests before merge to mainline (since presumably a feature branch would impact a lot of tests). Now that I write that all out it does sound Painful. =/ On Mon, Jul 3, 2023, at 10:38 AM, Maxim Muzafarov wrote: For me, the biggest benefit of keeping the build scripts and CI configurations as well in the same project is that these files are versioned in the same way as the main sources do. This ensures that we can build past releases without having any annoying errors in the scripts, so I would say that this is a pretty necessary change. I'd like to mention the approach that could work for the projects with a huge amount of tests. Instead of running all the tests through available CI agents every time we can have presets of tests: - base tests (to make sure that your design basically works, the set will not run longer than 30 min); - pre-commit tests (the number of tests to make sure that we can safely commit new changes and fit the run into the 1-2 hour build timeframe); - nightly builds (scheduled task to build everything we have once a day and notify the ML if that build fails); My question here is: Should we mention in this concept how we will build the sub-projects (e.g. Accord) alongside Cassandra? On Fri, 30 Jun 2023 at 23:19, Josh McKenzie wrote: > > Not everyone will have access to such resources, if all you have is 1 such pod you'll be waiting a long time (in theory one month, and you actually need a few bigger pods for some of the more extensive tests, e.g. large upgrade tests)…. > > One thing worth calling out: I believe we have a lot of low hanging fruit in the domain of "find long running tests and speed them up". Early 2022 I was poking around at our unit tests on CASSANDRA-17371 and found that 2.62% of our tests made up 20.4% of our runtime (https://docs.google.com/spreadsheets/d/1-tkH-hWBlEVInzMjLmJz4wABV6_mGs-2-NNM2XoVTcA/edit#gid=1501761592). This kind of finding is pretty consistent; I remember Carl Yeksigian at NGCC back in like 2015 axing an hour plus of aggregate runtime b
Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
Ultimately I think we have to invest in two directions: first, choose a consistent, representative subset of stable tests that we feel give us a reasonable level of confidence in return for a reasonable amount of runtime. Second, we need to invest in figuring out why certain tests fail. I strongly dislike the term "flaky" because it suggests that it's some inconsequential issue causing problems. The truth is that a test that fails is either a bug in the service code or a bug in the test. I've come to realize that the CI and build framework is way too complex for me to be able to help with much, but I would love to start chipping away at failing test bugs. I'm getting settled into my new job and I should be able to commit some regular time each week to triage and fixing starting in August, and if there are any other folks who are interested let me know. Cheers, Derek On Mon, Jul 3, 2023, 12:30 PM Josh McKenzie wrote: > Instead of running all the tests through available CI agents every time we > can have presets of tests: > > Back when I joined the project in 2014, unit tests took ~ 5 minutes to run > on a local machine. We had pre-commit and post-commit tests as a > distinction as well, but also had flakes in the pre and post batch. I'd > love to see us get back to a unit test regime like that. > > The challenge we've always had is flaky tests showing up in either the > pre-commit or post-commit groups and difficulty in attribution on a flaky > failure to where it was introduced (not to lay blame but to educate and > learn and prevent recurrence). While historically further reduced smoke > testing suites would just mean more flakes showing up downstream, the rule > of multiplexing new or changed tests might go a long way to mitigating that. > > Should we mention in this concept how we will build the sub-projects (e.g. > Accord) alongside Cassandra? > > I think it's an interesting question, but I also think there's no real > dependency of process between primary mainline branches and feature > branches. My intuition is that having the same bar (green CI, multiplex, > don't introduce flakes, smart smoke suite tiering) would be a good idea on > feature branches so there's not a death march right before merge, squashing > flakes when you have to multiplex hundreds of tests before merge to > mainline (since presumably a feature branch would impact a lot of tests). > > Now that I write that all out it does sound Painful. =/ > > On Mon, Jul 3, 2023, at 10:38 AM, Maxim Muzafarov wrote: > > For me, the biggest benefit of keeping the build scripts and CI > configurations as well in the same project is that these files are > versioned in the same way as the main sources do. This ensures that we > can build past releases without having any annoying errors in the > scripts, so I would say that this is a pretty necessary change. > > I'd like to mention the approach that could work for the projects with > a huge amount of tests. Instead of running all the tests through > available CI agents every time we can have presets of tests: > - base tests (to make sure that your design basically works, the set > will not run longer than 30 min); > - pre-commit tests (the number of tests to make sure that we can > safely commit new changes and fit the run into the 1-2 hour build > timeframe); > - nightly builds (scheduled task to build everything we have once a > day and notify the ML if that build fails); > > > My question here is: > Should we mention in this concept how we will build the sub-projects > (e.g. Accord) alongside Cassandra? > > On Fri, 30 Jun 2023 at 23:19, Josh McKenzie wrote: > > > > Not everyone will have access to such resources, if all you have is 1 > such pod you'll be waiting a long time (in theory one month, and you > actually need a few bigger pods for some of the more extensive tests, e.g. > large upgrade tests)…. > > > > One thing worth calling out: I believe we have a lot of low hanging > fruit in the domain of "find long running tests and speed them up". Early > 2022 I was poking around at our unit tests on CASSANDRA-17371 and found > that 2.62% of our tests made up 20.4% of our runtime ( > https://docs.google.com/spreadsheets/d/1-tkH-hWBlEVInzMjLmJz4wABV6_mGs-2-NNM2XoVTcA/edit#gid=1501761592). > This kind of finding is pretty consistent; I remember Carl Yeksigian at > NGCC back in like 2015 axing an hour plus of aggregate runtime by just > devoting an afternoon to looking at a few badly behaving tests. > > > > I'd like to see us move from "1 pod 1 month" down to something a lot > more manageable. :) > > > > Shout-out to Berenger's work on CASSANDRA-16951 for dtest cluster reuse > (not yet merged), and I have CASSANDRA-15196 to remove the CDC vs. non > segment allocator distinction and axe the test-cdc target entirely. > > > > Ok. Enough of that. Don't want to derail us, just wanted to call out > that the state of things today isn't the way it has to be. > > > > On Fri, Jun 30, 2023, at 4:41 PM, Mic
Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
> Instead of running all the tests through available CI agents every time we > can have presets of tests: Back when I joined the project in 2014, unit tests took ~ 5 minutes to run on a local machine. We had pre-commit and post-commit tests as a distinction as well, but also had flakes in the pre and post batch. I'd love to see us get back to a unit test regime like that. The challenge we've always had is flaky tests showing up in either the pre-commit or post-commit groups and difficulty in attribution on a flaky failure to where it was introduced (not to lay blame but to educate and learn and prevent recurrence). While historically further reduced smoke testing suites would just mean more flakes showing up downstream, the rule of multiplexing new or changed tests might go a long way to mitigating that. > Should we mention in this concept how we will build the sub-projects (e.g. > Accord) alongside Cassandra? I think it's an interesting question, but I also think there's no real dependency of process between primary mainline branches and feature branches. My intuition is that having the same bar (green CI, multiplex, don't introduce flakes, smart smoke suite tiering) would be a good idea on feature branches so there's not a death march right before merge, squashing flakes when you have to multiplex hundreds of tests before merge to mainline (since presumably a feature branch would impact a lot of tests). Now that I write that all out it does sound Painful. =/ On Mon, Jul 3, 2023, at 10:38 AM, Maxim Muzafarov wrote: > For me, the biggest benefit of keeping the build scripts and CI > configurations as well in the same project is that these files are > versioned in the same way as the main sources do. This ensures that we > can build past releases without having any annoying errors in the > scripts, so I would say that this is a pretty necessary change. > > I'd like to mention the approach that could work for the projects with > a huge amount of tests. Instead of running all the tests through > available CI agents every time we can have presets of tests: > - base tests (to make sure that your design basically works, the set > will not run longer than 30 min); > - pre-commit tests (the number of tests to make sure that we can > safely commit new changes and fit the run into the 1-2 hour build > timeframe); > - nightly builds (scheduled task to build everything we have once a > day and notify the ML if that build fails); > > > My question here is: > Should we mention in this concept how we will build the sub-projects > (e.g. Accord) alongside Cassandra? > > On Fri, 30 Jun 2023 at 23:19, Josh McKenzie wrote: > > > > Not everyone will have access to such resources, if all you have is 1 such > > pod you'll be waiting a long time (in theory one month, and you actually > > need a few bigger pods for some of the more extensive tests, e.g. large > > upgrade tests)…. > > > > One thing worth calling out: I believe we have a lot of low hanging fruit > > in the domain of "find long running tests and speed them up". Early 2022 I > > was poking around at our unit tests on CASSANDRA-17371 and found that 2.62% > > of our tests made up 20.4% of our runtime > > (https://docs.google.com/spreadsheets/d/1-tkH-hWBlEVInzMjLmJz4wABV6_mGs-2-NNM2XoVTcA/edit#gid=1501761592). > > This kind of finding is pretty consistent; I remember Carl Yeksigian at > > NGCC back in like 2015 axing an hour plus of aggregate runtime by just > > devoting an afternoon to looking at a few badly behaving tests. > > > > I'd like to see us move from "1 pod 1 month" down to something a lot more > > manageable. :) > > > > Shout-out to Berenger's work on CASSANDRA-16951 for dtest cluster reuse > > (not yet merged), and I have CASSANDRA-15196 to remove the CDC vs. non > > segment allocator distinction and axe the test-cdc target entirely. > > > > Ok. Enough of that. Don't want to derail us, just wanted to call out that > > the state of things today isn't the way it has to be. > > > > On Fri, Jun 30, 2023, at 4:41 PM, Mick Semb Wever wrote: > > > > - There are hw constraints, is there any approximation on how long it will > > take to run all tests? Or is there a stated goal that we will strive to > > reach as a project? > > > > Have to defer to Mick on this; I don't think the changes outlined here will > > materially change the runtime on our currently donated nodes in CI. > > > > > > > > A recent comparison between CircleCI and the jenkins code underneath > > ci-cassandra.a.o was done (not yet shared) to whether a 'repeatable CI' can > > be both lower cost and same turn around time. The exercise undercovered > > that there's a lot of waste in our jenkins builds, and once the jenkinsfile > > becomes standalone it can stash and unstash the build results. From this a > > conservative estimate was even if we only brought the build time to be > > double that of circleci it will still be significantly lower cost while
Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
For me, the biggest benefit of keeping the build scripts and CI configurations as well in the same project is that these files are versioned in the same way as the main sources do. This ensures that we can build past releases without having any annoying errors in the scripts, so I would say that this is a pretty necessary change. I'd like to mention the approach that could work for the projects with a huge amount of tests. Instead of running all the tests through available CI agents every time we can have presets of tests: - base tests (to make sure that your design basically works, the set will not run longer than 30 min); - pre-commit tests (the number of tests to make sure that we can safely commit new changes and fit the run into the 1-2 hour build timeframe); - nightly builds (scheduled task to build everything we have once a day and notify the ML if that build fails); My question here is: Should we mention in this concept how we will build the sub-projects (e.g. Accord) alongside Cassandra? On Fri, 30 Jun 2023 at 23:19, Josh McKenzie wrote: > > Not everyone will have access to such resources, if all you have is 1 such > pod you'll be waiting a long time (in theory one month, and you actually need > a few bigger pods for some of the more extensive tests, e.g. large upgrade > tests)…. > > One thing worth calling out: I believe we have a lot of low hanging fruit in > the domain of "find long running tests and speed them up". Early 2022 I was > poking around at our unit tests on CASSANDRA-17371 and found that 2.62% of > our tests made up 20.4% of our runtime > (https://docs.google.com/spreadsheets/d/1-tkH-hWBlEVInzMjLmJz4wABV6_mGs-2-NNM2XoVTcA/edit#gid=1501761592). > This kind of finding is pretty consistent; I remember Carl Yeksigian at NGCC > back in like 2015 axing an hour plus of aggregate runtime by just devoting an > afternoon to looking at a few badly behaving tests. > > I'd like to see us move from "1 pod 1 month" down to something a lot more > manageable. :) > > Shout-out to Berenger's work on CASSANDRA-16951 for dtest cluster reuse (not > yet merged), and I have CASSANDRA-15196 to remove the CDC vs. non segment > allocator distinction and axe the test-cdc target entirely. > > Ok. Enough of that. Don't want to derail us, just wanted to call out that the > state of things today isn't the way it has to be. > > On Fri, Jun 30, 2023, at 4:41 PM, Mick Semb Wever wrote: > > - There are hw constraints, is there any approximation on how long it will > take to run all tests? Or is there a stated goal that we will strive to reach > as a project? > > Have to defer to Mick on this; I don't think the changes outlined here will > materially change the runtime on our currently donated nodes in CI. > > > > A recent comparison between CircleCI and the jenkins code underneath > ci-cassandra.a.o was done (not yet shared) to whether a 'repeatable CI' can > be both lower cost and same turn around time. The exercise undercovered that > there's a lot of waste in our jenkins builds, and once the jenkinsfile > becomes standalone it can stash and unstash the build results. From this a > conservative estimate was even if we only brought the build time to be double > that of circleci it will still be significantly lower cost while still using > on-demand ec2 instances. (The goal is to use spot instances.) > > The real problem here is that our CI pipeline uses ~1000 containers. > ci-cassandra.a.o only has 100 executors (and a few of these at any time are > often down for disk self-cleaning). The idea with 'repeatable CI', and to a > broader extent Josh's opening email, is that no one will need to use > ci-cassandra.a.o for pre-commit work anymore. For post-commit we don't care > if it takes 7 hours (we care about stability of results, which 'repeatable > CI' also helps us with). > > While pre-commit testing will be more accessible to everyone, it will still > depend on the resources you have access to. For the fastest turn-around > times you will need a k8s cluster that can spawn 1000 pods (4cpu, 8GB ram) > which will run for up to 1-30 minutes, or the equivalent. Not everyone will > have access to such resources, if all you have is 1 such pod you'll be > waiting a long time (in theory one month, and you actually need a few bigger > pods for some of the more extensive tests, e.g. large upgrade tests)…. > >
Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
> Not everyone will have access to such resources, if all you have is 1 such > pod you'll be waiting a long time (in theory one month, and you actually need > a few bigger pods for some of the more extensive tests, e.g. large upgrade > tests)…. One thing worth calling out: I believe we have *a lot* of low hanging fruit in the domain of "find long running tests and speed them up". Early 2022 I was poking around at our unit tests on CASSANDRA-17371 and found that *2.62% of our tests made up 20.4% of our runtime* (https://docs.google.com/spreadsheets/d/1-tkH-hWBlEVInzMjLmJz4wABV6_mGs-2-NNM2XoVTcA/edit#gid=1501761592). This kind of finding is pretty consistent; I remember Carl Yeksigian at NGCC back in like 2015 axing an hour plus of aggregate runtime by just devoting an afternoon to looking at a few badly behaving tests. I'd like to see us move from "1 pod 1 month" down to something a lot more manageable. :) Shout-out to Berenger's work on CASSANDRA-16951 for dtest cluster reuse (not yet merged), and I have CASSANDRA-15196 to remove the CDC vs. non segment allocator distinction and axe the test-cdc target entirely. Ok. Enough of that. Don't want to derail us, just wanted to call out that the state of things today isn't the way it has to be. On Fri, Jun 30, 2023, at 4:41 PM, Mick Semb Wever wrote: >>> - There are hw constraints, is there any approximation on how long it will >>> take to run all tests? Or is there a stated goal that we will strive to >>> reach as a project? >> Have to defer to Mick on this; I don't think the changes outlined here will >> materially change the runtime on our currently donated nodes in CI. > > > A recent comparison between CircleCI and the jenkins code underneath > ci-cassandra.a.o was done (not yet shared) to whether a 'repeatable CI' can > be both lower cost and same turn around time. The exercise undercovered that > there's a lot of waste in our jenkins builds, and once the jenkinsfile > becomes standalone it can stash and unstash the build results. >From this a > conservative estimate was even if we only brought the build time to be double > that of circleci it will still be significantly lower cost while still using > on-demand ec2 instances. (The goal is to use spot instances.) > > The real problem here is that our CI pipeline uses ~1000 containers. > ci-cassandra.a.o only has 100 executors (and a few of these at any time are > often down for disk self-cleaning). The idea with 'repeatable CI', and to a > broader extent Josh's opening email, is that no one will need to use > ci-cassandra.a.o for pre-commit work anymore. For post-commit we don't care > if it takes 7 hours (we care about stability of results, which 'repeatable > CI' also helps us with). > > While pre-commit testing will be more accessible to everyone, it will still > depend on the resources you have access to. For the fastest turn-around > times you will need a k8s cluster that can spawn 1000 pods (4cpu, 8GB ram) > which will run for up to 1-30 minutes, or the equivalent. Not everyone will > have access to such resources, if all you have is 1 such pod you'll be > waiting a long time (in theory one month, and you actually need a few bigger > pods for some of the more extensive tests, e.g. large upgrade tests)….
Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
> > - There are hw constraints, is there any approximation on how long it will > take to run all tests? Or is there a stated goal that we will strive to > reach as a project? > > Have to defer to Mick on this; I don't think the changes outlined here > will materially change the runtime on our currently donated nodes in CI. > A recent comparison between CircleCI and the jenkins code underneath ci-cassandra.a.o was done (not yet shared) to whether a 'repeatable CI' can be both lower cost and same turn around time. The exercise undercovered that there's a lot of waste in our jenkins builds, and once the jenkinsfile becomes standalone it can stash and unstash the build results. From this a conservative estimate was even if we only brought the build time to be double that of circleci it will still be significantly lower cost while still using on-demand ec2 instances. (The goal is to use spot instances.) The real problem here is that our CI pipeline uses ~1000 containers. ci-cassandra.a.o only has 100 executors (and a few of these at any time are often down for disk self-cleaning). The idea with 'repeatable CI', and to a broader extent Josh's opening email, is that no one will need to use ci-cassandra.a.o for pre-commit work anymore. For post-commit we don't care if it takes 7 hours (we care about stability of results, which 'repeatable CI' also helps us with). While pre-commit testing will be more accessible to everyone, it will still depend on the resources you have access to. For the fastest turn-around times you will need a k8s cluster that can spawn 1000 pods (4cpu, 8GB ram) which will run for up to 1-30 minutes, or the equivalent. Not everyone will have access to such resources, if all you have is 1 such pod you'll be waiting a long time (in theory one month, and you actually need a few bigger pods for some of the more extensive tests, e.g. large upgrade tests)….
Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
All great questions I don't have answers to Ekaterina. :) Thoughts though: > - Currently we run at most two parallel CI runs in Jenkins-dev, I guess you > will try to improve that limitation? If we get to using cloud-based resources for CI instead of our donated hardware w/a budget, we could theoretically have a world where we could run more jobs at a time on ASF infra. Managing a monthly recurring spend on CI for a bunch of committers around the world with different sponsors is outside the scope of what we're targeting, but the work we're doing now will enable us to pursue that as a potential option in the future. > - There are hw constraints, is there any approximation on how long it will > take to run all tests? Or is there a stated goal that we will strive to reach > as a project? Have to defer to Mick on this; I don't think the changes outlined here will materially change the runtime on our currently donated nodes in CI. It'd be faster if we spun up cloud resources; we've gone back and forth on that topic too, using spot instances, more resilience in the face of that, etc. But keeping that path separate so we can bite off manageable chunks at a time. > - Bringing scripts in-tree will make it easier to add a multiplexer which we > miss at the moment, that’s great. (Running jobs in a loop, helps a lot with > flaky tests) . Also makes it easier to add any new test suites Definitely; this should have been in the doc (and is in a few others on the topic that are on related bits). I'll add a bullet about multiplexing changed or newly added tests. On Fri, Jun 30, 2023, at 2:38 PM, Ekaterina Dimitrova wrote: > Thank you, Josh and Mick > > Immediate questions on my mind: > - Currently we run at most two parallel CI runs in Jenkins-dev, I guess you > will try to improve that limitation? > - There are hw constraints, is there any approximation on how long it will > take to run all tests? Or is there a stated goal that we will strive to reach > as a project? > - Bringing scripts in-tree will make it easier to add a multiplexer which we > miss at the moment, that’s great. (Running jobs in a loop, helps a lot with > flaky tests) . Also makes it easier to add any new test suites > > On Fri, 30 Jun 2023 at 13:35, Derek Chen-Becker wrote: >> Thanks Josh, this looks great! I think the constraints you've outlined are >> reasonable for an initial attempt. We can always evolve if we run into >> issues. >> >> Cheers, >> >> Derek >> >> On Fri, Jun 30, 2023 at 11:19 AM Josh McKenzie wrote: >>> __ >>> Context: we're looking to get away from having split CircleCI and ASF CI as >>> well >>> as getting ASF CI to a stable state. There's a variety of reasons why it's >>> flaky >>> (orchestration, heterogenous hardware, hardware failures, flaky tests, >>> non-deterministic runs, noisy neighbors, etc), many of which Mick has been >>> making great headway on starting to address. >>> >>> If you're curious see: >>> - Mick's 2023/01/09 email thread on CI: >>> https://lists.apache.org/thread/fqdvqkjmz6w8c864vw98ymvb1995lcy4 >>> - Mick's 2023/04/26 email thread on CI: >>> https://lists.apache.org/thread/xb80v6r857dz5rlm5ckcn69xcl4shvbq >>> - CASSANDRA-18137: epic for "Repeatable ci-cassandra.a.o": >>> https://issues.apache.org/jira/browse/CASSANDRA-18137 >>> - CASSANDRA-18133: In-tree build scripts: >>> https://issues.apache.org/jira/browse/CASSANDRA-18133 >>> >>> What's fallen out from this: the new reference CI will have the following >>> logical layers: >>> 1. ant >>> 2. build/test scripts that setup the env. See run-tests.sh and >>> run-python-dtests.sh here: >>> >>> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build >>> 3. dockerized build/test scripts that have containerized the flow of 1 and >>> 2. See: >>> >>> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build/docker >>> 4. CI integrations. See generation of unified test report in build.xml: >>> >>> https://github.com/thelastpickle/cassandra/blame/mck/18133/trunk/build.xml#L1794-L1817) >>> 5. Optional full CI lifecycle w/Jenkins running in a container (full stack >>> setup, run, teardown, pending) >>> >>> **I want to let everyone know the high level structure of how this is >>> shaping up,** >>> **as this is a change that will directly impact the work of *all of us* on >>> the** >>> **project.** >>> >>> In terms of our goals, the chief goals I'd like to call out in this context >>> are: >>> * ASF CI needs to be and remain consistent >>> * contributors need a turnkey way to validate their work before merging that >>> they can accelerate by throwing resources at it. >>> >>> We as a project need to determine what is *required* to run in a CI >>> environment >>> to consider that run certified for merge. Where Mick and I landed >>> through a lot >>> of back and forth is that the following would be re
Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
Thank you, Josh and Mick Immediate questions on my mind: - Currently we run at most two parallel CI runs in Jenkins-dev, I guess you will try to improve that limitation? - There are hw constraints, is there any approximation on how long it will take to run all tests? Or is there a stated goal that we will strive to reach as a project? - Bringing scripts in-tree will make it easier to add a multiplexer which we miss at the moment, that’s great. (Running jobs in a loop, helps a lot with flaky tests) . Also makes it easier to add any new test suites On Fri, 30 Jun 2023 at 13:35, Derek Chen-Becker wrote: > Thanks Josh, this looks great! I think the constraints you've outlined are > reasonable for an initial attempt. We can always evolve if we run into > issues. > > Cheers, > > Derek > > On Fri, Jun 30, 2023 at 11:19 AM Josh McKenzie > wrote: > >> Context: we're looking to get away from having split CircleCI and ASF CI >> as well >> as getting ASF CI to a stable state. There's a variety of reasons why >> it's flaky >> (orchestration, heterogenous hardware, hardware failures, flaky tests, >> non-deterministic runs, noisy neighbors, etc), many of which Mick has been >> making great headway on starting to address. >> >> If you're curious see: >> - Mick's 2023/01/09 email thread on CI: >> https://lists.apache.org/thread/fqdvqkjmz6w8c864vw98ymvb1995lcy4 >> - Mick's 2023/04/26 email thread on CI: >> https://lists.apache.org/thread/xb80v6r857dz5rlm5ckcn69xcl4shvbq >> - CASSANDRA-18137: epic for "Repeatable ci-cassandra.a.o": >> https://issues.apache.org/jira/browse/CASSANDRA-18137 >> - CASSANDRA-18133: In-tree build scripts: >> https://issues.apache.org/jira/browse/CASSANDRA-18133 >> >> What's fallen out from this: the new reference CI will have the following >> logical layers: >> 1. ant >> 2. build/test scripts that setup the env. See run-tests.sh and >> run-python-dtests.sh here: >> >> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build >> 3. dockerized build/test scripts that have containerized the flow of 1 >> and 2. See: >> >> https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build/docker >> 4. CI integrations. See generation of unified test report in build.xml: >> >> https://github.com/thelastpickle/cassandra/blame/mck/18133/trunk/build.xml#L1794-L1817 >> ) >> 5. Optional full CI lifecycle w/Jenkins running in a container (full stack >> setup, run, teardown, pending) >> >> >> *I want to let everyone know the high level structure of how this is >> shaping up,* >> >> *as this is a change that will directly impact the work of *all of us* on >> the* >> *project.* >> >> In terms of our goals, the chief goals I'd like to call out in this >> context are: >> * ASF CI needs to be and remain consistent >> * contributors need a turnkey way to validate their work before merging >> that >> they can accelerate by throwing resources at it. >> >> We as a project need to determine what is *required* to run in a CI >> environment >> to consider that run certified for merge. Where Mick and I landed >> through a lot >> of back and forth is that the following would be required: >> 1. used ant / pytest to build and run tests >> 2. used the reference scripts being changed in CASSANDRA-18133 (in-tree >> .build/) >> to setup and execute your test environment >> 3. constrained your runtime environment to the same hardware and time >> constraints we use in ASF CI, within reason (CPU count independent of >> speed, >> memory size and disk size independent of hardware specs, etc) >> 4. reported test results in a unified fashion that has all the >> information we >> normally get from a test run >> 5. (maybe) Parallelized the tests across the same split lines as upstream >> ASF >> (i.e. no weird env specific neighbor / scheduling flakes) >> >> Last but not least is the "What do we do with CircleCI?" angle. The >> current >> thought is we allow people to continue using it with the stated goal of >> migrating the circle config over to using the unified build scripts as >> well and >> get it in compliance with the above requirements. >> >> For reference, here's a gdoc where we've hashed this out: >> >> https://docs.google.com/document/d/1TaYMvE5ryOYX03cxzY6XzuUS651fktVER02JHmZR5FU/edit?usp=sharing >> >> So my questions for the community here: >> 1. What's missing from the above conceptualization of the problem? >> 2. Are the constraints too strong? Too weak? Just right? >> >> Thanks everyone, and happy Friday. ;) >> >> ~Josh >> > > > -- > +---+ > | Derek Chen-Becker | > | GPG Key available at https://keybase.io/dchenbecker and | > | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | > | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > +---
Re: [DISCUSS] Formalizing requirements for pre-commit patches on new CI
Thanks Josh, this looks great! I think the constraints you've outlined are reasonable for an initial attempt. We can always evolve if we run into issues. Cheers, Derek On Fri, Jun 30, 2023 at 11:19 AM Josh McKenzie wrote: > Context: we're looking to get away from having split CircleCI and ASF CI > as well > as getting ASF CI to a stable state. There's a variety of reasons why it's > flaky > (orchestration, heterogenous hardware, hardware failures, flaky tests, > non-deterministic runs, noisy neighbors, etc), many of which Mick has been > making great headway on starting to address. > > If you're curious see: > - Mick's 2023/01/09 email thread on CI: > https://lists.apache.org/thread/fqdvqkjmz6w8c864vw98ymvb1995lcy4 > - Mick's 2023/04/26 email thread on CI: > https://lists.apache.org/thread/xb80v6r857dz5rlm5ckcn69xcl4shvbq > - CASSANDRA-18137: epic for "Repeatable ci-cassandra.a.o": > https://issues.apache.org/jira/browse/CASSANDRA-18137 > - CASSANDRA-18133: In-tree build scripts: > https://issues.apache.org/jira/browse/CASSANDRA-18133 > > What's fallen out from this: the new reference CI will have the following > logical layers: > 1. ant > 2. build/test scripts that setup the env. See run-tests.sh and > run-python-dtests.sh here: > > https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build > 3. dockerized build/test scripts that have containerized the flow of 1 and > 2. See: > > https://github.com/thelastpickle/cassandra/tree/0aecbd873ff4de5474fe15efac4cdde10b603c7b/.build/docker > 4. CI integrations. See generation of unified test report in build.xml: > > https://github.com/thelastpickle/cassandra/blame/mck/18133/trunk/build.xml#L1794-L1817 > ) > 5. Optional full CI lifecycle w/Jenkins running in a container (full stack > setup, run, teardown, pending) > > > *I want to let everyone know the high level structure of how this is > shaping up,* > > *as this is a change that will directly impact the work of *all of us* on > the* > *project.* > > In terms of our goals, the chief goals I'd like to call out in this > context are: > * ASF CI needs to be and remain consistent > * contributors need a turnkey way to validate their work before merging > that > they can accelerate by throwing resources at it. > > We as a project need to determine what is *required* to run in a CI > environment > to consider that run certified for merge. Where Mick and I landed > through a lot > of back and forth is that the following would be required: > 1. used ant / pytest to build and run tests > 2. used the reference scripts being changed in CASSANDRA-18133 (in-tree > .build/) > to setup and execute your test environment > 3. constrained your runtime environment to the same hardware and time > constraints we use in ASF CI, within reason (CPU count independent of > speed, > memory size and disk size independent of hardware specs, etc) > 4. reported test results in a unified fashion that has all the information > we > normally get from a test run > 5. (maybe) Parallelized the tests across the same split lines as upstream > ASF > (i.e. no weird env specific neighbor / scheduling flakes) > > Last but not least is the "What do we do with CircleCI?" angle. The current > thought is we allow people to continue using it with the stated goal of > migrating the circle config over to using the unified build scripts as > well and > get it in compliance with the above requirements. > > For reference, here's a gdoc where we've hashed this out: > > https://docs.google.com/document/d/1TaYMvE5ryOYX03cxzY6XzuUS651fktVER02JHmZR5FU/edit?usp=sharing > > So my questions for the community here: > 1. What's missing from the above conceptualization of the problem? > 2. Are the constraints too strong? Too weak? Just right? > > Thanks everyone, and happy Friday. ;) > > ~Josh > -- +---+ | Derek Chen-Becker | | GPG Key available at https://keybase.io/dchenbecker and | | https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | | Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | +---+
