Re: On our unit tests...
Rocking effort Stack!!! Thanks. Regards Ram On Fri, Nov 6, 2015 at 1:40 AM, Stack wrote: > On Thu, Nov 5, 2015 at 8:07 AM, Andrew Purtell > wrote: > > > > Hanging tests have been fixed and or disabled to be put back after > > scrubbing. > > > > What do you think about an interim step that adds a flakey test category > > and a profile that disables them only on builds.a.o., i.e. the Jenkins > job > > configuration turns them off. Is that possible? I'd like to continue > > running these on my build rigs since they are better endowed than > build.a.o > > resources. Or at least a profile that can turn them on? > > > > > We could do such a thing. Probably better than the current hackery where > the test is just disabled with JIRAs to fix ...sometime. > > > > > > This is a petition that we go out of our way going forward to keep OUR > > test suite blue. > > > > Big +1 here > > > > > Yeah. Its got to be a group thing. > > > > > BTW it turns out after seeing the results of your effort that most of my > > issues with builds.a.o were probably due to the broken zombie killing > > thing. That's why locally run stuff (also under Jenkins sometimes btw) > was > > just so much more stable. Can we have review and SCM of our build > > configurations somehow going forward? > > > > > Makes sense (and still work to do on zombie detector). Let me work on it. > St.Ack > > > > > > > > > > > > > On Oct 23, 2015, at 2:54 PM, Stack wrote: > > > > > > A few of us have been doing cleanup over the last month or so (see > > > HBASE-14420). As a project, we had let our unit test suite go to seed. > It > > > was an anthology of mysterious crashes, zombies and flakes. > > > > > > We are not done yet but tests are mostly stable again with patch builds > > > passing close to 100% of the time as long as the patch is good and > trunk > > > and branch-1/branch-1.2 are tending back toward being blue always. > > Hanging > > > tests have been fixed and or disabled to be put back after scrubbing. > > > Mysterious surefire crashes/timeouts have been addressed by purging a > > > problematic test set that we intend to re-add after tuneup and fix. > There > > > are still a few flakies in the mix. > > > > > > This is a petition that we go out of our way going forward to keep OUR > > test > > > suite blue. We'll all be more productive if we can keep it this way. > > > Patches will land faster because there'll be less friction getting them > > in > > > (Landing big patches was taking me a week before starting in on this > > > effort). We'll catch a slew of problems before commit. New devs won't > be > > > confounded by mysterious unrelated test fails. There'll be no need to > > keep > > > up an arcane knowledge of 'known flakies' or hanging tests or the need > > for > > > expending extra effort and resources doing > 'look-it-works-locally-for-me' > > > test runs locally. > > > > > > St.Ack > > > > > > Below are some further notes for those interested in build and work > done > > to > > > our test rig recently; ugly detail is over in HBASE-14420. > > > > > > Until an alternative shows up, our Apache Jenkins needs to run blue > > always > > > if we want to do community development. True, Apache Jenkins is a > trying > > > environment in which to run tests, but it is shared, public, and I have > > yet > > > to come across a hang or failure that was Apache-Jenkins-only; the only > > > difference I've seen is that the incidence of hangs and flakies is > higher > > > on Apache. > > > > > > The test-patch.sh script had some hacking done to it mostly removing > code > > > that was finding and killing zombies. We were reporting ANY concurrent > > > build as a zombie, even those that were not hbase tests, and killing > them > > > in the belief that they were leftovers from previous runs (the script > > had a > > > few different techniques for finding and executing adjacent processes). > > > This made some sense when we were supposed to be the only test running > on > > > the box but this has not been true for a long time. Killing was > > > papering-over the fact that we were leaving zombies after us. > > > > > > The Jenkins build configuration also had zombie code from test-patch.sh > > in > > > it (still does -- a TODO). Builds now dump out test machine load and > > > listing of what else is running on the box at test start to give a > sense > > of > > > how loaded the test box is. > > > > > > I feel particularly bad for the new contributors. They have it hard > > enough > > > already checking out a fat project with a slow build system with hours > of > > > tests to run to verify changes. Lets spare them the added barrier of a > > > confounding experience when their nice patch throws up a mysterious > > jenkins > > > fail on submit. > > >
Re: On our unit tests...
On Thu, Nov 5, 2015 at 8:07 AM, Andrew Purtell wrote: > > Hanging tests have been fixed and or disabled to be put back after > scrubbing. > > What do you think about an interim step that adds a flakey test category > and a profile that disables them only on builds.a.o., i.e. the Jenkins job > configuration turns them off. Is that possible? I'd like to continue > running these on my build rigs since they are better endowed than build.a.o > resources. Or at least a profile that can turn them on? > > We could do such a thing. Probably better than the current hackery where the test is just disabled with JIRAs to fix ...sometime. > > This is a petition that we go out of our way going forward to keep OUR > test suite blue. > > Big +1 here > > Yeah. Its got to be a group thing. > BTW it turns out after seeing the results of your effort that most of my > issues with builds.a.o were probably due to the broken zombie killing > thing. That's why locally run stuff (also under Jenkins sometimes btw) was > just so much more stable. Can we have review and SCM of our build > configurations somehow going forward? > > Makes sense (and still work to do on zombie detector). Let me work on it. St.Ack > > > > > On Oct 23, 2015, at 2:54 PM, Stack wrote: > > > > A few of us have been doing cleanup over the last month or so (see > > HBASE-14420). As a project, we had let our unit test suite go to seed. It > > was an anthology of mysterious crashes, zombies and flakes. > > > > We are not done yet but tests are mostly stable again with patch builds > > passing close to 100% of the time as long as the patch is good and trunk > > and branch-1/branch-1.2 are tending back toward being blue always. > Hanging > > tests have been fixed and or disabled to be put back after scrubbing. > > Mysterious surefire crashes/timeouts have been addressed by purging a > > problematic test set that we intend to re-add after tuneup and fix. There > > are still a few flakies in the mix. > > > > This is a petition that we go out of our way going forward to keep OUR > test > > suite blue. We'll all be more productive if we can keep it this way. > > Patches will land faster because there'll be less friction getting them > in > > (Landing big patches was taking me a week before starting in on this > > effort). We'll catch a slew of problems before commit. New devs won't be > > confounded by mysterious unrelated test fails. There'll be no need to > keep > > up an arcane knowledge of 'known flakies' or hanging tests or the need > for > > expending extra effort and resources doing 'look-it-works-locally-for-me' > > test runs locally. > > > > St.Ack > > > > Below are some further notes for those interested in build and work done > to > > our test rig recently; ugly detail is over in HBASE-14420. > > > > Until an alternative shows up, our Apache Jenkins needs to run blue > always > > if we want to do community development. True, Apache Jenkins is a trying > > environment in which to run tests, but it is shared, public, and I have > yet > > to come across a hang or failure that was Apache-Jenkins-only; the only > > difference I've seen is that the incidence of hangs and flakies is higher > > on Apache. > > > > The test-patch.sh script had some hacking done to it mostly removing code > > that was finding and killing zombies. We were reporting ANY concurrent > > build as a zombie, even those that were not hbase tests, and killing them > > in the belief that they were leftovers from previous runs (the script > had a > > few different techniques for finding and executing adjacent processes). > > This made some sense when we were supposed to be the only test running on > > the box but this has not been true for a long time. Killing was > > papering-over the fact that we were leaving zombies after us. > > > > The Jenkins build configuration also had zombie code from test-patch.sh > in > > it (still does -- a TODO). Builds now dump out test machine load and > > listing of what else is running on the box at test start to give a sense > of > > how loaded the test box is. > > > > I feel particularly bad for the new contributors. They have it hard > enough > > already checking out a fat project with a slow build system with hours of > > tests to run to verify changes. Lets spare them the added barrier of a > > confounding experience when their nice patch throws up a mysterious > jenkins > > fail on submit. >
Re: On our unit tests...
Huge kudos to you, Stack, for making the time to run these down. As a contributor, I'm very moved by the thought of treating what Jenkins reports as truth. Stack wrote: Since I wrote the below, we've figured who the surefire-killer was [HBASE-14589]. 9 of the last 10 1.2 builds passed (even though blue builds are harder to achieve now since they are a compound of a jdk 1.7 and a jdk 1.8 run). 1.3 is failing on a few tests that seem legitimately flakey; I'm looking into them. Trunk is settling down after being made into a jdk7/8 matrix; it should stabilize soon. Repeating my petition from below, can we start putting our trust back in apache builds and start relying on it again? It found a flakey end of last week soon after it went in because builds mostly pass now so the flakey shone through. It can find more if we all make the effort to keep it blue. In particular, can we end the passes-locally-for-me practice since tests that go zombie or hang usually run fine on boxes where there is no contention. Thanks, St.Ack On Fri, Oct 23, 2015 at 2:54 PM, Stack wrote: A few of us have been doing cleanup over the last month or so (see HBASE-14420). As a project, we had let our unit test suite go to seed. It was an anthology of mysterious crashes, zombies and flakes. We are not done yet but tests are mostly stable again with patch builds passing close to 100% of the time as long as the patch is good and trunk and branch-1/branch-1.2 are tending back toward being blue always. Hanging tests have been fixed and or disabled to be put back after scrubbing. Mysterious surefire crashes/timeouts have been addressed by purging a problematic test set that we intend to re-add after tuneup and fix. There are still a few flakies in the mix. This is a petition that we go out of our way going forward to keep OUR test suite blue. We'll all be more productive if we can keep it this way. Patches will land faster because there'll be less friction getting them in (Landing big patches was taking me a week before starting in on this effort). We'll catch a slew of problems before commit. New devs won't be confounded by mysterious unrelated test fails. There'll be no need to keep up an arcane knowledge of 'known flakies' or hanging tests or the need for expending extra effort and resources doing 'look-it-works-locally-for-me' test runs locally. St.Ack Below are some further notes for those interested in build and work done to our test rig recently; ugly detail is over in HBASE-14420. Until an alternative shows up, our Apache Jenkins needs to run blue always if we want to do community development. True, Apache Jenkins is a trying environment in which to run tests, but it is shared, public, and I have yet to come across a hang or failure that was Apache-Jenkins-only; the only difference I've seen is that the incidence of hangs and flakies is higher on Apache. The test-patch.sh script had some hacking done to it mostly removing code that was finding and killing zombies. We were reporting ANY concurrent build as a zombie, even those that were not hbase tests, and killing them in the belief that they were leftovers from previous runs (the script had a few different techniques for finding and executing adjacent processes). This made some sense when we were supposed to be the only test running on the box but this has not been true for a long time. Killing was papering-over the fact that we were leaving zombies after us. The Jenkins build configuration also had zombie code from test-patch.sh in it (still does -- a TODO). Builds now dump out test machine load and listing of what else is running on the box at test start to give a sense of how loaded the test box is. I feel particularly bad for the new contributors. They have it hard enough already checking out a fat project with a slow build system with hours of tests to run to verify changes. Lets spare them the added barrier of a confounding experience when their nice patch throws up a mysterious jenkins fail on submit.
Re: On our unit tests...
> In particular, can we end the passes-locally-for-me practice +1 Although this depends on the sanity and stability of precommit builds. We (at least I) resorted to posting locally sourced "proof" of clean test suite runs to make forward progress in the limited amount of time I had to work on a particular issue. Anyway, let's give it a shot with renewed confidence. > On Nov 4, 2015, at 4:23 PM, Stack wrote: > > Since I wrote the below, we've figured who the surefire-killer was > [HBASE-14589]. 9 of the last 10 1.2 builds passed (even though blue builds > are harder to achieve now since they are a compound of a jdk 1.7 and a jdk > 1.8 run). 1.3 is failing on a few tests that seem legitimately flakey; I'm > looking into them. Trunk is settling down after being made into a jdk7/8 > matrix; it should stabilize soon. > > Repeating my petition from below, can we start putting our trust back in > apache builds and start relying on it again? It found a flakey end of last > week soon after it went in because builds mostly pass now so the flakey > shone through. It can find more if we all make the effort to keep it blue. > In particular, can we end the passes-locally-for-me practice since tests > that go zombie or hang usually run fine on boxes where there is no > contention. > > Thanks, > St.Ack > > > >> On Fri, Oct 23, 2015 at 2:54 PM, Stack wrote: >> >> A few of us have been doing cleanup over the last month or so (see >> HBASE-14420). As a project, we had let our unit test suite go to seed. It >> was an anthology of mysterious crashes, zombies and flakes. >> >> We are not done yet but tests are mostly stable again with patch builds >> passing close to 100% of the time as long as the patch is good and trunk >> and branch-1/branch-1.2 are tending back toward being blue always. Hanging >> tests have been fixed and or disabled to be put back after scrubbing. >> Mysterious surefire crashes/timeouts have been addressed by purging a >> problematic test set that we intend to re-add after tuneup and fix. There >> are still a few flakies in the mix. >> >> This is a petition that we go out of our way going forward to keep OUR >> test suite blue. We'll all be more productive if we can keep it this way. >> Patches will land faster because there'll be less friction getting them in >> (Landing big patches was taking me a week before starting in on this >> effort). We'll catch a slew of problems before commit. New devs won't be >> confounded by mysterious unrelated test fails. There'll be no need to keep >> up an arcane knowledge of 'known flakies' or hanging tests or the need for >> expending extra effort and resources doing 'look-it-works-locally-for-me' >> test runs locally. >> >> St.Ack >> >> Below are some further notes for those interested in build and work done >> to our test rig recently; ugly detail is over in HBASE-14420. >> >> Until an alternative shows up, our Apache Jenkins needs to run blue always >> if we want to do community development. True, Apache Jenkins is a trying >> environment in which to run tests, but it is shared, public, and I have yet >> to come across a hang or failure that was Apache-Jenkins-only; the only >> difference I've seen is that the incidence of hangs and flakies is higher >> on Apache. >> >> The test-patch.sh script had some hacking done to it mostly removing code >> that was finding and killing zombies. We were reporting ANY concurrent >> build as a zombie, even those that were not hbase tests, and killing them >> in the belief that they were leftovers from previous runs (the script had a >> few different techniques for finding and executing adjacent processes). >> This made some sense when we were supposed to be the only test running on >> the box but this has not been true for a long time. Killing was >> papering-over the fact that we were leaving zombies after us. >> >> The Jenkins build configuration also had zombie code from test-patch.sh in >> it (still does -- a TODO). Builds now dump out test machine load and >> listing of what else is running on the box at test start to give a sense of >> how loaded the test box is. >> >> I feel particularly bad for the new contributors. They have it hard enough >> already checking out a fat project with a slow build system with hours of >> tests to run to verify changes. Lets spare them the added barrier of a >> confounding experience when their nice patch throws up a mysterious jenkins >> fail on submit. >>
Re: On our unit tests...
Thanks so much for banging on our tests and builds.a.o setup such that some sanity there has now been restored! > Hanging tests have been fixed and or disabled to be put back after scrubbing. What do you think about an interim step that adds a flakey test category and a profile that disables them only on builds.a.o., i.e. the Jenkins job configuration turns them off. Is that possible? I'd like to continue running these on my build rigs since they are better endowed than build.a.o resources. Or at least a profile that can turn them on? > This is a petition that we go out of our way going forward to keep OUR test > suite blue. Big +1 here BTW it turns out after seeing the results of your effort that most of my issues with builds.a.o were probably due to the broken zombie killing thing. That's why locally run stuff (also under Jenkins sometimes btw) was just so much more stable. Can we have review and SCM of our build configurations somehow going forward? > On Oct 23, 2015, at 2:54 PM, Stack wrote: > > A few of us have been doing cleanup over the last month or so (see > HBASE-14420). As a project, we had let our unit test suite go to seed. It > was an anthology of mysterious crashes, zombies and flakes. > > We are not done yet but tests are mostly stable again with patch builds > passing close to 100% of the time as long as the patch is good and trunk > and branch-1/branch-1.2 are tending back toward being blue always. Hanging > tests have been fixed and or disabled to be put back after scrubbing. > Mysterious surefire crashes/timeouts have been addressed by purging a > problematic test set that we intend to re-add after tuneup and fix. There > are still a few flakies in the mix. > > This is a petition that we go out of our way going forward to keep OUR test > suite blue. We'll all be more productive if we can keep it this way. > Patches will land faster because there'll be less friction getting them in > (Landing big patches was taking me a week before starting in on this > effort). We'll catch a slew of problems before commit. New devs won't be > confounded by mysterious unrelated test fails. There'll be no need to keep > up an arcane knowledge of 'known flakies' or hanging tests or the need for > expending extra effort and resources doing 'look-it-works-locally-for-me' > test runs locally. > > St.Ack > > Below are some further notes for those interested in build and work done to > our test rig recently; ugly detail is over in HBASE-14420. > > Until an alternative shows up, our Apache Jenkins needs to run blue always > if we want to do community development. True, Apache Jenkins is a trying > environment in which to run tests, but it is shared, public, and I have yet > to come across a hang or failure that was Apache-Jenkins-only; the only > difference I've seen is that the incidence of hangs and flakies is higher > on Apache. > > The test-patch.sh script had some hacking done to it mostly removing code > that was finding and killing zombies. We were reporting ANY concurrent > build as a zombie, even those that were not hbase tests, and killing them > in the belief that they were leftovers from previous runs (the script had a > few different techniques for finding and executing adjacent processes). > This made some sense when we were supposed to be the only test running on > the box but this has not been true for a long time. Killing was > papering-over the fact that we were leaving zombies after us. > > The Jenkins build configuration also had zombie code from test-patch.sh in > it (still does -- a TODO). Builds now dump out test machine load and > listing of what else is running on the box at test start to give a sense of > how loaded the test box is. > > I feel particularly bad for the new contributors. They have it hard enough > already checking out a fat project with a slow build system with hours of > tests to run to verify changes. Lets spare them the added barrier of a > confounding experience when their nice patch throws up a mysterious jenkins > fail on submit.
Re: On our unit tests...
Since I wrote the below, we've figured who the surefire-killer was [HBASE-14589]. 9 of the last 10 1.2 builds passed (even though blue builds are harder to achieve now since they are a compound of a jdk 1.7 and a jdk 1.8 run). 1.3 is failing on a few tests that seem legitimately flakey; I'm looking into them. Trunk is settling down after being made into a jdk7/8 matrix; it should stabilize soon. Repeating my petition from below, can we start putting our trust back in apache builds and start relying on it again? It found a flakey end of last week soon after it went in because builds mostly pass now so the flakey shone through. It can find more if we all make the effort to keep it blue. In particular, can we end the passes-locally-for-me practice since tests that go zombie or hang usually run fine on boxes where there is no contention. Thanks, St.Ack On Fri, Oct 23, 2015 at 2:54 PM, Stack wrote: > A few of us have been doing cleanup over the last month or so (see > HBASE-14420). As a project, we had let our unit test suite go to seed. It > was an anthology of mysterious crashes, zombies and flakes. > > We are not done yet but tests are mostly stable again with patch builds > passing close to 100% of the time as long as the patch is good and trunk > and branch-1/branch-1.2 are tending back toward being blue always. Hanging > tests have been fixed and or disabled to be put back after scrubbing. > Mysterious surefire crashes/timeouts have been addressed by purging a > problematic test set that we intend to re-add after tuneup and fix. There > are still a few flakies in the mix. > > This is a petition that we go out of our way going forward to keep OUR > test suite blue. We'll all be more productive if we can keep it this way. > Patches will land faster because there'll be less friction getting them in > (Landing big patches was taking me a week before starting in on this > effort). We'll catch a slew of problems before commit. New devs won't be > confounded by mysterious unrelated test fails. There'll be no need to keep > up an arcane knowledge of 'known flakies' or hanging tests or the need for > expending extra effort and resources doing 'look-it-works-locally-for-me' > test runs locally. > > St.Ack > > Below are some further notes for those interested in build and work done > to our test rig recently; ugly detail is over in HBASE-14420. > > Until an alternative shows up, our Apache Jenkins needs to run blue always > if we want to do community development. True, Apache Jenkins is a trying > environment in which to run tests, but it is shared, public, and I have yet > to come across a hang or failure that was Apache-Jenkins-only; the only > difference I've seen is that the incidence of hangs and flakies is higher > on Apache. > > The test-patch.sh script had some hacking done to it mostly removing code > that was finding and killing zombies. We were reporting ANY concurrent > build as a zombie, even those that were not hbase tests, and killing them > in the belief that they were leftovers from previous runs (the script had a > few different techniques for finding and executing adjacent processes). > This made some sense when we were supposed to be the only test running on > the box but this has not been true for a long time. Killing was > papering-over the fact that we were leaving zombies after us. > > The Jenkins build configuration also had zombie code from test-patch.sh in > it (still does -- a TODO). Builds now dump out test machine load and > listing of what else is running on the box at test start to give a sense of > how loaded the test box is. > > I feel particularly bad for the new contributors. They have it hard enough > already checking out a fat project with a slow build system with hours of > tests to run to verify changes. Lets spare them the added barrier of a > confounding experience when their nice patch throws up a mysterious jenkins > fail on submit. >
