Re: Tegra build backlog is too big!
Justin Wood wrote: We also have had developer confusion around this change (and some relatively minor unforseen problems with the patch, detailed in bug) that caused sheriffs to ask for this to be backed out. If you omit part of the try syntax then you get a default set of options, but as far as I can see Try Chooser doesn't support this, so it's all or nothing. Perhaps if Try Chooser had an option to use the defaults then they could be adjusted as necessary to something sensible for developers who don't really need every single combination of builds and tests? -- Warning: May contain traces of nuts. ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Tegra build backlog is too big!
Thanks John, this looks like it strikes a good balance between reducing tegra load and getting false-positives in try runs. Just for anybody following along though, it looks like bug 915465 (which deployed this change) is still in flux so things may have not completely settled yet. See also https://bugzilla.mozilla.org/show_bug.cgi?id=915465#c8 for the new trychooser syntax to enable testing on Tegras (although from comment 10 it looks like it may not be working properly yet). Cheers, kats On 13-09-12 01:11 , John O'Duinn wrote: hi kats (cross-posting to dev-b2g); tl:dr; we think all is ok again, details below. To avoid this happening again this week, we're changing tryserver to reduce the number of Android-tests-run-on-tegra-by-default. If you specifically want tegra testing on tryserver, you will need to state that when pushing to try. Yes, load on try was unusually heavy today (the b2g workweek is in full force). All other platforms, including our pool of panda Android4 test boards were handling this heavy load just fine. However, our pool of tegra boards is small (hard to get boards), had more-then-usual percentage offline, and was not able to keep up. We were distracted by an unrelated 64bit windows problem, but thats no excuse, we should have detected this earlier. After this is all calm, we'll postmortem. 1) As of earlier this afternoon, 976 of the pending 1078 tegra jobs are from try. This was no single abuse of try, this was simply an accumulation of a lot of pushes-to-try spread across the day. 2) We manually repoked every one of the offline tegras, and most are now working correctly again. As of now, our tegra pool is much healthier size again, back up to as-good-as-we-can-hope-for-with-poolsize, and quickly chewing through the remaining backlog. At this time, we are down to 240 pending jobs, and dropping fast. http://builddata.pub.build.mozilla.org/reports/pending/pending.html Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915457 Triage tegras with no completed jobs within last 24 hours 3) We cancelled all pending tegra test jobs on try that were waiting over 6 hours (the longest was 10hours). Note: we did not cancel panda test jobs, and we did not cancel tegra test jobs on other branches. If we cancelled a tegra test job on try that you do still need run, please let us know in #releng, and we'll sort it out. Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915481 Stop try tegra jobs pending for 6 hours 4) We are changing the default when you push to try. Until now, by default Try will generate Android builds and then run *all* unittests+talos for Android2.2 (tegras) *and* Android4 (pandas). We are now testing a change to default as follows: 4a) Android builds: no change, still built by default 4b) Android pandas tests: no change, still run unittest and talos by default 4c) Android tegras tests: unittests would not be run by default, but talos would still be run by default. Because of details in how TryServer works, talos would still be run on tegras by default, in order to keep scheduling talos on pandas also. Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915465 Pushes to try should not trigger tegra test jobs by default. Again, this is just changing the default: anyone who wants Android2.2 specific unittests run on tegras on try can still get that by specify it explicitly when pushing to try. Note: this change is for default on try *only*. There is no change to what any other (non-try) branches do for android testing on tegras, those remain as-is. As mhommey and joel discussed earlier in this thread, changing try default to test on pandas, but not also run all tests on tegras, does introduce a slim-but-non-zero risk of missing a problem that only a tegra would have caught with that try push. Note that we are still running tegra testing on non-try branches, as usual, so even if a problem like this is missed on try, it will be caught the first time it lands on any other branch that has Android coverage. After this workweek, we can revisit whether this default setting needs to be reverted. Let me know if you want any further info, ok? John. == On 9/11/13 2:31 PM, Kartikaya Gupta wrote: Earlier today the backlog on Android build jobs was on the order of 1300. It seems to be coming down a little now but for a while there I was worried it was going to grow unboundedly. Try jobs from over 10 hours ago still have pending jobs - as I'm sure you all know, having a 10-hour turnaround on try pushes is something of a productivity killer. I brought this up in #releng and one of the proposed solutions was to try to tweak the prioritization of jobs between Try and Inbound a little bit. I personally do like that Inbound jobs are prioritized above Try, but perhaps they don't need to be prioritized quite so much. However, changing this will affect a number of people, so it was suggested I bring the discussion here to get other people's comments. So,
Re: Tegra build backlog is too big!
hi kats (cross-posting to dev-b2g); tl:dr; we think all is ok again, details below. To avoid this happening again this week, we're changing tryserver to reduce the number of Android-tests-run-on-tegra-by-default. If you specifically want tegra testing on tryserver, you will need to state that when pushing to try. Hey everyone, this change was just backed out. After last night and todays recovery of tegra devices we are in a much better state than we were when kats was prompted to start this thread. We also have had developer confusion around this change (and some relatively minor unforseen problems with the patch, detailed in bug) that caused sheriffs to ask for this to be backed out. We expect wait times to return to roughly what they were as of last week for now. ~Justin Wood (Callek) ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Tegra build backlog is too big!
Earlier today the backlog on Android build jobs was on the order of 1300. It seems to be coming down a little now but for a while there I was worried it was going to grow unboundedly. Try jobs from over 10 hours ago still have pending jobs - as I'm sure you all know, having a 10-hour turnaround on try pushes is something of a productivity killer. I brought this up in #releng and one of the proposed solutions was to try to tweak the prioritization of jobs between Try and Inbound a little bit. I personally do like that Inbound jobs are prioritized above Try, but perhaps they don't need to be prioritized quite so much. However, changing this will affect a number of people, so it was suggested I bring the discussion here to get other people's comments. So, anybody have thoughts on a good way to solve this problem? Cheers, kats ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Tegra build backlog is too big!
Do we know why it's that much backed up? I started noticing it yesterday. Is it because of lots of inbound pushes? lots of try pushes? Lots of clobbering? Lots of tests? Jim On 9/11/13 5:31 PM, Kartikaya Gupta wrote: Earlier today the backlog on Android build jobs was on the order of 1300. It seems to be coming down a little now but for a while there I was worried it was going to grow unboundedly. Try jobs from over 10 hours ago still have pending jobs - as I'm sure you all know, having a 10-hour turnaround on try pushes is something of a productivity killer. I brought this up in #releng and one of the proposed solutions was to try to tweak the prioritization of jobs between Try and Inbound a little bit. I personally do like that Inbound jobs are prioritized above Try, but perhaps they don't need to be prioritized quite so much. However, changing this will affect a number of people, so it was suggested I bring the discussion here to get other people's comments. So, anybody have thoughts on a good way to solve this problem? Cheers, kats ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Tegra build backlog is too big!
quite possibly we don't need all those jobs running on tegras. I don't know of a bug in the product that has broken on either the tegra or panda platform but not the other. Joel ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Tegra build backlog is too big!
On Wed, Sep 11, 2013 at 04:39:37PM -0700, jmaher wrote: quite possibly we don't need all those jobs running on tegras. I don't know of a bug in the product that has broken on either the tegra or panda platform but not the other. Off the top of my head: - I have broken one but not the other on several occasions, involving differences in the handling of instruction and data caches, but unless you're touching the linker or the jit, it shouldn't matter. - Tegras don't have neon instructions, so wrong build flags, or wrong run time detection could trigger failures on one end and not the other. - GPUs on tegras and pandas, as well as their supporting libraries, differ, too. But unless you are touching graphics code, that shouldn't matter, unless your changes trigger some pre-existing bug.. So, while chances of breaking one and not the other are slim, they do exist. Mike ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Tegra build backlog is too big!
Fixing bugs like bug 884972 would probably help quite a bit. Also posting patches with checkin info and marking the bug with checkin-needed so the work lands with other patches. I always try to do this with simple front end patches. Jim Kartikaya Gupta kgu...@mozilla.com wrote in message news:jl6dnrwrpogjfk3pnz2dnuvz_umdn...@mozilla.org... Earlier today the backlog on Android build jobs was on the order of 1300. It seems to be coming down a little now but for a while there I was worried it was going to grow unboundedly. Try jobs from over 10 hours ago still have pending jobs - as I'm sure you all know, having a 10-hour turnaround on try pushes is something of a productivity killer. I brought this up in #releng and one of the proposed solutions was to try to tweak the prioritization of jobs between Try and Inbound a little bit. I personally do like that Inbound jobs are prioritized above Try, but perhaps they don't need to be prioritized quite so much. However, changing this will affect a number of people, so it was suggested I bring the discussion here to get other people's comments. So, anybody have thoughts on a good way to solve this problem? Cheers, kats ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Tegra build backlog is too big!
2013/9/11 Mike Hommey m...@glandium.org On Wed, Sep 11, 2013 at 04:39:37PM -0700, jmaher wrote: quite possibly we don't need all those jobs running on tegras. I don't know of a bug in the product that has broken on either the tegra or panda platform but not the other. Off the top of my head: - I have broken one but not the other on several occasions, involving differences in the handling of instruction and data caches, but unless you're touching the linker or the jit, it shouldn't matter. - Tegras don't have neon instructions, so wrong build flags, or wrong run time detection could trigger failures on one end and not the other. - GPUs on tegras and pandas, as well as their supporting libraries, differ, too. But unless you are touching graphics code, that shouldn't matter, unless your changes trigger some pre-existing bug.. And Panda boards have 1G of RAM, which is more than the Tegra boards have, right? Surely that can help avoiding OOM problems on Pandas. At some point earlier this year, WebGL conformance tests were perma-orange on Tegras but only intermittently orange on Pandas. RAM differences were likely the cause, as WebGL tests were OOM'ing a lot. Benoit So, while chances of breaking one and not the other are slim, they do exist. Mike ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform
Re: Tegra build backlog is too big!
hi kats (cross-posting to dev-b2g); tl:dr; we think all is ok again, details below. To avoid this happening again this week, we're changing tryserver to reduce the number of Android-tests-run-on-tegra-by-default. If you specifically want tegra testing on tryserver, you will need to state that when pushing to try. Yes, load on try was unusually heavy today (the b2g workweek is in full force). All other platforms, including our pool of panda Android4 test boards were handling this heavy load just fine. However, our pool of tegra boards is small (hard to get boards), had more-then-usual percentage offline, and was not able to keep up. We were distracted by an unrelated 64bit windows problem, but thats no excuse, we should have detected this earlier. After this is all calm, we'll postmortem. 1) As of earlier this afternoon, 976 of the pending 1078 tegra jobs are from try. This was no single abuse of try, this was simply an accumulation of a lot of pushes-to-try spread across the day. 2) We manually repoked every one of the offline tegras, and most are now working correctly again. As of now, our tegra pool is much healthier size again, back up to as-good-as-we-can-hope-for-with-poolsize, and quickly chewing through the remaining backlog. At this time, we are down to 240 pending jobs, and dropping fast. http://builddata.pub.build.mozilla.org/reports/pending/pending.html Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915457 Triage tegras with no completed jobs within last 24 hours 3) We cancelled all pending tegra test jobs on try that were waiting over 6 hours (the longest was 10hours). Note: we did not cancel panda test jobs, and we did not cancel tegra test jobs on other branches. If we cancelled a tegra test job on try that you do still need run, please let us know in #releng, and we'll sort it out. Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915481 Stop try tegra jobs pending for 6 hours 4) We are changing the default when you push to try. Until now, by default Try will generate Android builds and then run *all* unittests+talos for Android2.2 (tegras) *and* Android4 (pandas). We are now testing a change to default as follows: 4a) Android builds: no change, still built by default 4b) Android pandas tests: no change, still run unittest and talos by default 4c) Android tegras tests: unittests would not be run by default, but talos would still be run by default. Because of details in how TryServer works, talos would still be run on tegras by default, in order to keep scheduling talos on pandas also. Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915465 Pushes to try should not trigger tegra test jobs by default. Again, this is just changing the default: anyone who wants Android2.2 specific unittests run on tegras on try can still get that by specify it explicitly when pushing to try. Note: this change is for default on try *only*. There is no change to what any other (non-try) branches do for android testing on tegras, those remain as-is. As mhommey and joel discussed earlier in this thread, changing try default to test on pandas, but not also run all tests on tegras, does introduce a slim-but-non-zero risk of missing a problem that only a tegra would have caught with that try push. Note that we are still running tegra testing on non-try branches, as usual, so even if a problem like this is missed on try, it will be caught the first time it lands on any other branch that has Android coverage. After this workweek, we can revisit whether this default setting needs to be reverted. Let me know if you want any further info, ok? John. == On 9/11/13 2:31 PM, Kartikaya Gupta wrote: Earlier today the backlog on Android build jobs was on the order of 1300. It seems to be coming down a little now but for a while there I was worried it was going to grow unboundedly. Try jobs from over 10 hours ago still have pending jobs - as I'm sure you all know, having a 10-hour turnaround on try pushes is something of a productivity killer. I brought this up in #releng and one of the proposed solutions was to try to tweak the prioritization of jobs between Try and Inbound a little bit. I personally do like that Inbound jobs are prioritized above Try, but perhaps they don't need to be prioritized quite so much. However, changing this will affect a number of people, so it was suggested I bring the discussion here to get other people's comments. So, anybody have thoughts on a good way to solve this problem? Cheers, kats ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform signature.asc Description: OpenPGP digital signature ___ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform