Re: Tegra build backlog is too big!

2013-09-13 Thread Neil

Justin Wood wrote:


We also have had developer confusion around this change (and some relatively 
minor unforseen problems with the patch, detailed in bug) that caused sheriffs 
to ask for this to be backed out.
 

If you omit part of the try syntax then you get a default set of 
options, but as far as I can see Try Chooser doesn't support this, so 
it's all or nothing. Perhaps if Try Chooser had an option to use the 
defaults then they could be adjusted as necessary to something sensible 
for developers who don't really need every single combination of builds 
and tests?


--
Warning: May contain traces of nuts.
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Tegra build backlog is too big!

2013-09-12 Thread Kartikaya Gupta
Thanks John, this looks like it strikes a good balance between reducing 
tegra load and getting false-positives in try runs.


Just for anybody following along though, it looks like bug 915465 (which 
deployed this change) is still in flux so things may have not completely 
settled yet. See also 
https://bugzilla.mozilla.org/show_bug.cgi?id=915465#c8 for the new 
trychooser syntax to enable testing on Tegras (although from comment 10 
it looks like it may not be working properly yet).


Cheers,
kats

On 13-09-12 01:11 , John O'Duinn wrote:

hi kats (cross-posting to dev-b2g);

tl:dr; we think all is ok again, details below. To avoid this happening
again this week, we're changing tryserver to reduce the number of
Android-tests-run-on-tegra-by-default. If you specifically want tegra
testing on tryserver, you will need to state that when pushing to try.


Yes, load on try was unusually heavy today (the b2g workweek is in full
force). All other platforms, including our pool of panda Android4 test
boards were handling this heavy load just fine. However, our pool of
tegra boards is small (hard to get boards), had more-then-usual
percentage offline, and was not able to keep up. We were distracted by
an unrelated 64bit windows problem, but thats no excuse, we should have
detected this earlier. After this is all calm, we'll postmortem.

1) As of earlier this afternoon, 976 of the pending 1078 tegra jobs are
from try. This was no single abuse of try, this was simply an
accumulation of a lot of pushes-to-try spread across the day.

2) We manually repoked every one of the offline tegras, and most are now
working correctly again. As of now, our tegra pool is much healthier
size again, back up to as-good-as-we-can-hope-for-with-poolsize, and
quickly chewing through the remaining backlog. At this time, we are down
to 240 pending jobs, and dropping fast.
http://builddata.pub.build.mozilla.org/reports/pending/pending.html
Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915457 Triage
tegras with no completed jobs within last 24 hours

3) We cancelled all pending tegra test jobs on try that were waiting
over 6 hours (the longest was 10hours). Note: we did not cancel panda
test jobs, and we did not cancel tegra test jobs on other branches. If
we cancelled a tegra test job on try that you do still need run, please
let us know in #releng, and we'll sort it out.
Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915481 Stop try
tegra jobs pending for 6 hours

4) We are changing the default when you push to try. Until now, by
default Try will generate Android builds and then run *all*
unittests+talos for Android2.2 (tegras) *and* Android4 (pandas). We are
now testing a change to default as follows:

4a) Android builds: no change, still built by default
4b) Android pandas tests: no change, still run unittest and talos by default
4c) Android tegras tests: unittests would not be run by default, but
talos would still be run by default. Because of details in how TryServer
works, talos would still be run on tegras by default, in order to keep
scheduling talos on pandas also. Details in
https://bugzilla.mozilla.org/show_bug.cgi?id=915465 Pushes to try
should not trigger tegra test jobs by default. Again, this is just
changing the default: anyone who wants Android2.2  specific unittests
run on tegras on try can still get that by specify it explicitly when
pushing to try.

Note: this change is for default on try *only*. There is no change to
what any other (non-try) branches do for android testing on tegras,
those remain as-is.



As mhommey and joel discussed earlier in this thread, changing try
default to test on pandas, but not also run all tests on tegras, does
introduce a slim-but-non-zero risk of missing a problem that only a
tegra would have caught with that try push. Note that we are still
running tegra testing on non-try branches, as usual, so even if a
problem like this is missed on try, it will be caught the first time it
lands on any other branch that has Android coverage. After this
workweek, we can revisit whether this default setting needs to be reverted.


Let me know if you want any further info, ok?
John.
==
On 9/11/13 2:31 PM, Kartikaya Gupta wrote:

Earlier today the backlog on Android build jobs was on the order of
1300. It seems to be coming down a little now but for a while there I
was worried it was going to grow unboundedly. Try jobs from over 10
hours ago still have pending jobs - as I'm sure you all know, having a
10-hour turnaround on try pushes is something of a productivity killer.

I brought this up in #releng and one of the proposed solutions was to
try to tweak the prioritization of jobs between Try and Inbound a little
bit. I personally do like that Inbound jobs are prioritized above Try,
but perhaps they don't need to be prioritized quite so much. However,
changing this will affect a number of people, so it was suggested I
bring the discussion here to get other people's comments.

So, 

Re: Tegra build backlog is too big!

2013-09-12 Thread Justin Wood
 hi kats (cross-posting to dev-b2g);
 
 tl:dr; we think all is ok again, details below. To avoid this happening
 again this week, we're changing tryserver to reduce the number of
 Android-tests-run-on-tegra-by-default. If you specifically want tegra
 testing on tryserver, you will need to state that when pushing to try.
 

Hey everyone, this change was just backed out.

After last night and todays recovery of tegra devices we are in a much better 
state than we were when kats was prompted to start this thread. We also have 
had developer confusion around this change (and some relatively minor unforseen 
problems with the patch, detailed in bug) that caused sheriffs to ask for this 
to be backed out.

We expect wait times to return to roughly what they were as of last week for 
now.

~Justin Wood (Callek)
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Tegra build backlog is too big!

2013-09-11 Thread Kartikaya Gupta
Earlier today the backlog on Android build jobs was on the order of 
1300. It seems to be coming down a little now but for a while there I 
was worried it was going to grow unboundedly. Try jobs from over 10 
hours ago still have pending jobs - as I'm sure you all know, having a 
10-hour turnaround on try pushes is something of a productivity killer.


I brought this up in #releng and one of the proposed solutions was to 
try to tweak the prioritization of jobs between Try and Inbound a little 
bit. I personally do like that Inbound jobs are prioritized above Try, 
but perhaps they don't need to be prioritized quite so much. However, 
changing this will affect a number of people, so it was suggested I 
bring the discussion here to get other people's comments.


So, anybody have thoughts on a good way to solve this problem?

Cheers,
kats
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Tegra build backlog is too big!

2013-09-11 Thread Jim Chen
Do we know why it's that much backed up? I started noticing
it yesterday. Is it because of lots of inbound pushes? lots
of try pushes? Lots of clobbering? Lots of tests?

Jim


On 9/11/13 5:31 PM, Kartikaya Gupta wrote:
 Earlier today the backlog on Android build jobs was on the
 order of 1300. It seems to be coming down a little now but
 for a while there I was worried it was going to grow
 unboundedly. Try jobs from over 10 hours ago still have
 pending jobs - as I'm sure you all know, having a 10-hour
 turnaround on try pushes is something of a productivity killer.
 
 I brought this up in #releng and one of the proposed
 solutions was to try to tweak the prioritization of jobs
 between Try and Inbound a little bit. I personally do like
 that Inbound jobs are prioritized above Try, but perhaps
 they don't need to be prioritized quite so much. However,
 changing this will affect a number of people, so it was
 suggested I bring the discussion here to get other people's
 comments.
 
 So, anybody have thoughts on a good way to solve this problem?
 
 Cheers,
 kats
 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Tegra build backlog is too big!

2013-09-11 Thread jmaher
quite possibly we don't need all those jobs running on tegras.  I don't know of 
a bug in the product that has broken on either the tegra or panda platform but 
not the other.

Joel
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Tegra build backlog is too big!

2013-09-11 Thread Mike Hommey
On Wed, Sep 11, 2013 at 04:39:37PM -0700, jmaher wrote:
 quite possibly we don't need all those jobs running on tegras.  I
 don't know of a bug in the product that has broken on either the tegra
 or panda platform but not the other.

Off the top of my head:

- I have broken one but not the other on several occasions, involving
differences in the handling of instruction and data caches, but unless
you're touching the linker or the jit, it shouldn't matter.

- Tegras don't have neon instructions, so wrong build flags, or wrong run
time detection could trigger failures on one end and not the other.

- GPUs on tegras and pandas, as well as their supporting libraries,
differ, too. But unless you are touching graphics code, that shouldn't
matter, unless your changes trigger some pre-existing bug..

So, while chances of breaking one and not the other are slim, they do
exist.

Mike
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Tegra build backlog is too big!

2013-09-11 Thread Jim Mathies
Fixing bugs like bug 884972 would probably help quite a bit. Also posting
patches with checkin info and marking the bug with checkin-needed so the
work lands with other patches. I always try to do this with simple front end
patches.

Jim


Kartikaya Gupta kgu...@mozilla.com wrote in message
news:jl6dnrwrpogjfk3pnz2dnuvz_umdn...@mozilla.org...
 Earlier today the backlog on Android build jobs was on the order of 
 1300. It seems to be coming down a little now but for a while there I 
 was worried it was going to grow unboundedly. Try jobs from over 10 
 hours ago still have pending jobs - as I'm sure you all know, having a 
 10-hour turnaround on try pushes is something of a productivity killer.
 
 I brought this up in #releng and one of the proposed solutions was to 
 try to tweak the prioritization of jobs between Try and Inbound a little 
 bit. I personally do like that Inbound jobs are prioritized above Try, 
 but perhaps they don't need to be prioritized quite so much. However, 
 changing this will affect a number of people, so it was suggested I 
 bring the discussion here to get other people's comments.
 
 So, anybody have thoughts on a good way to solve this problem?
 
 Cheers,
 kats
 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Tegra build backlog is too big!

2013-09-11 Thread Benoit Jacob
2013/9/11 Mike Hommey m...@glandium.org

 On Wed, Sep 11, 2013 at 04:39:37PM -0700, jmaher wrote:
  quite possibly we don't need all those jobs running on tegras.  I
  don't know of a bug in the product that has broken on either the tegra
  or panda platform but not the other.

 Off the top of my head:

 - I have broken one but not the other on several occasions, involving
 differences in the handling of instruction and data caches, but unless
 you're touching the linker or the jit, it shouldn't matter.

 - Tegras don't have neon instructions, so wrong build flags, or wrong run
 time detection could trigger failures on one end and not the other.

 - GPUs on tegras and pandas, as well as their supporting libraries,
 differ, too. But unless you are touching graphics code, that shouldn't
 matter, unless your changes trigger some pre-existing bug..


And Panda boards have 1G of RAM, which is more than the Tegra boards have,
right? Surely that can help avoiding OOM problems on Pandas.

At some point earlier this year, WebGL conformance tests were perma-orange
on Tegras but only intermittently orange on Pandas. RAM differences were
likely the cause, as WebGL tests were OOM'ing a lot.

Benoit




 So, while chances of breaking one and not the other are slim, they do
 exist.

 Mike
 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform

___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform


Re: Tegra build backlog is too big!

2013-09-11 Thread John O'Duinn
hi kats (cross-posting to dev-b2g);

tl:dr; we think all is ok again, details below. To avoid this happening
again this week, we're changing tryserver to reduce the number of
Android-tests-run-on-tegra-by-default. If you specifically want tegra
testing on tryserver, you will need to state that when pushing to try.


Yes, load on try was unusually heavy today (the b2g workweek is in full
force). All other platforms, including our pool of panda Android4 test
boards were handling this heavy load just fine. However, our pool of
tegra boards is small (hard to get boards), had more-then-usual
percentage offline, and was not able to keep up. We were distracted by
an unrelated 64bit windows problem, but thats no excuse, we should have
detected this earlier. After this is all calm, we'll postmortem.

1) As of earlier this afternoon, 976 of the pending 1078 tegra jobs are
from try. This was no single abuse of try, this was simply an
accumulation of a lot of pushes-to-try spread across the day.

2) We manually repoked every one of the offline tegras, and most are now
working correctly again. As of now, our tegra pool is much healthier
size again, back up to as-good-as-we-can-hope-for-with-poolsize, and
quickly chewing through the remaining backlog. At this time, we are down
to 240 pending jobs, and dropping fast.
http://builddata.pub.build.mozilla.org/reports/pending/pending.html
Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915457 Triage
tegras with no completed jobs within last 24 hours

3) We cancelled all pending tegra test jobs on try that were waiting
over 6 hours (the longest was 10hours). Note: we did not cancel panda
test jobs, and we did not cancel tegra test jobs on other branches. If
we cancelled a tegra test job on try that you do still need run, please
let us know in #releng, and we'll sort it out.
Details in https://bugzilla.mozilla.org/show_bug.cgi?id=915481 Stop try
tegra jobs pending for 6 hours

4) We are changing the default when you push to try. Until now, by
default Try will generate Android builds and then run *all*
unittests+talos for Android2.2 (tegras) *and* Android4 (pandas). We are
now testing a change to default as follows:

4a) Android builds: no change, still built by default
4b) Android pandas tests: no change, still run unittest and talos by default
4c) Android tegras tests: unittests would not be run by default, but
talos would still be run by default. Because of details in how TryServer
works, talos would still be run on tegras by default, in order to keep
scheduling talos on pandas also. Details in
https://bugzilla.mozilla.org/show_bug.cgi?id=915465 Pushes to try
should not trigger tegra test jobs by default. Again, this is just
changing the default: anyone who wants Android2.2  specific unittests
run on tegras on try can still get that by specify it explicitly when
pushing to try.

Note: this change is for default on try *only*. There is no change to
what any other (non-try) branches do for android testing on tegras,
those remain as-is.



As mhommey and joel discussed earlier in this thread, changing try
default to test on pandas, but not also run all tests on tegras, does
introduce a slim-but-non-zero risk of missing a problem that only a
tegra would have caught with that try push. Note that we are still
running tegra testing on non-try branches, as usual, so even if a
problem like this is missed on try, it will be caught the first time it
lands on any other branch that has Android coverage. After this
workweek, we can revisit whether this default setting needs to be reverted.


Let me know if you want any further info, ok?
John.
==
On 9/11/13 2:31 PM, Kartikaya Gupta wrote:
 Earlier today the backlog on Android build jobs was on the order of
 1300. It seems to be coming down a little now but for a while there I
 was worried it was going to grow unboundedly. Try jobs from over 10
 hours ago still have pending jobs - as I'm sure you all know, having a
 10-hour turnaround on try pushes is something of a productivity killer.
 
 I brought this up in #releng and one of the proposed solutions was to
 try to tweak the prioritization of jobs between Try and Inbound a little
 bit. I personally do like that Inbound jobs are prioritized above Try,
 but perhaps they don't need to be prioritized quite so much. However,
 changing this will affect a number of people, so it was suggested I
 bring the discussion here to get other people's comments.
 
 So, anybody have thoughts on a good way to solve this problem?
 
 Cheers,
 kats
 ___
 dev-platform mailing list
 dev-platform@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-platform


signature.asc
Description: OpenPGP digital signature
___
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform