[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread Evan Martin

On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvainnsylv...@chromium.org wrote:
   The underlying problem with buildbot is the database format, which is just
 hundred of
 thousand of files on the harddrive, with no seek capability, and the fact
 that the
 webserver itself is single threaded.
   We currently have 63 slaves on our main waterfall. I think this is too
 many for what
 buildbot can really support. We would ideally need to split it.

Can we get upstream buildbot devs involved in this discussion?  It
seems they ought to be able to scale better than they have.

It seems to me a caching layer that only ever hit the backend every
ten seconds would allow it ten seconds to grind through its
computations, which should be more than sufficient, without any extra
splitting up required.  That is, we should (a) fix the proxy and (b)
make every use the proxy.

   Q3: What kind of auto-refresh do we need?
   We used to be at 60 secs for a long time, and I changed it a couple of
 weeks ago to 90 secs.
 No one complained, so I guess this is good. Should we go even higher than
 that?

I personally hate auto-refresh.  We should make it opt-in since I
doubt most users need it and it adds load.  I expect many people
(myself included) leave the buildbot page open in a background tab and
have it continually refreshing despite not looking at it.

(My other common use case: the tree is red, I start scrolling down to
see what's gone wrong, and then the page refreshes out from under me
and I lose where I was looking.)

 - Get a better machine. It's already running on a dedicated dual quad core
 nehalem server
   with 24gb of RAM and 15k rpm drives.

This is absurdly powerful!  It should have all the data necessary to
generate the page in RAM already, no need for even touching the disks
(?).

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread Ojan Vafai
On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvain nsylv...@chromium.orgwrote:

   Q3: What kind of auto-refresh do we need?

   We used to be at 60 secs for a long time, and I changed it a couple of
 weeks ago to 90 secs.
 No one complained, so I guess this is good. Should we go even higher than
 that?


Why do we even default the waterfall to autorefresh? Most of the time when I
commit, I refresh it myself obsessively anyways. I get the impression that
most people do this. I imagine that turning off autorefresh and allowing
people that really want it to opt in would save a lot of unnecessary
overhead (e.g. people leave the waterfall page open for hours when they're
not at their desk).

I'll note that upstream WebKit's waterfall does not autorefresh and I've
never heard someone complain about it.

Ojan

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread Paweł Hajdan Jr .

I was just thinking... doesn't it have a reverse proxy in front of it?
It could also force content cache time of 60s or even more...
Something like Squid or Varnish.

Oh, and today morning was probably me scraping the logs. :-( Sorry.

On Tue, Jul 14, 2009 at 14:18, Nicolas Sylvainnsylv...@chromium.org wrote:
 Hello,
   We reached a point where we have too many build slaves and too many users
 to get good performance/latency out of our current buildbot waterfall and
 console view page.
   We've been tweaking the page a lot lately to make it faster to load, but
 the number
 of new slaves every day, and the number of new users make it going slower
 faster
 than we can address the problem. This morning buildbot was completely
 unresponsive
 because it was not able to keep up with the demand.
   To make it come back to life, I disabled two features, which account for
 about 50%
 of all traffics : The buildbot chrome extensions, and the top 3 overview
 bars at the top
 of the waterfall.  This will be able to make it stay online for a while, but
 this is not ideal.
 It's time to think of a better solution.
   The underlying problem with buildbot is the database format, which is just
 hundred of
 thousand of files on the harddrive, with no seek capability, and the fact
 that the
 webserver itself is single threaded.
   We currently have 63 slaves on our main waterfall. I think this is too
 many for what
 buildbot can really support. We would ideally need to split it.
   Q1: Want kind of split would you prefer? mac/linux/windows or
 chromium/webkit/modules
 or full/windows/mac/linux/memory, etc?
   the main buildbot page would most likely become a bunch of iframe to
 display all the
 slaves at the same time. The console view integration might be a little bit
 less nice. If there is
 anyone with web devel experience who wants to help, we could modify the
 current waterfall
 to fetch only json data from the buildbot, and merge them together, client
 side, to get a
 combined view.
   Q2: How many changes do we need to display on the console view?
   We are currently displaying the last 50 changes. Which is usually
 half-day. If people don't
 mind about this, we could scale back to 30. This would make it a little
 faster to load.
   Q3: What kind of auto-refresh do we need?
   We used to be at 60 secs for a long time, and I changed it a couple of
 weeks ago to 90 secs.
 No one complained, so I guess this is good. Should we go even higher than
 that?
   Q4: How much build history do we need?
   Right now stdio log are kept for 3 weeks and build results (green, red)
 are kept for 1 month. Older
 build results are archived but can't be accessed directly by the buildbot.
   If you have any other suggestions, please let me know!
 Some things that we can't do:
 - Get a better machine. It's already running on a dedicated dual quad core
 nehalem server
   with 24gb of RAM and 15k rpm drives.
 - Change buildbot to use non-single threaded web server. This is way too
 much involved.
 WHAT I NEED YOU HELP WITH :
 1. No more scraping of the waterfall please! If you need to crawl the logs,
 let me know and I can
     run your script on the database directly.
 2. If you know about apache mod_cache / mod_proxy, and wants to help, please
 let me know.
     build.chromium.org is a proxied cache of the real buildbot server, and
 the cache does not work
     well. This contribute to another got 25/30% of the overall load on the
 buildbot.
 Thanks!
 Nicolas

 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread Scott Hess

If I understand right, simply serving the auto-refresh requests is
substantial?  At least for the main page, a reverse in accelerator
mode could turn that into a constant load.

[I'd offer to help, but I don't know what kind of technology we're
talking about, here.]

-scott


On Tue, Jul 14, 2009 at 2:25 PM, Evan Martine...@chromium.org wrote:

 On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvainnsylv...@chromium.org wrote:
   The underlying problem with buildbot is the database format, which is just
 hundred of
 thousand of files on the harddrive, with no seek capability, and the fact
 that the
 webserver itself is single threaded.
   We currently have 63 slaves on our main waterfall. I think this is too
 many for what
 buildbot can really support. We would ideally need to split it.

 Can we get upstream buildbot devs involved in this discussion?  It
 seems they ought to be able to scale better than they have.

 It seems to me a caching layer that only ever hit the backend every
 ten seconds would allow it ten seconds to grind through its
 computations, which should be more than sufficient, without any extra
 splitting up required.  That is, we should (a) fix the proxy and (b)
 make every use the proxy.

   Q3: What kind of auto-refresh do we need?
   We used to be at 60 secs for a long time, and I changed it a couple of
 weeks ago to 90 secs.
 No one complained, so I guess this is good. Should we go even higher than
 that?

 I personally hate auto-refresh.  We should make it opt-in since I
 doubt most users need it and it adds load.  I expect many people
 (myself included) leave the buildbot page open in a background tab and
 have it continually refreshing despite not looking at it.

 (My other common use case: the tree is red, I start scrolling down to
 see what's gone wrong, and then the page refreshes out from under me
 and I lose where I was looking.)

 - Get a better machine. It's already running on a dedicated dual quad core
 nehalem server
   with 24gb of RAM and 15k rpm drives.

 This is absurdly powerful!  It should have all the data necessary to
 generate the page in RAM already, no need for even touching the disks
 (?).

 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread Nicolas Sylvain
On Tue, Jul 14, 2009 at 2:25 PM, Evan Martin e...@chromium.org wrote:

 On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvainnsylv...@chromium.org
 wrote:
The underlying problem with buildbot is the database format, which is
 just
  hundred of
  thousand of files on the harddrive, with no seek capability, and the
 fact
  that the
  webserver itself is single threaded.
We currently have 63 slaves on our main waterfall. I think this is too
  many for what
  buildbot can really support. We would ideally need to split it.

 Can we get upstream buildbot devs involved in this discussion?  It
 seems they ought to be able to scale better than they have.


I talked to them a little. They do plan some of that for their 1.0 release,
but they
said that it was not on the radar until then.




 It seems to me a caching layer that only ever hit the backend every
 ten seconds would allow it ten seconds to grind through its
 computations, which should be more than sufficient, without any extra
 splitting up required.  That is, we should (a) fix the proxy and (b)
 make every use the proxy.


That makes a lot of sense. I agree that we should fix the proxy, and
make more people use it. Some direct buildbot access would still be
required internally to force a build and stuff like that.



Q3: What kind of auto-refresh do we need?
We used to be at 60 secs for a long time, and I changed it a couple of
  weeks ago to 90 secs.
  No one complained, so I guess this is good. Should we go even higher than
  that?

 I personally hate auto-refresh.  We should make it opt-in since I
 doubt most users need it and it adds load.  I expect many people
 (myself included) leave the buildbot page open in a background tab and
 have it continually refreshing despite not looking at it.

 (My other common use case: the tree is red, I start scrolling down to
 see what's gone wrong, and then the page refreshes out from under me
 and I lose where I was looking.)


Yep, a lot of people told me the same thing. Some other told me they would
like
a faster reload.  Now i'm tempted to let the user control the reload and not
give
a default one.




  - Get a better machine. It's already running on a dedicated dual quad
 core
  nehalem server
with 24gb of RAM and 15k rpm drives.

 This is absurdly powerful!  It should have all the data necessary to
 generate the page in RAM already, no need for even touching the disks
 (?).


Yeah, i'm not too sure how to debug this. When I strace the process I only
see file reads, scrolling, like crazy, all the time.

Nicolas

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread 王重傑
On Tue, Jul 14, 2009 at 2:35 PM, Nicolas Sylvain nsylv...@chromium.orgwrote:



 On Tue, Jul 14, 2009 at 2:25 PM, Evan Martin e...@chromium.org wrote:

 On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvainnsylv...@chromium.org
 wrote:
The underlying problem with buildbot is the database format, which is
 just
  hundred of
  thousand of files on the harddrive, with no seek capability, and the
 fact
  that the
  webserver itself is single threaded.
We currently have 63 slaves on our main waterfall. I think this is too
  many for what
  buildbot can really support. We would ideally need to split it.

 Can we get upstream buildbot devs involved in this discussion?  It
 seems they ought to be able to scale better than they have.


 I talked to them a little. They do plan some of that for their 1.0 release,
 but they
 said that it was not on the radar until then.




 It seems to me a caching layer that only ever hit the backend every
 ten seconds would allow it ten seconds to grind through its
 computations, which should be more than sufficient, without any extra
 splitting up required.  That is, we should (a) fix the proxy and (b)
 make every use the proxy.


 That makes a lot of sense. I agree that we should fix the proxy, and
 make more people use it. Some direct buildbot access would still be
 required internally to force a build and stuff like that.



Q3: What kind of auto-refresh do we need?
We used to be at 60 secs for a long time, and I changed it a couple of
  weeks ago to 90 secs.
  No one complained, so I guess this is good. Should we go even higher
 than
  that?

 I personally hate auto-refresh.  We should make it opt-in since I
 doubt most users need it and it adds load.  I expect many people
 (myself included) leave the buildbot page open in a background tab and
 have it continually refreshing despite not looking at it.

 (My other common use case: the tree is red, I start scrolling down to
 see what's gone wrong, and then the page refreshes out from under me
 and I lose where I was looking.)


 Yep, a lot of people told me the same thing. Some other told me they would
 like
 a faster reload.  Now i'm tempted to let the user control the reload and
 not give
 a default one.




  - Get a better machine. It's already running on a dedicated dual quad
 core
  nehalem server
with 24gb of RAM and 15k rpm drives.

 This is absurdly powerful!  It should have all the data necessary to
 generate the page in RAM already, no need for even touching the disks
 (?).


 Yeah, i'm not too sure how to debug this. When I strace the process I only
 see file reads, scrolling, like crazy, all the time.


That is pretty nuts.  Is it calling fsync or something crazy?  Since you
said strace, I'm assmuming linux. In that case, the buffer cache should be
saving you from disk accesses for most everything.

-Albert

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread Ryosuke Niwa

On Tue, Jul 14, 2009 at 2:35 PM, Nicolas Sylvainnsylv...@chromium.org wrote:
  - Get a better machine. It's already running on a dedicated dual quad
  core
  nehalem server
    with 24gb of RAM and 15k rpm drives.

 This is absurdly powerful!  It should have all the data necessary to
 generate the page in RAM already, no need for even touching the disks
 (?).

 Yeah, i'm not too sure how to debug this. When I strace the process I only
 see file reads, scrolling, like crazy, all the time.
 Nicolas

Can't we use RAM disk?  If we made a RAM disk and moved all the
relevant files there, shouldn't it speed things up?

Ryosuke

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread Adam Langley

On Tue, Jul 14, 2009 at 2:40 PM, Albert J. Wong
(王重傑)ajw...@chromium.org wrote:
 That is pretty nuts.  Is it calling fsync or something crazy?  Since you
 said strace, I'm assmuming linux. In that case, the buffer cache should be
 saving you from disk accesses for most everything.

Of course, vmstat 1 will tell you what disk IO is happening. If you
don't have noatime, that would probably be good.


AGL

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread Ojan Vafai
On Tue, Jul 14, 2009 at 2:35 PM, Nicolas Sylvain nsylv...@chromium.orgwrote:

 On Tue, Jul 14, 2009 at 2:25 PM, Evan Martin e...@chromium.org wrote:

 On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvainnsylv...@chromium.org
 wrote:
 It seems to me a caching layer that only ever hit the backend every

 ten seconds would allow it ten seconds to grind through its
 computations, which should be more than sufficient, without any extra
 splitting up required.  That is, we should (a) fix the proxy and (b)
 make every use the proxy.


 That makes a lot of sense. I agree that we should fix the proxy, and
 make more people use it. Some direct buildbot access would still be
 required internally to force a build and stuff like that.


Can we make this stuff be based on IP instead of what URL you hit, e.g. have
the force build UI there on the proxy, but have it return a 403 +
descriptive error page when you try to do it?

Alternately, can we have the internal URLs also use the proxy? Or have them
use a different proxy if that's easier?

Ojan

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread Jeremy Orlow
On Tue, Jul 14, 2009 at 2:42 PM, Adam Langley a...@chromium.org wrote:


 On Tue, Jul 14, 2009 at 2:40 PM, Albert J. Wong
 (王重傑)ajw...@chromium.org wrote:
  That is pretty nuts.  Is it calling fsync or something crazy?  Since you
  said strace, I'm assmuming linux. In that case, the buffer cache should
 be
  saving you from disk accesses for most everything.

 Of course, vmstat 1 will tell you what disk IO is happening. If you
 don't have noatime, that would probably be good.


atop is a really nice program for getting a birds eye view of what's going
on with the system.  It's not installed by default, but if you're running
ubuntu, it's easy to install.


More generally: I think there are a couple uses of the build bots:
1) Most people just want to know can I commit and then are watching one
specific CL's status.  In this case, not auto-refreshing is fine and not
much history is fine.
2) Sheriffing is the one case where I think you actually do need
auto-refreshing, but normally you don't need a lot of history.  That said,
sometimes things fail and then
3) You're trying to fix things:  In this case you want to see a lot of
history (or at least need the option to see more) and you do NOT want it to
auto refresh.  I've definitely had times when I wish there was a show me
more button.  And I've definitely have been reading something far down the
page only to have it refresh on me.

It seems to me that these requirements are diverse enough that one single
configuration isn't going to make everyone happy.  I know you can do a bunch
of customization so you can see exactly what you want, but I assume that
will only chew up more resources.  Maybe the right way to go is a couple
customized pages for each roll?  There's definitely much more information
there than people need most of the time, though.

--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---



[chromium-dev] Re: Buildbot performance issue.

2009-07-14 Thread Mohamed Mansour
I would like to see auto refresh turned off by default. That might help the
load.
-- Mohamed Mansour


On Tue, Jul 14, 2009 at 7:15 PM, Michael Nordman micha...@google.comwrote:

 Turning off auto refresh by default sounds reasonable idea right now...
 with an option to enable it if really desired... autorefresh[=n] where n
 is a number of minutes maybe (defaults to 1) type thing.
 A fair amount of the load may just evaporate with that change.



 On Tue, Jul 14, 2009 at 3:32 PM, Jeremy Orlow jor...@google.com wrote:

 On Tue, Jul 14, 2009 at 2:42 PM, Adam Langley a...@chromium.org wrote:


 On Tue, Jul 14, 2009 at 2:40 PM, Albert J. Wong
 (王重傑)ajw...@chromium.org wrote:
  That is pretty nuts.  Is it calling fsync or something crazy?  Since
 you
  said strace, I'm assmuming linux. In that case, the buffer cache should
 be
  saving you from disk accesses for most everything.

 Of course, vmstat 1 will tell you what disk IO is happening. If you
 don't have noatime, that would probably be good.


 atop is a really nice program for getting a birds eye view of what's going
 on with the system.  It's not installed by default, but if you're running
 ubuntu, it's easy to install.


 More generally: I think there are a couple uses of the build bots:
 1) Most people just want to know can I commit and then are watching one
 specific CL's status.  In this case, not auto-refreshing is fine and not
 much history is fine.
 2) Sheriffing is the one case where I think you actually do need
 auto-refreshing, but normally you don't need a lot of history.  That said,
 sometimes things fail and then
 3) You're trying to fix things:  In this case you want to see a lot of
 history (or at least need the option to see more) and you do NOT want it to
 auto refresh.  I've definitely had times when I wish there was a show me
 more button.  And I've definitely have been reading something far down the
 page only to have it refresh on me.

 It seems to me that these requirements are diverse enough that one single
 configuration isn't going to make everyone happy.  I know you can do a bunch
 of customization so you can see exactly what you want, but I assume that
 will only chew up more resources.  Maybe the right way to go is a couple
 customized pages for each roll?  There's definitely much more information
 there than people need most of the time, though.




 


--~--~-~--~~~---~--~~
Chromium Developers mailing list: chromium-dev@googlegroups.com 
View archives, change email options, or unsubscribe: 
http://groups.google.com/group/chromium-dev
-~--~~~~--~~--~--~---