[chromium-dev] Re: Buildbot performance issue.
On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvainnsylv...@chromium.org wrote: The underlying problem with buildbot is the database format, which is just hundred of thousand of files on the harddrive, with no seek capability, and the fact that the webserver itself is single threaded. We currently have 63 slaves on our main waterfall. I think this is too many for what buildbot can really support. We would ideally need to split it. Can we get upstream buildbot devs involved in this discussion? It seems they ought to be able to scale better than they have. It seems to me a caching layer that only ever hit the backend every ten seconds would allow it ten seconds to grind through its computations, which should be more than sufficient, without any extra splitting up required. That is, we should (a) fix the proxy and (b) make every use the proxy. Q3: What kind of auto-refresh do we need? We used to be at 60 secs for a long time, and I changed it a couple of weeks ago to 90 secs. No one complained, so I guess this is good. Should we go even higher than that? I personally hate auto-refresh. We should make it opt-in since I doubt most users need it and it adds load. I expect many people (myself included) leave the buildbot page open in a background tab and have it continually refreshing despite not looking at it. (My other common use case: the tree is red, I start scrolling down to see what's gone wrong, and then the page refreshes out from under me and I lose where I was looking.) - Get a better machine. It's already running on a dedicated dual quad core nehalem server with 24gb of RAM and 15k rpm drives. This is absurdly powerful! It should have all the data necessary to generate the page in RAM already, no need for even touching the disks (?). --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Buildbot performance issue.
On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvain nsylv...@chromium.orgwrote: Q3: What kind of auto-refresh do we need? We used to be at 60 secs for a long time, and I changed it a couple of weeks ago to 90 secs. No one complained, so I guess this is good. Should we go even higher than that? Why do we even default the waterfall to autorefresh? Most of the time when I commit, I refresh it myself obsessively anyways. I get the impression that most people do this. I imagine that turning off autorefresh and allowing people that really want it to opt in would save a lot of unnecessary overhead (e.g. people leave the waterfall page open for hours when they're not at their desk). I'll note that upstream WebKit's waterfall does not autorefresh and I've never heard someone complain about it. Ojan --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Buildbot performance issue.
I was just thinking... doesn't it have a reverse proxy in front of it? It could also force content cache time of 60s or even more... Something like Squid or Varnish. Oh, and today morning was probably me scraping the logs. :-( Sorry. On Tue, Jul 14, 2009 at 14:18, Nicolas Sylvainnsylv...@chromium.org wrote: Hello, We reached a point where we have too many build slaves and too many users to get good performance/latency out of our current buildbot waterfall and console view page. We've been tweaking the page a lot lately to make it faster to load, but the number of new slaves every day, and the number of new users make it going slower faster than we can address the problem. This morning buildbot was completely unresponsive because it was not able to keep up with the demand. To make it come back to life, I disabled two features, which account for about 50% of all traffics : The buildbot chrome extensions, and the top 3 overview bars at the top of the waterfall. This will be able to make it stay online for a while, but this is not ideal. It's time to think of a better solution. The underlying problem with buildbot is the database format, which is just hundred of thousand of files on the harddrive, with no seek capability, and the fact that the webserver itself is single threaded. We currently have 63 slaves on our main waterfall. I think this is too many for what buildbot can really support. We would ideally need to split it. Q1: Want kind of split would you prefer? mac/linux/windows or chromium/webkit/modules or full/windows/mac/linux/memory, etc? the main buildbot page would most likely become a bunch of iframe to display all the slaves at the same time. The console view integration might be a little bit less nice. If there is anyone with web devel experience who wants to help, we could modify the current waterfall to fetch only json data from the buildbot, and merge them together, client side, to get a combined view. Q2: How many changes do we need to display on the console view? We are currently displaying the last 50 changes. Which is usually half-day. If people don't mind about this, we could scale back to 30. This would make it a little faster to load. Q3: What kind of auto-refresh do we need? We used to be at 60 secs for a long time, and I changed it a couple of weeks ago to 90 secs. No one complained, so I guess this is good. Should we go even higher than that? Q4: How much build history do we need? Right now stdio log are kept for 3 weeks and build results (green, red) are kept for 1 month. Older build results are archived but can't be accessed directly by the buildbot. If you have any other suggestions, please let me know! Some things that we can't do: - Get a better machine. It's already running on a dedicated dual quad core nehalem server with 24gb of RAM and 15k rpm drives. - Change buildbot to use non-single threaded web server. This is way too much involved. WHAT I NEED YOU HELP WITH : 1. No more scraping of the waterfall please! If you need to crawl the logs, let me know and I can run your script on the database directly. 2. If you know about apache mod_cache / mod_proxy, and wants to help, please let me know. build.chromium.org is a proxied cache of the real buildbot server, and the cache does not work well. This contribute to another got 25/30% of the overall load on the buildbot. Thanks! Nicolas --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Buildbot performance issue.
If I understand right, simply serving the auto-refresh requests is substantial? At least for the main page, a reverse in accelerator mode could turn that into a constant load. [I'd offer to help, but I don't know what kind of technology we're talking about, here.] -scott On Tue, Jul 14, 2009 at 2:25 PM, Evan Martine...@chromium.org wrote: On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvainnsylv...@chromium.org wrote: The underlying problem with buildbot is the database format, which is just hundred of thousand of files on the harddrive, with no seek capability, and the fact that the webserver itself is single threaded. We currently have 63 slaves on our main waterfall. I think this is too many for what buildbot can really support. We would ideally need to split it. Can we get upstream buildbot devs involved in this discussion? It seems they ought to be able to scale better than they have. It seems to me a caching layer that only ever hit the backend every ten seconds would allow it ten seconds to grind through its computations, which should be more than sufficient, without any extra splitting up required. That is, we should (a) fix the proxy and (b) make every use the proxy. Q3: What kind of auto-refresh do we need? We used to be at 60 secs for a long time, and I changed it a couple of weeks ago to 90 secs. No one complained, so I guess this is good. Should we go even higher than that? I personally hate auto-refresh. We should make it opt-in since I doubt most users need it and it adds load. I expect many people (myself included) leave the buildbot page open in a background tab and have it continually refreshing despite not looking at it. (My other common use case: the tree is red, I start scrolling down to see what's gone wrong, and then the page refreshes out from under me and I lose where I was looking.) - Get a better machine. It's already running on a dedicated dual quad core nehalem server with 24gb of RAM and 15k rpm drives. This is absurdly powerful! It should have all the data necessary to generate the page in RAM already, no need for even touching the disks (?). --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Buildbot performance issue.
On Tue, Jul 14, 2009 at 2:25 PM, Evan Martin e...@chromium.org wrote: On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvainnsylv...@chromium.org wrote: The underlying problem with buildbot is the database format, which is just hundred of thousand of files on the harddrive, with no seek capability, and the fact that the webserver itself is single threaded. We currently have 63 slaves on our main waterfall. I think this is too many for what buildbot can really support. We would ideally need to split it. Can we get upstream buildbot devs involved in this discussion? It seems they ought to be able to scale better than they have. I talked to them a little. They do plan some of that for their 1.0 release, but they said that it was not on the radar until then. It seems to me a caching layer that only ever hit the backend every ten seconds would allow it ten seconds to grind through its computations, which should be more than sufficient, without any extra splitting up required. That is, we should (a) fix the proxy and (b) make every use the proxy. That makes a lot of sense. I agree that we should fix the proxy, and make more people use it. Some direct buildbot access would still be required internally to force a build and stuff like that. Q3: What kind of auto-refresh do we need? We used to be at 60 secs for a long time, and I changed it a couple of weeks ago to 90 secs. No one complained, so I guess this is good. Should we go even higher than that? I personally hate auto-refresh. We should make it opt-in since I doubt most users need it and it adds load. I expect many people (myself included) leave the buildbot page open in a background tab and have it continually refreshing despite not looking at it. (My other common use case: the tree is red, I start scrolling down to see what's gone wrong, and then the page refreshes out from under me and I lose where I was looking.) Yep, a lot of people told me the same thing. Some other told me they would like a faster reload. Now i'm tempted to let the user control the reload and not give a default one. - Get a better machine. It's already running on a dedicated dual quad core nehalem server with 24gb of RAM and 15k rpm drives. This is absurdly powerful! It should have all the data necessary to generate the page in RAM already, no need for even touching the disks (?). Yeah, i'm not too sure how to debug this. When I strace the process I only see file reads, scrolling, like crazy, all the time. Nicolas --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Buildbot performance issue.
On Tue, Jul 14, 2009 at 2:35 PM, Nicolas Sylvain nsylv...@chromium.orgwrote: On Tue, Jul 14, 2009 at 2:25 PM, Evan Martin e...@chromium.org wrote: On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvainnsylv...@chromium.org wrote: The underlying problem with buildbot is the database format, which is just hundred of thousand of files on the harddrive, with no seek capability, and the fact that the webserver itself is single threaded. We currently have 63 slaves on our main waterfall. I think this is too many for what buildbot can really support. We would ideally need to split it. Can we get upstream buildbot devs involved in this discussion? It seems they ought to be able to scale better than they have. I talked to them a little. They do plan some of that for their 1.0 release, but they said that it was not on the radar until then. It seems to me a caching layer that only ever hit the backend every ten seconds would allow it ten seconds to grind through its computations, which should be more than sufficient, without any extra splitting up required. That is, we should (a) fix the proxy and (b) make every use the proxy. That makes a lot of sense. I agree that we should fix the proxy, and make more people use it. Some direct buildbot access would still be required internally to force a build and stuff like that. Q3: What kind of auto-refresh do we need? We used to be at 60 secs for a long time, and I changed it a couple of weeks ago to 90 secs. No one complained, so I guess this is good. Should we go even higher than that? I personally hate auto-refresh. We should make it opt-in since I doubt most users need it and it adds load. I expect many people (myself included) leave the buildbot page open in a background tab and have it continually refreshing despite not looking at it. (My other common use case: the tree is red, I start scrolling down to see what's gone wrong, and then the page refreshes out from under me and I lose where I was looking.) Yep, a lot of people told me the same thing. Some other told me they would like a faster reload. Now i'm tempted to let the user control the reload and not give a default one. - Get a better machine. It's already running on a dedicated dual quad core nehalem server with 24gb of RAM and 15k rpm drives. This is absurdly powerful! It should have all the data necessary to generate the page in RAM already, no need for even touching the disks (?). Yeah, i'm not too sure how to debug this. When I strace the process I only see file reads, scrolling, like crazy, all the time. That is pretty nuts. Is it calling fsync or something crazy? Since you said strace, I'm assmuming linux. In that case, the buffer cache should be saving you from disk accesses for most everything. -Albert --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Buildbot performance issue.
On Tue, Jul 14, 2009 at 2:35 PM, Nicolas Sylvainnsylv...@chromium.org wrote: - Get a better machine. It's already running on a dedicated dual quad core nehalem server with 24gb of RAM and 15k rpm drives. This is absurdly powerful! It should have all the data necessary to generate the page in RAM already, no need for even touching the disks (?). Yeah, i'm not too sure how to debug this. When I strace the process I only see file reads, scrolling, like crazy, all the time. Nicolas Can't we use RAM disk? If we made a RAM disk and moved all the relevant files there, shouldn't it speed things up? Ryosuke --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Buildbot performance issue.
On Tue, Jul 14, 2009 at 2:40 PM, Albert J. Wong (王重傑)ajw...@chromium.org wrote: That is pretty nuts. Is it calling fsync or something crazy? Since you said strace, I'm assmuming linux. In that case, the buffer cache should be saving you from disk accesses for most everything. Of course, vmstat 1 will tell you what disk IO is happening. If you don't have noatime, that would probably be good. AGL --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Buildbot performance issue.
On Tue, Jul 14, 2009 at 2:35 PM, Nicolas Sylvain nsylv...@chromium.orgwrote: On Tue, Jul 14, 2009 at 2:25 PM, Evan Martin e...@chromium.org wrote: On Tue, Jul 14, 2009 at 2:18 PM, Nicolas Sylvainnsylv...@chromium.org wrote: It seems to me a caching layer that only ever hit the backend every ten seconds would allow it ten seconds to grind through its computations, which should be more than sufficient, without any extra splitting up required. That is, we should (a) fix the proxy and (b) make every use the proxy. That makes a lot of sense. I agree that we should fix the proxy, and make more people use it. Some direct buildbot access would still be required internally to force a build and stuff like that. Can we make this stuff be based on IP instead of what URL you hit, e.g. have the force build UI there on the proxy, but have it return a 403 + descriptive error page when you try to do it? Alternately, can we have the internal URLs also use the proxy? Or have them use a different proxy if that's easier? Ojan --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Buildbot performance issue.
On Tue, Jul 14, 2009 at 2:42 PM, Adam Langley a...@chromium.org wrote: On Tue, Jul 14, 2009 at 2:40 PM, Albert J. Wong (王重傑)ajw...@chromium.org wrote: That is pretty nuts. Is it calling fsync or something crazy? Since you said strace, I'm assmuming linux. In that case, the buffer cache should be saving you from disk accesses for most everything. Of course, vmstat 1 will tell you what disk IO is happening. If you don't have noatime, that would probably be good. atop is a really nice program for getting a birds eye view of what's going on with the system. It's not installed by default, but if you're running ubuntu, it's easy to install. More generally: I think there are a couple uses of the build bots: 1) Most people just want to know can I commit and then are watching one specific CL's status. In this case, not auto-refreshing is fine and not much history is fine. 2) Sheriffing is the one case where I think you actually do need auto-refreshing, but normally you don't need a lot of history. That said, sometimes things fail and then 3) You're trying to fix things: In this case you want to see a lot of history (or at least need the option to see more) and you do NOT want it to auto refresh. I've definitely had times when I wish there was a show me more button. And I've definitely have been reading something far down the page only to have it refresh on me. It seems to me that these requirements are diverse enough that one single configuration isn't going to make everyone happy. I know you can do a bunch of customization so you can see exactly what you want, but I assume that will only chew up more resources. Maybe the right way to go is a couple customized pages for each roll? There's definitely much more information there than people need most of the time, though. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---
[chromium-dev] Re: Buildbot performance issue.
I would like to see auto refresh turned off by default. That might help the load. -- Mohamed Mansour On Tue, Jul 14, 2009 at 7:15 PM, Michael Nordman micha...@google.comwrote: Turning off auto refresh by default sounds reasonable idea right now... with an option to enable it if really desired... autorefresh[=n] where n is a number of minutes maybe (defaults to 1) type thing. A fair amount of the load may just evaporate with that change. On Tue, Jul 14, 2009 at 3:32 PM, Jeremy Orlow jor...@google.com wrote: On Tue, Jul 14, 2009 at 2:42 PM, Adam Langley a...@chromium.org wrote: On Tue, Jul 14, 2009 at 2:40 PM, Albert J. Wong (王重傑)ajw...@chromium.org wrote: That is pretty nuts. Is it calling fsync or something crazy? Since you said strace, I'm assmuming linux. In that case, the buffer cache should be saving you from disk accesses for most everything. Of course, vmstat 1 will tell you what disk IO is happening. If you don't have noatime, that would probably be good. atop is a really nice program for getting a birds eye view of what's going on with the system. It's not installed by default, but if you're running ubuntu, it's easy to install. More generally: I think there are a couple uses of the build bots: 1) Most people just want to know can I commit and then are watching one specific CL's status. In this case, not auto-refreshing is fine and not much history is fine. 2) Sheriffing is the one case where I think you actually do need auto-refreshing, but normally you don't need a lot of history. That said, sometimes things fail and then 3) You're trying to fix things: In this case you want to see a lot of history (or at least need the option to see more) and you do NOT want it to auto refresh. I've definitely had times when I wish there was a show me more button. And I've definitely have been reading something far down the page only to have it refresh on me. It seems to me that these requirements are diverse enough that one single configuration isn't going to make everyone happy. I know you can do a bunch of customization so you can see exactly what you want, but I assume that will only chew up more resources. Maybe the right way to go is a couple customized pages for each roll? There's definitely much more information there than people need most of the time, though. --~--~-~--~~~---~--~~ Chromium Developers mailing list: chromium-dev@googlegroups.com View archives, change email options, or unsubscribe: http://groups.google.com/group/chromium-dev -~--~~~~--~~--~--~---