<snip>

> True, but does it matter that much for a build? I mean the 
> build is taking
> source and producing something that can be scratched and 
> rebuilt any time.
> Even if a machine fails you haven't lost anything. 

True, but... :)

If a machine fails at 2 am, you haven't lost code but you've lost (potentially) 
a weekly build that a small army of developers and testers are planning to use 
at 7am. Is getting the weekly build (or the nightly build) out the door 
reliably worth having enough people on call with enough expertise to replace 
the box, replicate the environment and get the builds running again? I'd rather 
address the problem in software (if possible).

For me, it's an issue of robustness. Does it work every time? If you are 
depending on a commodity PC hard drive or power supply, the answer is no. Then 
you have to face the issue of whether or not your customers (in this case, the 
development community) is losing faith in your ability to deliver product... 
But I digress. 

Our existing solution has a cluster of build machines that provide very nice 
failover, so the feature is "expected". Suggesting we look at a build system 
with a dozen boxes, with each one being a point of failure, wouldn't go over 
well. Given cascading build failure issues, the wrong box dying could take out 
(literally) hundreds of builds.

> 
> > 
> > To give you a little more background, I'm thinking about a 
> several hundred
> > developers, 300 projects (plus installers) and about 5 
> millions lines of
> > code. (SAS is big). http://www.pragmaticautomation.com/cgi-
> > bin/pragauto.cgi/Build/CCOnALargeScale.rdoc
> 
> Note: I can't access the doc (access is forbidden).
> 

I can access it. Hmmmm... Try just hitting http://www.pragmaticautomation.com 
and then search on Jared or SAS. The blog entry is titled "CruiseControl on a 
Large Scale". 


> I think that with such a system having a perfect load-balancer (i.e.
> automatic load balancing) would be more important than the fact that a
> machine could crash (which would lead to nothing being lost 
> as everything
> can always be rebuilt - You could even save logs for whatever 
> is put in the
> build log queue and the result of the build so that you know 
> which builds
> have not been processed. You could also have the machine that 
> takes a build
> starts by logging the build in a log file so that you could 
> replay the build
> if it crashes, etc).
> 

I think we agree on this point. Do you agree? :)

> > 
> > We could put builds 1 through 17 on a single machine, 18 
> through 30 on
> > another, 
> 
> I think having a "reverse" load balancer would improve a lot 
> the overall
> efficiency of the build farm. If you put builds 1 through 17 
> on one machine
> and those builds happen to be simple builds or with not a lot 
> of developers
> or whatever the machine will be under-used. If these builds 
> are heavy then
> they the projects won't get built as often as they could, etc.
> 
> That said, I'm not an expert in this domain so I'd love to be 
> proved wrong
> and learn something in the process! :-)
> 

I'm not following you. What do you mean by reverse load balancer? 


> > etc, but if/when a single machine crashes, the recovery time
> > would become a real issue. 
> 
> It is if you have a controller that controls to which to load balance
> because it'll have to understand that the machine should be 
> removed from the
> farm and not been given any job whereas with the "reverse" 
> load balancer
> there's no logic to implement for this.
> 
> > Also, you can't load balance this way. Someone
> > would end up "tuning" the load to get projects that run in 
> parallel off of
> > the same machine.
> 
> I don't understand this point.

:)  If you have builds assigned to a given machine (and only that machine) then 
a person will have to move builds from machine to machine if you want to 
efficiently use the hardware. Random (or alphabetical) assignments will not be 
an effecive load spreading mechanism.


> 
> > 
> > As to "knowing the state", if the proxy/manager issues the 
> job and is
> > notified (return code? Good log file?) of the job's completion, it's
> > pretty easy to keep track of state. CPU, ram, etc becomes a 
> side issue if
> > you just issue one build at a time to each box in the 
> cluster. Once a box
> > is finished, you send another job. The faster boxes process 
> more builds
> > and the slow ones process less. Automatic load balancing.
> 
> Yes, true. The hard part is knowing to which machine to send 
> the next build
> job so you need get some answer from the build machines. And 
> you need to
> modify the scheduler if a new machine is added.
> 
> But yes, I think both solutions have pros and cons. I haven't really
> implemented the "reverse" load balancer solution but I have 
> always found it
> extremely elegant in term of architecture. It seems to me it 
> has more pros
> than cons but maybe the devil is in the implementation details... :-)
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to