Hi Pierre,

As always, thanks for the reply. I've trimmed a bit to try to keep things clear.

On 8/25/2016 4:31 AM, Pierre Tardy wrote:
Do you still have the problem with master not continuing the build after command has terminated on the worker?

Not at the moment, though since I just restarted the master, I wouldn't expect that particular problem just yet.

Our primary master currently has 5 workers which have one or more builders acquiring locks, which is usually a bad sign (I'll look in on them later). And it has one worker that has build requests queued on multiple builders, but no running builds. That machine seems to be having communications problems, though, so it's probably not a buildbot problem.

On our second master, which runs just a few builders on a few workers, we have a buildrequest that probably would never build, even though that worker isn't building anything else. The previous buildrequest sat for 20 hours or so (I cancelled the queue and forced a new build). I stopped it's worker and let our usual process start it back up, and it's running now.

It's just a gut feeling, but I think that there's a single basic problem somewhere that's manifesting itself in a few different ways. As you said previously, I think a deferred is getting lost somewhere.

The recent documentation on multimaster is there.
http://docs.buildbot.net/latest/manual/concepts.html#multimaster


Thanks for the pointer.

This later information is correct.


I thought so, but it's better to get confirmation.

In nine, there is the new concept of clustered service, which are service that runs on only one master, masters are competing to run those service, and the first master which will claim this service will run it. schedulers and changesource are all clustered services. The database will act as an arbitrator (hence multimaster cannot work with sqlite)
https://github.com/buildbot/buildbot/blob/master/master/buildbot/db/model.py#L254


What is not implemented is load balancing between master. Basically if you run a symetric multimaster configuration(as per concepts.rst), the first master that will start will take all the schedulers and change sources.

(snipped the rest)

That's what I figured.

What you seem to have missed is that for multimaster to work you need a common message queue. At the moment, only crossbar.io <http://crossbar.io> is implemented
http://docs.buildbot.net/latest/manual/cfg-global.html#mq-specification


Yes, I missed it.

Messages are important so that the other master is aware that a new buildrequest has been sent to the database


I'm not a database guy, per se, but wouldn't any database you'd want to run multi-master on be able to notify the other masters? Postgres, for example, has NOTIFY and LISTEN. I'm not much of a SQLAlchemy guy, either, but a cursory search shows an Event API.

If you don't configure a multimaster capable mq, then build will not start instantly on the second master. If will only start when other event happen on that second master (like a new worker (dis)connection or build finish)


That may be acceptable. With the volume we have, things finish pretty often. And with some of these builds taking a long time, the wait may be insignificant to us.


If I understand correctly, you are running rc1 for python code, and rc2 for UI? That should be fine, but I would recommend to update the whole to rc2, as a number of bugs have been fixed. No new feature have been added on this stable branch, so I expect this limits the risk of regression you should expect


Not exactly, thought I can't be certain as I didn't do that work. As far as I know, only the builders page is rc2 + our change to show the last build, and the cancel queue just came along with it.

Neil Gilmore
grammatech.com
_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users

Reply via email to