Good afternoon everyone, I have more anecdotes!

Well, we've had multi-master running since Thurs. or so. It's been a mixed bag, but I'll start with the good.

Having the UI and force schedulers on their own master is definitely a good thing. Having a single maser, sometimes it would take many minutes to populate a page. Now it may take a minute, tops.

Separating out our results process into its own master also seems to be good. That process is pretty stable, but people would complain if I had to take down the master when they needed it.

In general, having our 'real' builds separated across two masters seems to work at least as well a single master. Maybe a touch better.

Now the neutral:

I had been using 4 identical copies of our master.cfg, 1 per master. This was a bit silly, but got things running quickly. I did a little experiment, and yes, you can have multiple masters pointing to the same master.cfg in their respective buildbot.tac files. In our case, using an absolute path works well. We only want a single copy because it's in our version control, and we don't want to have to remember all the places it turns up. Also, much of it, like the dictionaries containing the workers and their directories, is reused among the masters.

Unfortunately, to convert from 4 copies to a single copies requires taking down the master, editing its buildbot.tac, then bringing it back up. Because reconfig only works on master.cfg, not buildbot.tac. Oh well...

We're also having trouble with checkconfig. I rewrote our master.cfg to decide which master was being configured by comparing against the variable basedir. Unfortunately, basedir is usually '.' when doing a reconfig, and that's not in our dictionary of masters. So we get a KeyError. I'll need to fix that. I'd want to anyway because checking the config for a particular master will result in some things not getting checked. I'll work around this by disabling the code that gets called when builders, etc. are added that compares the current master to the master that the object should belong to.

I was going to move the masters to the single master.cfg, but I had trouble shutting them down. I got logs that mostly looked like this: 2016-10-10T14:45:27-0400 [-] while publishing event org.buildbot.mq.steps.888\
04.logs.stdio.append
        Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/buildbot-0.9.0rc2-py2.\
7.egg/buildbot/mq/wamp.py", line 37, in produce
            d = self._produce(routingKey, data)
File "/usr/local/lib/python2.7/dist-packages/buildbot-0.9.0rc2-py2.\
7.egg/buildbot/mq/wamp.py", line 57, in _produce
return self.master.wamp.publish(self.messageTopic(routingKey), _d\
ata, options=options)
File "/usr/local/lib/python2.7/dist-packages/Twisted-16.3.0-py2.7-l\
inux-x86_64.egg/twisted/internet/defer.py", line 1274, in unwindGenerator
            return _inlineCallbacks(None, gen, Deferred())
File "/usr/local/lib/python2.7/dist-packages/Twisted-16.3.0-py2.7-l\
inux-x86_64.egg/twisted/internet/defer.py", line 1128, in _inlineCallbacks
            result = g.send(result)
        --- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/buildbot-0.9.0rc2-py2.\
7.egg/buildbot/wamp/connector.py", line 109, in publish
            ret = yield service.publish(topic, data, options=options)
File "/usr/local/lib/python2.7/dist-packages/autobahn-0.16.0-py2.7.\
egg/autobahn/wamp/protocol.py", line 1109, in publish
            raise exception.TransportLost()
        autobahn.wamp.exception.TransportLost:

It's worth noting that I'm told we had some network hiccups affecting things last night. This might be nothing. But plain old kill won't stop a buildbot if it doesn't want to stop.

The bad:

Using multi-master doesn't seem to have stopped our lost deferred/stuck build problems. Restarting the master seems to have remedied the problem, but it's disappointing. Especially since that's our major problem that we were trying to deal with by going to multi-master. The only good news there is that when I have to stop a master because of stuck builds, it doesn't stop everything.

If I could have a single bugfix, it would be to stop losing deferred objects so builds wouldn't stall (assuming that's the problem). If I could have a single new feature, it would be a way to reset a worker completely without having to take down the master, so as to not need the bugfix.

Neil Gilmore
grammatech.com
_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users

Reply via email to