Hi Everyone,

Well, I put 0.9.3 multi-master plus Pierre's reconfig patches into production Tues. afternoon. After running a few days, it mostly works.

Unfortunately, there's always problems. Our current problem is that we've added a few workers since then. And which the builders associated with those workers are having builds scheduled, those builds never start. Even forced builds do not start.

Here's what the worker log shows:

2017-02-03 12:39:08-0500 [-] Loading buildbot.tac...
2017-02-03 12:39:09-0500 [-] Loaded.
2017-02-03 12:39:09-0500 [-] twistd 16.2.0 (/usr/bin/python 2.7.6) starting up. 2017-02-03 12:39:09-0500 [-] reactor class: twisted.internet.epollreactor.EPollReactor.
2017-02-03 12:39:09-0500 [-] Starting Worker -- version: 0.9.0rc2
2017-02-03 12:39:09-0500 [-] recording hostname in twistd.hostname
2017-02-03 12:39:09-0500 [-] Starting factory <buildbot_worker.pb.BotFactory instance at 0x7f5ed09fcd88>
2017-02-03 12:39:09-0500 [-] Connecting to buildbot:9984
2017-02-03 12:39:09-0500 [Broker,client] message from master: attached
2017-02-03 12:39:09-0500 [Broker,client] Connected to <host:port>; worker is ready 2017-02-03 12:39:09-0500 [Broker,client] sending application-level keepalives every 600 seconds

And here's what the master log shows (yes, I've redacted host names, etc.). And the masters are pretyt busy, so I hope I have the relevant entries here:

2017-02-03T12:39:09-0500 [Broker,28906,10.233.216.43] worker '<name>' attaching from IPv4Address(TCP, '<ip>', 35642) 2017-02-03T12:39:09-0500 [Broker,28906,10.233.216.43] Got workerinfo from '<name>'
2017-02-03T12:39:09-0500 [-] bot attached
2017-02-03T12:39:09-0500 [-] worker <name> cannot attach
        Traceback (most recent call last):
Failure: twisted.internet.error.AlreadyCalled: Tried to cancel an already-called event.

This is consistent for all the added workers. The UI shows that the workers are attached, and the builds scheduler as normal. They just never seem to start. Workers present when we started the 0.,9.3 masters (using the same database as before) appear to be working correctly.

The 'cannot attach' entry comes from Worker.attached() after an exception in AbstractWorker.attached(). But it comes very late in AbstractWorker.attached(), as these are the only lines after the 'bot attached' entry is generated:

        self.messageReceivedFromWorker()
        self.stopMissingTimer()
        yield self.updateWorker()
        yield self.botmaster.maybeStartBuildsForWorker(self.name)

I have no clue which might be the problem, or which event was already called. When it's a forced build, the worker is on a different master, as the force scheudlers are all on out UI master, but this happens with scheduled builds, too. And those schedulers should be on the same master as the builder and worker.

I've tried a number of things to correct this, short of just shutting everything down and/or using a new database.

We also had several builders showing 2 builds building at the same time. This appears to be benign, as going in through the manhole and looking at masters.botmaster.namedServices['<name>'].building shows only 1 build on a builder.

Neil Gilmore
grammatech.com


_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users

Reply via email to