We had a little problem here today, and as a result a few machines were
rebooted, including one that has a particular worker.
Here we don't start workers using cron, we mostly start them using
buildbot builds (except for the worker whose builds start the other
workers). We have a build that logs in to other machines, determines
whether the worker is running, and starts it if it isn't. It runs every
hour. The build logs are also useful to monitor which workers are up, as
I find it a bit quicker to scan that than the builders page.
Unfortunately, the buildbot UI was unresponsive (15 minutes and it
hadn't given me the builders page). It's last knowledge appeared to be
that the builds on the rebooted worker were still in progress (even
though that certainly wasn't true).
I had to kill the master and restart it (that particular worker's builds
are ones everyone notices). By the time it was fully restarted, and our
builds to start workers had run, and the rebooted worker's builds were
running, the 'BuildMaster is running' was down in twistd.log.11.
I'd forced a build to get the rebooted host's worker started. It took
about 15 minutes for it to start.
And I did notice that upon our startup we do get a lot of unauthorized
login entries as the workers start attempting to connect as soon as the
master is up. They go on for several minutes until the master catches up
with things. I see a lot of buildstep activity going on in between.
At least this time I didn't have to clear the database.
users mailing list