Hi everyone,

Another anecdote...

We had a little problem here today, and as a result a few machines were rebooted, including one that has a particular worker.

Here we don't start workers using cron, we mostly start them using buildbot builds (except for the worker whose builds start the other workers). We have a build that logs in to other machines, determines whether the worker is running, and starts it if it isn't. It runs every hour. The build logs are also useful to monitor which workers are up, as I find it a bit quicker to scan that than the builders page.

Unfortunately, the buildbot UI was unresponsive (15 minutes and it hadn't given me the builders page). It's last knowledge appeared to be that the builds on the rebooted worker were still in progress (even though that certainly wasn't true).

I had to kill the master and restart it (that particular worker's builds are ones everyone notices). By the time it was fully restarted, and our builds to start workers had run, and the rebooted worker's builds were running, the 'BuildMaster is running' was down in twistd.log.11.

I'd forced a build to get the rebooted host's worker started. It took about 15 minutes for it to start.

And I did notice that upon our startup we do get a lot of unauthorized login entries as the workers start attempting to connect as soon as the master is up. They go on for several minutes until the master catches up with things. I see a lot of buildstep activity going on in between.

At least this time I didn't have to clear the database.

Neil Gilmore
users mailing list

Reply via email to