Re: [users@bb.net] A summary of first month issues using Buildbot 2.0

Yngve N. Pettersen Thu, 21 Mar 2019 03:47:00 -0700

Hi again,

A further update on that DB issue.

It seems that the last such incident caused two of the new branchschedulers I was adding to not get properly registered. The schedulers arelisted, but AFAICT the git polling does not work, and it also seems likethe restart I did to fix the DB issue did not fix the problem.

On Sat, 16 Mar 2019 19:22:14 +0100, Yngve N. Pettersen <yn...@vivaldi.com>wrote:

Hi again,

An update about one of the issues, the lost database connection.
This seems to affect the GitPoller instance. Other database activity,such as both forced_scheduler and triggered jobs work as normal.
It seems like the GitPoller (maybe all pollers) are not able to recoverfrom a lost database connection. A full shutdown and start is needed torecover. This seems to be similar to the worker reconnect failures I'vementioned, that code is not able to recover from a failed workersubscription, and the connection ends up as a zombie, a live connection,but still dead.
In the case earlier today, I got an exception during the sighupoperation:
2019-03-16 13:02:31+0000 [-] while polling for changes
   Traceback (most recent call last):
File"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line1418, in _inlineCallbacks
       result = g.send(result)
File"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py",line 233, in poll
       yield self.setState('lastRev', self.lastRev)
File"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line1613, in unwindGenerator
       return _cancellableInlineCallbacks(gen)
File"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line1529, in _cancellableInlineCallbacks
       _inlineCallbacks(None, g, status)
   --- <exception caught here> ---
File"sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py",line 233, in poll
       yield self.setState('lastRev', self.lastRev)
File"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
       result = g.send(result)
File "sandbox/lib/python3.6/site-packages/buildbot/util/state.py",line 43, in setState
       yield self.master.db.state.setState(self._objectid, key, value)
   builtins.AttributeError: 'NoneType' object has no attribute 'db'
2019-03-16 13:02:31+0000 [-] Caught exception while deactivatingClusteredService(...)
   Traceback (most recent call last):
File"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line654, in _runCallbacks
       current.result = callback(current.result, *args, **kw)
File"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line1475, in gotResult
       _inlineCallbacks(r, g, status)
File"sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line1418, in _inlineCallbacks
       result = g.send(result)
File"sandbox/lib/python3.6/site-packages/buildbot/util/service.py", line341, in stopServicelog.err(e, _why="Caught exception while deactivatingClusteredService(%s)" % self.name)
   --- <exception caught here> ---
File"sandbox/lib/python3.6/site-packages/buildbot/util/service.py", line339, in stopService
       yield self._unclaimService()
File"sandbox/lib/python3.6/site-packages/buildbot/changes/base.py", line 51,in _unclaimServicereturnself.master.data.updates.trySetChangeSourceMaster(self.serviceid,
   builtins.AttributeError: 'NoneType' object has no attribute 'data'
On Fri, 15 Mar 2019 02:29:30 +0100, Yngve N. Pettersen<yn...@vivaldi.com> wrote:
Hi,
About a month ago we transferred our build system from the old chromiumdeveloped buildbot system to one based on Buildbot 2.0. In that periodwe have had a couple of major issues that I thought I'd summarize:
* We have had two crashes of the buildbot master process. I do not knowwhat causes the crashes, and the twisted.log does not contain anyinformation about what happened, so my guess is that it is either theUbuntu 18 Python 3.6 that crashed, or the Twisted/buildbot scripts didso in a non-logging fashion.
* We have had at least two cases where the master lost its connectionto the Database server, and did not recover, and restarting the masterwas the only option. The probable commonality with these cases is thatit seems to have happened when using the reconfigure/sighup option toupdate the buildbot configuration. In at least one case the log seemedto include an exception regarding the Database connection (which is aremote postgresql server)
* We have had a couple of cases where the network connection betweenthe master and some of the workers have been interrupted. In the majorcase, this lead to having to restart the worker instances on all theaffected workers. This was the topic of an email to this list a fewweeks ago. In this case logs show that the workers correctly connected,but that the master then failed (due to an exception) to correctlyregister the worker, and failed to cut the connection to the worker (sothat it could try to reconnect again) either when the registrationprocess failed, or later when checking open connections (if it does),and apparently also responded to pings from the worker. It also did notdetect that a worker was not really connected when it tried to ping itwhen trying to assign it a job.
This reconnect issue is such a major problem and hassle that, when wedid a restart of that network connection, we shut down the *master*instance while taking down the network connection, and restarting itafterwards.



--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
_______________________________________________
users mailing list
users@buildbot.net
https://lists.buildbot.net/mailman/listinfo/users

Re: [users@bb.net] A summary of first month issues using Buildbot 2.0

Reply via email to