Hi again,

A further update on that DB issue.

It seems that the last such incident caused two of the new branch schedulers I was adding to not get properly registered. The schedulers are listed, but AFAICT the git polling does not work, and it also seems like the restart I did to fix the DB issue did not fix the problem.

On Sat, 16 Mar 2019 19:22:14 +0100, Yngve N. Pettersen <yn...@vivaldi.com> wrote:

Hi again,

An update about one of the issues, the lost database connection.

This seems to affect the GitPoller instance. Other database activity, such as both forced_scheduler and triggered jobs work as normal.

It seems like the GitPoller (maybe all pollers) are not able to recover from a lost database connection. A full shutdown and start is needed to recover. This seems to be similar to the worker reconnect failures I've mentioned, that code is not able to recover from a failed worker subscription, and the connection ends up as a zombie, a live connection, but still dead.

In the case earlier today, I got an exception during the sighup operation:

2019-03-16 13:02:31+0000 [-] while polling for changes
   Traceback (most recent call last):
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
       result = g.send(result)
File "sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py", line 233, in poll
       yield self.setState('lastRev', self.lastRev)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line 1613, in unwindGenerator
       return _cancellableInlineCallbacks(gen)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
       _inlineCallbacks(None, g, status)
   --- <exception caught here> ---
File "sandbox/lib/python3.6/site-packages/buildbot/changes/gitpoller.py", line 233, in poll
       yield self.setState('lastRev', self.lastRev)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py",
line 1418, in _inlineCallbacks
       result = g.send(result)
File "sandbox/lib/python3.6/site-packages/buildbot/util/state.py", line 43, in setState
       yield self.master.db.state.setState(self._objectid, key, value)
   builtins.AttributeError: 'NoneType' object has no attribute 'db'

2019-03-16 13:02:31+0000 [-] Caught exception while deactivating ClusteredService(...)
   Traceback (most recent call last):
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
       current.result = callback(current.result, *args, **kw)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line 1475, in gotResult
       _inlineCallbacks(r, g, status)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
       result = g.send(result)
File "sandbox/lib/python3.6/site-packages/buildbot/util/service.py", line 341, in stopService log.err(e, _why="Caught exception while deactivating ClusteredService(%s)" % self.name)
   --- <exception caught here> ---
File "sandbox/lib/python3.6/site-packages/buildbot/util/service.py", line 339, in stopService
       yield self._unclaimService()
File "sandbox/lib/python3.6/site-packages/buildbot/changes/base.py", line 51, in _unclaimService return self.master.data.updates.trySetChangeSourceMaster(self.serviceid,
   builtins.AttributeError: 'NoneType' object has no attribute 'data'


On Fri, 15 Mar 2019 02:29:30 +0100, Yngve N. Pettersen <yn...@vivaldi.com> wrote:

Hi,

About a month ago we transferred our build system from the old chromium developed buildbot system to one based on Buildbot 2.0. In that period we have had a couple of major issues that I thought I'd summarize:

* We have had two crashes of the buildbot master process. I do not know what causes the crashes, and the twisted.log does not contain any information about what happened, so my guess is that it is either the Ubuntu 18 Python 3.6 that crashed, or the Twisted/buildbot scripts did so in a non-logging fashion.

* We have had at least two cases where the master lost its connection to the Database server, and did not recover, and restarting the master was the only option. The probable commonality with these cases is that it seems to have happened when using the reconfigure/sighup option to update the buildbot configuration. In at least one case the log seemed to include an exception regarding the Database connection (which is a remote postgresql server)

* We have had a couple of cases where the network connection between the master and some of the workers have been interrupted. In the major case, this lead to having to restart the worker instances on all the affected workers. This was the topic of an email to this list a few weeks ago. In this case logs show that the workers correctly connected, but that the master then failed (due to an exception) to correctly register the worker, and failed to cut the connection to the worker (so that it could try to reconnect again) either when the registration process failed, or later when checking open connections (if it does), and apparently also responded to pings from the worker. It also did not detect that a worker was not really connected when it tried to ping it when trying to assign it a job.

This reconnect issue is such a major problem and hassle that, when we did a restart of that network connection, we shut down the *master* instance while taking down the network connection, and restarting it afterwards.





--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
_______________________________________________
users mailing list
users@buildbot.net
https://lists.buildbot.net/mailman/listinfo/users

Reply via email to