Hello all,

Yesterday we had a network event when some of our buildbot workers lost the network connection to the master for about 10 minutes.

However, while according to the logs on both the master and the workers show that the workers successfully reconnected within 10 minutes of the network connection being restored, according to status displays, the workers the worker were missing. It eventually took a stop/start or reboot of the workers to get them reconnected an hour after the network connection was lost.

What I am seeing is that master log has entries like this when a worker ("arbeider") reconnected:

2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] worker 'arbeider' attaching from IPv4Address(type='TCP', host='1.2.3.4', port=51630) 2019-03-01 11:24:05+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got duplication connection from 'arbeider' starting arbitration procedure 2019-03-01 11:24:15+0000 [-] Connected worker 'arbeider' ping timed out after 10 seconds 2019-03-01 11:24:15+0000 [-] Old connection for 'arbeider' was lost, accepting new 2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] Got workerinfo from 'arbeider' 2019-03-01 11:24:15+0000 [Broker (TLSMemoryBIOProtocol),236,1.2.3.4] worker arbeider cannot attach
        Traceback (most recent call last):
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line 1529, in _cancellableInlineCallbacks
            _inlineCallbacks(None, g, status)
File "sandbox/lib/python3.6/site-packages/twisted/internet/defer.py", line 1416, in _inlineCallbacks
            result = result.throwExceptionIntoGenerator(g)
File "sandbox/lib/python3.6/site-packages/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
            return g.throw(self.type, self.value, self.tb)
File "sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 638, in attached
            log.err(e, "worker %s cannot attach" % (self.name,))
        --- <exception caught here> ---
File "sandbox/lib/python3.6/site-packages/buildbot/worker/base.py", line 636, in attached
            yield AbstractWorker.attached(self, bot)
        builtins.AssertionError:


Does anyone have any ideas about why the reconnects failed?

In one case, a job was started on one of the workers (which was shown as "online"), and the master was just registering the task as "Pinging worker", for 20+ minutes until we stopped the task (and even that took a while).

If this happens every time the network connection is lost (which admittedly does not happen that frequently, but could happen in case of network maintenance) it is going to be a serious inconvenience, since some of the workers need special handling when being restarted.


Relevant information about the configuration:

* Buildbot v2.0.1

* The PB connections are TLS protected, using a workaround based on the one from <https://github.com/buildbot/buildbot/issues/2866>

* Workers run Python 2

* The master is running the current Twisted version

* The workers are running Twisted 18.7.0 (fixed version, due to installation problems with the current version; on Windows it goes looking for a compiler and does not find one, even when one is installed)


--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
_______________________________________________
users mailing list
users@buildbot.net
https://lists.buildbot.net/mailman/listinfo/users

Reply via email to