Hi Pierre,

We've had notify_on_missing configured for some time -- certainly it was present when I restarted the masters on Tues. I also restarted the master that has those workers. Shouldn't all the workers in master.cfg have attached then?

This could cause us some trouble. Here's how we start workers:

We have a builder consisting of ShellCommands that does an ssh login into the worker machine, sees if the worker is running, and runs it if it is not. This allows us to make certain that all the workers that are in master.cfg also have matching worker processes on the worker machines. This builder runs every hour. It is entirely possible that a worker machine without a worker process could sit for more than an hour before having its worker process started.

What you guys seem to be telling me is that if I were to stop a worker process and let more than an hour go by, that worker would never, ever have its builds run. Even though the worker is attached, and the builds are queued.

That sounds pretty bad to me. Am I understanding correctly? Or can I wait until a worker process is running, then reconfigure without that worker in master.cfg(and getting unkown worker errors, I suppose), then reconfiguring with the worker back in master.cfg, so that it attempts to attach before the timeout?

Let me emphasize that the thing that brought this to my attention was adding a new worker and its builders to master.cfg. The worker process would have been started sometime after that by the builder that starts worker processes.

Neil Gilmore
grammatech.com

On 2/3/2017 3:17 PM, Pierre Tardy wrote:
Hi Neil,

The timer starts when the worker is first configured.
but only if notify_on_missing is configured.

that may be a reason why you do not see the bug for ancient workers

Pierre

Le ven. 3 févr. 2017 à 21:59, Neil Gilmore <[email protected] <mailto:[email protected]>> a écrit :

    Hi Andrej,

    Thanks for the reply.

    I don't see missing_timeout in our master.cfg anywhere. But I do
    see this:

    c['workers'] = [Worker(host, '<password>',
    notify_on_missing=bots_email[host]) for host in bots_list]

    Let's see if I understood you. The default missing_timeout is 60
    minutes. If I start the master and wait 60 minutes, then start the
    worker, the worker won't attach?

    In our case, we're not even adding the worker to master.cfg until well
    after that 60 minutes (a couple days after). We're adding new workers.
    Do you figure this could be the same problem?

    What happens with a default notify_on_missing? I figure I can try the
    patch in your PR when we restart the masters.

    Neil Gilmore
    [email protected] <mailto:[email protected]>

    On 2/3/2017 2:42 PM, Andrej Rode wrote:
    > Hi Neil,
    >
    >> 2017-02-03T12:39:09-0500 [Broker,28906,10.233.216.43] worker
    '<name>'
    >> attaching from IPv4Address(TCP, '<ip>', 35642)
    >> 2017-02-03T12:39:09-0500 [Broker,28906,10.233.216.43] Got
    workerinfo
    >> from '<name>'
    >> 2017-02-03T12:39:09-0500 [-] bot attached
    >> 2017-02-03T12:39:09-0500 [-] worker <name> cannot attach
    >>          Traceback (most recent call last):
    >>          Failure: twisted.internet.error.AlreadyCalled: Tried
    to cancel
    >> an already-called event.
    > I had the same problembs but with a single-master setup. By any
    chance
    > are you using a non-default `missing_timeout` and/or
    `notify_on_missing`
    > on your workers?
    >
    > For my issue I've a PR up [0] and now I can detach and attach
    workers
    > as I like. But it is still not clear why we even run into
    problems here.
    >
    > I figured out that attaching a worker after longer than
    > `missing_timeout` after a master start results in this problem on my
    > setup. (Default `missing_timeout` is 60 minutes.)
    >
    > Cheers,
    > Andrej
    >
    > [0] https://github.com/buildbot/buildbot/pull/2708
    > _______________________________________________
    > users mailing list
    > [email protected] <mailto:[email protected]>
    > https://lists.buildbot.net/mailman/listinfo/users

    _______________________________________________
    users mailing list
    [email protected] <mailto:[email protected]>
    https://lists.buildbot.net/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users

Reply via email to