Re: [[email protected]] Anotheranecdote from the multi-master trenches.

Neil Gilmore Thu, 08 Dec 2016 13:34:32 -0800

Hi Pierre,

Thanks for the specifics. If it comes up again, I'll have a look in theplaces you describe.

It's slightly possible that this might have happened during areconfiguration, though unlikely. I'm stll trying to train the users touse my supplied script to reconfigure, as it does a checkconfig andreconfigures all 4 masters.

I think a problem during the health-check period is much more likely.Though none of the masters crashed, it seems more likely that somenetwork problem might have caused it to appear so. Next, time, I'll trythe reconfig first. In this case, restarting wasn't a problem. Ifnothing else, it corrected some of the non-reconfigurable schedulertroubles, if only temporarily.


Neil Gilmore
grammatech.com

On 12/8/2016 2:25 PM, Pierre Tardy wrote:

Hi Neil,
Thanks for the detailed report. I see few chances that the symptomsyou are describing could be explained by a failure of the multi-mastermessaging.If the data api showed the builders that means that the builders wereseen to be attached to 0 masters.There is a "show old builders" checkbox that could have confirm that.An builder is considered "old" when it has no master.
The builders REST api has a masterids attribute that will tell that.
There are several action that will make the list of builders of agiven master go to 0
- during a reconfiguration. The BotMaster service will setup the newlist of builders to the database (they could go to 0 if misconfiguration)
- at master shutdown, the master will set itself inactive, andunregisters from all its builders.
- After the master health-check period. each master has a timestampwhich a needs to update regularly in the database to inform othermasters that he his still alive. During that heartbeat callback, themaster will also check for other masters if they have correctlyupdated their own timestamp. If they didn't for the previous 10minutes, this means that they somehow crashed without telling, so thefirst detecting master will mark the quiet master to be disconnected.In you case, this could be explaining the behaviour. Maybe there was atime were the consumer and procucers masters were unavailable, blockedor off-network. The third master marked them away, but the 2 then wentback, but did not figure out they were marked disconnected, but stillcontinued to take buildrequests. I think this is a design bug that weneed to fixed. A single reconfig would have fixed the situation (noneed for restart)
In any case I would expect that the twisted.log may tell you somestuff. Either you would get some exceptions during a reconfigurationor something. Or you may get a period of time with suspiciousactivity, which could explain a miss of the heartbeat timer.
Let us know if you reproduce the problem again and if these adviceshelped you better understand the problem.
regards,
Pierre
Le jeu. 8 déc. 2016 à 17:45, Neil Gilmore <[email protected]<mailto:[email protected]>> a écrit :
    Hi everyone.

    First, a bit of good news. My current top priority is to make the
    schedulers reconfigurable. Not conceptually difficult, but I wasn't
    well-versed in Python argument passing (which figures prominently in
    this), so I've had a couple aborted tries on that score. I think I've
    got all that sorted out for now. It's just biting us way too badly to
    not be able to reconfigure schedulers.

    Now, the anecdote. As you may remember, we're running 4 masters. 1
    just
    has the UI and force schedulers. 1 has our overall logging system. The
    other 2 are split between producing builds, and consuming them for
    tests.

    Sometime between when I left yesterday and when the test lead looked
    this morning, the UI stopped displaying the builders for the producer
    and consumer masters. Looking at all the masters, they were
    running, and
    I didn't immediately see anything suspicious in the logs. Looking
    at the
    data api, I could see all the builders and workers. The workers all
    showed connected_to being valid, but only the logging workers showed
    anything in configured_on. I restarted our UI master and that didn't
    help. Restarting the producer and consumer seems to have solved the
    problem. I can see the builders in the UI, and looking at the
    workers in
    the data API, I see that most appear to have configured_on set. I have
    no idea what actually happened. My wild conjecture is that the
    inter-master communication got screwed up somehow. Either that or they
    lost connection to the database (less likely, I think. Postgres is
    pretty stable that way.).

    Neil Gilmore
    grammatech.com <http://grammatech.com>
    _______________________________________________
    users mailing list
    [email protected] <mailto:[email protected]>
    https://lists.buildbot.net/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users

Re: [[email protected]] Anotheranecdote from the multi-master trenches.

Reply via email to