Hi Honza, On Tue, Sep 3, 2019 at 7:20 PM Jan Friesse <[email protected]> wrote:
> Jeevan, > > Jeevan Patnaik napsal(a): > > Hi Honza, > > > > Thanks for the response. > > > > If you increase token timeout even higher > > (let's say 12sec) is it still appearing or not? > > - I will try this. > > > > If you try to run it without RT priority, does it help? > > - Can RT priority affect the process scheduling negatively? > > Actually we've had report that it can, because it blocks kernel thread > which is responsible for sending/receiving packets. I was not able to > reporduce this behavior myself, and it seemed to be kernel specific, but > resolution was that behavior without RT was better. > Thanks. I will check this. Also in theory, can blocking kernel thread responsible for sending/receiving packets affect scheduling of the corosync process (with RT priority) ? > > > > > I don't see any irregular IO activity during the time when we got these > > errors. Also, swap usage and swap IO is not much at all, it's only in > KBs. > > we have vm.swappiness set to 1. So, I don't think swap is causing any > issue. > > > > However, I see slight network activity during the issue times (What I > > understand is network activity should not affect the CPU jobs as long as > > CPU load is normal and without any blocking IO). > > It shouldn't > > > > > I am thinking of debugging in the following way, unless there is option > to > > restart corosync with debugger mode. : > > You can turn on debug messages (debug: on in logging section of > corosync.conf). > > Yes, I found thist later. Will try debugging. Hoping it would help in knowing where the problem is. > > > > -> Run a process strace in background on the corosync process and > redirect > > log to a output > > -> Add a frequent cron job to rotate the output log (delete old ones), > > unless there is a flag file to keep the old log > > -> Add another frequent cron job to check corosync log for the specific > > token timeout error and add the above mentioned flag file to not delete > the > > strace output. > > > > Don't know if the above process is safe to run on a production server, > > without creating much impact on the system resources. Need to check. > > > > Yep. Hopefully you find something. > > Regards, > Honza > > > > > On Mon, Sep 2, 2019 at 5:50 PM Jan Friesse <[email protected]> wrote: > > > >> Jeevan, > >> > >> Jeevan Patnaik napsal(a): > >>> Hi, > >>> > >>> Also, both are physical machines. > >>> > >>> On Fri, Aug 30, 2019 at 7:23 PM Jeevan Patnaik <[email protected]> > >> wrote: > >>> > >>>> Hi, > >>>> > >>>> We see the following messages almost everyday in our 2 node cluster > and > >>>> resources gets migrated when it happens: > >>>> > >>>> [16187] node1 corosyncwarning [MAIN ] Corosync main process was not > >> scheduled for 2889.8477 ms (threshold is 800.0000 ms). Consider token > >> timeout increase. > >>>> [16187] node1 corosyncnotice [TOTEM ] c. > >>>> [16187] node1 corosyncnotice [TOTEM ] A new membership ( > >> 192.168.0.1:1268) was formed. Members joined: 2 left: 2 > >>>> [16187] node1 corosyncnotice [TOTEM ] Failed to receive the leave > >> message. failed: 2 > >>>> > >>>> > >>>> After setting the token timeout to 6000ms, at least the "Failed to > >>>> receive the leave message" doesn't appear anymore. But we see corosync > >>>> timeout errors: > >>>> [16395] node1 corosyncwarning [MAIN ] Corosync main process was not > >>>> scheduled for 6660.9043 ms (threshold is 4800.0000 ms). Consider token > >>>> timeout increase. > >>>> > >>>> 1. Why is the set timeout not in effect? It's 4800ms instead of > 6000ms. > >> > >> It is in effect. Threshold for pause detector is set as 0.8 * token > >> timeout. > >> > >>>> 2. How to fix this? We have not much load on the nodes, the corosync > is > >>>> already running with RT priority. > >> > >> There must be something wrong. If you increase token timeout even higher > >> (let's say 12sec) is it still appearing or not? If so, isn't the machine > >> swapping (for example) or waiting for IO? If you try to run it without > >> RT priority, does it help? > >> > >> Regards, > >> Honza > >> > >> > >>>> > >>>> The following is the details of OS and packages: > >>>> > >>>> Kernel: 3.10.0-957.el7.x86_64 > >>>> OS: Oracle Linux Server 7.6 > >>>> > >>>> corosync-2.4.3-4.el7.x86_64 > >>>> corosynclib-2.4.3-4.el7.x86_64 > >>>> > >>>> Thanks in advance. > >>>> > >>>> -- > >>>> Regards, > >>>> Jeevan. > >>>> Create your own email signature > >>>> < > >> > https://www.wisestamp.com/signature-in-email?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own > >>> > >>>> > >>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> Manage your subscription: > >>> https://lists.clusterlabs.org/mailman/listinfo/users > >>> > >>> ClusterLabs home: https://www.clusterlabs.org/ > >>> > >> > >> > > > > Regards, > > Jeevan. > > > > Regards, Jeevan
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
