Are the machines where your DUCC daemons and/or agents run extremely busy? Otherwise, I should think that the default heartbeat config should work as is.
Lou. On Wed, Dec 10, 2014 at 4:06 AM, reshu.agarwal <[email protected]> wrote: > Dear Lou, > > My problem has been resolved. I just increased the max time of receiving > Heartbeats of agents. > > The "unstable behavior" of DUCC 1.1.0 in my case was the node up and down > problem in both cases either on single instance of DUCC 1.1.0 > or running both ducc versions simultaneously. > > And Now, I am able to run DUCC 1.1.0 alone. So, Only DUCC 1.1.0 is > configured. > > Thanks for your help. :-) > > Reshu. > > > > > On 12/08/2014 04:24 PM, Lou DeGenaro wrote: > >> What is the "unstable behavior" of DUCC 1.1.0 when running it alone? >> >> All kinds of bad things can happen if you run 2 DUCCs on the same set of >> machines. I'm willing to help, but am not sure I can if you are running 2 >> DUCCs - that's fairly complex. Instead I urge you to run a single DUCC >> 1.1.0 and let's try to fix what's wrong with running it alone. >> >> Lou. >> >> On Sun, Dec 7, 2014 at 11:40 PM, reshu.agarwal <[email protected]> >> wrote: >> >> Yes, I am running both at same time. But I tried only 1.1.0 version to >>> check the performance.But, due to unstable behaviour I had to run DUCC >>> 1.0.0 and DUCC 1.1.0 at the same time. I am running DUCC 1.0.0 for >>> running >>> Jobs and DUCC 1.1.0 for testing purpose. >>> >>> Do I need to increase heartbeats timing to greater than to 60 sec? >>> Signature >>> >>> **Reshu. >>> >>> >>> On 12/05/2014 05:57 PM, Lou DeGenaro wrote: >>> >>> You can fetch the latest code containing the bug fix from SVN and build >>>> your own snapshot. However, this bug is of minimal impact so there is >>>> no >>>> pressing need to do so. >>>> >>>> Are you trying to run 1.0 and 1.1 at the same time? This can be very >>>> tricky. You need to be sure of no overlaps. I highly recommend that >>>> you >>>> pick one or the other. >>>> >>>> Lou. >>>> >>>> On Fri, Dec 5, 2014 at 6:31 AM, reshu.agarwal <[email protected] >>>> > >>>> wrote: >>>> >>>> Dear Lou, >>>> >>>>> Thanks for confirming this. >>>>> >>>>> Is Bug fixing version available for use? >>>>> >>>>> What can be the reason of delaying in heartbeats? Because machines were >>>>> not able to send heartbeats with in 60 seconds so node gets down in >>>>> DUCC >>>>> 1.1.0 but DUCC 1.0.0 is working fine on same machines. >>>>> >>>>> My master node is physical and client is on virtual. Can this be a >>>>> reason >>>>> for delaying in heartbeats as well as increase of processing time of >>>>> job? >>>>> >>>>> Thanks. >>>>> >>>>> Reshu. >>>>> >>>>> >>>>> On 12/05/2014 04:45 PM, Lou DeGenaro wrote: >>>>> >>>>> Each node has a DUCC Agent daemon that sends heartbeats. >>>>> >>>>>> There was a bug discovered after the release of 1.1 whereby the share >>>>>> calculation is incorrect (a rounding up problem that you observe). >>>>>> The >>>>>> impact of this bug should be minimal. The bug has been fixed. >>>>>> >>>>>> Lou. >>>>>> >>>>>> >>>>>> >>>>>> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal < >>>>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>> Lou, >>>>>> >>>>>> How can a node send heartbeats in DUCC? If you can tell me this I >>>>>>> will >>>>>>> be >>>>>>> able to identify problem of down in my nodes. >>>>>>> >>>>>>> The other problem which I am facing is: >>>>>>> >>>>>>> Memory(GB):total : 31 >>>>>>> Memory(GB):usable : 16 >>>>>>> Shares:total : 8 >>>>>>> Shares:inuse : 9 >>>>>>> >>>>>>> >>>>>>> Means actual RAM which is available is 30 GB so shares available >>>>>>> should >>>>>>> be >>>>>>> 15(2GB per share) but it is showing Memory(GB):usable : 16 and >>>>>>> Shares:total : 8. >>>>>>> >>>>>>> In DUCC 1.0.0, I don't face this problem. >>>>>>> >>>>>>> Please explain me its reason. >>>>>>> >>>>>>> Reshu. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote: >>>>>>> >>>>>>> Which of these are no understandable? If you hover over the >>>>>>> column >>>>>>> >>>>>>> heading >>>>>>>> a little more explanation is given (though not much). >>>>>>>> >>>>>>>> For example, If you hover over Heartbeat(last) you'll see "The >>>>>>>> elapsed >>>>>>>> time >>>>>>>> (in seconds) since the last heartbeat". This should usually be >>>>>>>> around >>>>>>>> 60 >>>>>>>> seconds. On the system I'm looking at live presently, I see a range >>>>>>>> from >>>>>>>> 9 >>>>>>>> to 66. If the number gets too large, the DUCC system will consider >>>>>>>> the >>>>>>>> node down. As best as I can tell, it looks like your numbers are >>>>>>>> 58 & >>>>>>>> 59 >>>>>>>> which is perfect. >>>>>>>> >>>>>>>> Lou. >>>>>>>> >>>>>>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal < >>>>>>>> [email protected] >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Please look this stats: >>>>>>>> >>>>>>>>> / Status Name Memory(GB):usable Memory(GB):total >>>>>>>>> Swap(GB):inuse >>>>>>>>> Swap(GB):free Alien PIDs Shares:total Shares:inuse >>>>>>>>> Heartbeat >>>>>>>>> (last) >>>>>>>>> Total 58 70 >>>>>>>>> 0 29 9 29 >>>>>>>>> 3 >>>>>>>>> up S144 36 39 >>>>>>>>> 0 20 8 18 2 >>>>>>>>> 59 >>>>>>>>> down S143 22 31 >>>>>>>>> 0 9 1 11 11 >>>>>>>>> 58 >>>>>>>>> / >>>>>>>>> I am not able to understand this stats. >>>>>>>>> >>>>>>>>> Please help. >>>>>>>>> >>>>>>>>> Reshu. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >
