What is the "unstable behavior" of DUCC 1.1.0 when running it alone?
All kinds of bad things can happen if you run 2 DUCCs on the same set of machines. I'm willing to help, but am not sure I can if you are running 2 DUCCs - that's fairly complex. Instead I urge you to run a single DUCC 1.1.0 and let's try to fix what's wrong with running it alone. Lou. On Sun, Dec 7, 2014 at 11:40 PM, reshu.agarwal <[email protected]> wrote: > > Yes, I am running both at same time. But I tried only 1.1.0 version to > check the performance.But, due to unstable behaviour I had to run DUCC > 1.0.0 and DUCC 1.1.0 at the same time. I am running DUCC 1.0.0 for running > Jobs and DUCC 1.1.0 for testing purpose. > > Do I need to increase heartbeats timing to greater than to 60 sec? > Signature > > **Reshu. > > > On 12/05/2014 05:57 PM, Lou DeGenaro wrote: > >> You can fetch the latest code containing the bug fix from SVN and build >> your own snapshot. However, this bug is of minimal impact so there is no >> pressing need to do so. >> >> Are you trying to run 1.0 and 1.1 at the same time? This can be very >> tricky. You need to be sure of no overlaps. I highly recommend that you >> pick one or the other. >> >> Lou. >> >> On Fri, Dec 5, 2014 at 6:31 AM, reshu.agarwal <[email protected]> >> wrote: >> >> Dear Lou, >>> >>> Thanks for confirming this. >>> >>> Is Bug fixing version available for use? >>> >>> What can be the reason of delaying in heartbeats? Because machines were >>> not able to send heartbeats with in 60 seconds so node gets down in DUCC >>> 1.1.0 but DUCC 1.0.0 is working fine on same machines. >>> >>> My master node is physical and client is on virtual. Can this be a reason >>> for delaying in heartbeats as well as increase of processing time of job? >>> >>> Thanks. >>> >>> Reshu. >>> >>> >>> On 12/05/2014 04:45 PM, Lou DeGenaro wrote: >>> >>> Each node has a DUCC Agent daemon that sends heartbeats. >>>> >>>> There was a bug discovered after the release of 1.1 whereby the share >>>> calculation is incorrect (a rounding up problem that you observe). The >>>> impact of this bug should be minimal. The bug has been fixed. >>>> >>>> Lou. >>>> >>>> >>>> >>>> On Fri, Dec 5, 2014 at 12:41 AM, reshu.agarwal < >>>> [email protected]> >>>> wrote: >>>> >>>> Lou, >>>> >>>>> How can a node send heartbeats in DUCC? If you can tell me this I will >>>>> be >>>>> able to identify problem of down in my nodes. >>>>> >>>>> The other problem which I am facing is: >>>>> >>>>> Memory(GB):total : 31 >>>>> Memory(GB):usable : 16 >>>>> Shares:total : 8 >>>>> Shares:inuse : 9 >>>>> >>>>> >>>>> Means actual RAM which is available is 30 GB so shares available should >>>>> be >>>>> 15(2GB per share) but it is showing Memory(GB):usable : 16 and >>>>> Shares:total : 8. >>>>> >>>>> In DUCC 1.0.0, I don't face this problem. >>>>> >>>>> Please explain me its reason. >>>>> >>>>> Reshu. >>>>> >>>>> >>>>> >>>>> On 12/04/2014 06:42 PM, Lou DeGenaro wrote: >>>>> >>>>> Which of these are no understandable? If you hover over the column >>>>> >>>>>> heading >>>>>> a little more explanation is given (though not much). >>>>>> >>>>>> For example, If you hover over Heartbeat(last) you'll see "The elapsed >>>>>> time >>>>>> (in seconds) since the last heartbeat". This should usually be around >>>>>> 60 >>>>>> seconds. On the system I'm looking at live presently, I see a range >>>>>> from >>>>>> 9 >>>>>> to 66. If the number gets too large, the DUCC system will consider >>>>>> the >>>>>> node down. As best as I can tell, it looks like your numbers are 58 & >>>>>> 59 >>>>>> which is perfect. >>>>>> >>>>>> Lou. >>>>>> >>>>>> On Thu, Dec 4, 2014 at 7:41 AM, reshu.agarwal < >>>>>> [email protected] >>>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> Please look this stats: >>>>>>> >>>>>>> / Status Name Memory(GB):usable Memory(GB):total >>>>>>> Swap(GB):inuse >>>>>>> Swap(GB):free Alien PIDs Shares:total Shares:inuse >>>>>>> Heartbeat >>>>>>> (last) >>>>>>> Total 58 70 >>>>>>> 0 29 9 29 >>>>>>> 3 >>>>>>> up S144 36 39 >>>>>>> 0 20 8 18 2 >>>>>>> 59 >>>>>>> down S143 22 31 >>>>>>> 0 9 1 11 11 >>>>>>> 58 >>>>>>> / >>>>>>> I am not able to understand this stats. >>>>>>> >>>>>>> Please help. >>>>>>> >>>>>>> Reshu. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >
