Re: Curator barriers missing watch events

Brian Phillips Tue, 25 Mar 2014 17:40:20 -0700

Yes, there's two barrier sessions. But different barrier instances, and 
different barrier paths. ):


Sent from my iPhone

> On Mar 25, 2014, at 8:34 PM, "Jordan Zimmerman" <[email protected]> 
> wrote:
> 
> Are you saying there are two barrier sessions? The first one works, but the 
> second doesn’t? Are you re-using the same path? I wonder if there are znodes 
> left in the path or something. Before running the second barrier session, 
> double check that the path is empty (do a getChildren on it). If it’s not 
> empty that could be the problem.
> 
> -JZ
> 
> 
> From: Brian Phillips [email protected]
> Reply: [email protected] [email protected]
> Date: March 25, 2014 at 6:10:46 PM
> To: [email protected] [email protected]
> Subject:  Re: Curator barriers missing watch events 
> 
>> I’ve tried, but it seems to be timing specific. Its in a rather large 
>> complicated program, where the first barrier always works but the one at the 
>> end of the program usually gets stuck. I’ve spent all day trying to make 
>> sense of it, as my project really needs it to work.
>> 
>> I’d like to be able to figure out if the zookeeper server is actually 
>> sending my clients the watch events. 
>> 
>> _B
>> 
>> 
>> On Mar 25, 2014, at 6:53 PM, "Jordan Zimmerman" <[email protected]> 
>> wrote:
>> 
>>> There’s no way you can distill your usage into a test?
>>> 
>>> -JZ
>>> 
>>> 
>>> From: Brian Phillips [email protected]
>>> Reply: [email protected] [email protected]
>>> Date: March 25, 2014 at 5:51:37 PM
>>> To: [email protected] [email protected]
>>> Subject:  Re: Curator barriers missing watch events
>>> 
>>>> Hmm, I made that change, but it didn't seem to help. The first program 
>>>> made it to the barrier enter, then the second program entered, exited, and 
>>>> the first program never left the barrier.
>>>> 
>>>> The second program got a node created event, but the first program never 
>>>> got any event from its watcher.
>>>> 
>>>> I appreciate the help! Must be something else.
>>>> 
>>>> _B
>>>> 
>>>> On Mar 25, 2014, at 6:28 PM, "Jordan Zimmerman" 
>>>> <[email protected]> wrote:
>>>> 
>>>>> Look at line 313 and line 331. The noarg version of enter() causes 
>>>>> internalEnter() to call wait even though the watcher may have already 
>>>>> notified. I believe line 331 should be:
>>>>> 
>>>>> else if ( !hasBeenNotified.get() )
>>>>> 
>>>>> -JZ
>>>>> 
>>>>> 
>>>>> From: Brian Phillips [email protected]
>>>>> Reply: [email protected] [email protected]
>>>>> Date: March 25, 2014 at 5:25:48 PM
>>>>> To: [email protected] [email protected]
>>>>> Subject:  Re: Curator barriers missing watch events
>>>>> 
>>>>>> I am using the no arg version! What's the bug?
>>>>>> 
>>>>>> _B
>>>>>> 
>>>>>> On Mar 25, 2014, at 6:23 PM, "Jordan Zimmerman" 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>>> Which version of enter() are you using? I see a potential bug when the 
>>>>>>> no arg version of enter() is used.
>>>>>>> 
>>>>>>> 
>>>>>>> From: Brian Phillips [email protected]
>>>>>>> Reply: Brian Phillips [email protected]
>>>>>>> Date: March 25, 2014 at 4:19:36 PM
>>>>>>> To: Jordan Zimmerman [email protected]
>>>>>>> Subject:  Re: Curator barriers missing watch events
>>>>>>> 
>>>>>>>> Good idea, but yes I am. The connection state doesn’t change while I’m 
>>>>>>>> executing the barrier code. It seems to be some kind of race condition 
>>>>>>>> I think, as sometimes it work and sometimes it doesn’t. I’ve looked 
>>>>>>>> through the recipe code and it looks good as far as I can tell though. 
>>>>>>>> I’m practically pulling my hair out at this point.
>>>>>>>> 
>>>>>>>> I may try a non-curator zookeeper only barrier tomorrow. See if that 
>>>>>>>> works. Or I may start trying to debug the zookeeper client, see if its 
>>>>>>>> actually getting the watches but not delivering them.
>>>>>>>> 
>>>>>>>> _B
>>>>>>>> 
>>>>>>>>> On Mar 25, 2014, at 4:54 PM, Jordan Zimmerman 
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> Are you setting a ConnectionStateListener? If the connection gets 
>>>>>>>>> SUSPENDED or LOST then you’d need to reinitialize your barrier.
>>>>>>>>> 
>>>>>>>>> -JZ
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> From: Brian Phillips [email protected]
>>>>>>>>> Reply: [email protected] [email protected]
>>>>>>>>> Date: March 25, 2014 at 2:51:42 PM
>>>>>>>>> To: [email protected] [email protected]
>>>>>>>>> Subject:  Re: Curator barriers missing watch events 
>>>>>>>>> 
>>>>>>>>>> I have tried writing a test program which launches two programs in 
>>>>>>>>>> the same manor, each makes a connection then loops over barriers 
>>>>>>>>>> with a Thread.sleep(random) in-between. This run indefinitely and 
>>>>>>>>>> everything works out fine.
>>>>>>>>>> 
>>>>>>>>>> I have also tried writing my own barrier, which uses a SharedCount, 
>>>>>>>>>> where each guy tries to increment it until it hits a memberQty. This 
>>>>>>>>>> too missed watch events and does not work properly.
>>>>>>>>>> 
>>>>>>>>>> It’s almost as if something else that I’ve done during the running 
>>>>>>>>>> of my program has broken zookeepers watch events somehow. Is there 
>>>>>>>>>> any good way to debug watch events in general? I’ve tried to look at 
>>>>>>>>>> the DEBUG output for my zookeeper server log, but it looks the same 
>>>>>>>>>> for the working vs non-working barriers...
>>>>>>>>>> 
>>>>>>>>>> _B
>>>>>>>>>> 
>>>>>>>>>>> On Mar 25, 2014, at 3:42 PM, Jordan Zimmerman 
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Unfortunately, the barrier recipes aren’t widely used (from what I 
>>>>>>>>>>> know). So, there may well be a bug. If you could get a test to show 
>>>>>>>>>>> the problem that would be ideal.
>>>>>>>>>>> 
>>>>>>>>>>> -JZ
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> From: Brian Phillips [email protected]
>>>>>>>>>>> Reply: [email protected] [email protected]
>>>>>>>>>>> Date: March 25, 2014 at 2:38:40 PM
>>>>>>>>>>> To: [email protected] [email protected]
>>>>>>>>>>> Subject:  Curator barriers missing watch events 
>>>>>>>>>>> 
>>>>>>>>>>>> Hi guys, 
>>>>>>>>>>>> 
>>>>>>>>>>>> I’ve been integrating curator into my project and have recently 
>>>>>>>>>>>> run into an issue I just can’t seem to make sense of. 
>>>>>>>>>>>> 
>>>>>>>>>>>> I’m running two JVMs on the same host machine, each with their own 
>>>>>>>>>>>> curator connection. At the beginning of my program I’m using the 
>>>>>>>>>>>> DistributedDoubleBarrier recipe, and once again at the end of my 
>>>>>>>>>>>> program. A bunch of work is done in-between, including zookeeper 
>>>>>>>>>>>> set/get/watches of other nodes. 
>>>>>>>>>>>> 
>>>>>>>>>>>> I’m finding that the first double barrier, everyone always making 
>>>>>>>>>>>> it through. The job-end barrier, sometimes everyone gets through, 
>>>>>>>>>>>> but more often than not one of the programs hangs in enter's 
>>>>>>>>>>>> wait(), and never gets the watch event for the ready path which 
>>>>>>>>>>>> notifies it to proceed. If I look in zookeeper, I can see that the 
>>>>>>>>>>>> ready path is actually set in there. 
>>>>>>>>>>>> 
>>>>>>>>>>>> It would seem that the watch for one of the programs just never 
>>>>>>>>>>>> triggers. 
>>>>>>>>>>>> 
>>>>>>>>>>>> To simplify debugging, I’ve set both double barriers to only ever 
>>>>>>>>>>>> call enter() and not leave(). Both barriers have their own 
>>>>>>>>>>>> separate path. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Also, the program never shuts down or disconnects from zookeeper. 
>>>>>>>>>>>> It just sleeps infinitely after it gets out of the final barrier. 
>>>>>>>>>>>> 
>>>>>>>>>>>> Any idea on how to debug this issue? I don’t mind hacking up 
>>>>>>>>>>>> zookeeper/curator code to insert my own debugging statements if it 
>>>>>>>>>>>> comes to that. 
>>>>>>>>>>>> 
>>>>>>>>>>>> _Brian=

Re: Curator barriers missing watch events

Reply via email to