Re: permanent ZSESSIONMOVED

2010-03-18 Thread Ted Dunning
Thanks to you for figuring out that it was a problem (and for having a new
and strange environment to test ZK in!)

On Thu, Mar 18, 2010 at 1:36 PM, Łukasz Osipiuk  wrote:

> That is great. Thank you  very much for your work!


Re: permanent ZSESSIONMOVED

2010-03-18 Thread Łukasz Osipiuk
Hi!


On Thu, Mar 18, 2010 at 21:23, Patrick Hunt  wrote:
> We've identified the problem and are working on a fix. Full details here:
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-710
>

That is great. Thank you  very much for your work!

All the best, Łukasz


> Patrick
>
> Łukasz Osipiuk wrote:
>>
>> Hello Mahadev,
>>
>> 2010/3/18 Mahadev Konar :
>>>
>>> Hi Lukasz,
>>>  I looked at the logs and havent been able to determine how the session
>>> move can happen.
>>>
>>>
>>>  You should go ahead and open a jira for this! Please attach the logs to
>>> that jira. Mark it for 3.3.0 until we can say that it isnt a problem as
>>> of
>>> now.
>>
>> I created issue for this
>> https://issues.apache.org/jira/browse/ZOOKEEPER-710. I marked it as
>> 3.2.2 as this is version we are using.
>>
>>> Also, please go ahead and attach non filtered logs to the jira for the
>>> same
>>> timeline as you had done earlier (by non filtered I mean not grepping for
>>> the session id). Also, do attach the earlier logs you had emailed.
>>
>> I attached logs in jira.
>>
>>> Also, may I know what function are you using to print that statement
>>>
>>>
>>> Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper:
>>> ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback
>>> error='session
>>> moved to another server, so operation is ignored'
>>
>> info added to jira
>>
>>> (If you can cut and paste that code on the jira that you create, it would
>>> be
>>> helpful!)...
>>>
>>> Thanks
>>> mahadev
>>>
>>>
>>> On 3/17/10 2:56 AM, "Łukasz Osipiuk"  wrote:
>>>
 Ok.

 I analysed some logs from time when problem last occurred.

 brokensession.txt - logs concerning session which had a problem
 leader.txt - logs concerning leader election (i grepped for Leader in
 zookeeper.log)

 Some observations below
 - Network glitch which resulted in problem occurred at about 22:08.
 - From what I see since 17:48 node2 was the leader and it did not
 change later yesterday.
 - Client was connected to node2 since 17:50
 - At around 22:09 client tried to connect to every node (1,2,3).
 Connections to node1 and node3 were closed
  with exception "Exception causing close of session 0x22767e1c963
 due to java.io.IOException: Read error".
  Connection to node2 stood alive.
 - All subsequent operations were refused with ZSESSIONMOVED error.
 Error visible both on client and on server side.

 Some more clarifications concerning questions from different posts:

 1. Saying we have huge packet drop I mean we have huge packet drop for
 a local network. We are still investigating source of it. From time to
 time we observe that machines loose contact with others for few
 seconds.

 2. We do not use any fancy network tricks. All machines are in single
 vlan and use permanently assigned IP addresses.

 Let me know if anything more can help.

 PS. Should I already create JIRA issue for this or is not concrete
 enough?

 Regards, Łukasz

 On Tue, Mar 16, 2010 at 20:39, Patrick Hunt  wrote:
>
> Yea, that's great. (no complaints on my end, just wasn't sure what you
> meant, wanted to make sure I was clear.).
>
> Can you identify some sort of pattern? We're discussing on our end, but
> this
> is the first time we've heard of such an issue (outside 3.2.0 bug) and
> we're
> looking for a direction to hunt... anything you could add would help.
>
> Is it possible for you to upgrade your client/server to 3.2.2 and
> report the
> next failure, including any server/client logs if they are available?
> You
> should create a JIRA for this in order to track the details as they
> emerge.
> (incl attaching the logs, etc...)
>
> Patrick
>
> Łukasz Osipiuk wrote:
>>
>> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:
>>>
>>> We'll probably need the ZK server/client logs to hunt this down. Can
>>> you
>>> tell if the MOVED happens in some particular scenario, say you are
>>> connected
>>> to a follower and move to a leader, or perhaps you are connected to
>>> server
>>> A, get disconnected and reconnected to server A?  is there some
>>> pattern
>>> that could help us understand what's causing this?
>>>
 Session gets broken after zookeeper library reconnects to cluster,
 although most of reconnection leave session if working state.
>>>
>>> I'm not sure I follow this ^^^, could you restate it? (I get the
>>> first
>>> about
>>> about the session is broken after reconnect, what's the second part
>>> saying?)
>>
>> Sorry for my English.
>>
>> I meant that it only happens from time to time. We get reconnects
>> quite often due to large packet loss in our network, but most of the
>> time they do not break the sess

Re: permanent ZSESSIONMOVED

2010-03-18 Thread Patrick Hunt

We've identified the problem and are working on a fix. Full details here:

https://issues.apache.org/jira/browse/ZOOKEEPER-710

Patrick

Łukasz Osipiuk wrote:

Hello Mahadev,

2010/3/18 Mahadev Konar :

Hi Lukasz,
 I looked at the logs and havent been able to determine how the session
move can happen.


 You should go ahead and open a jira for this! Please attach the logs to
that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of
now.


I created issue for this
https://issues.apache.org/jira/browse/ZOOKEEPER-710. I marked it as
3.2.2 as this is version we are using.


Also, please go ahead and attach non filtered logs to the jira for the same
timeline as you had done earlier (by non filtered I mean not grepping for
the session id). Also, do attach the earlier logs you had emailed.


I attached logs in jira.


Also, may I know what function are you using to print that statement


Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper:
ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session
moved to another server, so operation is ignored'


info added to jira


(If you can cut and paste that code on the jira that you create, it would be
helpful!)...

Thanks
mahadev


On 3/17/10 2:56 AM, "Łukasz Osipiuk"  wrote:


Ok.

I analysed some logs from time when problem last occurred.

brokensession.txt - logs concerning session which had a problem
leader.txt - logs concerning leader election (i grepped for Leader in
zookeeper.log)

Some observations below
- Network glitch which resulted in problem occurred at about 22:08.
- From what I see since 17:48 node2 was the leader and it did not
change later yesterday.
- Client was connected to node2 since 17:50
- At around 22:09 client tried to connect to every node (1,2,3).
Connections to node1 and node3 were closed
  with exception "Exception causing close of session 0x22767e1c963
due to java.io.IOException: Read error".
  Connection to node2 stood alive.
- All subsequent operations were refused with ZSESSIONMOVED error.
Error visible both on client and on server side.

Some more clarifications concerning questions from different posts:

1. Saying we have huge packet drop I mean we have huge packet drop for
a local network. We are still investigating source of it. From time to
time we observe that machines loose contact with others for few
seconds.

2. We do not use any fancy network tricks. All machines are in single
vlan and use permanently assigned IP addresses.

Let me know if anything more can help.

PS. Should I already create JIRA issue for this or is not concrete enough?

Regards, Łukasz

On Tue, Mar 16, 2010 at 20:39, Patrick Hunt  wrote:

Yea, that's great. (no complaints on my end, just wasn't sure what you
meant, wanted to make sure I was clear.).

Can you identify some sort of pattern? We're discussing on our end, but this
is the first time we've heard of such an issue (outside 3.2.0 bug) and we're
looking for a direction to hunt... anything you could add would help.

Is it possible for you to upgrade your client/server to 3.2.2 and report the
next failure, including any server/client logs if they are available? You
should create a JIRA for this in order to track the details as they emerge.
(incl attaching the logs, etc...)

Patrick

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:

We'll probably need the ZK server/client logs to hunt this down. Can you
tell if the MOVED happens in some particular scenario, say you are
connected
to a follower and move to a leader, or perhaps you are connected to
server
A, get disconnected and reconnected to server A?  is there some
pattern
that could help us understand what's causing this?


Session gets broken after zookeeper library reconnects to cluster,
although most of reconnection leave session if working state.

I'm not sure I follow this ^^^, could you restate it? (I get the first
about
about the session is broken after reconnect, what's the second part
saying?)

Sorry for my English.

I meant that it only happens from time to time. We get reconnects
quite often due to large packet loss in our network, but most of the
time they do not break the session.

Is it clear now?

Regards, Łukasz Osipiuk


Regards,

Patrick

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 19:22, Patrick Hunt  wrote:

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:

Can you verify that you are using 3.2.2 on all servers? You can do
this
by
running the "stat" command against each of your servers and look at
the
very
top of the output (we include the version of zk server there).
http://bit.ly/dglVld

Thanks for hint.

We had 3.2.2 jars installed on all server - I checked it previously
but apparently our script which restarts instance
after installing .deb does not work very well and node2 and node3 were
running version 3.2.1.
Could using 3.2.1 be source of the problem?


I heard this story where customer service for dell asks custom

Re: permanent ZSESSIONMOVED

2010-03-18 Thread Łukasz Osipiuk
Hello Mahadev,

2010/3/18 Mahadev Konar :
> Hi Lukasz,
>  I looked at the logs and havent been able to determine how the session
> move can happen.
>
>
>  You should go ahead and open a jira for this! Please attach the logs to
> that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of
> now.

I created issue for this
https://issues.apache.org/jira/browse/ZOOKEEPER-710. I marked it as
3.2.2 as this is version we are using.

> Also, please go ahead and attach non filtered logs to the jira for the same
> timeline as you had done earlier (by non filtered I mean not grepping for
> the session id). Also, do attach the earlier logs you had emailed.

I attached logs in jira.

> Also, may I know what function are you using to print that statement
>
>
> Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper:
> ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session
> moved to another server, so operation is ignored'

info added to jira

>
> (If you can cut and paste that code on the jira that you create, it would be
> helpful!)...
>
> Thanks
> mahadev
>
>
> On 3/17/10 2:56 AM, "Łukasz Osipiuk"  wrote:
>
>> Ok.
>>
>> I analysed some logs from time when problem last occurred.
>>
>> brokensession.txt - logs concerning session which had a problem
>> leader.txt - logs concerning leader election (i grepped for Leader in
>> zookeeper.log)
>>
>> Some observations below
>> - Network glitch which resulted in problem occurred at about 22:08.
>> - From what I see since 17:48 node2 was the leader and it did not
>> change later yesterday.
>> - Client was connected to node2 since 17:50
>> - At around 22:09 client tried to connect to every node (1,2,3).
>> Connections to node1 and node3 were closed
>>   with exception "Exception causing close of session 0x22767e1c963
>> due to java.io.IOException: Read error".
>>   Connection to node2 stood alive.
>> - All subsequent operations were refused with ZSESSIONMOVED error.
>> Error visible both on client and on server side.
>>
>> Some more clarifications concerning questions from different posts:
>>
>> 1. Saying we have huge packet drop I mean we have huge packet drop for
>> a local network. We are still investigating source of it. From time to
>> time we observe that machines loose contact with others for few
>> seconds.
>>
>> 2. We do not use any fancy network tricks. All machines are in single
>> vlan and use permanently assigned IP addresses.
>>
>> Let me know if anything more can help.
>>
>> PS. Should I already create JIRA issue for this or is not concrete enough?
>>
>> Regards, Łukasz
>>
>> On Tue, Mar 16, 2010 at 20:39, Patrick Hunt  wrote:
>>> Yea, that's great. (no complaints on my end, just wasn't sure what you
>>> meant, wanted to make sure I was clear.).
>>>
>>> Can you identify some sort of pattern? We're discussing on our end, but this
>>> is the first time we've heard of such an issue (outside 3.2.0 bug) and we're
>>> looking for a direction to hunt... anything you could add would help.
>>>
>>> Is it possible for you to upgrade your client/server to 3.2.2 and report the
>>> next failure, including any server/client logs if they are available? You
>>> should create a JIRA for this in order to track the details as they emerge.
>>> (incl attaching the logs, etc...)
>>>
>>> Patrick
>>>
>>> Łukasz Osipiuk wrote:

 On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:
>
> We'll probably need the ZK server/client logs to hunt this down. Can you
> tell if the MOVED happens in some particular scenario, say you are
> connected
> to a follower and move to a leader, or perhaps you are connected to
> server
> A, get disconnected and reconnected to server A?  is there some
> pattern
> that could help us understand what's causing this?
>
>> Session gets broken after zookeeper library reconnects to cluster,
>> although most of reconnection leave session if working state.
>
> I'm not sure I follow this ^^^, could you restate it? (I get the first
> about
> about the session is broken after reconnect, what's the second part
> saying?)

 Sorry for my English.

 I meant that it only happens from time to time. We get reconnects
 quite often due to large packet loss in our network, but most of the
 time they do not break the session.

 Is it clear now?

 Regards, Łukasz Osipiuk

> Regards,
>
> Patrick
>
> Łukasz Osipiuk wrote:
>>
>> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt  wrote:
>>>
>>> Łukasz Osipiuk wrote:

 On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:
>
> Can you verify that you are using 3.2.2 on all servers? You can do
> this
> by
> running the "stat" command against each of your servers and look at
> the
> very
> top of the output (we include the version of zk server there).
> http

Re: permanent ZSESSIONMOVED

2010-03-18 Thread Michael Bauland
Hi Patrick,

thanks for your detailed answer.


>> I'm not sure, but maybe my input could help, too. As I mentioned
>> earlier, I also run the three zookeeper servers not in a local
>> environment but across two sites in different countries (soon, in
>> production, it'll be three sites).
>> I'm accessing the zookeeper ensemble continuously with about 5 clients
>> for testing purposes. These clients are running on the same machine as
>> one of the three zookeeper servers. I have attached logs of two of the
>> zookeeper servers (one leader and one follower).
>> My clients have a way to recover from connectionloss by trying ten
>> times, then they wait for 5 seconds, close the zookeeper connection and
>> open a new one and try up to ten times again; then again waiting 10
>> secs, closing, reconnecting, trying, etc., up to 50 seconds, then they
>> fail.
> 
> The zk client lib will handle connection loss automatically, not sure
> what you mean by "try ten times" to recover from conn loss.
> 
> Basically the client lib will notify your watcher when the session is
> disconnected from the cluster. If you have "in process" operations they
> will get the "connection loss" exception. However once this happens you
> want to wait, the ZK client lib will attempt to connect to the cluster
> again (one of the servers in the connect string), once it does it will
> notify you again via the watcher (sync connected event).
> 
> If the session reconnects to the cluster within the session timeout then
> all watches are restored and nothing is necessary on your part.

Unfortunately there are things I have to take care of. As explained in
http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling, I don't know
whether my previous request went through. Of course this is no problem
for idempotent requests (I can just reissue the request), but for
non-indempotent requests I first have to find out, whether the request
succeeded or failed.
So for read-requests I have to repeat it to get the information I need
and for write requests I have to first issue a read request to find out,
whether the request went through. Those follow-up recovery requests are
the ones I ment by trying ten times before I close the connection, wait
and re-connect to try again.

If I understand you correctly, I should not retry my request right away,
but wait until I get notified.
So, when I establish a connection to the Zookeeper ensemble, I also
provide a watcher:
zk = new ZooKeeper (connectString, timeOut, watcher);

This watcher is very simple in the way that it has a
public final Integer mutex = -1;

and a method

@Override
public void process (WatchedEvent event)
{
  synchronized (mutex)
  {
mutex.notifyAll ();
  }
}

So probably I will have to add some code that checks the event and
stores a special flag in the watcher for the "sync connected event" and
another one for the "session expired event".
Whenever I get a connectionloss or sessionmoved exception (those are the
recoverable ones) during my work with the zk object, I will need to issue a
watcher.mutex.wait ();
command and only retry if the watcher has the special sync connected
flag showing. And if I see the session expired flag in the watcher I
know I'll have to close and re-create a new zookeeper connection.

Is that the correct way to handle those recoverable errors?




>> Sometimes all of the clients fail at about the same time. When I look at
>> the zookeeper logs I see that they report that the client connection
>> limit has been reached (which is set to 10, since I didn't set the value
>> at all), although this shouldn't happen since I have just 5 clients
>> running and none of them opens two connections at the same time, since
>> they're just single threaded.
>>
> 
> These clients are all on the same ip right?

At the moment, yes, and it's the same IP as one of the zookeeper
servers. Though this is just for testing. Later in production the
clients will be on different IPs (though there may still be several
threads running on the same IP).


> Are you sure you are closing the sessions in all cases (the old ones)?

Yes, I always call zk.close (); in the end.

> It could also be the case that you
> 
> 1) create a session, it gets disco, so you close it and
> 2) create a new session (and so on)
> 
> the session created in 1) (even if you closed it) may still be
> considered alive by the server until a) it expires, b) the session close
> in 1) eventually makes it's way back to the server.

I see. But I set the session time-out to 3000 (i.e., 3 seconds), so they
shouldn't stay active too long after I disconnect, I guess.


>> To me it seems that there are some connections which actually shouldn't
>> be there (anymore).
> 
> The easiest way to see this is using the "dump" command on the leader.
> It will tell you the sessions that the cluster thinks is still active.
> Run this command while running your test and see what it reports...
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin

Re: permanent ZSESSIONMOVED

2010-03-17 Thread Patrick Hunt
JIRA is great because we can move the discussion off the list and track 
explicitly along with possible fixes. Also if you create the jira you'll 
get updates on any changes to the record (you could 'watch' the JIRA 
instead, that's fine too).


Can you include the session id in your client logs? That would be very 
helpful for debugging. (it's available from ZooKeeper class)


Best if you tar/gz the logs (zip fine too). Having the full server (all 
server logs) and client logs for the time/incident in question we can 
"follow along" if you will, pretty closely with what's happening. It 
should really shed some light.


Thanks for your help on this, we'd really like to nail this one down.

Regards,

Patrick

Mahadev Konar wrote:

Hi Lukasz,
  I looked at the logs and havent been able to determine how the session
move can happen. 



  You should go ahead and open a jira for this! Please attach the logs to
that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of
now. 


Also, please go ahead and attach non filtered logs to the jira for the same
timeline as you had done earlier (by non filtered I mean not grepping for
the session id). Also, do attach the earlier logs you had emailed.

Also, may I know what function are you using to print that statement


Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper:
ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session
moved to another server, so operation is ignored'


(If you can cut and paste that code on the jira that you create, it would be
helpful!)... 


Thanks
mahadev


On 3/17/10 2:56 AM, "Łukasz Osipiuk"  wrote:


Ok.

I analysed some logs from time when problem last occurred.

brokensession.txt - logs concerning session which had a problem
leader.txt - logs concerning leader election (i grepped for Leader in
zookeeper.log)

Some observations below
- Network glitch which resulted in problem occurred at about 22:08.
- From what I see since 17:48 node2 was the leader and it did not
change later yesterday.
- Client was connected to node2 since 17:50
- At around 22:09 client tried to connect to every node (1,2,3).
Connections to node1 and node3 were closed
  with exception "Exception causing close of session 0x22767e1c963
due to java.io.IOException: Read error".
  Connection to node2 stood alive.
- All subsequent operations were refused with ZSESSIONMOVED error.
Error visible both on client and on server side.

Some more clarifications concerning questions from different posts:

1. Saying we have huge packet drop I mean we have huge packet drop for
a local network. We are still investigating source of it. From time to
time we observe that machines loose contact with others for few
seconds.

2. We do not use any fancy network tricks. All machines are in single
vlan and use permanently assigned IP addresses.

Let me know if anything more can help.

PS. Should I already create JIRA issue for this or is not concrete enough?

Regards, Łukasz

On Tue, Mar 16, 2010 at 20:39, Patrick Hunt  wrote:

Yea, that's great. (no complaints on my end, just wasn't sure what you
meant, wanted to make sure I was clear.).

Can you identify some sort of pattern? We're discussing on our end, but this
is the first time we've heard of such an issue (outside 3.2.0 bug) and we're
looking for a direction to hunt... anything you could add would help.

Is it possible for you to upgrade your client/server to 3.2.2 and report the
next failure, including any server/client logs if they are available? You
should create a JIRA for this in order to track the details as they emerge.
(incl attaching the logs, etc...)

Patrick

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:

We'll probably need the ZK server/client logs to hunt this down. Can you
tell if the MOVED happens in some particular scenario, say you are
connected
to a follower and move to a leader, or perhaps you are connected to
server
A, get disconnected and reconnected to server A?  is there some
pattern
that could help us understand what's causing this?


Session gets broken after zookeeper library reconnects to cluster,
although most of reconnection leave session if working state.

I'm not sure I follow this ^^^, could you restate it? (I get the first
about
about the session is broken after reconnect, what's the second part
saying?)

Sorry for my English.

I meant that it only happens from time to time. We get reconnects
quite often due to large packet loss in our network, but most of the
time they do not break the session.

Is it clear now?

Regards, Łukasz Osipiuk


Regards,

Patrick

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 19:22, Patrick Hunt  wrote:

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:

Can you verify that you are using 3.2.2 on all servers? You can do
this
by
running the "stat" command against each of your servers and look at
the
very
top of the output (we include the version of zk server there).

Re: permanent ZSESSIONMOVED

2010-03-17 Thread Mahadev Konar
Hi Lukasz,
  I looked at the logs and havent been able to determine how the session
move can happen. 


  You should go ahead and open a jira for this! Please attach the logs to
that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of
now. 

Also, please go ahead and attach non filtered logs to the jira for the same
timeline as you had done earlier (by non filtered I mean not grepping for
the session id). Also, do attach the earlier logs you had emailed.

Also, may I know what function are you using to print that statement


Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper:
ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session
moved to another server, so operation is ignored'


(If you can cut and paste that code on the jira that you create, it would be
helpful!)... 

Thanks
mahadev


On 3/17/10 2:56 AM, "Łukasz Osipiuk"  wrote:

> Ok.
> 
> I analysed some logs from time when problem last occurred.
> 
> brokensession.txt - logs concerning session which had a problem
> leader.txt - logs concerning leader election (i grepped for Leader in
> zookeeper.log)
> 
> Some observations below
> - Network glitch which resulted in problem occurred at about 22:08.
> - From what I see since 17:48 node2 was the leader and it did not
> change later yesterday.
> - Client was connected to node2 since 17:50
> - At around 22:09 client tried to connect to every node (1,2,3).
> Connections to node1 and node3 were closed
>   with exception "Exception causing close of session 0x22767e1c963
> due to java.io.IOException: Read error".
>   Connection to node2 stood alive.
> - All subsequent operations were refused with ZSESSIONMOVED error.
> Error visible both on client and on server side.
> 
> Some more clarifications concerning questions from different posts:
> 
> 1. Saying we have huge packet drop I mean we have huge packet drop for
> a local network. We are still investigating source of it. From time to
> time we observe that machines loose contact with others for few
> seconds.
> 
> 2. We do not use any fancy network tricks. All machines are in single
> vlan and use permanently assigned IP addresses.
> 
> Let me know if anything more can help.
> 
> PS. Should I already create JIRA issue for this or is not concrete enough?
> 
> Regards, Łukasz
> 
> On Tue, Mar 16, 2010 at 20:39, Patrick Hunt  wrote:
>> Yea, that's great. (no complaints on my end, just wasn't sure what you
>> meant, wanted to make sure I was clear.).
>> 
>> Can you identify some sort of pattern? We're discussing on our end, but this
>> is the first time we've heard of such an issue (outside 3.2.0 bug) and we're
>> looking for a direction to hunt... anything you could add would help.
>> 
>> Is it possible for you to upgrade your client/server to 3.2.2 and report the
>> next failure, including any server/client logs if they are available? You
>> should create a JIRA for this in order to track the details as they emerge.
>> (incl attaching the logs, etc...)
>> 
>> Patrick
>> 
>> Łukasz Osipiuk wrote:
>>> 
>>> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:
 
 We'll probably need the ZK server/client logs to hunt this down. Can you
 tell if the MOVED happens in some particular scenario, say you are
 connected
 to a follower and move to a leader, or perhaps you are connected to
 server
 A, get disconnected and reconnected to server A?  is there some
 pattern
 that could help us understand what's causing this?
 
> Session gets broken after zookeeper library reconnects to cluster,
> although most of reconnection leave session if working state.
 
 I'm not sure I follow this ^^^, could you restate it? (I get the first
 about
 about the session is broken after reconnect, what's the second part
 saying?)
>>> 
>>> Sorry for my English.
>>> 
>>> I meant that it only happens from time to time. We get reconnects
>>> quite often due to large packet loss in our network, but most of the
>>> time they do not break the session.
>>> 
>>> Is it clear now?
>>> 
>>> Regards, Łukasz Osipiuk
>>> 
 Regards,
 
 Patrick
 
 Łukasz Osipiuk wrote:
> 
> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt  wrote:
>> 
>> Łukasz Osipiuk wrote:
>>> 
>>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:
 
 Can you verify that you are using 3.2.2 on all servers? You can do
 this
 by
 running the "stat" command against each of your servers and look at
 the
 very
 top of the output (we include the version of zk server there).
 http://bit.ly/dglVld
>>> 
>>> Thanks for hint.
>>> 
>>> We had 3.2.2 jars installed on all server - I checked it previously
>>> but apparently our script which restarts instance
>>> after installing .deb does not work very well and node2 and node3 were
>>> running version 3.2.1.
>>> Could using 3.2.1 be so

Re: permanent ZSESSIONMOVED

2010-03-17 Thread Patrick Hunt

Thanks for this Michael, a bit of clarification/advice inline below:

Michael Bauland wrote:

Hi,


This is very different from most uses of ZK.  Normally Zookeeper is used on
a private network with almost no packet loss.  Your high rate of loss may be
tickling a protocol bug that other people
just never see.


I'm not sure, but maybe my input could help, too. As I mentioned
earlier, I also run the three zookeeper servers not in a local
environment but across two sites in different countries (soon, in
production, it'll be three sites).
I'm accessing the zookeeper ensemble continuously with about 5 clients
for testing purposes. These clients are running on the same machine as
one of the three zookeeper servers. I have attached logs of two of the
zookeeper servers (one leader and one follower).
My clients have a way to recover from connectionloss by trying ten
times, then they wait for 5 seconds, close the zookeeper connection and
open a new one and try up to ten times again; then again waiting 10
secs, closing, reconnecting, trying, etc., up to 50 seconds, then they fail.


The zk client lib will handle connection loss automatically, not sure 
what you mean by "try ten times" to recover from conn loss.


Basically the client lib will notify your watcher when the session is 
disconnected from the cluster. If you have "in process" operations they 
will get the "connection loss" exception. However once this happens you 
want to wait, the ZK client lib will attempt to connect to the cluster 
again (one of the servers in the connect string), once it does it will 
notify you again via the watcher (sync connected event).


If the session reconnects to the cluster within the session timeout then 
all watches are restored and nothing is necessary on your part. If you 
exceed the timeout then the cluster will have expired your session and 
you would be notified of this immediately upon reconnect to the cluster.


See item 3 in the faq, there's some detail there:
http://wiki.apache.org/hadoop/ZooKeeper/FAQ#A3

So really in your code you want to:

1) if you get disconnected just wait till you get reconnected, our 
library handles all that for you.

2) if you get expired then you must create a new session.

I seem to remember you may have had some issues with local vs remote 
though... so it may not be as clearcut. But in general our client lib 
handles all this for you.



Sometimes all of the clients fail at about the same time. When I look at
the zookeeper logs I see that they report that the client connection
limit has been reached (which is set to 10, since I didn't set the value
at all), although this shouldn't happen since I have just 5 clients
running and none of them opens two connections at the same time, since
they're just single threaded.



These clients are all on the same ip right?

Are you sure you are closing the sessions in all cases (the old ones)? 
It could also be the case that you


1) create a session, it gets disco, so you close it and
2) create a new session (and so on)

the session created in 1) (even if you closed it) may still be 
considered alive by the server until a) it expires, b) the session close 
in 1) eventually makes it's way back to the server.



To me it seems that there are some connections which actually shouldn't
be there (anymore).


The easiest way to see this is using the "dump" command on the leader. 
It will tell you the sessions that the cluster thinks is still active. 
Run this command while running your test and see what it reports...

http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands


For the logs, note that the clients are running also on the server which
is at that time the follower and I replaced my IPs by letters.


Take a look at this session 0x226f49bb5d78ea8 for example:

2010-03-08 15:27:20,525 - INFO 
[NIOServerCxn.Factory:2181:nioserverc...@639] - Creating new session 
0x226f49bb5d78ea8
2010-03-08 15:27:22,253 - INFO  [CommitProcessor:0:nioserverc...@992] - 
Finished init of 0x226f49bb5d78ea8 valid:true
2010-03-08 15:27:22,254 - WARN 
[NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close 
of session 0x226f49bb5d78ea8 due to java.io.IOException: Read error
2010-03-08 15:27:22,254 - INFO 
[NIOServerCxn.Factory:2181:nioserverc...@857] - closing 
session:0x226f49bb5d78ea8 NIOServerCnxn: 
java.nio.channels.SocketChannel[connected local=/A.B.C.E:2181 
remote=/A.B.C.D:49258]
2010-03-08 15:27:28,000 - INFO  [SessionTracker:sessiontrackeri...@133] 
- Expiring session 0x226f49bb5d78ea8
2010-03-08 15:27:28,001 - INFO  [SessionTracker:zookeeperser...@326] - 
Expiring session 0x226f49bb5d78ea8
2010-03-08 15:27:28,001 - INFO 
[ProcessThread:-1:preprequestproces...@384] - Processed session 
termination request for id: 0x226f49bb5d78ea8


Looks like it's using a 5sec timeout, I can see that it's connecting to 
the server, then the connection is failing (or client closing? I don't 
have the log for that) after just a fractio

Re: permanent ZSESSIONMOVED

2010-03-17 Thread Łukasz Osipiuk
Ok.

I analysed some logs from time when problem last occurred.

brokensession.txt - logs concerning session which had a problem
leader.txt - logs concerning leader election (i grepped for Leader in
zookeeper.log)

Some observations below
- Network glitch which resulted in problem occurred at about 22:08.
- From what I see since 17:48 node2 was the leader and it did not
change later yesterday.
- Client was connected to node2 since 17:50
- At around 22:09 client tried to connect to every node (1,2,3).
Connections to node1 and node3 were closed
  with exception "Exception causing close of session 0x22767e1c963
due to java.io.IOException: Read error".
  Connection to node2 stood alive.
- All subsequent operations were refused with ZSESSIONMOVED error.
Error visible both on client and on server side.

Some more clarifications concerning questions from different posts:

1. Saying we have huge packet drop I mean we have huge packet drop for
a local network. We are still investigating source of it. From time to
time we observe that machines loose contact with others for few
seconds.

2. We do not use any fancy network tricks. All machines are in single
vlan and use permanently assigned IP addresses.

Let me know if anything more can help.

PS. Should I already create JIRA issue for this or is not concrete enough?

Regards, Łukasz

On Tue, Mar 16, 2010 at 20:39, Patrick Hunt  wrote:
> Yea, that's great. (no complaints on my end, just wasn't sure what you
> meant, wanted to make sure I was clear.).
>
> Can you identify some sort of pattern? We're discussing on our end, but this
> is the first time we've heard of such an issue (outside 3.2.0 bug) and we're
> looking for a direction to hunt... anything you could add would help.
>
> Is it possible for you to upgrade your client/server to 3.2.2 and report the
> next failure, including any server/client logs if they are available? You
> should create a JIRA for this in order to track the details as they emerge.
> (incl attaching the logs, etc...)
>
> Patrick
>
> Łukasz Osipiuk wrote:
>>
>> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:
>>>
>>> We'll probably need the ZK server/client logs to hunt this down. Can you
>>> tell if the MOVED happens in some particular scenario, say you are
>>> connected
>>> to a follower and move to a leader, or perhaps you are connected to
>>> server
>>> A, get disconnected and reconnected to server A?  is there some
>>> pattern
>>> that could help us understand what's causing this?
>>>
 Session gets broken after zookeeper library reconnects to cluster,
 although most of reconnection leave session if working state.
>>>
>>> I'm not sure I follow this ^^^, could you restate it? (I get the first
>>> about
>>> about the session is broken after reconnect, what's the second part
>>> saying?)
>>
>> Sorry for my English.
>>
>> I meant that it only happens from time to time. We get reconnects
>> quite often due to large packet loss in our network, but most of the
>> time they do not break the session.
>>
>> Is it clear now?
>>
>> Regards, Łukasz Osipiuk
>>
>>> Regards,
>>>
>>> Patrick
>>>
>>> Łukasz Osipiuk wrote:

 On Tue, Mar 16, 2010 at 19:22, Patrick Hunt  wrote:
>
> Łukasz Osipiuk wrote:
>>
>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:
>>>
>>> Can you verify that you are using 3.2.2 on all servers? You can do
>>> this
>>> by
>>> running the "stat" command against each of your servers and look at
>>> the
>>> very
>>> top of the output (we include the version of zk server there).
>>> http://bit.ly/dglVld
>>
>> Thanks for hint.
>>
>> We had 3.2.2 jars installed on all server - I checked it previously
>> but apparently our script which restarts instance
>> after installing .deb does not work very well and node2 and node3 were
>> running version 3.2.1.
>> Could using 3.2.1 be source of the problem?
>>
> I heard this story where customer service for dell asks customers to
> replug
> in the mouse because it might be "a dusty connection". Invariably when
> ppl
> look at the back of the box they find that the mouse is not actually
> plugged
> in. Brilliant ideas from CS perspective. ;-)
>
> I think that issue was fixed in 3.2.1, so I don't think that's an issue
> (version).

 Damn :(

>>> Are you using synchronous or async operations in your client?
>>
>> We are only using async operations.
>
> Ok. good info.
>
>>> I see snippets from the server logs, but I don't see anything from
>>> you
>>> client logs. It would really help if you could zip up all the logs
>>> (server
>>> and client) and provide them.
>>
>> We do not really have much logging on application side :(. Failed
>> operations are logged with error code (ZSESSIONEXPIRED in our case).
>> We also log session state changes but we noticed only changes between
>>>

Re: permanent ZSESSIONMOVED

2010-03-16 Thread Ted Dunning
Hmm... this inspires me to have a thought as well.

Łukasz, there isn't any fancy network stuff going on here is there?  No
NATing or fancy load balancing or reassignment of IP addresses of servers,
right?

On Tue, Mar 16, 2010 at 4:51 PM, Patrick Hunt  wrote:

> It will be good to see the logs, however I had one additional thought.
>
> The leader (the zk leader) is the one checking for session MOVED. It keeps
> track of which server the session is currently attached to and will throw
> the moved exception if the session proposes a request through a server other
> than who the leader thinks is the owner.
>


Re: permanent ZSESSIONMOVED

2010-03-16 Thread Patrick Hunt

It will be good to see the logs, however I had one additional thought.

The leader (the zk leader) is the one checking for session MOVED. It 
keeps track of which server the session is currently attached to and 
will throw the moved exception if the session proposes a request through 
a server other than who the leader thinks is the owner.


I'm wondering, if/when you see this again, if you restart the server 
that the session is attached to (use netstat on the client for this) 
what would happen. The client will re-attach to the cluster, I'm 
wondering if this would fix the problem. (rather than trying to restart 
the client as you have been doing).


Not sure if you can try this (production env?) but it would be an 
interesting additional data point if you can give it a try.


Regards,

Patrick

Patrick Hunt wrote:
Yes, if you search "back" (older entries) in the server log you will be 
able to see who the leader is, it will say something like "LEADING" or 
"FOLLOWING", but this may change over time (which is why you need to 
search "back" as I mention) if leadership within the ZK cluster changes 
(say due to networking issue). This is why I mention the logs so highly 
- it really will give us much additional insight into the issue.


here's an example of a 5 server ensemble:
ph...@valhalla:~/dev/workspace/zkconf/test5[master]$ egrep LEAD 
local*/*.log
localhost:2184/zoo.log:2010-03-16 12:50:13,711 - INFO 
[QuorumPeer:/0:0:0:0:0:0:0:0:2184:quorump...@632] - LEADING
ph...@valhalla:~/dev/workspace/zkconf/test5[master]$ egrep FOLLOW 
local*/*.log
localhost:2181/zoo.log:2010-03-16 12:50:13,649 - INFO 
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - FOLLOWING
localhost:2182/zoo.log:2010-03-16 12:50:13,933 - INFO 
[QuorumPeer:/0:0:0:0:0:0:0:0:2182:quorump...@620] - FOLLOWING
localhost:2183/zoo.log:2010-03-16 12:50:13,901 - INFO 
[QuorumPeer:/0:0:0:0:0:0:0:0:2183:quorump...@620] - FOLLOWING
localhost:2185/zoo.log:2010-03-16 12:50:13,661 - INFO 
[QuorumPeer:/0:0:0:0:0:0:0:0:2185:quorump...@620] - FOLLOWING



Additionally if you use the "stat" 4letter word you will see the current 
status of the server, leader or follower. (JMX as well)


You might also find this useful: http://github.com/phunt/zktop

Patrick

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:

We'll probably need the ZK server/client logs to hunt this down. Can you
tell if the MOVED happens in some particular scenario, say you are 
connected
to a follower and move to a leader, or perhaps you are connected to 
server
A, get disconnected and reconnected to server A?  is there some 
pattern

that could help us understand what's causing this?



When I get to office tomorrow I will try to investigate logs and maybe
i will be able to find out what the error scenario is.
But I am not sure if I will be able to find out what was the role of
each node when problem occurred?
Does zookeeper server log when node state changes between follower and
leader. Or can I make it log it?

Regards, Łukasz





Re: permanent ZSESSIONMOVED

2010-03-16 Thread Patrick Hunt

Agree.

Patrick

Ted Dunning wrote:

This is very different from most uses of ZK.  Normally Zookeeper is used on
a private network with almost no packet loss.  Your high rate of loss may be
tickling a protocol bug that other people
just never see.

On Tue, Mar 16, 2010 at 12:24 PM, Łukasz Osipiuk  wrote:


 We get reconnects
quite often due to large packet loss in our network, but most of the
time they do not break the session.





Re: permanent ZSESSIONMOVED

2010-03-16 Thread Ted Dunning
This is very different from most uses of ZK.  Normally Zookeeper is used on
a private network with almost no packet loss.  Your high rate of loss may be
tickling a protocol bug that other people
just never see.

On Tue, Mar 16, 2010 at 12:24 PM, Łukasz Osipiuk  wrote:

>  We get reconnects
> quite often due to large packet loss in our network, but most of the
> time they do not break the session.
>


Re: permanent ZSESSIONMOVED

2010-03-16 Thread Patrick Hunt
Yes, if you search "back" (older entries) in the server log you will be 
able to see who the leader is, it will say something like "LEADING" or 
"FOLLOWING", but this may change over time (which is why you need to 
search "back" as I mention) if leadership within the ZK cluster changes 
(say due to networking issue). This is why I mention the logs so highly 
- it really will give us much additional insight into the issue.


here's an example of a 5 server ensemble:
ph...@valhalla:~/dev/workspace/zkconf/test5[master]$ egrep LEAD local*/*.log
localhost:2184/zoo.log:2010-03-16 12:50:13,711 - INFO 
[QuorumPeer:/0:0:0:0:0:0:0:0:2184:quorump...@632] - LEADING
ph...@valhalla:~/dev/workspace/zkconf/test5[master]$ egrep FOLLOW 
local*/*.log
localhost:2181/zoo.log:2010-03-16 12:50:13,649 - INFO 
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:quorump...@620] - FOLLOWING
localhost:2182/zoo.log:2010-03-16 12:50:13,933 - INFO 
[QuorumPeer:/0:0:0:0:0:0:0:0:2182:quorump...@620] - FOLLOWING
localhost:2183/zoo.log:2010-03-16 12:50:13,901 - INFO 
[QuorumPeer:/0:0:0:0:0:0:0:0:2183:quorump...@620] - FOLLOWING
localhost:2185/zoo.log:2010-03-16 12:50:13,661 - INFO 
[QuorumPeer:/0:0:0:0:0:0:0:0:2185:quorump...@620] - FOLLOWING



Additionally if you use the "stat" 4letter word you will see the current 
status of the server, leader or follower. (JMX as well)


You might also find this useful: http://github.com/phunt/zktop

Patrick

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:

We'll probably need the ZK server/client logs to hunt this down. Can you
tell if the MOVED happens in some particular scenario, say you are connected
to a follower and move to a leader, or perhaps you are connected to server
A, get disconnected and reconnected to server A?  is there some pattern
that could help us understand what's causing this?



When I get to office tomorrow I will try to investigate logs and maybe
i will be able to find out what the error scenario is.
But I am not sure if I will be able to find out what was the role of
each node when problem occurred?
Does zookeeper server log when node state changes between follower and
leader. Or can I make it log it?

Regards, Łukasz





Re: permanent ZSESSIONMOVED

2010-03-16 Thread Patrick Hunt
Yea, that's great. (no complaints on my end, just wasn't sure what you 
meant, wanted to make sure I was clear.).


Can you identify some sort of pattern? We're discussing on our end, but 
this is the first time we've heard of such an issue (outside 3.2.0 bug) 
and we're looking for a direction to hunt... anything you could add 
would help.


Is it possible for you to upgrade your client/server to 3.2.2 and report 
the next failure, including any server/client logs if they are 
available? You should create a JIRA for this in order to track the 
details as they emerge. (incl attaching the logs, etc...)


Patrick

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:

We'll probably need the ZK server/client logs to hunt this down. Can you
tell if the MOVED happens in some particular scenario, say you are connected
to a follower and move to a leader, or perhaps you are connected to server
A, get disconnected and reconnected to server A?  is there some pattern
that could help us understand what's causing this?


Session gets broken after zookeeper library reconnects to cluster,
although most of reconnection leave session if working state.

I'm not sure I follow this ^^^, could you restate it? (I get the first about
about the session is broken after reconnect, what's the second part saying?)


Sorry for my English.

I meant that it only happens from time to time. We get reconnects
quite often due to large packet loss in our network, but most of the
time they do not break the session.

Is it clear now?

Regards, Łukasz Osipiuk


Regards,

Patrick

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 19:22, Patrick Hunt  wrote:

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:

Can you verify that you are using 3.2.2 on all servers? You can do this
by
running the "stat" command against each of your servers and look at the
very
top of the output (we include the version of zk server there).
http://bit.ly/dglVld

Thanks for hint.

We had 3.2.2 jars installed on all server - I checked it previously
but apparently our script which restarts instance
after installing .deb does not work very well and node2 and node3 were
running version 3.2.1.
Could using 3.2.1 be source of the problem?


I heard this story where customer service for dell asks customers to
replug
in the mouse because it might be "a dusty connection". Invariably when
ppl
look at the back of the box they find that the mouse is not actually
plugged
in. Brilliant ideas from CS perspective. ;-)

I think that issue was fixed in 3.2.1, so I don't think that's an issue
(version).

Damn :(


Are you using synchronous or async operations in your client?

We are only using async operations.

Ok. good info.


I see snippets from the server logs, but I don't see anything from you
client logs. It would really help if you could zip up all the logs
(server
and client) and provide them.

We do not really have much logging on application side :(. Failed
operations are logged with error code (ZSESSIONEXPIRED in our case).
We also log session state changes but we noticed only changes between 1
and 3.
Unfortunately we do not have logs coming directly from zookeeper
library as it logs to stderr/stdout and we do not store any of them at
the moment.


I see, in this case I think that the log messages theyselves are the
problem, ie we are warning about this, but it's not a real problem. We
warn
on things that we don't like but can handle, we error if we
don'tlike/can'thandle.

MOVED basically means that a client initiated a request on one server,
then
moved to another server before the request could be processed.

In your case you are using async requests, that can actually make this
problem show up more (more "in flight" requests vs sync operation).

So if you are not seeing MOVED on the client then this is fine. If you
see
them in the server log it means what I said above.

That make sense, or am I missing something?

It seems I was what I wrote was not clear. We are noticing errors on
client side. Any operation after
session gets into broken state fails with ZSESSIONMOVE error code (I
wrote ZSESSIONEXPIRED by mistake in previous post).
We get ZSESSIONMOVE from callback on rc parameter.

Session gets broken after zookeeper library reconnects to cluster,
although most of reconnection leave session if working state.

Regards, Łukasz



Patrick




Regards, Łukasz


Patrick

Łukasz Osipiuk wrote:

not really - it happens occasionally - every few days :(
I believe it is somewhat connected with our network environment which
suffers from some packet loss which leads to
connection timeouts.

I can switch on some more logging if you can lead me which categories
are worth to enable DEBUG for them.

Regards, Łukasz Osipiuk

On Tue, Mar 16, 2010 at 16:35, Benjamin Reed 
wrote:

weird, this does sound like a bug. do you have a reliable way of
reproducing
the problem?

thanx
ben

On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:

nope.

I always pa

Re: permanent ZSESSIONMOVED

2010-03-16 Thread Łukasz Osipiuk
On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:
> We'll probably need the ZK server/client logs to hunt this down. Can you
> tell if the MOVED happens in some particular scenario, say you are connected
> to a follower and move to a leader, or perhaps you are connected to server
> A, get disconnected and reconnected to server A?  is there some pattern
> that could help us understand what's causing this?
>

When I get to office tomorrow I will try to investigate logs and maybe
i will be able to find out what the error scenario is.
But I am not sure if I will be able to find out what was the role of
each node when problem occurred?
Does zookeeper server log when node state changes between follower and
leader. Or can I make it log it?

Regards, Łukasz



-- 
-- 
Łukasz Osipiuk
mailto:luk...@osipiuk.net


Re: permanent ZSESSIONMOVED

2010-03-16 Thread Łukasz Osipiuk
On Tue, Mar 16, 2010 at 20:05, Patrick Hunt  wrote:
> We'll probably need the ZK server/client logs to hunt this down. Can you
> tell if the MOVED happens in some particular scenario, say you are connected
> to a follower and move to a leader, or perhaps you are connected to server
> A, get disconnected and reconnected to server A?  is there some pattern
> that could help us understand what's causing this?
>
>> Session gets broken after zookeeper library reconnects to cluster,
>> although most of reconnection leave session if working state.
>
> I'm not sure I follow this ^^^, could you restate it? (I get the first about
> about the session is broken after reconnect, what's the second part saying?)

Sorry for my English.

I meant that it only happens from time to time. We get reconnects
quite often due to large packet loss in our network, but most of the
time they do not break the session.

Is it clear now?

Regards, Łukasz Osipiuk

> Regards,
>
> Patrick
>
> Łukasz Osipiuk wrote:
>>
>> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt  wrote:
>>>
>>> Łukasz Osipiuk wrote:

 On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:
>
> Can you verify that you are using 3.2.2 on all servers? You can do this
> by
> running the "stat" command against each of your servers and look at the
> very
> top of the output (we include the version of zk server there).
> http://bit.ly/dglVld

 Thanks for hint.

 We had 3.2.2 jars installed on all server - I checked it previously
 but apparently our script which restarts instance
 after installing .deb does not work very well and node2 and node3 were
 running version 3.2.1.
 Could using 3.2.1 be source of the problem?

>>> I heard this story where customer service for dell asks customers to
>>> replug
>>> in the mouse because it might be "a dusty connection". Invariably when
>>> ppl
>>> look at the back of the box they find that the mouse is not actually
>>> plugged
>>> in. Brilliant ideas from CS perspective. ;-)
>>>
>>> I think that issue was fixed in 3.2.1, so I don't think that's an issue
>>> (version).
>>
>> Damn :(
>>
> Are you using synchronous or async operations in your client?

 We are only using async operations.
>>>
>>> Ok. good info.
>>>
> I see snippets from the server logs, but I don't see anything from you
> client logs. It would really help if you could zip up all the logs
> (server
> and client) and provide them.

 We do not really have much logging on application side :(. Failed
 operations are logged with error code (ZSESSIONEXPIRED in our case).
 We also log session state changes but we noticed only changes between 1
 and 3.
 Unfortunately we do not have logs coming directly from zookeeper
 library as it logs to stderr/stdout and we do not store any of them at
 the moment.

>>> I see, in this case I think that the log messages theyselves are the
>>> problem, ie we are warning about this, but it's not a real problem. We
>>> warn
>>> on things that we don't like but can handle, we error if we
>>> don'tlike/can'thandle.
>>>
>>> MOVED basically means that a client initiated a request on one server,
>>> then
>>> moved to another server before the request could be processed.
>>>
>>> In your case you are using async requests, that can actually make this
>>> problem show up more (more "in flight" requests vs sync operation).
>>>
>>> So if you are not seeing MOVED on the client then this is fine. If you
>>> see
>>> them in the server log it means what I said above.
>>>
>>> That make sense, or am I missing something?
>>
>> It seems I was what I wrote was not clear. We are noticing errors on
>> client side. Any operation after
>> session gets into broken state fails with ZSESSIONMOVE error code (I
>> wrote ZSESSIONEXPIRED by mistake in previous post).
>> We get ZSESSIONMOVE from callback on rc parameter.
>>
>> Session gets broken after zookeeper library reconnects to cluster,
>> although most of reconnection leave session if working state.
>>
>> Regards, Łukasz
>>
>>
>>> Patrick
>>>
>>>
>>>
 Regards, Łukasz

> Patrick
>
> Łukasz Osipiuk wrote:
>>
>> not really - it happens occasionally - every few days :(
>> I believe it is somewhat connected with our network environment which
>> suffers from some packet loss which leads to
>> connection timeouts.
>>
>> I can switch on some more logging if you can lead me which categories
>> are worth to enable DEBUG for them.
>>
>> Regards, Łukasz Osipiuk
>>
>> On Tue, Mar 16, 2010 at 16:35, Benjamin Reed 
>> wrote:
>>>
>>> weird, this does sound like a bug. do you have a reliable way of
>>> reproducing
>>> the problem?
>>>
>>> thanx
>>> ben
>>>
>>> On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:

 nope.

 I always pass 0 as clientid.

 Łukasz

Re: permanent ZSESSIONMOVED

2010-03-16 Thread Patrick Hunt
We'll probably need the ZK server/client logs to hunt this down. Can you 
tell if the MOVED happens in some particular scenario, say you are 
connected to a follower and move to a leader, or perhaps you are 
connected to server A, get disconnected and reconnected to server A? 
 is there some pattern that could help us understand what's causing 
this?


> Session gets broken after zookeeper library reconnects to cluster,
> although most of reconnection leave session if working state.

I'm not sure I follow this ^^^, could you restate it? (I get the first 
about about the session is broken after reconnect, what's the second 
part saying?)


Regards,

Patrick

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 19:22, Patrick Hunt  wrote:

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:

Can you verify that you are using 3.2.2 on all servers? You can do this
by
running the "stat" command against each of your servers and look at the
very
top of the output (we include the version of zk server there).
http://bit.ly/dglVld

Thanks for hint.

We had 3.2.2 jars installed on all server - I checked it previously
but apparently our script which restarts instance
after installing .deb does not work very well and node2 and node3 were
running version 3.2.1.
Could using 3.2.1 be source of the problem?


I heard this story where customer service for dell asks customers to replug
in the mouse because it might be "a dusty connection". Invariably when ppl
look at the back of the box they find that the mouse is not actually plugged
in. Brilliant ideas from CS perspective. ;-)

I think that issue was fixed in 3.2.1, so I don't think that's an issue
(version).


Damn :(


Are you using synchronous or async operations in your client?

We are only using async operations.

Ok. good info.


I see snippets from the server logs, but I don't see anything from you
client logs. It would really help if you could zip up all the logs
(server
and client) and provide them.

We do not really have much logging on application side :(. Failed
operations are logged with error code (ZSESSIONEXPIRED in our case).
We also log session state changes but we noticed only changes between 1
and 3.
Unfortunately we do not have logs coming directly from zookeeper
library as it logs to stderr/stdout and we do not store any of them at
the moment.


I see, in this case I think that the log messages theyselves are the
problem, ie we are warning about this, but it's not a real problem. We warn
on things that we don't like but can handle, we error if we
don'tlike/can'thandle.

MOVED basically means that a client initiated a request on one server, then
moved to another server before the request could be processed.

In your case you are using async requests, that can actually make this
problem show up more (more "in flight" requests vs sync operation).

So if you are not seeing MOVED on the client then this is fine. If you see
them in the server log it means what I said above.

That make sense, or am I missing something?


It seems I was what I wrote was not clear. We are noticing errors on
client side. Any operation after
session gets into broken state fails with ZSESSIONMOVE error code (I
wrote ZSESSIONEXPIRED by mistake in previous post).
We get ZSESSIONMOVE from callback on rc parameter.

Session gets broken after zookeeper library reconnects to cluster,
although most of reconnection leave session if working state.

Regards, Łukasz



Patrick




Regards, Łukasz


Patrick

Łukasz Osipiuk wrote:

not really - it happens occasionally - every few days :(
I believe it is somewhat connected with our network environment which
suffers from some packet loss which leads to
connection timeouts.

I can switch on some more logging if you can lead me which categories
are worth to enable DEBUG for them.

Regards, Łukasz Osipiuk

On Tue, Mar 16, 2010 at 16:35, Benjamin Reed 
wrote:

weird, this does sound like a bug. do you have a reliable way of
reproducing
the problem?

thanx
ben

On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:

nope.

I always pass 0 as clientid.

Łukasz

On Tue, Mar 16, 2010 at 16:20, Benjamin Reed
 wrote:


do you ever use zookeeper_init() with the clientid field set to
something
other than null?

ben

On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:


Hi everyone!

I am writing to this group because recently we are getting some
strange errors with our production zookeeper setup.

 From time to time we are observing that our client application (C++
based) disconnects from zookeeper (session state is changed to 1)
and
reconnects (state changed to 3).
This itself is not a problem - usually application continues to run
without problems after reconnect.
But from time to time after above happens all subsequent operations
start to return ZSESSIONMOVED error. To make it work again we have
to
restart application (which creates new zookeeper session).

I noticed that in 3.2.0 introduced a bug
http://issues.apache.org/jira/browse/ZOOKEEPER-449 

Re: permanent ZSESSIONMOVED

2010-03-16 Thread Łukasz Osipiuk
On Tue, Mar 16, 2010 at 19:22, Patrick Hunt  wrote:
> Łukasz Osipiuk wrote:
>>
>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:
>>>
>>> Can you verify that you are using 3.2.2 on all servers? You can do this
>>> by
>>> running the "stat" command against each of your servers and look at the
>>> very
>>> top of the output (we include the version of zk server there).
>>> http://bit.ly/dglVld
>>
>> Thanks for hint.
>>
>> We had 3.2.2 jars installed on all server - I checked it previously
>> but apparently our script which restarts instance
>> after installing .deb does not work very well and node2 and node3 were
>> running version 3.2.1.
>> Could using 3.2.1 be source of the problem?
>>
>
> I heard this story where customer service for dell asks customers to replug
> in the mouse because it might be "a dusty connection". Invariably when ppl
> look at the back of the box they find that the mouse is not actually plugged
> in. Brilliant ideas from CS perspective. ;-)
>
> I think that issue was fixed in 3.2.1, so I don't think that's an issue
> (version).

Damn :(

>>> Are you using synchronous or async operations in your client?
>>
>> We are only using async operations.
>
> Ok. good info.
>
>>
>>> I see snippets from the server logs, but I don't see anything from you
>>> client logs. It would really help if you could zip up all the logs
>>> (server
>>> and client) and provide them.
>>
>> We do not really have much logging on application side :(. Failed
>> operations are logged with error code (ZSESSIONEXPIRED in our case).
>> We also log session state changes but we noticed only changes between 1
>> and 3.
>> Unfortunately we do not have logs coming directly from zookeeper
>> library as it logs to stderr/stdout and we do not store any of them at
>> the moment.
>>
>
> I see, in this case I think that the log messages theyselves are the
> problem, ie we are warning about this, but it's not a real problem. We warn
> on things that we don't like but can handle, we error if we
> don'tlike/can'thandle.
>
> MOVED basically means that a client initiated a request on one server, then
> moved to another server before the request could be processed.
>
> In your case you are using async requests, that can actually make this
> problem show up more (more "in flight" requests vs sync operation).
>
> So if you are not seeing MOVED on the client then this is fine. If you see
> them in the server log it means what I said above.
>
> That make sense, or am I missing something?

It seems I was what I wrote was not clear. We are noticing errors on
client side. Any operation after
session gets into broken state fails with ZSESSIONMOVE error code (I
wrote ZSESSIONEXPIRED by mistake in previous post).
We get ZSESSIONMOVE from callback on rc parameter.

Session gets broken after zookeeper library reconnects to cluster,
although most of reconnection leave session if working state.

Regards, Łukasz


>
> Patrick
>
>
>
>> Regards, Łukasz
>>
>>> Patrick
>>>
>>> Łukasz Osipiuk wrote:

 not really - it happens occasionally - every few days :(
 I believe it is somewhat connected with our network environment which
 suffers from some packet loss which leads to
 connection timeouts.

 I can switch on some more logging if you can lead me which categories
 are worth to enable DEBUG for them.

 Regards, Łukasz Osipiuk

 On Tue, Mar 16, 2010 at 16:35, Benjamin Reed 
 wrote:
>
> weird, this does sound like a bug. do you have a reliable way of
> reproducing
> the problem?
>
> thanx
> ben
>
> On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:
>>
>> nope.
>>
>> I always pass 0 as clientid.
>>
>> Łukasz
>>
>> On Tue, Mar 16, 2010 at 16:20, Benjamin Reed
>>  wrote:
>>
>>> do you ever use zookeeper_init() with the clientid field set to
>>> something
>>> other than null?
>>>
>>> ben
>>>
>>> On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:
>>>
 Hi everyone!

 I am writing to this group because recently we are getting some
 strange errors with our production zookeeper setup.

  From time to time we are observing that our client application (C++
 based) disconnects from zookeeper (session state is changed to 1)
 and
 reconnects (state changed to 3).
 This itself is not a problem - usually application continues to run
 without problems after reconnect.
 But from time to time after above happens all subsequent operations
 start to return ZSESSIONMOVED error. To make it work again we have
 to
 restart application (which creates new zookeeper session).

 I noticed that in 3.2.0 introduced a bug
 http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
 zookeeper v. 3.2.2.
 I just noticed that app at compile time used 3.2.0 library but
 

Re: permanent ZSESSIONMOVED

2010-03-16 Thread Patrick Hunt

Łukasz Osipiuk wrote:

On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:

Can you verify that you are using 3.2.2 on all servers? You can do this by
running the "stat" command against each of your servers and look at the very
top of the output (we include the version of zk server there).
http://bit.ly/dglVld


Thanks for hint.

We had 3.2.2 jars installed on all server - I checked it previously
but apparently our script which restarts instance
after installing .deb does not work very well and node2 and node3 were
running version 3.2.1.
Could using 3.2.1 be source of the problem?



I heard this story where customer service for dell asks customers to 
replug in the mouse because it might be "a dusty connection". Invariably 
when ppl look at the back of the box they find that the mouse is not 
actually plugged in. Brilliant ideas from CS perspective. ;-)


I think that issue was fixed in 3.2.1, so I don't think that's an issue 
(version).



Are you using synchronous or async operations in your client?


We are only using async operations.


Ok. good info.




I see snippets from the server logs, but I don't see anything from you
client logs. It would really help if you could zip up all the logs (server
and client) and provide them.


We do not really have much logging on application side :(. Failed
operations are logged with error code (ZSESSIONEXPIRED in our case).
We also log session state changes but we noticed only changes between 1 and 3.
Unfortunately we do not have logs coming directly from zookeeper
library as it logs to stderr/stdout and we do not store any of them at
the moment.



I see, in this case I think that the log messages theyselves are the 
problem, ie we are warning about this, but it's not a real problem. We 
warn on things that we don't like but can handle, we error if we 
don'tlike/can'thandle.


MOVED basically means that a client initiated a request on one server, 
then moved to another server before the request could be processed.


In your case you are using async requests, that can actually make this 
problem show up more (more "in flight" requests vs sync operation).


So if you are not seeing MOVED on the client then this is fine. If you 
see them in the server log it means what I said above.


That make sense, or am I missing something?


Patrick




Regards, Łukasz


Patrick

Łukasz Osipiuk wrote:

not really - it happens occasionally - every few days :(
I believe it is somewhat connected with our network environment which
suffers from some packet loss which leads to
connection timeouts.

I can switch on some more logging if you can lead me which categories
are worth to enable DEBUG for them.

Regards, Łukasz Osipiuk

On Tue, Mar 16, 2010 at 16:35, Benjamin Reed  wrote:

weird, this does sound like a bug. do you have a reliable way of
reproducing
the problem?

thanx
ben

On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:

nope.

I always pass 0 as clientid.

Łukasz

On Tue, Mar 16, 2010 at 16:20, Benjamin Reed
 wrote:


do you ever use zookeeper_init() with the clientid field set to
something
other than null?

ben

On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:


Hi everyone!

I am writing to this group because recently we are getting some
strange errors with our production zookeeper setup.

 From time to time we are observing that our client application (C++
based) disconnects from zookeeper (session state is changed to 1) and
reconnects (state changed to 3).
This itself is not a problem - usually application continues to run
without problems after reconnect.
But from time to time after above happens all subsequent operations
start to return ZSESSIONMOVED error. To make it work again we have to
restart application (which creates new zookeeper session).

I noticed that in 3.2.0 introduced a bug
http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
zookeeper v. 3.2.2.
I just noticed that app at compile time used 3.2.0 library but patches
fixing bug 449 did not touch C client lib so I believe that our
problems are not
related with that.

In zookeeper logs at moment which initiated the problem with client
application I have

node1:
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
/10.1.112.61:37197 lastZxid 42992576502
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,443 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init
of 0x324dcc1ba580085 valid:true
2010-03-16 14:21:49,443 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:49,444 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@857] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181
remote=/10.1.112.61:37197]

node2:
2010-03-16 14:21:40,580 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@4

Re: permanent ZSESSIONMOVED

2010-03-16 Thread Łukasz Osipiuk
On Tue, Mar 16, 2010 at 17:18, Patrick Hunt  wrote:
> Can you verify that you are using 3.2.2 on all servers? You can do this by
> running the "stat" command against each of your servers and look at the very
> top of the output (we include the version of zk server there).
> http://bit.ly/dglVld

Thanks for hint.

We had 3.2.2 jars installed on all server - I checked it previously
but apparently our script which restarts instance
after installing .deb does not work very well and node2 and node3 were
running version 3.2.1.
Could using 3.2.1 be source of the problem?

> Are you using synchronous or async operations in your client?

We are only using async operations.

> I see snippets from the server logs, but I don't see anything from you
> client logs. It would really help if you could zip up all the logs (server
> and client) and provide them.

We do not really have much logging on application side :(. Failed
operations are logged with error code (ZSESSIONEXPIRED in our case).
We also log session state changes but we noticed only changes between 1 and 3.
Unfortunately we do not have logs coming directly from zookeeper
library as it logs to stderr/stdout and we do not store any of them at
the moment.

Regards, Łukasz

> Patrick
>
> Łukasz Osipiuk wrote:
>>
>> not really - it happens occasionally - every few days :(
>> I believe it is somewhat connected with our network environment which
>> suffers from some packet loss which leads to
>> connection timeouts.
>>
>> I can switch on some more logging if you can lead me which categories
>> are worth to enable DEBUG for them.
>>
>> Regards, Łukasz Osipiuk
>>
>> On Tue, Mar 16, 2010 at 16:35, Benjamin Reed  wrote:
>>>
>>> weird, this does sound like a bug. do you have a reliable way of
>>> reproducing
>>> the problem?
>>>
>>> thanx
>>> ben
>>>
>>> On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:

 nope.

 I always pass 0 as clientid.

 Łukasz

 On Tue, Mar 16, 2010 at 16:20, Benjamin Reed
  wrote:

> do you ever use zookeeper_init() with the clientid field set to
> something
> other than null?
>
> ben
>
> On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:
>
>> Hi everyone!
>>
>> I am writing to this group because recently we are getting some
>> strange errors with our production zookeeper setup.
>>
>>  From time to time we are observing that our client application (C++
>> based) disconnects from zookeeper (session state is changed to 1) and
>> reconnects (state changed to 3).
>> This itself is not a problem - usually application continues to run
>> without problems after reconnect.
>> But from time to time after above happens all subsequent operations
>> start to return ZSESSIONMOVED error. To make it work again we have to
>> restart application (which creates new zookeeper session).
>>
>> I noticed that in 3.2.0 introduced a bug
>> http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
>> zookeeper v. 3.2.2.
>> I just noticed that app at compile time used 3.2.0 library but patches
>> fixing bug 449 did not touch C client lib so I believe that our
>> problems are not
>> related with that.
>>
>> In zookeeper logs at moment which initiated the problem with client
>> application I have
>>
>> node1:
>> 2010-03-16 14:21:43,510 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
>> /10.1.112.61:37197 lastZxid 42992576502
>> 2010-03-16 14:21:43,510 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
>> 0x324dcc1ba580085
>> 2010-03-16 14:21:49,443 - INFO
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init
>> of 0x324dcc1ba580085 valid:true
>> 2010-03-16 14:21:49,443 - WARN
>> [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
>> close of session 0x324dcc1ba580085 due to java.io.IOException: Read
>> error
>> 2010-03-16 14:21:49,444 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@857] - closing
>> session:0x324dcc1ba580085 NIOServerCnxn:
>> java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181
>> remote=/10.1.112.61:37197]
>>
>> node2:
>> 2010-03-16 14:21:40,580 - WARN
>> [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
>> close of session 0x324dcc1ba580085 due to java.io.IOException: Read
>> error
>> 2010-03-16 14:21:40,581 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@833] - closing
>> session:0x324dcc1ba580085 NIOServerCnxn:
>> java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181
>> remote=/10.1.112.61:60693]
>> 2010-03-16 14:21:46,839 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
>> /10.1.112.61:48336 lastZxid 42992576502
>> 2010-03-16 14:21:46,839 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@612] - Rene

Re: permanent ZSESSIONMOVED

2010-03-16 Thread Patrick Hunt
Can you verify that you are using 3.2.2 on all servers? You can do this 
by running the "stat" command against each of your servers and look at 
the very top of the output (we include the version of zk server there).

http://bit.ly/dglVld

Are you using synchronous or async operations in your client?

I see snippets from the server logs, but I don't see anything from you 
client logs. It would really help if you could zip up all the logs 
(server and client) and provide them.


Patrick

Łukasz Osipiuk wrote:

not really - it happens occasionally - every few days :(
I believe it is somewhat connected with our network environment which
suffers from some packet loss which leads to
connection timeouts.

I can switch on some more logging if you can lead me which categories
are worth to enable DEBUG for them.

Regards, Łukasz Osipiuk

On Tue, Mar 16, 2010 at 16:35, Benjamin Reed  wrote:

weird, this does sound like a bug. do you have a reliable way of reproducing
the problem?

thanx
ben

On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:

nope.

I always pass 0 as clientid.

Łukasz

On Tue, Mar 16, 2010 at 16:20, Benjamin Reed  wrote:


do you ever use zookeeper_init() with the clientid field set to something
other than null?

ben

On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:


Hi everyone!

I am writing to this group because recently we are getting some
strange errors with our production zookeeper setup.

 From time to time we are observing that our client application (C++
based) disconnects from zookeeper (session state is changed to 1) and
reconnects (state changed to 3).
This itself is not a problem - usually application continues to run
without problems after reconnect.
But from time to time after above happens all subsequent operations
start to return ZSESSIONMOVED error. To make it work again we have to
restart application (which creates new zookeeper session).

I noticed that in 3.2.0 introduced a bug
http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
zookeeper v. 3.2.2.
I just noticed that app at compile time used 3.2.0 library but patches
fixing bug 449 did not touch C client lib so I believe that our
problems are not
related with that.

In zookeeper logs at moment which initiated the problem with client
application I have

node1:
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
/10.1.112.61:37197 lastZxid 42992576502
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,443 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init
of 0x324dcc1ba580085 valid:true
2010-03-16 14:21:49,443 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:49,444 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@857] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181
remote=/10.1.112.61:37197]

node2:
2010-03-16 14:21:40,580 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:40,581 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181
remote=/10.1.112.61:60693]
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
/10.1.112.61:48336 lastZxid 42992576502
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,439 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init
of 0x324dcc1ba580085 valid:true

node3:
2010-03-16 02:14:48,961 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 02:14:48,962 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181
remote=/10.1.112.61:57309]

and then lots of entries like this
2010-03-16 02:14:54,696 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49
zxid:0xfffe txntype:unknown
/locks/9871253/lock-8589943989-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
at

org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
at

org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
at

org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
2010-03-16 14:22:06,428 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
proces

Re: permanent ZSESSIONMOVED

2010-03-16 Thread Łukasz Osipiuk
not really - it happens occasionally - every few days :(
I believe it is somewhat connected with our network environment which
suffers from some packet loss which leads to
connection timeouts.

I can switch on some more logging if you can lead me which categories
are worth to enable DEBUG for them.

Regards, Łukasz Osipiuk

On Tue, Mar 16, 2010 at 16:35, Benjamin Reed  wrote:
> weird, this does sound like a bug. do you have a reliable way of reproducing
> the problem?
>
> thanx
> ben
>
> On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:
>>
>> nope.
>>
>> I always pass 0 as clientid.
>>
>> Łukasz
>>
>> On Tue, Mar 16, 2010 at 16:20, Benjamin Reed  wrote:
>>
>>>
>>> do you ever use zookeeper_init() with the clientid field set to something
>>> other than null?
>>>
>>> ben
>>>
>>> On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:
>>>

 Hi everyone!

 I am writing to this group because recently we are getting some
 strange errors with our production zookeeper setup.

  From time to time we are observing that our client application (C++
 based) disconnects from zookeeper (session state is changed to 1) and
 reconnects (state changed to 3).
 This itself is not a problem - usually application continues to run
 without problems after reconnect.
 But from time to time after above happens all subsequent operations
 start to return ZSESSIONMOVED error. To make it work again we have to
 restart application (which creates new zookeeper session).

 I noticed that in 3.2.0 introduced a bug
 http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
 zookeeper v. 3.2.2.
 I just noticed that app at compile time used 3.2.0 library but patches
 fixing bug 449 did not touch C client lib so I believe that our
 problems are not
 related with that.

 In zookeeper logs at moment which initiated the problem with client
 application I have

 node1:
 2010-03-16 14:21:43,510 - INFO
 [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
 /10.1.112.61:37197 lastZxid 42992576502
 2010-03-16 14:21:43,510 - INFO
 [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
 0x324dcc1ba580085
 2010-03-16 14:21:49,443 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init
 of 0x324dcc1ba580085 valid:true
 2010-03-16 14:21:49,443 - WARN
 [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
 close of session 0x324dcc1ba580085 due to java.io.IOException: Read
 error
 2010-03-16 14:21:49,444 - INFO
 [NIOServerCxn.Factory:2181:nioserverc...@857] - closing
 session:0x324dcc1ba580085 NIOServerCnxn:
 java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181
 remote=/10.1.112.61:37197]

 node2:
 2010-03-16 14:21:40,580 - WARN
 [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
 close of session 0x324dcc1ba580085 due to java.io.IOException: Read
 error
 2010-03-16 14:21:40,581 - INFO
 [NIOServerCxn.Factory:2181:nioserverc...@833] - closing
 session:0x324dcc1ba580085 NIOServerCnxn:
 java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181
 remote=/10.1.112.61:60693]
 2010-03-16 14:21:46,839 - INFO
 [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
 /10.1.112.61:48336 lastZxid 42992576502
 2010-03-16 14:21:46,839 - INFO
 [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session
 0x324dcc1ba580085
 2010-03-16 14:21:49,439 - INFO
 [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init
 of 0x324dcc1ba580085 valid:true

 node3:
 2010-03-16 02:14:48,961 - WARN
 [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
 close of session 0x324dcc1ba580085 due to java.io.IOException: Read
 error
 2010-03-16 02:14:48,962 - INFO
 [NIOServerCxn.Factory:2181:nioserverc...@833] - closing
 session:0x324dcc1ba580085 NIOServerCnxn:
 java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181
 remote=/10.1.112.61:57309]

 and then lots of entries like this
 2010-03-16 02:14:54,696 - WARN
 [ProcessThread:-1:preprequestproces...@402] - Got exception when
 processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49
 zxid:0xfffe txntype:unknown
 /locks/9871253/lock-8589943989-
 org.apache.zookeeper.KeeperException$SessionMovedException:
 KeeperErrorCode = Session moved
         at

 org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
         at

 org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
         at

 org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
 2010-03-16 14:22:06,428 - WARN
 [ProcessThread:-1:preprequestproces...@402] - Got excepti

Re: permanent ZSESSIONMOVED

2010-03-16 Thread Benjamin Reed
weird, this does sound like a bug. do you have a reliable way of 
reproducing the problem?


thanx
ben

On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:

nope.

I always pass 0 as clientid.

Łukasz

On Tue, Mar 16, 2010 at 16:20, Benjamin Reed  wrote:
   

do you ever use zookeeper_init() with the clientid field set to something
other than null?

ben

On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:
 

Hi everyone!

I am writing to this group because recently we are getting some
strange errors with our production zookeeper setup.

  From time to time we are observing that our client application (C++
based) disconnects from zookeeper (session state is changed to 1) and
reconnects (state changed to 3).
This itself is not a problem - usually application continues to run
without problems after reconnect.
But from time to time after above happens all subsequent operations
start to return ZSESSIONMOVED error. To make it work again we have to
restart application (which creates new zookeeper session).

I noticed that in 3.2.0 introduced a bug
http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
zookeeper v. 3.2.2.
I just noticed that app at compile time used 3.2.0 library but patches
fixing bug 449 did not touch C client lib so I believe that our
problems are not
related with that.

In zookeeper logs at moment which initiated the problem with client
application I have

node1:
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
/10.1.112.61:37197 lastZxid 42992576502
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,443 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init
of 0x324dcc1ba580085 valid:true
2010-03-16 14:21:49,443 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:49,444 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@857] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181
remote=/10.1.112.61:37197]

node2:
2010-03-16 14:21:40,580 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:40,581 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181
remote=/10.1.112.61:60693]
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
/10.1.112.61:48336 lastZxid 42992576502
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,439 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init
of 0x324dcc1ba580085 valid:true

node3:
2010-03-16 02:14:48,961 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 02:14:48,962 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181
remote=/10.1.112.61:57309]

and then lots of entries like this
2010-03-16 02:14:54,696 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49
zxid:0xfffe txntype:unknown
/locks/9871253/lock-8589943989-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
 at
org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
 at
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
 at
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
2010-03-16 14:22:06,428 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603
zxid:0xfffe txntype:unknown
/locks/1665960/lock-8589961006-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
 at
org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
 at
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
 at
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)


To workaround disconnections I am going to increase session timeout
from 5 to 15 seconds but event if it helps at all it is just a
workaround.

Do you have an idea where is the source of my problem.

Regards, Łukasz Osipiuk




   


 



   




Re: permanent ZSESSIONMOVED

2010-03-16 Thread Łukasz Osipiuk
nope.

I always pass 0 as clientid.

Łukasz

On Tue, Mar 16, 2010 at 16:20, Benjamin Reed  wrote:
> do you ever use zookeeper_init() with the clientid field set to something
> other than null?
>
> ben
>
> On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:
>>
>> Hi everyone!
>>
>> I am writing to this group because recently we are getting some
>> strange errors with our production zookeeper setup.
>>
>>  From time to time we are observing that our client application (C++
>> based) disconnects from zookeeper (session state is changed to 1) and
>> reconnects (state changed to 3).
>> This itself is not a problem - usually application continues to run
>> without problems after reconnect.
>> But from time to time after above happens all subsequent operations
>> start to return ZSESSIONMOVED error. To make it work again we have to
>> restart application (which creates new zookeeper session).
>>
>> I noticed that in 3.2.0 introduced a bug
>> http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
>> zookeeper v. 3.2.2.
>> I just noticed that app at compile time used 3.2.0 library but patches
>> fixing bug 449 did not touch C client lib so I believe that our
>> problems are not
>> related with that.
>>
>> In zookeeper logs at moment which initiated the problem with client
>> application I have
>>
>> node1:
>> 2010-03-16 14:21:43,510 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
>> /10.1.112.61:37197 lastZxid 42992576502
>> 2010-03-16 14:21:43,510 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
>> 0x324dcc1ba580085
>> 2010-03-16 14:21:49,443 - INFO
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init
>> of 0x324dcc1ba580085 valid:true
>> 2010-03-16 14:21:49,443 - WARN
>> [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
>> close of session 0x324dcc1ba580085 due to java.io.IOException: Read
>> error
>> 2010-03-16 14:21:49,444 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@857] - closing
>> session:0x324dcc1ba580085 NIOServerCnxn:
>> java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181
>> remote=/10.1.112.61:37197]
>>
>> node2:
>> 2010-03-16 14:21:40,580 - WARN
>> [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
>> close of session 0x324dcc1ba580085 due to java.io.IOException: Read
>> error
>> 2010-03-16 14:21:40,581 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@833] - closing
>> session:0x324dcc1ba580085 NIOServerCnxn:
>> java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181
>> remote=/10.1.112.61:60693]
>> 2010-03-16 14:21:46,839 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
>> /10.1.112.61:48336 lastZxid 42992576502
>> 2010-03-16 14:21:46,839 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session
>> 0x324dcc1ba580085
>> 2010-03-16 14:21:49,439 - INFO
>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init
>> of 0x324dcc1ba580085 valid:true
>>
>> node3:
>> 2010-03-16 02:14:48,961 - WARN
>> [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
>> close of session 0x324dcc1ba580085 due to java.io.IOException: Read
>> error
>> 2010-03-16 02:14:48,962 - INFO
>> [NIOServerCxn.Factory:2181:nioserverc...@833] - closing
>> session:0x324dcc1ba580085 NIOServerCnxn:
>> java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181
>> remote=/10.1.112.61:57309]
>>
>> and then lots of entries like this
>> 2010-03-16 02:14:54,696 - WARN
>> [ProcessThread:-1:preprequestproces...@402] - Got exception when
>> processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49
>> zxid:0xfffe txntype:unknown
>> /locks/9871253/lock-8589943989-
>> org.apache.zookeeper.KeeperException$SessionMovedException:
>> KeeperErrorCode = Session moved
>>         at
>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
>>         at
>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
>>         at
>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
>> 2010-03-16 14:22:06,428 - WARN
>> [ProcessThread:-1:preprequestproces...@402] - Got exception when
>> processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603
>> zxid:0xfffe txntype:unknown
>> /locks/1665960/lock-8589961006-
>> org.apache.zookeeper.KeeperException$SessionMovedException:
>> KeeperErrorCode = Session moved
>>         at
>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
>>         at
>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
>>         at
>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
>>
>>
>> To workaround disconnections I am going to increase session timeout
>> from 5 to 15 seconds but event if it helps at all it is just a
>> workaround.
>>
>> Do you have an idea where is the source of my problem

Re: permanent ZSESSIONMOVED

2010-03-16 Thread Benjamin Reed
do you ever use zookeeper_init() with the clientid field set to 
something other than null?


ben

On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:

Hi everyone!

I am writing to this group because recently we are getting some
strange errors with our production zookeeper setup.

 From time to time we are observing that our client application (C++
based) disconnects from zookeeper (session state is changed to 1) and
reconnects (state changed to 3).
This itself is not a problem - usually application continues to run
without problems after reconnect.
But from time to time after above happens all subsequent operations
start to return ZSESSIONMOVED error. To make it work again we have to
restart application (which creates new zookeeper session).

I noticed that in 3.2.0 introduced a bug
http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
zookeeper v. 3.2.2.
I just noticed that app at compile time used 3.2.0 library but patches
fixing bug 449 did not touch C client lib so I believe that our
problems are not
related with that.

In zookeeper logs at moment which initiated the problem with client
application I have

node1:
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
/10.1.112.61:37197 lastZxid 42992576502
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,443 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init
of 0x324dcc1ba580085 valid:true
2010-03-16 14:21:49,443 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:49,444 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@857] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181
remote=/10.1.112.61:37197]

node2:
2010-03-16 14:21:40,580 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:40,581 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181
remote=/10.1.112.61:60693]
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
/10.1.112.61:48336 lastZxid 42992576502
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,439 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init
of 0x324dcc1ba580085 valid:true

node3:
2010-03-16 02:14:48,961 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 02:14:48,962 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181
remote=/10.1.112.61:57309]

and then lots of entries like this
2010-03-16 02:14:54,696 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49
zxid:0xfffe txntype:unknown
/locks/9871253/lock-8589943989-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
 at 
org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
 at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
 at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
2010-03-16 14:22:06,428 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603
zxid:0xfffe txntype:unknown
/locks/1665960/lock-8589961006-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
 at 
org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
 at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
 at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)


To workaround disconnections I am going to increase session timeout
from 5 to 15 seconds but event if it helps at all it is just a
workaround.

Do you have an idea where is the source of my problem.

Regards, Łukasz Osipiuk



   




permanent ZSESSIONMOVED

2010-03-16 Thread Łukasz Osipiuk
Hi everyone!

I am writing to this group because recently we are getting some
strange errors with our production zookeeper setup.

>From time to time we are observing that our client application (C++
based) disconnects from zookeeper (session state is changed to 1) and
reconnects (state changed to 3).
This itself is not a problem - usually application continues to run
without problems after reconnect.
But from time to time after above happens all subsequent operations
start to return ZSESSIONMOVED error. To make it work again we have to
restart application (which creates new zookeeper session).

I noticed that in 3.2.0 introduced a bug
http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are using
zookeeper v. 3.2.2.
I just noticed that app at compile time used 3.2.0 library but patches
fixing bug 449 did not touch C client lib so I believe that our
problems are not
related with that.

In zookeeper logs at moment which initiated the problem with client
application I have

node1:
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
/10.1.112.61:37197 lastZxid 42992576502
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,443 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished init
of 0x324dcc1ba580085 valid:true
2010-03-16 14:21:49,443 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:49,444 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@857] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.62:2181
remote=/10.1.112.61:37197]

node2:
2010-03-16 14:21:40,580 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 14:21:40,581 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.63:2181
remote=/10.1.112.61:60693]
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
/10.1.112.61:48336 lastZxid 42992576502
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,439 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished init
of 0x324dcc1ba580085 valid:true

node3:
2010-03-16 02:14:48,961 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception causing
close of session 0x324dcc1ba580085 due to java.io.IOException: Read
error
2010-03-16 02:14:48,962 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.1.112.64:2181
remote=/10.1.112.61:57309]

and then lots of entries like this
2010-03-16 02:14:54,696 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9e9e49
zxid:0xfffe txntype:unknown
/locks/9871253/lock-8589943989-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
at 
org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)
2010-03-16 14:22:06,428 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create cxid:0x4b9f6603
zxid:0xfffe txntype:unknown
/locks/1665960/lock-8589961006-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
at 
org.apache.zookeeper.server.SessionTrackerImpl.checkSession(SessionTrackerImpl.java:231)
at 
org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepRequestProcessor.java:211)
at 
org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestProcessor.java:114)


To workaround disconnections I am going to increase session timeout
from 5 to 15 seconds but event if it helps at all it is just a
workaround.

Do you have an idea where is the source of my problem.

Regards, Łukasz Osipiuk



-- 
-- 
Łukasz Osipiuk
mailto:luk...@osipiuk.net