We've identified the problem and are working on a fix. Full details here:

https://issues.apache.org/jira/browse/ZOOKEEPER-710

Patrick

Łukasz Osipiuk wrote:
Hello Mahadev,

2010/3/18 Mahadev Konar <maha...@yahoo-inc.com>:
Hi Lukasz,
 I looked at the logs and havent been able to determine how the session
move can happen.


 You should go ahead and open a jira for this! Please attach the logs to
that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of
now.

I created issue for this
https://issues.apache.org/jira/browse/ZOOKEEPER-710. I marked it as
3.2.2 as this is version we are using.

Also, please go ahead and attach non filtered logs to the jira for the same
timeline as you had done earlier (by non filtered I mean not grepping for
the session id). Also, do attach the earlier logs you had emailed.

I attached logs in jira.

Also, may I know what function are you using to print that statement


Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper:
ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session
moved to another server, so operation is ignored'

info added to jira

(If you can cut and paste that code on the jira that you create, it would be
helpful!)...

Thanks
mahadev


On 3/17/10 2:56 AM, "Łukasz Osipiuk" <luk...@osipiuk.net> wrote:

Ok.

I analysed some logs from time when problem last occurred.

brokensession.txt - logs concerning session which had a problem
leader.txt - logs concerning leader election (i grepped for Leader in
zookeeper.log)

Some observations below
- Network glitch which resulted in problem occurred at about 22:08.
- From what I see since 17:48 node2 was the leader and it did not
change later yesterday.
- Client was connected to node2 since 17:50
- At around 22:09 client tried to connect to every node (1,2,3).
Connections to node1 and node3 were closed
  with exception "Exception causing close of session 0x22767e1c9630000
due to java.io.IOException: Read error".
  Connection to node2 stood alive.
- All subsequent operations were refused with ZSESSIONMOVED error.
Error visible both on client and on server side.

Some more clarifications concerning questions from different posts:

1. Saying we have huge packet drop I mean we have huge packet drop for
a local network. We are still investigating source of it. From time to
time we observe that machines loose contact with others for few
seconds.

2. We do not use any fancy network tricks. All machines are in single
vlan and use permanently assigned IP addresses.

Let me know if anything more can help.

PS. Should I already create JIRA issue for this or is not concrete enough?

Regards, Łukasz

On Tue, Mar 16, 2010 at 20:39, Patrick Hunt <ph...@apache.org> wrote:
Yea, that's great. (no complaints on my end, just wasn't sure what you
meant, wanted to make sure I was clear.).

Can you identify some sort of pattern? We're discussing on our end, but this
is the first time we've heard of such an issue (outside 3.2.0 bug) and we're
looking for a direction to hunt... anything you could add would help.

Is it possible for you to upgrade your client/server to 3.2.2 and report the
next failure, including any server/client logs if they are available? You
should create a JIRA for this in order to track the details as they emerge.
(incl attaching the logs, etc...)

Patrick

Łukasz Osipiuk wrote:
On Tue, Mar 16, 2010 at 20:05, Patrick Hunt <ph...@apache.org> wrote:
We'll probably need the ZK server/client logs to hunt this down. Can you
tell if the MOVED happens in some particular scenario, say you are
connected
to a follower and move to a leader, or perhaps you are connected to
server
A, get disconnected and reconnected to server A? .... is there some
pattern
that could help us understand what's causing this?

Session gets broken after zookeeper library reconnects to cluster,
although most of reconnection leave session if working state.
I'm not sure I follow this ^^^, could you restate it? (I get the first
about
about the session is broken after reconnect, what's the second part
saying?)
Sorry for my English.

I meant that it only happens from time to time. We get reconnects
quite often due to large packet loss in our network, but most of the
time they do not break the session.

Is it clear now?

Regards, Łukasz Osipiuk

Regards,

Patrick

Łukasz Osipiuk wrote:
On Tue, Mar 16, 2010 at 19:22, Patrick Hunt <ph...@apache.org> wrote:
Łukasz Osipiuk wrote:
On Tue, Mar 16, 2010 at 17:18, Patrick Hunt <ph...@apache.org> wrote:
Can you verify that you are using 3.2.2 on all servers? You can do
this
by
running the "stat" command against each of your servers and look at
the
very
top of the output (we include the version of zk server there).
http://bit.ly/dglVld
Thanks for hint.

We had 3.2.2 jars installed on all server - I checked it previously
but apparently our script which restarts instance
after installing .deb does not work very well and node2 and node3 were
running version 3.2.1.
Could using 3.2.1 be source of the problem?

I heard this story where customer service for dell asks customers to
replug
in the mouse because it might be "a dusty connection". Invariably when
ppl
look at the back of the box they find that the mouse is not actually
plugged
in. Brilliant ideas from CS perspective. ;-)

I think that issue was fixed in 3.2.1, so I don't think that's an issue
(version).
Damn :(

Are you using synchronous or async operations in your client?
We are only using async operations.
Ok. good info.

I see snippets from the server logs, but I don't see anything from
you
client logs. It would really help if you could zip up all the logs
(server
and client) and provide them.
We do not really have much logging on application side :(. Failed
operations are logged with error code (ZSESSIONEXPIRED in our case).
We also log session state changes but we noticed only changes between
1
and 3.
Unfortunately we do not have logs coming directly from zookeeper
library as it logs to stderr/stdout and we do not store any of them at
the moment.

I see, in this case I think that the log messages theyselves are the
problem, ie we are warning about this, but it's not a real problem. We
warn
on things that we don't like but can handle, we error if we
don'tlike/can'thandle.

MOVED basically means that a client initiated a request on one server,
then
moved to another server before the request could be processed.

In your case you are using async requests, that can actually make this
problem show up more (more "in flight" requests vs sync operation).

So if you are not seeing MOVED on the client then this is fine. If you
see
them in the server log it means what I said above.

That make sense, or am I missing something?
It seems I was what I wrote was not clear. We are noticing errors on
client side. Any operation after
session gets into broken state fails with ZSESSIONMOVE error code (I
wrote ZSESSIONEXPIRED by mistake in previous post).
We get ZSESSIONMOVE from callback on rc parameter.

Session gets broken after zookeeper library reconnects to cluster,
although most of reconnection leave session if working state.

Regards, Łukasz


Patrick



Regards, Łukasz

Patrick

Łukasz Osipiuk wrote:
not really - it happens occasionally - every few days :(
I believe it is somewhat connected with our network environment
which
suffers from some packet loss which leads to
connection timeouts.

I can switch on some more logging if you can lead me which
categories
are worth to enable DEBUG for them.

Regards, Łukasz Osipiuk

On Tue, Mar 16, 2010 at 16:35, Benjamin Reed <br...@yahoo-inc.com>
wrote:
weird, this does sound like a bug. do you have a reliable way of
reproducing
the problem?

thanx
ben

On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:
nope.

I always pass 0 as clientid.

Łukasz

On Tue, Mar 16, 2010 at 16:20, Benjamin Reed<br...@yahoo-inc.com>
 wrote:

do you ever use zookeeper_init() with the clientid field set to
something
other than null?

ben

On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:

Hi everyone!

I am writing to this group because recently we are getting some
strange errors with our production zookeeper setup.

 From time to time we are observing that our client application
(C++
based) disconnects from zookeeper (session state is changed to
1)
and
reconnects (state changed to 3).
This itself is not a problem - usually application continues to
run
without problems after reconnect.
But from time to time after above happens all subsequent
operations
start to return ZSESSIONMOVED error. To make it work again we
have
to
restart application (which creates new zookeeper session).

I noticed that in 3.2.0 introduced a bug
http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are
using
zookeeper v. 3.2.2.
I just noticed that app at compile time used 3.2.0 library but
patches
fixing bug 449 did not touch C client lib so I believe that our
problems are not
related with that.

In zookeeper logs at moment which initiated the problem with
client
application I have

node1:
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
/10.1.112.61:37197 lastZxid 42992576502
2010-03-16 14:21:43,510 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,443 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished
init
of 0x324dcc1ba580085 valid:true
2010-03-16 14:21:49,443 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@518] - Exception
causing
close of session 0x324dcc1ba580085 due to java.io.IOException:
Read
error
2010-03-16 14:21:49,444 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@857] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected
local=/10.1.112.62:2181
remote=/10.1.112.61:37197]

node2:
2010-03-16 14:21:40,580 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception
causing
close of session 0x324dcc1ba580085 due to java.io.IOException:
Read
error
2010-03-16 14:21:40,581 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected
local=/10.1.112.63:2181
remote=/10.1.112.61:60693]
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
/10.1.112.61:48336 lastZxid 42992576502
2010-03-16 14:21:46,839 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session
0x324dcc1ba580085
2010-03-16 14:21:49,439 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished
init
of 0x324dcc1ba580085 valid:true

node3:
2010-03-16 02:14:48,961 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@494] - Exception
causing
close of session 0x324dcc1ba580085 due to java.io.IOException:
Read
error
2010-03-16 02:14:48,962 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@833] - closing
session:0x324dcc1ba580085 NIOServerCnxn:
java.nio.channels.SocketChannel[connected
local=/10.1.112.64:2181
remote=/10.1.112.61:57309]

and then lots of entries like this
2010-03-16 02:14:54,696 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create
cxid:0x4b9e9e49
zxid:0xfffffffffffffffe txntype:unknown
/locks/9871253/lock-8589943989-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
    at





org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session
TrackerImpl.java:231)
    at





org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque
stProcessor.java:211)
    at





org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro
cessor.java:114)
2010-03-16 14:22:06,428 - WARN
[ProcessThread:-1:preprequestproces...@402] - Got exception when
processing sessionid:0x324dcc1ba580085 type:create
cxid:0x4b9f6603
zxid:0xfffffffffffffffe txntype:unknown
/locks/1665960/lock-8589961006-
org.apache.zookeeper.KeeperException$SessionMovedException:
KeeperErrorCode = Session moved
    at





org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session
TrackerImpl.java:231)
    at





org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque
stProcessor.java:211)
    at





org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro
cessor.java:114)


To workaround disconnections I am going to increase session
timeout
from 5 to 15 seconds but event if it helps at all it is just a
workaround.

Do you have an idea where is the source of my problem.

Regards, Łukasz Osipiuk













Reply via email to