Hello Mahadev,

2010/3/18 Mahadev Konar <maha...@yahoo-inc.com>:
> Hi Lukasz,
>  I looked at the logs and havent been able to determine how the session
> move can happen.
>
>
>  You should go ahead and open a jira for this! Please attach the logs to
> that jira. Mark it for 3.3.0 until we can say that it isnt a problem as of
> now.

I created issue for this
https://issues.apache.org/jira/browse/ZOOKEEPER-710. I marked it as
3.2.2 as this is version we are using.

> Also, please go ahead and attach non filtered logs to the jira for the same
> timeline as you had done earlier (by non filtered I mean not grepping for
> the session id). Also, do attach the earlier logs you had emailed.

I attached logs in jira.

> Also, may I know what function are you using to print that statement
>
>
> Mar 16 22:09:01 ggmail-c1-mbox-1 ggmail_mbox[7976]: ERROR: ZooKeeper:
> ZooKeeper::getChildren() path='/ggmboxlocks/16314'; callback error='session
> moved to another server, so operation is ignored'

info added to jira

>
> (If you can cut and paste that code on the jira that you create, it would be
> helpful!)...
>
> Thanks
> mahadev
>
>
> On 3/17/10 2:56 AM, "Łukasz Osipiuk" <luk...@osipiuk.net> wrote:
>
>> Ok.
>>
>> I analysed some logs from time when problem last occurred.
>>
>> brokensession.txt - logs concerning session which had a problem
>> leader.txt - logs concerning leader election (i grepped for Leader in
>> zookeeper.log)
>>
>> Some observations below
>> - Network glitch which resulted in problem occurred at about 22:08.
>> - From what I see since 17:48 node2 was the leader and it did not
>> change later yesterday.
>> - Client was connected to node2 since 17:50
>> - At around 22:09 client tried to connect to every node (1,2,3).
>> Connections to node1 and node3 were closed
>>   with exception "Exception causing close of session 0x22767e1c9630000
>> due to java.io.IOException: Read error".
>>   Connection to node2 stood alive.
>> - All subsequent operations were refused with ZSESSIONMOVED error.
>> Error visible both on client and on server side.
>>
>> Some more clarifications concerning questions from different posts:
>>
>> 1. Saying we have huge packet drop I mean we have huge packet drop for
>> a local network. We are still investigating source of it. From time to
>> time we observe that machines loose contact with others for few
>> seconds.
>>
>> 2. We do not use any fancy network tricks. All machines are in single
>> vlan and use permanently assigned IP addresses.
>>
>> Let me know if anything more can help.
>>
>> PS. Should I already create JIRA issue for this or is not concrete enough?
>>
>> Regards, Łukasz
>>
>> On Tue, Mar 16, 2010 at 20:39, Patrick Hunt <ph...@apache.org> wrote:
>>> Yea, that's great. (no complaints on my end, just wasn't sure what you
>>> meant, wanted to make sure I was clear.).
>>>
>>> Can you identify some sort of pattern? We're discussing on our end, but this
>>> is the first time we've heard of such an issue (outside 3.2.0 bug) and we're
>>> looking for a direction to hunt... anything you could add would help.
>>>
>>> Is it possible for you to upgrade your client/server to 3.2.2 and report the
>>> next failure, including any server/client logs if they are available? You
>>> should create a JIRA for this in order to track the details as they emerge.
>>> (incl attaching the logs, etc...)
>>>
>>> Patrick
>>>
>>> Łukasz Osipiuk wrote:
>>>>
>>>> On Tue, Mar 16, 2010 at 20:05, Patrick Hunt <ph...@apache.org> wrote:
>>>>>
>>>>> We'll probably need the ZK server/client logs to hunt this down. Can you
>>>>> tell if the MOVED happens in some particular scenario, say you are
>>>>> connected
>>>>> to a follower and move to a leader, or perhaps you are connected to
>>>>> server
>>>>> A, get disconnected and reconnected to server A? .... is there some
>>>>> pattern
>>>>> that could help us understand what's causing this?
>>>>>
>>>>>> Session gets broken after zookeeper library reconnects to cluster,
>>>>>> although most of reconnection leave session if working state.
>>>>>
>>>>> I'm not sure I follow this ^^^, could you restate it? (I get the first
>>>>> about
>>>>> about the session is broken after reconnect, what's the second part
>>>>> saying?)
>>>>
>>>> Sorry for my English.
>>>>
>>>> I meant that it only happens from time to time. We get reconnects
>>>> quite often due to large packet loss in our network, but most of the
>>>> time they do not break the session.
>>>>
>>>> Is it clear now?
>>>>
>>>> Regards, Łukasz Osipiuk
>>>>
>>>>> Regards,
>>>>>
>>>>> Patrick
>>>>>
>>>>> Łukasz Osipiuk wrote:
>>>>>>
>>>>>> On Tue, Mar 16, 2010 at 19:22, Patrick Hunt <ph...@apache.org> wrote:
>>>>>>>
>>>>>>> Łukasz Osipiuk wrote:
>>>>>>>>
>>>>>>>> On Tue, Mar 16, 2010 at 17:18, Patrick Hunt <ph...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>> Can you verify that you are using 3.2.2 on all servers? You can do
>>>>>>>>> this
>>>>>>>>> by
>>>>>>>>> running the "stat" command against each of your servers and look at
>>>>>>>>> the
>>>>>>>>> very
>>>>>>>>> top of the output (we include the version of zk server there).
>>>>>>>>> http://bit.ly/dglVld
>>>>>>>>
>>>>>>>> Thanks for hint.
>>>>>>>>
>>>>>>>> We had 3.2.2 jars installed on all server - I checked it previously
>>>>>>>> but apparently our script which restarts instance
>>>>>>>> after installing .deb does not work very well and node2 and node3 were
>>>>>>>> running version 3.2.1.
>>>>>>>> Could using 3.2.1 be source of the problem?
>>>>>>>>
>>>>>>> I heard this story where customer service for dell asks customers to
>>>>>>> replug
>>>>>>> in the mouse because it might be "a dusty connection". Invariably when
>>>>>>> ppl
>>>>>>> look at the back of the box they find that the mouse is not actually
>>>>>>> plugged
>>>>>>> in. Brilliant ideas from CS perspective. ;-)
>>>>>>>
>>>>>>> I think that issue was fixed in 3.2.1, so I don't think that's an issue
>>>>>>> (version).
>>>>>>
>>>>>> Damn :(
>>>>>>
>>>>>>>>> Are you using synchronous or async operations in your client?
>>>>>>>>
>>>>>>>> We are only using async operations.
>>>>>>>
>>>>>>> Ok. good info.
>>>>>>>
>>>>>>>>> I see snippets from the server logs, but I don't see anything from
>>>>>>>>> you
>>>>>>>>> client logs. It would really help if you could zip up all the logs
>>>>>>>>> (server
>>>>>>>>> and client) and provide them.
>>>>>>>>
>>>>>>>> We do not really have much logging on application side :(. Failed
>>>>>>>> operations are logged with error code (ZSESSIONEXPIRED in our case).
>>>>>>>> We also log session state changes but we noticed only changes between
>>>>>>>> 1
>>>>>>>> and 3.
>>>>>>>> Unfortunately we do not have logs coming directly from zookeeper
>>>>>>>> library as it logs to stderr/stdout and we do not store any of them at
>>>>>>>> the moment.
>>>>>>>>
>>>>>>> I see, in this case I think that the log messages theyselves are the
>>>>>>> problem, ie we are warning about this, but it's not a real problem. We
>>>>>>> warn
>>>>>>> on things that we don't like but can handle, we error if we
>>>>>>> don'tlike/can'thandle.
>>>>>>>
>>>>>>> MOVED basically means that a client initiated a request on one server,
>>>>>>> then
>>>>>>> moved to another server before the request could be processed.
>>>>>>>
>>>>>>> In your case you are using async requests, that can actually make this
>>>>>>> problem show up more (more "in flight" requests vs sync operation).
>>>>>>>
>>>>>>> So if you are not seeing MOVED on the client then this is fine. If you
>>>>>>> see
>>>>>>> them in the server log it means what I said above.
>>>>>>>
>>>>>>> That make sense, or am I missing something?
>>>>>>
>>>>>> It seems I was what I wrote was not clear. We are noticing errors on
>>>>>> client side. Any operation after
>>>>>> session gets into broken state fails with ZSESSIONMOVE error code (I
>>>>>> wrote ZSESSIONEXPIRED by mistake in previous post).
>>>>>> We get ZSESSIONMOVE from callback on rc parameter.
>>>>>>
>>>>>> Session gets broken after zookeeper library reconnects to cluster,
>>>>>> although most of reconnection leave session if working state.
>>>>>>
>>>>>> Regards, Łukasz
>>>>>>
>>>>>>
>>>>>>> Patrick
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> Regards, Łukasz
>>>>>>>>
>>>>>>>>> Patrick
>>>>>>>>>
>>>>>>>>> Łukasz Osipiuk wrote:
>>>>>>>>>>
>>>>>>>>>> not really - it happens occasionally - every few days :(
>>>>>>>>>> I believe it is somewhat connected with our network environment
>>>>>>>>>> which
>>>>>>>>>> suffers from some packet loss which leads to
>>>>>>>>>> connection timeouts.
>>>>>>>>>>
>>>>>>>>>> I can switch on some more logging if you can lead me which
>>>>>>>>>> categories
>>>>>>>>>> are worth to enable DEBUG for them.
>>>>>>>>>>
>>>>>>>>>> Regards, Łukasz Osipiuk
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 16, 2010 at 16:35, Benjamin Reed <br...@yahoo-inc.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> weird, this does sound like a bug. do you have a reliable way of
>>>>>>>>>>> reproducing
>>>>>>>>>>> the problem?
>>>>>>>>>>>
>>>>>>>>>>> thanx
>>>>>>>>>>> ben
>>>>>>>>>>>
>>>>>>>>>>> On 03/16/2010 08:27 AM, Łukasz Osipiuk wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> nope.
>>>>>>>>>>>>
>>>>>>>>>>>> I always pass 0 as clientid.
>>>>>>>>>>>>
>>>>>>>>>>>> Łukasz
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 16, 2010 at 16:20, Benjamin Reed<br...@yahoo-inc.com>
>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> do you ever use zookeeper_init() with the clientid field set to
>>>>>>>>>>>>> something
>>>>>>>>>>>>> other than null?
>>>>>>>>>>>>>
>>>>>>>>>>>>> ben
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 03/16/2010 07:43 AM, Łukasz Osipiuk wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi everyone!
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am writing to this group because recently we are getting some
>>>>>>>>>>>>> strange errors with our production zookeeper setup.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  From time to time we are observing that our client application
>>>>>>>>>>>>> (C++
>>>>>>>>>>>>> based) disconnects from zookeeper (session state is changed to
>>>>>>>>>>>>> 1)
>>>>>>>>>>>>> and
>>>>>>>>>>>>> reconnects (state changed to 3).
>>>>>>>>>>>>> This itself is not a problem - usually application continues to
>>>>>>>>>>>>> run
>>>>>>>>>>>>> without problems after reconnect.
>>>>>>>>>>>>> But from time to time after above happens all subsequent
>>>>>>>>>>>>> operations
>>>>>>>>>>>>> start to return ZSESSIONMOVED error. To make it work again we
>>>>>>>>>>>>> have
>>>>>>>>>>>>> to
>>>>>>>>>>>>> restart application (which creates new zookeeper session).
>>>>>>>>>>>>>
>>>>>>>>>>>>> I noticed that in 3.2.0 introduced a bug
>>>>>>>>>>>>> http://issues.apache.org/jira/browse/ZOOKEEPER-449 but we are
>>>>>>>>>>>>> using
>>>>>>>>>>>>> zookeeper v. 3.2.2.
>>>>>>>>>>>>> I just noticed that app at compile time used 3.2.0 library but
>>>>>>>>>>>>> patches
>>>>>>>>>>>>> fixing bug 449 did not touch C client lib so I believe that our
>>>>>>>>>>>>> problems are not
>>>>>>>>>>>>> related with that.
>>>>>>>>>>>>>
>>>>>>>>>>>>> In zookeeper logs at moment which initiated the problem with
>>>>>>>>>>>>> client
>>>>>>>>>>>>> application I have
>>>>>>>>>>>>>
>>>>>>>>>>>>> node1:
>>>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@607] - Connected to
>>>>>>>>>>>>> /10.1.112.61:37197 lastZxid 42992576502
>>>>>>>>>>>>> 2010-03-16 14:21:43,510 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@636] - Renewing session
>>>>>>>>>>>>> 0x324dcc1ba580085
>>>>>>>>>>>>> 2010-03-16 14:21:49,443 - INFO
>>>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@992] - Finished
>>>>>>>>>>>>> init
>>>>>>>>>>>>> of 0x324dcc1ba580085 valid:true
>>>>>>>>>>>>> 2010-03-16 14:21:49,443 - WARN
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception
>>>>>>>>>>>>> causing
>>>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException:
>>>>>>>>>>>>> Read
>>>>>>>>>>>>> error
>>>>>>>>>>>>> 2010-03-16 14:21:49,444 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@857] - closing
>>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>>> local=/10.1.112.62:2181
>>>>>>>>>>>>> remote=/10.1.112.61:37197]
>>>>>>>>>>>>>
>>>>>>>>>>>>> node2:
>>>>>>>>>>>>> 2010-03-16 14:21:40,580 - WARN
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception
>>>>>>>>>>>>> causing
>>>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException:
>>>>>>>>>>>>> Read
>>>>>>>>>>>>> error
>>>>>>>>>>>>> 2010-03-16 14:21:40,581 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@833] - closing
>>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>>> local=/10.1.112.63:2181
>>>>>>>>>>>>> remote=/10.1.112.61:60693]
>>>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@583] - Connected to
>>>>>>>>>>>>> /10.1.112.61:48336 lastZxid 42992576502
>>>>>>>>>>>>> 2010-03-16 14:21:46,839 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@612] - Renewing session
>>>>>>>>>>>>> 0x324dcc1ba580085
>>>>>>>>>>>>> 2010-03-16 14:21:49,439 - INFO
>>>>>>>>>>>>> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:nioserverc...@964] - Finished
>>>>>>>>>>>>> init
>>>>>>>>>>>>> of 0x324dcc1ba580085 valid:true
>>>>>>>>>>>>>
>>>>>>>>>>>>> node3:
>>>>>>>>>>>>> 2010-03-16 02:14:48,961 - WARN
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@494] - Exception
>>>>>>>>>>>>> causing
>>>>>>>>>>>>> close of session 0x324dcc1ba580085 due to java.io.IOException:
>>>>>>>>>>>>> Read
>>>>>>>>>>>>> error
>>>>>>>>>>>>> 2010-03-16 02:14:48,962 - INFO
>>>>>>>>>>>>> [NIOServerCxn.Factory:2181:nioserverc...@833] - closing
>>>>>>>>>>>>> session:0x324dcc1ba580085 NIOServerCnxn:
>>>>>>>>>>>>> java.nio.channels.SocketChannel[connected
>>>>>>>>>>>>> local=/10.1.112.64:2181
>>>>>>>>>>>>> remote=/10.1.112.61:57309]
>>>>>>>>>>>>>
>>>>>>>>>>>>> and then lots of entries like this
>>>>>>>>>>>>> 2010-03-16 02:14:54,696 - WARN
>>>>>>>>>>>>> [ProcessThread:-1:preprequestproces...@402] - Got exception when
>>>>>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create
>>>>>>>>>>>>> cxid:0x4b9e9e49
>>>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown
>>>>>>>>>>>>> /locks/9871253/lock-8589943989-
>>>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException:
>>>>>>>>>>>>> KeeperErrorCode = Session moved
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session
>>>>>>>>>>>>> TrackerImpl.java:231)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque
>>>>>>>>>>>>> stProcessor.java:211)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro
>>>>>>>>>>>>> cessor.java:114)
>>>>>>>>>>>>> 2010-03-16 14:22:06,428 - WARN
>>>>>>>>>>>>> [ProcessThread:-1:preprequestproces...@402] - Got exception when
>>>>>>>>>>>>> processing sessionid:0x324dcc1ba580085 type:create
>>>>>>>>>>>>> cxid:0x4b9f6603
>>>>>>>>>>>>> zxid:0xfffffffffffffffe txntype:unknown
>>>>>>>>>>>>> /locks/1665960/lock-8589961006-
>>>>>>>>>>>>> org.apache.zookeeper.KeeperException$SessionMovedException:
>>>>>>>>>>>>> KeeperErrorCode = Session moved
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.SessionTrackerImpl.checkSession(Session
>>>>>>>>>>>>> TrackerImpl.java:231)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.pRequest(PrepReque
>>>>>>>>>>>>> stProcessor.java:211)
>>>>>>>>>>>>>     at
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> org.apache.zookeeper.server.PrepRequestProcessor.run(PrepRequestPro
>>>>>>>>>>>>> cessor.java:114)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> To workaround disconnections I am going to increase session
>>>>>>>>>>>>> timeout
>>>>>>>>>>>>> from 5 to 15 seconds but event if it helps at all it is just a
>>>>>>>>>>>>> workaround.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Do you have an idea where is the source of my problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards, Łukasz Osipiuk
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>
>



-- 
-- 
Łukasz Osipiuk
mailto:luk...@osipiuk.net

Reply via email to