Re: problems on EC2?

2009-04-16 Thread Ted Dunning
Yes.  I had seen that before, but it is worth reading about once a month.

On Thu, Apr 16, 2009 at 11:45 AM, Patrick Hunt  wrote:

> Ted Dunning wrote:
>
>> On a related note, what is best practice for handling session expiration?
>> Just deal with it as if it is a new start?
>>
>
> See this re handling the errors ZK can throw at you:
> http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling
>
> Patrick
>



-- 
Ted Dunning, CTO
DeepDyve


Re: problems on EC2?

2009-04-16 Thread Patrick Hunt

Ted Dunning wrote:

On a related note, what is best practice for handling session expiration?
Just deal with it as if it is a new start?


See this re handling the errors ZK can throw at you:
http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling

Patrick


Re: problems on EC2?

2009-04-16 Thread Ted Dunning
Once we have a bit more experience, that would be fine.  Best would be to
present solutions as well as non-specific problems.

On Thu, Apr 16, 2009 at 11:41 AM, Patrick Hunt  wrote:

> ps. please consider presenting your "experiences running ZK inside EC2" at
> an upcoming Hadoop social or even at the summit. I know I'd really be
> interested to hear your experiences and I think it would be useful for both
> new and existing ZK users.
>
>


Re: problems on EC2?

2009-04-16 Thread Patrick Hunt
ps. please consider presenting your "experiences running ZK inside EC2" 
at an upcoming Hadoop social or even at the summit. I know I'd really be 
interested to hear your experiences and I think it would be useful for 
both new and existing ZK users.


Patrick

Patrick Hunt wrote:
Well that's good - 300ms max latency means that the server can round 
trip any requests pretty quickly. It would lead me to look at the client 
VMs or (intermittent) network problems...


Keep in mind though that's one of your servers (unless you are saying 
you checked all X of the servers in the cluster and that was the overall 
max?). You may discover one server that has issues while the other 
servers are fine. In which case only clients connected to the "bad" 
server(s) will experience problems. (and since clients can jump btw that 
might be contributing the the randomness in observed occurrence)


Good luck and keep us posted. EC2 is very interesting, I'd like to learn 
more about the operating environment and in particular the issues 
involved with running ZK there.


Patrick

Ted Dunning wrote:

Patrick,

Thanks enormously.

This hasn't helped yet, but that is just because it was a very large 
bite of

the apple.  Once I digest it, I can tell that it will be very helpful.

I did have a chance to look at the "stat" output and maximum latency was
<300ms.  How that connects with what you are saying isn't clear yet, 
but I
can see how that might not be diagnostic of whether the server side 
timeout

is sufficiently long.

Thanks again.

On Thu, Apr 16, 2009 at 10:57 AM, Patrick Hunt  wrote:

lots of stuff about monitoring ... jmx ... packet loss ... vm 
latencies ...

timeout details.
... Hope this helps.

Patrick







Re: problems on EC2?

2009-04-16 Thread Patrick Hunt
Well that's good - 300ms max latency means that the server can round 
trip any requests pretty quickly. It would lead me to look at the client 
VMs or (intermittent) network problems...


Keep in mind though that's one of your servers (unless you are saying 
you checked all X of the servers in the cluster and that was the overall 
max?). You may discover one server that has issues while the other 
servers are fine. In which case only clients connected to the "bad" 
server(s) will experience problems. (and since clients can jump btw that 
might be contributing the the randomness in observed occurrence)


Good luck and keep us posted. EC2 is very interesting, I'd like to learn 
more about the operating environment and in particular the issues 
involved with running ZK there.


Patrick

Ted Dunning wrote:

Patrick,

Thanks enormously.

This hasn't helped yet, but that is just because it was a very large bite of
the apple.  Once I digest it, I can tell that it will be very helpful.

I did have a chance to look at the "stat" output and maximum latency was
<300ms.  How that connects with what you are saying isn't clear yet, but I
can see how that might not be diagnostic of whether the server side timeout
is sufficiently long.

Thanks again.

On Thu, Apr 16, 2009 at 10:57 AM, Patrick Hunt  wrote:


lots of stuff about monitoring ... jmx ... packet loss ... vm latencies ...
timeout details.
... Hope this helps.

Patrick







Re: problems on EC2?

2009-04-16 Thread Ted Dunning
Patrick,

Thanks enormously.

This hasn't helped yet, but that is just because it was a very large bite of
the apple.  Once I digest it, I can tell that it will be very helpful.

I did have a chance to look at the "stat" output and maximum latency was
<300ms.  How that connects with what you are saying isn't clear yet, but I
can see how that might not be diagnostic of whether the server side timeout
is sufficiently long.

Thanks again.

On Thu, Apr 16, 2009 at 10:57 AM, Patrick Hunt  wrote:

> lots of stuff about monitoring ... jmx ... packet loss ... vm latencies ...
> timeout details.
> ... Hope this helps.
>
> Patrick
>
>
>


Re: problems on EC2?

2009-04-16 Thread Patrick Hunt

Take a look at this section to start:
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_commonProblems

What type of monitoring are you doing on your cluster? You could monitor 
at both the host and at the java (jmx) level. That will give you some 
insight on where to look; cpu, memory, disk, network, etc... Also the 
ZooKeeper JMX will give you information about latencies and such (you 
can even use the "four letter words" for that if you want to hack up 
some scripts instead of using jmx). JMX will also give you insight into 
the JVM workings - so for example you could confirm/ruleout the scenario 
outlined by Nitay (gc causing the jvm java threads to hang for > 30sec 
at a time, including the ZK heartbeat).


I've seen similar to what you describe a few times now, in each case it 
was something different. In one case for example there was a cluster of 
5k clients attaching to a ZK cluster, ~20% of the clients had 
mis-configured nics, that was causing high tcp packet loss (and 
therefore high network latency), which caused a similar situation to 
what you are seeing, but only under fairly high network load (which made 
it hard to track down!).


I've also seen situations where ppl run the entire zk cluster on a set 
of VMWare vms, all on the same host system. Latency on this 
configuration was >>> 10sec in some cases due to resource issues (in 
particular io - see the link I provided above, dedicated log devices are 
critical to low latency operation of the ZK cluster).



In your scenario I think 5 sec timeout is too low, probably much too 
low. Why? You are running in virtualized environments on non-dedicated 
hardware outside your control/inspection. There is typically no way to 
tell (unless you are running on the 8 core ec2 systems) if the ec2 host 
you are running on is over/under subscribed (other vms). There is no way 
to control disk latency either. You could be seeing large latencies due 
to resource contention on the ec2 host alone. In addition to that I've 
heard that network latencies in ec2 are high relative to what you would 
see if you were running on your own dedicated environment. It's hard to 
tell the latency btw the servers and client->server w/in the ec2 
environment you are seeing w/out measuring it.


Keep in mind the the timeout period is used by both the client and the 
server. If the ZK leader doesn't hear from the client w/in the timeout 
(say it's 5 sec) it will expire the session. The client is sending a 
ping after 1/3 of the timeout period. It expects to hear a response 
before another 1/3 of the timeout elapses, after which it will attempt 
to re-sync to another server in the cluster. In the 5 sec timeout case 
you are allowing 1.3 seconds for the request to go to the server, the 
server to respond back to the client, and the client to process the 
response. Check the latencies in ZK's JMX as I suggested to the hbase 
team in order to get insight into this (i.e. if the server latency is 
high, say because of io issues, or jvm swapping, vm latency, etc... that 
will cause the client/sessions to timeout)


Hope this helps.

Patrick

Mahadev Konar wrote:

Hi Ted,

These problems seem to manifest around getting lots of anomalous disconnects
and session expirations even though we have the timeout values set to 2
seconds on the server side and 5 seconds on the client side.



 Your scenario might be a little differetn from what Nitay (Hbase) is
seeing. In their scenario the zookeeper client was not able to send out
pings to the server due to gc stalling threads in their zookeeper
application process.

The latencies in zookeeper clients are directly related to Zookeeper server
machines. It is very much dependant on the disk io latencies that you would
get on the zookeeper servers and network latencies with your cluster.

I am not sure how much sensitive you want your zookeeper application to be
-- but increasing the timeout should help. Also, we recommend using
dedicated disk for zookeeper log transactions.

http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html#sc_streng
thsAndLimitations

Also, we have seen Ntp having problems and clocks going back on one of our
vm setup. This would lead to session getting timed out earler than the set
session timeout.

I hope this helps.


mahadev

On 4/14/09 5:48 PM, "Ted Dunning"  wrote:


We have been using EC2 as a substrate for our search cluster with zookeeper
as our coordination layer and have been seeing some strange problems.

These problems seem to manifest around getting lots of anomalous disconnects
and session expirations even though we have the timeout values set to 2
seconds on the server side and 5 seconds on the client side.

Has anybody else been seeing this?

Is this related to clock jumps in a virtualized setting?

On a related note, what is best practice for handling session expiration?
Just deal with it as if it is a new start?




Re: problems on EC2?

2009-04-14 Thread Mahadev Konar
Hi Ted,
> These problems seem to manifest around getting lots of anomalous disconnects
> and session expirations even though we have the timeout values set to 2
> seconds on the server side and 5 seconds on the client side.
> 

 Your scenario might be a little differetn from what Nitay (Hbase) is
seeing. In their scenario the zookeeper client was not able to send out
pings to the server due to gc stalling threads in their zookeeper
application process.

The latencies in zookeeper clients are directly related to Zookeeper server
machines. It is very much dependant on the disk io latencies that you would
get on the zookeeper servers and network latencies with your cluster.

I am not sure how much sensitive you want your zookeeper application to be
-- but increasing the timeout should help. Also, we recommend using
dedicated disk for zookeeper log transactions.

http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperAdmin.html#sc_streng
thsAndLimitations

Also, we have seen Ntp having problems and clocks going back on one of our
vm setup. This would lead to session getting timed out earler than the set
session timeout.

I hope this helps.


mahadev

On 4/14/09 5:48 PM, "Ted Dunning"  wrote:

> We have been using EC2 as a substrate for our search cluster with zookeeper
> as our coordination layer and have been seeing some strange problems.
> 
> These problems seem to manifest around getting lots of anomalous disconnects
> and session expirations even though we have the timeout values set to 2
> seconds on the server side and 5 seconds on the client side.
> 
> Has anybody else been seeing this?
> 
> Is this related to clock jumps in a virtualized setting?
> 
> On a related note, what is best practice for handling session expiration?
> Just deal with it as if it is a new start?



Re: problems on EC2?

2009-04-14 Thread Nitay
Yes, we are. We currently don't handle SessionExpired very well at all in
HBase. There are two things going on in parallel to fix it:

1) Reinitialize the ZooKeeper handler (and everything else that depends on
it) on the node in question when a SessionExpired event occurs.
2) Reduce the number of SessionExpired events we get by using Joey's JNI
solution. After the various talks about session timeout, different GC flags,
etc, we decided to pursue the JNI solution. We plan on contributing his work
back to ZooKeeper, under some contrib, so that others can use it.

In the really short term, for folks that are seeing it, using the concurrent
GC and bumping up the session timeout to 30 seconds or so seems to reduce
the frequency of the problem.

I'm curious if your problems are the same as ours. You should try tweaking
the GC parameters and session timeout to see if the problems you're having
are the same as ours.

Cheers,
-n

On Tue, Apr 14, 2009 at 6:34 PM, Ted Dunning  wrote:

> Very good pointer.  Thanks.
>
> Are you still having your problems?
>
> On Tue, Apr 14, 2009 at 6:09 PM, Nitay  wrote:
>
> > Hi Ted,
> >
> > Fellow user coming from HBase. We were recently seeing lots of
> > SessionExpired events as well. Check out this mail thread:
> >
> >
> >
> http://markmail.org/search/?q=SessionExpired#query:SessionExpired+page:1+mid:gt4c2kn4n4f5s5kw+state:results
> >
> > Perhaps this might have something to do with what you're seeing.
> >
> > Cheers,
> > -n
> >
> > On Tue, Apr 14, 2009 at 5:48 PM, Ted Dunning 
> > wrote:
> >
> > > We have been using EC2 as a substrate for our search cluster with
> > zookeeper
> > > as our coordination layer and have been seeing some strange problems.
> > >
> > > These problems seem to manifest around getting lots of anomalous
> > > disconnects
> > > and session expirations even though we have the timeout values set to 2
> > > seconds on the server side and 5 seconds on the client side.
> > >
> > > Has anybody else been seeing this?
> > >
> > > Is this related to clock jumps in a virtualized setting?
> > >
> > > On a related note, what is best practice for handling session
> expiration?
> > > Just deal with it as if it is a new start?
> > >
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>
> 111 West Evelyn Ave. Ste. 202
> Sunnyvale, CA 94086
> www.deepdyve.com
> 858-414-0013 (m)
> 408-773-0220 (fax)
>


Re: problems on EC2?

2009-04-14 Thread Ted Dunning
Very good pointer.  Thanks.

Are you still having your problems?

On Tue, Apr 14, 2009 at 6:09 PM, Nitay  wrote:

> Hi Ted,
>
> Fellow user coming from HBase. We were recently seeing lots of
> SessionExpired events as well. Check out this mail thread:
>
>
> http://markmail.org/search/?q=SessionExpired#query:SessionExpired+page:1+mid:gt4c2kn4n4f5s5kw+state:results
>
> Perhaps this might have something to do with what you're seeing.
>
> Cheers,
> -n
>
> On Tue, Apr 14, 2009 at 5:48 PM, Ted Dunning 
> wrote:
>
> > We have been using EC2 as a substrate for our search cluster with
> zookeeper
> > as our coordination layer and have been seeing some strange problems.
> >
> > These problems seem to manifest around getting lots of anomalous
> > disconnects
> > and session expirations even though we have the timeout values set to 2
> > seconds on the server side and 5 seconds on the client side.
> >
> > Has anybody else been seeing this?
> >
> > Is this related to clock jumps in a virtualized setting?
> >
> > On a related note, what is best practice for handling session expiration?
> > Just deal with it as if it is a new start?
> >
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
858-414-0013 (m)
408-773-0220 (fax)


Re: problems on EC2?

2009-04-14 Thread Nitay
Hi Ted,

Fellow user coming from HBase. We were recently seeing lots of
SessionExpired events as well. Check out this mail thread:

http://markmail.org/search/?q=SessionExpired#query:SessionExpired+page:1+mid:gt4c2kn4n4f5s5kw+state:results

Perhaps this might have something to do with what you're seeing.

Cheers,
-n

On Tue, Apr 14, 2009 at 5:48 PM, Ted Dunning  wrote:

> We have been using EC2 as a substrate for our search cluster with zookeeper
> as our coordination layer and have been seeing some strange problems.
>
> These problems seem to manifest around getting lots of anomalous
> disconnects
> and session expirations even though we have the timeout values set to 2
> seconds on the server side and 5 seconds on the client side.
>
> Has anybody else been seeing this?
>
> Is this related to clock jumps in a virtualized setting?
>
> On a related note, what is best practice for handling session expiration?
> Just deal with it as if it is a new start?
>