Re: Server exception when closing session

2010-01-25 Thread Patrick Hunt
GC and disk IO (transactional log in particular) will cause significant 
latency in some cases. See this for details on the types of things you 
should look at:


http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

I've seen cases where the JVM will pause for 2+ minutes for GC, in some 
cases I've seen 4+ and I've heard of worse than that. Tuning GC (in 
particular using incremental/cms gc) is critical for consistently low 
latencies.


Patrick

Josh Scheid wrote:

On Fri, Jan 22, 2010 at 17:48, Mahadev Konar maha...@yahoo-inc.com wrote:

 The server latency does seem huge. What os and hardware are you running it
on?


RHEL4 2.8GHz 2-core AMD.  8GB RAM.
I will check with my admin for evidence of swap activity, but I don't
anticipate any.

This is bursty.  I'm currently seeing ~200 connections, but maxlat in
the last hour has been 185ms with 4ms avg.


What is usage model of zookeeper?


Distributed lock service.  Using the lock recipe.  Hosts hold a lock
for 5s to a couple of minutes with zero to dozens of waiters.


How much memory are you allocating to the server?


That's a good question.  I'm not an expert at java deployment.  I just
use zkServer.sh defaults.
sun-jre 1.6.0_14.  It's taking 1188MB of virtual memory right now,
100MB resident.


The debug well exacerbate the problem.


OK.


A dedicated disk means the following:
Zookeeper has snapshots and transaction logs. The datadir is the directory
that stores the transaction logs. Its highly recommended that this directory
be on a separate disk that isnt being used by any other process. The
snapshots can sit on a disk that is being used by the OS and can be shared.


Yeah, I understand.  I just need to get that set up and was hoping
that my load wouldn't warrant it.


Also, Pat ran some tests for serve lantecies at:

http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview

You can take a look at that as well and see what the expected performance
should be for your workload.


I will take a look at that.  Thank you for your time.

-Josh


Re: Server exception when closing session

2010-01-25 Thread Ted Dunning
Be very cautious about misdirection here.  It is easy to focus on the ZK
server-side GC's.  In my experience, I have had many more GC related ZK
problems caused by *client* side GC.  If the client checks out for a minute,
you get disconnects and session expiration which is good for debugging that
code, but generally indicates a really bad user experience.  Lots of times
that bad user experience is somewhat covered up by your load balancer and
other general redundancy to you may notice it first from ZK forcing you to
think about these things.

On Mon, Jan 25, 2010 at 9:43 AM, Patrick Hunt ph...@apache.org wrote:

 GC and disk IO (transactional log in particular) will cause significant
 latency in some cases. See this for details on the types of things you
 should look at:

 http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

 I've seen cases where the JVM will pause for 2+ minutes for GC, in some
 cases I've seen 4+ and I've heard of worse than that. Tuning GC (in
 particular using incremental/cms gc) is critical for consistently low
 latencies.

 Patrick

 Josh Scheid wrote:

 On Fri, Jan 22, 2010 at 17:48, Mahadev Konar maha...@yahoo-inc.com
 wrote:

  The server latency does seem huge. What os and hardware are you running
 it
 on?


 RHEL4 2.8GHz 2-core AMD.  8GB RAM.
 I will check with my admin for evidence of swap activity, but I don't
 anticipate any.

 This is bursty.  I'm currently seeing ~200 connections, but maxlat in
 the last hour has been 185ms with 4ms avg.

  What is usage model of zookeeper?


 Distributed lock service.  Using the lock recipe.  Hosts hold a lock
 for 5s to a couple of minutes with zero to dozens of waiters.

  How much memory are you allocating to the server?


 That's a good question.  I'm not an expert at java deployment.  I just
 use zkServer.sh defaults.
 sun-jre 1.6.0_14.  It's taking 1188MB of virtual memory right now,
 100MB resident.

  The debug well exacerbate the problem.


 OK.

  A dedicated disk means the following:
 Zookeeper has snapshots and transaction logs. The datadir is the
 directory
 that stores the transaction logs. Its highly recommended that this
 directory
 be on a separate disk that isnt being used by any other process. The
 snapshots can sit on a disk that is being used by the OS and can be
 shared.


 Yeah, I understand.  I just need to get that set up and was hoping
 that my load wouldn't warrant it.

  Also, Pat ran some tests for serve lantecies at:

 http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview

 You can take a look at that as well and see what the expected performance
 should be for your workload.


 I will take a look at that.  Thank you for your time.

 -Josh




-- 
Ted Dunning, CTO
DeepDyve


Re: Server exception when closing session

2010-01-25 Thread Patrick Hunt

Good point Ted.

FYI, this is the case with the HBase-ZK integration. The HBase java 
region server code, acting as a client of the ZK service, was seeing  
2min gc pauses in some cases. This was causing the ZK client session to 
timeout. Tuning has helped, however until the G1 compactor is available 
there are some nasty cases that can still show up regardless of how much 
tuning you do (although gc tuning on the client helps considerably, for 
the most part this is not an issue in HBase if properly tuned).


Patrick

Ted Dunning wrote:

Be very cautious about misdirection here.  It is easy to focus on the ZK
server-side GC's.  In my experience, I have had many more GC related ZK
problems caused by *client* side GC.  If the client checks out for a minute,
you get disconnects and session expiration which is good for debugging that
code, but generally indicates a really bad user experience.  Lots of times
that bad user experience is somewhat covered up by your load balancer and
other general redundancy to you may notice it first from ZK forcing you to
think about these things.

On Mon, Jan 25, 2010 at 9:43 AM, Patrick Hunt ph...@apache.org wrote:


GC and disk IO (transactional log in particular) will cause significant
latency in some cases. See this for details on the types of things you
should look at:

http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

I've seen cases where the JVM will pause for 2+ minutes for GC, in some
cases I've seen 4+ and I've heard of worse than that. Tuning GC (in
particular using incremental/cms gc) is critical for consistently low
latencies.

Patrick

Josh Scheid wrote:


On Fri, Jan 22, 2010 at 17:48, Mahadev Konar maha...@yahoo-inc.com
wrote:


 The server latency does seem huge. What os and hardware are you running
it
on?


RHEL4 2.8GHz 2-core AMD.  8GB RAM.
I will check with my admin for evidence of swap activity, but I don't
anticipate any.

This is bursty.  I'm currently seeing ~200 connections, but maxlat in
the last hour has been 185ms with 4ms avg.

 What is usage model of zookeeper?
Distributed lock service.  Using the lock recipe.  Hosts hold a lock
for 5s to a couple of minutes with zero to dozens of waiters.

 How much memory are you allocating to the server?
That's a good question.  I'm not an expert at java deployment.  I just
use zkServer.sh defaults.
sun-jre 1.6.0_14.  It's taking 1188MB of virtual memory right now,
100MB resident.

 The debug well exacerbate the problem.
OK.

 A dedicated disk means the following:

Zookeeper has snapshots and transaction logs. The datadir is the
directory
that stores the transaction logs. Its highly recommended that this
directory
be on a separate disk that isnt being used by any other process. The
snapshots can sit on a disk that is being used by the OS and can be
shared.


Yeah, I understand.  I just need to get that set up and was hoping
that my load wouldn't warrant it.

 Also, Pat ran some tests for serve lantecies at:

http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview

You can take a look at that as well and see what the expected performance
should be for your workload.


I will take a look at that.  Thank you for your time.

-Josh






Server exception when closing session

2010-01-22 Thread Josh Scheid
Is it normal for client session close() to cause a server exception?
Things seem to work, but the WARN is a bit disconcerting.

2010-01-22 17:15:01,573 - WARN
[NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
close of session 0x126571af282114b due to java.io.IOException: Read
error
2010-01-22 17:15:01,573 - DEBUG
[NIOServerCxn.Factory:2181:nioserverc...@521] - IOException stack
trace
java.io.IOException: Read error
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:396)
at 
org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:239)
2010-01-22 17:15:01,573 - INFO
[NIOServerCxn.Factory:2181:nioserverc...@857] - closing
session:0x126571af282114b NIOServerCnxn:
java.nio.channels.SocketChannel[connected local=/10.66.16.96:2181
remote=/10.66.24.94:59591]
2010-01-22 17:15:01,573 - INFO
[ProcessThread:-1:preprequestproces...@384] - Processed session
termination request for id: 0x126571af282114b
2010-01-22 17:15:01,583 - DEBUG
[SyncThread:0:finalrequestproces...@74] - Processing request::
sessionid:0x126571af282114b type:closeSession cxid:0x4b5a4d95
zxid:0x43f3 txntype:-11 n/a
2010-01-22 17:15:01,583 - DEBUG
[SyncThread:0:finalrequestproces...@147] - sessionid:0x126571af282114b
type:closeSession cxid:0x4b5a4d95 zxid:0x43f3 txntype:-11 n/a

zk 3.2.2.  Client is using zkpython.

Nothing is otherwise abnormal.  I can just connect, then close the
session and this occurs.

-Josh


Re: Server exception when closing session

2010-01-22 Thread Mahadev Konar
Hi Josh,
 This warning is not of any concern. Just a quick question, is there any
reason for you to runn the server on a DEBUG level?

Thanks
mahadev


On 1/22/10 5:19 PM, Josh Scheid jsch...@velocetechnologies.com wrote:

 Is it normal for client session close() to cause a server exception?
 Things seem to work, but the WARN is a bit disconcerting.
 
 2010-01-22 17:15:01,573 - WARN
 [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing
 close of session 0x126571af282114b due to java.io.IOException: Read
 error
 2010-01-22 17:15:01,573 - DEBUG
 [NIOServerCxn.Factory:2181:nioserverc...@521] - IOException stack
 trace
 java.io.IOException: Read error
 at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:396)
 at 
 org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:239)
 2010-01-22 17:15:01,573 - INFO
 [NIOServerCxn.Factory:2181:nioserverc...@857] - closing
 session:0x126571af282114b NIOServerCnxn:
 java.nio.channels.SocketChannel[connected local=/10.66.16.96:2181
 remote=/10.66.24.94:59591]
 2010-01-22 17:15:01,573 - INFO
 [ProcessThread:-1:preprequestproces...@384] - Processed session
 termination request for id: 0x126571af282114b
 2010-01-22 17:15:01,583 - DEBUG
 [SyncThread:0:finalrequestproces...@74] - Processing request::
 sessionid:0x126571af282114b type:closeSession cxid:0x4b5a4d95
 zxid:0x43f3 txntype:-11 n/a
 2010-01-22 17:15:01,583 - DEBUG
 [SyncThread:0:finalrequestproces...@147] - sessionid:0x126571af282114b
 type:closeSession cxid:0x4b5a4d95 zxid:0x43f3 txntype:-11 n/a
 
 zk 3.2.2.  Client is using zkpython.
 
 Nothing is otherwise abnormal.  I can just connect, then close the
 session and this occurs.
 
 -Josh



Re: Server exception when closing session

2010-01-22 Thread Josh Scheid
On Fri, Jan 22, 2010 at 17:22, Mahadev Konar maha...@yahoo-inc.com wrote:
  This warning is not of any concern.

OK.  I'm used to warnings being things that must be addressed.  I'll
ignore this one in the future.

 Just a quick question, is there any reason for you to runn the server on a 
 DEBUG level?

We're having issues with server latency.  Client default timeout of
1ms gets hit.  I saw a stat output showing a 16s max latency
today.
Is DEBUG going to exacerbate that?  Of the recommendations I've seen,
the one I can't yet follow is a dedicated disk: dataDir is in the root
partition of the server right now.

-Josh


Re: Server exception when closing session

2010-01-22 Thread Mahadev Konar
HI Josh,
  The server latency does seem huge. What os and hardware are you running it
on? What is usage model of zookeeper? How much memory are you allocating to
the server? 
The debug well exacerbate the problem.
A dedicated disk means the following:
Zookeeper has snapshots and transaction logs. The datadir is the directory
that stores the transaction logs. Its highly recommended that this directory
be on a separate disk that isnt being used by any other process. The
snapshots can sit on a disk that is being used by the OS and can be shared.

Also, Pat ran some tests for serve lantecies at:

http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview

You can take a look at that as well and see what the expected performance
should be for your workload.

Thanks
mahadev


On 1/22/10 5:40 PM, Josh Scheid jsch...@velocetechnologies.com wrote:

 On Fri, Jan 22, 2010 at 17:22, Mahadev Konar maha...@yahoo-inc.com wrote:
  This warning is not of any concern.
 
 OK.  I'm used to warnings being things that must be addressed.  I'll
 ignore this one in the future.
 
 Just a quick question, is there any reason for you to runn the server on a
 DEBUG level?
 
 We're having issues with server latency.  Client default timeout of
 1ms gets hit.  I saw a stat output showing a 16s max latency
 today.
 Is DEBUG going to exacerbate that?  Of the recommendations I've seen,
 the one I can't yet follow is a dedicated disk: dataDir is in the root
 partition of the server right now.
 
 -Josh