Re: Server exception when closing session
Good point Ted. FYI, this is the case with the HBase<->ZK integration. The HBase java region server code, acting as a client of the ZK service, was seeing > 2min gc pauses in some cases. This was causing the ZK client session to timeout. Tuning has helped, however until the G1 compactor is available there are some nasty cases that can still show up regardless of how much tuning you do (although gc tuning on the client helps considerably, for the most part this is not an issue in HBase if properly tuned). Patrick Ted Dunning wrote: Be very cautious about misdirection here. It is easy to focus on the ZK server-side GC's. In my experience, I have had many more GC related ZK problems caused by *client* side GC. If the client checks out for a minute, you get disconnects and session expiration which is good for debugging that code, but generally indicates a really bad user experience. Lots of times that bad user experience is somewhat covered up by your load balancer and other general redundancy to you may notice it first from ZK forcing you to think about these things. On Mon, Jan 25, 2010 at 9:43 AM, Patrick Hunt wrote: GC and disk IO (transactional log in particular) will cause significant latency in some cases. See this for details on the types of things you should look at: http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting I've seen cases where the JVM will pause for 2+ minutes for GC, in some cases I've seen 4+ and I've heard of worse than that. Tuning GC (in particular using incremental/cms gc) is critical for consistently low latencies. Patrick Josh Scheid wrote: On Fri, Jan 22, 2010 at 17:48, Mahadev Konar wrote: The server latency does seem huge. What os and hardware are you running it on? RHEL4 2.8GHz 2-core AMD. 8GB RAM. I will check with my admin for evidence of swap activity, but I don't anticipate any. This is bursty. I'm currently seeing ~200 connections, but maxlat in the last hour has been 185ms with 4ms avg. What is usage model of zookeeper? Distributed lock service. Using the lock recipe. Hosts hold a lock for 5s to a couple of minutes with zero to dozens of waiters. How much memory are you allocating to the server? That's a good question. I'm not an expert at java deployment. I just use zkServer.sh defaults. sun-jre 1.6.0_14. It's taking 1188MB of virtual memory right now, 100MB resident. The debug well exacerbate the problem. OK. A dedicated disk means the following: Zookeeper has snapshots and transaction logs. The datadir is the directory that stores the transaction logs. Its highly recommended that this directory be on a separate disk that isnt being used by any other process. The snapshots can sit on a disk that is being used by the OS and can be shared. Yeah, I understand. I just need to get that set up and was hoping that my load wouldn't warrant it. Also, Pat ran some tests for serve lantecies at: http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview You can take a look at that as well and see what the expected performance should be for your workload. I will take a look at that. Thank you for your time. -Josh
Re: Server exception when closing session
Be very cautious about misdirection here. It is easy to focus on the ZK server-side GC's. In my experience, I have had many more GC related ZK problems caused by *client* side GC. If the client checks out for a minute, you get disconnects and session expiration which is good for debugging that code, but generally indicates a really bad user experience. Lots of times that bad user experience is somewhat covered up by your load balancer and other general redundancy to you may notice it first from ZK forcing you to think about these things. On Mon, Jan 25, 2010 at 9:43 AM, Patrick Hunt wrote: > GC and disk IO (transactional log in particular) will cause significant > latency in some cases. See this for details on the types of things you > should look at: > > http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting > > I've seen cases where the JVM will pause for 2+ minutes for GC, in some > cases I've seen 4+ and I've heard of worse than that. Tuning GC (in > particular using incremental/cms gc) is critical for consistently low > latencies. > > Patrick > > Josh Scheid wrote: > >> On Fri, Jan 22, 2010 at 17:48, Mahadev Konar >> wrote: >> >>> The server latency does seem huge. What os and hardware are you running >>> it >>> on? >>> >> >> RHEL4 2.8GHz 2-core AMD. 8GB RAM. >> I will check with my admin for evidence of swap activity, but I don't >> anticipate any. >> >> This is bursty. I'm currently seeing ~200 connections, but maxlat in >> the last hour has been 185ms with 4ms avg. >> >> What is usage model of zookeeper? >>> >> >> Distributed lock service. Using the lock recipe. Hosts hold a lock >> for 5s to a couple of minutes with zero to dozens of waiters. >> >> How much memory are you allocating to the server? >>> >> >> That's a good question. I'm not an expert at java deployment. I just >> use zkServer.sh defaults. >> sun-jre 1.6.0_14. It's taking 1188MB of virtual memory right now, >> 100MB resident. >> >> The debug well exacerbate the problem. >>> >> >> OK. >> >> A dedicated disk means the following: >>> Zookeeper has snapshots and transaction logs. The datadir is the >>> directory >>> that stores the transaction logs. Its highly recommended that this >>> directory >>> be on a separate disk that isnt being used by any other process. The >>> snapshots can sit on a disk that is being used by the OS and can be >>> shared. >>> >> >> Yeah, I understand. I just need to get that set up and was hoping >> that my load wouldn't warrant it. >> >> Also, Pat ran some tests for serve lantecies at: >>> >>> http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview >>> >>> You can take a look at that as well and see what the expected performance >>> should be for your workload. >>> >> >> I will take a look at that. Thank you for your time. >> >> -Josh >> > -- Ted Dunning, CTO DeepDyve
Re: Server exception when closing session
GC and disk IO (transactional log in particular) will cause significant latency in some cases. See this for details on the types of things you should look at: http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting I've seen cases where the JVM will pause for 2+ minutes for GC, in some cases I've seen 4+ and I've heard of worse than that. Tuning GC (in particular using incremental/cms gc) is critical for consistently low latencies. Patrick Josh Scheid wrote: On Fri, Jan 22, 2010 at 17:48, Mahadev Konar wrote: The server latency does seem huge. What os and hardware are you running it on? RHEL4 2.8GHz 2-core AMD. 8GB RAM. I will check with my admin for evidence of swap activity, but I don't anticipate any. This is bursty. I'm currently seeing ~200 connections, but maxlat in the last hour has been 185ms with 4ms avg. What is usage model of zookeeper? Distributed lock service. Using the lock recipe. Hosts hold a lock for 5s to a couple of minutes with zero to dozens of waiters. How much memory are you allocating to the server? That's a good question. I'm not an expert at java deployment. I just use zkServer.sh defaults. sun-jre 1.6.0_14. It's taking 1188MB of virtual memory right now, 100MB resident. The debug well exacerbate the problem. OK. A dedicated disk means the following: Zookeeper has snapshots and transaction logs. The datadir is the directory that stores the transaction logs. Its highly recommended that this directory be on a separate disk that isnt being used by any other process. The snapshots can sit on a disk that is being used by the OS and can be shared. Yeah, I understand. I just need to get that set up and was hoping that my load wouldn't warrant it. Also, Pat ran some tests for serve lantecies at: http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview You can take a look at that as well and see what the expected performance should be for your workload. I will take a look at that. Thank you for your time. -Josh
Re: Server exception when closing session
On Fri, Jan 22, 2010 at 17:48, Mahadev Konar wrote: > The server latency does seem huge. What os and hardware are you running it > on? RHEL4 2.8GHz 2-core AMD. 8GB RAM. I will check with my admin for evidence of swap activity, but I don't anticipate any. This is bursty. I'm currently seeing ~200 connections, but maxlat in the last hour has been 185ms with 4ms avg. > What is usage model of zookeeper? Distributed lock service. Using the lock recipe. Hosts hold a lock for 5s to a couple of minutes with zero to dozens of waiters. > How much memory are you allocating to the server? That's a good question. I'm not an expert at java deployment. I just use zkServer.sh defaults. sun-jre 1.6.0_14. It's taking 1188MB of virtual memory right now, 100MB resident. > The debug well exacerbate the problem. OK. > A dedicated disk means the following: > Zookeeper has snapshots and transaction logs. The datadir is the directory > that stores the transaction logs. Its highly recommended that this directory > be on a separate disk that isnt being used by any other process. The > snapshots can sit on a disk that is being used by the OS and can be shared. Yeah, I understand. I just need to get that set up and was hoping that my load wouldn't warrant it. > Also, Pat ran some tests for serve lantecies at: > > http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview > > You can take a look at that as well and see what the expected performance > should be for your workload. I will take a look at that. Thank you for your time. -Josh
Re: Server exception when closing session
HI Josh, The server latency does seem huge. What os and hardware are you running it on? What is usage model of zookeeper? How much memory are you allocating to the server? The debug well exacerbate the problem. A dedicated disk means the following: Zookeeper has snapshots and transaction logs. The datadir is the directory that stores the transaction logs. Its highly recommended that this directory be on a separate disk that isnt being used by any other process. The snapshots can sit on a disk that is being used by the OS and can be shared. Also, Pat ran some tests for serve lantecies at: http://wiki.apache.org/hadoop/ZooKeeper/ServiceLatencyOverview You can take a look at that as well and see what the expected performance should be for your workload. Thanks mahadev On 1/22/10 5:40 PM, "Josh Scheid" wrote: > On Fri, Jan 22, 2010 at 17:22, Mahadev Konar wrote: >> This warning is not of any concern. > > OK. I'm used to warnings being things that must be addressed. I'll > ignore this one in the future. > >> Just a quick question, is there any reason for you to runn the server on a >> DEBUG level? > > We're having issues with server latency. Client default timeout of > 1ms gets hit. I saw a "stat" output showing a 16s max latency > today. > Is DEBUG going to exacerbate that? Of the recommendations I've seen, > the one I can't yet follow is a dedicated disk: dataDir is in the root > partition of the server right now. > > -Josh
Re: Server exception when closing session
On Fri, Jan 22, 2010 at 17:22, Mahadev Konar wrote: > This warning is not of any concern. OK. I'm used to warnings being things that must be addressed. I'll ignore this one in the future. > Just a quick question, is there any reason for you to runn the server on a > DEBUG level? We're having issues with server latency. Client default timeout of 1ms gets hit. I saw a "stat" output showing a 16s max latency today. Is DEBUG going to exacerbate that? Of the recommendations I've seen, the one I can't yet follow is a dedicated disk: dataDir is in the root partition of the server right now. -Josh
Re: Server exception when closing session
Hi Josh, This warning is not of any concern. Just a quick question, is there any reason for you to runn the server on a DEBUG level? Thanks mahadev On 1/22/10 5:19 PM, "Josh Scheid" wrote: > Is it normal for client session close() to cause a server exception? > Things seem to "work", but the WARN is a bit disconcerting. > > 2010-01-22 17:15:01,573 - WARN > [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing > close of session 0x126571af282114b due to java.io.IOException: Read > error > 2010-01-22 17:15:01,573 - DEBUG > [NIOServerCxn.Factory:2181:nioserverc...@521] - IOException stack > trace > java.io.IOException: Read error > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:396) > at > org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:239) > 2010-01-22 17:15:01,573 - INFO > [NIOServerCxn.Factory:2181:nioserverc...@857] - closing > session:0x126571af282114b NIOServerCnxn: > java.nio.channels.SocketChannel[connected local=/10.66.16.96:2181 > remote=/10.66.24.94:59591] > 2010-01-22 17:15:01,573 - INFO > [ProcessThread:-1:preprequestproces...@384] - Processed session > termination request for id: 0x126571af282114b > 2010-01-22 17:15:01,583 - DEBUG > [SyncThread:0:finalrequestproces...@74] - Processing request:: > sessionid:0x126571af282114b type:closeSession cxid:0x4b5a4d95 > zxid:0x43f3 txntype:-11 n/a > 2010-01-22 17:15:01,583 - DEBUG > [SyncThread:0:finalrequestproces...@147] - sessionid:0x126571af282114b > type:closeSession cxid:0x4b5a4d95 zxid:0x43f3 txntype:-11 n/a > > zk 3.2.2. Client is using zkpython. > > Nothing is otherwise abnormal. I can just connect, then close the > session and this occurs. > > -Josh
Server exception when closing session
Is it normal for client session close() to cause a server exception? Things seem to "work", but the WARN is a bit disconcerting. 2010-01-22 17:15:01,573 - WARN [NIOServerCxn.Factory:2181:nioserverc...@518] - Exception causing close of session 0x126571af282114b due to java.io.IOException: Read error 2010-01-22 17:15:01,573 - DEBUG [NIOServerCxn.Factory:2181:nioserverc...@521] - IOException stack trace java.io.IOException: Read error at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:396) at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:239) 2010-01-22 17:15:01,573 - INFO [NIOServerCxn.Factory:2181:nioserverc...@857] - closing session:0x126571af282114b NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.66.16.96:2181 remote=/10.66.24.94:59591] 2010-01-22 17:15:01,573 - INFO [ProcessThread:-1:preprequestproces...@384] - Processed session termination request for id: 0x126571af282114b 2010-01-22 17:15:01,583 - DEBUG [SyncThread:0:finalrequestproces...@74] - Processing request:: sessionid:0x126571af282114b type:closeSession cxid:0x4b5a4d95 zxid:0x43f3 txntype:-11 n/a 2010-01-22 17:15:01,583 - DEBUG [SyncThread:0:finalrequestproces...@147] - sessionid:0x126571af282114b type:closeSession cxid:0x4b5a4d95 zxid:0x43f3 txntype:-11 n/a zk 3.2.2. Client is using zkpython. Nothing is otherwise abnormal. I can just connect, then close the session and this occurs. -Josh