Re: RegionServers Crashing every hour in production env

2013-04-03 Thread Ted Yu
I went over related emails in my Inbox. One previous case was that other task was running on the region server node which consumed good portion of CPU. In that case I observed a gap of activities in region server log. I can send that snippet, after anonymization since there were some IP addresses a

Re: RegionServers Crashing every hour in production env

2013-04-03 Thread Pablo Musa
Have you looked at http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired , suggested below ? Yes I did, but GC is not the issue here. I think zookeeper timeout should be more closely watched. What do you mean? My ZK timeout today is 150 secs, which is very big. However, my problem

Re: RegionServers Crashing every hour in production env

2013-04-03 Thread Ted Yu
Thanks for the sharing. Have you looked at http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired , suggested below ? I think zookeeper timeout should be more closely watched. On Wed, Apr 3, 2013 at 11:21 AM, Pablo Musa wrote: > Hello guys, > I stopped my research on HBase ZK timeout

Re: RegionServers Crashing every hour in production env

2013-04-03 Thread Pablo Musa
Hello guys, I stopped my research on HBase ZK timeout for while due to other issues I had to do, but I am back. A very weird behavior that I would like your comments is that my RegionServers perform better (less crashes) under heavy load instead of light load. There is, if I let HBase alone with

Re: RegionServers Crashing every hour in production env

2013-03-12 Thread Pablo Musa
Guys, thank you very much for the help. Yesterday I spent 14 hours trying to tune the whole cluster. The cluster is not ready yet needs a lot of tunning, but at least is working. My first big problem was namenode + datanode GC. They were not using CMS and thus were taking "incremental" time to

Re: RegionServers Crashing every hour in production env

2013-03-10 Thread Andrew Purtell
Be careful with GC tuning, throwing changes at an application without analysis of what is going on with the heap is shooting in the dark. One particular good treatment of the subject is here: http://java.dzone.com/articles/how-tame-java-gc-pauses If you have made custom changes to blockcache or me

Re: RegionServers Crashing every hour in production env

2013-03-10 Thread Azuryy Yu
Pablo, another, what's your java version? On Mon, Mar 11, 2013 at 10:13 AM, Azuryy Yu wrote: > Hi Pablo, > It'a terrible for a long minor GC. I don't think there are swaping from > your vmstat log. > but I just suggest you > 1) add following JVM options: > -XX:+DisableExplicitGC -XX:+UseCompre

Re: RegionServers Crashing every hour in production env

2013-03-10 Thread Azuryy Yu
Hi Pablo, It'a terrible for a long minor GC. I don't think there are swaping from your vmstat log. but I just suggest you 1) add following JVM options: -XX:+DisableExplicitGC -XX:+UseCompressedOops -XX:GCTimeRatio=19 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:SurvivorRatio=2 -XX:MaxTenuringThreshold=3 -XX:+

Re: RegionServers Crashing every hour in production env

2013-03-10 Thread Stack
You could increase your zookeeper session timeout to 5 minutes while you are figuring why these long pauses. http://hbase.apache.org/book.html#zookeeper.session.timeout Above, there is an outage for almost 5 minutes: >> We slept 225100ms instead of 3000ms, this is likely due to a long You have g

Re: RegionServers Crashing every hour in production env

2013-03-10 Thread Pablo Musa
Hi Sreepathi, they say in the book (or the site), we could try it to see if it is really a timeout error or there is something more. But it is not recomended for production environments. I could give it a try if five minutes will ensure to us that the problem is the GC or elsewhere!! Anyway,

Re: RegionServers Crashing every hour in production env

2013-03-10 Thread Sreepathi
Hi Stack/Ted/Pablo, Should we increase the hbase.rpc.timeout property to 5 minutes ( 30 ms ) ? Regards, - Sreepathi On Sun, Mar 10, 2013 at 11:59 AM, Pablo Musa wrote: > > That combo should be fine. > > Great!! > > > > If JVM is full GC'ing, the application is stopped. > > The below does

Re: RegionServers Crashing every hour in production env

2013-03-10 Thread Pablo Musa
> That combo should be fine. Great!! > If JVM is full GC'ing, the application is stopped. > The below does not look like a full GC but that is a long pause in system > time, enough to kill your zk session. Exactly. This pause is really making the zk expire the RS which shutsdown (logs in the

Re: RegionServers Crashing every hour in production env

2013-03-08 Thread Stack
On Fri, Mar 8, 2013 at 10:58 AM, Pablo Musa wrote: > 0.94 currently doesn't support hadoop 2.0 >> Can you deploy hadoop 1.1.1 instead ? >> > > I am using cdh4.2.0 which uses this version as default installation. > I think it will be a problem for me to deploy 1.1.1 because I would need to > "upgr

Re: RegionServers Crashing every hour in production env

2013-03-08 Thread Pablo Musa
0.94 currently doesn't support hadoop 2.0 Can you deploy hadoop 1.1.1 instead ? I am using cdh4.2.0 which uses this version as default installation. I think it will be a problem for me to deploy 1.1.1 because I would need to "upgrade" the whole cluster with 70TB of data (backup everything, go of

Re: RegionServers Crashing every hour in production env

2013-03-08 Thread Stack
What RAM says. 2013-03-07 17:24:57,887 INFO org.apache.zookeeper.**ClientCnxn: Client session timed out, have not heard from server in 159348ms for sessionid 0x13d3c4bcba600a7, closing socket connection and attempting reconnect You Full GC'ing around this time? Put up your configs in a place whe

Re: RegionServers Crashing every hour in production env

2013-03-08 Thread ramkrishna vasudevan
I think it is with your GC config. What is your heap size? What is the data that you pump in and how much is the block cache size? Regards Ram On Fri, Mar 8, 2013 at 9:31 PM, Ted Yu wrote: > 0.94 currently doesn't support hadoop 2.0 > > Can you deploy hadoop 1.1.1 instead ? > > Are you using

Re: RegionServers Crashing every hour in production env

2013-03-08 Thread Ted Yu
0.94 currently doesn't support hadoop 2.0 Can you deploy hadoop 1.1.1 instead ? Are you using 0.94.5 ? Thanks On Fri, Mar 8, 2013 at 7:44 AM, Pablo Musa wrote: > Hey guys, > as I sent in an email a long time ago, the RSs in my cluster did not get > along > and crashed 3 times a day. I tried a