Re: GC [ParNew...] took 299 secs causing region server to die

2010-08-07 Thread Steve Kuo
We never did pinpoint the underlying causes but we found a way to resolve this issue -- by upgrading ubuntu from 9.04 to 9.10. The surmise from our SysAdmin was that there was some incompatibility between the new hardware and the old drivers. So it's most definitely out with the old and in with

Re: GC [ParNew...] took 299 secs causing region server to die

2010-07-30 Thread Steve Kuo
GC paused of 5 minutes seems to be my bane. I disable swapping sysctl -A |grep swap vm.swappiness = 0 so I assume Java heap is not swapped out. How do one check whether swapping is on for a particular Java process.

Re: GC [ParNew...] took 299 secs causing region server to die

2010-07-30 Thread Jean-Daniel Cryans
swappiness is something else, it's good to set it at 0 when you have enough RAM to fit everything but it will still swap when you run out if it and it will be a big hit. I would advise monitoring the cluster, or at least very least look at the output of the top command while the job is running.

Re: GC [ParNew...] took 299 secs causing region server to die

2010-07-30 Thread Steve Kuo
I am pretty sure that is the case but will double check. I found another case where RS died without an apparent stop-the-world GC. RS grv-hadoopc05 *** rs.log *** 2010-07-30 10:43:36,028 INFO org.apache.hadoop.hbase.regionserver.HLog: Roll

Re: GC [ParNew...] took 299 secs causing region server to die

2010-07-29 Thread Jean-Daniel Cryans
Well it says; Times: user=0.17 sys=0.04, real=299.23 secs So why did it take 0.04 of system time but 300 secs of real time? That's insane. Either the region server process was completely starved of CPU cycles (are you on EC2 or any virtualized service like that?), or the computer was put to

Re: GC [ParNew...] took 299 secs causing region server to die

2010-07-29 Thread Todd Lipcon
Agree with what JD said - also check for swapping on the machine. GC can take forever if any of the Java heap gets swapped out, since GC by its nature has to traverse most of the pages in the heap. -Todd On Thu, Jul 29, 2010 at 3:41 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Well it