We never did pinpoint the underlying causes but we found a way to resolve
this issue -- by upgrading ubuntu from 9.04 to 9.10. The surmise from our
SysAdmin was that there was some incompatibility between the new hardware
and the old drivers. So it's most definitely out with the old and in with
GC paused of 5 minutes seems to be my bane. I disable swapping
sysctl -A |grep swap
vm.swappiness = 0
so I assume Java heap is not swapped out. How do one check whether swapping
is on for a particular Java process.
swappiness is something else, it's good to set it at 0 when you have
enough RAM to fit everything but it will still swap when you run out
if it and it will be a big hit.
I would advise monitoring the cluster, or at least very least look at
the output of the top command while the job is running.
I am pretty sure that is the case but will double check.
I found another case where RS died without an apparent stop-the-world GC.
RS grv-hadoopc05
*** rs.log ***
2010-07-30 10:43:36,028 INFO org.apache.hadoop.hbase.regionserver.HLog: Roll
Well it says; Times: user=0.17 sys=0.04, real=299.23 secs
So why did it take 0.04 of system time but 300 secs of real time?
That's insane. Either the region server process was completely starved
of CPU cycles (are you on EC2 or any virtualized service like that?),
or the computer was put to
Agree with what JD said - also check for swapping on the machine. GC can
take forever if any of the Java heap gets swapped out, since GC by its
nature has to traverse most of the pages in the heap.
-Todd
On Thu, Jul 29, 2010 at 3:41 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote:
Well it