Thanks for updating the list Ferdy. St.Ack
On Mon, Jun 14, 2010 at 3:09 AM, Ferdy <[email protected]> wrote: > After running stable for quite a while (using configured long timeouts), we > recently noticed regionservers were starting to behave bad again. During > compaction, regionservers complained that blocks are unavailable. Every > couple of days, a regionserver decided to terminate itself because it could > not recover from the DFS errors. > > So, after looking into it again, we might have found the actual cause for > this problem. Prior to a regionserver terminate, logs of the corresponding > datanode told us that the "df" command could not be ran because it could not > allocate memory. Indeed, we finetuned our nodes to use nearly all RAM for > the Hadoop/Hbase and child task processes. We had swap disabled. But we had > the assumption that a simple "df" check should not be that expensive, > right..? > > Well it seems we had to learn a bit about "Linux memory overcommit". Without > going into much details, spawning processes in Linux requires the new > process the have the same memory footprint as the original process, more or > less. Therefore, a datanode with 1.6GB heap (in our case) should have about > the same amount of memory free when spawning a new process, even though the > spawned process will do little to nothing. In order to accomodate, you > should either have enough free memory available (fysical / swap) or you > could tweak the 'overcommit' configuration of the operating system. We > decided to increase the amount of memory by enabling swap files. > > We're still running Hadoop 0.20.1 and Hbase 0.20.3, presumably the newest > releases has better handling of errors in the DFSClient/InputStreams. > Nevertheless, we believe that we have found the root cause of our > regionserver problems. > > Ferdy. > > Stack wrote: >> >> The culprit might be the fragmentation calculation. See >> https://issues.apache.org/jira/browse/HBASE-2165. >> St.Ack >> >> On Wed, Mar 10, 2010 at 9:33 AM, Andrew Purtell <[email protected]> >> wrote: >> >>>> >>>> However, once and every while our Nagios (our service monitor) >>>> >>> >>> detects >>> >>>> >>>> that requesting the Hbase master page takes a long >>>> >>> >>> time. Sometimes > 10 >>> >>>> >>>> sec, rarely around 30 secs but most of >>>> >>> >>> the time < 10 secs. In the cases >>> >>>> >>>> the page loads slowly, >>>> >>> >>> there is a fair amount of load on Hbase. >>> >>> I've noticed this also. With 0.20.4-dev. I think others have mentioned it >>> on the list from time to time. However, I can never seem to jump on to a >>> console fast enough to grab a stack dump before the UI becomes responsive >>> again. :-( It is not consistent behavior. It concerns me that perhaps >>> whatever lock is holding up the UI is also holding up any client >>> attempting to (re)locate a region. If I manage to capture it I will file >>> a jira. >>> >>> >>> >>> - Andy >>> >>> >>> >>> >>> >>> >
