I appear to have resolved the OOM error by greatly increasing the max process limit (to 64K). Using HDP 2.1 a limit of 1024 seemed to be working OK. I’m surprised I had to make a change of this magnitude.
Brian > On Dec 23, 2015, at 7:20 AM, Brian Jeltema <[email protected]> wrote: > > Update on this: > > deleting the contents the /hbase-unsecure/region-in-transition node did fix > my problem with > HBase finding my table regions. > > I'm still have a problem though, possibly related. I’m seeing OutOfMemory > errors in the region server logs (modified slightly): > > 2015-12-23 06:52:37,466 INFO [RS_LOG_REPLAY_OPS-p7:60020-0] > handler.HLogSplitterHandler: worker p7.foo.net,60020,1450871487168 done with > task > /hbase-unsecure/splitWAL/WALs%2Fp15.foo.net%2C60020%2C1450535337455-splitting%2Fp15.foo.net%252C60020%252C1450535337455.1450535339318 > in 68348ms > 2015-12-23 06:52:37,466 ERROR [RS_LOG_REPLAY_OPS-p7:60020-0] > executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:713) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1360) > at > java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.close(HLogSplitter.java:1121) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.finishWritingAndClose(HLogSplitter.java:1086) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:360) > at > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:220) > at > org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:143) > at > org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:82) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > > The region servers are configured with an 8G heap. I initially thought this > might be a ulimit problem, so I bumped the > open file limit to about 10K and the process limit up to 2048, but that did > not seem to matter. What other parameters > might be causing an OOM error? > > Thanks > Brian > >> On Dec 22, 2015, at 12:46 PM, Brian Jeltema <[email protected]> wrote: >> >>> >>> You should really find out where you hmaster ui lives (there is a master UI >>> for every node provided by the apache project) because it gives you >>> information on the state of your system, >> >> I’m familiar with the HMaster UI. I’m looking at it now. It does not contain >> the information you describe. There is a list of region servers and an >> a menu bar that contains: Home Table Details Local Logs Degug Dump >> Metrics Dump HBase Configuration >> >> If I click on the Table Details item, I get a list of the tables. If I click >> on a table, there is a Tasks section that says >> No tasks currently runining on this node. >> >> The region server logs do not contain any records relating to RITs, or >> really even regions. >> The master UI does not contain any information about RITs >> Version: HDP 2.2 -> HBase 0.98.4 >> >> The zookeeper node /hbase-unsecure/regions-in-transition contains a long >> list of items >> that are not removed when I restart the service. I think this is a >> side-effect of problems >> I had when I did the HDP 2.1 -> HDP 2.2 upgrade, which did not go well. >> >> I would like to remove or clear the /hbase-unsecure/region-in-transition node >> as an experiment. I’m just looking for guidance on whether that is a safe >> thing to do. >> >> Brian >> >>> but if you want to skip all that, >>> here are the instructions for OfflineRepair, without knowing what is >>> happening with your system (logs, master ui info) you can try this but at >>> your own risk. >>> >>> OfflineMetaRepair. >>> Description Below: >>> This code is used to rebuild meta off line from file system data. If there >>> * are any problem detected, it will fail suggesting actions for the user >>> to do >>> * to "fix" problems. If it succeeds, it will backup the previous >>> hbase:meta and >>> * -ROOT- dirs and write new tables in place. >>> >>> Stop HBase >>> zookeeper-client rmr /hbase >>> HADOOP_USER_NAME=hbase hbase >>> org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair >>> start hbase >>> >>> ^ This has worked for me in some situations where I understood HDFS and >>> Zookeeper disagreed on region locations, but keep in mind I have tried this >>> on hbase 1.0.0 and your mileage may vary. >>> >>> We don't have your hbase version (you can even find this on the hbase shell) >>> We don't have logs msgs >>> We don't have master's view of your RITs >>> >>> >>> On Tue, Dec 22, 2015 at 11:52 AM, Brian Jeltema <[email protected]> wrote: >>> >>>> I’m running Ambari 2.0.2 and HPD 2.2. I don’t see any of this displayed at >>>> master:60010. >>>> >>>> I really think this problem is the result of cruft in ZooKeeper. Does >>>> anybody know >>>> if it’s safe to delete the node? >>>> >>>> >>>>> On Dec 22, 2015, at 11:40 AM, Geovanie Marquez < >>>> [email protected]> wrote: >>>>> >>>>> check hmaster:60010 under TASKS (between Software Attributes and Tables) >>>>> you will see if you have regions in transition. This will tell you which >>>>> regions are transitioning and you can go to those region server logs and >>>>> check them, I've run into a couple of these and every time they've talk >>>> to >>>>> me about their problem. >>>>> >>>>> Also, under Software Attributes you can check the HBase version. >>>>> >>>>> On Tue, Dec 22, 2015 at 11:29 AM, Ted Yu <[email protected]> wrote: >>>>> >>>>>> From RegionListTmpl.jamon : >>>>>> >>>>>> <%if (onlineRegions != null && onlineRegions.size() > 0) %> >>>>>> ... >>>>>> <%else> >>>>>> <p>Not serving regions</p> >>>>>> </%if> >>>>>> >>>>>> The message means that there was no region online on the underlying >>>> server. >>>>>> >>>>>> FYI >>>>>> >>>>>> On Tue, Dec 22, 2015 at 7:18 AM, Brian Jeltema <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Following up, if I look at the MBase Master UI in the Ambari console I >>>>>> see >>>>>>> links to >>>>>>> all of the region servers. If I click on those links, the Region Server >>>>>>> page comes >>>>>>> up and in the Regions section, is displays ‘Not serving regions’. I’m >>>> not >>>>>>> sure >>>>>>> if that means something is disabled, or it just doesn’t have any >>>> regions >>>>>>> to server. >>>>>>> >>>>>>>> On Dec 22, 2015, at 6:19 AM, Brian Jeltema <[email protected]> >>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> Can you pick a few regions stuck in transition and check related >>>>>> region >>>>>>>>> server logs to see why they couldn't be assigned ? >>>>>>>> >>>>>>>> I don’t see anything in the region logs relating any regions. >>>>>>>> >>>>>>>>> >>>>>>>>> Which release were you using previously ? >>>>>>>> >>>>>>>> HDP 2.1 -> HDP 2.2 >>>>>>>> >>>>>>>> So is it safe to stop HBase and delete the ZK node? >>>>>>>> >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> On Mon, Dec 21, 2015 at 3:54 PM, Brian Jeltema <[email protected]> >>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I am doing a cluster upgrade to the HDP 2.2 stack. For some reason, >>>>>>> after >>>>>>>>>> the upgrade HBase >>>>>>>>>> cannot find any regions for existing tables. I believe the HDFS file >>>>>>>>>> system is OK. But looking at the ZooKeeper >>>>>>>>>> nodes, I noticed that many (maybe all) of the regions were listed in >>>>>>> the >>>>>>>>>> ZooKeeper >>>>>>>>>> /hbase-unsecure/region-in-transition node. I suspect this could be >>>>>>> causing >>>>>>>>>> a problem. Is it >>>>>>>>>> safe to stop HBase and delete that node? >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Brian >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>>> >> >> >
