I appear to have resolved the OOM error by greatly increasing the max process 
limit (to 64K). Using HDP 2.1
a limit of 1024 seemed to be working OK. I’m surprised I had to make a change 
of this magnitude.

Brian

> On Dec 23, 2015, at 7:20 AM, Brian Jeltema <[email protected]> wrote:
> 
> Update on this:
> 
> deleting the contents the /hbase-unsecure/region-in-transition node did fix 
> my problem with
> HBase finding my table regions.
> 
> I'm still have a problem though, possibly related. I’m seeing OutOfMemory 
> errors in the region server logs (modified slightly):
> 
> 2015-12-23 06:52:37,466 INFO  [RS_LOG_REPLAY_OPS-p7:60020-0] 
> handler.HLogSplitterHandler: worker p7.foo.net,60020,1450871487168 done with 
> task 
> /hbase-unsecure/splitWAL/WALs%2Fp15.foo.net%2C60020%2C1450535337455-splitting%2Fp15.foo.net%252C60020%252C1450535337455.1450535339318
>  in 68348ms
> 2015-12-23 06:52:37,466 ERROR [RS_LOG_REPLAY_OPS-p7:60020-0] 
> executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
> java.lang.OutOfMemoryError: unable to create new native thread
>        at java.lang.Thread.start0(Native Method)
>        at java.lang.Thread.start(Thread.java:713)
>        at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
>        at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1360)
>        at 
> java.util.concurrent.ExecutorCompletionService.submit(ExecutorCompletionService.java:181)
>        at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.close(HLogSplitter.java:1121)
>        at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter$LogRecoveredEditsOutputSink.finishWritingAndClose(HLogSplitter.java:1086)
>        at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:360)
>        at 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:220)
>        at 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:143)
>        at 
> org.apache.hadoop.hbase.regionserver.handler.HLogSplitterHandler.process(HLogSplitterHandler.java:82)
>        at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>        at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>        at java.lang.Thread.run(Thread.java:744)
> 
> The region servers are configured with an 8G heap. I initially thought this 
> might be a ulimit problem, so I bumped the
> open file limit to about 10K and the process limit  up to 2048, but that did 
> not seem to matter. What other parameters
> might be causing an OOM error?
> 
> Thanks
> Brian
> 
>> On Dec 22, 2015, at 12:46 PM, Brian Jeltema <[email protected]> wrote:
>> 
>>> 
>>> You should really find out where you hmaster ui lives (there is a master UI
>>> for every node provided by the apache project) because it gives you
>>> information on the state of your system,
>> 
>> I’m familiar with the HMaster UI. I’m looking at it now. It does not contain
>> the information you describe. There is a list of region servers and an
>> a menu bar that contains: Home    Table Details    Local Logs   Degug Dump   
>>  Metrics Dump    HBase Configuration
>> 
>> If I click on the Table Details item, I get a list of the tables. If I click 
>> on a table, there is a Tasks section that says
>> No tasks currently runining on this node.
>> 
>> The region server logs do not contain any records relating to RITs, or 
>> really even regions.
>> The master UI does not contain any information about RITs
>> Version:  HDP 2.2 -> HBase 0.98.4
>> 
>> The zookeeper node /hbase-unsecure/regions-in-transition contains a long 
>> list of items
>> that are not removed when I restart the service. I think this is a 
>> side-effect of problems
>> I had when I did the HDP 2.1 -> HDP 2.2 upgrade, which did not go well. 
>> 
>> I would like to remove or clear the /hbase-unsecure/region-in-transition node
>> as an experiment. I’m just looking for guidance on whether that is a safe 
>> thing to do.
>> 
>> Brian
>> 
>>> but if you want to skip all that,
>>> here are the instructions for OfflineRepair, without knowing what is
>>> happening with your system (logs, master ui info) you can try this but at
>>> your own risk.
>>> 
>>> OfflineMetaRepair.
>>> Description Below:
>>> This code is used to rebuild meta off line from file system data. If there
>>> * are any problem detected, it will fail suggesting actions for the user
>>> to do
>>> * to "fix" problems. If it succeeds, it will backup the previous
>>> hbase:meta and
>>> * -ROOT- dirs and write new tables in place.
>>> 
>>> Stop HBase
>>> zookeeper-client rmr /hbase
>>> HADOOP_USER_NAME=hbase hbase
>>> org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
>>> start hbase
>>> 
>>> ^ This has worked for me in some situations where I understood HDFS and
>>> Zookeeper disagreed on region locations, but keep in mind I have tried this
>>> on hbase 1.0.0 and your mileage may vary.
>>> 
>>> We don't have your hbase version (you can even find this on the hbase shell)
>>> We don't have logs msgs
>>> We don't have master's view of your RITs
>>> 
>>> 
>>> On Tue, Dec 22, 2015 at 11:52 AM, Brian Jeltema <[email protected]> wrote:
>>> 
>>>> I’m running Ambari 2.0.2 and HPD 2.2. I don’t see any of this displayed at
>>>> master:60010.
>>>> 
>>>> I really think this problem is the result of cruft in ZooKeeper. Does
>>>> anybody know
>>>> if it’s safe to delete the node?
>>>> 
>>>> 
>>>>> On Dec 22, 2015, at 11:40 AM, Geovanie Marquez <
>>>> [email protected]> wrote:
>>>>> 
>>>>> check hmaster:60010 under TASKS (between Software Attributes and Tables)
>>>>> you will see if you have regions in transition. This will tell you which
>>>>> regions are transitioning and you can go to those region server logs and
>>>>> check them, I've run into a couple of these and every time they've talk
>>>> to
>>>>> me about their problem.
>>>>> 
>>>>> Also, under Software Attributes you can check the HBase version.
>>>>> 
>>>>> On Tue, Dec 22, 2015 at 11:29 AM, Ted Yu <[email protected]> wrote:
>>>>> 
>>>>>> From RegionListTmpl.jamon :
>>>>>> 
>>>>>> <%if (onlineRegions != null && onlineRegions.size() > 0) %>
>>>>>> ...
>>>>>> <%else>
>>>>>> <p>Not serving regions</p>
>>>>>> </%if>
>>>>>> 
>>>>>> The message means that there was no region online on the underlying
>>>> server.
>>>>>> 
>>>>>> FYI
>>>>>> 
>>>>>> On Tue, Dec 22, 2015 at 7:18 AM, Brian Jeltema <[email protected]>
>>>>>> wrote:
>>>>>> 
>>>>>>> Following up, if I look at the MBase Master UI in the Ambari console I
>>>>>> see
>>>>>>> links to
>>>>>>> all of the region servers. If I click on those links, the Region Server
>>>>>>> page comes
>>>>>>> up and in the Regions section, is displays ‘Not serving regions’. I’m
>>>> not
>>>>>>> sure
>>>>>>> if that means something is disabled, or it just doesn’t have any
>>>> regions
>>>>>>> to server.
>>>>>>> 
>>>>>>>> On Dec 22, 2015, at 6:19 AM, Brian Jeltema <[email protected]>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Can you pick a few regions stuck in transition and check related
>>>>>> region
>>>>>>>>> server logs to see why they couldn't be assigned ?
>>>>>>>> 
>>>>>>>> I don’t see anything in the region logs relating any regions.
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Which release were you using previously ?
>>>>>>>> 
>>>>>>>> HDP 2.1 -> HDP 2.2
>>>>>>>> 
>>>>>>>> So is it safe to stop HBase and delete the ZK node?
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> 
>>>>>>>>> On Mon, Dec 21, 2015 at 3:54 PM, Brian Jeltema <[email protected]>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I am doing a cluster upgrade to the HDP 2.2 stack. For some reason,
>>>>>>> after
>>>>>>>>>> the upgrade HBase
>>>>>>>>>> cannot find any regions for existing tables. I believe the HDFS file
>>>>>>>>>> system is OK. But looking at the ZooKeeper
>>>>>>>>>> nodes, I noticed that many (maybe all) of the regions were listed in
>>>>>>> the
>>>>>>>>>> ZooKeeper
>>>>>>>>>> /hbase-unsecure/region-in-transition node. I suspect this could be
>>>>>>> causing
>>>>>>>>>> a problem. Is it
>>>>>>>>>> safe to stop HBase and delete that node?
>>>>>>>>>> 
>>>>>>>>>> Thanks
>>>>>>>>>> Brian
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 
> 

Reply via email to