Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptions

David Koch Sun, 10 Feb 2013 04:52:25 -0800

Yes, the limit is at 65535.

/David


On Sun, Feb 10, 2013 at 4:22 AM, Marcos Ortiz <[email protected]> wrote:

>  Did you increase the number of open files in your
> /etc/security/limits.conf in your system?
>
>
> On 02/09/2013 09:17 PM, David Koch wrote:
>
> Hello,
>
> Thank you for your reply, I checked the HDFS log for error messages that
> are indicative of "xciever" problems but could not find any. The settings
> suggested here:
> http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/have been
> applied on our cluster.
>
> I did a grep "File does not exist: /hbase/<table_name>/"
> /var/log/hadoop-hdfs/hadoop-cmf-hdfs1-NAMENODE-big* | wc
>
> on the namenode logs and there millions of such lines for one table only.
> The count is 0 for all other tables - even though they may be reported as
> inconsistent by hbchk.
>
> It seems like this is less of a performance issue but rather some stale
> "where to find what data" problem - possibly related to Zookeeper? I
> remember there being some kind of procedure for clearing ZK even though I
> cannot recall the steps involved.
>
> Any further help would be appreciated,
>
> Thanks,
>
> /David
>
> On Sun, Feb 10, 2013 at 2:24 AM, Dhaval Shah <[email protected]> 
> <[email protected]>wrote:
>
>
>  It seems like you need to increase the limit on the number of xceivers on
> the hdfs config looking at your error messages.
>
>
> ------------------------------
> On Sun 10 Feb, 2013 6:37 AM IST David Koch wrote:
>
>
>  Hello,
>
> As of lately, we have been having issues with Region Servers crashing in
> our cluster. This happens while running Map/Reduce jobs over HBase tables
> in particular but also spontaneously when the cluster is seemingly idle.
>
> Restarting the Region Servers or even HBase entirely as well as HDFS and
> Map/Reduce services does not fix the problem and jobs will fail during the
> next attempt citing "Region not served" exceptions. It is not always the
> same nodes that crash.
>
> The log data during the minutes leading up to the crash contain many "File
> does not exist /hbase/<table_name>/..." error messages which change to
>
>  "Too
>
>  many open files" messages, finally, there are a few "Failed to renew lease
> for DFSClient" messages followed by several "FATAL" messages about HLog
>
>  not
>
>  being able to synch and immediately afterwards a terminal "ABORTING region
> server".
>
> You can find an extract of a Region Server log 
> here:http://pastebin.com/G39LQyQT.
>
> Running "hbase hbck" reveals inconsistencies in some tables, but
>
>  attempting
>
>  a repair with "hbase hbck -repair" stalls due to some regions being in
> transition, see here: http://pastebin.com/JAbcQ4cc.
>
> The setup contains 30 machines, 26GB RAM each, the services are managed
> using CDH4, so HBase version is 0.92.x. We did not tweak any of the
>
>  default
>
>  configuration settings, however table scans are done with sensible
> scan/batch/filter settings.
>
> Data intake is about 100GB/day which are added at a time when no
>
>  Map/Reduce
>
>  jobs are running. Tables have between 100 * 10^6 and 2 * 10^9 rows, with
>
>  an
>
>  average of 10 KVs, about 1kb each. Very few rows exceed 10^6 KV.
>
> What can we do to fix these issues? Are they symptomic of a mal-configured
> setup or some critical threshold level being reached? The cluster used to
> be stable.
>
> Thank you,
>
> /David
>
>
> --
> Marcos Ortiz Valmaseda,
> Product Manager && Data Scientist at UCI
> Blog: http://marcosluis2186.posterous.com
> Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186>
>

Re: : Region Servers crashing following: "File does not exist", "Too many open files" exceptions

Reply via email to