Yes, the limit is at 65535. /David
On Sun, Feb 10, 2013 at 4:22 AM, Marcos Ortiz <[email protected]> wrote: > Did you increase the number of open files in your > /etc/security/limits.conf in your system? > > > On 02/09/2013 09:17 PM, David Koch wrote: > > Hello, > > Thank you for your reply, I checked the HDFS log for error messages that > are indicative of "xciever" problems but could not find any. The settings > suggested here: > http://blog.cloudera.com/blog/2012/03/hbase-hadoop-xceivers/have been > applied on our cluster. > > I did a grep "File does not exist: /hbase/<table_name>/" > /var/log/hadoop-hdfs/hadoop-cmf-hdfs1-NAMENODE-big* | wc > > on the namenode logs and there millions of such lines for one table only. > The count is 0 for all other tables - even though they may be reported as > inconsistent by hbchk. > > It seems like this is less of a performance issue but rather some stale > "where to find what data" problem - possibly related to Zookeeper? I > remember there being some kind of procedure for clearing ZK even though I > cannot recall the steps involved. > > Any further help would be appreciated, > > Thanks, > > /David > > On Sun, Feb 10, 2013 at 2:24 AM, Dhaval Shah <[email protected]> > <[email protected]>wrote: > > > It seems like you need to increase the limit on the number of xceivers on > the hdfs config looking at your error messages. > > > ------------------------------ > On Sun 10 Feb, 2013 6:37 AM IST David Koch wrote: > > > Hello, > > As of lately, we have been having issues with Region Servers crashing in > our cluster. This happens while running Map/Reduce jobs over HBase tables > in particular but also spontaneously when the cluster is seemingly idle. > > Restarting the Region Servers or even HBase entirely as well as HDFS and > Map/Reduce services does not fix the problem and jobs will fail during the > next attempt citing "Region not served" exceptions. It is not always the > same nodes that crash. > > The log data during the minutes leading up to the crash contain many "File > does not exist /hbase/<table_name>/..." error messages which change to > > "Too > > many open files" messages, finally, there are a few "Failed to renew lease > for DFSClient" messages followed by several "FATAL" messages about HLog > > not > > being able to synch and immediately afterwards a terminal "ABORTING region > server". > > You can find an extract of a Region Server log > here:http://pastebin.com/G39LQyQT. > > Running "hbase hbck" reveals inconsistencies in some tables, but > > attempting > > a repair with "hbase hbck -repair" stalls due to some regions being in > transition, see here: http://pastebin.com/JAbcQ4cc. > > The setup contains 30 machines, 26GB RAM each, the services are managed > using CDH4, so HBase version is 0.92.x. We did not tweak any of the > > default > > configuration settings, however table scans are done with sensible > scan/batch/filter settings. > > Data intake is about 100GB/day which are added at a time when no > > Map/Reduce > > jobs are running. Tables have between 100 * 10^6 and 2 * 10^9 rows, with > > an > > average of 10 KVs, about 1kb each. Very few rows exceed 10^6 KV. > > What can we do to fix these issues? Are they symptomic of a mal-configured > setup or some critical threshold level being reached? The cluster used to > be stable. > > Thank you, > > /David > > > -- > Marcos Ortiz Valmaseda, > Product Manager && Data Scientist at UCI > Blog: http://marcosluis2186.posterous.com > Twitter: @marcosluis2186 <http://twitter.com/marcosluis2186> >
