I don't see any prior HDFS issues in the 15 minutes before this exception. The logs on the datanode reported as problematic are clean as well. However, I now see the log is full of errors like this: 2012-03-28 00:15:05,358 DEBUG org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Processing close of gs_users,731481|S n+xKryLzdodzMFK0CjKvA==,1331226388691.29929cb2200b3541ead85e17b836ade5. 2012-03-28 00:15:05,359 WARN org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error getting node's version in CLOSIN G state, aborting close of gs_users,731481|Sn+xKryLzdodzMFK0CjKvA==,1331226388691.29929cb2200b3541ead85e17b836ade5.
-eran On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans <[email protected]>wrote: > Any chance we can see what happened before that too? Usually you > should see a lot more HDFS spam before getting that all the datanodes > are bad. > > J-D > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[email protected]> wrote: > > Hi, > > > > We have region server sporadically stopping under load due supposedly to > > errors writing to HDFS. Things like: > > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: Error > while > > syncing > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. Aborting.. > > > > It's happening with a different region server and data node every time, > so > > it's not a problem with one specific server and there doesn't seem to be > > anything really wrong with either of them. I've already increased the > file > > descriptor limit, datanode xceivers and data node handler count. Any idea > > what can be causing these errors? > > > > > > A more complete log is here: http://pastebin.com/wC90xU2x > > > > Thanks. > > > > -eran >
