Thanks for writing back. I guess you meant 'things are now operating well', below :-)
On Thu, Apr 5, 2012 at 6:25 AM, Eran Kutner <[email protected]> wrote: > As promised I'm writing back to update the list. > Seems that after upgrading to cdh3u3 of the hadoop cluster and zookeeper > ensemble (hadoop alone wasn't enough) things are no operating well with no > HDFS errors in the logs. I've also set > hbase.regionserver.logroll.errors.tolerated to 3 just in case. Now that the > log is clean a new exception shows up but I'll open a separate thread about > it. > > Thanks everyone. > > -eran > > > > On Wed, Mar 28, 2012 at 23:06, Eran Kutner <[email protected]> wrote: > > > hmmm... I couldn't find it either, so I've looked at the history of that > > file and sure enough a few check-ins back it had that message. > > I have no idea how something like this could happen. I know I had some > > merge issues when I first got the latest version and built that project > but > > I've then reverted all local changes and rebuilt. The only thing I can > > imagine is that the previous compiled class file was not modified and it > > was the one that got included in the JAR, although I don;t really know > how > > can it happen. > > > > -eran > > > > > > > > On Wed, Mar 28, 2012 at 18:53, Ted Yu <[email protected]> wrote: > > > >> Eran: > >> The error indicated some zookeeper related issue. > >> Do you see KeeperException after the Error log ? > >> > >> I searched 90 codebase but couldn't find the exact log phrase: > >> > >> zhihyu$ find src/main -name '*.java' -exec grep "getting node's version > in > >> CLOSI" {} \; -print > >> zhihyu$ find src/main -name '*.java' -exec grep 'Error getting ' {} \; > >> -print > >> > >> Cheers > >> > >> On Wed, Mar 28, 2012 at 9:45 AM, Eran Kutner <[email protected]> wrote: > >> > >> > I don't see any prior HDFS issues in the 15 minutes before this > >> exception. > >> > The logs on the datanode reported as problematic are clean as well. > >> > However, I now see the log is full of errors like this: > >> > 2012-03-28 00:15:05,358 DEBUG > >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: > >> Processing > >> > close of gs_users,731481|S > >> > n쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5. > >> > 2012-03-28 00:15:05,359 WARN > >> > org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler: Error > >> > getting node's version in CLOSIN > >> > G state, aborting close of > >> > > >> > gs_users,731481|Sn쒪㝨眳ԫ䂣⫰==,1331226388691.29929cb2200b3541ead85e17b836ade5. > >> > > >> > -eran > >> > > >> > > >> > > >> > On Wed, Mar 28, 2012 at 18:38, Jean-Daniel Cryans < > [email protected] > >> > >wrote: > >> > > >> > > Any chance we can see what happened before that too? Usually you > >> > > should see a lot more HDFS spam before getting that all the > datanodes > >> > > are bad. > >> > > > >> > > J-D > >> > > > >> > > On Wed, Mar 28, 2012 at 4:28 AM, Eran Kutner <[email protected]> > wrote: > >> > > > Hi, > >> > > > > >> > > > We have region server sporadically stopping under load due > >> supposedly > >> > to > >> > > > errors writing to HDFS. Things like: > >> > > > > >> > > > 2012-03-28 00:37:11,210 WARN org.apache.hadoop.hdfs.DFSClient: > Error > >> > > while > >> > > > syncing > >> > > > java.io.IOException: All datanodes 10.1.104.10:50010 are bad. > >> > Aborting.. > >> > > > > >> > > > It's happening with a different region server and data node every > >> time, > >> > > so > >> > > > it's not a problem with one specific server and there doesn't seem > >> to > >> > be > >> > > > anything really wrong with either of them. I've already increased > >> the > >> > > file > >> > > > descriptor limit, datanode xceivers and data node handler count. > Any > >> > idea > >> > > > what can be causing these errors? > >> > > > > >> > > > > >> > > > A more complete log is here: http://pastebin.com/wC90xU2x > >> > > > > >> > > > Thanks. > >> > > > > >> > > > -eran > >> > > > >> > > >> > > > > >
