On Sun, Feb 17, 2013 at 5:08 PM, Harsh J <[email protected]> wrote: > Hi Robert, > > Are you by any chance adding files carrying unusual encoding?
I don't believe so. The only files I push to HDFS are SequenceFiles (with protobuf objects in them) and HBase's regions, which again is just protobuf objects. I don't use any special encodings in the protobufs. > If its > possible, can we be sent a bundle of the corrupted log set (all of the > dfs.name.dir contents) to inspect what seems to be causing the > corruption? > I can give the logs, dfs data dir(s), and 2nn dirs. https://www.dropbox.com/s/heijq65pmb3esvd/hdfs-bug.tar.gz > The only identified (but rarely occurring) bug around this part in > 1.0.4 would be https://issues.apache.org/jira/browse/HDFS-4423. The > other major corruption bug I know of is already fixed in your version, > being https://issues.apache.org/jira/browse/HDFS-3652 specifically. > > We've not had this report from other users so having a reproduced file > set (data not required) would be most helpful. If you have logs > leading to the shutdown and crash as well, that'd be good to have too. > > P.s. How exactly are you shutting down the NN each time? A kill -9 or > a regular SIGTERM shutdown? > I shut down the NN with 'bin/stop-dfs.sh'. > On Mon, Feb 18, 2013 at 4:31 AM, Robert Dyer <[email protected]> wrote: > > On Sun, Feb 17, 2013 at 4:41 PM, Mohammad Tariq <[email protected]> > wrote: > >> > >> You can make use of offine image viewer to diagnose > >> the fsimage file. > > > > > > Is this not included in the 1.0.x branch? All of the documentation I > find > > for it says to run 'bin/hdfs oev' but I do not have a 'bin/hdfs'. > > > >> > >> Warm Regards, > >> Tariq > >> https://mtariq.jux.com/ > >> cloudfront.blogspot.com > >> > >> > >> On Mon, Feb 18, 2013 at 3:31 AM, Robert Dyer <[email protected]> wrote: > >>> > >>> It just happened again. This was after a fresh format of HDFS/HBase > and > >>> I am attempting to re-import the (backed up) data. > >>> > >>> http://pastebin.com/3fsWCNQY > >>> > >>> So now if I restart the namenode, I will lose data from the past 3 > hours. > >>> > >>> What is causing this? How can I avoid it in the future? Is there an > >>> easy way to monitor (other than a script grep'ing the logs) the > checkpoints > >>> to see when this happens? > >>> > >>> > >>> On Sat, Feb 16, 2013 at 2:39 PM, Robert Dyer <[email protected]> > wrote: > >>>> > >>>> Forgot to mention: Hadoop 1.0.4 > >>>> > >>>> > >>>> On Sat, Feb 16, 2013 at 2:38 PM, Robert Dyer <[email protected]> > wrote: > >>>>> > >>>>> I am at a bit of wits end here. Every single time I restart the > >>>>> namenode, I get this crash: > >>>>> > >>>>> 2013-02-16 14:32:42,616 INFO > >>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size > 168058 > >>>>> loaded in 0 seconds. > >>>>> 2013-02-16 14:32:42,618 ERROR > >>>>> org.apache.hadoop.hdfs.server.namenode.NameNode: > >>>>> java.lang.NullPointerException > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1099) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1111) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1014) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:208) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:631) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1021) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:839) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:377) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) > >>>>> at > >>>>> > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) > >>>>> > >>>>> I am following best practices here, as far as I know. I have the > >>>>> namenode writing into 3 directories (2 local, 1 NFS). All 3 of > these dirs > >>>>> have the exact same files in them. > >>>>> > >>>>> I also run a secondary checkpoint node. This one appears to have > >>>>> started failing a week ago. So checkpoints were *not* being done > since > >>>>> then. Thus I can get the NN up and running, but with a week old > data! > >>>>> > >>>>> What is going on here? Why does my NN data *always* wind up causing > >>>>> this exception over time? Is there some easy way to get notified > when the > >>>>> checkpointing starts to fail? > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> > >>>> Robert Dyer > >>>> [email protected] > >>> > >>> > >>> > >>> > >>> -- > >>> > >>> Robert Dyer > >>> [email protected] > >> > >> > > > > > > > > -- > > > > Robert Dyer > > [email protected] > > > > -- > Harsh J > -- Robert Dyer [email protected]
