Hi Mark, Thanks for giving detailed information about name node failure and High availability feature.
Wish you all the best in your job search. Thanks again... Regards, Chandrash3khar Kotekar Mobile - +91 8600011455 On Mon, Dec 15, 2014 at 6:29 AM, mark charts <[email protected]> wrote: > > > "Prior to the Hadoop 2.x series, the NameNode was a single point of > failure in an > HDFS cluster — in other words, if the machine on which the single NameNode > was configured became unavailable, the entire cluster would be unavailable > until the NameNode could be restarted. This was bad news, especially in the > case of unplanned outages, which could result in significant downtime if > the > cluster administrator weren’t available to restart the NameNode. > The solution to this problem is addressed by the HDFS High Availability > fea- > ture. The idea is to run two NameNodes in the same cluster — one active > NameNode and one hot standby NameNode. If the active NameNode crashes > or needs to be stopped for planned maintenance, it can be quickly failed > over > to the hot standby NameNode, which now becomes the active NameNode. > The key is to keep the standby node synchronized with the active node; this > action is now accomplished by having both nodes access a shared NFS direc- > tory. All namespace changes on the active node are logged in the shared > directory. The standby node picks up those changes from the directory and > applies them to its own namespace. In this way, the standby NameNode acts > as a current backup of the active NameNode. The standby node also has cur- > rent block location information, because DataNode heartbeats are routinely > sent to both active and standby NameNodes. > To ensure that only one NameNode is the “active” node at any given time, > configure a fencing process for the shared storage directory; then, during > a > failover, if it appears that the failed NameNode still carries the active > state, > the configured fencing process prevents that node from accessing the shared > directory and permits the newly active node (the former standby node) to > complete the failover. > The machines that will serve as the active and standby NameNodes in your > High Availability cluster should have equivalent hardware. The shared NFS > storage directory, which must be accessible to both active and standby > NameNodes, is usually located on a separate machine and can be mounted on > each NameNode machine. To prevent this directory from becoming a single > point of failure, configure multiple network paths to the storage > directory, and > ensure that there’s redundancy in the storage itself. Use a dedicated > network- > attached storage (NAS) appliance to contain the shared storage directory." > *sic* > > Courtesy of Dirk deRoos, Paul C. Zikopoulos, Bruce Brown, > Rafael Coss, and Roman B. Melnyk. > > > Ps. I am looking for work as Hadoop Admin/Developer (I am an Electrical > Engr w/ MSEE). I've implemented one 6 node cluster successfully at work a > few months ago for productivity purposes at work (that's my claim to fame). > I was laid off shortly afterwards. No correlation I suspect. But I am in FL > and willing to go anywhere to find contract/permanent work. If anyone knows > of a position for a tenacious Hadoop engineer, I am interested. > > > Thank you. > > > Mark Charts > > > > On Sunday, December 14, 2014 5:30 PM, daemeon reiydelle < > [email protected]> wrote: > > > I found the terminology of primary and secondary to be a bit confusing in > describing operation after a failure scenario. Perhaps it is helpful to > think that the Hadoop instance is guided to select a node as primary for > normal operation. If that node fails, then the backup becomes the new > primary. In analyzing traffic it appears that the restored node does not > become primary again until the whole instance restarts. I myself would > welcome clarification on this observed behavior. > > > > *.......* > > > > > > > *“Life should not be a journey to the grave with the intention of arriving > safely in apretty and well preserved body, but rather to skid in broadside > in a cloud of smoke,thoroughly used up, totally worn out, and loudly > proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA > (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872 > <%28%2B44%29%20%280%29%2020%208144%209872>* > > On Fri, Dec 12, 2014 at 7:56 AM, Rich Haase <[email protected]> wrote: > > The remaining cluster services will continue to run. That way when the > namenode (or other failed processes) is restored the cluster will resume > healthy operation. This is part of hadoop’s ability to handle network > partition events. > > *Rich Haase* | Sr. Software Engineer | Pandora > m 303.887.1146 | [email protected] > > From: Chandrashekhar Kotekar <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Friday, December 12, 2014 at 3:57 AM > To: "[email protected]" <[email protected]> > Subject: What happens to data nodes when name node has failed for long > time? > > Hi, > > What happens if name node has crashed for more than one hour but > secondary name node, all the data nodes, job tracker, task trackers are > running fine? Do those daemon services also automatically shutdown after > some time? Or those services keep running hoping for namenode to come back? > > Regards, > Chandrash3khar Kotekar > Mobile - +91 8600011455 > > > > >
