Just looked into hdfs630 and it looks like it was added in cdh2 0.20.1+169.89 and we're currently on 0.20.1+169.68. So would it help prevent some of these issues by updating to that so we have the patch?
Thanks, On 4 July 2010 18:12, Dan Harvey <[email protected]> wrote: > Hey, > > We're using stock CHD2 without any patches so I'm not sure if we have > hdfs630 or not. For HBase we're currently on 0.20.3 and will be testing and > moving to 0.20.5 soon > > What I did with this rollout of just config changes was take one region > server down at a time and restart the datanode on the same server. So what I > gather I should have done was shutdown all the region servers before > restarting any of the data nodes? > > I guess if I split it into different parts it would be :- > > - HBase Rolling update for point/config releases is supported > - Update masters first > - Then update region servers in turn > > - HDFS Data nodes don't support rolling updates? (Maybe better in the hdfs > list I guess) > - Take down HBase > - Take down datanodes > - Update all the datanodes code/configs > - Start datanodes > - Start HBase > > Would you be able to let me know which of these I've got right/wrong? > > Thanks, > > On 29 June 2010 15:50, Michael Segel <[email protected]> wrote: > >> >> Dan, >> >> I don't think you can do that because your 'new/updated' node will clash >> with the rest of the cloud. >> (We're talking code and not just cloud tuning parameters.) [Read different >> jars...] >> >> If you're going to push an update out, then it has to be an 'all or >> nothing' push. >> >> Since we're using Cloudera's release, moving from CDH2 to CDH3 represents >> a full backup, down the cloud, remove the software completely, and then then >> install new CDH3. Outside of that major switch, if we were going from one >> sub release to another, it would be just a $> yum update hadoop-0.20 call on >> each node. >> Again, you have to take the cloud down to do that. >> >> So the bottom line... if you're going to do upgrades, you'll need to plan >> for some down time. >> >> HTH >> >> -Mike >> >> > From: [email protected] >> > Date: Tue, 29 Jun 2010 14:43:26 +0100 >> > Subject: Rolling out Hadoop/HBase updates >> > To: [email protected] >> > >> > Hey, >> > >> > I've been thinking about how we do out configuration and code updates >> for >> > Hadoop and HBase and was wondering what others do and what is the best >> > practice to avoid errors with HBase. >> > >> > Currently we do a rolling update where we restart the services on one >> node >> > at a time, so shutting down the region server then restarting the >> datanode >> > and task trackers depending on what we are updating and what has change. >> But >> > with this I have occasional found errors with the HBase cluster >> afterwards >> > due to corrupt META table which I think could have been caused by >> restarting >> > the datanode, or maybe not waiting long enough for the cluster to sort >> out >> > loosing a region server before moving on to the next. >> > >> > The most resent error upon restarting a node was :- >> > >> > 2010-06-29 10:46:44,970 ERROR >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing >> > files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366 >> > java.io.IOException: Filesystem closed >> > at >> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230) >> > >> > 2010-06-29 10:46:44,970 FATAL >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down >> > HRegionServer: file system not available >> > java.io.IOException: File system is not available >> > at >> > >> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129) >> > >> > >> > Followed by this for every region being served :- >> > >> > 2010-06-29 10:46:44,996 ERROR >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing >> > documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202 >> > java.io.IOException: Filesystem closed >> > at >> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230) >> > >> > >> > After updating all the nodes all the region server shut down after a >> > few minutes reporting the following :- >> > >> > 2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error >> > Recovery for block blk_-1437671530216085093_2565663 bad datanode[0] >> > 10.0.11.4:50010 >> > >> > 2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog: >> > Could not append. Requesting close of hlog >> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting... >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542) >> > >> > >> > 2010-06-29 11:22:09,482 FATAL >> > org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with >> > ioe: >> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting... >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542) >> > >> > 2010-06-29 11:22:10,344 ERROR >> > org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log >> in >> > abort >> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting... >> > at >> > >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542) >> > >> > >> > This was fixed by restarting the master and starting the region servers >> > again, but it would be nice to know how to roll out changes cleaner. >> > >> > How do other people here roll out updates to HBase / Hadoop? What order >> do >> > you restart services in and how long do you wait before moving to the >> next >> > node? >> > >> > Just so you know we currently have 5 nodes and are getting another 10 to >> add >> > soon. >> > >> > Thanks, >> > >> > -- >> > Dan Harvey | Datamining Engineer >> > www.mendeley.com/profiles/dan-harvey >> > >> > Mendeley Limited | London, UK | www.mendeley.com >> > Registered in England and Wales | Company Number 6419015 >> >> _________________________________________________________________ >> Hotmail has tools for the New Busy. Search, chat and e-mail from your >> inbox. >> >> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1 >> > > > > -- > Dan Harvey | Datamining Engineer > www.mendeley.com/profiles/dan-harvey > > Mendeley Limited | London, UK | www.mendeley.com > Registered in England and Wales | Company Number 6419015 > -- Dan Harvey | Datamining Engineer www.mendeley.com/profiles/dan-harvey Mendeley Limited | London, UK | www.mendeley.com Registered in England and Wales | Company Number 6419015
