On Sun, Jul 4, 2010 at 10:36 AM, Dan Harvey <[email protected]> wrote: > Just looked into hdfs630 and it looks like it was added in > cdh2 0.20.1+169.89 and we're currently on 0.20.1+169.68. So would it help > prevent some of these issues by updating to that so we have the patch? >
For sure Dan. HDFS-630 will help at a minimum. St.Ack > Thanks, > > On 4 July 2010 18:12, Dan Harvey <[email protected]> wrote: > >> Hey, >> >> We're using stock CHD2 without any patches so I'm not sure if we have >> hdfs630 or not. For HBase we're currently on 0.20.3 and will be testing and >> moving to 0.20.5 soon >> >> What I did with this rollout of just config changes was take one region >> server down at a time and restart the datanode on the same server. So what I >> gather I should have done was shutdown all the region servers before >> restarting any of the data nodes? >> >> I guess if I split it into different parts it would be :- >> >> - HBase Rolling update for point/config releases is supported >> - Update masters first >> - Then update region servers in turn >> >> - HDFS Data nodes don't support rolling updates? (Maybe better in the hdfs >> list I guess) >> - Take down HBase >> - Take down datanodes >> - Update all the datanodes code/configs >> - Start datanodes >> - Start HBase >> >> Would you be able to let me know which of these I've got right/wrong? >> >> Thanks, >> >> On 29 June 2010 15:50, Michael Segel <[email protected]> wrote: >> >>> >>> Dan, >>> >>> I don't think you can do that because your 'new/updated' node will clash >>> with the rest of the cloud. >>> (We're talking code and not just cloud tuning parameters.) [Read different >>> jars...] >>> >>> If you're going to push an update out, then it has to be an 'all or >>> nothing' push. >>> >>> Since we're using Cloudera's release, moving from CDH2 to CDH3 represents >>> a full backup, down the cloud, remove the software completely, and then then >>> install new CDH3. Outside of that major switch, if we were going from one >>> sub release to another, it would be just a $> yum update hadoop-0.20 call on >>> each node. >>> Again, you have to take the cloud down to do that. >>> >>> So the bottom line... if you're going to do upgrades, you'll need to plan >>> for some down time. >>> >>> HTH >>> >>> -Mike >>> >>> > From: [email protected] >>> > Date: Tue, 29 Jun 2010 14:43:26 +0100 >>> > Subject: Rolling out Hadoop/HBase updates >>> > To: [email protected] >>> > >>> > Hey, >>> > >>> > I've been thinking about how we do out configuration and code updates >>> for >>> > Hadoop and HBase and was wondering what others do and what is the best >>> > practice to avoid errors with HBase. >>> > >>> > Currently we do a rolling update where we restart the services on one >>> node >>> > at a time, so shutting down the region server then restarting the >>> datanode >>> > and task trackers depending on what we are updating and what has change. >>> But >>> > with this I have occasional found errors with the HBase cluster >>> afterwards >>> > due to corrupt META table which I think could have been caused by >>> restarting >>> > the datanode, or maybe not waiting long enough for the cluster to sort >>> out >>> > loosing a region server before moving on to the next. >>> > >>> > The most resent error upon restarting a node was :- >>> > >>> > 2010-06-29 10:46:44,970 ERROR >>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing >>> > files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366 >>> > java.io.IOException: Filesystem closed >>> > at >>> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230) >>> > >>> > 2010-06-29 10:46:44,970 FATAL >>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down >>> > HRegionServer: file system not available >>> > java.io.IOException: File system is not available >>> > at >>> > >>> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129) >>> > >>> > >>> > Followed by this for every region being served :- >>> > >>> > 2010-06-29 10:46:44,996 ERROR >>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing >>> > documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202 >>> > java.io.IOException: Filesystem closed >>> > at >>> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230) >>> > >>> > >>> > After updating all the nodes all the region server shut down after a >>> > few minutes reporting the following :- >>> > >>> > 2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error >>> > Recovery for block blk_-1437671530216085093_2565663 bad datanode[0] >>> > 10.0.11.4:50010 >>> > >>> > 2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog: >>> > Could not append. Requesting close of hlog >>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting... >>> > at >>> > >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542) >>> > >>> > >>> > 2010-06-29 11:22:09,482 FATAL >>> > org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with >>> > ioe: >>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting... >>> > at >>> > >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542) >>> > >>> > 2010-06-29 11:22:10,344 ERROR >>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log >>> in >>> > abort >>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting... >>> > at >>> > >>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542) >>> > >>> > >>> > This was fixed by restarting the master and starting the region servers >>> > again, but it would be nice to know how to roll out changes cleaner. >>> > >>> > How do other people here roll out updates to HBase / Hadoop? What order >>> do >>> > you restart services in and how long do you wait before moving to the >>> next >>> > node? >>> > >>> > Just so you know we currently have 5 nodes and are getting another 10 to >>> add >>> > soon. >>> > >>> > Thanks, >>> > >>> > -- >>> > Dan Harvey | Datamining Engineer >>> > www.mendeley.com/profiles/dan-harvey >>> > >>> > Mendeley Limited | London, UK | www.mendeley.com >>> > Registered in England and Wales | Company Number 6419015 >>> >>> _________________________________________________________________ >>> Hotmail has tools for the New Busy. Search, chat and e-mail from your >>> inbox. >>> >>> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1 >>> >> >> >> >> -- >> Dan Harvey | Datamining Engineer >> www.mendeley.com/profiles/dan-harvey >> >> Mendeley Limited | London, UK | www.mendeley.com >> Registered in England and Wales | Company Number 6419015 >> > > > > -- > Dan Harvey | Datamining Engineer > www.mendeley.com/profiles/dan-harvey > > Mendeley Limited | London, UK | www.mendeley.com > Registered in England and Wales | Company Number 6419015 >
