Dan,

I don't think you can do that because your 'new/updated' node will clash with 
the rest of the cloud.
(We're talking code and not just cloud tuning parameters.) [Read different 
jars...]

If you're going to push an update out, then it has to be an 'all or nothing' 
push.

Since we're using Cloudera's release, moving from CDH2 to CDH3 represents a 
full backup, down the cloud, remove the software completely, and then then 
install new CDH3. Outside of that major switch, if we were going from one sub 
release to another, it would be just a $> yum update hadoop-0.20 call on each 
node.
Again, you have to take the cloud down to do that.

So the bottom line... if you're going to do upgrades, you'll need to plan for 
some down time.

HTH

-Mike

> From: [email protected]
> Date: Tue, 29 Jun 2010 14:43:26 +0100
> Subject: Rolling out Hadoop/HBase updates
> To: [email protected]
> 
> Hey,
> 
> I've been thinking about how we do out configuration and code updates for
> Hadoop and HBase and was wondering what others do and what is the best
> practice to avoid errors with HBase.
> 
> Currently we do a rolling update where we restart the services on one node
> at a time, so shutting down the region server then restarting the datanode
> and task trackers depending on what we are updating and what has change. But
> with this I have occasional found errors with the HBase cluster afterwards
> due to corrupt META table which I think could have been caused by restarting
> the datanode, or maybe not waiting long enough for the cluster to sort out
> loosing a region server before moving on to the next.
> 
> The most resent error upon restarting a node was :-
> 
> 2010-06-29 10:46:44,970 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
> files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
> 
> 2010-06-29 10:46:44,970 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
> HRegionServer: file system not available
> java.io.IOException: File system is not available
>         at
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)
> 
> 
> Followed by this for every region being served :-
> 
> 2010-06-29 10:46:44,996 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
> documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
> 
> 
> After updating all the nodes all the region server shut down after a
> few minutes reporting the following :-
> 
> 2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
> 10.0.11.4:50010
> 
> 2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
> Could not append. Requesting close of hlog
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 
> 2010-06-29 11:22:09,482 FATAL
> org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
> ioe:
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 2010-06-29 11:22:10,344 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
> abort
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 
> This was fixed by restarting the master and starting the region servers
> again, but it would be nice to know how to roll out changes cleaner.
> 
> How do other people here roll out updates to HBase / Hadoop? What order do
> you restart services in and how long do you wait before moving to the next
> node?
> 
> Just so you know we currently have 5 nodes and are getting another 10 to add
> soon.
> 
> Thanks,
> 
> -- 
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
> 
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
                                          
_________________________________________________________________
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1

Reply via email to