Re: Rolling out Hadoop/HBase updates

Stack Tue, 29 Jun 2010 07:50:03 -0700

Hey Dan 

Are you using raw apache hadoop?   If so any patches?   Do you have hdfs630?


Looking at errors below the rs thinks the fs is gone.  Nothing in the log 
before the exception pasted below?

In general you want to let the regionservers finish their shut down.  Any 
chance you are not letting this happen?

For rolling restart you should do master first then the regionservers.  Best 
results if cluster is quiet at the time else regions in transition can be 
"lost" over master restart ( to be fixed in hbase 0.90.0)

Stack

On Jun 29, 2010, at 6:43 AM, Dan Harvey <[email protected]> wrote:

> Hey,
> 
> I've been thinking about how we do out configuration and code updates for
> Hadoop and HBase and was wondering what others do and what is the best
> practice to avoid errors with HBase.
> 
> Currently we do a rolling update where we restart the services on one node
> at a time, so shutting down the region server then restarting the datanode
> and task trackers depending on what we are updating and what has change. But
> with this I have occasional found errors with the HBase cluster afterwards
> due to corrupt META table which I think could have been caused by restarting
> the datanode, or maybe not waiting long enough for the cluster to sort out
> loosing a region server before moving on to the next.
> 
> The most resent error upon restarting a node was :-
> 
> 2010-06-29 10:46:44,970 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
> files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
> 
> 2010-06-29 10:46:44,970 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
> HRegionServer: file system not available
> java.io.IOException: File system is not available
>        at
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)
> 
> 
> Followed by this for every region being served :-
> 
> 2010-06-29 10:46:44,996 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
> documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
> 
> 
> After updating all the nodes all the region server shut down after a
> few minutes reporting the following :-
> 
> 2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
> 10.0.11.4:50010
> 
> 2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
> Could not append. Requesting close of hlog
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 
> 2010-06-29 11:22:09,482 FATAL
> org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
> ioe:
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 2010-06-29 11:22:10,344 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
> abort
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 
> This was fixed by restarting the master and starting the region servers
> again, but it would be nice to know how to roll out changes cleaner.
> 
> How do other people here roll out updates to HBase / Hadoop? What order do
> you restart services in and how long do you wait before moving to the next
> node?
> 
> Just so you know we currently have 5 nodes and are getting another 10 to add
> soon.
> 
> Thanks,
> 
> -- 
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
> 
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015

Re: Rolling out Hadoop/HBase updates

Reply via email to