Felix,

After changing hdfs-site.xml, did you run "hadoop dfsadmin -refreshNodes"? That 
should have been enough, but you can try increasing the replication factor of 
these files, wait for them to be replicated to the new nodes, then setting it 
back to its original value.

Cheers,
Marcos

In 28-03-2013 17:00, Felix GV wrote:
Hello,

I've been running a virtualized CDH 4.2 cluster. I now want to migrate all my 
data to another (this time physical) set of slaves and then stop using the 
virtualized slaves.

I added the new physical slaves in the cluster, and marked all the old 
virtualized slaves as decommissioned using the dfs.hosts.exclude setting in 
hdfs-site.xml.

Almost all of the data replicated successfully to the new slaves, but when I 
bring down the old slaves, some blocks start showing up as missing or corrupt 
(according to the NN UI as well as fsck*). If I restart the old slaves, then 
there are no missing blocks reported by fsck.

I've tried shutting down the old slaves two by two, and for some of them I saw 
no problem, but then at some point I found two slaves which, when shut down, 
resulted in a couple of blocks being under-replicated (1 out of 3 replicas 
found). For example, fsck would report stuff like this:

/user/hive/warehouse/ads_destinations_hosts/part-m-00012:  Under replicated 
BP-1207449144-10.10.10.21-1356639087818:blk_6150201737015349469_121244. Target 
Replicas is 3 but found 1 replica(s).

The system then stayed in that state apparently forever. It never actually 
fixed the fact some blocks were under-replicated. Does that mean there's 
something wrong with some of the old datanodes...? Why do they keep block for 
themselves (even thought they're decommissioned) instead of replicating those 
blocks to the new (non-decommissioned) datanodes?

How do I force replication of under-replicated blocks?

*Actually, the NN UI and fsck report slightly different things. The NN UI 
always seems to report 60 under-replicated blocks, whereas fsck only reports 
those 60 under-replicated blocks when I shut down some of the old datanodes... 
When the old nodes are up, fsck reports 0 under-replicated blocks... This is 
very confusing!

Any help would be appreciated! Please don't hesitate to ask if I should provide 
some of my logs, settings, or the output of some commands...!

Thanks :) !

--
Felix

Reply via email to