Re: Why do some blocks refuse to replicate...?

Tapas Sarangi Thu, 28 Mar 2013 17:25:48 -0700

On Mar 28, 2013, at 7:13 PM, Felix GV <[email protected]> wrote:

> I'm using the version of hadoop in CDH 4.2, which is a version of Hadoop 2.0 
> with a bunch of patches on top...
> 
> I've tried copying one block and its .meta file to one of my new DN, then 
> restarted the DN service, and it did pick up the missing block and replicate 
> it properly within the new slaves. So... that's great, but it's also super 
> annoying, I don't want to have to pick and choose each block manually. 
> Although I guess I could parse the output of fsck, figure out where the 
> blocks are and script the whole thing.


This should flush out all the corrupt or missing blocks :

hadoop fsck <path to HDFS dir> -files -blocks -locations | egrep  
"CORRUPT|MISSING"

You can put in a for loop and copy them another node. None but little scripting.

------------


> 
> I'm now trying to rsync all of the data from an old node to a new one, and 
> see if it's gonna be able to pick that up, but I'm afraid the subdir 
> structure might not port over nicely to the new node. Also, this is 
> acceptable today to save me from picking each block manually (or come up with 
> the script) because I don't have that much data on the old node, but if I had 
> gotten in that situation with a large amount of data, that would not have 
> been a very good solution...
> 
> I'll report back when I'll have made some more progres...
> 
> --
> Felix
> 
> 
> On Thu, Mar 28, 2013 at 7:01 PM, Azuryy Yu <[email protected]> wrote:
> which hadoop version you used?
> 
> On Mar 29, 2013 5:24 AM, "Felix GV" <[email protected]> wrote:
> >
> > Yes, I didn't specify how I was testing my changes, but basically, here's 
> > what I did:
> >
> > My hdfs-site.xml file was modified to include a reference the a file 
> > containing a list of all datanodes (via dfs.hosts) and a reference to a 
> > file containing decommissioned nodes (via dfs.hosts.exclude). After that, I 
> > just changed these files, not hdfs-site.xml.
> >
> > I first added all my old nodes in the dfs.hosts.exclude file, did hdfs 
> > dfsadmin -refreshNodes, and most of the data replicated correctly.
> >
> > I then tried removing all old nodes from the dfs.hosts file, did hdfs 
> > dfsadmin -refreshNodes, and I saw that I now had a coupe of corrupt and 
> > missing blocks (60 of them).
> >
> > I re-added all the old nodes in the dfs.hosts file, and removed them 
> > gradually, each time doing the refreshNodes or restarting the NN, and I 
> > narrowed it down to three datanodes in particular, which seem to be the 
> > three nodes where all of those 60 blocks are located.
> >
> > Is it possible, perhaps, that these three nodes are completely incapable of 
> > replicating what they have (because they're corrupt or something), and so 
> > every block was replicated from other nodes, but the blocks that happened 
> > to be located on these three nodes are... doomed? I can see the data in 
> > those blocks in the NN hdfs browser, so I guess it's not corrupted... I 
> > also tried pinging the new nodes from those old ones and it works too, so I 
> > guess there is no network partition...
> >
> > I'm in the process of increasing replication factor above 3, but I don't 
> > know if that's gonna do anything...
> >
> > --
> > Felix
> >
> >
> > On Thu, Mar 28, 2013 at 4:45 PM, MARCOS MEDRADO RUBINELLI 
> > <[email protected]> wrote:
> >>
> >> Felix,
> >>
> >> After changing hdfs-site.xml, did you run "hadoop dfsadmin -refreshNodes"? 
> >> That should have been enough, but you can try increasing the replication 
> >> factor of these files, wait for them to be replicated to the new nodes, 
> >> then setting it back to its original value.
> >>
> >> Cheers,
> >> Marcos
> >>
> >>
> >> In 28-03-2013 17:00, Felix GV wrote:
> >>>
> >>> Hello,
> >>>
> >>> I've been running a virtualized CDH 4.2 cluster. I now want to migrate 
> >>> all my data to another (this time physical) set of slaves and then stop 
> >>> using the virtualized slaves.
> >>>
> >>> I added the new physical slaves in the cluster, and marked all the old 
> >>> virtualized slaves as decommissioned using the dfs.hosts.exclude setting 
> >>> in hdfs-site.xml.
> >>>
> >>> Almost all of the data replicated successfully to the new slaves, but 
> >>> when I bring down the old slaves, some blocks start showing up as missing 
> >>> or corrupt (according to the NN UI as well as fsck*). If I restart the 
> >>> old slaves, then there are no missing blocks reported by fsck.
> >>>
> >>> I've tried shutting down the old slaves two by two, and for some of them 
> >>> I saw no problem, but then at some point I found two slaves which, when 
> >>> shut down, resulted in a couple of blocks being under-replicated (1 out 
> >>> of 3 replicas found). For example, fsck would report stuff like this:
> >>>
> >>> /user/hive/warehouse/ads_destinations_hosts/part-m-00012:  Under 
> >>> replicated 
> >>> BP-1207449144-10.10.10.21-1356639087818:blk_6150201737015349469_121244. 
> >>> Target Replicas is 3 but found 1 replica(s).
> >>>
> >>> The system then stayed in that state apparently forever. It never 
> >>> actually fixed the fact some blocks were under-replicated. Does that mean 
> >>> there's something wrong with some of the old datanodes...? Why do they 
> >>> keep block for themselves (even thought they're decommissioned) instead 
> >>> of replicating those blocks to the new (non-decommissioned) datanodes?
> >>>
> >>> How do I force replication of under-replicated blocks?
> >>>
> >>> *Actually, the NN UI and fsck report slightly different things. The NN UI 
> >>> always seems to report 60 under-replicated blocks, whereas fsck only 
> >>> reports those 60 under-replicated blocks when I shut down some of the old 
> >>> datanodes... When the old nodes are up, fsck reports 0 under-replicated 
> >>> blocks... This is very confusing!
> >>>
> >>> Any help would be appreciated! Please don't hesitate to ask if I should 
> >>> provide some of my logs, settings, or the output of some commands...!
> >>>
> >>> Thanks :) !
> >>>
> >>> --
> >>> Felix
> >>
> >>
> >
>

Re: Why do some blocks refuse to replicate...?

Reply via email to