Re: Why do some blocks refuse to replicate...?

Tapas Sarangi Thu, 28 Mar 2013 15:07:37 -0700

Did you check if you have any disk that is "read-only" for the nodes that has 
the missing blocks ? If you know which are the blocks, you can manually copy 
the blocks and the corresponding '.meta' file to another node. Hadoop will 
re-read those blocks and replicate them.


-----



On Mar 28, 2013, at 4:23 PM, Felix GV <[email protected]> wrote:

> Yes, I didn't specify how I was testing my changes, but basically, here's 
> what I did:
> 
> My hdfs-site.xml file was modified to include a reference the a file 
> containing a list of all datanodes (via dfs.hosts) and a reference to a file 
> containing decommissioned nodes (via dfs.hosts.exclude). After that, I just 
> changed these files, not hdfs-site.xml.
> 
> I first added all my old nodes in the dfs.hosts.exclude file, did hdfs 
> dfsadmin -refreshNodes, and most of the data replicated correctly.
> 
> I then tried removing all old nodes from the dfs.hosts file, did hdfs 
> dfsadmin -refreshNodes, and I saw that I now had a coupe of corrupt and 
> missing blocks (60 of them).
> 
> I re-added all the old nodes in the dfs.hosts file, and removed them 
> gradually, each time doing the refreshNodes or restarting the NN, and I 
> narrowed it down to three datanodes in particular, which seem to be the three 
> nodes where all of those 60 blocks are located.
> 
> Is it possible, perhaps, that these three nodes are completely incapable of 
> replicating what they have (because they're corrupt or something), and so 
> every block was replicated from other nodes, but the blocks that happened to 
> be located on these three nodes are... doomed? I can see the data in those 
> blocks in the NN hdfs browser, so I guess it's not corrupted... I also tried 
> pinging the new nodes from those old ones and it works too, so I guess there 
> is no network partition...
> 
> I'm in the process of increasing replication factor above 3, but I don't know 
> if that's gonna do anything...
> 
> --
> Felix
> 
> 
> On Thu, Mar 28, 2013 at 4:45 PM, MARCOS MEDRADO RUBINELLI 
> <[email protected]> wrote:
> Felix,
> 
> After changing hdfs-site.xml, did you run "hadoop dfsadmin -refreshNodes"? 
> That should have been enough, but you can try increasing the replication 
> factor of these files, wait for them to be replicated to the new nodes, then 
> setting it back to its original value.
> 
> Cheers,
> Marcos
> 
> 
> In 28-03-2013 17:00, Felix GV wrote:
>> Hello,
>> 
>> I've been running a virtualized CDH 4.2 cluster. I now want to migrate all 
>> my data to another (this time physical) set of slaves and then stop using 
>> the virtualized slaves.
>> 
>> I added the new physical slaves in the cluster, and marked all the old 
>> virtualized slaves as decommissioned using the dfs.hosts.exclude setting in 
>> hdfs-site.xml.
>> 
>> Almost all of the data replicated successfully to the new slaves, but when I 
>> bring down the old slaves, some blocks start showing up as missing or 
>> corrupt (according to the NN UI as well as fsck*). If I restart the old 
>> slaves, then there are no missing blocks reported by fsck.
>> 
>> I've tried shutting down the old slaves two by two, and for some of them I 
>> saw no problem, but then at some point I found two slaves which, when shut 
>> down, resulted in a couple of blocks being under-replicated (1 out of 3 
>> replicas found). For example, fsck would report stuff like this:
>> 
>> /user/hive/warehouse/ads_destinations_hosts/part-m-00012:  Under replicated 
>> BP-1207449144-10.10.10.21-1356639087818:blk_6150201737015349469_121244. 
>> Target Replicas is 3 but found 1 replica(s).
>> 
>> The system then stayed in that state apparently forever. It never actually 
>> fixed the fact some blocks were under-replicated. Does that mean there's 
>> something wrong with some of the old datanodes...? Why do they keep block 
>> for themselves (even thought they're decommissioned) instead of replicating 
>> those blocks to the new (non-decommissioned) datanodes?
>> 
>> How do I force replication of under-replicated blocks?
>> 
>> *Actually, the NN UI and fsck report slightly different things. The NN UI 
>> always seems to report 60 under-replicated blocks, whereas fsck only reports 
>> those 60 under-replicated blocks when I shut down some of the old 
>> datanodes... When the old nodes are up, fsck reports 0 under-replicated 
>> blocks... This is very confusing!
>> 
>> Any help would be appreciated! Please don't hesitate to ask if I should 
>> provide some of my logs, settings, or the output of some commands...!
>> 
>> Thanks :) !
>> 
>> --
>> Felix
> 
>

Re: Why do some blocks refuse to replicate...?

Reply via email to