Re: Why do some blocks refuse to replicate...?

Felix GV Sat, 13 Apr 2013 11:44:22 -0700

Oups,

I just peeked inside my drafts folder and saw this update I never sent out.
Here goes:


Ok well rsyncing everything (including the whole subdirXX hierarchies) and
restarting the destination DN worked.

I'll definitely have to script it with something similar to what you
suggested if I hit this issue with bigger datanodes in the future.

Thanks for your help :) !

--
Felix


On Thu, Mar 28, 2013 at 8:25 PM, Tapas Sarangi <[email protected]>wrote:

>
> On Mar 28, 2013, at 7:13 PM, Felix GV <[email protected]> wrote:
>
> I'm using the version of hadoop in CDH 4.2, which is a version of Hadoop
> 2.0 with a bunch of patches on top...
>
> I've tried copying one block and its .meta file to one of my new DN, then
> restarted the DN service, and it did pick up the missing block and
> replicate it properly within the new slaves. So... that's great, but it's
> also super annoying, I don't want to have to pick and choose each block
> manually. Although I guess I could parse the output of fsck, figure out
> where the blocks are and script the whole thing.
>
>
> This should flush out all the corrupt or missing blocks :
>
> hadoop fsck <path to HDFS dir> -files -blocks -locations | egrep
>  "CORRUPT|MISSING"
>
> You can put in a for loop and copy them another node. None but little
> scripting.
>
> ------------
>
>
>
> I'm now trying to rsync all of the data from an old node to a new one, and
> see if it's gonna be able to pick that up, but I'm afraid the subdir
> structure might not port over nicely to the new node. Also, this is
> acceptable today to save me from picking each block manually (or come up
> with the script) because I don't have that much data on the old node, but
> if I had gotten in that situation with a large amount of data, that would
> not have been a very good solution...
>
> I'll report back when I'll have made some more progres...
>
> --
> Felix
>
>
> On Thu, Mar 28, 2013 at 7:01 PM, Azuryy Yu <[email protected]> wrote:
>
>> which hadoop version you used?
>>
>> On Mar 29, 2013 5:24 AM, "Felix GV" <[email protected]> wrote:
>> >
>> > Yes, I didn't specify how I was testing my changes, but basically,
>> here's what I did:
>> >
>> > My hdfs-site.xml file was modified to include a reference the a file
>> containing a list of all datanodes (via dfs.hosts) and a reference to a
>> file containing decommissioned nodes (via dfs.hosts.exclude). After that, I
>> just changed these files, not hdfs-site.xml.
>> >
>> > I first added all my old nodes in the dfs.hosts.exclude file, did
>> hdfs dfsadmin -refreshNodes, and most of the data replicated correctly.
>> >
>> > I then tried removing all old nodes from the dfs.hosts file, did
>> hdfs dfsadmin -refreshNodes, and I saw that I now had a coupe of corrupt
>> and missing blocks (60 of them).
>> >
>> > I re-added all the old nodes in the dfs.hosts file, and removed them
>> gradually, each time doing the refreshNodes or restarting the NN, and I
>> narrowed it down to three datanodes in particular, which seem to be the
>> three nodes where all of those 60 blocks are located.
>> >
>> > Is it possible, perhaps, that these three nodes are completely
>> incapable of replicating what they have (because they're corrupt or
>> something), and so every block was replicated from other nodes, but the
>> blocks that happened to be located on these three nodes are... doomed? I
>> can see the data in those blocks in the NN hdfs browser, so I guess it's
>> not corrupted... I also tried pinging the new nodes from those old ones and
>> it works too, so I guess there is no network partition...
>> >
>> > I'm in the process of increasing replication factor above 3, but I
>> don't know if that's gonna do anything...
>> >
>> > --
>> > Felix
>> >
>> >
>> > On Thu, Mar 28, 2013 at 4:45 PM, MARCOS MEDRADO RUBINELLI <
>> [email protected]> wrote:
>> >>
>> >> Felix,
>> >>
>> >> After changing hdfs-site.xml, did you run "hadoop dfsadmin
>> -refreshNodes"? That should have been enough, but you can try increasing
>> the replication factor of these files, wait for them to be replicated to
>> the new nodes, then setting it back to its original value.
>> >>
>> >> Cheers,
>> >> Marcos
>> >>
>> >>
>> >> In 28-03-2013 17:00, Felix GV wrote:
>> >>>
>> >>> Hello,
>> >>>
>> >>> I've been running a virtualized CDH 4.2 cluster. I now want to
>> migrate all my data to another (this time physical) set of slaves and then
>> stop using the virtualized slaves.
>> >>>
>> >>> I added the new physical slaves in the cluster, and marked all the
>> old virtualized slaves as decommissioned using the dfs.hosts.exclude
>> setting in hdfs-site.xml.
>> >>>
>> >>> Almost all of the data replicated successfully to the new slaves, but
>> when I bring down the old slaves, some blocks start showing up as missing
>> or corrupt (according to the NN UI as well as fsck*). If I restart the old
>> slaves, then there are no missing blocks reported by fsck.
>> >>>
>> >>> I've tried shutting down the old slaves two by two, and for some of
>> them I saw no problem, but then at some point I found two slaves which,
>> when shut down, resulted in a couple of blocks being under-replicated (1
>> out of 3 replicas found). For example, fsck would report stuff like this:
>> >>>
>> >>> /user/hive/warehouse/ads_destinations_hosts/part-m-00012:  Under
>> replicated
>> BP-1207449144-10.10.10.21-1356639087818:blk_6150201737015349469_121244.
>> Target Replicas is 3 but found 1 replica(s).
>> >>>
>> >>> The system then stayed in that state apparently forever. It never
>> actually fixed the fact some blocks were under-replicated. Does that mean
>> there's something wrong with some of the old datanodes...? Why do they keep
>> block for themselves (even thought they're decommissioned) instead of
>> replicating those blocks to the new (non-decommissioned) datanodes?
>> >>>
>> >>> How do I force replication of under-replicated blocks?
>> >>>
>> >>> *Actually, the NN UI and fsck report slightly different things. The
>> NN UI always seems to report 60 under-replicated blocks, whereas fsck only
>> reports those 60 under-replicated blocks when I shut down some of the old
>> datanodes... When the old nodes are up, fsck reports 0 under-replicated
>> blocks... This is very confusing!
>> >>>
>> >>> Any help would be appreciated! Please don't hesitate to ask if I
>> should provide some of my logs, settings, or the output of some commands...!
>> >>>
>> >>> Thanks :) !
>> >>>
>> >>> --
>> >>> Felix
>> >>
>> >>
>> >
>>
>
>
>

Re: Why do some blocks refuse to replicate...?

Reply via email to