Re: lost NDFS blocks following network reorg

2006-03-26 Thread Doug Cutting

Ken Krugler wrote:
Anyway, curious if anybody has insights here. We've done a fair amount 
of poking around, to no avail. I don't think there's any way to get the 
blocks back, as they definitely seem to be gone, and file recovery on 
Linux seems pretty iffy. I'm mostly interested in figuring out if this 
is a known issue (Of course you can't change the server names and 
expect it to work), or whether it's a symptom of lurking NDFS bugs.


It's hard to tell, after the fact, whether stuff like this is pilot 
error or a bug.  Others have reported similar things, so it's either a 
bug or it's too easy to make pilot errors.  So something needs to 
change.  But what?


We need to start testing stuff like this systematically.  A reproducible 
test case would make this much easier to diagnose.


I'm sorry I can't be more helpful.  I'm sorry you lost data.

Doug


lost NDFS blocks following network reorg

2006-03-25 Thread Ken Krugler

Hi all,

We had a test crawl, using Nutch 0.8 pre-Hadoop. This consisted of a 
master server and 10 slaves, where NameNode/JobTracker ran on the 
master, and DataNodes/TaskTrackers ran on each of the slaves.


As a test, we removed two of the slaves from the cluster. We did 
these one at a time, by shutting down Nutch, pulling out a slave, 
starting Nutch, waiting for all block replication to finish, and then 
repeating this same process one more time.


After we did this, we seemed to have a valid cluster - running 
mapreduce tasks worked, etc.


Then we moved the servers to a new colo facility, and also changed 
the network setup such that each server got a new name.


Sometime after that, we tried running another mapreduce task, and it 
failed with missing block errors.


Using Andrzej's DFS checker tool, we determined that 20% of the 
blocks were missing.


We also saw various log entries that looked like:

060321 125017 Obsoleting block blk_955860084498163084

One additional note - we'd been using a minimal (2x) replication 
factor, due to disk space constraints.


So first question is whether anybody has run into a similar problem.

Second, it didn't seem like the Nutch (remember, this is pre-Hadoop) 
DFS code saves the server name or IP address with blocks, but I 
haven't dug into this. If it did, that would obviously cause problems 
for us.


Third, it's still suspicious that we removed 20% of the DataNode 
servers, and now we're missing 20% of the blocks. Makes me wonder if 
the auto-replicated blocks (what happened when we took the two 
servers out of the cluster) wound up in a state where they were 
flagged as unused when the network changed.


Anyway, curious if anybody has insights here. We've done a fair 
amount of poking around, to no avail. I don't think there's any way 
to get the blocks back, as they definitely seem to be gone, and file 
recovery on Linux seems pretty iffy. I'm mostly interested in 
figuring out if this is a known issue (Of course you can't change 
the server names and expect it to work), or whether it's a symptom 
of lurking NDFS bugs.


Thanks,

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers