Hi all,
We had a test crawl, using Nutch 0.8 pre-Hadoop. This consisted of a
master server and 10 slaves, where NameNode/JobTracker ran on the
master, and DataNodes/TaskTrackers ran on each of the slaves.
As a test, we removed two of the slaves from the cluster. We did
these one at a time, by shutting down Nutch, pulling out a slave,
starting Nutch, waiting for all block replication to finish, and then
repeating this same process one more time.
After we did this, we seemed to have a valid cluster - running
mapreduce tasks worked, etc.
Then we moved the servers to a new colo facility, and also changed
the network setup such that each server got a new name.
Sometime after that, we tried running another mapreduce task, and it
failed with missing block errors.
Using Andrzej's DFS checker tool, we determined that 20% of the
blocks were missing.
We also saw various log entries that looked like:
060321 125017 Obsoleting block blk_955860084498163084
One additional note - we'd been using a minimal (2x) replication
factor, due to disk space constraints.
So first question is whether anybody has run into a similar problem.
Second, it didn't seem like the Nutch (remember, this is pre-Hadoop)
DFS code saves the server name or IP address with blocks, but I
haven't dug into this. If it did, that would obviously cause problems
for us.
Third, it's still suspicious that we removed 20% of the DataNode
servers, and now we're missing 20% of the blocks. Makes me wonder if
the auto-replicated blocks (what happened when we took the two
servers out of the cluster) wound up in a state where they were
flagged as unused when the network changed.
Anyway, curious if anybody has insights here. We've done a fair
amount of poking around, to no avail. I don't think there's any way
to get the blocks back, as they definitely seem to be gone, and file
recovery on Linux seems pretty iffy. I'm mostly interested in
figuring out if this is a known issue (Of course you can't change
the server names and expect it to work), or whether it's a symptom
of lurking NDFS bugs.
Thanks,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
Find Code, Find Answers