Thanks for your help J.D., answers inline:
On Mar 31, 2011, at 8:00 PM, Jean-Daniel Cryans wrote:
> I wouldn't worry too much at the moment for what seems to be double
> deletes of blocks, I'd like to concentrate on the state of your
> cluster first.
>
> So if you run hbck, do you see any inconsistencies?
>
Clean, no inconsistencies, as is hadoop fsck
> In the datanode logs, do you see any exceptions regarding xcievers
> (just in case).
>
The only place is shows up is the BlockAlreadyExistsException:
2011-03-29 22:55:12,726 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
DatanodeRegistration(10.56.24.12:50010,
storageID=DS-1245200625-10.56.24.12-50010-1297226452434, infoPort=50075,
ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
blk_5380327420628627923_1036814 is valid, and cannot be written to.
at
org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1314)
at
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:99)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:318)
at
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
> In the region server logs, after the split is failed, do you see a
> line that starts with "Successful rollback of failed split of..."? I
> would hope so as the region server would have aborted in case the
> split failed.
>
Yes, the rollback was successful.
> Actually would it be possible to see bigger logs? As much as I like
> the deep investigation you did, I don't want to go down a single debug
> path when something else might have happened and never got a chance to
> see it. Very very often, what you think is HDFS issues can turn out to
> be network issues (did you do those checks?).
>
I didn't see any issues with the network (no errors, dropped packets, etc).
Here is more of the master log:
http://pastebin.com/PZqDZ1TV
Regionserver:
http://pastebin.com/pCDMPbB2
Datanode had to snip some stuff to fit in pastebin:
http://pastebin.com/DrWzgykH
> Same for the master logs, it'd be nice to look for any weirdness.
>
> Thx!
>
> J-D
>
> On Thu, Mar 31, 2011 at 5:03 PM, Christopher Tarnas <[email protected]> wrote:
>> I've been trying to track down some hbase strangeness from what looks to be
>> lost hbase puts: in one thrift put we insert data into two different column
>> families at different rowkeys, but only one of the rows is there. There
>> were no errors to the client or the thrift log, which is a little
>> disturbing. This is fairly well tested code that has worked flawlessly up to
>> this point so I started to look for problems in the hbase and hadoop logs.
>> We are using CDH3b4.
>>
>> This happened during a large >1 billion row load over 11 hours . In the
>> regionserver logs it looks like the they were having trouble talking with
>> the datanodes during splits. I'd see worrying stuff like this in the
>> regionserver log:
>>
>> 2011-03-29 22:55:12,946 INFO org.apache.hadoop.hdfs.DFSClient: Could not
>> complete file
>> /hbase/sequence/d13ef276819124e550bb6e0be9c5cdc8/splits/c6f102f1781897a0bd2025bd8252c3cd/core/9219969397457794945.d13ef276819124e550bb6e0be9c5cdc8
>> retrying...
>> 2011-03-29 22:55:23,628 INFO
>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running rollback of
>> failed split of
>> sequence,MI6Q0pMyu3mQDR7hp71RBA\x093d6ae329777234c5ae9019cf8f5cfe80-A2600002-1\x09,1301367397529.d13ef276819124e550bb6e0be9c5cdc8.;
>> Took too long to split the files and create the references, aborting split
>> 2011-03-29 22:55:23,661 WARN org.apache.hadoop.ipc.Client: interrupted
>> waiting to send params to server
>> java.lang.InterruptedException
>> at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1279)
>>
>> More of the regionserver log here: http://pastebin.com/vnzHWXKT
>>
>> I saw those messages at different but close times in each of the nodes I
>> checked. I then looked at the datanode log, and it had messages like
>> this suspiciousness at the same time in it:
>>
>> 2011-03-29 22:55:12,726 ERROR
>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
>> 10.56.24.12:50010, storageID=DS-1245200625-10.56.24.12-50010-1297226452434,
>> infoPort=50075,
>> ipcPort=50020):DataXceiverorg.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException:
>> Block blk_5380327420628627923_1036814 is valid, and cannot be written to.
>> at
>> org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1314)
>> at
>> org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:99)
>> at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:318)
>> at
>> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
>>
>>
>> Here is more of the datanode log: http://pastebin.com/yPP6EG8k
>>
>> Greping for that block in the datanode log yields this:
>>
>> 2011-03-29 22:54:54,742 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> blk_5380327420628627923_1036814 src: /10.56.24.12:45205 dest: /
>> 10.56.24.12:50010
>> 2011-03-29 22:54:54,748 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
>> 10.56.24.12:45205, dest: /10.56.24.12:50010, bytes: 81, op: HDFS_WRITE,
>> cliID: DFSClient_-1734106091, offset: 0, srvID:
>> DS-1245200625-10.56.24.12-50010-1297226452434, blockid:
>> blk_5380327420628627923_1036814, duration: 1102000
>> 2011-03-29 22:54:54,749 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 2 for block
>> blk_5380327420628627923_1036814 terminating
>> 2011-03-29 22:55:12,582 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
>> blk_5380327420628627923_1036814 src: /10.56.24.19:41159 dest: /
>> 10.56.24.12:50010
>> 2011-03-29 22:55:12,726 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>> blk_5380327420628627923_1036814 received exception
>> org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
>> blk_5380327420628627923_1036814 is valid, and cannot be written to.
>> org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block
>> blk_5380327420628627923_1036814 is valid, and cannot be written to.
>> 2011-03-29 22:55:17,380 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling block
>> blk_5380327420628627923_1036814 file
>> /biotique11/titan_data/hdfs/data/current/subdir52/blk_5380327420628627923
>> for deletion
>> 2011-03-29 22:55:17,381 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted block
>> blk_5380327420628627923_1036814 at file
>> /biotique11/titan_data/hdfs/data/current/subdir52/blk_5380327420628627923
>> 2011-03-29 22:56:12,697 WARN
>> org.apache.hadoop.hdfs.server.datanode.DataNode: Unexpected error trying to
>> delete block blk_5380327420628627923_1036814. BlockInfo not found in
>> volumeMap.
>>
>> It also looks like it is being deleted twice. So... I grepped the namenode
>> log for the block and saw this:
>>
>> http://pastebin.com/FDv0vMSM
>>
>> From what I can tell it looks like the block is over-replicated, added to
>> InvalidStates - deleted on the node via that - then deleted again when the
>> file is deleted. I then grepped for that file in the namenode log to see
>> what that reveals:
>>
>> http://pastebin.com/C6mFpvWc
>>
>> Not sure what to make of that - well out of my depth. It does look like the
>> file is created, deleted, then created again.
>>
>> Any help figuring out what happened would be great. I'm probably going to
>> have to redo the load but I would like to have an idea on what went wrong.
>>
>> thanks,
>> -chris
>>