On Tue, Mar 17, 2015 at 11:42 PM, Mike Dillon <[email protected]> wrote:
> Thanks. I'll look into those suggestions tomorrow. I'm pretty sure that > short-circuit reads are not turned on, but I'll double check when I follow > up on this. > > The main issue that actually led to me being asked to look into this issue > was that the cluster had a datanode running at 100% disk usage on all its > mounts. Since it was already in a compromised state and I didn't fully > understand what restarting it would do, I haven't done that yet. > > Understood. > It turned out that at least part of the reason that the node got to 100% > capacity was that major compactions had been silently failing for a couple > weeks due to the aforementioned corrupt block. When I looked into the logs > of the node at capacity, I was seeing "compaction failed" error messages > for a particular region, caused by BlockMissingExceptions for a particular > block. That's what let me to fsck that block file and start digging into > the underlying data. The weird thing is that the at-capacity node actually > had one of the good copies of the failed block and it was a different node > that had the broken one. > > Ok. HDFS gets a little unpredictable when full or, to put it another way, it has not been well tested at this extreme. Please paste the exceptions in here when you get a chance. Will help with https://issues.apache.org/jira/browse/HBASE-12949 > And of course, the logs for when this broken HFile was created have already > been aged out, so I'm left to chase shadows to some extent. > Of course. Let us try and help out. St.Ack
