Re: Recovering from corrupt blocks in HFile

Stack Wed, 18 Mar 2015 10:07:44 -0700

On Tue, Mar 17, 2015 at 11:42 PM, Mike Dillon <[email protected]>
wrote:


> Thanks. I'll look into those suggestions tomorrow. I'm pretty sure that
> short-circuit reads are not turned on, but I'll double check when I follow
> up on this.
>
> The main issue that actually led to me being asked to look into this issue
> was that the cluster had a datanode running at 100% disk usage on all its
> mounts. Since it was already in a compromised state and I didn't fully
> understand what restarting it would do, I haven't done that yet.
>
>
Understood.



> It turned out that at least part of the reason that the node got to 100%
> capacity was that major compactions had been silently failing for a couple
> weeks due to the aforementioned corrupt block. When I looked into the logs
> of the node at capacity, I was seeing "compaction failed" error messages
> for a particular region, caused by BlockMissingExceptions for a particular
> block. That's what let me to fsck that block file and start digging into
> the underlying data. The weird thing is that the at-capacity node actually
> had one of the good copies of the failed block and it was a different node
> that had the broken one.
>
>
Ok. HDFS gets a little unpredictable when full or, to put it another way,
it has not been well tested at this extreme.

Please paste the exceptions in here when you get a chance. Will help with
https://issues.apache.org/jira/browse/HBASE-12949


> And of course, the logs for when this broken HFile was created have already
> been aged out, so I'm left to chase shadows to some extent.
>

Of course.

Let us try and help out.

St.Ack

Re: Recovering from corrupt blocks in HFile

Reply via email to