Re: HBase checksum vs HDFS checksum

Stack Tue, 29 Apr 2014 12:16:07 -0700

On Tue, Apr 29, 2014 at 11:53 AM, Stack <[email protected]> wrote:

> On Tue, Apr 29, 2014 at 1:54 AM, Krishna Rao <[email protected]>wrote:
>
>> Thank you for your reply Anoop.
>>
>> However, the confusing is, unfortunately, still there because of the
>> following (from
>> here<http://hbase.apache.org/book.html#perf.hdfs.configs.localread>
>> ):
>>
>> "For optimal performance when short-circuit reads are enabled, it is
>> recommended that HDFS checksums are disabled. To maintain data integrity
>> with HDFS checksums disabled, HBase can be configured to write its own
>> checksums into its datablocks and verify against these"
>>
>>
> The text is confusing.  If you read the next sentence and click on the
> description under 
> hbase.regionserver.checksum.verify<http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify>
>  it
> should be a little more clear.
>
> The confusion comes of the little configuration dance that is necessary
> around hbase writing checksums optionally inline into hfiles
>


Correction: we seem to always write hbase checksums inline with the data.
 See
http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/HStore.html#901
The HBase checksums are always present.  The flag then is just about
whether they are used at read time.  If so, at read time, we ask for a
stream from HDFS that does not validate checksums (if an error on this
stream, we reopen asking HDFS to do checksum validation).

St.Ack



> so they are available inline at read time and the interaction w/ native
> hdfs checksumming.  When running with hbase checksumming of hfiles, we want
> a means of telling HDFS to NOT validate the checksum -- i.e. double
> checksumming -- because hbase will be doing it (unless there is an error,
> and then we'll fall back to HDFS validation).  Let me try and clean up the
> docs.
>
> St.Ack
>
>
>
>> To me it implies that HDFS checksum needs to be disabled, meaning that
>> HDFS
>> wouldn't write checksums into it's datablocks. But HBase would be fine by
>> writing it's own checksum.
>>
>>
>> On 29 April 2014 09:32, Anoop John <[email protected]> wrote:
>>
>> > HBase using its own checksum handling doesn't directly affect HDFS. It
>> will
>> > still maintain checksum info.  The diff is at the read time..  HBase
>> will
>> > open reader with checksum validation false and it will do checksum
>> > validation on its own.   So using hbase handled checksum in a cluster
>> > should not affect other data..  Does that solves your doubt?
>> >
>> > -Anoop-
>> >
>> > On Tue, Apr 29, 2014 at 1:58 PM, Krishna Rao <[email protected]>
>> > wrote:
>> >
>> > > Hi Ted,
>> > >
>> > > I had read those, but I'm confused about how this will affect
>> non-HBase
>> > > HDFS data. With HDFS checksumming off won't it affect data integrity?
>> > >
>> > > Krishna
>> > >
>> > >
>> > > On 24 April 2014 15:54, Ted Yu <[email protected]> wrote:
>> > >
>> > > > Please take a look at the following:
>> > > >
>> > > > http://hbase.apache.org/book.html#perf.hdfs.configs.localread
>> > > >
>> http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify
>> > > >
>> > > >
>> > > > On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I understand that there is a significant improvement gain when
>> > turning
>> > > on
>> > > > > short circuit reads, and additionally by setting HBase to do
>> > checksums
>> > > > > rather than HDFS.
>> > > > >
>> > > > > However, I'm a little confused by this, do I need to turn of
>> checksum
>> > > > > within HDFS for the entire file system? We don't just use HBase on
>> > our
>> > > > > cluster, so this would seem to be a bad idea right?
>> > > > >
>> > > > >  Cheers,
>> > > > >
>> > > > > Krishna
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: HBase checksum vs HDFS checksum

Reply via email to