On Tue, Apr 29, 2014 at 11:53 AM, Stack <[email protected]> wrote: > On Tue, Apr 29, 2014 at 1:54 AM, Krishna Rao <[email protected]>wrote: > >> Thank you for your reply Anoop. >> >> However, the confusing is, unfortunately, still there because of the >> following (from >> here<http://hbase.apache.org/book.html#perf.hdfs.configs.localread> >> ): >> >> "For optimal performance when short-circuit reads are enabled, it is >> recommended that HDFS checksums are disabled. To maintain data integrity >> with HDFS checksums disabled, HBase can be configured to write its own >> checksums into its datablocks and verify against these" >> >> > The text is confusing. If you read the next sentence and click on the > description under > hbase.regionserver.checksum.verify<http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify> > it > should be a little more clear. > > The confusion comes of the little configuration dance that is necessary > around hbase writing checksums optionally inline into hfiles >
Correction: we seem to always write hbase checksums inline with the data. See http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/HStore.html#901 The HBase checksums are always present. The flag then is just about whether they are used at read time. If so, at read time, we ask for a stream from HDFS that does not validate checksums (if an error on this stream, we reopen asking HDFS to do checksum validation). St.Ack > so they are available inline at read time and the interaction w/ native > hdfs checksumming. When running with hbase checksumming of hfiles, we want > a means of telling HDFS to NOT validate the checksum -- i.e. double > checksumming -- because hbase will be doing it (unless there is an error, > and then we'll fall back to HDFS validation). Let me try and clean up the > docs. > > St.Ack > > > >> To me it implies that HDFS checksum needs to be disabled, meaning that >> HDFS >> wouldn't write checksums into it's datablocks. But HBase would be fine by >> writing it's own checksum. >> >> >> On 29 April 2014 09:32, Anoop John <[email protected]> wrote: >> >> > HBase using its own checksum handling doesn't directly affect HDFS. It >> will >> > still maintain checksum info. The diff is at the read time.. HBase >> will >> > open reader with checksum validation false and it will do checksum >> > validation on its own. So using hbase handled checksum in a cluster >> > should not affect other data.. Does that solves your doubt? >> > >> > -Anoop- >> > >> > On Tue, Apr 29, 2014 at 1:58 PM, Krishna Rao <[email protected]> >> > wrote: >> > >> > > Hi Ted, >> > > >> > > I had read those, but I'm confused about how this will affect >> non-HBase >> > > HDFS data. With HDFS checksumming off won't it affect data integrity? >> > > >> > > Krishna >> > > >> > > >> > > On 24 April 2014 15:54, Ted Yu <[email protected]> wrote: >> > > >> > > > Please take a look at the following: >> > > > >> > > > http://hbase.apache.org/book.html#perf.hdfs.configs.localread >> > > > >> http://hbase.apache.org/book.html#hbase.regionserver.checksum.verify >> > > > >> > > > >> > > > On Thu, Apr 24, 2014 at 5:55 AM, Krishna Rao < >> [email protected]> >> > > > wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > I understand that there is a significant improvement gain when >> > turning >> > > on >> > > > > short circuit reads, and additionally by setting HBase to do >> > checksums >> > > > > rather than HDFS. >> > > > > >> > > > > However, I'm a little confused by this, do I need to turn of >> checksum >> > > > > within HDFS for the entire file system? We don't just use HBase on >> > our >> > > > > cluster, so this would seem to be a bad idea right? >> > > > > >> > > > > Cheers, >> > > > > >> > > > > Krishna >> > > > > >> > > > >> > > >> > >> > >
