RE: HBase random access in HDFS and block indices

Michael Segel Tue, 02 Nov 2010 09:32:32 -0700

Once you instantiate the HFile object, you should be able to as many random 
get() against the table until you close the reference.



> Date: Tue, 2 Nov 2010 16:07:29 +0800
> Subject: Re: HBase random access in HDFS and block indices
> From: [email protected]
> To: [email protected]
> 
> I read the code and my understanding is when a RS starts StoreFiles of each
> Region
> will be instantiated. Then HFile.reader.loadFileInfo() will read the the
> index and file info.
> So each StoreFile is opened only once and block index are cached. The cache
> miss are
> for blocks. I mean for random Get each read does not need open HFile ->read
> index again.
> Is that right?
> 
> 2010/10/29 Sean Bigdatafun <[email protected]>
> 
> > I have the same doubt here. Let's say I have a totally random read pattern
> > (uniformly distributed).
> >
> > Now let's assume my total data size stored in HBase is 100TB on 10
> > machines(not a big deal considering nowaday's disks), and the total size of
> > my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06%
> > cache hit probablity. Under random read pattern, each read is bound to
> > experience the "open-> read index -> .... -> read datablock" sequence,
> > which
> > would be expensive.
> >
> > Any comment?
> >
> >
> >
> > On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan <[email protected]> wrote:
> >
> > > I was envisioning the HFiles being opened and closed more often, but it
> > > sounds like they're held open for long periods and that the indexes are
> > > permanently cached.  Is it roughly correct to say that after opening an
> > > HFile and loading its checksum/metadata/index/etc then each random data
> > > block access only requires a single pread, where the pread has some
> > > threading and connection overhead, but theoretically only requires one
> > disk
> > > seek.  I'm curious because I'm trying to do a lot of random reads, and
> > > given
> > > enough application parallelism, the disk seeks should become the
> > bottleneck
> > > much sooner than the network and threading overhead.
> > >
> > > Thanks again,
> > > Matt
> > >
> > > On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson <[email protected]>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > Since the file is write-once, no random writes, putting the index at
> > > > the end is the only choice.  The loading goes like this:
> > > > - read fixed file trailer, ie: filelen.offset - <fixed size>
> > > > - read location of additional variable length sections, eg: block index
> > > > - read those indexes, including the variable length 'file-info' section
> > > >
> > > >
> > > > So to give some background, by default an InputStream from DFSClient
> > > > has a single socket and positioned reads are fairly fast.  The DFS
> > > > just sees 'read bytes from pos X length Y' commands on an open socket.
> > > >  That is fast.  But only 1 thread at a time can use this interface.
> > > > So for 'get' requests we use another interface called pread() which
> > > > takes a position+length, and returns data.  This involves setting up a
> > > > 1-use socket and tearing it down when we are done.  So it is slower by
> > > > 2-3 tcp RTT, thread instantiation and other misc overhead.
> > > >
> > > >
> > > > Back to the HFile index, it is indeed stored in ram, not block cache.
> > > > Size is generally not an issue, hasn't been yet.  We ship with a
> > > > default block size of 64k, and I'd recommend sticking with that unless
> > > > you have evidential proof your data set performance is better under a
> > > > different size.  Since the index grows linearly by a factor of 1/64k
> > > > with the bytes of the data, it isn't a huge deal.  Also the indexes
> > > > are spread around the cluster, so you are pushing load to all
> > > > machines.
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <[email protected]>
> > > wrote:
> > > > > Do you guys ever worry about how big an HFile's index will be?  For
> > > > example,
> > > > > if you have a 512mb HFile with 8k block size, you will have 64,000
> > > > blocks.
> > > > >  If each index entry is 50b, then you have a 3.2mb index which is way
> > > out
> > > > of
> > > > > line with your intention of having a small block size.  I believe
> > > that's
> > > > > read all at once so will be slow the first time... is the index
> > cached
> > > > > somewhere (block cache?) so that index accesses are from memory?
> > > > >
> > > > > And somewhat related - since the index is stored at the end of the
> > > HFile,
> > > > is
> > > > > an additional random access required to find its offset?  If it was
> > > > > considered, why was that chosen over putting it in it's own file that
> > > > could
> > > > > be accessed directly?
> > > > >
> > > > > Thanks for all these explanations,
> > > > > Matt
> > > > >
> > > > >
> > > > > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <[email protected]>
> > > > wrote:
> > > > >
> > > > >> The primary problem is the namenode memory. It contains entries for
> > > > every
> > > > >> file and block, so setting hdfs block size small limits your
> > > > scaleability.
> > > > >>
> > > > >> There is nothing inherently wrong with in file random read, Its just
> > > > That
> > > > >> the hdfs client was written for a single reader to read most of a
> > > file.
> > > > >> This to achieve high performance you'd need to do tricks, such as
> > > > >> pipelining
> > > > >> sockets and socket pool reuse. Right now for random reads We open a
> > > new
> > > > >> socket, read data then close it.
> > > > >> On Oct 18, 2010 8:22 PM, "William Kang" <[email protected]>
> > > wrote:
> > > > >> > Hi JG and Ryan,
> > > > >> > Thanks for the excellent answers.
> > > > >> >
> > > > >> > So, I am going to push everything to the extremes without
> > > considering
> > > > >> > the memory first. In theory, if in HBase, every cell size equals
> > to
> > > > >> > HBase block size, then there would not be any in block traverse.
> > In
> > > > >> > HDFS, very HBase block size equals to each HDFS block size, there
> > > > >> > would not be any in-file random access necessary. This would
> > provide
> > > > >> > the best performance?
> > > > >> >
> > > > >> > But, the problem is that if the block in HBase is too large, the
> > > > >> > memory will run out since HBase load block into memory; if the
> > block
> > > > >> > in HDFS is too small, the DN will run out of memory since each
> > HDFS
> > > > >> > file takes some memory. So, it is a trade-off problem between
> > memory
> > > > >> > and performance. Is it right?
> > > > >> >
> > > > >> > And would it make any difference between random reading the same
> > > size
> > > > >> > file portion from of a small HDFS block and from a large HDFS
> > block?
> > > > >> >
> > > > >> > Thanks.
> > > > >> >
> > > > >> >
> > > > >> > William
> > > > >> >
> > > > >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <[email protected]
> > >
> > > > >> wrote:
> > > > >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang <
> > > > [email protected]>
> > > > >> wrote:
> > > > >> >>> Hi,
> > > > >> >>> Recently I have spent some efforts to try to understand the
> > > > mechanisms
> > > > >> >>> of HBase to exploit possible performance tunning options. And
> > many
> > > > >> >>> thanks to the folks who helped with my questions in this
> > > community,
> > > > I
> > > > >> >>> have sent a report. But, there are still few questions left.
> > > > >> >>>
> > > > >> >>> 1. If a HFile block contains more than one keyvalue pair, will
> > the
> > > > >> >>> block index in HFile point out the offset for every keyvalue
> > pair
> > > in
> > > > >> >>> that block? Or, the block index will just point out the key
> > ranges
> > > > >> >>> inside that block, so you have to traverse inside the block
> > until
> > > > you
> > > > >> >>> meet the key you are looking for?
> > > > >> >>
> > > > >> >> The block index contains the first key for every block.  It
> > > therefore
> > > > >> >> defines in an [a,b) manner the range of each block. Once a block
> > > has
> > > > >> >> been selected to read from, it is read into memory then iterated
> > > over
> > > > >> >> until the key in question has been found (or the closest match
> > has
> > > > >> >> been found).
> > > > >> >>
> > > > >> >>> 2. When HBase read block to fetching the data or traverse in it,
> > > is
> > > > >> >>> this block read into memory?
> > > > >> >>
> > > > >> >> yes, the entire block at a time is read in a single read
> > operation.
> > > > >> >>
> > > > >> >>>
> > > > >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m
> > > > >> >>> configurable), to read the HBase blocks, we have to random
> > access
> > > > the
> > > > >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small
> > > > >> >>> portion of the larger HDFS blocks, it is still a random access.
> > > > Would
> > > > >> >>> this be slow?
> > > > >> >>
> > > > >> >> Random access reads are not necessarily slow, they require
> > several
> > > > >> things:
> > > > >> >> - disk seeks to the data in question
> > > > >> >> - disk seeks to the checksum files in question
> > > > >> >> - checksum computation and verification
> > > > >> >>
> > > > >> >> While not particularly slow, this could probably be optimized a
> > > bit.
> > > > >> >>
> > > > >> >> Most of the issues with random reads in HDFS is parallelizing the
> > > > >> >> reads and doing as much io-pushdown/scheduling as possible
> > without
> > > > >> >> consuming an excess of sockets and threads.  The actual speed can
> > > be
> > > > >> >> excellent, or not, depending on how busy the IO subsystem is.
> > > > >> >>
> > > > >> >>
> > > > >> >>>
> > > > >> >>> Many thanks. I would be grateful for your answers.
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> William
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > > >
> > > >
> > >
> >

RE: HBase random access in HDFS and block indices

Reply via email to