Once you instantiate the HFile object, you should be able to as many random get() against the table until you close the reference.
> Date: Tue, 2 Nov 2010 16:07:29 +0800 > Subject: Re: HBase random access in HDFS and block indices > From: [email protected] > To: [email protected] > > I read the code and my understanding is when a RS starts StoreFiles of each > Region > will be instantiated. Then HFile.reader.loadFileInfo() will read the the > index and file info. > So each StoreFile is opened only once and block index are cached. The cache > miss are > for blocks. I mean for random Get each read does not need open HFile ->read > index again. > Is that right? > > 2010/10/29 Sean Bigdatafun <[email protected]> > > > I have the same doubt here. Let's say I have a totally random read pattern > > (uniformly distributed). > > > > Now let's assume my total data size stored in HBase is 100TB on 10 > > machines(not a big deal considering nowaday's disks), and the total size of > > my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06% > > cache hit probablity. Under random read pattern, each read is bound to > > experience the "open-> read index -> .... -> read datablock" sequence, > > which > > would be expensive. > > > > Any comment? > > > > > > > > On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan <[email protected]> wrote: > > > > > I was envisioning the HFiles being opened and closed more often, but it > > > sounds like they're held open for long periods and that the indexes are > > > permanently cached. Is it roughly correct to say that after opening an > > > HFile and loading its checksum/metadata/index/etc then each random data > > > block access only requires a single pread, where the pread has some > > > threading and connection overhead, but theoretically only requires one > > disk > > > seek. I'm curious because I'm trying to do a lot of random reads, and > > > given > > > enough application parallelism, the disk seeks should become the > > bottleneck > > > much sooner than the network and threading overhead. > > > > > > Thanks again, > > > Matt > > > > > > On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson <[email protected]> > > wrote: > > > > > > > Hi, > > > > > > > > Since the file is write-once, no random writes, putting the index at > > > > the end is the only choice. The loading goes like this: > > > > - read fixed file trailer, ie: filelen.offset - <fixed size> > > > > - read location of additional variable length sections, eg: block index > > > > - read those indexes, including the variable length 'file-info' section > > > > > > > > > > > > So to give some background, by default an InputStream from DFSClient > > > > has a single socket and positioned reads are fairly fast. The DFS > > > > just sees 'read bytes from pos X length Y' commands on an open socket. > > > > That is fast. But only 1 thread at a time can use this interface. > > > > So for 'get' requests we use another interface called pread() which > > > > takes a position+length, and returns data. This involves setting up a > > > > 1-use socket and tearing it down when we are done. So it is slower by > > > > 2-3 tcp RTT, thread instantiation and other misc overhead. > > > > > > > > > > > > Back to the HFile index, it is indeed stored in ram, not block cache. > > > > Size is generally not an issue, hasn't been yet. We ship with a > > > > default block size of 64k, and I'd recommend sticking with that unless > > > > you have evidential proof your data set performance is better under a > > > > different size. Since the index grows linearly by a factor of 1/64k > > > > with the bytes of the data, it isn't a huge deal. Also the indexes > > > > are spread around the cluster, so you are pushing load to all > > > > machines. > > > > > > > > > > > > > > > > > > > > On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <[email protected]> > > > wrote: > > > > > Do you guys ever worry about how big an HFile's index will be? For > > > > example, > > > > > if you have a 512mb HFile with 8k block size, you will have 64,000 > > > > blocks. > > > > > If each index entry is 50b, then you have a 3.2mb index which is way > > > out > > > > of > > > > > line with your intention of having a small block size. I believe > > > that's > > > > > read all at once so will be slow the first time... is the index > > cached > > > > > somewhere (block cache?) so that index accesses are from memory? > > > > > > > > > > And somewhat related - since the index is stored at the end of the > > > HFile, > > > > is > > > > > an additional random access required to find its offset? If it was > > > > > considered, why was that chosen over putting it in it's own file that > > > > could > > > > > be accessed directly? > > > > > > > > > > Thanks for all these explanations, > > > > > Matt > > > > > > > > > > > > > > > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <[email protected]> > > > > wrote: > > > > > > > > > >> The primary problem is the namenode memory. It contains entries for > > > > every > > > > >> file and block, so setting hdfs block size small limits your > > > > scaleability. > > > > >> > > > > >> There is nothing inherently wrong with in file random read, Its just > > > > That > > > > >> the hdfs client was written for a single reader to read most of a > > > file. > > > > >> This to achieve high performance you'd need to do tricks, such as > > > > >> pipelining > > > > >> sockets and socket pool reuse. Right now for random reads We open a > > > new > > > > >> socket, read data then close it. > > > > >> On Oct 18, 2010 8:22 PM, "William Kang" <[email protected]> > > > wrote: > > > > >> > Hi JG and Ryan, > > > > >> > Thanks for the excellent answers. > > > > >> > > > > > >> > So, I am going to push everything to the extremes without > > > considering > > > > >> > the memory first. In theory, if in HBase, every cell size equals > > to > > > > >> > HBase block size, then there would not be any in block traverse. > > In > > > > >> > HDFS, very HBase block size equals to each HDFS block size, there > > > > >> > would not be any in-file random access necessary. This would > > provide > > > > >> > the best performance? > > > > >> > > > > > >> > But, the problem is that if the block in HBase is too large, the > > > > >> > memory will run out since HBase load block into memory; if the > > block > > > > >> > in HDFS is too small, the DN will run out of memory since each > > HDFS > > > > >> > file takes some memory. So, it is a trade-off problem between > > memory > > > > >> > and performance. Is it right? > > > > >> > > > > > >> > And would it make any difference between random reading the same > > > size > > > > >> > file portion from of a small HDFS block and from a large HDFS > > block? > > > > >> > > > > > >> > Thanks. > > > > >> > > > > > >> > > > > > >> > William > > > > >> > > > > > >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <[email protected] > > > > > > > >> wrote: > > > > >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang < > > > > [email protected]> > > > > >> wrote: > > > > >> >>> Hi, > > > > >> >>> Recently I have spent some efforts to try to understand the > > > > mechanisms > > > > >> >>> of HBase to exploit possible performance tunning options. And > > many > > > > >> >>> thanks to the folks who helped with my questions in this > > > community, > > > > I > > > > >> >>> have sent a report. But, there are still few questions left. > > > > >> >>> > > > > >> >>> 1. If a HFile block contains more than one keyvalue pair, will > > the > > > > >> >>> block index in HFile point out the offset for every keyvalue > > pair > > > in > > > > >> >>> that block? Or, the block index will just point out the key > > ranges > > > > >> >>> inside that block, so you have to traverse inside the block > > until > > > > you > > > > >> >>> meet the key you are looking for? > > > > >> >> > > > > >> >> The block index contains the first key for every block. It > > > therefore > > > > >> >> defines in an [a,b) manner the range of each block. Once a block > > > has > > > > >> >> been selected to read from, it is read into memory then iterated > > > over > > > > >> >> until the key in question has been found (or the closest match > > has > > > > >> >> been found). > > > > >> >> > > > > >> >>> 2. When HBase read block to fetching the data or traverse in it, > > > is > > > > >> >>> this block read into memory? > > > > >> >> > > > > >> >> yes, the entire block at a time is read in a single read > > operation. > > > > >> >> > > > > >> >>> > > > > >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m > > > > >> >>> configurable), to read the HBase blocks, we have to random > > access > > > > the > > > > >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small > > > > >> >>> portion of the larger HDFS blocks, it is still a random access. > > > > Would > > > > >> >>> this be slow? > > > > >> >> > > > > >> >> Random access reads are not necessarily slow, they require > > several > > > > >> things: > > > > >> >> - disk seeks to the data in question > > > > >> >> - disk seeks to the checksum files in question > > > > >> >> - checksum computation and verification > > > > >> >> > > > > >> >> While not particularly slow, this could probably be optimized a > > > bit. > > > > >> >> > > > > >> >> Most of the issues with random reads in HDFS is parallelizing the > > > > >> >> reads and doing as much io-pushdown/scheduling as possible > > without > > > > >> >> consuming an excess of sockets and threads. The actual speed can > > > be > > > > >> >> excellent, or not, depending on how busy the IO subsystem is. > > > > >> >> > > > > >> >> > > > > >> >>> > > > > >> >>> Many thanks. I would be grateful for your answers. > > > > >> >>> > > > > >> >>> > > > > >> >>> William > > > > >> >>> > > > > >> >> > > > > >> > > > > > > > > > > > > > >
