I have the same doubt here. Let's say I have a totally random read pattern (uniformly distributed).
Now let's assume my total data size stored in HBase is 100TB on 10 machines(not a big deal considering nowaday's disks), and the total size of my RS' memory is 10 * 6G = 60 GB. That translate into a 60/100*1000 = 0.06% cache hit probablity. Under random read pattern, each read is bound to experience the "open-> read index -> .... -> read datablock" sequence, which would be expensive. Any comment? On Mon, Oct 18, 2010 at 9:30 PM, Matt Corgan <[email protected]> wrote: > I was envisioning the HFiles being opened and closed more often, but it > sounds like they're held open for long periods and that the indexes are > permanently cached. Is it roughly correct to say that after opening an > HFile and loading its checksum/metadata/index/etc then each random data > block access only requires a single pread, where the pread has some > threading and connection overhead, but theoretically only requires one disk > seek. I'm curious because I'm trying to do a lot of random reads, and > given > enough application parallelism, the disk seeks should become the bottleneck > much sooner than the network and threading overhead. > > Thanks again, > Matt > > On Tue, Oct 19, 2010 at 12:07 AM, Ryan Rawson <[email protected]> wrote: > > > Hi, > > > > Since the file is write-once, no random writes, putting the index at > > the end is the only choice. The loading goes like this: > > - read fixed file trailer, ie: filelen.offset - <fixed size> > > - read location of additional variable length sections, eg: block index > > - read those indexes, including the variable length 'file-info' section > > > > > > So to give some background, by default an InputStream from DFSClient > > has a single socket and positioned reads are fairly fast. The DFS > > just sees 'read bytes from pos X length Y' commands on an open socket. > > That is fast. But only 1 thread at a time can use this interface. > > So for 'get' requests we use another interface called pread() which > > takes a position+length, and returns data. This involves setting up a > > 1-use socket and tearing it down when we are done. So it is slower by > > 2-3 tcp RTT, thread instantiation and other misc overhead. > > > > > > Back to the HFile index, it is indeed stored in ram, not block cache. > > Size is generally not an issue, hasn't been yet. We ship with a > > default block size of 64k, and I'd recommend sticking with that unless > > you have evidential proof your data set performance is better under a > > different size. Since the index grows linearly by a factor of 1/64k > > with the bytes of the data, it isn't a huge deal. Also the indexes > > are spread around the cluster, so you are pushing load to all > > machines. > > > > > > > > > > On Mon, Oct 18, 2010 at 8:53 PM, Matt Corgan <[email protected]> > wrote: > > > Do you guys ever worry about how big an HFile's index will be? For > > example, > > > if you have a 512mb HFile with 8k block size, you will have 64,000 > > blocks. > > > If each index entry is 50b, then you have a 3.2mb index which is way > out > > of > > > line with your intention of having a small block size. I believe > that's > > > read all at once so will be slow the first time... is the index cached > > > somewhere (block cache?) so that index accesses are from memory? > > > > > > And somewhat related - since the index is stored at the end of the > HFile, > > is > > > an additional random access required to find its offset? If it was > > > considered, why was that chosen over putting it in it's own file that > > could > > > be accessed directly? > > > > > > Thanks for all these explanations, > > > Matt > > > > > > > > > On Mon, Oct 18, 2010 at 11:27 PM, Ryan Rawson <[email protected]> > > wrote: > > > > > >> The primary problem is the namenode memory. It contains entries for > > every > > >> file and block, so setting hdfs block size small limits your > > scaleability. > > >> > > >> There is nothing inherently wrong with in file random read, Its just > > That > > >> the hdfs client was written for a single reader to read most of a > file. > > >> This to achieve high performance you'd need to do tricks, such as > > >> pipelining > > >> sockets and socket pool reuse. Right now for random reads We open a > new > > >> socket, read data then close it. > > >> On Oct 18, 2010 8:22 PM, "William Kang" <[email protected]> > wrote: > > >> > Hi JG and Ryan, > > >> > Thanks for the excellent answers. > > >> > > > >> > So, I am going to push everything to the extremes without > considering > > >> > the memory first. In theory, if in HBase, every cell size equals to > > >> > HBase block size, then there would not be any in block traverse. In > > >> > HDFS, very HBase block size equals to each HDFS block size, there > > >> > would not be any in-file random access necessary. This would provide > > >> > the best performance? > > >> > > > >> > But, the problem is that if the block in HBase is too large, the > > >> > memory will run out since HBase load block into memory; if the block > > >> > in HDFS is too small, the DN will run out of memory since each HDFS > > >> > file takes some memory. So, it is a trade-off problem between memory > > >> > and performance. Is it right? > > >> > > > >> > And would it make any difference between random reading the same > size > > >> > file portion from of a small HDFS block and from a large HDFS block? > > >> > > > >> > Thanks. > > >> > > > >> > > > >> > William > > >> > > > >> > On Mon, Oct 18, 2010 at 10:58 PM, Ryan Rawson <[email protected]> > > >> wrote: > > >> >> On Mon, Oct 18, 2010 at 7:49 PM, William Kang < > > [email protected]> > > >> wrote: > > >> >>> Hi, > > >> >>> Recently I have spent some efforts to try to understand the > > mechanisms > > >> >>> of HBase to exploit possible performance tunning options. And many > > >> >>> thanks to the folks who helped with my questions in this > community, > > I > > >> >>> have sent a report. But, there are still few questions left. > > >> >>> > > >> >>> 1. If a HFile block contains more than one keyvalue pair, will the > > >> >>> block index in HFile point out the offset for every keyvalue pair > in > > >> >>> that block? Or, the block index will just point out the key ranges > > >> >>> inside that block, so you have to traverse inside the block until > > you > > >> >>> meet the key you are looking for? > > >> >> > > >> >> The block index contains the first key for every block. It > therefore > > >> >> defines in an [a,b) manner the range of each block. Once a block > has > > >> >> been selected to read from, it is read into memory then iterated > over > > >> >> until the key in question has been found (or the closest match has > > >> >> been found). > > >> >> > > >> >>> 2. When HBase read block to fetching the data or traverse in it, > is > > >> >>> this block read into memory? > > >> >> > > >> >> yes, the entire block at a time is read in a single read operation. > > >> >> > > >> >>> > > >> >>> 3. HBase blocks (64k configurable) are inside HDFS blocks (64m > > >> >>> configurable), to read the HBase blocks, we have to random access > > the > > >> >>> HDFS blocks. Even HBase can use in(p, buf, 0, x) to read a small > > >> >>> portion of the larger HDFS blocks, it is still a random access. > > Would > > >> >>> this be slow? > > >> >> > > >> >> Random access reads are not necessarily slow, they require several > > >> things: > > >> >> - disk seeks to the data in question > > >> >> - disk seeks to the checksum files in question > > >> >> - checksum computation and verification > > >> >> > > >> >> While not particularly slow, this could probably be optimized a > bit. > > >> >> > > >> >> Most of the issues with random reads in HDFS is parallelizing the > > >> >> reads and doing as much io-pushdown/scheduling as possible without > > >> >> consuming an excess of sockets and threads. The actual speed can > be > > >> >> excellent, or not, depending on how busy the IO subsystem is. > > >> >> > > >> >> > > >> >>> > > >> >>> Many thanks. I would be grateful for your answers. > > >> >>> > > >> >>> > > >> >>> William > > >> >>> > > >> >> > > >> > > > > > >
