Thanks Zahoor, > If there is no bloom... you have to load every block and scan to find if the row exists..
I could be wrong. I think HFile index block (which is located at the end of HFile) is a binary search tree containing all row-key values (of the HFile) in the binary search tree. Searching a specific row-key in the binary search tree could easily find whether a row-key exists (some node in the tree has the same row-key value) or not. Why we need load every block to find if the row exists? regards, Lin On Tue, Aug 21, 2012 at 11:56 PM, jmozah <[email protected]> wrote: > > > > > > 1. After reading the materials you sent to me, I am confused how Bloom > Filter could save I/O during random read. Supposing I am not using Bloom > Filter, in order to find whether a row (or row-key) exists, we need to scan > the index block which is at the end part of an HFile, the scan is in memory > (I think index block is always in memory, please feel free to correct me if > I am wrong) using binary search -- it should be pretty fast. With Bloom > Filter, we could be a bit faster by looking up Bloom Filter bit vector in > memory. Since both index block binary search and Bloom Filter bit vector > search are doing in memory (no I/O is involved), what kinds of I/O is > saved? :-) > > > > If bloom says the Row *may* be present.. the block is loaded otherwise > not... > If there is no bloom... you have to load every block and scan to find if > the row exists.. > > This may incur more IO > > > > 2. > > > > > One Hadoop job doing random reads is perfectly fine. but , since you > said "Handling directly user traffic"... i assumed you wanted to > > > expose HBase independently to every client request, thereby having as > many connections as the number of simultaneous req.. > > > > Sorry I need to confirm again on this point. I think you mean > establishing a new connection for each request is not good, using > connection pool or asynchronous I/O is preferred? > > > > > Yes.
