> > > 1. After reading the materials you sent to me, I am confused how Bloom Filter > could save I/O during random read. Supposing I am not using Bloom Filter, in > order to find whether a row (or row-key) exists, we need to scan the index > block which is at the end part of an HFile, the scan is in memory (I think > index block is always in memory, please feel free to correct me if I am > wrong) using binary search -- it should be pretty fast. With Bloom Filter, we > could be a bit faster by looking up Bloom Filter bit vector in memory. Since > both index block binary search and Bloom Filter bit vector search are doing > in memory (no I/O is involved), what kinds of I/O is saved? :-) >
If bloom says the Row *may* be present.. the block is loaded otherwise not... If there is no bloom... you have to load every block and scan to find if the row exists.. This may incur more IO > 2. > > > One Hadoop job doing random reads is perfectly fine. but , since you said > > "Handling directly user traffic"... i assumed you wanted to > > expose HBase independently to every client request, thereby having as many > > connections as the number of simultaneous req.. > > Sorry I need to confirm again on this point. I think you mean establishing a > new connection for each request is not good, using connection pool or > asynchronous I/O is preferred? > Yes.
