Hi Ted Yu,

Thanks for your advice, it helps me to clear the “scan after delete” scenario.

About hbase 1.1, I don’t think it will improve the efficiency for the problem 
what I encountered. Cause it mainly improve the network I/O. And I think the 
most efficiency way to improve network I/O is reduce the number of columns. I 
also carry out a test to verify this. Compare 1 column and 3 columns, the 
client latency will improve by about 5 times.

There’s one thing I want to confirm. Which is the exact point to decide if the 
cell will be visible to the scan result? In order to be more efficient in 
network I/O, this point should be before the result back to client, we can also 
deduce that this action should be after the region scanner get the partial 
result ready to merge. But in https://blogs.apache.org/hbase/ 
<https://blogs.apache.org/hbase/> , the “background" paragraph in “Scan 
Improvements in HBase 1.1.0” mention that “the ResultScanner decides which 
Results to make visible to the application layer.”  It confuse me.

Br, Great Soul
[email protected] <mailto:[email protected]>





> On Jul 3, 2015, at 10:24 AM, Ted Yu <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> You may have read http://hbase.apache.org/book.html#version.delete 
> <http://hbase.apache.org/book.html#version.delete>
> 
> Please see 'Scan Improvements in HBase 1.1.0' under
> https://blogs.apache.org/hbase/ <https://blogs.apache.org/hbase/>
> 
> Cheers
> 
> On Thu, Jul 2, 2015 at 6:54 PM, Song Geng <[email protected]> wrote:
> 
>> Hi everyone,
>> 
>> I am a complete novice in hbase and the community. And this is my first
>> email. Please forgive me if I make some trouble.
>> 
>> Here is the issue:
>> 
>> We use hbase store the file information and using compose userid and
>> rowkey as the file path.
>>        For example: A user’s id is 1000, and he has a file “a.txt” store
>> in “/root/data/”, then the rowkey will be “1000_/root/data/a.txt” .
>> 
>> User will store a number of files in our system, like "millions of" or
>> "billions of”. Sometimes, he will do a delete action to a folder which
>> maybe store millions of files. And after this kind of delete action, it
>> will often turn up a “timeout issue” while scanning until we do a major
>> compaction.
>> 
>> In order to make clear this issue, I read the google bigtable paper,
>> “hbase in action” and bloggers about block cache wrote by Nick, and many
>> other articles relevant to hbase, also the source code. I do some tests and
>> I got my conclusion list follows:
>> 
>> The test table only have one column family and this cf only have one
>> column.
>> 
>> There’s 3 aspects will influent the read latency, search key, disk I/O,
>> and network I/O.
>> Make hbase client caching smaller will reduce the latency for the sake of
>> “network I/O”.
>> Compare to normal scan, the “delete” scenario will result in spending more
>> time on searching and disk I/O. And I think mainly on searching. Think a
>> scenario: I put a number of data into hbase that just flush into a hfile.
>> Then I delete the majority of these data from the start key. It will record
>> into another hfile. At this time, it will read the data one by one if i do
>> a scan action from the start key(suppose there’s no compaction). Until we
>> get the first item not deleted.
>> 
>> So, do compaction is the most effective way to resolve this kind of issue.
>> 
>> I still have some doubt. Hope anyone could clear that.
>> First, I am not very confirm about the scan process of "delete scenario" I
>> described in "number 3”.
>> Second, block cache seems make less effect on this scenario.
>> 
>> P.S. I don’t attach my test result cause I am afraid confuse others. I
>> will clear up them if necessary.
>> 
>> Br, Great Soul
>> [email protected]
>> 
>> 
>> 
>> 
>> 
>> 

Reply via email to