This all very much makes sense and matches my current understanding of
things.

So, basically it's expensive to increment old data.

Thanks for the time and the detailed answer.



On 6/18/11 9:24 PM, Andrew Purtell wrote:
> This is from memory, but I expect someone will chime in if any detail is 
> inaccurate. :-)
>
> If the blocks containing the values you are updating fit into blockcache then 
> read IOPS are avoided, satisfied from cache, not disk. Evictions from 
> blockcache are done on an LRU basis. (Packing related frequently accessed 
> items into a block via multiple columns and/or appropriate row key 
> construction is therefore advisable.)
>
> When seeking for a value HBase examines storefiles from the most recent 
> backwards and stops as soon as the required number of versions, including 
> delete markers, have been found. Updates only care about the most recent 
> version. The most recent versions of frequently updated values are likely to 
> be found early in the search; the more frequent, the earlier, the first.
>
> By enabling bloom filters you can also avoid reads of storefiles known not to 
> contain the value(s) as HBase searches backwards in time. This is because the 
> bloom filters are held in blockcache, and will be frequently accessed for 
> various queries, so are likely to remain resident even if the index or data 
> blocks of the storefiles in question were evicted.
>
> Once you update a value then the most recent version will be held in MemStore 
> until flush. If the value is updated with an Increment again before the 
> flush, read-update-replace then happens in memory. Increments do not add to 
> the size of the MemStore so if you are only incrementing existing values then 
> you will not trigger flushing. (Here however we can make improvements to 
> HBase to reduce unnecessary overheads in the current code. There are a couple 
> of jiras open to this effect. For example to my understanding Facebook runs 
> with a local patch that introduces mutable KeyValues for the MemStore, 
> eliminating some unnecessary data structure management overheads.)
>
> Unless you explicitly ask HBase to do otherwise -- by .setWriteToWAL(false) 
> -- then there will be a write to and sync of the write ahead log (WAL) for 
> every commit, every Increment. The RegionServer's RPC handler for the current 
> client operation will block until HDFS acknowledges the write at all 
> DataNodes in the write pipeline, typically configured for 3 replicas. Note 
> that here HBase will do per-row group commit, so you can amortize this cost 
> over multiple updates if you can group them. Put values that are often 
> updated together into the same row.
>
> Also note that HBase users typically run on top of a HDFS that has been 
> patched with HDFS-895, so HDFS can concurrently service new WAL writes while 
> syncs of others are in progress.
>
> <wizard>
> Furthermore, the WAL is by default configured to sync at every commit but can 
> also be configured to sync after N commits, e.g. 100 or 1000; or after S 
> seconds, e.g. 1 or 10; whichever comes first, according to the loss window 
> (upon RegionServer failure) your use case can tolerate. So in addition to 
> taking advantage of group commit you can amortize sync overhead further with 
> the tradeoff that under failure conditions your counters (or other data) may 
> become imprecise. For some use cases that is fine.
> </wizard>
>
>    - Andy
>
>
> --- On Sat, 6/18/11, Claudio Martella <claudio.marte...@tis.bz.it> wrote:
>
>> From: Claudio Martella <claudio.marte...@tis.bz.it>
>> Subject: on the impact of incremental counters
>> To: user@hbase.apache.org
>> Date: Saturday, June 18, 2011, 9:00 AM
>> Hello list,
>>
>> I was a few days ago at SIGMOD and was happy to attend
>> Facebook's talk on HBase.
>>
>> As I could understand their workflow makes heavy use of incremental
>> couters for analytics and so is mine. For what I understand the cost of
>> incrementing a counter is 2 * N + 1 IOPS, where N is the number of
>> sequence files over which my dataset is spread, 2 because you have seek
>> AND read and the final 1 comes from the write to the append-log.
>> As that looks like an expensive operation, I was guessing if I was
>> missing something and what are the strategies to alleviate such a cost
>> (a part of bloom filters).
>>
>> Thanks!
>>
>> Claudio
>>
>> -- 
>>
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.marte...@tis.bz.it
>> http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According
>> to Section 13 of Italian Legislative Decree no. 196 of 30
>> June 2003, we inform you that we process your personal data
>> in order to fulfil contractual and fiscal obligations and
>> also to send you information regarding our services and
>> events. Your personal data are processed with and without
>> electronic means and by respecting data subjects' rights,
>> fundamental freedoms and dignity, particularly with regard
>> to confidentiality, personal identity and the right to
>> personal data protection. At any time and without
>> formalities you can write an e-mail to priv...@tis.bz.it
>> in order to object the processing of your personal data for
>> the purpose of sending advertising materials and also to
>> exercise the right to access personal data and other rights
>> referred to in Section 7 of Decree 196/2003. The data
>> controller is TIS Techno Innovation Alto Adige, Siemens
>> Street n. 19, Bolzano. You can find the complete information
>> on the web site www.tis.bz.it.
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.




Reply via email to