Re: Two questions about the maximum number of versions of a column family

Anil Gupta Mon, 22 Feb 2016 08:41:04 -0800

If its possible to make the timestamps as a suffix of your rowkey(assuming the 
rowkey is composite) then you would not run into read/write hotspots. 
Have a look at open tsdb data model that scales really really well.


Sent from my iPhone

> On Feb 21, 2016, at 10:28 AM, Stephen Durfey <[email protected]> wrote:
> 
> I personally don't deal with time series data, so I'm not going to make a 
> statement on which is better. I would think from a scanning viewpoint putting 
> the time stamp in the row key is easier, but that will introduce scanning 
> performance bottlenecks due to the row keys being stored lexicographically. 
> All data from the same date range will end up in the same region or regions 
> (this is causes hot spots) reducing the number of tasks you get for reads, 
> thus increasing extraction time. 
> One method to deal with this is salting your row keys to get an even 
> distribution of data around the cluster. Cloudera recently had a good post 
> about this on their blog: 
> http://blog.cloudera.com/blog/2015/06/how-to-scan-salted-apache-hbase-tables-with-region-specific-key-ranges-in-mapreduce/
> 
> 
> 
> 
> 
> On Sun, Feb 21, 2016 at 9:47 AM -0800, "Daniel" <[email protected]> wrote:
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> Thanks for your sharing, Stephen and Ted. The reference guide recommends 
> "rows" over "versions" concerning time series data. Are there advantages of 
> using "reversed timestamps" in row keys over the built-in "versions" with 
> regard to scanning performance?
> 
> ------------------ Original ------------------
> From: "Ted Yu"
> Date: Mon, Feb 22, 2016 01:02 AM
> To: "[email protected]";
> Subject: Re: Two questions about the maximum number of versions of a column 
> family
> 
> 
> Thanks for sharing, Stephen.
> 
> bq. scan performance on the region servers needing to scan over all that
> data you may not need
> 
> When number of versions is large, try to utilize Filters (where
> appropriate) which implements:
> 
>  public Cell getNextCellHint(Cell currentKV) {
> 
> See MultiRowRangeFilter for example.
> 
> 
> Please see hbase-shell/src/main/ruby/shell/commands/alter.rb for syntax on
> how to alter table. When "hbase.online.schema.update.enable" is true, table
> can stay online during the change.
> 
> Cheers
> 
>> On Sun, Feb 21, 2016 at 8:20 AM, Stephen Durfey  wrote:
>> 
>> Someone please correct me if I am wrong.
>> I've looked into this recently due to some performance reasons with my
>> tables in a production environment. Like the books says, I don't recommend
>> keeping this many versions around unless you really need them. Telling
>> HBase to keep around a very large number doesn't waste space, that's just a
>> value in the table descriptor. So, I wouldn't worry about that. The
>> problems are going to come in when you actually write out those versions.
>> My tables currently have max_versions set and roughly 40% of the tables
>> are due to historical versions. So, one table in particular is around 25
>> TB. I don't have a need to keep this many versions, so I am working on
>> changing the max versions to the default of 3 (some cells are hundreds or
>> thousands of cells deep). The issue youll run into is scan performance on
>> the region servers needing to scan over all that data you may not need (due
>> to large store files). This could lead to increased scan time and
>> potentially scanner timeouts, depending upon how large your batch size is
>> set on the scan.
>> I assume this has some performance impact on compactions, both minor and
>> major, but I didn't investigate that, and potentially on the write path,
>> but also not something I looked into.
>> Changing the number of versions after the table has been created doesn't
>> have a performance impact due to just being a metadata change. The table
>> will need to be disabled, changed, and re-enabled again. If this is done
>> through a script the table could be offline for a couple of seconds. The
>> only concern around that are users of the table. If they have scheduled job
>> runs that hit that table that would break if they try to read from it while
>> the table is disabled. The only performance impact I can think of around
>> this change would be major compaction of the table, but even that shouldn't
>> be an issue.
>> 
>> 
>>    _____________________________
>> From: Daniel 
>> Sent: Sunday, February 21, 2016 9:22 AM
>> Subject: Two questions about the maximum number of versions of a column
>> family
>> To: user 
>> 
>> 
>> Hi, I have two questions about the maximum number of versions of a column
>> family:
>> 
>> (1) Is it OK to set a very large (>100,000) maximum number of versions for
>> a column family?
>> 
>> The reference guide says "It is not recommended setting the number of max
>> versions to an exceedingly high level (e.g., hundreds or more) unless those
>> old values are very dear to you because this will greatly increase
>> StoreFile size." (Chapter 36.1)
>> 
>> I'm new to the Hadoop ecosystem, and have no idea about the consequences
>> of a very large StoreFile size.
>> 
>> Furthermore, it is OK to set a large maximum number of versions but insert
>> only a few versions? Does it waste space?
>> 
>> (2) How much performance overhead does it cause to increase the maximum
>> number of versions of a column family after enormous (e.g. billions) rows
>> have been inserted?
>> 
>> Regards,
>> 
>> Daniel
> 
> 
> 
>

Re: Two questions about the maximum number of versions of a column family

Reply via email to