If its possible to make the timestamps as a suffix of your rowkey(assuming the rowkey is composite) then you would not run into read/write hotspots. Have a look at open tsdb data model that scales really really well.
Sent from my iPhone > On Feb 21, 2016, at 10:28 AM, Stephen Durfey <[email protected]> wrote: > > I personally don't deal with time series data, so I'm not going to make a > statement on which is better. I would think from a scanning viewpoint putting > the time stamp in the row key is easier, but that will introduce scanning > performance bottlenecks due to the row keys being stored lexicographically. > All data from the same date range will end up in the same region or regions > (this is causes hot spots) reducing the number of tasks you get for reads, > thus increasing extraction time. > One method to deal with this is salting your row keys to get an even > distribution of data around the cluster. Cloudera recently had a good post > about this on their blog: > http://blog.cloudera.com/blog/2015/06/how-to-scan-salted-apache-hbase-tables-with-region-specific-key-ranges-in-mapreduce/ > > > > > > On Sun, Feb 21, 2016 at 9:47 AM -0800, "Daniel" <[email protected]> wrote: > > > > > > > > > > > Thanks for your sharing, Stephen and Ted. The reference guide recommends > "rows" over "versions" concerning time series data. Are there advantages of > using "reversed timestamps" in row keys over the built-in "versions" with > regard to scanning performance? > > ------------------ Original ------------------ > From: "Ted Yu" > Date: Mon, Feb 22, 2016 01:02 AM > To: "[email protected]"; > Subject: Re: Two questions about the maximum number of versions of a column > family > > > Thanks for sharing, Stephen. > > bq. scan performance on the region servers needing to scan over all that > data you may not need > > When number of versions is large, try to utilize Filters (where > appropriate) which implements: > > public Cell getNextCellHint(Cell currentKV) { > > See MultiRowRangeFilter for example. > > > Please see hbase-shell/src/main/ruby/shell/commands/alter.rb for syntax on > how to alter table. When "hbase.online.schema.update.enable" is true, table > can stay online during the change. > > Cheers > >> On Sun, Feb 21, 2016 at 8:20 AM, Stephen Durfey wrote: >> >> Someone please correct me if I am wrong. >> I've looked into this recently due to some performance reasons with my >> tables in a production environment. Like the books says, I don't recommend >> keeping this many versions around unless you really need them. Telling >> HBase to keep around a very large number doesn't waste space, that's just a >> value in the table descriptor. So, I wouldn't worry about that. The >> problems are going to come in when you actually write out those versions. >> My tables currently have max_versions set and roughly 40% of the tables >> are due to historical versions. So, one table in particular is around 25 >> TB. I don't have a need to keep this many versions, so I am working on >> changing the max versions to the default of 3 (some cells are hundreds or >> thousands of cells deep). The issue youll run into is scan performance on >> the region servers needing to scan over all that data you may not need (due >> to large store files). This could lead to increased scan time and >> potentially scanner timeouts, depending upon how large your batch size is >> set on the scan. >> I assume this has some performance impact on compactions, both minor and >> major, but I didn't investigate that, and potentially on the write path, >> but also not something I looked into. >> Changing the number of versions after the table has been created doesn't >> have a performance impact due to just being a metadata change. The table >> will need to be disabled, changed, and re-enabled again. If this is done >> through a script the table could be offline for a couple of seconds. The >> only concern around that are users of the table. If they have scheduled job >> runs that hit that table that would break if they try to read from it while >> the table is disabled. The only performance impact I can think of around >> this change would be major compaction of the table, but even that shouldn't >> be an issue. >> >> >> _____________________________ >> From: Daniel >> Sent: Sunday, February 21, 2016 9:22 AM >> Subject: Two questions about the maximum number of versions of a column >> family >> To: user >> >> >> Hi, I have two questions about the maximum number of versions of a column >> family: >> >> (1) Is it OK to set a very large (>100,000) maximum number of versions for >> a column family? >> >> The reference guide says "It is not recommended setting the number of max >> versions to an exceedingly high level (e.g., hundreds or more) unless those >> old values are very dear to you because this will greatly increase >> StoreFile size." (Chapter 36.1) >> >> I'm new to the Hadoop ecosystem, and have no idea about the consequences >> of a very large StoreFile size. >> >> Furthermore, it is OK to set a large maximum number of versions but insert >> only a few versions? Does it waste space? >> >> (2) How much performance overhead does it cause to increase the maximum >> number of versions of a column family after enormous (e.g. billions) rows >> have been inserted? >> >> Regards, >> >> Daniel > > > >
