Hi, Anil: Thanks for the feedback! I'll proceed with the non-short column-naming. It's good to have some feedback from real-world, production cases.
Thanks again, - Ken On Sat, Jun 11, 2016 at 2:47 PM anil gupta <[email protected]> wrote: > My 2 cents: > > #1. HBase version timestamp is purely used for storing & purging historical > data on basis of TTL. If you try to build an app toying around timestamps > you might run into issues. So, you might need to be very careful with that. > > #2. Usually HBase suggests that column name to be around 5-6 chars because > HBase store data as KV. But, its hard to keep on doing that in **real world > apps**. When you use block encoding/compression, the performance penalty of > wide columns is reduced. For example, Apache Phoenix uses Fast_Diff > encoding by default due to non-short column name. > Here is another blogpost that discuss perf of encoding/compression: > > http://hadoop-hbase.blogspot.com/2016/02/hbase-compression-vs-blockencoding_17.html > I have been using user friendly column names(more readable rather than > short abbreviation) and i still get decent performance in my > apps.(Obviously, YMMV. My apps are performing within our SLA.) > In prod, I have a table that has 1100+ columns, column names are not short. > Hence, i would recommend you to go ahead with your non-short column naming. > You might need to try out different encoding/compression to see what > provides you best performance. > > HTH, > Anil Gupta > > On Fri, Jun 10, 2016 at 8:16 PM, Ken Hampson <[email protected]> wrote: > > > I realize that was probably a bit of a wall of text... =) > > > > So, TL;DR: I'm wondering: > > 1) If people have used and had good experiences with caller-specified > > version timestamps (esp. given the caveats in the HBase book doc re: > issues > > with deletions and TTLs). > > > > 2) About suggestions for optimal column naming for potentially large > > numbers of different column groupings for very wide tables. > > > > Thanks, > > - Ken > > > > On Tue, Jun 7, 2016 at 10:52 PM Ken Hampson <[email protected]> wrote: > > > > > Hi: > > > > > > I'm currently using HBase 1.1.2 and am in the process of determining > how > > > best to proceed with the column layout for an upcoming expansion of our > > > data pipeline. > > > > > > Background: > > > > > > Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is > > sha1 > > > Table B: billions of rows (more than Table A), 1.8 TB (with snappy > > > compression), rowkey is sha1 > > > > > > > > > These tables represent data obtained via a combination batch/streaming > > > process. We want to expand our data pipeline to run an assortment of > > > analyses on these tables (both batch and streaming) and be able to > store > > > the results in each table as appropriate. Table A is a set of unique > > > entries with some example data, whereas Table B is correlated to Table > A > > > (via Table A's sha1), but is not de-duplicated (that is to say, it > > contains > > > contextual data). > > > > > > For the expansion of the data pipeline, we want to store the data > either > > > in Table A if context is not needed, and Table B if context is needed. > > > Since we have a theoretically unlimited number of different analyses > that > > > we may want to perform and store the results for (that is to say, I > need > > to > > > assume there will be a substantial number of data sets that need to be > > > stored in these tables, which will grow over time and could each > > themselves > > > potentially be somewhat wide in terms of columns). > > > > > > Originally, I had considered storing these in column families, where > each > > > analysis is grouped together in a different column family. However, I > > have > > > read in the HBase book documentation that HBase does not perform well > > with > > > many column families (a few default, ~10 max), so I have discarded this > > > option. > > > > > > The next two options both involve using wide tables with many columns > in > > a > > > separate column family (e.g. "d"), where all the various analysis would > > be > > > grouped into the same family in a potentially wide amount of columns in > > > total. Each of these analyses needs to maintain their own versions so > we > > > can correlate the data from each one. The variants which come to mind > to > > > accomplish that, and on which I would appreciate some feedback on are: > > > > > > 1. Use HBase's native versioning to store the version of the > analysis > > > 2. Encode a version in the column name itself > > > > > > I know the HBase native versions use the server's timestamp by default, > > > but can take any long value. So we could assign a particular time value > > to > > > be a version of a particular analysis. However, the doc also warned > that > > > there could be negative ramifications of this because HBase uses the > > > versions internally for things like TTL for deletes/maintenance. Do > > people > > > use versions in this way? Are the TTL issues of great concern? (We > likely > > > won't be deleting things often from the tables, but can't guarantee > that > > we > > > won't ever do so). > > > > > > Encoding a version in the column name itself would make the column > names > > > bigger, and I know it's encouraged for column names to be as small as > > > possible. > > > > > > Adjacent to the native-version-or-not question, there's the general > > column > > > naming. I was originally thinking maybe having a prefix followed by the > > > column name, optionally with the version in the middle depending on > > whether > > > 1 or 2 is chosen above. This would allow prefix filters to be used > during > > > gets/scans to gather all columns for a given analysis type, etc. but it > > > would perhaps result in larger column names across billions of rows. > > > > > > e.g. *analysisfoo_4_column1* > > > > > > In practice, is this done and can it perform well? Or is it better to > > pick > > > a fixed width and use some number in its place, that's then translated > > via, > > > say, another table? > > > > > > e.g. *100000_1000_100000* (or something to that effect -- fixed width > > > numbers that are stand-in ids for potentially longer descriptions). > > > > > > Thanks, > > > - Ken > > > > > > > > > -- > Thanks & Regards, > Anil Gupta >
