I realize that was probably a bit of a wall of text... =) So, TL;DR: I'm wondering: 1) If people have used and had good experiences with caller-specified version timestamps (esp. given the caveats in the HBase book doc re: issues with deletions and TTLs).
2) About suggestions for optimal column naming for potentially large numbers of different column groupings for very wide tables. Thanks, - Ken On Tue, Jun 7, 2016 at 10:52 PM Ken Hampson <[email protected]> wrote: > Hi: > > I'm currently using HBase 1.1.2 and am in the process of determining how > best to proceed with the column layout for an upcoming expansion of our > data pipeline. > > Background: > > Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is sha1 > Table B: billions of rows (more than Table A), 1.8 TB (with snappy > compression), rowkey is sha1 > > > These tables represent data obtained via a combination batch/streaming > process. We want to expand our data pipeline to run an assortment of > analyses on these tables (both batch and streaming) and be able to store > the results in each table as appropriate. Table A is a set of unique > entries with some example data, whereas Table B is correlated to Table A > (via Table A's sha1), but is not de-duplicated (that is to say, it contains > contextual data). > > For the expansion of the data pipeline, we want to store the data either > in Table A if context is not needed, and Table B if context is needed. > Since we have a theoretically unlimited number of different analyses that > we may want to perform and store the results for (that is to say, I need to > assume there will be a substantial number of data sets that need to be > stored in these tables, which will grow over time and could each themselves > potentially be somewhat wide in terms of columns). > > Originally, I had considered storing these in column families, where each > analysis is grouped together in a different column family. However, I have > read in the HBase book documentation that HBase does not perform well with > many column families (a few default, ~10 max), so I have discarded this > option. > > The next two options both involve using wide tables with many columns in a > separate column family (e.g. "d"), where all the various analysis would be > grouped into the same family in a potentially wide amount of columns in > total. Each of these analyses needs to maintain their own versions so we > can correlate the data from each one. The variants which come to mind to > accomplish that, and on which I would appreciate some feedback on are: > > 1. Use HBase's native versioning to store the version of the analysis > 2. Encode a version in the column name itself > > I know the HBase native versions use the server's timestamp by default, > but can take any long value. So we could assign a particular time value to > be a version of a particular analysis. However, the doc also warned that > there could be negative ramifications of this because HBase uses the > versions internally for things like TTL for deletes/maintenance. Do people > use versions in this way? Are the TTL issues of great concern? (We likely > won't be deleting things often from the tables, but can't guarantee that we > won't ever do so). > > Encoding a version in the column name itself would make the column names > bigger, and I know it's encouraged for column names to be as small as > possible. > > Adjacent to the native-version-or-not question, there's the general column > naming. I was originally thinking maybe having a prefix followed by the > column name, optionally with the version in the middle depending on whether > 1 or 2 is chosen above. This would allow prefix filters to be used during > gets/scans to gather all columns for a given analysis type, etc. but it > would perhaps result in larger column names across billions of rows. > > e.g. *analysisfoo_4_column1* > > In practice, is this done and can it perform well? Or is it better to pick > a fixed width and use some number in its place, that's then translated via, > say, another table? > > e.g. *100000_1000_100000* (or something to that effect -- fixed width > numbers that are stand-in ids for potentially longer descriptions). > > Thanks, > - Ken >
