Hi: I'm currently using HBase 1.1.2 and am in the process of determining how best to proceed with the column layout for an upcoming expansion of our data pipeline.
Background: Table A: billions of rows, 1.3 TB (with snappy compression), rowkey is sha1 Table B: billions of rows (more than Table A), 1.8 TB (with snappy compression), rowkey is sha1 These tables represent data obtained via a combination batch/streaming process. We want to expand our data pipeline to run an assortment of analyses on these tables (both batch and streaming) and be able to store the results in each table as appropriate. Table A is a set of unique entries with some example data, whereas Table B is correlated to Table A (via Table A's sha1), but is not de-duplicated (that is to say, it contains contextual data). For the expansion of the data pipeline, we want to store the data either in Table A if context is not needed, and Table B if context is needed. Since we have a theoretically unlimited number of different analyses that we may want to perform and store the results for (that is to say, I need to assume there will be a substantial number of data sets that need to be stored in these tables, which will grow over time and could each themselves potentially be somewhat wide in terms of columns). Originally, I had considered storing these in column families, where each analysis is grouped together in a different column family. However, I have read in the HBase book documentation that HBase does not perform well with many column families (a few default, ~10 max), so I have discarded this option. The next two options both involve using wide tables with many columns in a separate column family (e.g. "d"), where all the various analysis would be grouped into the same family in a potentially wide amount of columns in total. Each of these analyses needs to maintain their own versions so we can correlate the data from each one. The variants which come to mind to accomplish that, and on which I would appreciate some feedback on are: 1. Use HBase's native versioning to store the version of the analysis 2. Encode a version in the column name itself I know the HBase native versions use the server's timestamp by default, but can take any long value. So we could assign a particular time value to be a version of a particular analysis. However, the doc also warned that there could be negative ramifications of this because HBase uses the versions internally for things like TTL for deletes/maintenance. Do people use versions in this way? Are the TTL issues of great concern? (We likely won't be deleting things often from the tables, but can't guarantee that we won't ever do so). Encoding a version in the column name itself would make the column names bigger, and I know it's encouraged for column names to be as small as possible. Adjacent to the native-version-or-not question, there's the general column naming. I was originally thinking maybe having a prefix followed by the column name, optionally with the version in the middle depending on whether 1 or 2 is chosen above. This would allow prefix filters to be used during gets/scans to gather all columns for a given analysis type, etc. but it would perhaps result in larger column names across billions of rows. e.g. *analysisfoo_4_column1* In practice, is this done and can it perform well? Or is it better to pick a fixed width and use some number in its place, that's then translated via, say, another table? e.g. *100000_1000_100000* (or something to that effect -- fixed width numbers that are stand-in ids for potentially longer descriptions). Thanks, - Ken
