There's no expected scaling issue with having each column qualifier in its own unique column family, regardless of how large the number of these becomes. I've ingested random data like this before for testing, and it works fine.
However, there may be an issue trying to create a very large number of locality groups. Locality groups are named, and you must explicitly configure them to store particular column families. That configuration is typically stored in ZooKeeper, and the configuration storage (in ZooKeeper, and/or in your conf/accumulo-site.xml file) does not scale as well as the data storage (HDFS) does. Where, and how, it will break, is probably system-dependent and not directly known (at least, not known by me). I would expect dozens, and possibly hundreds, of locality groups to work okay, but thousands seems like it's too many (but I haven't tried). On Thu, Oct 19, 2017 at 6:47 PM Mohammad Kargar <[email protected]> wrote: > That makes sense. So this means that there's no limit or concerns on > having, potentially, large number of column families (holing only one > column qualifier), right? > > On Thu, Oct 19, 2017 at 3:06 PM, Josh Elser <[email protected]> wrote: > >> Yup, that's the intended use case. You have the flexibility to determine >> what column families make sense to group together. Your only "cost" in >> changing your mind is the speed at which you can re-compact your data. >> >> There is one concern which comes to mind. Though making many locality >> groups does increase the speed at which you can read from specific columns, >> it decreases the speed at which you can read from _all_ columns. So, you >> can do this trick to make Accumulo act more like a columnar database, but >> beware that you're going to have an impact if you still have a use-case >> where you read more than just one or two columns at a time. >> >> Does that make sense? >> >> >> On 10/19/17 5:50 PM, Mohammad Kargar wrote: >> >>> AFAIK in Accumulo we can use "locality groups" to group sets of columns >>> together on disk which would make it more like a column-oriented database. >>> Considering that "locality groups" are per column family, I was wondering >>> what if we treat column families like column qualifiers (creating one >>> column family per each qualifier) and assigning each to a different >>> locality group. This way all the data in a given column will be next to >>> each other on disk which makes it easier for analytical applications to >>> query the data. >>> >>> Any thoughts? >>> >>> Thanks, >>> Mohammad >>> >>> >
