"dozens, and possibly hundreds of locality groups" per table or per Accumulo instance?
On Thu, Oct 19, 2017 at 6:05 PM, Christopher <ctubb...@apache.org> wrote: > There's no expected scaling issue with having each column qualifier in its > own unique column family, regardless of how large the number of these > becomes. I've ingested random data like this before for testing, and it > works fine. > > However, there may be an issue trying to create a very large number of > locality groups. Locality groups are named, and you must explicitly > configure them to store particular column families. That configuration is > typically stored in ZooKeeper, and the configuration storage (in ZooKeeper, > and/or in your conf/accumulo-site.xml file) does not scale as well as the > data storage (HDFS) does. Where, and how, it will break, is probably > system-dependent and not directly known (at least, not known by me). I > would expect dozens, and possibly hundreds, of locality groups to work > okay, but thousands seems like it's too many (but I haven't tried). > > > On Thu, Oct 19, 2017 at 6:47 PM Mohammad Kargar <mkar...@phemi.com> wrote: > >> That makes sense. So this means that there's no limit or concerns on >> having, potentially, large number of column families (holing only one >> column qualifier), right? >> >> On Thu, Oct 19, 2017 at 3:06 PM, Josh Elser <els...@apache.org> wrote: >> >>> Yup, that's the intended use case. You have the flexibility to determine >>> what column families make sense to group together. Your only "cost" in >>> changing your mind is the speed at which you can re-compact your data. >>> >>> There is one concern which comes to mind. Though making many locality >>> groups does increase the speed at which you can read from specific columns, >>> it decreases the speed at which you can read from _all_ columns. So, you >>> can do this trick to make Accumulo act more like a columnar database, but >>> beware that you're going to have an impact if you still have a use-case >>> where you read more than just one or two columns at a time. >>> >>> Does that make sense? >>> >>> >>> On 10/19/17 5:50 PM, Mohammad Kargar wrote: >>> >>>> AFAIK in Accumulo we can use "locality groups" to group sets of columns >>>> together on disk which would make it more like a column-oriented database. >>>> Considering that "locality groups" are per column family, I was wondering >>>> what if we treat column families like column qualifiers (creating one >>>> column family per each qualifier) and assigning each to a different >>>> locality group. This way all the data in a given column will be next to >>>> each other on disk which makes it easier for analytical applications to >>>> query the data. >>>> >>>> Any thoughts? >>>> >>>> Thanks, >>>> Mohammad >>>> >>>> >>