Yep, ticket here: ACCUMULO-3206 <https://issues.apache.org/jira/browse/ACCUMULO-3206>
There is a related movement at ACCUMULO-3005 <https://issues.apache.org/jira/browse/ACCUMULO-3005> to make the information of number of entries, number of bytes per tablet / tablet server per table, available via a RESTful web server as an extension of the monitor. With the extra operations you suggest, number of keys in a range and median key in a range, we would want to keep that at the API level so that we can introduce authorizations. Sounds great! Could you layout a list of all the stats that Accumulo tracks already so that we know what to implement, either here or on JIRA? This will form the basis for extending the API. ~Dylan On Mon, Oct 6, 2014 at 10:31 AM, Adam Fuchs <[email protected]> wrote: > A few years ago we hashed out a rough idea of creating a stats API > that would allow users to ask a variety of questions that leverage > information that is already present in the system. Those questions > would include things like: > * Estimate of number of keys in a range. This would satisfy the "key > count per tablet" request, but could also be used for things like > predicting query result sizes. > * Find the median key in a range. This is useful for doing things > like parallelizing processing by ranges and predicting sizes of > intersections. > > I think these would best be exposed in both the iterator API and as > client operations. We never got around to building this before, mostly > due to prioritization with other features. However, it seems to be > coming up in conversation frequently these days. There are going to be > a few tricky parts around cell-level security (information leakage) > and accuracy of estimates. Is somebody working on creating this ticket > already? > > Adam > > > On Sat, Oct 4, 2014 at 9:23 PM, Josh Elser <[email protected]> wrote: > > I'll re-state it: I'd be happy to work with you to figure out some Java > APIs > > for clients to consume for these kinds of metrics. A JIRA issue is the > best > > way to encapsulate this. Would also love to help you provide a patch for > it, > > too :) > > > > The biggest concern (at least for creating an API for entries in a table > -- > > by tablet/tabletserver/otherwise) is going to be that the number of > entries > > is an approximation, not definitive. This is not prohibitive, though, as > > long as we're clear that it is an approximation and not an exact metric. > > > > Dylan Hutchison wrote: > >> > >> It should suffice to list the number of entries for a table, tablet and > >> tablet server. No need to worry about number of unique rows, number of > >> unique column families, etc. By entry I mean number of (key,value)s. > >> > >> For load balancing, we care about how much physical data is on each > tablet > >> / tablet server. This is directly proportional to the number of > entries, > >> assuming that the key size and value size in b > > > > ytes do not > >> > >> differ too drastically. If they do (say for raw documents of vastly > >> different sizes), the best measure is the /size of the data in bytes > /for > >> each tablet / tablet server. I didn't suggest it because it doesn't > look > >> like Accumulo tracks it so it would involve a lot of new implementation > and > >> book-keeping, which could hamper performance. > >> > >> Accumulo does already track the number of entries for tables, tablets > and > >> tablet server. It's just hard to get to, relying on the format of the > >> metadata table and accessing the non-public Monitor classes. Bringing > it to > >> the public API just looks like a matter of reworking the API and > letting the > >> client gather the information that the Monitor already does by > connecting to > >> each tablet server. Does that sound reasonable? > >> > >> Regards, Dylan > >> > >> On Sat, Oct 4, 2014 at 4:11 PM, David Medinets < > [email protected] > >> <mailto:[email protected]>> wrote: > >> > >> Adding this functionality in > > > > to Accumulo's API would reduce it's > >> > >> efficiency for users that don't need this level of tracking. Let > >> ingest procedures take the performance hit. There are > >> synchronization issues that reduce degrade performance. Also what > >> would be the appropriate level of tracking - at the row, > >> column-family, or every level? Whatever answer you give, someone > >> else will ask for something different. And then there are the > >> aggregation questions. Not to mention the additional storage > >> requirements. > >> > >> > >> > >> -- > >> www.cs.stevens.edu/~dhutchis <http://www.cs.stevens.edu/~dhutchis> > -- www.cs.stevens.edu/~dhutchis
