Re: Determining tablets assigned to table splits, and the number of rows in each tablet

Dylan Hutchison Mon, 06 Oct 2014 08:44:55 -0700

Yep, ticket here: ACCUMULO-3206
<https://issues.apache.org/jira/browse/ACCUMULO-3206>


There is a related movement at ACCUMULO-3005
<https://issues.apache.org/jira/browse/ACCUMULO-3005> to make the
information of number of entries, number of bytes per tablet / tablet
server per table, available via a RESTful web server as an extension of the
monitor.  With the extra operations you suggest, number of keys in a range
and median key in a range, we would want to keep that at the API level so
that we can introduce authorizations.  Sounds great!

Could you layout a list of all the stats that Accumulo tracks already so
that we know what to implement, either here or on JIRA?  This will form the
basis for extending the API.

~Dylan


On Mon, Oct 6, 2014 at 10:31 AM, Adam Fuchs <[email protected]> wrote:

> A few years ago we hashed out a rough idea of creating a stats API
> that would allow users to ask a variety of questions that leverage
> information that is already present in the system. Those questions
> would include things like:
>  * Estimate of number of keys in a range. This would satisfy the "key
> count per tablet" request, but could also be used for things like
> predicting query result sizes.
>  * Find the median key in a range. This is useful for doing things
> like parallelizing processing by ranges and predicting sizes of
> intersections.
>
> I think these would best be exposed in both the iterator API and as
> client operations. We never got around to building this before, mostly
> due to prioritization with other features. However, it seems to be
> coming up in conversation frequently these days. There are going to be
> a few tricky parts around cell-level security (information leakage)
> and accuracy of estimates. Is somebody working on creating this ticket
> already?
>
> Adam
>
>
> On Sat, Oct 4, 2014 at 9:23 PM, Josh Elser <[email protected]> wrote:
> > I'll re-state it: I'd be happy to work with you to figure out some Java
> APIs
> > for clients to consume for these kinds of metrics. A JIRA issue is the
> best
> > way to encapsulate this. Would also love to help you provide a patch for
> it,
> > too :)
> >
> > The biggest concern (at least for creating an API for entries in a table
> --
> > by tablet/tabletserver/otherwise) is going to be that the number of
> entries
> > is an approximation, not definitive. This is not prohibitive, though, as
> > long as we're clear that it is an approximation and not an exact metric.
> >
> > Dylan Hutchison wrote:
> >>
> >> It should suffice to list the number of entries for a table, tablet and
> >> tablet server.  No need to worry about number of unique rows, number of
> >> unique column families, etc.  By entry I mean number of (key,value)s.
> >>
> >> For load balancing, we care about how much physical data is on each
> tablet
> >> / tablet server.  This is directly proportional to the number of
> entries,
> >> assuming that the key size and value size in b
> >
> > ytes do not
> >>
> >> differ too drastically.  If they do (say for raw documents of vastly
> >> different sizes), the best measure is the /size of the data in bytes
> /for
> >> each tablet / tablet server.  I didn't suggest it because it doesn't
> look
> >> like Accumulo tracks it so it would involve a lot of new implementation
> and
> >> book-keeping, which could hamper performance.
> >>
> >> Accumulo does already track the number of entries for tables, tablets
> and
> >> tablet server.  It's just hard to get to, relying on the format of the
> >> metadata table and accessing the non-public Monitor classes.  Bringing
> it to
> >> the public API just looks like a matter of reworking the API and
> letting the
> >> client gather the information that the Monitor already does by
> connecting to
> >> each tablet server.  Does that sound reasonable?
> >>
> >> Regards, Dylan
> >>
> >> On Sat, Oct 4, 2014 at 4:11 PM, David Medinets <
> [email protected]
> >> <mailto:[email protected]>> wrote:
> >>
> >>     Adding this functionality in
> >
> > to Accumulo's API would reduce it's
> >>
> >>     efficiency for users that don't need this level of tracking. Let
> >>     ingest procedures take the performance hit. There are
> >>     synchronization issues that reduce degrade performance. Also what
> >>     would be the appropriate level of tracking - at the row,
> >>     column-family, or every level? Whatever answer you give, someone
> >>     else will ask for something different. And then there are the
> >>     aggregation questions. Not to mention the additional storage
> >>     requirements.
> >>
> >>
> >>
> >> --
> >> www.cs.stevens.edu/~dhutchis <http://www.cs.stevens.edu/~dhutchis>
>



-- 
www.cs.stevens.edu/~dhutchis

Re: Determining tablets assigned to table splits, and the number of rows in each tablet

Reply via email to