Re: Determining tablets assigned to table splits, and the number of rows in each tablet

Josh Elser Sat, 04 Oct 2014 18:23:35 -0700

I'll re-state it: I'd be happy to work with you to figure out some Java APIs 
for clients to consume for these kinds of metrics. A JIRA issue is the best way 
to encapsulate this. Would also love to help you provide a patch for it, too :)


The biggest concern (at least for creating an API for entries in a table -- by 
tablet/tabletserver/otherwise) is going to be that the number of entries is an 
approximation, not definitive. This is not prohibitive, though, as long as 
we're clear that it is an approximation and not an exact metric.

Dylan Hutchison wrote:

It should suffice to list the number of entries for a table, tabletand tablet server. No need to worry about number of unique rows,number of unique column families, etc. By entry I mean number of(key,value)s.
For load balancing, we care about how much physical data is on eachtablet / tablet server. This is directly proportional to the numberof entries, assuming that the key size and value size in b

ytes do not

differ too drastically. If they do (say for raw documents of vastlydifferent sizes), the best measure is the /size of the data in bytes/for each tablet / tablet server. I didn't suggest it because itdoesn't look like Accumulo tracks it so it would involve a lot of newimplementation and book-keeping, which could hamper performance.
Accumulo does already track the number of entries for tables, tabletsand tablet server. It's just hard to get to, relying on the format ofthe metadata table and accessing the non-public Monitor classes.Bringing it to the public API just looks like a matter of reworkingthe API and letting the client gather the information that the Monitoralready does by connecting to each tablet server. Does that soundreasonable?
Regards, Dylan
On Sat, Oct 4, 2014 at 4:11 PM, David Medinets<[email protected] <mailto:[email protected]>> wrote:
    Adding this functionality in

to Accumulo's API would reduce it's

    efficiency for users that don't need this level of tracking. Let
    ingest procedures take the performance hit. There are
    synchronization issues that reduce degrade performance. Also what
    would be the appropriate level of tracking - at the row,
    column-family, or every level? Whatever answer you give, someone
    else will ask for something different. And then there are the
    aggregation questions. Not to mention the additional storage
    requirements.



--
www.cs.stevens.edu/~dhutchis <http://www.cs.stevens.edu/~dhutchis>

Re: Determining tablets assigned to table splits, and the number of rows in each tablet

Reply via email to