David, thanks for the pointer to the articles. I read them a few months ago but forgot. Will need to read the HyperLogLog paper <https://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf> .
*The number of unique rows within a tablet are not explicitly tracked.* Right Josh, I misspoke. For load balancing, we're interested in the *number of entries in each tablet*, not the number of unique rows. Only counting the number of unique rows doesn't distinguish between really big rows and singleton rows, and as David pointed out, we need client-controlled means of doing unique row counting/estimation. We can see the number of entries in a Table and the number of entries in a Table of a particular Tablet Server, because these are listed in the monitor. [image: Inline image 2] David, you may recognize the name of this tablet server. Just got Accumulo Vagrant <https://github.com/medined/Accumulo_1_5_0_By_Vagrant> working last week, thanks ;) [image: Inline image 1] However, there could be multiple Tablets assigned to the same Tablet Server. Here is an outline of the procedure I followed to read the *TabletStats.numEntries* <https://accumulo.apache.org/1.5/apidocs/org/apache/accumulo/core/tabletserver/thrift/TabletStats.html#numEntries> for the correct Tablet that holds a split range. Given table name, - get a list of all tablet servers by connecting to the Master and referencing the MasterMonitorInfo <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/master/thrift/MasterClientService.Client.html#getMasterStats(org.apache.accumulo.trace.thrift.TInfo,%20org.apache.accumulo.core.security.thrift.TCredentials)> - get internal table ID via Tables.getNameToIdMap <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/impl/Tables.html#getNameToIdMap(org.apache.accumulo.core.client.Instance)> - connect to each tablet server TabletStat <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/tabletserver/thrift/TabletStats.html>s of tablets that are on the tablet server under the given internal table ID - Scan Metadata table starting at the {tableName converted to internal table ID} - and ending at {internal table ID}’<’ (last entry for this table in the metadata table) - Example row: 1< (if the internal table ID is 1 and this is the last split in the row) - look at the column for the previous row: ~tab:~pr - Example row-col-val: 1< ~tab:~pr [] \x00 - (this table has no table splits-- no end row and no previous row start) - Create an extent for the value using KeyExtent <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/data/KeyExtent.html> - (shortcut for parsing the metadata table and getting the previous and current end row) - Among the list of TabletStats, find the one whose previous end row and next end row match the result from the Metadata table. Take that tabletStat.numEntries <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/tabletserver/thrift/TabletStats.html#numEntries> to get the number of entries in this table split range. Later this information is combined into a method that returns an array of triples (tablet_split_range, tablet_num_entries, tablet_server_list_for_this_tablet) I recommend adding the ability to get the number of entries for tables, tablet servers and tablets to the public API. It would be nice to reference any of the data from the Accumulo monitor programmatically; in this case we cross-reference monitor data with the Metadata table. Josh, is JIRA the place to file those kinds of suggestions? Regards, Dylan -- www.cs.stevens.edu/~dhutchis
