Josh, As you suggested, I don't want to pay the price of a CountingIterator. Fortunately, I don't care about visibility in this case. (For a couple of reasons, one of which is that visibility will be uniformly distributed -- I think.)
I'm thinking about doing this: In mutation-writing clients, sample. Possibly truncate keys to fit what I need. For sampled mutations, write them to a table with a summing combiner. (I'll probably also have historical stats tables 'sample_20140627T10:12' or whatever, so I can see samples evolve.) Then implement Range.getCountEstimate() by querying the sample table with summing. Sound reasonable? --Jamie On Fri, Jun 27, 2014 at 10:04 AM, Josh Elser <[email protected]> wrote: > You could do this fairly efficiently by leveraging the CountingIterator to > get an exact count (taking visibilities into account, as well) for the > range in question. It isn't going to be as fast as a precomputed answer, > but you could cache that easily. > > The fact that visibilities will affect the cardinality of a term makes it > harder for us to provide this within Accumulo. The situations where > Accumulo itself cares about cardinality, it's agnostic of the visibilities. > It would be possible to try to build an index of this information > internally, but, like Eric said, that's not there today. > > > On 6/27/14, 10:40 AM, Eric Newton wrote: > >> Short answer: no. >> >> Long answer: >> >> You can scan the metadata table for the count/size of the files. >> >> You can query tablet servers for the basic stats of every tablet for a >> given table. This is used for balancing. >> >> But really you should collect the statistics you want during ingest and >> insert them in another table. >> >> -Eric >> >> >> On Fri, Jun 27, 2014 at 9:42 AM, Jamie Stephens <[email protected] >> <mailto:[email protected]>> wrote: >> >> Is there a way to get a quick estimate of the number of keys in a >> given range? >> >> Perhaps more generally, getting an estimate of the amount of work >> (and even some sort of confidence based on, say, the age of >> something) to iterate over a range. >> >> I'd like to do some query planning, so statistics like these sure >> would be nice. >> >> --Jamie >> >> >>
