Nice, not having to worry about visibilities makes the problem easier.

I'd encourage you to even consider forgoing sampling. You might be able to get by via combination/reduction in your client, and then setting a SummingCombiner on your cardinality table. It may be enough to get an accurate view of the statistics without a noticeable performance hit. But, you know your situation better than I do :)

Let us know how it goes.

On 6/27/14, 11:15 AM, Jamie Stephens wrote:
Josh,

As you suggested, I don't want to pay the price of a CountingIterator.
Fortunately, I don't care about visibility in this case.  (For a couple
of reasons, one of which is that visibility will be uniformly
distributed -- I think.)

I'm thinking about doing this:

In mutation-writing clients, sample.  Possibly truncate keys to fit what
I need.  For sampled mutations, write them to a table with a summing
combiner.  (I'll probably also have historical stats tables
'sample_20140627T10:12' or whatever, so I can see samples evolve.)  Then
implement Range.getCountEstimate() by querying the sample table with
summing.  Sound reasonable?

--Jamie



On Fri, Jun 27, 2014 at 10:04 AM, Josh Elser <[email protected]
<mailto:[email protected]>> wrote:

    You could do this fairly efficiently by leveraging the
    CountingIterator to get an exact count (taking visibilities into
    account, as well) for the range in question. It isn't going to be as
    fast as a precomputed answer, but you could cache that easily.

    The fact that visibilities will affect the cardinality of a term
    makes it harder for us to provide this within Accumulo. The
    situations where Accumulo itself cares about cardinality, it's
    agnostic of the visibilities. It would be possible to try to build
    an index of this information internally, but, like Eric said, that's
    not there today.


    On 6/27/14, 10:40 AM, Eric Newton wrote:

        Short answer: no.

        Long answer:

        You can scan the metadata table for the count/size of the files.

        You can query tablet servers for the basic stats of every tablet
        for a
        given table.  This is used for balancing.

        But really you should collect the statistics you want during
        ingest and
        insert them in another table.

        -Eric


        On Fri, Jun 27, 2014 at 9:42 AM, Jamie Stephens <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

             Is there a way to get a quick estimate of the number of
        keys in a
             given range?

             Perhaps more generally, getting an estimate of the amount
        of work
             (and even some sort of confidence based on, say, the age of
             something) to iterate over a range.

             I'd like to do some query planning, so statistics like
        these sure
             would be nice.

             --Jamie



Reply via email to