Hi Sven,
  Without locality groups, your filtered scan may be reading nearly the
entire table.  The process looks like this:

   1. For each tablet that has one of the 3000 row ids (assuming sufficient
   tablet servers),
      1. *Seek* to the first column family of the first row id out of the
      target row ids in the tablet.
      2. *Read* that row+cf prefix.
      3. Find the next cf (out of the 5k cf's in your filter).
         1. *Read* the next entry and see if it is in the cf.  If it is,
         then you are lucky and go back to step 2.  Repeat this process for 10
         entries (a heuristic number).
         2. If none of the next 10 entries match the cf (or the next row in
         your target ranges), then *seek* to the next target row+cf, as in
         step 1.
      4. Continue until all target row ids in the tablet are scanned.

In the worst case, if the 5k target cf's in your filter are uniformly
spread out among the 500k total cf's (and each row has all 500k cf's, which
is probably not the case in your document-sentence table), then Accumulo
performs 5k seeks per row id, or 5k * 3k rows = 15M seeks, to be divided
among your tablet servers (assuming no significant skew).  You can adjust
this for the actual distribution of column families in your table to get an
idea of how many seeks Accumulo performs.

(On the other hand in the best case, if the 5k target cf's are all clumped
together, then Accumulo need only seek 3k times, or less if some row ids
are consecutive.)

Perhaps others could extend the model by estimating a "seconds/seek"
figure?  If we can estimate this, it would tell you whether your
BatchScanner times are in the right ballpark.  Or it might be sufficient to
compare the number of seeks.

Cheers, Dylan

On Wed, Aug 31, 2016 at 12:06 AM, Sven Hodapp <
sven.hod...@scai.fraunhofer.de> wrote:

> Hi Keith,
>
> I've tried it with 1, 2 or 10 threads. Unfortunately there where no
> amazing differences.
> Maybe it's a problem with the table structure? For example it may happen
> that one row id (e.g. a sentence) has several thousand column families. Can
> this affect the seek performance?
>
> So for my initial example it has about 3000 row ids to seek, which will
> return about 500k entries. If I filter for specific column families (e.g. a
> document without annotations) it will return about 5k entries, but the seek
> time will only be halved.
> Are there to much column families to seek it fast?
>
> Thanks!
>
> Regards,
> Sven
>
> --
> Sven Hodapp, M.Sc.,
> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
> Department of Bioinformatics
> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
> sven.hod...@scai.fraunhofer.de
> www.scai.fraunhofer.de
>
> ----- Ursprüngliche Mail -----
> > Von: "Keith Turner" <ke...@deenlo.com>
> > An: "user" <user@accumulo.apache.org>
> > Gesendet: Montag, 29. August 2016 22:37:32
> > Betreff: Re: Accumulo Seek performance
>
> > On Wed, Aug 24, 2016 at 9:22 AM, Sven Hodapp
> > <sven.hod...@scai.fraunhofer.de> wrote:
> >> Hi there,
> >>
> >> currently we're experimenting with a two node Accumulo cluster (two
> tablet
> >> servers) setup for document storage.
> >> This documents are decomposed up to the sentence level.
> >>
> >> Now I'm using a BatchScanner to assemble the full document like this:
> >>
> >>     val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) //
> ARTIFACTS table
> >>     currently hosts ~30GB data, ~200M entries on ~45 tablets
> >>     bscan.setRanges(ranges)  // there are like 3000 Range.exact's in
> the ranges-list
> >>       for (entry <- bscan.asScala) yield {
> >>         val key = entry.getKey()
> >>         val value = entry.getValue()
> >>         // etc.
> >>       }
> >>
> >> For larger full documents (e.g. 3000 exact ranges), this operation will
> take
> >> about 12 seconds.
> >> But shorter documents are assembled blazing fast...
> >>
> >> Is that to much for a BatchScanner / I'm misusing the BatchScaner?
> >> Is that a normal time for such a (seek) operation?
> >> Can I do something to get a better seek performance?
> >
> > How many threads did you configure the batch scanner with and did you
> > try varying this?
> >
> >>
> >> Note: I have already enabled bloom filtering on that table.
> >>
> >> Thank you for any advice!
> >>
> >> Regards,
> >> Sven
> >>
> >> --
> >> Sven Hodapp, M.Sc.,
> >> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
> >> Department of Bioinformatics
> >> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
> >> sven.hod...@scai.fraunhofer.de
> > > www.scai.fraunhofer.de
>

Reply via email to