Re: Accumulo Seek performance

2016-08-29 Thread Keith Turner
On Wed, Aug 24, 2016 at 9:22 AM, Sven Hodapp
 wrote:
> Hi there,
>
> currently we're experimenting with a two node Accumulo cluster (two tablet 
> servers) setup for document storage.
> This documents are decomposed up to the sentence level.
>
> Now I'm using a BatchScanner to assemble the full document like this:
>
> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // 
> ARTIFACTS table currently hosts ~30GB data, ~200M entries on ~45 tablets
> bscan.setRanges(ranges)  // there are like 3000 Range.exact's in the 
> ranges-list
>   for (entry <- bscan.asScala) yield {
> val key = entry.getKey()
> val value = entry.getValue()
> // etc.
>   }
>
> For larger full documents (e.g. 3000 exact ranges), this operation will take 
> about 12 seconds.
> But shorter documents are assembled blazing fast...
>
> Is that to much for a BatchScanner / I'm misusing the BatchScaner?
> Is that a normal time for such a (seek) operation?
> Can I do something to get a better seek performance?

How many threads did you configure the batch scanner with and did you
try varying this?

>
> Note: I have already enabled bloom filtering on that table.
>
> Thank you for any advice!
>
> Regards,
> Sven
>
> --
> Sven Hodapp, M.Sc.,
> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
> Department of Bioinformatics
> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
> sven.hod...@scai.fraunhofer.de
> www.scai.fraunhofer.de


Re: Profile a (batch) scan

2016-08-29 Thread Josh Elser

Right, no, I understand.

I just meant that the metrics we do expose are lacking. It would be a 
huge benefit to everyone if we could find more things that we can 
expose, do that once, and then prevent the need for the next person to 
hand-roll some things like you are doing now :)


Mario Pastorelli wrote:

I think suggestions in this mailing list are useful, Josh, that's why I
keep asking questions. I'm sorry that I'm asking so many questions but
I'm trying to improve my knowledge of Accumulo and documentation is
limited. Ideally, I would like to use only the metrics provided by
Accumulo, because that's less stuff that I have to maintain. The
StopWatch writing to the tracer could help.



On Sun, Aug 28, 2016 at 10:14 PM, Josh Elser > wrote:

I know it's not a super-helpful response, but I would love to help
you work through things we *can* expose and help you do that.

I imagine there is significantly more that we can add into the
dist-tracing information for BatchScanners now which would give more
insight into the tserver (amount of data read, number of ranges per
scan RPC, amount of data returned). This would be ideal as it would
prevent you from having to update your application code (although,
the suggestion of writing some iterator for timing purposes is a
simple way to move forward)

Mario Pastorelli wrote:

I would like to understand the performance of a batch scan and I
would
like to have some hints on how to proceed. I have enabled the
distributed trace, and it tells me that some batch scanner
threads take
much more time than others to complete but this is not helpful
enough
because it's not telling me why some threads take more. My gut
feeling
is that one batch thread is scanning more data than the others,
which
means that the data is not well distributed for a query, but I use a
random shard byte as prefix of the keys which should guarantee
that data
of the same range is almost equally distributed among the
tservers. I
enabled JMX on the tservers and attached jvisualvm to get an
idea of the
state of each tserver but I couldn't find anything meaningful. I
would
like to know if there is a way to profile what's going on on a
single
tserver for a single scan thread and by this I mean:

  1. where are the tablets required by a scan? Which tablet server?
  2. how fast was the lookups on the index for that scan?
  3. how many bytes/records were read for that scan without the
iterators
  4. how many seeks are done by the scan and possibly why

The main Accumulo UI is fine to get an overview of Accumulo but
don't
really give you any information about the performance of a
single query
and it seems to me that they are heavily affected by what
iterators do.
Profiling a single scan is much more interesting. Is there a way to
profile a single (batch) scan in Accumulo such that I have a
complete
overview of the entire process of reading and sending back
records to
the driver?

Thanks,
Mario

--
Mario Pastorelli| TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone:+41794381682 
email: mario.pastore...@teralytics.ch

>
www.teralytics.net 


Company registration number: CH-020.3.037.709-7 | Trade register
Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark
Schmitz,
Yann de Vries

This e-mail message contains confidential information which is
for the
sole attention and use of the intended recipient. Please notify
us at
once if you think that it may not be intended for you and delete it
immediately.




--
Mario Pastorelli| TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone:+41794381682
email: mario.pastore...@teralytics.ch

www.teralytics.net 

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
Yann de Vries

This e-mail message contains confidential information which is for the
sole attention and use of the intended recipient. Please notify us at
once if you think that it may not be intended for you and delete it
immediately.