Re: Accumulo Seek performance

dlmarion Thu, 25 Aug 2016 05:35:11 -0700

Calling BatchScanner.iterator() is what starts the work on the server side. You 
should do this first for all 6 batch scanners, then iterate over all of them in 
parallel.


----- Original Message -----

From: "Sven Hodapp" <[email protected]> 
To: "user" <[email protected]> 
Sent: Thursday, August 25, 2016 4:53:41 AM 
Subject: Re: Accumulo Seek performance 

Hi, 

I've changed the code a little bit, so that it uses a thread pool (via the 
Future): 

val ranges500 = ranges.asScala.grouped(500) // this means 6 BatchScanners will 
be created 

for (ranges <- ranges500) { 
val bscan = instance.createBatchScanner(ARTIFACTS, auths, 2) 
bscan.setRanges(ranges.asJava) 
Future { 
time("mult-scanner") { 
bscan.asScala.toList // toList forces the iteration of the iterator 
} 
} 
} 

Here are the results: 

background log: info: mult-scanner time: 4807.289358 ms 
background log: info: mult-scanner time: 4930.996522 ms 
background log: info: mult-scanner time: 9510.010808 ms 
background log: info: mult-scanner time: 11394.152391 ms 
background log: info: mult-scanner time: 13297.247295 ms 
background log: info: mult-scanner time: 14032.704837 ms 

background log: info: single-scanner time: 15322.624393 ms 

Every Future completes independent, but in return every batch scanner iterator 
needs more time to complete. :( 
This means the batch scanners aren't really processed in parallel on the server 
side? 
Should I reconfigure something? Maybe the tablet servers haven't/can't allocate 
enough threads or memory? (Every of the two nodes has 8 cores and 64GB memory 
and a storage with ~300MB/s...) 

Regards, 
Sven 

-- 
Sven Hodapp, M.Sc., 
Fraunhofer Institute for Algorithms and Scientific Computing SCAI, 
Department of Bioinformatics 
Schloss Birlinghoven, 53754 Sankt Augustin, Germany 
[email protected] 
www.scai.fraunhofer.de 

----- Ursprüngliche Mail ----- 
> Von: "Josh Elser" <[email protected]> 
> An: "user" <[email protected]> 
> Gesendet: Mittwoch, 24. August 2016 18:36:42 
> Betreff: Re: Accumulo Seek performance 

> Ahh duh. Bad advice from me in the first place :) 
> 
> Throw 'em in a threadpool locally. 
> 
> [email protected] wrote: 
>> Doesn't this use the 6 batch scanners serially? 
>> 
>> ------------------------------------------------------------------------ 
>> *From: *"Sven Hodapp" <[email protected]> 
>> *To: *"user" <[email protected]> 
>> *Sent: *Wednesday, August 24, 2016 11:56:14 AM 
>> *Subject: *Re: Accumulo Seek performance 
>> 
>> Hi Josh, 
>> 
>> thanks for your reply! 
>> 
>> I've tested your suggestion with a implementation like that: 
>> 
>> val ranges500 = ranges.asScala.grouped(500) // this means 6 
>> BatchScanners will be created 
>> 
>> time("mult-scanner") { 
>> for (ranges <- ranges500) { 
>> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1) 
>> bscan.setRanges(ranges.asJava) 
>> for (entry <- bscan.asScala) yield { 
>> entry.getKey() 
>> } 
>> } 
>> } 
>> 
>> And the result is a bit disappointing: 
>> 
>> background log: info: mult-scanner time: 18064.969281 ms 
>> background log: info: single-scanner time: 6527.482383 ms 
>> 
>> I'm doing something wrong here? 
>> 
>> 
>> Regards, 
>> Sven 
>> 
>> -- 
>> Sven Hodapp, M.Sc., 
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI, 
>> Department of Bioinformatics 
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany 
>> [email protected] 
>> www.scai.fraunhofer.de 
>> 
>> ----- Ursprüngliche Mail ----- 
>> > Von: "Josh Elser" <[email protected]> 
>> > An: "user" <[email protected]> 
>> > Gesendet: Mittwoch, 24. August 2016 16:33:37 
>> > Betreff: Re: Accumulo Seek performance 
>> 
>> > This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710 
>> > 
>> > I don't feel like 3000 ranges is too many, but this isn't quantitative. 
>> > 
>> > IIRC, the BatchScanner will take each Range you provide, bin each Range 
>> > to the TabletServer(s) currently hosting the corresponding data, clip 
>> > (truncate) each Range to match the Tablet boundaries, and then does an 
>> > RPC to each TabletServer with just the Ranges hosted there. 
>> > 
>> > Inside the TabletServer, it will then have many Ranges, binned by Tablet 
>> > (KeyExtent, to be precise). This will spawn a 
>> > org.apache.accumulo.tserver.scan.LookupTask will will start collecting 
>> > results to send back to the client. 
>> > 
>> > The caveat here is that those ranges are processed serially on a 
>> > TabletServer. Maybe, you're swamping one TabletServer with lots of 
>> > Ranges that it could be processing in parallel. 
>> > 
>> > Could you experiment with using multiple BatchScanners and something 
>> > like Guava's Iterables.concat to make it appear like one Iterator? 
>> > 
>> > I'm curious if we should put an optimization into the BatchScanner 
>> > itself to limit the number of ranges we send in one RPC to a 
>> > TabletServer (e.g. one BatchScanner might open multiple 
>> > MultiScanSessions to a TabletServer). 
>> > 
>> > Sven Hodapp wrote: 
>> >> Hi there, 
>> >> 
>> >> currently we're experimenting with a two node Accumulo cluster (two 
>> tablet 
>> >> servers) setup for document storage. 
>> >> This documents are decomposed up to the sentence level. 
>> >> 
>> >> Now I'm using a BatchScanner to assemble the full document like this: 
>> >> 
>> >> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // 
>> ARTIFACTS table 
>> >> currently hosts ~30GB data, ~200M entries on ~45 tablets 
>> >> bscan.setRanges(ranges) // there are like 3000 Range.exact's in the 
>> ranges-list 
>> >> for (entry<- bscan.asScala) yield { 
>> >> val key = entry.getKey() 
>> >> val value = entry.getValue() 
>> >> // etc. 
>> >> } 
>> >> 
>> >> For larger full documents (e.g. 3000 exact ranges), this operation 
>> will take 
>> >> about 12 seconds. 
>> >> But shorter documents are assembled blazing fast... 
>> >> 
>> >> Is that to much for a BatchScanner / I'm misusing the BatchScaner? 
>> >> Is that a normal time for such a (seek) operation? 
>> >> Can I do something to get a better seek performance? 
>> >> 
>> >> Note: I have already enabled bloom filtering on that table. 
>> >> 
>> >> Thank you for any advice! 
>> >> 
>> >> Regards, 
>> >> Sven

Re: Accumulo Seek performance

Reply via email to