Re: BatchScanner taking too much time to scan rows

Keith Turner Tue, 12 May 2015 11:10:11 -0700

Do you know how much data is being brought back (i.e. 100 megabytes)? I am
wondering what the data rate is in MB/s.  Do you know how many files per
tablet you have?  Do most of the 10,000 ids you are querying for exist?


On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
[email protected]> wrote:

> I have 194 tablets. Currently I am using 20 threads to create the
> batchscanner inside the createBatchScanner method.
> On 12-May-2015 11:19 pm, "Keith Turner" <[email protected]> wrote:
>
>> How many tablets do you have?  The batch scanner does not parallelize
>> operations within a tablet.
>>
>> If you give the batch scanner more threads than there are tservers, it
>> will make multilple parallel rpc calls to each tserver if the tserver has
>> multiple tablets.  Each rpc may include multiple tablets and ranges for
>> each tablet.
>>
>> If the batch scanner has less threads than tservers, it will make one rpc
>> per tserver per thread.  Each rpc call will include all tablets and
>> associated ranges for that tserver.
>>
>> Keith
>>
>>
>>
>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>> [email protected]> wrote:
>>
>>> Hi,
>>>
>>> I am using BatchScanner to scan rows from a accumulo table. The table
>>> has around 187m entries and I am using a 3 node cluster which has accumulo
>>> 1.6.1.
>>>
>>> I have passed 10000 ids which are stored as row id in my table as a list
>>> in the setRanges() method.
>>>
>>> This whole process takes around 50 secs(from adding the ids in the list
>>> to scanning the whole table using the BatchScanner).
>>>
>>> I tried switching on bloom filters but that didn't work.
>>>
>>> Also if anyone could briefly explain how a BatchScanner works, how it
>>> does parallel scanning it would help me understand what I am doing better.
>>>
>>> Thanks
>>> Vaibhav
>>>
>>>
>>>
>>

Re: BatchScanner taking too much time to scan rows

Reply via email to