Re: BatchScanner taking too much time to scan rows

Eric Newton Wed, 13 May 2015 12:14:24 -0700

Yes, hot-spotting does affect accumulo because you have fewer servers and
caches handling your request.


Let's say your data is spread out, in a normal distribution from "0".."9".

What if you have only 1 split?  You would want it at "5", to divide the
data in half, and you could host the halves on different servers.  But if
you split at 1, now 10% of your queries go to one tablet, and 90% go to the
other.

-Eric

On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal <
[email protected]> wrote:

> Thank you Eric. I will surely do the same. Should uneven distribution
> across the tablets affect querying in accumulo?  If this case, it is. Is
> this behaviour normal?
> On 13-May-2015 10:58 pm, "Eric Newton" <[email protected]> wrote:
>
>> Yes, that's a great way to split the data evenly.
>>
>> Also, since the data set is so small, turn on data caching for your table:
>>
>> shell> config -t mytable -s table.cache.block.enable=true
>>
>> You may want to increase the size of your tserver JVM, and increase the
>> size of the cache:
>>
>> shell> config -s tserver.cache.data.size=1G
>>
>> This will help with repeated random look-ups.
>>
>> -Eric
>>
>> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal <
>> [email protected]> wrote:
>>
>>> Thank you Eric.
>>>
>>> One thing I would like to know. Does pre-splitting the data play a part
>>> in querying accumulo?
>>>
>>> Because I managed to somewhat decrease the querying time.
>>> I did the following steps:
>>> My table was around 1.47gb so I explicity set the split parameter to
>>> 256mb instead of the default 1gb.
>>>
>>> So I had just 8 tablets. Now when I carried out the same query, it
>>> finished in 15s.
>>>
>>> Is it because of the split points are more evenly distributed?
>>>
>>> The previous table on which the query took 50s had entries unevenly
>>> distributed across the tablets.
>>> Thanks
>>> Vaibhav
>>> On 13-May-2015 7:43 pm, "Eric Newton" <[email protected]> wrote:
>>>
>>>> This use case is one of the things Accumulo was designed to handle
>>>> well. It's the reason there is a BatchScanner.
>>>>
>>>> I've created:
>>>>
>>>> https://issues.apache.org/jira/browse/ACCUMULO-3813
>>>>
>>>> so we can investigate and track down any problems or improvements.
>>>>
>>>> Feel free to add any other details to the JIRA ticket.
>>>>
>>>> -Eric
>>>>
>>>>
>>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <
>>>> [email protected]> wrote:
>>>>
>>>>>  It sounds like each of your ranges is an ID, e.g. a single row. I've
>>>>> found that scanning lots of non-sequential single-row ranges is pretty 
>>>>> slow
>>>>> in accumulo. Your best approach is probably to create an index table on
>>>>> whatever you are originally trying to query (assuming those 10000 ids came
>>>>> from some other query).
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Emilio
>>>>>
>>>>>
>>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
>>>>>
>>>>>  The rf files per tablet vary between 2 to 5 per tablet. The entries
>>>>> returned to me by the batchScanner is 460000. The approx. average data 
>>>>> rate
>>>>> is 0.5 MB/s as seen on the accumulo monitor page.
>>>>>
>>>>>  A simple scan on the table has an average data rate of about 7-8
>>>>> MB/s.
>>>>>
>>>>>  All the ids exist in the accumulo table.
>>>>>
>>>>> On 12 May 2015 at 23:39, Keith Turner <[email protected]> wrote:
>>>>>
>>>>>> Do you know how much data is being brought back (i.e. 100 megabytes)?
>>>>>> I am wondering what the data rate is in MB/s.  Do you know how many files
>>>>>> per tablet you have?  Do most of the 10,000 ids you are querying for 
>>>>>> exist?
>>>>>>
>>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> I have 194 tablets. Currently I am using 20 threads to create the
>>>>>>> batchscanner inside the createBatchScanner method.
>>>>>>>  On 12-May-2015 11:19 pm, "Keith Turner" <[email protected]> wrote:
>>>>>>>
>>>>>>>>   How many tablets do you have?  The batch scanner does not
>>>>>>>> parallelize operations within a tablet.
>>>>>>>>
>>>>>>>>  If you give the batch scanner more threads than there are
>>>>>>>> tservers, it will make multilple parallel rpc calls to each tserver if 
>>>>>>>> the
>>>>>>>> tserver has multiple tablets.  Each rpc may include multiple tablets 
>>>>>>>> and
>>>>>>>> ranges for each tablet.
>>>>>>>>
>>>>>>>>  If the batch scanner has less threads than tservers, it will make
>>>>>>>> one rpc per tserver per thread.  Each rpc call will include all 
>>>>>>>> tablets and
>>>>>>>> associated ranges for that tserver.
>>>>>>>>
>>>>>>>>  Keith
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>  I am using BatchScanner to scan rows from a accumulo table. The
>>>>>>>>> table has around 187m entries and I am using a 3 node cluster which 
>>>>>>>>> has
>>>>>>>>> accumulo 1.6.1.
>>>>>>>>>
>>>>>>>>>  I have passed 10000 ids which are stored as row id in my table
>>>>>>>>> as a list in the setRanges() method.
>>>>>>>>>
>>>>>>>>>  This whole process takes around 50 secs(from adding the ids in
>>>>>>>>> the list to scanning the whole table using the BatchScanner).
>>>>>>>>>
>>>>>>>>>  I tried switching on bloom filters but that didn't work.
>>>>>>>>>
>>>>>>>>>  Also if anyone could briefly explain how a BatchScanner works,
>>>>>>>>> how it does parallel scanning it would help me understand what I am 
>>>>>>>>> doing
>>>>>>>>> better.
>>>>>>>>>
>>>>>>>>>  Thanks
>>>>>>>>>  Vaibhav
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>

Re: BatchScanner taking too much time to scan rows

Reply via email to