Re: BatchScanner taking too much time to scan rows

Dylan Hutchison Thu, 14 May 2015 11:17:24 -0700

I didn't have an average query time-- the tablet server crashed.  A quick
solution is to batch the ranges into groups of 50k (or 500k, I forgot which
one) and do many BatchScans-- not ideal.  I think I achieved 33k
entries/second retrieval on a single-node Accumulo.  Accumulo is better for
sequential lookup than random.


On Thu, May 14, 2015 at 1:57 PM, vaibhav thapliyal <
[email protected]> wrote:

> Dylan could you elaborate on the average query time you had?
> Thanks
> Vaibhav
> On 14-May-2015 11:03 pm, "Dylan Hutchison" <[email protected]> wrote:
>
>> I think this is the same issue I found for ACCUMULO-3710
>> <https://issues.apache.org/jira/browse/ACCUMULO-3710>, only in my case
>> the tserver ran out of memory.  Accumulo doesn't handle large numbers of
>> small, disjoint ranges well.  I bet there's room for improvement on both
>> the client and tablet server.
>> ~Dylan
>>
>> On Wed, May 13, 2015 at 3:13 PM, Eric Newton <[email protected]>
>> wrote:
>>
>>> Yes, hot-spotting does affect accumulo because you have fewer servers
>>> and caches handling your request.
>>>
>>> Let's say your data is spread out, in a normal distribution from
>>> "0".."9".
>>>
>>> What if you have only 1 split?  You would want it at "5", to divide the
>>> data in half, and you could host the halves on different servers.  But if
>>> you split at 1, now 10% of your queries go to one tablet, and 90% go to the
>>> other.
>>>
>>> -Eric
>>>
>>>
>>> On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal <
>>> [email protected]> wrote:
>>>
>>>> Thank you Eric. I will surely do the same. Should uneven distribution
>>>> across the tablets affect querying in accumulo?  If this case, it is. Is
>>>> this behaviour normal?
>>>> On 13-May-2015 10:58 pm, "Eric Newton" <[email protected]> wrote:
>>>>
>>>>> Yes, that's a great way to split the data evenly.
>>>>>
>>>>> Also, since the data set is so small, turn on data caching for your
>>>>> table:
>>>>>
>>>>> shell> config -t mytable -s table.cache.block.enable=true
>>>>>
>>>>> You may want to increase the size of your tserver JVM, and increase
>>>>> the size of the cache:
>>>>>
>>>>> shell> config -s tserver.cache.data.size=1G
>>>>>
>>>>> This will help with repeated random look-ups.
>>>>>
>>>>> -Eric
>>>>>
>>>>> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thank you Eric.
>>>>>>
>>>>>> One thing I would like to know. Does pre-splitting the data play a
>>>>>> part in querying accumulo?
>>>>>>
>>>>>> Because I managed to somewhat decrease the querying time.
>>>>>> I did the following steps:
>>>>>> My table was around 1.47gb so I explicity set the split parameter to
>>>>>> 256mb instead of the default 1gb.
>>>>>>
>>>>>> So I had just 8 tablets. Now when I carried out the same query, it
>>>>>> finished in 15s.
>>>>>>
>>>>>> Is it because of the split points are more evenly distributed?
>>>>>>
>>>>>> The previous table on which the query took 50s had entries unevenly
>>>>>> distributed across the tablets.
>>>>>> Thanks
>>>>>> Vaibhav
>>>>>> On 13-May-2015 7:43 pm, "Eric Newton" <[email protected]> wrote:
>>>>>>
>>>>>>> This use case is one of the things Accumulo was designed to handle
>>>>>>> well. It's the reason there is a BatchScanner.
>>>>>>>
>>>>>>> I've created:
>>>>>>>
>>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-3813
>>>>>>>
>>>>>>> so we can investigate and track down any problems or improvements.
>>>>>>>
>>>>>>> Feel free to add any other details to the JIRA ticket.
>>>>>>>
>>>>>>> -Eric
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>>  It sounds like each of your ranges is an ID, e.g. a single row.
>>>>>>>> I've found that scanning lots of non-sequential single-row ranges is 
>>>>>>>> pretty
>>>>>>>> slow in accumulo. Your best approach is probably to create an index 
>>>>>>>> table
>>>>>>>> on whatever you are originally trying to query (assuming those 10000 
>>>>>>>> ids
>>>>>>>> came from some other query).
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Emilio
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
>>>>>>>>
>>>>>>>>  The rf files per tablet vary between 2 to 5 per tablet. The
>>>>>>>> entries returned to me by the batchScanner is 460000. The approx. 
>>>>>>>> average
>>>>>>>> data rate is 0.5 MB/s as seen on the accumulo monitor page.
>>>>>>>>
>>>>>>>>  A simple scan on the table has an average data rate of about 7-8
>>>>>>>> MB/s.
>>>>>>>>
>>>>>>>>  All the ids exist in the accumulo table.
>>>>>>>>
>>>>>>>> On 12 May 2015 at 23:39, Keith Turner <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Do you know how much data is being brought back (i.e. 100
>>>>>>>>> megabytes)? I am wondering what the data rate is in MB/s.  Do you 
>>>>>>>>> know how
>>>>>>>>> many files per tablet you have?  Do most of the 10,000 ids you are 
>>>>>>>>> querying
>>>>>>>>> for exist?
>>>>>>>>>
>>>>>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> I have 194 tablets. Currently I am using 20 threads to create the
>>>>>>>>>> batchscanner inside the createBatchScanner method.
>>>>>>>>>>  On 12-May-2015 11:19 pm, "Keith Turner" <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>   How many tablets do you have?  The batch scanner does not
>>>>>>>>>>> parallelize operations within a tablet.
>>>>>>>>>>>
>>>>>>>>>>>  If you give the batch scanner more threads than there are
>>>>>>>>>>> tservers, it will make multilple parallel rpc calls to each tserver 
>>>>>>>>>>> if the
>>>>>>>>>>> tserver has multiple tablets.  Each rpc may include multiple 
>>>>>>>>>>> tablets and
>>>>>>>>>>> ranges for each tablet.
>>>>>>>>>>>
>>>>>>>>>>>  If the batch scanner has less threads than tservers, it will
>>>>>>>>>>> make one rpc per tserver per thread.  Each rpc call will include all
>>>>>>>>>>> tablets and associated ranges for that tserver.
>>>>>>>>>>>
>>>>>>>>>>>  Keith
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>>  I am using BatchScanner to scan rows from a accumulo table.
>>>>>>>>>>>> The table has around 187m entries and I am using a 3 node cluster 
>>>>>>>>>>>> which has
>>>>>>>>>>>> accumulo 1.6.1.
>>>>>>>>>>>>
>>>>>>>>>>>>  I have passed 10000 ids which are stored as row id in my
>>>>>>>>>>>> table as a list in the setRanges() method.
>>>>>>>>>>>>
>>>>>>>>>>>>  This whole process takes around 50 secs(from adding the ids
>>>>>>>>>>>> in the list to scanning the whole table using the BatchScanner).
>>>>>>>>>>>>
>>>>>>>>>>>>  I tried switching on bloom filters but that didn't work.
>>>>>>>>>>>>
>>>>>>>>>>>>  Also if anyone could briefly explain how a BatchScanner
>>>>>>>>>>>> works, how it does parallel scanning it would help me understand 
>>>>>>>>>>>> what I am
>>>>>>>>>>>> doing better.
>>>>>>>>>>>>
>>>>>>>>>>>>  Thanks
>>>>>>>>>>>>  Vaibhav
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>>

Re: BatchScanner taking too much time to scan rows

Reply via email to