I didn't have an average query time-- the tablet server crashed. A quick solution is to batch the ranges into groups of 50k (or 500k, I forgot which one) and do many BatchScans-- not ideal. I think I achieved 33k entries/second retrieval on a single-node Accumulo. Accumulo is better for sequential lookup than random.
On Thu, May 14, 2015 at 1:57 PM, vaibhav thapliyal < [email protected]> wrote: > Dylan could you elaborate on the average query time you had? > Thanks > Vaibhav > On 14-May-2015 11:03 pm, "Dylan Hutchison" <[email protected]> wrote: > >> I think this is the same issue I found for ACCUMULO-3710 >> <https://issues.apache.org/jira/browse/ACCUMULO-3710>, only in my case >> the tserver ran out of memory. Accumulo doesn't handle large numbers of >> small, disjoint ranges well. I bet there's room for improvement on both >> the client and tablet server. >> ~Dylan >> >> On Wed, May 13, 2015 at 3:13 PM, Eric Newton <[email protected]> >> wrote: >> >>> Yes, hot-spotting does affect accumulo because you have fewer servers >>> and caches handling your request. >>> >>> Let's say your data is spread out, in a normal distribution from >>> "0".."9". >>> >>> What if you have only 1 split? You would want it at "5", to divide the >>> data in half, and you could host the halves on different servers. But if >>> you split at 1, now 10% of your queries go to one tablet, and 90% go to the >>> other. >>> >>> -Eric >>> >>> >>> On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal < >>> [email protected]> wrote: >>> >>>> Thank you Eric. I will surely do the same. Should uneven distribution >>>> across the tablets affect querying in accumulo? If this case, it is. Is >>>> this behaviour normal? >>>> On 13-May-2015 10:58 pm, "Eric Newton" <[email protected]> wrote: >>>> >>>>> Yes, that's a great way to split the data evenly. >>>>> >>>>> Also, since the data set is so small, turn on data caching for your >>>>> table: >>>>> >>>>> shell> config -t mytable -s table.cache.block.enable=true >>>>> >>>>> You may want to increase the size of your tserver JVM, and increase >>>>> the size of the cache: >>>>> >>>>> shell> config -s tserver.cache.data.size=1G >>>>> >>>>> This will help with repeated random look-ups. >>>>> >>>>> -Eric >>>>> >>>>> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal < >>>>> [email protected]> wrote: >>>>> >>>>>> Thank you Eric. >>>>>> >>>>>> One thing I would like to know. Does pre-splitting the data play a >>>>>> part in querying accumulo? >>>>>> >>>>>> Because I managed to somewhat decrease the querying time. >>>>>> I did the following steps: >>>>>> My table was around 1.47gb so I explicity set the split parameter to >>>>>> 256mb instead of the default 1gb. >>>>>> >>>>>> So I had just 8 tablets. Now when I carried out the same query, it >>>>>> finished in 15s. >>>>>> >>>>>> Is it because of the split points are more evenly distributed? >>>>>> >>>>>> The previous table on which the query took 50s had entries unevenly >>>>>> distributed across the tablets. >>>>>> Thanks >>>>>> Vaibhav >>>>>> On 13-May-2015 7:43 pm, "Eric Newton" <[email protected]> wrote: >>>>>> >>>>>>> This use case is one of the things Accumulo was designed to handle >>>>>>> well. It's the reason there is a BatchScanner. >>>>>>> >>>>>>> I've created: >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-3813 >>>>>>> >>>>>>> so we can investigate and track down any problems or improvements. >>>>>>> >>>>>>> Feel free to add any other details to the JIRA ticket. >>>>>>> >>>>>>> -Eric >>>>>>> >>>>>>> >>>>>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> It sounds like each of your ranges is an ID, e.g. a single row. >>>>>>>> I've found that scanning lots of non-sequential single-row ranges is >>>>>>>> pretty >>>>>>>> slow in accumulo. Your best approach is probably to create an index >>>>>>>> table >>>>>>>> on whatever you are originally trying to query (assuming those 10000 >>>>>>>> ids >>>>>>>> came from some other query). >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Emilio >>>>>>>> >>>>>>>> >>>>>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: >>>>>>>> >>>>>>>> The rf files per tablet vary between 2 to 5 per tablet. The >>>>>>>> entries returned to me by the batchScanner is 460000. The approx. >>>>>>>> average >>>>>>>> data rate is 0.5 MB/s as seen on the accumulo monitor page. >>>>>>>> >>>>>>>> A simple scan on the table has an average data rate of about 7-8 >>>>>>>> MB/s. >>>>>>>> >>>>>>>> All the ids exist in the accumulo table. >>>>>>>> >>>>>>>> On 12 May 2015 at 23:39, Keith Turner <[email protected]> wrote: >>>>>>>> >>>>>>>>> Do you know how much data is being brought back (i.e. 100 >>>>>>>>> megabytes)? I am wondering what the data rate is in MB/s. Do you >>>>>>>>> know how >>>>>>>>> many files per tablet you have? Do most of the 10,000 ids you are >>>>>>>>> querying >>>>>>>>> for exist? >>>>>>>>> >>>>>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> I have 194 tablets. Currently I am using 20 threads to create the >>>>>>>>>> batchscanner inside the createBatchScanner method. >>>>>>>>>> On 12-May-2015 11:19 pm, "Keith Turner" <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> How many tablets do you have? The batch scanner does not >>>>>>>>>>> parallelize operations within a tablet. >>>>>>>>>>> >>>>>>>>>>> If you give the batch scanner more threads than there are >>>>>>>>>>> tservers, it will make multilple parallel rpc calls to each tserver >>>>>>>>>>> if the >>>>>>>>>>> tserver has multiple tablets. Each rpc may include multiple >>>>>>>>>>> tablets and >>>>>>>>>>> ranges for each tablet. >>>>>>>>>>> >>>>>>>>>>> If the batch scanner has less threads than tservers, it will >>>>>>>>>>> make one rpc per tserver per thread. Each rpc call will include all >>>>>>>>>>> tablets and associated ranges for that tserver. >>>>>>>>>>> >>>>>>>>>>> Keith >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I am using BatchScanner to scan rows from a accumulo table. >>>>>>>>>>>> The table has around 187m entries and I am using a 3 node cluster >>>>>>>>>>>> which has >>>>>>>>>>>> accumulo 1.6.1. >>>>>>>>>>>> >>>>>>>>>>>> I have passed 10000 ids which are stored as row id in my >>>>>>>>>>>> table as a list in the setRanges() method. >>>>>>>>>>>> >>>>>>>>>>>> This whole process takes around 50 secs(from adding the ids >>>>>>>>>>>> in the list to scanning the whole table using the BatchScanner). >>>>>>>>>>>> >>>>>>>>>>>> I tried switching on bloom filters but that didn't work. >>>>>>>>>>>> >>>>>>>>>>>> Also if anyone could briefly explain how a BatchScanner >>>>>>>>>>>> works, how it does parallel scanning it would help me understand >>>>>>>>>>>> what I am >>>>>>>>>>>> doing better. >>>>>>>>>>>> >>>>>>>>>>>> Thanks >>>>>>>>>>>> Vaibhav >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>> >>
