Yes, that's a great way to split the data evenly. Also, since the data set is so small, turn on data caching for your table:
shell> config -t mytable -s table.cache.block.enable=true You may want to increase the size of your tserver JVM, and increase the size of the cache: shell> config -s tserver.cache.data.size=1G This will help with repeated random look-ups. -Eric On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal < [email protected]> wrote: > Thank you Eric. > > One thing I would like to know. Does pre-splitting the data play a part in > querying accumulo? > > Because I managed to somewhat decrease the querying time. > I did the following steps: > My table was around 1.47gb so I explicity set the split parameter to 256mb > instead of the default 1gb. > > So I had just 8 tablets. Now when I carried out the same query, it > finished in 15s. > > Is it because of the split points are more evenly distributed? > > The previous table on which the query took 50s had entries unevenly > distributed across the tablets. > Thanks > Vaibhav > On 13-May-2015 7:43 pm, "Eric Newton" <[email protected]> wrote: > >> This use case is one of the things Accumulo was designed to handle well. >> It's the reason there is a BatchScanner. >> >> I've created: >> >> https://issues.apache.org/jira/browse/ACCUMULO-3813 >> >> so we can investigate and track down any problems or improvements. >> >> Feel free to add any other details to the JIRA ticket. >> >> -Eric >> >> >> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <[email protected]> >> wrote: >> >>> It sounds like each of your ranges is an ID, e.g. a single row. I've >>> found that scanning lots of non-sequential single-row ranges is pretty slow >>> in accumulo. Your best approach is probably to create an index table on >>> whatever you are originally trying to query (assuming those 10000 ids came >>> from some other query). >>> >>> Thanks, >>> >>> Emilio >>> >>> >>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: >>> >>> The rf files per tablet vary between 2 to 5 per tablet. The entries >>> returned to me by the batchScanner is 460000. The approx. average data rate >>> is 0.5 MB/s as seen on the accumulo monitor page. >>> >>> A simple scan on the table has an average data rate of about 7-8 MB/s. >>> >>> All the ids exist in the accumulo table. >>> >>> On 12 May 2015 at 23:39, Keith Turner <[email protected]> wrote: >>> >>>> Do you know how much data is being brought back (i.e. 100 megabytes)? I >>>> am wondering what the data rate is in MB/s. Do you know how many files per >>>> tablet you have? Do most of the 10,000 ids you are querying for exist? >>>> >>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal < >>>> [email protected]> wrote: >>>> >>>>> I have 194 tablets. Currently I am using 20 threads to create the >>>>> batchscanner inside the createBatchScanner method. >>>>> On 12-May-2015 11:19 pm, "Keith Turner" <[email protected]> wrote: >>>>> >>>>>> How many tablets do you have? The batch scanner does not >>>>>> parallelize operations within a tablet. >>>>>> >>>>>> If you give the batch scanner more threads than there are tservers, >>>>>> it will make multilple parallel rpc calls to each tserver if the tserver >>>>>> has multiple tablets. Each rpc may include multiple tablets and ranges >>>>>> for >>>>>> each tablet. >>>>>> >>>>>> If the batch scanner has less threads than tservers, it will make >>>>>> one rpc per tserver per thread. Each rpc call will include all tablets >>>>>> and >>>>>> associated ranges for that tserver. >>>>>> >>>>>> Keith >>>>>> >>>>>> >>>>>> >>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am using BatchScanner to scan rows from a accumulo table. The >>>>>>> table has around 187m entries and I am using a 3 node cluster which has >>>>>>> accumulo 1.6.1. >>>>>>> >>>>>>> I have passed 10000 ids which are stored as row id in my table as >>>>>>> a list in the setRanges() method. >>>>>>> >>>>>>> This whole process takes around 50 secs(from adding the ids in the >>>>>>> list to scanning the whole table using the BatchScanner). >>>>>>> >>>>>>> I tried switching on bloom filters but that didn't work. >>>>>>> >>>>>>> Also if anyone could briefly explain how a BatchScanner works, how >>>>>>> it does parallel scanning it would help me understand what I am doing >>>>>>> better. >>>>>>> >>>>>>> Thanks >>>>>>> Vaibhav >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>> >>> >>
