Yes, hot-spotting does affect accumulo because you have fewer servers and caches handling your request.
Let's say your data is spread out, in a normal distribution from "0".."9". What if you have only 1 split? You would want it at "5", to divide the data in half, and you could host the halves on different servers. But if you split at 1, now 10% of your queries go to one tablet, and 90% go to the other. -Eric On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal < [email protected]> wrote: > Thank you Eric. I will surely do the same. Should uneven distribution > across the tablets affect querying in accumulo? If this case, it is. Is > this behaviour normal? > On 13-May-2015 10:58 pm, "Eric Newton" <[email protected]> wrote: > >> Yes, that's a great way to split the data evenly. >> >> Also, since the data set is so small, turn on data caching for your table: >> >> shell> config -t mytable -s table.cache.block.enable=true >> >> You may want to increase the size of your tserver JVM, and increase the >> size of the cache: >> >> shell> config -s tserver.cache.data.size=1G >> >> This will help with repeated random look-ups. >> >> -Eric >> >> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal < >> [email protected]> wrote: >> >>> Thank you Eric. >>> >>> One thing I would like to know. Does pre-splitting the data play a part >>> in querying accumulo? >>> >>> Because I managed to somewhat decrease the querying time. >>> I did the following steps: >>> My table was around 1.47gb so I explicity set the split parameter to >>> 256mb instead of the default 1gb. >>> >>> So I had just 8 tablets. Now when I carried out the same query, it >>> finished in 15s. >>> >>> Is it because of the split points are more evenly distributed? >>> >>> The previous table on which the query took 50s had entries unevenly >>> distributed across the tablets. >>> Thanks >>> Vaibhav >>> On 13-May-2015 7:43 pm, "Eric Newton" <[email protected]> wrote: >>> >>>> This use case is one of the things Accumulo was designed to handle >>>> well. It's the reason there is a BatchScanner. >>>> >>>> I've created: >>>> >>>> https://issues.apache.org/jira/browse/ACCUMULO-3813 >>>> >>>> so we can investigate and track down any problems or improvements. >>>> >>>> Feel free to add any other details to the JIRA ticket. >>>> >>>> -Eric >>>> >>>> >>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz < >>>> [email protected]> wrote: >>>> >>>>> It sounds like each of your ranges is an ID, e.g. a single row. I've >>>>> found that scanning lots of non-sequential single-row ranges is pretty >>>>> slow >>>>> in accumulo. Your best approach is probably to create an index table on >>>>> whatever you are originally trying to query (assuming those 10000 ids came >>>>> from some other query). >>>>> >>>>> Thanks, >>>>> >>>>> Emilio >>>>> >>>>> >>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: >>>>> >>>>> The rf files per tablet vary between 2 to 5 per tablet. The entries >>>>> returned to me by the batchScanner is 460000. The approx. average data >>>>> rate >>>>> is 0.5 MB/s as seen on the accumulo monitor page. >>>>> >>>>> A simple scan on the table has an average data rate of about 7-8 >>>>> MB/s. >>>>> >>>>> All the ids exist in the accumulo table. >>>>> >>>>> On 12 May 2015 at 23:39, Keith Turner <[email protected]> wrote: >>>>> >>>>>> Do you know how much data is being brought back (i.e. 100 megabytes)? >>>>>> I am wondering what the data rate is in MB/s. Do you know how many files >>>>>> per tablet you have? Do most of the 10,000 ids you are querying for >>>>>> exist? >>>>>> >>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> I have 194 tablets. Currently I am using 20 threads to create the >>>>>>> batchscanner inside the createBatchScanner method. >>>>>>> On 12-May-2015 11:19 pm, "Keith Turner" <[email protected]> wrote: >>>>>>> >>>>>>>> How many tablets do you have? The batch scanner does not >>>>>>>> parallelize operations within a tablet. >>>>>>>> >>>>>>>> If you give the batch scanner more threads than there are >>>>>>>> tservers, it will make multilple parallel rpc calls to each tserver if >>>>>>>> the >>>>>>>> tserver has multiple tablets. Each rpc may include multiple tablets >>>>>>>> and >>>>>>>> ranges for each tablet. >>>>>>>> >>>>>>>> If the batch scanner has less threads than tservers, it will make >>>>>>>> one rpc per tserver per thread. Each rpc call will include all >>>>>>>> tablets and >>>>>>>> associated ranges for that tserver. >>>>>>>> >>>>>>>> Keith >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I am using BatchScanner to scan rows from a accumulo table. The >>>>>>>>> table has around 187m entries and I am using a 3 node cluster which >>>>>>>>> has >>>>>>>>> accumulo 1.6.1. >>>>>>>>> >>>>>>>>> I have passed 10000 ids which are stored as row id in my table >>>>>>>>> as a list in the setRanges() method. >>>>>>>>> >>>>>>>>> This whole process takes around 50 secs(from adding the ids in >>>>>>>>> the list to scanning the whole table using the BatchScanner). >>>>>>>>> >>>>>>>>> I tried switching on bloom filters but that didn't work. >>>>>>>>> >>>>>>>>> Also if anyone could briefly explain how a BatchScanner works, >>>>>>>>> how it does parallel scanning it would help me understand what I am >>>>>>>>> doing >>>>>>>>> better. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Vaibhav >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>>> >>>> >>
