It sounds like each of your ranges is an ID, e.g. a single row. I've found that scanning lots of non-sequential single-row ranges is pretty slow in accumulo. Your best approach is probably to create an index table on whatever you are originally trying to query (assuming those 10000 ids came from some other query).

Thanks,

Emilio

On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
The rf files per tablet vary between 2 to 5 per tablet. The entries returned to me by the batchScanner is 460000. The approx. average data rate is 0.5 MB/s as seen on the accumulo monitor page.

A simple scan on the table has an average data rate of about 7-8 MB/s.

All the ids exist in the accumulo table.

On 12 May 2015 at 23:39, Keith Turner <[email protected] <mailto:[email protected]>> wrote:

    Do you know how much data is being brought back (i.e. 100
    megabytes)? I am wondering what the data rate is in MB/s.  Do you
    know how many files per tablet you have?  Do most of the 10,000
    ids you are querying for exist?

    On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal
    <[email protected]
    <mailto:[email protected]>> wrote:

        I have 194 tablets. Currently I am using 20 threads to create
        the batchscanner inside the createBatchScanner method.

        On 12-May-2015 11:19 pm, "Keith Turner" <[email protected]
        <mailto:[email protected]>> wrote:

            How many tablets do you have? The batch scanner does not
            parallelize operations within a tablet.

            If you give the batch scanner more threads than there are
            tservers, it will make multilple parallel rpc calls to
            each tserver if the tserver has multiple tablets.  Each
            rpc may include multiple tablets and ranges for each tablet.

            If the batch scanner has less threads than tservers, it
            will make one rpc per tserver per thread.  Each rpc call
            will include all tablets and associated ranges for that
            tserver.

            Keith



            On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal
            <[email protected]
            <mailto:[email protected]>> wrote:

                Hi,

                I am using BatchScanner to scan rows from a accumulo
                table. The table has around 187m entries and I am
                using a 3 node cluster which has accumulo 1.6.1.

                I have passed 10000 ids which are stored as row id in
                my table as a list in the setRanges() method.

                This whole process takes around 50 secs(from adding
                the ids in the list to scanning the whole table using
                the BatchScanner).

                I tried switching on bloom filters but that didn't work.

                Also if anyone could briefly explain how a
                BatchScanner works, how it does parallel scanning it
                would help me understand what I am doing better.

                Thanks
                Vaibhav






Reply via email to