Hi Mario, By default the AccumuloInputFormat merges overlapping ranges, then splits them to align with tablet boundaries (see the setAutoAdjustRanges method in InutFormatBase). You could also use a batch scanner (see setBatchScan method) but I am not sure on how it handles the order of the keys in a map-reduce task.
Regards, Massimiliano Mattetti Security R&D Engineer From: Mario Pastorelli <[email protected]> To: [email protected] Date: 02/08/2016 02:55 AM Subject: AccumuloInputFormat and data locality for jobs that don't need keys sorted I would like to use an Accumulo table as input for a Spark job. Let me clarify that my job doesn't need keys sorted and Accumulo is purely used to filter the input data thanks to it's index on the keys. The data that I need to process in Spark is still a small portion of the full dataset. I know that Accumulo provides the AccumuloInputFormat but in my tests almost no task has data locality when I use this input format which leads to poor performance. I'm not sure why this happens but my guess is that the AccumuloInputFormat creates one task per range. I wonder if there is a way to tell to the AccumuloInputFormat to split each range into the sub-ranges local to each tablet server so that each task in Spark will will read only data from the same machines where it is running. Thanks for the help, Mario -- Mario Pastorelli | TERALYTICS software engineer Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland phone: +41794381682 email: [email protected] www.teralytics.net Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately.
