If you are not aware of it, something else to consider is the setOfflineTableScan[1] option. This can support much faster reads of data. In my experience this usually only useful for map only jobs like you are doing. When doing map/reduce the sort can make a speedup in map read rate irrelevant.
You still may not get locality if tablets have multiple files because a merged read of the files is done in the mapper. Offline map reduce in Accumulo attempts to run mappers at the last location a tablet compacted some of its files. Even w/o locality you still avoid the cost of de-serializing , re-serializing, transmission, and de-serializing data in the tserver+client. [1]: http://accumulo.apache.org/1.7/apidocs/org/apache/accumulo/core/client/mapred/InputFormatBase.html#setOfflineTableScan%28org.apache.hadoop.mapred.JobConf,%20boolean%29 On Mon, Aug 1, 2016 at 7:55 PM, Mario Pastorelli <[email protected]> wrote: > I would like to use an Accumulo table as input for a Spark job. Let me > clarify that my job doesn't need keys sorted and Accumulo is purely used to > filter the input data thanks to it's index on the keys. The data that I need > to process in Spark is still a small portion of the full dataset. > I know that Accumulo provides the AccumuloInputFormat but in my tests almost > no task has data locality when I use this input format which leads to poor > performance. I'm not sure why this happens but my guess is that the > AccumuloInputFormat creates one task per range. > I wonder if there is a way to tell to the AccumuloInputFormat to split each > range into the sub-ranges local to each tablet server so that each task in > Spark will will read only data from the same machines where it is running. > > Thanks for the help, > Mario > > -- > Mario Pastorelli | TERALYTICS > > software engineer > > Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland > phone: +41794381682 > email: [email protected] > www.teralytics.net > > Company registration number: CH-020.3.037.709-7 | Trade register Canton > Zurich > Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann > de Vries > > This e-mail message contains confidential information which is for the sole > attention and use of the intended recipient. Please notify us at once if you > think that it may not be intended for you and delete it immediately.
