I would like to use an Accumulo table as input for a Spark job. Let me clarify that my job doesn't need keys sorted and Accumulo is purely used to filter the input data thanks to it's index on the keys. The data that I need to process in Spark is still a small portion of the full dataset. I know that Accumulo provides the AccumuloInputFormat but in my tests almost no task has data locality when I use this input format which leads to poor performance. I'm not sure why this happens but my guess is that the AccumuloInputFormat creates one task per range. I wonder if there is a way to tell to the AccumuloInputFormat to split each range into the sub-ranges local to each tablet server so that each task in Spark will will read only data from the same machines where it is running.
Thanks for the help, Mario -- Mario Pastorelli | TERALYTICS *software engineer* Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland phone: +41794381682 email: [email protected] www.teralytics.net Company registration number: CH-020.3.037.709-7 | Trade register Canton Zurich Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann de Vries This e-mail message contains confidential information which is for the sole attention and use of the intended recipient. Please notify us at once if you think that it may not be intended for you and delete it immediately.
