AccumuloInputFormat and data locality for jobs that don't need keys sorted

Mario Pastorelli Mon, 01 Aug 2016 16:56:06 -0700

I would like to use an Accumulo table as input for a Spark job. Let me
clarify that my job doesn't need keys sorted and Accumulo is purely used to
filter the input data thanks to it's index on the keys. The data that I
need to process in Spark is still a small portion of the full dataset.
I know that Accumulo provides the AccumuloInputFormat but in my tests
almost no task has data locality when I use this input format which leads
to poor performance. I'm not sure why this happens but my guess is that
the  AccumuloInputFormat creates one task per range.
I wonder if there is a way to tell to the AccumuloInputFormat to split each
range into the sub-ranges local to each tablet server so that each task in
Spark will will read only data from the same machines where it is running.


Thanks for the help,
Mario

-- 
Mario Pastorelli | TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone: +41794381682
email: [email protected]
www.teralytics.net

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
de Vries

This e-mail message contains confidential information which is for the sole
attention and use of the intended recipient. Please notify us at once if
you think that it may not be intended for you and delete it immediately.

AccumuloInputFormat and data locality for jobs that don't need keys sorted

Reply via email to