Re: AccumuloInputFormat and data locality for jobs that don't need keys sorted

James Hughes Tue, 02 Aug 2016 08:28:23 -0700

Hi Mario,

If I recall, the AccumuloInputFormat wants to have a task per range by
default.  For geospatial indexing (or other non-trivial cases), a query
plan can generate a bunch of ranges.  Eugene Cheipesh wrote in about a year
and a half ago about this same issue as it came up in GeoTrellis:
http://mail-archives.apache.org/mod_mbox/accumulo-user/201502.mbox/%[email protected]%3E.



>From that discussion it sounded like a JIRA ticket and maybe some work was
underway.  I remember one of the GeoMesans looking at what Eugene had done
and implementing a similar idea in our GeoMesaInputFormat.  Since cloud
configurations can vary quite a bit, we left some light hooks to allow for
tuning.

Overall, I think this is likely to be something one has to play with in
terms of particular use case and cloud.

Cheers,

Jim

Links in case they help:
https://github.com/locationtech/geomesa/blob/master/geomesa-compute/src/main/scala/org/locationtech/geomesa/compute/spark/GeoMesaSpark.scala
https://github.com/locationtech/geomesa/blob/master/geomesa-jobs/src/main/scala/org/locationtech/geomesa/jobs/mapreduce/GeoMesaInputFormat.scala

On Tue, Aug 2, 2016 at 9:30 AM, Marc Reichman <[email protected]>
wrote:

> Hi Mario,
>
> In my experiences, the performance/locality help from
> AccumuloInputFormat/AccumuloRowInputFormat tends to be less helpful if you
> add a lot of ranges. In some cases, I've found there's an efficiency curve
> you can experiment with where it's sometimes faster to just locally throw
> out data vs use many ranges. I've been using Accumulo with classic
> MapReduce, YARN MapReduce, and Spark for a while and this has held true on
> all of those platforms.
>
> Good luck!
>
> Marc
>
> On Mon, Aug 1, 2016 at 6:55 PM, Mario Pastorelli <
> [email protected]> wrote:
>
>> I would like to use an Accumulo table as input for a Spark job. Let me
>> clarify that my job doesn't need keys sorted and Accumulo is purely used to
>> filter the input data thanks to it's index on the keys. The data that I
>> need to process in Spark is still a small portion of the full dataset.
>> I know that Accumulo provides the AccumuloInputFormat but in my tests
>> almost no task has data locality when I use this input format which leads
>> to poor performance. I'm not sure why this happens but my guess is that
>> the  AccumuloInputFormat creates one task per range.
>> I wonder if there is a way to tell to the AccumuloInputFormat to split
>> each range into the sub-ranges local to each tablet server so that each
>> task in Spark will will read only data from the same machines where it is
>> running.
>>
>> Thanks for the help,
>> Mario
>>
>> --
>> Mario Pastorelli | TERALYTICS
>>
>> *software engineer*
>>
>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>> phone: +41794381682
>> email: [email protected]
>> www.teralytics.net
>>
>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>> Zurich
>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>> Yann de Vries
>>
>> This e-mail message contains confidential information which is for the
>> sole attention and use of the intended recipient. Please notify us at once
>> if you think that it may not be intended for you and delete it immediately.
>>
>
>

Re: AccumuloInputFormat and data locality for jobs that don't need keys sorted

Reply via email to