Hi Mario, If I recall, the AccumuloInputFormat wants to have a task per range by default. For geospatial indexing (or other non-trivial cases), a query plan can generate a bunch of ranges. Eugene Cheipesh wrote in about a year and a half ago about this same issue as it came up in GeoTrellis: http://mail-archives.apache.org/mod_mbox/accumulo-user/201502.mbox/%[email protected]%3E.
>From that discussion it sounded like a JIRA ticket and maybe some work was underway. I remember one of the GeoMesans looking at what Eugene had done and implementing a similar idea in our GeoMesaInputFormat. Since cloud configurations can vary quite a bit, we left some light hooks to allow for tuning. Overall, I think this is likely to be something one has to play with in terms of particular use case and cloud. Cheers, Jim Links in case they help: https://github.com/locationtech/geomesa/blob/master/geomesa-compute/src/main/scala/org/locationtech/geomesa/compute/spark/GeoMesaSpark.scala https://github.com/locationtech/geomesa/blob/master/geomesa-jobs/src/main/scala/org/locationtech/geomesa/jobs/mapreduce/GeoMesaInputFormat.scala On Tue, Aug 2, 2016 at 9:30 AM, Marc Reichman <[email protected]> wrote: > Hi Mario, > > In my experiences, the performance/locality help from > AccumuloInputFormat/AccumuloRowInputFormat tends to be less helpful if you > add a lot of ranges. In some cases, I've found there's an efficiency curve > you can experiment with where it's sometimes faster to just locally throw > out data vs use many ranges. I've been using Accumulo with classic > MapReduce, YARN MapReduce, and Spark for a while and this has held true on > all of those platforms. > > Good luck! > > Marc > > On Mon, Aug 1, 2016 at 6:55 PM, Mario Pastorelli < > [email protected]> wrote: > >> I would like to use an Accumulo table as input for a Spark job. Let me >> clarify that my job doesn't need keys sorted and Accumulo is purely used to >> filter the input data thanks to it's index on the keys. The data that I >> need to process in Spark is still a small portion of the full dataset. >> I know that Accumulo provides the AccumuloInputFormat but in my tests >> almost no task has data locality when I use this input format which leads >> to poor performance. I'm not sure why this happens but my guess is that >> the AccumuloInputFormat creates one task per range. >> I wonder if there is a way to tell to the AccumuloInputFormat to split >> each range into the sub-ranges local to each tablet server so that each >> task in Spark will will read only data from the same machines where it is >> running. >> >> Thanks for the help, >> Mario >> >> -- >> Mario Pastorelli | TERALYTICS >> >> *software engineer* >> >> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >> phone: +41794381682 >> email: [email protected] >> www.teralytics.net >> >> Company registration number: CH-020.3.037.709-7 | Trade register Canton >> Zurich >> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >> Yann de Vries >> >> This e-mail message contains confidential information which is for the >> sole attention and use of the intended recipient. Please notify us at once >> if you think that it may not be intended for you and delete it immediately. >> > >
