Re: AccumuloInputFormat and data locality for jobs that don't need keys sorted

Massimilian Mattetti Tue, 02 Aug 2016 00:37:23 -0700

Hi Mario, 

By default the AccumuloInputFormat merges overlapping ranges, then splits 
them to align with tablet boundaries (see the setAutoAdjustRanges method 
in InutFormatBase). You could also use a batch scanner (see setBatchScan 
method) but I am not sure on how it handles the order of the keys in a 
map-reduce task.



Regards,

Massimiliano Mattetti
Security R&D Engineer









From:   Mario Pastorelli <[email protected]>
To:     [email protected]
Date:   02/08/2016 02:55 AM
Subject:        AccumuloInputFormat and data locality for jobs that don't 
need keys sorted



I would like to use an Accumulo table as input for a Spark job. Let me 
clarify that my job doesn't need keys sorted and Accumulo is purely used 
to filter the input data thanks to it's index on the keys. The data that I 
need to process in Spark is still a small portion of the full dataset.
I know that Accumulo provides the AccumuloInputFormat but in my tests 
almost no task has data locality when I use this input format which leads 
to poor performance. I'm not sure why this happens but my guess is that 
the  AccumuloInputFormat creates one task per range.
I wonder if there is a way to tell to the AccumuloInputFormat to split 
each range into the sub-ranges local to each tablet server so that each 
task in Spark will will read only data from the same machines where it is 
running.

Thanks for the help,
Mario

-- 
Mario Pastorelli | TERALYTICS
software engineer
Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland 
phone: +41794381682
email: [email protected]
www.teralytics.net
Company registration number: CH-020.3.037.709-7 | Trade register Canton 
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann 
de Vries
This e-mail message contains confidential information which is for the 
sole attention and use of the intended recipient. Please notify us at once 
if you think that it may not be intended for you and delete it 
immediately.

Re: AccumuloInputFormat and data locality for jobs that don't need keys sorted

Reply via email to