Re: AccumuloInputFormat and spark

Josh Elser Mon, 16 Feb 2015 12:48:58 -0800

Eugene,

First off, thanks so much for writing this up. This is definitely a "hottopic" that comes up for users and appears to have a lot of relevance topeople right now.

I think the first thing that needs to happen is that we "lift"TabletLocator (or some class which serves the purpose that TabletLocatorcurrently fulfills) into the public API. TabletLocator is currentlytreated as "internal implementation" meaning that you

don't have any guarantees on its use.

I think step 1 would be to add a TabletLocator class into the public API(and hide the implementation in a TabletLocatorImpl). We could only dothis for 1.7.0 given our adoption of semver. You are more than welcometo look at this and try to work on a PR.

Feel free to open an issue on JIRA as well (I can make sure it getsassigned to you after you do), and we can work with you to get a gooddesign in place.


- JOsh

Eugene Cheipesh wrote:

Hello,

This is more of a use-case report and a request for comment.

I am using Accumulo as a source for Spark RDDs through
AccumuloInputFormat. My index is based on a z-order space filing curve.
When I decompose a bounding box into index ranges I can end up with a
large number of Ranges, 3k+ is not too unusual. Getting a fast response
from Accumulo is not at all an issue. It would be possible to generate
approximate ranges and use a Filter to refine them on server side but
that only delays the problem.

The ideal scenario is for Spark executors to be co-located with Accumulo
tservers and number of splits per server to be roughly equal to the
number of cores on the machine.

However, AccumuloInputFormat maps each range to a Split and Spark maps
every split to a Task. It is nature of z-order curve that some of these
ranges contain only a few tiles while others contain a pretty big chunk.
Having significantly more splits than cores prevents good concurrency on
fetches. This is a problem that BatchScanner is designed to fix but it’s
not used in AccumuloInputFormat.

I noticed that TabletLocator which is used by AccumuloInputFormat
returns a structure that looks like it breaks down ranges by host and
then by tablet. I hacked together an InputFormat that generates a split
per tablet and a Reader that uses a BatchScanner. The performance for
spark use-case was orders of magnitude better. I end up with about 50
splits for the same dataset. I can’t give exact numbers because I gave
up timing the original source. This seems is a pretty good compromise
since the number of splits can be dynamically controlled to tune the
distribution and granularity of calculation batches.

A drawback is that most modes can not support this operation directly:
client side, offline, and isolated scans require a single range
iterator. So some additional code would be required for juggling them.

What are your thoughts on this use case and its requirements? Is this a
legitimate use of TabletLocator?

It would be nice if AccumuloInputFormat was able to use BatchScanner,
perhaps as an option. Accumulo is designed to crunch through large
number of ranges so I would guess this to be a common issue. I’d be
willing to take a stab at a PR if there is agreement on that.

Thanks,
--
Eugene Cheipesh

Re: AccumuloInputFormat and spark

Reply via email to