Hi,

I am working on revamping the architecture of OpenRefine, an ETL tool,
to execute workflows on datasets which do not fit in RAM.

Spark's RDD API is a great fit for the tool's operations, and provides
everything we need: partitioning and lazy evaluation.

However, OpenRefine is a lightweight tool that runs locally, on the
users' machine, and we want to preserve this use case. Running Spark in
standalone mode works, but I have read at a couple of places that the
standalone mode is only intended for development and testing. This is
confirmed by my experience with it so far:
- the overhead added by task serialization and scheduling is significant
even in standalone mode. This makes sense for testing, since you want to
test serialization as well, but to run Spark in production locally, we
would need to bypass serialization, which is not possible as far as I know;
- some bugs that manifest themselves only in local mode are not getting
a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
it seems dangerous to base a production system on standalone Spark.

So, we cannot use Spark as default runner in the tool. Do you know any
alternative which would be designed for local use? A library which would
provide something similar to the RDD API, but for parallelization with
threads in the same JVM, not machines in a cluster?

If there is no such thing, it should not be too hard to write our
homegrown implementation, which would basically be Java streams with
partitioning. I have looked at Apache Beam's direct runner, but it is
also designed for testing so does not fit our bill for the same reasons.

We plan to offer a Spark-based runner in any case - but I do not think
it can be used as the default runner.

Cheers,
Antonin





---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to