Hi, I am working on revamping the architecture of OpenRefine, an ETL tool, to execute workflows on datasets which do not fit in RAM.
Spark's RDD API is a great fit for the tool's operations, and provides everything we need: partitioning and lazy evaluation. However, OpenRefine is a lightweight tool that runs locally, on the users' machine, and we want to preserve this use case. Running Spark in standalone mode works, but I have read at a couple of places that the standalone mode is only intended for development and testing. This is confirmed by my experience with it so far: - the overhead added by task serialization and scheduling is significant even in standalone mode. This makes sense for testing, since you want to test serialization as well, but to run Spark in production locally, we would need to bypass serialization, which is not possible as far as I know; - some bugs that manifest themselves only in local mode are not getting a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so it seems dangerous to base a production system on standalone Spark. So, we cannot use Spark as default runner in the tool. Do you know any alternative which would be designed for local use? A library which would provide something similar to the RDD API, but for parallelization with threads in the same JVM, not machines in a cluster? If there is no such thing, it should not be too hard to write our homegrown implementation, which would basically be Java streams with partitioning. I have looked at Apache Beam's direct runner, but it is also designed for testing so does not fit our bill for the same reasons. We plan to offer a Spark-based runner in any case - but I do not think it can be used as the default runner. Cheers, Antonin --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org