Re: RDD-like API for entirely local workflows?

Antonin Delpeuch (lists) Sat, 04 Jul 2020 09:10:11 -0700

Hi Stephen and Juan,

Thanks both for your replies - you are right, I used the wrong
terminology! The local mode is what fits our needs best (and what I have
benchmarking so far).


That being said, the problems I mention are still applicable to this
context. There is still a serialization overhead (which can be observed
from the web UI), which is really noticeable as a user.

For instance, to display the paginated grid in the tool's UI, I need to
run a simple job (filterByRange), and Spark's own overheads account for
about half of the overall execution time.

Intuitively, when running in local mode there should not be any need for
serializing tasks to pass them between threads, so that is what I am
trying to eliminate.

Regards,
Antonin

On 04/07/2020 17:49, Juan Martín Guillén wrote:
> Hi Antonin.
> 
> It seems you are confusing Standalone with Local mode. They are 2
> different modes.
> 
> From Spark in Action book: "In local mode, there is only one executor in
> the same client JVM as the driver, but
> this executor can spawn several threads to run tasks.
> In local mode, Spark uses your client process as the single executor in
> the cluster,
> and the number of threads specified determines how many tasks can be
> executed in parallel."
> 
> I am pretty sure this is the mode your use case is more suited to.
> 
> What you are referring to, I think, is to run an Standalone Cluster
> locally, something that does not make too much sense resources wise and
> is what may be considered only for testing purposes.
> 
> Running Spark in Local mode is totally fine and supported for
> non-cluster (local) environments.
> 
> Here the options you have to connect you Spark application to:
> https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
> 
> Regards,
> Juan Martín.
> 
> 
> 
> 
> El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists)
> <li...@antonin.delpeuch.eu> escribió:
> 
> 
> Hi,
> 
> I am working on revamping the architecture of OpenRefine, an ETL tool,
> to execute workflows on datasets which do not fit in RAM.
> 
> Spark's RDD API is a great fit for the tool's operations, and provides
> everything we need: partitioning and lazy evaluation.
> 
> However, OpenRefine is a lightweight tool that runs locally, on the
> users' machine, and we want to preserve this use case. Running Spark in
> standalone mode works, but I have read at a couple of places that the
> standalone mode is only intended for development and testing. This is
> confirmed by my experience with it so far:
> - the overhead added by task serialization and scheduling is significant
> even in standalone mode. This makes sense for testing, since you want to
> test serialization as well, but to run Spark in production locally, we
> would need to bypass serialization, which is not possible as far as I know;
> - some bugs that manifest themselves only in local mode are not getting
> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
> it seems dangerous to base a production system on standalone Spark.
> 
> So, we cannot use Spark as default runner in the tool. Do you know any
> alternative which would be designed for local use? A library which would
> provide something similar to the RDD API, but for parallelization with
> threads in the same JVM, not machines in a cluster?
> 
> If there is no such thing, it should not be too hard to write our
> homegrown implementation, which would basically be Java streams with
> partitioning. I have looked at Apache Beam's direct runner, but it is
> also designed for testing so does not fit our bill for the same reasons.
> 
> We plan to offer a Spark-based runner in any case - but I do not think
> it can be used as the default runner.
> 
> Cheers,
> Antonin
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> <mailto:user-unsubscr...@spark.apache.org>
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: RDD-like API for entirely local workflows?

Reply via email to