[jira] [Created] (SPARK-2389) globally shared SparkContext / shared Spark "application"

Robert Stupp (JIRA) Mon, 07 Jul 2014 07:47:28 -0700

Robert Stupp created SPARK-2389:
-----------------------------------

             Summary: globally shared SparkContext / shared Spark "application"
                 Key: SPARK-2389
                 URL: https://issues.apache.org/jira/browse/SPARK-2389
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Robert Stupp



The documentation (in Cluster Mode Overview) cites:

bq. Each application gets its own executor processes, which *stay up for the 
duration of the whole application* and run tasks in multiple threads. This has 
the benefit of isolating applications from each other, on both the scheduling 
side (each driver schedules its own tasks) and executor side (tasks from 
different applications run in different JVMs). However, it also means that 
*data cannot be shared* across different Spark applications (instances of 
SparkContext) without writing it to an external storage system.

IMO this is a limitation that should be lifted to support any number of 
--driver-- client processes to share executors and to share (persistent / 
cached) data.

This is especially useful if you have a bunch of frontend servers (dump web app 
servers) that want to use Spark as a _big computing machine_. Most important is 
the fact that Spark is quite good in caching/persisting data in memory / on 
disk thus removing load from backend data stores.

Means: it would be really great to let different --driver-- client JVMs operate 
on the same RDDs and benefit from Spark's caching/persistence.

It would however introduce some administration mechanisms to
* start a shared context
* update the executor configuration (# of worker nodes, # of cpus, etc) on the 
fly
* stop a shared context

Even "conventional" batch MR applications would benefit if ran fequently 
against the same data set.

As an implicit requirement, RDD persistence could get a TTL for its 
materialized state.

With such a feature the overall performance of today's web applications could 
then be increased by adding more web app servers, more spark nodes, more nosql 
nodes etc




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2389) globally shared SparkContext / shared Spark "application"

Reply via email to