[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Murat Eken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291795#comment-14291795
 ] 

Murat Eken commented on SPARK-2389:
---

[~sowen], I think Robert is talking about fault tolerance when he mentions 
scalability. Anyway, as I mentioned in my original comment, Tachyon is not an 
option, at least for us, due to interprocess serialization/deserialization 
costs. Although we haven't tried HDFS, but I would be surprised if that 
performed differently.

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Murat Eken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291724#comment-14291724
 ] 

Murat Eken commented on SPARK-2389:
---

+1. We're using a Spark cluster as a real-time query engine, and unfortunately 
we're running into the same issues as Robert mentions. Although Spark provides 
a plethora of solutions when it comes to making its cluster fault-tolerant and 
resilient, we need the same resilience for the front layer, from where the 
Spark cluster is accessed; meaning multiple instances of Spark clients, hence 
multiple SparkContexts from those clients connecting to the same cluster with 
the same computing power.

Performance is crucial for us, hence our choice for caching the data in memory 
and utilizing the full hardware resources in the executors. Alternative 
solutions, such as using Tachyon for the data, and restarting executors for 
each query just don't give the same performance. We're looking into using 
https://github.com/spark-jobserver/spark-jobserver but that's not a proper 
solution as we still would have the jobserver as a single point of failure in 
our setup, which is a problem for us.

I'd appreciate it if a Spark developer could give some information about the 
feasibility of this change request; if this turns out to be difficult or even 
impossible due to the choices made in the architecture, it would be good to 
know that so that we can consider our alternatives.

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark application

2015-01-26 Thread Murat Eken (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14291755#comment-14291755
 ] 

Murat Eken commented on SPARK-2389:
---

Yes [~sowen], it's about HA for the driver. Our approach is to have a single 
app that's responsible for initializing the cache at start up (quite expensive) 
and then serve queries on that cached data (very fast).

When you mention  N front-ends talking to a process built around one long 
running Spark app that can be done right now, are you referring to something 
like the spark-jobserver (or any alternative) I mentioned? If yes, the problem 
with that is the single point of failure, as we're moving that from the driver 
to the jobserver instance. Or is there something else we've missed?

 globally shared SparkContext / shared Spark application
 -

 Key: SPARK-2389
 URL: https://issues.apache.org/jira/browse/SPARK-2389
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Robert Stupp

 The documentation (in Cluster Mode Overview) cites:
 bq. Each application gets its own executor processes, which *stay up for the 
 duration of the whole application* and run tasks in multiple threads. This 
 has the benefit of isolating applications from each other, on both the 
 scheduling side (each driver schedules its own tasks) and executor side 
 (tasks from different applications run in different JVMs). However, it also 
 means that *data cannot be shared* across different Spark applications 
 (instances of SparkContext) without writing it to an external storage system.
 IMO this is a limitation that should be lifted to support any number of 
 --driver-- client processes to share executors and to share (persistent / 
 cached) data.
 This is especially useful if you have a bunch of frontend servers (dump web 
 app servers) that want to use Spark as a _big computing machine_. Most 
 important is the fact that Spark is quite good in caching/persisting data in 
 memory / on disk thus removing load from backend data stores.
 Means: it would be really great to let different --driver-- client JVMs 
 operate on the same RDDs and benefit from Spark's caching/persistence.
 It would however introduce some administration mechanisms to
 * start a shared context
 * update the executor configuration (# of worker nodes, # of cpus, etc) on 
 the fly
 * stop a shared context
 Even conventional batch MR applications would benefit if ran fequently 
 against the same data set.
 As an implicit requirement, RDD persistence could get a TTL for its 
 materialized state.
 With such a feature the overall performance of today's web applications could 
 then be increased by adding more web app servers, more spark nodes, more 
 nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org