Re: Multitenancy in Spark - within/across spark context

2014-10-25 Thread RJ Nowling
Ashwin,

What is your motivation for needing to share RDDs between jobs? Optimizing
for reusing data across jobs?

If so, you may want to look into Tachyon. My understanding is that Tachyon
acts like a caching layer and you can designate when data will be reused in
multiple jobs so it know to keep that in memory or local disk for faster
access. But my knowledge of tachyon is second hand so forgive me if I have
it wrong :)

RJ

On Friday, October 24, 2014, Evan Chan velvia.git...@gmail.com wrote:

 Ashwin,

 I would say the strategies in general are:

 1) Have each user submit separate Spark app (each its own Spark
 Context), with its own resource settings, and share data through HDFS
 or something like Tachyon for speed.

 2) Share a single spark context amongst multiple users, using fair
 scheduler.  This is sort of like having a Hadoop resource pool.It
 has some obvious HA/SPOF issues, namely that if the context dies then
 every user using it is also dead.   Also, sharing RDDs in cached
 memory has the same resiliency problems, namely that if any executor
 dies then Spark must recompute / rebuild the RDD (it tries to only
 rebuild the missing part, but sometimes it must rebuild everything).

 Job server can help with 1 or 2, 2 in particular.  If you have any
 questions about job server, feel free to ask at the spark-jobserver
 google group.   I am the maintainer.

 -Evan


 On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin van...@cloudera.com
 javascript:; wrote:
  You may want to take a look at
 https://issues.apache.org/jira/browse/SPARK-3174.
 
  On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang jianshi.hu...@gmail.com
 javascript:; wrote:
  Upvote for the multitanency requirement.
 
  I'm also building a data analytic platform and there'll be multiple
 users
  running queries and computations simultaneously. One of the paint point
 is
  control of resource size. Users don't really know how much nodes they
 need,
  they always use as much as possible... The result is lots of wasted
 resource
  in our Yarn cluster.
 
  A way to 1) allow multiple spark context to share the same resource or
 2)
  add dynamic resource management for Yarn mode is very much wanted.
 
  Jianshi
 
  On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com
 javascript:; wrote:
 
  On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
  ashwinshanka...@gmail.com javascript:; wrote:
   That's not something you might want to do usually. In general, a
   SparkContext maps to a user application
  
   My question was basically this. In this page in the official doc,
 under
   Scheduling within an application section, it talks about multiuser
 and
   fair sharing within an app. How does multiuser within an application
   work(how users connect to an app,run their stuff) ? When would I
 want to
   use
   this ?
 
  I see. The way I read that page is that Spark supports all those
  scheduling options; but Spark doesn't give you the means to actually
  be able to submit jobs from different users to a running SparkContext
  hosted on a different process. For that, you'll need something like
  the job server that I referenced before, or write your own framework
  for supporting that.
 
  Personally, I'd use the information on that page when dealing with
  concurrent jobs in the same SparkContext, but still restricted to the
  same user. I'd avoid trying to create any application where a single
  SparkContext is trying to be shared by multiple users in any way.
 
   As far as I understand, this will cause executors to be killed,
 which
   means that Spark will start retrying tasks to rebuild the data that
   was held by those executors when needed.
  
   I basically wanted to find out if there were any gotchas related to
   preemption on Spark. Things like say half of an application's
 executors
   got
   preempted say while doing reduceByKey, will the application progress
   with
   the remaining resources/fair share ?
 
  Jobs should still make progress as long as at least one executor is
  available. The gotcha would be the one I mentioned, where Spark will
  fail your job after x executors failed, which might be a common
  occurrence when preemption is enabled. That being said, it's a
  configurable option, so you can set x to a very large value and your
  job should keep on chugging along.
 
  The options you'd want to take a look at are: spark.task.maxFailures
  and spark.yarn.max.executor.failures
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 javascript:;
  For additional commands, e-mail: user-h...@spark.apache.org
 javascript:;
 
 
 
 
  --
  Jianshi Huang
 
  LinkedIn: jianshi
  Twitter: @jshuang
  Github  Blog: http://huangjs.github.com/
 
 
 
  --
  Marcelo
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org javascript:;
  For additional 

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Jianshi Huang
Upvote for the multitanency requirement.

I'm also building a data analytic platform and there'll be multiple users
running queries and computations simultaneously. One of the paint point is
control of resource size. Users don't really know how much nodes they need,
they always use as much as possible... The result is lots of wasted
resource in our Yarn cluster.

A way to 1) allow multiple spark context to share the same resource or 2)
add dynamic resource management for Yarn mode is very much wanted.

Jianshi

On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com wrote:

 On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
 ashwinshanka...@gmail.com wrote:
  That's not something you might want to do usually. In general, a
  SparkContext maps to a user application
 
  My question was basically this. In this page in the official doc, under
  Scheduling within an application section, it talks about multiuser and
  fair sharing within an app. How does multiuser within an application
  work(how users connect to an app,run their stuff) ? When would I want to
 use
  this ?

 I see. The way I read that page is that Spark supports all those
 scheduling options; but Spark doesn't give you the means to actually
 be able to submit jobs from different users to a running SparkContext
 hosted on a different process. For that, you'll need something like
 the job server that I referenced before, or write your own framework
 for supporting that.

 Personally, I'd use the information on that page when dealing with
 concurrent jobs in the same SparkContext, but still restricted to the
 same user. I'd avoid trying to create any application where a single
 SparkContext is trying to be shared by multiple users in any way.

  As far as I understand, this will cause executors to be killed, which
  means that Spark will start retrying tasks to rebuild the data that
  was held by those executors when needed.
 
  I basically wanted to find out if there were any gotchas related to
  preemption on Spark. Things like say half of an application's executors
 got
  preempted say while doing reduceByKey, will the application progress with
  the remaining resources/fair share ?

 Jobs should still make progress as long as at least one executor is
 available. The gotcha would be the one I mentioned, where Spark will
 fail your job after x executors failed, which might be a common
 occurrence when preemption is enabled. That being said, it's a
 configurable option, so you can set x to a very large value and your
 job should keep on chugging along.

 The options you'd want to take a look at are: spark.task.maxFailures
 and spark.yarn.max.executor.failures

 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Marcelo Vanzin
You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174.

On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang jianshi.hu...@gmail.com wrote:
 Upvote for the multitanency requirement.

 I'm also building a data analytic platform and there'll be multiple users
 running queries and computations simultaneously. One of the paint point is
 control of resource size. Users don't really know how much nodes they need,
 they always use as much as possible... The result is lots of wasted resource
 in our Yarn cluster.

 A way to 1) allow multiple spark context to share the same resource or 2)
 add dynamic resource management for Yarn mode is very much wanted.

 Jianshi

 On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com wrote:

 On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
 ashwinshanka...@gmail.com wrote:
  That's not something you might want to do usually. In general, a
  SparkContext maps to a user application
 
  My question was basically this. In this page in the official doc, under
  Scheduling within an application section, it talks about multiuser and
  fair sharing within an app. How does multiuser within an application
  work(how users connect to an app,run their stuff) ? When would I want to
  use
  this ?

 I see. The way I read that page is that Spark supports all those
 scheduling options; but Spark doesn't give you the means to actually
 be able to submit jobs from different users to a running SparkContext
 hosted on a different process. For that, you'll need something like
 the job server that I referenced before, or write your own framework
 for supporting that.

 Personally, I'd use the information on that page when dealing with
 concurrent jobs in the same SparkContext, but still restricted to the
 same user. I'd avoid trying to create any application where a single
 SparkContext is trying to be shared by multiple users in any way.

  As far as I understand, this will cause executors to be killed, which
  means that Spark will start retrying tasks to rebuild the data that
  was held by those executors when needed.
 
  I basically wanted to find out if there were any gotchas related to
  preemption on Spark. Things like say half of an application's executors
  got
  preempted say while doing reduceByKey, will the application progress
  with
  the remaining resources/fair share ?

 Jobs should still make progress as long as at least one executor is
 available. The gotcha would be the one I mentioned, where Spark will
 fail your job after x executors failed, which might be a common
 occurrence when preemption is enabled. That being said, it's a
 configurable option, so you can set x to a very large value and your
 job should keep on chugging along.

 The options you'd want to take a look at are: spark.task.maxFailures
 and spark.yarn.max.executor.failures

 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Evan Chan
Ashwin,

I would say the strategies in general are:

1) Have each user submit separate Spark app (each its own Spark
Context), with its own resource settings, and share data through HDFS
or something like Tachyon for speed.

2) Share a single spark context amongst multiple users, using fair
scheduler.  This is sort of like having a Hadoop resource pool.It
has some obvious HA/SPOF issues, namely that if the context dies then
every user using it is also dead.   Also, sharing RDDs in cached
memory has the same resiliency problems, namely that if any executor
dies then Spark must recompute / rebuild the RDD (it tries to only
rebuild the missing part, but sometimes it must rebuild everything).

Job server can help with 1 or 2, 2 in particular.  If you have any
questions about job server, feel free to ask at the spark-jobserver
google group.   I am the maintainer.

-Evan


On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin van...@cloudera.com wrote:
 You may want to take a look at 
 https://issues.apache.org/jira/browse/SPARK-3174.

 On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang jianshi.hu...@gmail.com 
 wrote:
 Upvote for the multitanency requirement.

 I'm also building a data analytic platform and there'll be multiple users
 running queries and computations simultaneously. One of the paint point is
 control of resource size. Users don't really know how much nodes they need,
 they always use as much as possible... The result is lots of wasted resource
 in our Yarn cluster.

 A way to 1) allow multiple spark context to share the same resource or 2)
 add dynamic resource management for Yarn mode is very much wanted.

 Jianshi

 On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin van...@cloudera.com wrote:

 On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
 ashwinshanka...@gmail.com wrote:
  That's not something you might want to do usually. In general, a
  SparkContext maps to a user application
 
  My question was basically this. In this page in the official doc, under
  Scheduling within an application section, it talks about multiuser and
  fair sharing within an app. How does multiuser within an application
  work(how users connect to an app,run their stuff) ? When would I want to
  use
  this ?

 I see. The way I read that page is that Spark supports all those
 scheduling options; but Spark doesn't give you the means to actually
 be able to submit jobs from different users to a running SparkContext
 hosted on a different process. For that, you'll need something like
 the job server that I referenced before, or write your own framework
 for supporting that.

 Personally, I'd use the information on that page when dealing with
 concurrent jobs in the same SparkContext, but still restricted to the
 same user. I'd avoid trying to create any application where a single
 SparkContext is trying to be shared by multiple users in any way.

  As far as I understand, this will cause executors to be killed, which
  means that Spark will start retrying tasks to rebuild the data that
  was held by those executors when needed.
 
  I basically wanted to find out if there were any gotchas related to
  preemption on Spark. Things like say half of an application's executors
  got
  preempted say while doing reduceByKey, will the application progress
  with
  the remaining resources/fair share ?

 Jobs should still make progress as long as at least one executor is
 available. The gotcha would be the one I mentioned, where Spark will
 fail your job after x executors failed, which might be a common
 occurrence when preemption is enabled. That being said, it's a
 configurable option, so you can set x to a very large value and your
 job should keep on chugging along.

 The options you'd want to take a look at are: spark.task.maxFailures
 and spark.yarn.max.executor.failures

 --
 Marcelo

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/



 --
 Marcelo

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
Hi Spark devs/users,
One of the things we are investigating here at Netflix is if Spark would
suit us for our ETL needs, and one of requirements is multi tenancy.
I did read the official doc
http://spark.apache.org/docs/latest/job-scheduling.html and the book, but
I'm still not clear on certain things.

Here are my questions :
1. *Sharing spark context* : How exactly multiple users can share the
cluster using same spark
context ? UserA wants to run AppA, UserB wants to run AppB. How do they
talk to same
context ? How exactly are each of their jobs scheduled and run in same
context?
Is preemption supported in this scenario ? How are user names passed on
to the spark context ?

2. *Different spark context in YARN*: assuming I have a YARN cluster with
queues and preemption
configured. Are there problems if executors/containers of a spark app
are preempted to allow a
high priority spark app to execute ? Would the preempted app get stuck
or would it continue to
make progress? How are user names passed on from spark to yarn(say I'm
using nested user
queues feature in fair scheduler) ?

3. Sharing RDDs in 1 and 2 above ?

4. Anything else about user/job isolation ?

I know I'm asking a lot of questions. Thanks in advance :) !

-- 
Thanks,
Ashwin
Netflix


Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Marcelo Vanzin
Hi Ashwin,

Let me try to answer to the best of my knowledge.

On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar
ashwinshanka...@gmail.com wrote:
 Here are my questions :
 1. Sharing spark context : How exactly multiple users can share the cluster
 using same spark
 context ?

That's not something you might want to do usually. In general, a
SparkContext maps to a user application, so each user would submit
their own job which would create its own SparkContext.

If you want to go outside of Spark, there are project which allow you
to manage SparkContext instances outside of applications and
potentially share them, such as
https://github.com/spark-jobserver/spark-jobserver. But be sure you
actually need it - since you haven't really explained the use case,
it's not very clear.

 2. Different spark context in YARN: assuming I have a YARN cluster with
 queues and preemption
 configured. Are there problems if executors/containers of a spark app
 are preempted to allow a
 high priority spark app to execute ?

As far as I understand, this will cause executors to be killed, which
means that Spark will start retrying tasks to rebuild the data that
was held by those executors when needed. Yarn mode does have a
configurable upper limit on the number of executor failures, so if
your jobs keeps getting preempted it will eventually fail (unless you
tweak the settings).

I don't recall whether Yarn has an API to cleanly allow clients to
stop executors when preempted, but even if it does, I don't think
that's supported in Spark at the moment.

 How are user names passed on from spark to yarn(say I'm
 using nested user queues feature in fair scheduler) ?

Spark will try to run the job as the requesting user; if you're not
using Kerberos, that means the process themselves will be run as
whatever user runs the Yarn daemons, but the Spark app will be run
inside a UserGroupInformation.doAs() call as the requesting user. So
technically nested queues should work as expected.

 3. Sharing RDDs in 1 and 2 above ?

I'll assume you don't mean actually sharing RDDs in the same context,
but between different SparkContext instances. You might (big might
here) be able to checkpoint an RDD from one context and load it from
another context; that's actually like some HA-like features for Spark
drivers are being addressed.

The job server I mentioned before, which allows different apps to
share the same Spark context, has a feature to share RDDs by name,
also, without having to resort to checkpointing.

Hope this helps!

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
Thanks Marcelo, that was helpful ! I had some follow up questions :

That's not something you might want to do usually. In general, a
 SparkContext maps to a user application

My question was basically this. In this
http://spark.apache.org/docs/latest/job-scheduling.html page in the
official doc, under  Scheduling within an application section, it talks
about multiuser and fair sharing within an app. How does multiuser within
an application work(how users connect to an app,run their stuff) ? When
would I want to use this ?

As far as I understand, this will cause executors to be killed, which
 means that Spark will start retrying tasks to rebuild the data that
 was held by those executors when needed.

I basically wanted to find out if there were any gotchas related to
preemption on Spark. Things like say half of an application's executors got
preempted say while doing reduceByKey, will the application progress with
the remaining resources/fair share ?

I'm new to spark, sry if I'm asking something very obvious :).

Thanks,
Ashwin

On Wed, Oct 22, 2014 at 12:07 PM, Marcelo Vanzin van...@cloudera.com
wrote:

 Hi Ashwin,

 Let me try to answer to the best of my knowledge.

 On Wed, Oct 22, 2014 at 11:47 AM, Ashwin Shankar
 ashwinshanka...@gmail.com wrote:
  Here are my questions :
  1. Sharing spark context : How exactly multiple users can share the
 cluster
  using same spark
  context ?

 That's not something you might want to do usually. In general, a
 SparkContext maps to a user application, so each user would submit
 their own job which would create its own SparkContext.

 If you want to go outside of Spark, there are project which allow you
 to manage SparkContext instances outside of applications and
 potentially share them, such as
 https://github.com/spark-jobserver/spark-jobserver. But be sure you
 actually need it - since you haven't really explained the use case,
 it's not very clear.

  2. Different spark context in YARN: assuming I have a YARN cluster with
  queues and preemption
  configured. Are there problems if executors/containers of a spark app
  are preempted to allow a
  high priority spark app to execute ?

 As far as I understand, this will cause executors to be killed, which
 means that Spark will start retrying tasks to rebuild the data that
 was held by those executors when needed. Yarn mode does have a
 configurable upper limit on the number of executor failures, so if
 your jobs keeps getting preempted it will eventually fail (unless you
 tweak the settings).

 I don't recall whether Yarn has an API to cleanly allow clients to
 stop executors when preempted, but even if it does, I don't think
 that's supported in Spark at the moment.

  How are user names passed on from spark to yarn(say I'm
  using nested user queues feature in fair scheduler) ?

 Spark will try to run the job as the requesting user; if you're not
 using Kerberos, that means the process themselves will be run as
 whatever user runs the Yarn daemons, but the Spark app will be run
 inside a UserGroupInformation.doAs() call as the requesting user. So
 technically nested queues should work as expected.

  3. Sharing RDDs in 1 and 2 above ?

 I'll assume you don't mean actually sharing RDDs in the same context,
 but between different SparkContext instances. You might (big might
 here) be able to checkpoint an RDD from one context and load it from
 another context; that's actually like some HA-like features for Spark
 drivers are being addressed.

 The job server I mentioned before, which allows different apps to
 share the same Spark context, has a feature to share RDDs by name,
 also, without having to resort to checkpointing.

 Hope this helps!

 --
 Marcelo




-- 
Thanks,
Ashwin


Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Marcelo Vanzin
On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
ashwinshanka...@gmail.com wrote:
 That's not something you might want to do usually. In general, a
 SparkContext maps to a user application

 My question was basically this. In this page in the official doc, under
 Scheduling within an application section, it talks about multiuser and
 fair sharing within an app. How does multiuser within an application
 work(how users connect to an app,run their stuff) ? When would I want to use
 this ?

I see. The way I read that page is that Spark supports all those
scheduling options; but Spark doesn't give you the means to actually
be able to submit jobs from different users to a running SparkContext
hosted on a different process. For that, you'll need something like
the job server that I referenced before, or write your own framework
for supporting that.

Personally, I'd use the information on that page when dealing with
concurrent jobs in the same SparkContext, but still restricted to the
same user. I'd avoid trying to create any application where a single
SparkContext is trying to be shared by multiple users in any way.

 As far as I understand, this will cause executors to be killed, which
 means that Spark will start retrying tasks to rebuild the data that
 was held by those executors when needed.

 I basically wanted to find out if there were any gotchas related to
 preemption on Spark. Things like say half of an application's executors got
 preempted say while doing reduceByKey, will the application progress with
 the remaining resources/fair share ?

Jobs should still make progress as long as at least one executor is
available. The gotcha would be the one I mentioned, where Spark will
fail your job after x executors failed, which might be a common
occurrence when preemption is enabled. That being said, it's a
configurable option, so you can set x to a very large value and your
job should keep on chugging along.

The options you'd want to take a look at are: spark.task.maxFailures
and spark.yarn.max.executor.failures

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org