Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-04 Thread Alonso Isidoro Roman
Andy, i think there are some ideas to implement a pool of spark context,
but, for now, it is only an idea.


https://github.com/spark-jobserver/spark-jobserver/issues/365


It is possible to share a spark context between apps, i did not have to use
this feature, sorry about that.

Regards,

Alonso



Alonso Isidoro Roman.

Mis citas preferidas (de hoy) :
"Si depurar es el proceso de quitar los errores de software, entonces
programar debe ser el proceso de introducirlos..."
 -  Edsger Dijkstra

My favorite quotes (today):
"If debugging is the process of removing software bugs, then programming
must be the process of putting ..."
  - Edsger Dijkstra

"If you pay peanuts you get monkeys"


2016-05-04 11:08 GMT+02:00 Tobias Eriksson <tobias.eriks...@qvantel.com>:

> Hi Andy,
>  We have a very simple approach I think, we do like this
>
>1. Submit our Spark application to the Spark Master (version 1.6.1.)
>2. Our Application creates a Spark Context that we use throughout
>3. We use Spray REST server
>4. Every request that comes in we simply serve by querying Cassandra
>doing some joins and some processing, and returns JSON as a result back on
>the REST-API
>5. We to take advantage of co-locating the Spark Workers with the
>Cassandra Nodes to “boost” performance (in our test lab we have a 4 node
>cluster)
>
> Performance wise we have had some challenges but that has had to do with
> how the data was arranged in Cassandra, after changing to the
> time-series-design-pattern we improved our performance dramatically, 750
> times in our test lab.
>
> But now the problem is that we have more Spark applications running
> concurrently/in parallell and we are then forced to scale down on the
> number of cores that OUR application can use to ensure that we give way for
> other applications to come in a “play” too. This is not optimal, cause if
> there is free resources then I would like to use them
>
> When it comes to having load balancing the REST requests, then in my case
> I will not have that many clients, yet in my case I think that I could
> scale by adding multiple instances of my Spark Applications, but would
> obviously suffer in having to share the resources between the different
> Spark Workers (say cores). Or I would have to use dynamic resourcing.
> But as I started out my question here this is where I struggle, I need to
> get this right with sharing the resources.
> This is a challenges since I rely on that I HAVE TO co-locate the Spark
> Workers and Cassandra Nodes, meaning that I can not have 3 out of 4 nodes,
> cause then the Cassandra access will not be efficient since I use
> repartitionByCassandraReplica()
>
> Satisfying 250ms requests, well that depends very much on your use case I
> would say, and boring answer :-( sorry
>
> Regards
>  Tobias
>
> From: Andy Davidson <a...@santacruzintegration.com>
> Date: Tuesday 3 May 2016 at 17:26
> To: Tobias Eriksson <tobias.eriks...@qvantel.com>, "user@spark.apache.org"
> <user@spark.apache.org>
> Subject: Re: Multiple Spark Applications that use Cassandra, how to share
> resources/nodes
>
> Hi Tobias
>
> I am very interested implemented rest based api on top of spark. My rest
> based system would make predictions from data provided in the request using
> models trained in batch. My SLA is 250 ms.
>
> Would you mind sharing how you implemented your rest server?
>
> I am using spark-1.6.1. I have several unit tests that create spark
> context, master is set to ‘local[4]’. I do not think the unit test frame is
> going to scale. Can each rest server have a pool of sparks contexts?
>
>
> The system would like to replacing is set up as follows
>
> Layer of dumb load balancers: l1, l2, l3
> Layer of proxy servers:   p1, p2, p3, p4, p5, ….. Pn
> Layer of containers:  c1, c2, c3, ….. Cn
>
> Where Cn is much larger than Pn
>
>
> Kind regards
>
> Andy
>
> P.s. There is a talk on 5/5 about spark 2.0 Hoping there is something in
> the near future.
>
> https://www.brighttalk.com/webcast/12891/202021?utm_campaign=google-calendar_content=_source=brighttalk-portal_medium=calendar_term=
>
> From: Tobias Eriksson <tobias.eriks...@qvantel.com>
> Date: Tuesday, May 3, 2016 at 7:34 AM
> To: "user @spark" <user@spark.apache.org>
> Subject: Multiple Spark Applications that use Cassandra, how to share
> resources/nodes
>
> Hi
>  We are using Spark for a long running job, in fact it is a REST-server
> that does some joins with some tables in Casandra and returns the result.
> Now we need to have multiple applications running in the same Spark
> cluster, and from what I understand t

Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-04 Thread Tobias Eriksson
Hi Andy,
 We have a very simple approach I think, we do like this

  1.  Submit our Spark application to the Spark Master (version 1.6.1.)
  2.  Our Application creates a Spark Context that we use throughout
  3.  We use Spray REST server
  4.  Every request that comes in we simply serve by querying Cassandra doing 
some joins and some processing, and returns JSON as a result back on the 
REST-API
  5.  We to take advantage of co-locating the Spark Workers with the Cassandra 
Nodes to “boost” performance (in our test lab we have a 4 node cluster)

Performance wise we have had some challenges but that has had to do with how 
the data was arranged in Cassandra, after changing to the 
time-series-design-pattern we improved our performance dramatically, 750 times 
in our test lab.

But now the problem is that we have more Spark applications running 
concurrently/in parallell and we are then forced to scale down on the number of 
cores that OUR application can use to ensure that we give way for other 
applications to come in a “play” too. This is not optimal, cause if there is 
free resources then I would like to use them

When it comes to having load balancing the REST requests, then in my case I 
will not have that many clients, yet in my case I think that I could scale by 
adding multiple instances of my Spark Applications, but would obviously suffer 
in having to share the resources between the different Spark Workers (say 
cores). Or I would have to use dynamic resourcing.
But as I started out my question here this is where I struggle, I need to get 
this right with sharing the resources.
This is a challenges since I rely on that I HAVE TO co-locate the Spark Workers 
and Cassandra Nodes, meaning that I can not have 3 out of 4 nodes, cause then 
the Cassandra access will not be efficient since I use 
repartitionByCassandraReplica()

Satisfying 250ms requests, well that depends very much on your use case I would 
say, and boring answer :-( sorry

Regards
 Tobias

From: Andy Davidson 
<a...@santacruzintegration.com<mailto:a...@santacruzintegration.com>>
Date: Tuesday 3 May 2016 at 17:26
To: Tobias Eriksson 
<tobias.eriks...@qvantel.com<mailto:tobias.eriks...@qvantel.com>>, 
"user@spark.apache.org<mailto:user@spark.apache.org>" 
<user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Re: Multiple Spark Applications that use Cassandra, how to share 
resources/nodes

Hi Tobias

I am very interested implemented rest based api on top of spark. My rest based 
system would make predictions from data provided in the request using models 
trained in batch. My SLA is 250 ms.

Would you mind sharing how you implemented your rest server?

I am using spark-1.6.1. I have several unit tests that create spark context, 
master is set to ‘local[4]’. I do not think the unit test frame is going to 
scale. Can each rest server have a pool of sparks contexts?


The system would like to replacing is set up as follows

Layer of dumb load balancers: l1, l2, l3
Layer of proxy servers:   p1, p2, p3, p4, p5, ….. Pn
Layer of containers:  c1, c2, c3, ….. Cn

Where Cn is much larger than Pn


Kind regards

Andy

P.s. There is a talk on 5/5 about spark 2.0 Hoping there is something in the 
near future.
https://www.brighttalk.com/webcast/12891/202021?utm_campaign=google-calendar_content=_source=brighttalk-portal_medium=calendar_term=

From: Tobias Eriksson 
<tobias.eriks...@qvantel.com<mailto:tobias.eriks...@qvantel.com>>
Date: Tuesday, May 3, 2016 at 7:34 AM
To: "user @spark" <user@spark.apache.org<mailto:user@spark.apache.org>>
Subject: Multiple Spark Applications that use Cassandra, how to share 
resources/nodes

Hi
 We are using Spark for a long running job, in fact it is a REST-server that 
does some joins with some tables in Casandra and returns the result.
Now we need to have multiple applications running in the same Spark cluster, 
and from what I understand this is not possible, or should I say somewhat 
complicated

  1.  A Spark application takes all the resources / nodes in the cluster (we 
have 4 nodes one for each Cassandra Node)
  2.  A Spark application returns it’s resources when it is done (exits or the 
context is closed/returned)
  3.  Sharing resources using Mesos only allows scaling down and then scaling 
up by a step-by-step policy, i.e. 2 nodes, 3 nodes, 4 nodes, … And increases as 
the need increases

But if this is true, I can not have several applications running in parallell, 
is that true ?
If I use Mesos then the whole idea with one Spark Worker per Cassandra Node 
fails, as it talks directly to a node, and that is how it is so efficient.
In this case I need all nodes, not 3 out of 4.

Any mistakes in my thinking ?
Any ideas on how to solve this ? Should be a common problem I think

-Tobias




RE: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Mohammed Guller
You can run multiple Spark applications simultaneously. Just limit the # of 
cores and memory allocated to each application. For example, if each node has 8 
cores and there are 10 nodes and you want to be able to run 4 applications 
simultaneously, limit the # of cores for each application to 20. Similarly, you 
can limit the amount of memory that an application can use on each node.

You can also use dynamic resource allocation.
Details are here: 
http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Tobias Eriksson [mailto:tobias.eriks...@qvantel.com]
Sent: Tuesday, May 3, 2016 7:34 AM
To: user@spark.apache.org
Subject: Multiple Spark Applications that use Cassandra, how to share 
resources/nodes

Hi
 We are using Spark for a long running job, in fact it is a REST-server that 
does some joins with some tables in Casandra and returns the result.
Now we need to have multiple applications running in the same Spark cluster, 
and from what I understand this is not possible, or should I say somewhat 
complicated

  1.  A Spark application takes all the resources / nodes in the cluster (we 
have 4 nodes one for each Cassandra Node)
  2.  A Spark application returns it’s resources when it is done (exits or the 
context is closed/returned)
  3.  Sharing resources using Mesos only allows scaling down and then scaling 
up by a step-by-step policy, i.e. 2 nodes, 3 nodes, 4 nodes, … And increases as 
the need increases
But if this is true, I can not have several applications running in parallell, 
is that true ?
If I use Mesos then the whole idea with one Spark Worker per Cassandra Node 
fails, as it talks directly to a node, and that is how it is so efficient.
In this case I need all nodes, not 3 out of 4.

Any mistakes in my thinking ?
Any ideas on how to solve this ? Should be a common problem I think

-Tobias




Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Andy Davidson
Hi Tobias

I am very interested implemented rest based api on top of spark. My rest
based system would make predictions from data provided in the request using
models trained in batch. My SLA is 250 ms.

Would you mind sharing how you implemented your rest server?

I am using spark-1.6.1. I have several unit tests that create spark context,
master is set to Œlocal[4]¹. I do not think the unit test frame is going to
scale. Can each rest server have a pool of sparks contexts?


The system would like to replacing is set up as follows

Layer of dumb load balancers: l1, l2, l3
Layer of proxy servers:   p1, p2, p3, p4, p5, Š.. Pn
Layer of containers:  c1, c2, c3, Š.. Cn

Where Cn is much larger than Pn


Kind regards

Andy

P.s. There is a talk on 5/5 about spark 2.0 Hoping there is something in the
near future.
https://www.brighttalk.com/webcast/12891/202021?utm_campaign=google-calendar
_content=_source=brighttalk-portal_medium=calendar_term=

From:  Tobias Eriksson <tobias.eriks...@qvantel.com>
Date:  Tuesday, May 3, 2016 at 7:34 AM
To:  "user @spark" <user@spark.apache.org>
Subject:  Multiple Spark Applications that use Cassandra, how to share
resources/nodes

> Hi 
>  We are using Spark for a long running job, in fact it is a REST-server that
> does some joins with some tables in Casandra and returns the result.
> Now we need to have multiple applications running in the same Spark cluster,
> and from what I understand this is not possible, or should I say somewhat
> complicated
> 1. A Spark application takes all the resources / nodes in the cluster (we have
> 4 nodes one for each Cassandra Node)
> 2. A Spark application returns it¹s resources when it is done (exits or the
> context is closed/returned)
> 3. Sharing resources using Mesos only allows scaling down and then scaling up
> by a step-by-step policy, i.e. 2 nodes, 3 nodes, 4 nodes, Š And increases as
> the need increases
> But if this is true, I can not have several applications running in parallell,
> is that true ?
> If I use Mesos then the whole idea with one Spark Worker per Cassandra Node
> fails, as it talks directly to a node, and that is how it is so efficient.
> In this case I need all nodes, not 3 out of 4.
> 
> Any mistakes in my thinking ?
> Any ideas on how to solve this ? Should be a common problem I think
> 
> -Tobias
> 
> 




Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Tobias Eriksson
Hi
 We are using Spark for a long running job, in fact it is a REST-server that 
does some joins with some tables in Casandra and returns the result.
Now we need to have multiple applications running in the same Spark cluster, 
and from what I understand this is not possible, or should I say somewhat 
complicated

  1.  A Spark application takes all the resources / nodes in the cluster (we 
have 4 nodes one for each Cassandra Node)
  2.  A Spark application returns it’s resources when it is done (exits or the 
context is closed/returned)
  3.  Sharing resources using Mesos only allows scaling down and then scaling 
up by a step-by-step policy, i.e. 2 nodes, 3 nodes, 4 nodes, … And increases as 
the need increases

But if this is true, I can not have several applications running in parallell, 
is that true ?
If I use Mesos then the whole idea with one Spark Worker per Cassandra Node 
fails, as it talks directly to a node, and that is how it is so efficient.
In this case I need all nodes, not 3 out of 4.

Any mistakes in my thinking ?
Any ideas on how to solve this ? Should be a common problem I think

-Tobias