Re: heterogeneous cluster setup

2014-12-04 Thread Victor Tso-Guillen
To reiterate, it's very important for Spark's workers to have the same
memory available. Think about Spark uniformly chopping up your data and
distributing the work to the nodes. The algorithm is not designed to
consider that a worker has less memory available than some other worker.

On Thu, Dec 4, 2014 at 12:11 AM, rapelly kartheek kartheek.m...@gmail.com
wrote:


 *It's very important for Spark's workers to have the same resources
 available*

 So, each worker should have same amount of memory and same number of
 cores. But, heterogeneity of the cluster in the physical layout of cpu is
 understandable, but how about heterogeneity with respect to memory?

 On Thu, Dec 4, 2014 at 12:18 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 You'll have to decide which is more expensive in your heterogenous
 environment and optimize for the utilization of that. For example, you may
 decide that memory is the only costing factor and you can discount the
 number of cores. Then you could have 8GB on each worker each with four
 cores. Note that cores in Spark don't necessarily map to cores on the
 machine. It's just a configuration setting for how many simultaneous tasks
 that worker can work on.

 You are right that each executor gets the same amount of resources and I
 would add level of parallelization. Your heterogeneity is in the physical
 layout of your cluster, not in how Spark treats the workers as resources.
 It's very important for Spark's workers to have the same resources
 available because it needs to be able to generically divide and conquer
 your data amongst all those workers.

 Hope that helps,
 Victor

 On Wed, Dec 3, 2014 at 10:04 PM, rapelly kartheek 
 kartheek.m...@gmail.com wrote:

 Thank you so much for valuable reply, Victor. That's a very clear
 solution I understood.

 Right now I have nodes with:
 16Gb RAM, 4 cores; 8GB RAM, 4cores; 8GB RAM, 2 cores. From my
 understanding, the division could be something like, each executor can have
 2 cores and 6GB RAM.   So, the ones with 16GB RAM and 4 cores can have two
 executors. Please let me know if my understanding is correct.

 But, I am not able to see  any heterogeneity in this setting as each
 executor has got the same amount of resources. Can you please clarify this
 doubt?

 Regards
 Karthik

 On Wed, Dec 3, 2014 at 11:11 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 I don't have a great answer for you. For us, we found a common divisor,
 not necessarily a whole gigabyte, of the available memory of the different
 hardware and used that as the amount of memory per worker and scaled the
 number of cores accordingly so that every core in the system has the same
 amount of memory. The quotient of the available memory and the common
 divisor, hopefully a whole number to reduce waste, was the number of
 workers we spun up. Therefore, if you have 64G, 30G, and 15G available
 memory on your machines, the divisor could be 15G and you'd have 4, 2 and 1
 worker per machine. Every worker on all the machines would have the same
 number of cores, set to what you think is a good value.

 Hope that helps.

 On Wed, Dec 3, 2014 at 7:44 AM, kartheek.m...@gmail.com wrote:

 Hi Victor,

 I want to setup a heterogeneous stand-alone spark cluster. I have
 hardware with different memory sizes and varied number of cores per node. 
 I
 could get all the nodes active in the cluster only when the size of memory
 per executor is set as the least available memory size of all nodes and is
 same with no.of cores/executor. As of now, I configure one executor per
 node.

 Can you please suggest some path to set up a stand-alone
 heterogeneous  cluster such that I can efficiently use the available
 hardware.

 Thank you




 _
 Sent from http://apache-spark-user-list.1001560.n3.nabble.com








Re: heterogeneous cluster setup

2014-12-03 Thread Victor Tso-Guillen
I don't have a great answer for you. For us, we found a common divisor, not
necessarily a whole gigabyte, of the available memory of the different
hardware and used that as the amount of memory per worker and scaled the
number of cores accordingly so that every core in the system has the same
amount of memory. The quotient of the available memory and the common
divisor, hopefully a whole number to reduce waste, was the number of
workers we spun up. Therefore, if you have 64G, 30G, and 15G available
memory on your machines, the divisor could be 15G and you'd have 4, 2 and 1
worker per machine. Every worker on all the machines would have the same
number of cores, set to what you think is a good value.

Hope that helps.

On Wed, Dec 3, 2014 at 7:44 AM, kartheek.m...@gmail.com wrote:

 Hi Victor,

 I want to setup a heterogeneous stand-alone spark cluster. I have hardware
 with different memory sizes and varied number of cores per node. I could
 get all the nodes active in the cluster only when the size of memory per
 executor is set as the least available memory size of all nodes and is same
 with no.of cores/executor. As of now, I configure one executor per node.

 Can you please suggest some path to set up a stand-alone heterogeneous
 cluster such that I can efficiently use the available hardware.

 Thank you




 _
 Sent from http://apache-spark-user-list.1001560.n3.nabble.com




Re: heterogeneous cluster setup

2014-12-03 Thread Victor Tso-Guillen
You'll have to decide which is more expensive in your heterogenous
environment and optimize for the utilization of that. For example, you may
decide that memory is the only costing factor and you can discount the
number of cores. Then you could have 8GB on each worker each with four
cores. Note that cores in Spark don't necessarily map to cores on the
machine. It's just a configuration setting for how many simultaneous tasks
that worker can work on.

You are right that each executor gets the same amount of resources and I
would add level of parallelization. Your heterogeneity is in the physical
layout of your cluster, not in how Spark treats the workers as resources.
It's very important for Spark's workers to have the same resources
available because it needs to be able to generically divide and conquer
your data amongst all those workers.

Hope that helps,
Victor

On Wed, Dec 3, 2014 at 10:04 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Thank you so much for valuable reply, Victor. That's a very clear solution
 I understood.

 Right now I have nodes with:
 16Gb RAM, 4 cores; 8GB RAM, 4cores; 8GB RAM, 2 cores. From my
 understanding, the division could be something like, each executor can have
 2 cores and 6GB RAM.   So, the ones with 16GB RAM and 4 cores can have two
 executors. Please let me know if my understanding is correct.

 But, I am not able to see  any heterogeneity in this setting as each
 executor has got the same amount of resources. Can you please clarify this
 doubt?

 Regards
 Karthik

 On Wed, Dec 3, 2014 at 11:11 PM, Victor Tso-Guillen v...@paxata.com
 wrote:

 I don't have a great answer for you. For us, we found a common divisor,
 not necessarily a whole gigabyte, of the available memory of the different
 hardware and used that as the amount of memory per worker and scaled the
 number of cores accordingly so that every core in the system has the same
 amount of memory. The quotient of the available memory and the common
 divisor, hopefully a whole number to reduce waste, was the number of
 workers we spun up. Therefore, if you have 64G, 30G, and 15G available
 memory on your machines, the divisor could be 15G and you'd have 4, 2 and 1
 worker per machine. Every worker on all the machines would have the same
 number of cores, set to what you think is a good value.

 Hope that helps.

 On Wed, Dec 3, 2014 at 7:44 AM, kartheek.m...@gmail.com wrote:

 Hi Victor,

 I want to setup a heterogeneous stand-alone spark cluster. I have
 hardware with different memory sizes and varied number of cores per node. I
 could get all the nodes active in the cluster only when the size of memory
 per executor is set as the least available memory size of all nodes and is
 same with no.of cores/executor. As of now, I configure one executor per
 node.

 Can you please suggest some path to set up a stand-alone heterogeneous
 cluster such that I can efficiently use the available hardware.

 Thank you




 _
 Sent from http://apache-spark-user-list.1001560.n3.nabble.com