Re: heterogeneous cluster setup
To reiterate, it's very important for Spark's workers to have the same memory available. Think about Spark uniformly chopping up your data and distributing the work to the nodes. The algorithm is not designed to consider that a worker has less memory available than some other worker. On Thu, Dec 4, 2014 at 12:11 AM, rapelly kartheek kartheek.m...@gmail.com wrote: *It's very important for Spark's workers to have the same resources available* So, each worker should have same amount of memory and same number of cores. But, heterogeneity of the cluster in the physical layout of cpu is understandable, but how about heterogeneity with respect to memory? On Thu, Dec 4, 2014 at 12:18 PM, Victor Tso-Guillen v...@paxata.com wrote: You'll have to decide which is more expensive in your heterogenous environment and optimize for the utilization of that. For example, you may decide that memory is the only costing factor and you can discount the number of cores. Then you could have 8GB on each worker each with four cores. Note that cores in Spark don't necessarily map to cores on the machine. It's just a configuration setting for how many simultaneous tasks that worker can work on. You are right that each executor gets the same amount of resources and I would add level of parallelization. Your heterogeneity is in the physical layout of your cluster, not in how Spark treats the workers as resources. It's very important for Spark's workers to have the same resources available because it needs to be able to generically divide and conquer your data amongst all those workers. Hope that helps, Victor On Wed, Dec 3, 2014 at 10:04 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Thank you so much for valuable reply, Victor. That's a very clear solution I understood. Right now I have nodes with: 16Gb RAM, 4 cores; 8GB RAM, 4cores; 8GB RAM, 2 cores. From my understanding, the division could be something like, each executor can have 2 cores and 6GB RAM. So, the ones with 16GB RAM and 4 cores can have two executors. Please let me know if my understanding is correct. But, I am not able to see any heterogeneity in this setting as each executor has got the same amount of resources. Can you please clarify this doubt? Regards Karthik On Wed, Dec 3, 2014 at 11:11 PM, Victor Tso-Guillen v...@paxata.com wrote: I don't have a great answer for you. For us, we found a common divisor, not necessarily a whole gigabyte, of the available memory of the different hardware and used that as the amount of memory per worker and scaled the number of cores accordingly so that every core in the system has the same amount of memory. The quotient of the available memory and the common divisor, hopefully a whole number to reduce waste, was the number of workers we spun up. Therefore, if you have 64G, 30G, and 15G available memory on your machines, the divisor could be 15G and you'd have 4, 2 and 1 worker per machine. Every worker on all the machines would have the same number of cores, set to what you think is a good value. Hope that helps. On Wed, Dec 3, 2014 at 7:44 AM, kartheek.m...@gmail.com wrote: Hi Victor, I want to setup a heterogeneous stand-alone spark cluster. I have hardware with different memory sizes and varied number of cores per node. I could get all the nodes active in the cluster only when the size of memory per executor is set as the least available memory size of all nodes and is same with no.of cores/executor. As of now, I configure one executor per node. Can you please suggest some path to set up a stand-alone heterogeneous cluster such that I can efficiently use the available hardware. Thank you _ Sent from http://apache-spark-user-list.1001560.n3.nabble.com
Re: heterogeneous cluster setup
I don't have a great answer for you. For us, we found a common divisor, not necessarily a whole gigabyte, of the available memory of the different hardware and used that as the amount of memory per worker and scaled the number of cores accordingly so that every core in the system has the same amount of memory. The quotient of the available memory and the common divisor, hopefully a whole number to reduce waste, was the number of workers we spun up. Therefore, if you have 64G, 30G, and 15G available memory on your machines, the divisor could be 15G and you'd have 4, 2 and 1 worker per machine. Every worker on all the machines would have the same number of cores, set to what you think is a good value. Hope that helps. On Wed, Dec 3, 2014 at 7:44 AM, kartheek.m...@gmail.com wrote: Hi Victor, I want to setup a heterogeneous stand-alone spark cluster. I have hardware with different memory sizes and varied number of cores per node. I could get all the nodes active in the cluster only when the size of memory per executor is set as the least available memory size of all nodes and is same with no.of cores/executor. As of now, I configure one executor per node. Can you please suggest some path to set up a stand-alone heterogeneous cluster such that I can efficiently use the available hardware. Thank you _ Sent from http://apache-spark-user-list.1001560.n3.nabble.com
Re: heterogeneous cluster setup
You'll have to decide which is more expensive in your heterogenous environment and optimize for the utilization of that. For example, you may decide that memory is the only costing factor and you can discount the number of cores. Then you could have 8GB on each worker each with four cores. Note that cores in Spark don't necessarily map to cores on the machine. It's just a configuration setting for how many simultaneous tasks that worker can work on. You are right that each executor gets the same amount of resources and I would add level of parallelization. Your heterogeneity is in the physical layout of your cluster, not in how Spark treats the workers as resources. It's very important for Spark's workers to have the same resources available because it needs to be able to generically divide and conquer your data amongst all those workers. Hope that helps, Victor On Wed, Dec 3, 2014 at 10:04 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Thank you so much for valuable reply, Victor. That's a very clear solution I understood. Right now I have nodes with: 16Gb RAM, 4 cores; 8GB RAM, 4cores; 8GB RAM, 2 cores. From my understanding, the division could be something like, each executor can have 2 cores and 6GB RAM. So, the ones with 16GB RAM and 4 cores can have two executors. Please let me know if my understanding is correct. But, I am not able to see any heterogeneity in this setting as each executor has got the same amount of resources. Can you please clarify this doubt? Regards Karthik On Wed, Dec 3, 2014 at 11:11 PM, Victor Tso-Guillen v...@paxata.com wrote: I don't have a great answer for you. For us, we found a common divisor, not necessarily a whole gigabyte, of the available memory of the different hardware and used that as the amount of memory per worker and scaled the number of cores accordingly so that every core in the system has the same amount of memory. The quotient of the available memory and the common divisor, hopefully a whole number to reduce waste, was the number of workers we spun up. Therefore, if you have 64G, 30G, and 15G available memory on your machines, the divisor could be 15G and you'd have 4, 2 and 1 worker per machine. Every worker on all the machines would have the same number of cores, set to what you think is a good value. Hope that helps. On Wed, Dec 3, 2014 at 7:44 AM, kartheek.m...@gmail.com wrote: Hi Victor, I want to setup a heterogeneous stand-alone spark cluster. I have hardware with different memory sizes and varied number of cores per node. I could get all the nodes active in the cluster only when the size of memory per executor is set as the least available memory size of all nodes and is same with no.of cores/executor. As of now, I configure one executor per node. Can you please suggest some path to set up a stand-alone heterogeneous cluster such that I can efficiently use the available hardware. Thank you _ Sent from http://apache-spark-user-list.1001560.n3.nabble.com