It is definitely possible to run multiple workers on a single node and have each worker with the maximum number of cores (e.g. if you have 8 cores and 2 workers you'd have 16 cores per node). I don't know if it's possible with the out of the box scripts though.
It's actually not really that difficult. You just run start-slave.sh multiple times on the same node, with different IDs. Here is the usage: # Usage: start-slave.sh <worker#> <master-spark-URL> But we have custom scripts to do that. I'm not sure whether it is possible using the standard start-all.sh script or that EC2 script. Probably not. I haven't set up or managed such a cluster myself, so that's about the extent of my knowledge. But I've deployed jobs to that cluster and enjoyed the benefit of double the cores - we had a fair amount of I/O though, which may be why it helped in our case. I recommend taking a look at the CPU utilization on the nodes when running a flow before jumping through these hoops. On Fri, Aug 1, 2014 at 12:05 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Darin, > > I think the number of cores in your cluster is a hard limit on how many > concurrent tasks you can execute at one time. If you want more parallelism, > I think you just need more cores in your cluster--that is, bigger nodes, or > more nodes. > > Daniel, > > Have you been able to get around this limit? > > Nick > > > > On Fri, Aug 1, 2014 at 11:49 AM, Daniel Siegmann <daniel.siegm...@velos.io > > wrote: > >> Sorry, but I haven't used Spark on EC2 and I'm not sure what the problem >> could be. Hopefully someone else will be able to help. The only thing I >> could suggest is to try setting both the worker instances and the number of >> cores (assuming spark-ec2 has such a parameter). >> >> >> On Thu, Jul 31, 2014 at 3:03 PM, Darin McBeath <ddmcbe...@yahoo.com> >> wrote: >> >>> Ok, I set the number of spark worker instances to 2 (below is my startup >>> command). But, this essentially had the effect of increasing my number of >>> workers from 3 to 6 (which was good) but it also reduced my number of cores >>> per worker from 8 to 4 (which was not so good). In the end, I would still >>> only be able to concurrently process 24 partitions in parallel. I'm >>> starting a stand-alone cluster using the spark provided ec2 scripts . I >>> tried setting the env variable for SPARK_WORKER_CORES in the spark_ec2.py >>> but this had no effect. So, it's not clear if I could even set the >>> SPARK_WORKER_CORES with the ec2 scripts. Anyway, not sure if there is >>> anything else I can try but at least wanted to document what I did try and >>> the net effect. I'm open to any suggestions/advice. >>> >>> ./spark-ec2 -k *key* -i key.pem --hadoop-major-version=2 launch -s 3 >>> -t m3.2xlarge -w 3600 --spot-price=.08 -z us-east-1e --worker-instances=2 >>> *my-cluster* >>> >>> >>> ------------------------------ >>> *From:* Daniel Siegmann <daniel.siegm...@velos.io> >>> *To:* Darin McBeath <ddmcbe...@yahoo.com> >>> *Cc:* Daniel Siegmann <daniel.siegm...@velos.io>; "user@spark.apache.org" >>> <user@spark.apache.org> >>> *Sent:* Thursday, July 31, 2014 10:04 AM >>> >>> *Subject:* Re: Number of partitions and Number of concurrent tasks >>> >>> I haven't configured this myself. I'd start with setting >>> SPARK_WORKER_CORES to a higher value, since that's a bit simpler than >>> adding more workers. This defaults to "all available cores" according to >>> the documentation, so I'm not sure if you can actually set it higher. If >>> not, you can get around this by adding more worker instances; I believe >>> simply setting SPARK_WORKER_INSTANCES to 2 would be sufficient. >>> >>> I don't think you *have* to set the cores if you have more workers - it >>> will default to 8 cores per worker (in your case). But maybe 16 cores per >>> node will be too many. You'll have to test. Keep in mind that more workers >>> means more memory and such too, so you may need to tweak some other >>> settings downward in this case. >>> >>> On a side note: I've read some people found performance was better when >>> they had more workers with less memory each, instead of a single worker >>> with tons of memory, because it cut down on garbage collection time. But I >>> can't speak to that myself. >>> >>> In any case, if you increase the number of cores available in your >>> cluster (whether per worker, or adding more workers per node, or of course >>> adding more nodes) you should see more tasks running concurrently. Whether >>> this will actually be *faster* probably depends mainly on whether the >>> CPUs in your nodes were really being fully utilized with the current number >>> of cores. >>> >>> >>> On Wed, Jul 30, 2014 at 8:30 PM, Darin McBeath <ddmcbe...@yahoo.com> >>> wrote: >>> >>> Thanks. >>> >>> So to make sure I understand. Since I'm using a 'stand-alone' >>> cluster, I would set SPARK_WORKER_INSTANCES to something like 2 >>> (instead of the default value of 1). Is that correct? But, it also sounds >>> like I need to explicitly set a value for SPARKER_WORKER_CORES (based on >>> what the documentation states). What would I want that value to be based >>> on my configuration below? Or, would I leave that alone? >>> >>> ------------------------------ >>> *From:* Daniel Siegmann <daniel.siegm...@velos.io> >>> *To:* user@spark.apache.org; Darin McBeath <ddmcbe...@yahoo.com> >>> *Sent:* Wednesday, July 30, 2014 5:58 PM >>> *Subject:* Re: Number of partitions and Number of concurrent tasks >>> >>> This is correct behavior. Each "core" can execute exactly one task at a >>> time, with each task corresponding to a partition. If your cluster only has >>> 24 cores, you can only run at most 24 tasks at once. >>> >>> You could run multiple workers per node to get more executors. That >>> would give you more cores in the cluster. But however many cores you have, >>> each core will run only one task at a time. >>> >>> >>> On Wed, Jul 30, 2014 at 3:56 PM, Darin McBeath <ddmcbe...@yahoo.com> >>> wrote: >>> >>> I have a cluster with 3 nodes (each with 8 cores) using Spark 1.0.1. >>> >>> I have an RDD<String> which I've repartitioned so it has 100 partitions >>> (hoping to increase the parallelism). >>> >>> When I do a transformation (such as filter) on this RDD, I can't seem >>> to get more than 24 tasks (my total number of cores across the 3 nodes) >>> going at one point in time. By tasks, I mean the number of tasks that >>> appear under the Application UI. I tried explicitly setting the >>> spark.default.parallelism to 48 (hoping I would get 48 tasks concurrently >>> running) and verified this in the Application UI for the running >>> application but this had no effect. Perhaps, this is ignored for a >>> 'filter' and the default is the total number of cores available. >>> >>> I'm fairly new with Spark so maybe I'm just missing or misunderstanding >>> something fundamental. Any help would be appreciated. >>> >>> Thanks. >>> >>> Darin. >>> >>> >>> >>> >>> -- >>> Daniel Siegmann, Software Developer >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 >>> E: daniel.siegm...@velos.io W: www.velos.io >>> >>> >>> >>> >>> >>> -- >>> Daniel Siegmann, Software Developer >>> Velos >>> Accelerating Machine Learning >>> >>> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 >>> E: daniel.siegm...@velos.io W: www.velos.io >>> >>> >>> >> >> >> -- >> Daniel Siegmann, Software Developer >> Velos >> Accelerating Machine Learning >> >> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 >> E: daniel.siegm...@velos.io W: www.velos.io >> > > -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io