What is your cluster setup? are you running a worker on the master node also?
1. Spark usually assigns the task to the worker who has the data locally available, If one worker has enough tasks, then i believe it will start assigning to others as well. You could control it with the level of parallelism and all. 2. If you coalesce it into one partition, then i believe only one of the worker will execute the single task. Thanks Best Regards On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > I have a job that searches for input recursively and creates a string of > pathnames to treat as one input. > > The files are part-xxxxx files and they are fairly small. The job seems to > take a long time to complete considering the size of the total data (150m) > and only runs on the master machine. The job only does rdd.map type > operations. > > 1) Why doesn’t it use the other workers in the cluster? > 2) Is there a downside to using a lot of small part files? Should I > coalesce them into one input file? > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >