Re: Lots of small input files

Akhil Das Sat, 22 Nov 2014 01:19:37 -0800

What is your cluster setup? are you running a worker on the master node
also?

1. Spark usually assigns the task to the worker who has the data locally
available, If one worker has enough tasks, then i believe it will start
assigning to others as well. You could control it with the level of
parallelism and all.

2. If you coalesce it into one partition, then i believe only one of the
worker will execute the single task.

Thanks
Best Regards

On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> I have a job that searches for input recursively and creates a string of
> pathnames to treat as one input.
>
> The files are part-xxxxx files and they are fairly small. The job seems to
> take a long time to complete considering the size of the total data (150m)
> and only runs on the master machine. The job only does rdd.map type
> operations.
>
> 1) Why doesn’t it use the other workers in the cluster?
> 2) Is there a downside to using a lot of small part files? Should I
> coalesce them into one input file?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Lots of small input files

Reply via email to