Re: Map parallelism

Dmitriy Ryaboy Tue, 14 Dec 2010 20:59:36 -0800

Try

set mapred.max.split.size $desired_split_size


-D

On Tue, Dec 14, 2010 at 8:10 PM, Charles W <[email protected]> wrote:
> Hi,
>
> I have a question about map parallelism in Pig.
>
> I am using Pig to stream a file through a Python script that performs some
> computationally expensive transforms. This process is assigned to a single
> map task that can take a very long time if it happens to execute on one of
> the weaker nodes in the cluster. I am wondering how I can force the map task
> to be spread across a number of nodes.
>
> From reading
> http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause, I
> see that the parallelism of maps is "determined by the input file, one map
> for each HDFS block."
>
> The file I am operating on is 40 MB; the block size is 64 MB, so presumably
> the file is stored in a single HDFS block. The replication factor for the
> file is 3, and the DFS web UI verifies this.
>
> My question is: Is there anything I can do to increase the parallelism of
> the map task? Is it the case that the replication factor being 3 does not
> influence how many map tasks can be performed simultaneously? Should I use a
> smaller HDFS block size?
>
> I am using Hadoop 0.20.2, Pig 0.7.0.
>
> Thanks,
> - Charles
>

Re: Map parallelism

Reply via email to