Try set mapred.max.split.size $desired_split_size
-D On Tue, Dec 14, 2010 at 8:10 PM, Charles W <[email protected]> wrote: > Hi, > > I have a question about map parallelism in Pig. > > I am using Pig to stream a file through a Python script that performs some > computationally expensive transforms. This process is assigned to a single > map task that can take a very long time if it happens to execute on one of > the weaker nodes in the cluster. I am wondering how I can force the map task > to be spread across a number of nodes. > > From reading > http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+PARALLEL+Clause, I > see that the parallelism of maps is "determined by the input file, one map > for each HDFS block." > > The file I am operating on is 40 MB; the block size is 64 MB, so presumably > the file is stored in a single HDFS block. The replication factor for the > file is 3, and the DFS web UI verifies this. > > My question is: Is there anything I can do to increase the parallelism of > the map task? Is it the case that the replication factor being 3 does not > influence how many map tasks can be performed simultaneously? Should I use a > smaller HDFS block size? > > I am using Hadoop 0.20.2, Pig 0.7.0. > > Thanks, > - Charles >
