What process creates the data in HDFS? You should be able to set the block size there and avoid the copy.
I would test the dfs.block.size on the copy and see if you get the mapper split you want before worrying about optimizing. David On Feb 11, 2013, at 2:10 PM, Something Something <[email protected]> wrote: > David: Your suggestion would add an additional step of copying data from > one place to another. Not bad, but not ideal. Is there no way to avoid > copying of data? > > BTW, we have tried changing the following options to no avail :( > > set pig.splitCombination false; > > & a few other 'dfs' options given below: > > mapreduce.min.split.size > mapreduce.max.split.size > > Thanks. > > On Mon, Feb 11, 2013 at 10:29 AM, David LaBarbera < > [email protected]> wrote: > >> You could store your data in smaller block sizes. Do something like >> hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576 >> -Dfs.local.block.size=1048576" -cp /org-input /small-block-input >> You might only need one of those parameters. You can verify the block size >> with >> hadoop fsck /small-block-input >> >> In your pig script, you'll probably need to set >> pig.maxCombinedSplitSize >> to something around the block size >> >> David >> >> On Feb 11, 2013, at 1:24 PM, Something Something <[email protected]> >> wrote: >> >>> Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to >>> HBase. Adding 'hadoop' user group. >>> >>> On Mon, Feb 11, 2013 at 10:22 AM, Something Something < >>> [email protected]> wrote: >>> >>>> Hello, >>>> >>>> We are running into performance issues with Pig/Hadoop because our input >>>> files are small. Everything goes to only 1 Mapper. To get around >> this, we >>>> are trying to use our own Loader like this: >>>> >>>> 1) Extend PigStorage: >>>> >>>> public class SmallFileStorage extends PigStorage { >>>> >>>> public SmallFileStorage(String delimiter) { >>>> super(delimiter); >>>> } >>>> >>>> @Override >>>> public InputFormat getInputFormat() { >>>> return new NLineInputFormat(); >>>> } >>>> } >>>> >>>> >>>> >>>> 2) Add command line argument to the Pig command as follows: >>>> >>>> -Dmapreduce.input.lineinputformat.linespermap=500000 >>>> >>>> >>>> >>>> 3) Use SmallFileStorage in the Pig script as follows: >>>> >>>> USING com.xxx.yyy.SmallFileStorage ('\t') >>>> >>>> >>>> But this doesn't seem to work. We still see that everything is going to >>>> one mapper. Before we spend any more time on this, I am wondering if >> this >>>> is a good approach – OR – if there's a better approach? Please let me >>>> know. Thanks. >>>> >>>> >>>> >> >>
