You could store your data in smaller block sizes. Do something like hadoop fs HADOOP_OPTS="-Ddfs.block.size=1048576 -Dfs.local.block.size=1048576" -cp /org-input /small-block-input You might only need one of those parameters. You can verify the block size with hadoop fsck /small-block-input
In your pig script, you'll probably need to set pig.maxCombinedSplitSize to something around the block size David On Feb 11, 2013, at 1:24 PM, Something Something <[email protected]> wrote: > Sorry.. Moving 'hbase' mailing list to BCC 'cause this is not related to > HBase. Adding 'hadoop' user group. > > On Mon, Feb 11, 2013 at 10:22 AM, Something Something < > [email protected]> wrote: > >> Hello, >> >> We are running into performance issues with Pig/Hadoop because our input >> files are small. Everything goes to only 1 Mapper. To get around this, we >> are trying to use our own Loader like this: >> >> 1) Extend PigStorage: >> >> public class SmallFileStorage extends PigStorage { >> >> public SmallFileStorage(String delimiter) { >> super(delimiter); >> } >> >> @Override >> public InputFormat getInputFormat() { >> return new NLineInputFormat(); >> } >> } >> >> >> >> 2) Add command line argument to the Pig command as follows: >> >> -Dmapreduce.input.lineinputformat.linespermap=500000 >> >> >> >> 3) Use SmallFileStorage in the Pig script as follows: >> >> USING com.xxx.yyy.SmallFileStorage ('\t') >> >> >> But this doesn't seem to work. We still see that everything is going to >> one mapper. Before we spend any more time on this, I am wondering if this >> is a good approach – OR – if there's a better approach? Please let me >> know. Thanks. >> >> >>
