Lei This is a limitation of gzip, .gz files are not splittable.
I had come across this link that provided a working solution, but after reviewing, I found it easier to just switch my codec to .bz. http://niels.basjes.nl/splittable-gzip Ryan On Mon, Aug 18, 2014 at 4:42 PM, leiwang...@gmail.com <leiwang...@gmail.com> wrote: > > Is there any other way to split the input gz file in the MapReduce intead > of changing the codec? > > > > leiwang...@gmail.com > > From: leiwang...@gmail.com > Date: 2014-08-18 23:15 > To: user > Subject: Re: Re: pig.maxCombinedSplitSize not work > > Hi Jarek, can you give me some examples of how to do this? > > Thanks, > Lei > > > leiwang...@gmail.com > > From: Jarek Jarcec Cecho > Date: 2014-08-18 23:01 > To: user > Subject: Re: pig.maxCombinedSplitSize not work > Hi Lei, > gzip is so called non splittable file format - Hadoop can't "seek" in the > middle of the file and start decompressing it - you have to always start > reading the file from the begging which is undesirable thing to do on > Hadoop cluster. Hence you will get one mapper per input non splittable file. > > You might consider to uncompress the files, use splittable codec (such as > bzip2) or use some binary container file (avro, parquet, sequence file). > > Jarcec > > On Aug 18, 2014, at 7:49 AM, leiwang...@gmail.com wrote: > > > > > I have an input directory which has 7 files: > > 804M bid10.gz > > 814M bid11.gz > > 808M bid2.gz > > 812M bid4.gz > > 803M bid5.gz > > 818M bid8.gz > > 823M bid9.gz > > > > In my pig script i set combined size to 128M > > > > SET pig.maxCombinedSplitSize 134217728; > > > > But there's only 7 mapper.(one file per mapper) > > Any insight on this? > > > > Thanks, > > Lei > > > > > > > > leiwang...@gmail.com > > -- Ryan Prociuk | Engineering Distributed Data