Lei

This is a limitation of gzip, .gz files are not splittable.

I had come across this link that provided a working solution, but after
reviewing, I found it easier to just switch my codec to .bz.

http://niels.basjes.nl/splittable-gzip

Ryan


On Mon, Aug 18, 2014 at 4:42 PM, leiwang...@gmail.com <leiwang...@gmail.com>
wrote:

>
> Is there any other way to split the input gz file in the MapReduce intead
> of changing the codec?
>
>
>
> leiwang...@gmail.com
>
> From: leiwang...@gmail.com
> Date: 2014-08-18 23:15
> To: user
> Subject: Re: Re: pig.maxCombinedSplitSize not work
>
> Hi Jarek, can you give me some examples of how to do this?
>
> Thanks,
> Lei
>
>
> leiwang...@gmail.com
>
> From: Jarek Jarcec Cecho
> Date: 2014-08-18 23:01
> To: user
> Subject: Re: pig.maxCombinedSplitSize not work
> Hi Lei,
> gzip is so called non splittable file format - Hadoop can't "seek" in the
> middle of the file and start decompressing it - you have to always start
> reading the file from the begging which is undesirable thing to do on
> Hadoop cluster. Hence you will get one mapper per input non splittable file.
>
> You might consider to uncompress the files, use splittable codec (such as
> bzip2) or use some binary container file (avro, parquet, sequence file).
>
> Jarcec
>
> On Aug 18, 2014, at 7:49 AM, leiwang...@gmail.com wrote:
>
> >
> > I have an input directory which has 7 files:
> > 804M bid10.gz
> > 814M bid11.gz
> > 808M bid2.gz
> > 812M bid4.gz
> > 803M bid5.gz
> > 818M bid8.gz
> > 823M bid9.gz
> >
> > In my pig script i set combined size to 128M
> >
> > SET pig.maxCombinedSplitSize 134217728;
> >
> > But there's only 7 mapper.(one file per mapper)
> > Any insight on this?
> >
> > Thanks,
> > Lei
> >
> >
> >
> > leiwang...@gmail.com
>
>


-- 
Ryan Prociuk | Engineering Distributed Data

Reply via email to