splittable bzip is supported by Hadoop (
https://issues.apache.org/jira/browse/HADOOP-4012 , since version 0.21).
I have opened Jira ticket for handling concatenated bzip/gzip files
already: https://issues.apache.org/jira/browse/PIG-4533.
It seems:
1. if bzip files would be let to be processed by Hadoop - we are fine
2. (if 1 is true) documentation needs to be improved (delete, or make
obsolte the part about "handling-compression")

On Mon, May 18, 2015 at 9:51 PM, Daniel Dai <da...@hortonworks.com> wrote:

> I am not very sure but seems Hadoop does not support splittable bzip
> initially so Pig implement its own.
>
>
> On 5/18/15, 12:43 AM, "Tomas Hudik" <xhu...@gmail.com> wrote:
>
> >thank you Daniel.
> >
> >follow  up question: is there any reasosn why bzip is processed by pig but
> >gzip is processed in Hadoop?
> >
> >thanks, Tomas
> >
> >On Mon, May 18, 2015 at 8:35 AM, Daniel Dai <da...@hortonworks.com>
> wrote:
> >
> >> The umcompression of gzip is on Hadoop side (TextInputFormat), if Hadoop
> >> fixed concatenated gzip, Pig should be fixed as well. Bzip however, is
> >> processed by Pig code, that does not support concatenation.
> >>
> >> It seems we need to update the documentation.
> >>
> >> Daniel
> >>
> >> On 5/5/15, 3:51 AM, "Tomas Hudik" <xhu...@gmail.com> wrote:
> >>
> >> >Hi,
> >> >I read a section:
> >> >https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> >> >
> >> >according to which any concatenated bzip/gzip files will produce
> >>strange
> >> >results.
> >> >I did a test - concatenated some files and processed them. However, all
> >> >the
> >> >results were identical to ones that were produces on non-concatenated
> >> >files. Why? They should be different...
> >> >
> >> >Then I saw: https://issues.apache.org/jira/i#browse/HADOOP-6835
> >> >
> >> >My questions:
> >> >1. is
> >>https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
> >> >still correct and concatenation will produce wrong results? Is this
> >>true
> >> >for any concatenated files or it might happanes once a time
> >> >2. is there any way how to find out whether tar.gz or tar.bz2 is
> >> >concatenated?
> >>
> >>
>
>

Reply via email to