Re: concatenated gzip/bzip in Pig 0.11 and higher

Daniel Dai Mon, 18 May 2015 12:53:14 -0700

I am not very sure but seems Hadoop does not support splittable bzip
initially so Pig implement its own.



On 5/18/15, 12:43 AM, "Tomas Hudik" <xhu...@gmail.com> wrote:

>thank you Daniel.
>
>follow  up question: is there any reasosn why bzip is processed by pig but
>gzip is processed in Hadoop?
>
>thanks, Tomas
>
>On Mon, May 18, 2015 at 8:35 AM, Daniel Dai <da...@hortonworks.com> wrote:
>
>> The umcompression of gzip is on Hadoop side (TextInputFormat), if Hadoop
>> fixed concatenated gzip, Pig should be fixed as well. Bzip however, is
>> processed by Pig code, that does not support concatenation.
>>
>> It seems we need to update the documentation.
>>
>> Daniel
>>
>> On 5/5/15, 3:51 AM, "Tomas Hudik" <xhu...@gmail.com> wrote:
>>
>> >Hi,
>> >I read a section:
>> >https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
>> >
>> >according to which any concatenated bzip/gzip files will produce
>>strange
>> >results.
>> >I did a test - concatenated some files and processed them. However, all
>> >the
>> >results were identical to ones that were produces on non-concatenated
>> >files. Why? They should be different...
>> >
>> >Then I saw: https://issues.apache.org/jira/i#browse/HADOOP-6835
>> >
>> >My questions:
>> >1. is 
>>https://pig.apache.org/docs/r0.11.1/func.html#handling-compression
>> >still correct and concatenation will produce wrong results? Is this
>>true
>> >for any concatenated files or it might happanes once a time
>> >2. is there any way how to find out whether tar.gz or tar.bz2 is
>> >concatenated?
>>
>>

Re: concatenated gzip/bzip in Pig 0.11 and higher

Reply via email to