I am not very sure but seems Hadoop does not support splittable bzip initially so Pig implement its own.
On 5/18/15, 12:43 AM, "Tomas Hudik" <xhu...@gmail.com> wrote: >thank you Daniel. > >follow up question: is there any reasosn why bzip is processed by pig but >gzip is processed in Hadoop? > >thanks, Tomas > >On Mon, May 18, 2015 at 8:35 AM, Daniel Dai <da...@hortonworks.com> wrote: > >> The umcompression of gzip is on Hadoop side (TextInputFormat), if Hadoop >> fixed concatenated gzip, Pig should be fixed as well. Bzip however, is >> processed by Pig code, that does not support concatenation. >> >> It seems we need to update the documentation. >> >> Daniel >> >> On 5/5/15, 3:51 AM, "Tomas Hudik" <xhu...@gmail.com> wrote: >> >> >Hi, >> >I read a section: >> >https://pig.apache.org/docs/r0.11.1/func.html#handling-compression >> > >> >according to which any concatenated bzip/gzip files will produce >>strange >> >results. >> >I did a test - concatenated some files and processed them. However, all >> >the >> >results were identical to ones that were produces on non-concatenated >> >files. Why? They should be different... >> > >> >Then I saw: https://issues.apache.org/jira/i#browse/HADOOP-6835 >> > >> >My questions: >> >1. is >>https://pig.apache.org/docs/r0.11.1/func.html#handling-compression >> >still correct and concatenation will produce wrong results? Is this >>true >> >for any concatenated files or it might happanes once a time >> >2. is there any way how to find out whether tar.gz or tar.bz2 is >> >concatenated? >> >>