splittable bzip is supported by Hadoop ( https://issues.apache.org/jira/browse/HADOOP-4012 , since version 0.21). I have opened Jira ticket for handling concatenated bzip/gzip files already: https://issues.apache.org/jira/browse/PIG-4533. It seems: 1. if bzip files would be let to be processed by Hadoop - we are fine 2. (if 1 is true) documentation needs to be improved (delete, or make obsolte the part about "handling-compression")
On Mon, May 18, 2015 at 9:51 PM, Daniel Dai <da...@hortonworks.com> wrote: > I am not very sure but seems Hadoop does not support splittable bzip > initially so Pig implement its own. > > > On 5/18/15, 12:43 AM, "Tomas Hudik" <xhu...@gmail.com> wrote: > > >thank you Daniel. > > > >follow up question: is there any reasosn why bzip is processed by pig but > >gzip is processed in Hadoop? > > > >thanks, Tomas > > > >On Mon, May 18, 2015 at 8:35 AM, Daniel Dai <da...@hortonworks.com> > wrote: > > > >> The umcompression of gzip is on Hadoop side (TextInputFormat), if Hadoop > >> fixed concatenated gzip, Pig should be fixed as well. Bzip however, is > >> processed by Pig code, that does not support concatenation. > >> > >> It seems we need to update the documentation. > >> > >> Daniel > >> > >> On 5/5/15, 3:51 AM, "Tomas Hudik" <xhu...@gmail.com> wrote: > >> > >> >Hi, > >> >I read a section: > >> >https://pig.apache.org/docs/r0.11.1/func.html#handling-compression > >> > > >> >according to which any concatenated bzip/gzip files will produce > >>strange > >> >results. > >> >I did a test - concatenated some files and processed them. However, all > >> >the > >> >results were identical to ones that were produces on non-concatenated > >> >files. Why? They should be different... > >> > > >> >Then I saw: https://issues.apache.org/jira/i#browse/HADOOP-6835 > >> > > >> >My questions: > >> >1. is > >>https://pig.apache.org/docs/r0.11.1/func.html#handling-compression > >> >still correct and concatenation will produce wrong results? Is this > >>true > >> >for any concatenated files or it might happanes once a time > >> >2. is there any way how to find out whether tar.gz or tar.bz2 is > >> >concatenated? > >> > >> > >