Reading Gzip Files

Eric Lubow Mon, 21 Feb 2011 15:47:24 -0800

I have been working my way through Pig recently with a lot of help from the
folks in #hadoop-pig on Freenode.


The problem I am having is with reading any gzip'd files from anywhere
(either locally or from s3).  This is the case with pig in local mode.  I am
using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance.  I have
checked my core-site.xml and I have the following line for compression
codecs:

<property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property>

Gzip is listed there so I don't know why it won't decode properly.  I am
trying to do the following as a test:

--
Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray);
foo = LIMIT Y 5;
dump foo
(?ks?F?6?)

Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray);
foo = LIMIT Y 5;
dump foo
(?ks?F?6?)
--

Both yield the same results.  What I am actually trying to parse is
compressed JSON.  And up to this point dmitriy has helped me and the JSON
loads and the scripts run perfectly as long as the logs are not compressed.
 Since the logs are compressed, my hands are tied.  Any suggestions to get
me moving in the right direction?  Thanks.

-e
--
Eric Lubow
e: [email protected]
w: eric.lubow.org

Reading Gzip Files

Reply via email to