I have been working my way through Pig recently with a lot of help from the folks in #hadoop-pig on Freenode.
The problem I am having is with reading any gzip'd files from anywhere (either locally or from s3). This is the case with pig in local mode. I am using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance. I have checked my core-site.xml and I have the following line for compression codecs: <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property> Gzip is listed there so I don't know why it won't decode properly. I am trying to do the following as a test: -- Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray); foo = LIMIT Y 5; dump foo (?ks?F?6?) Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray); foo = LIMIT Y 5; dump foo (?ks?F?6?) -- Both yield the same results. What I am actually trying to parse is compressed JSON. And up to this point dmitriy has helped me and the JSON loads and the scripts run perfectly as long as the logs are not compressed. Since the logs are compressed, my hands are tied. Any suggestions to get me moving in the right direction? Thanks. -e -- Eric Lubow e: [email protected] w: eric.lubow.org
