Here's what I just tried: I gzipped a file:
'cat foo.tsv | gzip > foo.tsv.gz' Uploaded to my hdfs (hdfs://master:8020) 'hadoop fs -put foo.tsv.gz /tmp' Then loaded it and dumped it with pig: grunt> data = LOAD 'hdfs://master/tmp/foo.tsv.gz'; grunt> DUMP data; (98384,559) (98385,587) (98386,573) (98387,587) (98388,589) (98389,584) (98390,572) (98391,567) Looks great. I'm going to blame it on your version? I'm using pig-0.8 and hadoop 0.20.2. --jacob @thedatachef On Tue, 2011-02-22 at 08:21 -0500, Eric Lubow wrote: > I apologize for the double mailing: > > grunt> Y = LOAD 'hdfs:///mnt/test.log.gz' AS (line:chararray); > grunt> foo = LIMIT Y 5; > grunt> dump foo > <0\Mtest.log?]?o?H??}?) > > It didn't work out of HDFS. > > -e > > On Tue, Feb 22, 2011 at 08:18, Eric Lubow <eric.lu...@gmail.com> wrote: > > > I'm not sure what you mean by testing it directly out of a normal HDFS. I > > have added it to HDFS with 'hadoop fs copyFromLocal', but then I can't > > access it via Pig using the file:///. Am I doing something wrong or are you > > asking me to try something else? > > > > -e > > > > > > On Mon, Feb 21, 2011 at 21:36, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > > > >> He's on 0.6, so the interface is different. And for him even PigStorage > >> doesn't decompress... > >> > >> It occurs to me the problem may be with underlying fs. Eric, what happens > >> when you try reading out of a normal HDFS (you can just run a > >> pseudo-distributed cluster locally to test)? > >> > >> D > >> > >> > >> On Mon, Feb 21, 2011 at 6:28 PM, Charles Gonçalves > >> <charles...@gmail.com>wrote: > >> > >>> I'm not sure if is the same problem. > >>> > >>> I did a custom loader and I got a problem reading compressed files too. > >>> So I noticed that in the PigStorage the function getInputFormat was: > >>> > >>> public InputFormat getInputFormat() throws IOException { > >>> if(loadLocation.endsWith(".bz2") || loadLocation.endsWith(".bz")) > >>> { > >>> return new Bzip2TextInputFormat(); > >>> } else { > >>> return new PigTextInputFormat(); > >>> } > >>> } > >>> > >>> And in my custom loader was : > >>> > >>> public InputFormat getInputFormat() { > >>> return new TextInputFormat(); > >>> } > >>> > >>> > >>> I just copied the code from PigStorage and everything went right > >>> > >>> > >>> > >>> On Mon, Feb 21, 2011 at 8:46 PM, Eric Lubow <eric.lu...@gmail.com> > >>> wrote: > >>> > >>> > I have been working my way through Pig recently with a lot of help from > >>> the > >>> > folks in #hadoop-pig on Freenode. > >>> > > >>> > The problem I am having is with reading any gzip'd files from anywhere > >>> > (either locally or from s3). This is the case with pig in local mode. > >>> I > >>> > am > >>> > using Pig 0.6 on an Amazon EMR (Elastic Map Reduce) instance. I have > >>> > checked my core-site.xml and I have the following line for compression > >>> > codecs: > >>> > > >>> > > >>> > > >>> <property><name>io.compression.codecs</name><value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value></property> > >>> > > >>> > Gzip is listed there so I don't know why it won't decode properly. I > >>> am > >>> > trying to do the following as a test: > >>> > > >>> > -- > >>> > Y = LOAD 's3://$bucket/$path/log.*.gz' AS (line:chararray); > >>> > foo = LIMIT Y 5; > >>> > dump foo > >>> > (?ks?F?6?) > >>> > > >>> > Y = LOAD 'file:///home/hadoop/logs/test.log.gz' AS (line:chararray); > >>> > foo = LIMIT Y 5; > >>> > dump foo > >>> > (?ks?F?6?) > >>> > -- > >>> > > >>> > Both yield the same results. What I am actually trying to parse is > >>> > compressed JSON. And up to this point dmitriy has helped me and the > >>> JSON > >>> > loads and the scripts run perfectly as long as the logs are not > >>> compressed. > >>> > Since the logs are compressed, my hands are tied. Any suggestions to > >>> get > >>> > me moving in the right direction? Thanks. > >>> > > >>> > -e > >>> > -- > >>> > Eric Lubow > >>> > e: eric.lu...@gmail.com > >>> > w: eric.lubow.org > >>> > > >>> > >>> > >>> > >>> -- > >>> *Charles Ferreira Gonçalves * > >>> http://homepages.dcc.ufmg.br/~charles/ > >>> UFMG - ICEx - Dcc > >>> Cel.: 55 31 87741485 > >>> Tel.: 55 31 34741485 > >>> Lab.: 55 31 34095840 > >>> > >> > >> > > Eric Lubow e: eric.lu...@gmail.com w: eric.lubow.org > > > > Eric Lubow e: eric.lu...@gmail.com w: eric.lubow.org