Re: problems with .gz

William Oberman Sat, 08 Jun 2013 05:01:34 -0700

They are all *.gz, I confirmed that first :-)

On Saturday, June 8, 2013, Niels Basjes wrote:


> What are the exact filenames you used?
> The decompression of input files is based on the filename extention.
>
> Niels
> On Jun 7, 2013 11:11 PM, "William Oberman" 
> <[email protected]<javascript:;>>
> wrote:
>
> > I'm using pig 0.11.2.
> >
> > I had been processing ASCII files of json with schema: (key:chararray,
> > columns:bag {column:tuple (timeUUID:chararray, value:chararray,
> > timestamp:long)})
> > For what it's worth, this is cassandra data, at a fairly low level.
> >
> > But, this was getting big, so I compressed it all with gzip (my "ETL"
> > process is already chunking the data into 1GB parts, making the .gz files
> > ~100MB).
> >
> > As a sanity check, I decided to do a quick check of pre/post, and the
> > numbers aren't matching.  Then I've done a lot of messing around trying
> to
> > figure out why and I'm getting more and more puzzled.
> >
> > My "quick check" was to get an overall count.  It looked like (assuming A
> > is a LOAD given the schema above):
> > -------
> > allGrp = GROUP A ALL;
> > aCount = FOREACH allGrp GENERATE group, COUNT(A);
> > DUMP aCount;
> > -------
> >
> > Basically the original data returned a number GREATER than the compressed
> > data number (not by a lot, but still...).
> >
> > Then I uncompressed all of the compressed files, and did a size check of
> > original vs. uncompressed.  They were the same.  Then I "quick checked"
> the
> > uncompressed, and the count of that was == original!  So, the way in
> which
> > pig processes the gzip'ed data is actually somehow different.
> >
> > Then I tried to see if there are nulls floating around, so I loaded
> "orig"
> > and "comp" and tried to catch the "missing keys" with outer joins:
> > -----------
> > joined = JOIN orig by key LEFT OUTER, comp BY key;
> > filtered = FILTER joined BY (comp::key is null);
> > -----------
> > And filtered was empty!  I then tried the reverse (which makes no sense I
> > know, as this was the smaller set), and filtered is still empty!
> >
> > All of these loads are through a custom UDF that extends LoadFunc.  But,
> > there isn't much to that UDF (and it's been in use for many months now).
> >  Basically, the "raw" data is JSON (from cassandra's sstable2json
> program).
> >  And I parse the json and turn it into the pig structure of the schema
> > noted above.
> >
> > Does anything make sense here?
> >
> > Thanks!
> >
> > will
> >
>


-- 
Will Oberman
Civic Science, Inc.
6101 Penn Avenue, Fifth Floor
Pittsburgh, PA 15206
(M) 412-480-7835
(E) [email protected]

Re: problems with .gz

Reply via email to