What are the exact filenames you used? The decompression of input files is based on the filename extention.
Niels On Jun 7, 2013 11:11 PM, "William Oberman" <[email protected]> wrote: > I'm using pig 0.11.2. > > I had been processing ASCII files of json with schema: (key:chararray, > columns:bag {column:tuple (timeUUID:chararray, value:chararray, > timestamp:long)}) > For what it's worth, this is cassandra data, at a fairly low level. > > But, this was getting big, so I compressed it all with gzip (my "ETL" > process is already chunking the data into 1GB parts, making the .gz files > ~100MB). > > As a sanity check, I decided to do a quick check of pre/post, and the > numbers aren't matching. Then I've done a lot of messing around trying to > figure out why and I'm getting more and more puzzled. > > My "quick check" was to get an overall count. It looked like (assuming A > is a LOAD given the schema above): > ------- > allGrp = GROUP A ALL; > aCount = FOREACH allGrp GENERATE group, COUNT(A); > DUMP aCount; > ------- > > Basically the original data returned a number GREATER than the compressed > data number (not by a lot, but still...). > > Then I uncompressed all of the compressed files, and did a size check of > original vs. uncompressed. They were the same. Then I "quick checked" the > uncompressed, and the count of that was == original! So, the way in which > pig processes the gzip'ed data is actually somehow different. > > Then I tried to see if there are nulls floating around, so I loaded "orig" > and "comp" and tried to catch the "missing keys" with outer joins: > ----------- > joined = JOIN orig by key LEFT OUTER, comp BY key; > filtered = FILTER joined BY (comp::key is null); > ----------- > And filtered was empty! I then tried the reverse (which makes no sense I > know, as this was the smaller set), and filtered is still empty! > > All of these loads are through a custom UDF that extends LoadFunc. But, > there isn't much to that UDF (and it's been in use for many months now). > Basically, the "raw" data is JSON (from cassandra's sstable2json program). > And I parse the json and turn it into the pig structure of the schema > noted above. > > Does anything make sense here? > > Thanks! > > will >
