They are all *.gz, I confirmed that first :-) On Saturday, June 8, 2013, Niels Basjes wrote:
> What are the exact filenames you used? > The decompression of input files is based on the filename extention. > > Niels > On Jun 7, 2013 11:11 PM, "William Oberman" > <[email protected]<javascript:;>> > wrote: > > > I'm using pig 0.11.2. > > > > I had been processing ASCII files of json with schema: (key:chararray, > > columns:bag {column:tuple (timeUUID:chararray, value:chararray, > > timestamp:long)}) > > For what it's worth, this is cassandra data, at a fairly low level. > > > > But, this was getting big, so I compressed it all with gzip (my "ETL" > > process is already chunking the data into 1GB parts, making the .gz files > > ~100MB). > > > > As a sanity check, I decided to do a quick check of pre/post, and the > > numbers aren't matching. Then I've done a lot of messing around trying > to > > figure out why and I'm getting more and more puzzled. > > > > My "quick check" was to get an overall count. It looked like (assuming A > > is a LOAD given the schema above): > > ------- > > allGrp = GROUP A ALL; > > aCount = FOREACH allGrp GENERATE group, COUNT(A); > > DUMP aCount; > > ------- > > > > Basically the original data returned a number GREATER than the compressed > > data number (not by a lot, but still...). > > > > Then I uncompressed all of the compressed files, and did a size check of > > original vs. uncompressed. They were the same. Then I "quick checked" > the > > uncompressed, and the count of that was == original! So, the way in > which > > pig processes the gzip'ed data is actually somehow different. > > > > Then I tried to see if there are nulls floating around, so I loaded > "orig" > > and "comp" and tried to catch the "missing keys" with outer joins: > > ----------- > > joined = JOIN orig by key LEFT OUTER, comp BY key; > > filtered = FILTER joined BY (comp::key is null); > > ----------- > > And filtered was empty! I then tried the reverse (which makes no sense I > > know, as this was the smaller set), and filtered is still empty! > > > > All of these loads are through a custom UDF that extends LoadFunc. But, > > there isn't much to that UDF (and it's been in use for many months now). > > Basically, the "raw" data is JSON (from cassandra's sstable2json > program). > > And I parse the json and turn it into the pig structure of the schema > > noted above. > > > > Does anything make sense here? > > > > Thanks! > > > > will > > > -- Will Oberman Civic Science, Inc. 6101 Penn Avenue, Fifth Floor Pittsburgh, PA 15206 (M) 412-480-7835 (E) [email protected]
