I still don't fully understand (and am still debugging), but I have a "problem file" and a theory.
The file has a "corrupt line" that is a huge block of null characters followed by a "\n" (other lines are json followed by "\n"). I'm thinking that's a problem with my cassandra -> s3 process, but is out of scope for this thread.... I wrote scripts to examine the file directly, and if I stop counting at the weird line, I get the "gz" count. If I count all lines (e.g. don't fail at the corrupt line) I get the "uncompressed" count. I don't know how to debug hadoop/pig quite as well, though I'm trying now. But, my working theory is that some combination of pig/hadoop aborts processing the gz stream on a null character, but keeps chugging on a non-gz stream. Does that sound familiar? will On Sat, Jun 8, 2013 at 8:00 AM, William Oberman <[email protected]>wrote: > They are all *.gz, I confirmed that first :-) > > > On Saturday, June 8, 2013, Niels Basjes wrote: > >> What are the exact filenames you used? >> The decompression of input files is based on the filename extention. >> >> Niels >> On Jun 7, 2013 11:11 PM, "William Oberman" <[email protected]> >> wrote: >> >> > I'm using pig 0.11.2. >> > >> > I had been processing ASCII files of json with schema: (key:chararray, >> > columns:bag {column:tuple (timeUUID:chararray, value:chararray, >> > timestamp:long)}) >> > For what it's worth, this is cassandra data, at a fairly low level. >> > >> > But, this was getting big, so I compressed it all with gzip (my "ETL" >> > process is already chunking the data into 1GB parts, making the .gz >> files >> > ~100MB). >> > >> > As a sanity check, I decided to do a quick check of pre/post, and the >> > numbers aren't matching. Then I've done a lot of messing around trying >> to >> > figure out why and I'm getting more and more puzzled. >> > >> > My "quick check" was to get an overall count. It looked like (assuming >> A >> > is a LOAD given the schema above): >> > ------- >> > allGrp = GROUP A ALL; >> > aCount = FOREACH allGrp GENERATE group, COUNT(A); >> > DUMP aCount; >> > ------- >> > >> > Basically the original data returned a number GREATER than the >> compressed >> > data number (not by a lot, but still...). >> > >> > Then I uncompressed all of the compressed files, and did a size check of >> > original vs. uncompressed. They were the same. Then I "quick checked" >> the >> > uncompressed, and the count of that was == original! So, the way in >> which >> > pig processes the gzip'ed data is actually somehow different. >> > >> > Then I tried to see if there are nulls floating around, so I loaded >> "orig" >> > and "comp" and tried to catch the "missing keys" with outer joins: >> > ----------- >> > joined = JOIN orig by key LEFT OUTER, comp BY key; >> > filtered = FILTER joined BY (comp::key is null); >> > ----------- >> > And filtered was empty! I then tried the reverse (which makes no sense >> I >> > know, as this was the smaller set), and filtered is still empty! >> > >> > All of these loads are through a custom UDF that extends LoadFunc. But, >> > there isn't much to that UDF (and it's been in use for many months now). >> > Basically, the "raw" data is JSON (from cassandra's sstable2json >> program). >> > And I parse the json and turn it into the pig structure of the schema >> > noted above. >> > >> > Does anything make sense here? >> > >> > Thanks! >> > >> > will >> > >> > > >
