Thanks for sending me this! Glad you found your issue. And though my mysterious bug stays mysterious, it's better that than an issue with Pig's gzip stuff.
2013/6/12 William Oberman <[email protected]> > I know what's going on, and it's kind of dumb on my part, but I'll post > anyways to help someone else who might be puzzled. To review, I had data > that looked like this (and yes, it's corrupt, but it happens sometimes): > "json\njson\n...json\n\0\0\0...\0\0\0\0middle_of_json\njson\n...json\n" > > E.g. a huge block of null characters in the middle of \n separated json. > And, usually the last character before the null block was a \n, but the > first character after the null block was in the middle of a json string. > > My custom UDF returned pig friendly data structures given JSON. The "dumb" > thing was I returned null on a bad parse, instead of throwing IOException. > For pig, returning null is a signal to stop loading data (I should have > payed closer attention to the javadoc). > > Thus, why my uncompressed count > compressed count: it's the difference > between splits vs not splits (since gz doesn't allow splitting). > > In the uncompressed case, blocks before AND AFTER the nulls were ok and > contributed data to my COUNT(*). > > In the compressed case, only data before the nulls contributed data to my > COUNT(*). > > will > > > On Tue, Jun 11, 2013 at 8:46 AM, Jonathan Coveney <[email protected] > >wrote: > > > William, > > > > It would be really awesome if you could furnish a file that replicates > this > > issue that we can attach to a bug in jira. A long time ago I had a very > > weird issue with some gzip files and never got to the bottom of it...I'm > > wondering if this could be it! > > > > > > 2013/6/10 Niels Basjes <[email protected]> > > > > > Bzip2 is only splittable in newer versions of hadoop. > > > On Jun 10, 2013 10:28 PM, "Alan Crosswell" <[email protected]> wrote: > > > > > > > Ignore what I said and see > > > > https://forums.aws.amazon.com/thread.jspa?threadID=51232 > > > > > > > > bzip2 was documented somewhere as being splittable but this appears > to > > > not > > > > actually be implemented at least in AWS S3. > > > > /a > > > > > > > > > > > > On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell <[email protected]> > > > > wrote: > > > > > > > > > Suggest that if you have a choice, you use bzip2 compression > instead > > of > > > > > gzip as bzip2 is block-based and Pig can split reading a large > > bzipped > > > > file > > > > > across multiple mappers while gzip can't be split that way. > > > > > > > > > > > > > > > On Mon, Jun 10, 2013 at 12:06 PM, William Oberman < > > > > > [email protected]> wrote: > > > > > > > > > >> I still don't fully understand (and am still debugging), but I > have > > a > > > > >> "problem file" and a theory. > > > > >> > > > > >> The file has a "corrupt line" that is a huge block of null > > characters > > > > >> followed by a "\n" (other lines are json followed by "\n"). I'm > > > > thinking > > > > >> that's a problem with my cassandra -> s3 process, but is out of > > scope > > > > for > > > > >> this thread.... I wrote scripts to examine the file directly, and > > if > > > I > > > > >> stop counting at the weird line, I get the "gz" count. If I > count > > > all > > > > >> lines (e.g. don't fail at the corrupt line) I get the > "uncompressed" > > > > >> count. > > > > >> > > > > >> I don't know how to debug hadoop/pig quite as well, though I'm > > trying > > > > now. > > > > >> But, my working theory is that some combination of pig/hadoop > > aborts > > > > >> processing the gz stream on a null character, but keeps chugging > on > > a > > > > >> non-gz stream. Does that sound familiar? > > > > >> > > > > >> will > > > > >> > > > > >> > > > > >> On Sat, Jun 8, 2013 at 8:00 AM, William Oberman < > > > > [email protected] > > > > >> >wrote: > > > > >> > > > > >> > They are all *.gz, I confirmed that first :-) > > > > >> > > > > > >> > > > > > >> > On Saturday, June 8, 2013, Niels Basjes wrote: > > > > >> > > > > > >> >> What are the exact filenames you used? > > > > >> >> The decompression of input files is based on the filename > > > extention. > > > > >> >> > > > > >> >> Niels > > > > >> >> On Jun 7, 2013 11:11 PM, "William Oberman" < > > > [email protected] > > > > > > > > > >> >> wrote: > > > > >> >> > > > > >> >> > I'm using pig 0.11.2. > > > > >> >> > > > > > >> >> > I had been processing ASCII files of json with schema: > > > > >> (key:chararray, > > > > >> >> > columns:bag {column:tuple (timeUUID:chararray, > value:chararray, > > > > >> >> > timestamp:long)}) > > > > >> >> > For what it's worth, this is cassandra data, at a fairly low > > > level. > > > > >> >> > > > > > >> >> > But, this was getting big, so I compressed it all with gzip > (my > > > > "ETL" > > > > >> >> > process is already chunking the data into 1GB parts, making > the > > > .gz > > > > >> >> files > > > > >> >> > ~100MB). > > > > >> >> > > > > > >> >> > As a sanity check, I decided to do a quick check of pre/post, > > and > > > > the > > > > >> >> > numbers aren't matching. Then I've done a lot of messing > > around > > > > >> trying > > > > >> >> to > > > > >> >> > figure out why and I'm getting more and more puzzled. > > > > >> >> > > > > > >> >> > My "quick check" was to get an overall count. It looked like > > > > >> (assuming > > > > >> >> A > > > > >> >> > is a LOAD given the schema above): > > > > >> >> > ------- > > > > >> >> > allGrp = GROUP A ALL; > > > > >> >> > aCount = FOREACH allGrp GENERATE group, COUNT(A); > > > > >> >> > DUMP aCount; > > > > >> >> > ------- > > > > >> >> > > > > > >> >> > Basically the original data returned a number GREATER than > the > > > > >> >> compressed > > > > >> >> > data number (not by a lot, but still...). > > > > >> >> > > > > > >> >> > Then I uncompressed all of the compressed files, and did a > size > > > > >> check of > > > > >> >> > original vs. uncompressed. They were the same. Then I > "quick > > > > >> checked" > > > > >> >> the > > > > >> >> > uncompressed, and the count of that was == original! So, the > > way > > > > in > > > > >> >> which > > > > >> >> > pig processes the gzip'ed data is actually somehow different. > > > > >> >> > > > > > >> >> > Then I tried to see if there are nulls floating around, so I > > > loaded > > > > >> >> "orig" > > > > >> >> > and "comp" and tried to catch the "missing keys" with outer > > > joins: > > > > >> >> > ----------- > > > > >> >> > joined = JOIN orig by key LEFT OUTER, comp BY key; > > > > >> >> > filtered = FILTER joined BY (comp::key is null); > > > > >> >> > ----------- > > > > >> >> > And filtered was empty! I then tried the reverse (which > makes > > no > > > > >> sense > > > > >> >> I > > > > >> >> > know, as this was the smaller set), and filtered is still > > empty! > > > > >> >> > > > > > >> >> > All of these loads are through a custom UDF that extends > > > LoadFunc. > > > > >> But, > > > > >> >> > there isn't much to that UDF (and it's been in use for many > > > months > > > > >> now). > > > > >> >> > Basically, the "raw" data is JSON (from cassandra's > > sstable2json > > > > >> >> program). > > > > >> >> > And I parse the json and turn it into the pig structure of > the > > > > >> schema > > > > >> >> > noted above. > > > > >> >> > > > > > >> >> > Does anything make sense here? > > > > >> >> > > > > > >> >> > Thanks! > > > > >> >> > > > > > >> >> > will > > > > >> >> > > > > > >> >> > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > >
