Thanks for sending me this! Glad you found your issue. And though my
mysterious bug stays mysterious, it's better that than an issue with Pig's
gzip stuff.


2013/6/12 William Oberman <[email protected]>

> I know what's going on, and it's kind of dumb on my part, but I'll post
> anyways to help someone else who might be puzzled.  To review, I had data
> that looked like this (and yes, it's corrupt, but it happens sometimes):
> "json\njson\n...json\n\0\0\0...\0\0\0\0middle_of_json\njson\n...json\n"
>
> E.g. a huge block of null characters in the middle of \n separated json.
>  And, usually the last character before the null block was a \n, but the
> first character after the null block was in the middle of a json string.
>
> My custom UDF returned pig friendly data structures given JSON.  The "dumb"
> thing was I returned null on a bad parse, instead of throwing IOException.
>  For pig, returning null is a signal to stop loading data (I should have
> payed closer attention to the javadoc).
>
> Thus, why my uncompressed count > compressed count: it's the difference
> between splits vs not splits (since gz doesn't allow splitting).
>
> In the uncompressed case, blocks before AND AFTER the nulls were ok and
> contributed data to my COUNT(*).
>
> In the compressed case, only data before the nulls contributed data to my
> COUNT(*).
>
> will
>
>
> On Tue, Jun 11, 2013 at 8:46 AM, Jonathan Coveney <[email protected]
> >wrote:
>
> > William,
> >
> > It would be really awesome if you could furnish a file that replicates
> this
> > issue that we can attach to a bug in jira. A long time ago I had a very
> > weird issue with some gzip files and never got to the bottom of it...I'm
> > wondering if this could be it!
> >
> >
> > 2013/6/10 Niels Basjes <[email protected]>
> >
> > > Bzip2 is only splittable in newer versions of hadoop.
> > > On Jun 10, 2013 10:28 PM, "Alan Crosswell" <[email protected]> wrote:
> > >
> > > > Ignore what I said and see
> > > > https://forums.aws.amazon.com/thread.jspa?threadID=51232
> > > >
> > > > bzip2 was documented somewhere as being splittable but this appears
> to
> > > not
> > > > actually be implemented at least in AWS S3.
> > > > /a
> > > >
> > > >
> > > > On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell <[email protected]>
> > > > wrote:
> > > >
> > > > > Suggest that if you have a choice, you use bzip2 compression
> instead
> > of
> > > > > gzip as bzip2 is block-based and Pig can split reading a large
> > bzipped
> > > > file
> > > > > across multiple mappers while gzip can't be split that way.
> > > > >
> > > > >
> > > > > On Mon, Jun 10, 2013 at 12:06 PM, William Oberman <
> > > > > [email protected]> wrote:
> > > > >
> > > > >> I still don't fully understand (and am still debugging), but I
> have
> > a
> > > > >> "problem file" and a theory.
> > > > >>
> > > > >> The file has a "corrupt line" that is a huge block of null
> > characters
> > > > >> followed by a "\n" (other lines are json followed by "\n").  I'm
> > > > thinking
> > > > >> that's a problem with my cassandra -> s3 process, but is out of
> > scope
> > > > for
> > > > >> this thread....  I wrote scripts to examine the file directly, and
> > if
> > > I
> > > > >> stop counting at the weird line, I get the "gz" count.   If I
> count
> > > all
> > > > >> lines (e.g. don't fail at the corrupt line) I get the
> "uncompressed"
> > > > >> count.
> > > > >>
> > > > >> I don't know how to debug hadoop/pig quite as well, though I'm
> > trying
> > > > now.
> > > > >>  But, my working theory is that some combination of pig/hadoop
> > aborts
> > > > >> processing the gz stream on a null character, but keeps chugging
> on
> > a
> > > > >> non-gz stream.  Does that sound familiar?
> > > > >>
> > > > >> will
> > > > >>
> > > > >>
> > > > >> On Sat, Jun 8, 2013 at 8:00 AM, William Oberman <
> > > > [email protected]
> > > > >> >wrote:
> > > > >>
> > > > >> > They are all *.gz, I confirmed that first :-)
> > > > >> >
> > > > >> >
> > > > >> > On Saturday, June 8, 2013, Niels Basjes wrote:
> > > > >> >
> > > > >> >> What are the exact filenames you used?
> > > > >> >> The decompression of input files is based on the filename
> > > extention.
> > > > >> >>
> > > > >> >> Niels
> > > > >> >> On Jun 7, 2013 11:11 PM, "William Oberman" <
> > > [email protected]
> > > > >
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > I'm using pig 0.11.2.
> > > > >> >> >
> > > > >> >> > I had been processing ASCII files of json with schema:
> > > > >> (key:chararray,
> > > > >> >> > columns:bag {column:tuple (timeUUID:chararray,
> value:chararray,
> > > > >> >> > timestamp:long)})
> > > > >> >> > For what it's worth, this is cassandra data, at a fairly low
> > > level.
> > > > >> >> >
> > > > >> >> > But, this was getting big, so I compressed it all with gzip
> (my
> > > > "ETL"
> > > > >> >> > process is already chunking the data into 1GB parts, making
> the
> > > .gz
> > > > >> >> files
> > > > >> >> > ~100MB).
> > > > >> >> >
> > > > >> >> > As a sanity check, I decided to do a quick check of pre/post,
> > and
> > > > the
> > > > >> >> > numbers aren't matching.  Then I've done a lot of messing
> > around
> > > > >> trying
> > > > >> >> to
> > > > >> >> > figure out why and I'm getting more and more puzzled.
> > > > >> >> >
> > > > >> >> > My "quick check" was to get an overall count.  It looked like
> > > > >> (assuming
> > > > >> >> A
> > > > >> >> > is a LOAD given the schema above):
> > > > >> >> > -------
> > > > >> >> > allGrp = GROUP A ALL;
> > > > >> >> > aCount = FOREACH allGrp GENERATE group, COUNT(A);
> > > > >> >> > DUMP aCount;
> > > > >> >> > -------
> > > > >> >> >
> > > > >> >> > Basically the original data returned a number GREATER than
> the
> > > > >> >> compressed
> > > > >> >> > data number (not by a lot, but still...).
> > > > >> >> >
> > > > >> >> > Then I uncompressed all of the compressed files, and did a
> size
> > > > >> check of
> > > > >> >> > original vs. uncompressed.  They were the same.  Then I
> "quick
> > > > >> checked"
> > > > >> >> the
> > > > >> >> > uncompressed, and the count of that was == original!  So, the
> > way
> > > > in
> > > > >> >> which
> > > > >> >> > pig processes the gzip'ed data is actually somehow different.
> > > > >> >> >
> > > > >> >> > Then I tried to see if there are nulls floating around, so I
> > > loaded
> > > > >> >> "orig"
> > > > >> >> > and "comp" and tried to catch the "missing keys" with outer
> > > joins:
> > > > >> >> > -----------
> > > > >> >> > joined = JOIN orig by key LEFT OUTER, comp BY key;
> > > > >> >> > filtered = FILTER joined BY (comp::key is null);
> > > > >> >> > -----------
> > > > >> >> > And filtered was empty!  I then tried the reverse (which
> makes
> > no
> > > > >> sense
> > > > >> >> I
> > > > >> >> > know, as this was the smaller set), and filtered is still
> > empty!
> > > > >> >> >
> > > > >> >> > All of these loads are through a custom UDF that extends
> > > LoadFunc.
> > > > >>  But,
> > > > >> >> > there isn't much to that UDF (and it's been in use for many
> > > months
> > > > >> now).
> > > > >> >> >  Basically, the "raw" data is JSON (from cassandra's
> > sstable2json
> > > > >> >> program).
> > > > >> >> >  And I parse the json and turn it into the pig structure of
> the
> > > > >> schema
> > > > >> >> > noted above.
> > > > >> >> >
> > > > >> >> > Does anything make sense here?
> > > > >> >> >
> > > > >> >> > Thanks!
> > > > >> >> >
> > > > >> >> > will
> > > > >> >> >
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to