Re: problems with .gz

Alan Crosswell Mon, 10 Jun 2013 11:03:36 -0700

Dunno, I'm guessing it would since each file is a different mapper.


On Mon, Jun 10, 2013 at 1:12 PM, William Oberman
<[email protected]>wrote:

> I'm using gzip as I had a huge S3 bucket of uncompressed files, and
> s3distcp only supported {gz, lzo, snappy}.
>
> I haven't ever done this, but can I mix/match files?  My backup processes
> add files to these buckets, so I could upload new files as *.bz.  But then
> I'd have some files as *.gz, and others as *.bz.
>
> will
>
>
> On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell <[email protected]>
> wrote:
>
> > Suggest that if you have a choice, you use bzip2 compression instead of
> > gzip as bzip2 is block-based and Pig can split reading a large bzipped
> file
> > across multiple mappers while gzip can't be split that way.
> >
> >
> > On Mon, Jun 10, 2013 at 12:06 PM, William Oberman
> > <[email protected]>wrote:
> >
> > > I still don't fully understand (and am still debugging), but I have a
> > > "problem file" and a theory.
> > >
> > > The file has a "corrupt line" that is a huge block of null characters
> > > followed by a "\n" (other lines are json followed by "\n").  I'm
> thinking
> > > that's a problem with my cassandra -> s3 process, but is out of scope
> for
> > > this thread....  I wrote scripts to examine the file directly, and if I
> > > stop counting at the weird line, I get the "gz" count.   If I count all
> > > lines (e.g. don't fail at the corrupt line) I get the "uncompressed"
> > count.
> > >
> > > I don't know how to debug hadoop/pig quite as well, though I'm trying
> > now.
> > >  But, my working theory is that some combination of pig/hadoop aborts
> > > processing the gz stream on a null character, but keeps chugging on a
> > > non-gz stream.  Does that sound familiar?
> > >
> > > will
> > >
> > >
> > > On Sat, Jun 8, 2013 at 8:00 AM, William Oberman <
> > [email protected]
> > > >wrote:
> > >
> > > > They are all *.gz, I confirmed that first :-)
> > > >
> > > >
> > > > On Saturday, June 8, 2013, Niels Basjes wrote:
> > > >
> > > >> What are the exact filenames you used?
> > > >> The decompression of input files is based on the filename extention.
> > > >>
> > > >> Niels
> > > >> On Jun 7, 2013 11:11 PM, "William Oberman" <
> [email protected]>
> > > >> wrote:
> > > >>
> > > >> > I'm using pig 0.11.2.
> > > >> >
> > > >> > I had been processing ASCII files of json with schema:
> > (key:chararray,
> > > >> > columns:bag {column:tuple (timeUUID:chararray, value:chararray,
> > > >> > timestamp:long)})
> > > >> > For what it's worth, this is cassandra data, at a fairly low
> level.
> > > >> >
> > > >> > But, this was getting big, so I compressed it all with gzip (my
> > "ETL"
> > > >> > process is already chunking the data into 1GB parts, making the
> .gz
> > > >> files
> > > >> > ~100MB).
> > > >> >
> > > >> > As a sanity check, I decided to do a quick check of pre/post, and
> > the
> > > >> > numbers aren't matching.  Then I've done a lot of messing around
> > > trying
> > > >> to
> > > >> > figure out why and I'm getting more and more puzzled.
> > > >> >
> > > >> > My "quick check" was to get an overall count.  It looked like
> > > (assuming
> > > >> A
> > > >> > is a LOAD given the schema above):
> > > >> > -------
> > > >> > allGrp = GROUP A ALL;
> > > >> > aCount = FOREACH allGrp GENERATE group, COUNT(A);
> > > >> > DUMP aCount;
> > > >> > -------
> > > >> >
> > > >> > Basically the original data returned a number GREATER than the
> > > >> compressed
> > > >> > data number (not by a lot, but still...).
> > > >> >
> > > >> > Then I uncompressed all of the compressed files, and did a size
> > check
> > > of
> > > >> > original vs. uncompressed.  They were the same.  Then I "quick
> > > checked"
> > > >> the
> > > >> > uncompressed, and the count of that was == original!  So, the way
> in
> > > >> which
> > > >> > pig processes the gzip'ed data is actually somehow different.
> > > >> >
> > > >> > Then I tried to see if there are nulls floating around, so I
> loaded
> > > >> "orig"
> > > >> > and "comp" and tried to catch the "missing keys" with outer joins:
> > > >> > -----------
> > > >> > joined = JOIN orig by key LEFT OUTER, comp BY key;
> > > >> > filtered = FILTER joined BY (comp::key is null);
> > > >> > -----------
> > > >> > And filtered was empty!  I then tried the reverse (which makes no
> > > sense
> > > >> I
> > > >> > know, as this was the smaller set), and filtered is still empty!
> > > >> >
> > > >> > All of these loads are through a custom UDF that extends LoadFunc.
> > >  But,
> > > >> > there isn't much to that UDF (and it's been in use for many months
> > > now).
> > > >> >  Basically, the "raw" data is JSON (from cassandra's sstable2json
> > > >> program).
> > > >> >  And I parse the json and turn it into the pig structure of the
> > schema
> > > >> > noted above.
> > > >> >
> > > >> > Does anything make sense here?
> > > >> >
> > > >> > Thanks!
> > > >> >
> > > >> > will
> > > >> >
> > > >>
> > > >
> > > >
> > > >
> > >
> >
>

Re: problems with .gz

Reply via email to