Dunno, I'm guessing it would since each file is a different mapper.
On Mon, Jun 10, 2013 at 1:12 PM, William Oberman <[email protected]>wrote: > I'm using gzip as I had a huge S3 bucket of uncompressed files, and > s3distcp only supported {gz, lzo, snappy}. > > I haven't ever done this, but can I mix/match files? My backup processes > add files to these buckets, so I could upload new files as *.bz. But then > I'd have some files as *.gz, and others as *.bz. > > will > > > On Mon, Jun 10, 2013 at 12:41 PM, Alan Crosswell <[email protected]> > wrote: > > > Suggest that if you have a choice, you use bzip2 compression instead of > > gzip as bzip2 is block-based and Pig can split reading a large bzipped > file > > across multiple mappers while gzip can't be split that way. > > > > > > On Mon, Jun 10, 2013 at 12:06 PM, William Oberman > > <[email protected]>wrote: > > > > > I still don't fully understand (and am still debugging), but I have a > > > "problem file" and a theory. > > > > > > The file has a "corrupt line" that is a huge block of null characters > > > followed by a "\n" (other lines are json followed by "\n"). I'm > thinking > > > that's a problem with my cassandra -> s3 process, but is out of scope > for > > > this thread.... I wrote scripts to examine the file directly, and if I > > > stop counting at the weird line, I get the "gz" count. If I count all > > > lines (e.g. don't fail at the corrupt line) I get the "uncompressed" > > count. > > > > > > I don't know how to debug hadoop/pig quite as well, though I'm trying > > now. > > > But, my working theory is that some combination of pig/hadoop aborts > > > processing the gz stream on a null character, but keeps chugging on a > > > non-gz stream. Does that sound familiar? > > > > > > will > > > > > > > > > On Sat, Jun 8, 2013 at 8:00 AM, William Oberman < > > [email protected] > > > >wrote: > > > > > > > They are all *.gz, I confirmed that first :-) > > > > > > > > > > > > On Saturday, June 8, 2013, Niels Basjes wrote: > > > > > > > >> What are the exact filenames you used? > > > >> The decompression of input files is based on the filename extention. > > > >> > > > >> Niels > > > >> On Jun 7, 2013 11:11 PM, "William Oberman" < > [email protected]> > > > >> wrote: > > > >> > > > >> > I'm using pig 0.11.2. > > > >> > > > > >> > I had been processing ASCII files of json with schema: > > (key:chararray, > > > >> > columns:bag {column:tuple (timeUUID:chararray, value:chararray, > > > >> > timestamp:long)}) > > > >> > For what it's worth, this is cassandra data, at a fairly low > level. > > > >> > > > > >> > But, this was getting big, so I compressed it all with gzip (my > > "ETL" > > > >> > process is already chunking the data into 1GB parts, making the > .gz > > > >> files > > > >> > ~100MB). > > > >> > > > > >> > As a sanity check, I decided to do a quick check of pre/post, and > > the > > > >> > numbers aren't matching. Then I've done a lot of messing around > > > trying > > > >> to > > > >> > figure out why and I'm getting more and more puzzled. > > > >> > > > > >> > My "quick check" was to get an overall count. It looked like > > > (assuming > > > >> A > > > >> > is a LOAD given the schema above): > > > >> > ------- > > > >> > allGrp = GROUP A ALL; > > > >> > aCount = FOREACH allGrp GENERATE group, COUNT(A); > > > >> > DUMP aCount; > > > >> > ------- > > > >> > > > > >> > Basically the original data returned a number GREATER than the > > > >> compressed > > > >> > data number (not by a lot, but still...). > > > >> > > > > >> > Then I uncompressed all of the compressed files, and did a size > > check > > > of > > > >> > original vs. uncompressed. They were the same. Then I "quick > > > checked" > > > >> the > > > >> > uncompressed, and the count of that was == original! So, the way > in > > > >> which > > > >> > pig processes the gzip'ed data is actually somehow different. > > > >> > > > > >> > Then I tried to see if there are nulls floating around, so I > loaded > > > >> "orig" > > > >> > and "comp" and tried to catch the "missing keys" with outer joins: > > > >> > ----------- > > > >> > joined = JOIN orig by key LEFT OUTER, comp BY key; > > > >> > filtered = FILTER joined BY (comp::key is null); > > > >> > ----------- > > > >> > And filtered was empty! I then tried the reverse (which makes no > > > sense > > > >> I > > > >> > know, as this was the smaller set), and filtered is still empty! > > > >> > > > > >> > All of these loads are through a custom UDF that extends LoadFunc. > > > But, > > > >> > there isn't much to that UDF (and it's been in use for many months > > > now). > > > >> > Basically, the "raw" data is JSON (from cassandra's sstable2json > > > >> program). > > > >> > And I parse the json and turn it into the pig structure of the > > schema > > > >> > noted above. > > > >> > > > > >> > Does anything make sense here? > > > >> > > > > >> > Thanks! > > > >> > > > > >> > will > > > >> > > > > >> > > > > > > > > > > > > > > > > > >
