Re: problems with .gz

William Oberman Mon, 10 Jun 2013 09:07:52 -0700

I still don't fully understand (and am still debugging), but I have a
"problem file" and a theory.


The file has a "corrupt line" that is a huge block of null characters
followed by a "\n" (other lines are json followed by "\n").  I'm thinking
that's a problem with my cassandra -> s3 process, but is out of scope for
this thread....  I wrote scripts to examine the file directly, and if I
stop counting at the weird line, I get the "gz" count.   If I count all
lines (e.g. don't fail at the corrupt line) I get the "uncompressed" count.

I don't know how to debug hadoop/pig quite as well, though I'm trying now.
 But, my working theory is that some combination of pig/hadoop aborts
processing the gz stream on a null character, but keeps chugging on a
non-gz stream.  Does that sound familiar?

will


On Sat, Jun 8, 2013 at 8:00 AM, William Oberman <[email protected]>wrote:

> They are all *.gz, I confirmed that first :-)
>
>
> On Saturday, June 8, 2013, Niels Basjes wrote:
>
>> What are the exact filenames you used?
>> The decompression of input files is based on the filename extention.
>>
>> Niels
>> On Jun 7, 2013 11:11 PM, "William Oberman" <[email protected]>
>> wrote:
>>
>> > I'm using pig 0.11.2.
>> >
>> > I had been processing ASCII files of json with schema: (key:chararray,
>> > columns:bag {column:tuple (timeUUID:chararray, value:chararray,
>> > timestamp:long)})
>> > For what it's worth, this is cassandra data, at a fairly low level.
>> >
>> > But, this was getting big, so I compressed it all with gzip (my "ETL"
>> > process is already chunking the data into 1GB parts, making the .gz
>> files
>> > ~100MB).
>> >
>> > As a sanity check, I decided to do a quick check of pre/post, and the
>> > numbers aren't matching.  Then I've done a lot of messing around trying
>> to
>> > figure out why and I'm getting more and more puzzled.
>> >
>> > My "quick check" was to get an overall count.  It looked like (assuming
>> A
>> > is a LOAD given the schema above):
>> > -------
>> > allGrp = GROUP A ALL;
>> > aCount = FOREACH allGrp GENERATE group, COUNT(A);
>> > DUMP aCount;
>> > -------
>> >
>> > Basically the original data returned a number GREATER than the
>> compressed
>> > data number (not by a lot, but still...).
>> >
>> > Then I uncompressed all of the compressed files, and did a size check of
>> > original vs. uncompressed.  They were the same.  Then I "quick checked"
>> the
>> > uncompressed, and the count of that was == original!  So, the way in
>> which
>> > pig processes the gzip'ed data is actually somehow different.
>> >
>> > Then I tried to see if there are nulls floating around, so I loaded
>> "orig"
>> > and "comp" and tried to catch the "missing keys" with outer joins:
>> > -----------
>> > joined = JOIN orig by key LEFT OUTER, comp BY key;
>> > filtered = FILTER joined BY (comp::key is null);
>> > -----------
>> > And filtered was empty!  I then tried the reverse (which makes no sense
>> I
>> > know, as this was the smaller set), and filtered is still empty!
>> >
>> > All of these loads are through a custom UDF that extends LoadFunc.  But,
>> > there isn't much to that UDF (and it's been in use for many months now).
>> >  Basically, the "raw" data is JSON (from cassandra's sstable2json
>> program).
>> >  And I parse the json and turn it into the pig structure of the schema
>> > noted above.
>> >
>> > Does anything make sense here?
>> >
>> > Thanks!
>> >
>> > will
>> >
>>
>
>
>

Re: problems with .gz

Reply via email to