Hi everyone. I'm having a problem loading log files based on parameter input 
and was wondering whether someone would be able to provide some guidance. The 
logs in question are Omniture logs, stored in subdirectories based on year, 
month, and day (eg. /year=2013/month=02/day=14). For any day, multiple logs 
could exist, each hundreds of MB.

I have a Pig script which currently processes logs for an entire month, with 
the month and the year specified as script parameters (eg. 
/year=$year/month=$month/day=*). It works fine and we're quite happy with it. 
That said, we want to switch to weekly processing of logs, which means the 
previous LOAD path glob won't work (weeks can wrap months as well as years). To 
solve this, I have a Python UDF which takes a start date and spits out the 
necessary glob for a week's worth of logs, eg:

                >>> log_path_regex(2013, 1, 28)
                
'{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'

This glob will then be inserted in the appropriate path:

                > %declare omniture_log_path 's3://foo/bar/$week_path/*.tsv.gz';
                > data = LOAD '$omniture_log_path' USING OmnitureTextLoader(); 
// See http://github.com/msukmanowsky/OmnitureTextLoader

Unfortunately, I can't for the life of me figure out how to populate $week_path 
based on $year, $month and $day script parameters. I tried using %declare but 
grunt complains, says its logging but never does:

> %declare week_path util.log_path_regex(year, month, day);
2013-02-14 16:54:02,648 [main] INFO  org.apache.pig.Main - Apache Pig version 
0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13

2013-02-1416:54:02,648 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /tmp/pig_1360878842643.log
% ls  /tmp/pig_1360878842643.log
ls: cannot access /tmp/pig_1360878842643.log: No such file or directory

The same error results if I prefix the parameters with dollar signs or surround 
prefixed parameters with quotes.

If I try to use define (which I believe only works for static Java functions), 
I get the following:

                > define week_path util.log_path_regex(year, month, day);
                2013-02-14 17:00:42,392 [main] ERROR 
org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 11, 
column 37>  mismatched input 'year' expecting RIGHT_PAREN

As with %declare, I get the same error if I prefix the parameters with dollar 
signs or surround prefixed parameters with quotes.

I've searched around and haven't come up with a solution. I'm possibly 
searching for the wrong thing. Invoking a shell command may work, but would be 
difficult as it would complicate our script deploy and may not be feasible 
given we're retrieving logs from S3 and not a mounted directory.

It's also likely there's a nice Pig-friendly way to restrict LOAD other than 
using globs. That said, I'd still have to use my UDF which seems to be the root 
of the issue.

Do I need to convert my UDF to a static Java method? Or will I run into the 
same issue? (I hesitate to do this on the off-chance it will work. It's an 
8-line Python function, readily deployable and much more maintainable by others 
than the equivalent Java code would be.)

Any ideas?

Cheers,
Ian.

Reply via email to