Hi everyone. I'm having a problem loading log files based on parameter input
and was wondering whether someone would be able to provide some guidance. The
logs in question are Omniture logs, stored in subdirectories based on year,
month, and day (eg. /year=2013/month=02/day=14). For any day, multiple logs
could exist, each hundreds of MB.
I have a Pig script which currently processes logs for an entire month, with
the month and the year specified as script parameters (eg.
/year=$year/month=$month/day=*). It works fine and we're quite happy with it.
That said, we want to switch to weekly processing of logs, which means the
previous LOAD path glob won't work (weeks can wrap months as well as years). To
solve this, I have a Python UDF which takes a start date and spits out the
necessary glob for a week's worth of logs, eg:
>>> log_path_regex(2013, 1, 28)
'{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'
This glob will then be inserted in the appropriate path:
> %declare omniture_log_path 's3://foo/bar/$week_path/*.tsv.gz';
> data = LOAD '$omniture_log_path' USING OmnitureTextLoader();
// See http://github.com/msukmanowsky/OmnitureTextLoader
Unfortunately, I can't for the life of me figure out how to populate $week_path
based on $year, $month and $day script parameters. I tried using %declare but
grunt complains, says its logging but never does:
> %declare week_path util.log_path_regex(year, month, day);
2013-02-14 16:54:02,648 [main] INFO org.apache.pig.Main - Apache Pig version
0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13
2013-02-1416:54:02,648 [main] INFO org.apache.pig.Main - Logging error
messages to: /tmp/pig_1360878842643.log
% ls /tmp/pig_1360878842643.log
ls: cannot access /tmp/pig_1360878842643.log: No such file or directory
The same error results if I prefix the parameters with dollar signs or surround
prefixed parameters with quotes.
If I try to use define (which I believe only works for static Java functions),
I get the following:
> define week_path util.log_path_regex(year, month, day);
2013-02-14 17:00:42,392 [main] ERROR
org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 11,
column 37> mismatched input 'year' expecting RIGHT_PAREN
As with %declare, I get the same error if I prefix the parameters with dollar
signs or surround prefixed parameters with quotes.
I've searched around and haven't come up with a solution. I'm possibly
searching for the wrong thing. Invoking a shell command may work, but would be
difficult as it would complicate our script deploy and may not be feasible
given we're retrieving logs from S3 and not a mounted directory.
It's also likely there's a nice Pig-friendly way to restrict LOAD other than
using globs. That said, I'd still have to use my UDF which seems to be the root
of the issue.
Do I need to convert my UDF to a static Java method? Or will I run into the
same issue? (I hesitate to do this on the off-chance it will work. It's an
8-line Python function, readily deployable and much more maintainable by others
than the equivalent Java code would be.)
Any ideas?
Cheers,
Ian.