Thanks for this Cheolsoo. I started work on a LoadFunc assuming it was an easy win, although I hate that I have to do this. It's just text concatenation after all. Moving the logic of our log path structure to a Java class or an external package is wrong from a maintenance standpoint.
How familiar are you (or anyone) with creating a custom LoadFunc? The documentation I've found is sparse. Is there a method I can override which considers reading on a file-by-file basis? Our Omniture logs have a date stamp in the filename, and it would be more maintainable to reject a file based on its basename rather than its path. We're more likely to change our paths than change the filenames, so this would mean the code has a better chance of standing the test of time. Cheers, Ian. -----Original Message----- From: Cheolsoo Park [mailto:[email protected]] Sent: February-15-13 4:53 PM To: [email protected] Subject: Re: Restricting loading of log files based on parameter input Hi Ian, 1) Pre-processor statements are just text substitution, so you can't call a Python (or Java) function inside %declare. 2) Regarding DEFINE statements, there are two problems using them with scripting UDF: - You can't pass constructor parameters to scripting UDF. - You can't use scripting UDF for Load/StoreFunc. Given these constraints, I think writing a Java LoadFunc seems to be the best option. I would write a sub-class of OmnitureTextLoader in such a way that it can take constructor parameters. For example, class MyOmnitureTextLoader extends OmnitureTextLoader { private String year; private String month; public MyOmnitureTextLoader() { ... } public MyOmnitureTextLoader(String year, String month) { ... } @Override setLocation(String location, Job job) { // Compute week path with year and month and replace location with that. } } Then, you can do something like in Pig: DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month); A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER; Hope this is helpful. Thanks, Cheolsoo On Thu, Feb 14, 2013 at 2:16 PM, Stevens, Ian <[email protected]>wrote: > Hi everyone. I'm having a problem loading log files based on parameter > input and was wondering whether someone would be able to provide some > guidance. The logs in question are Omniture logs, stored in > subdirectories based on year, month, and day (eg. > /year=2013/month=02/day=14). For any day, multiple logs could exist, each > hundreds of MB. > > I have a Pig script which currently processes logs for an entire > month, with the month and the year specified as script parameters (eg. > /year=$year/month=$month/day=*). It works fine and we're quite happy > with it. That said, we want to switch to weekly processing of logs, > which means the previous LOAD path glob won't work (weeks can wrap > months as well as years). To solve this, I have a Python UDF which > takes a start date and spits out the necessary glob for a week's worth of > logs, eg: > > >>> log_path_regex(2013, 1, 28) > > '{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}' > > This glob will then be inserted in the appropriate path: > > > %declare omniture_log_path > 's3://foo/bar/$week_path/*.tsv.gz'; > > data = LOAD '$omniture_log_path' USING > OmnitureTextLoader(); // See > http://github.com/msukmanowsky/OmnitureTextLoader > > Unfortunately, I can't for the life of me figure out how to populate > $week_path based on $year, $month and $day script parameters. I tried > using %declare but grunt complains, says its logging but never does: > > > %declare week_path util.log_path_regex(year, month, day); > 2013-02-14 16:54:02,648 [main] INFO org.apache.pig.Main - Apache Pig > version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13 > > 2013-02-1416:54:02,648 [main] INFO org.apache.pig.Main - Logging > error messages to: /tmp/pig_1360878842643.log % ls > /tmp/pig_1360878842643.log > ls: cannot access /tmp/pig_1360878842643.log: No such file or > directory > > The same error results if I prefix the parameters with dollar signs or > surround prefixed parameters with quotes. > > If I try to use define (which I believe only works for static Java > functions), I get the following: > > > define week_path util.log_path_regex(year, month, day); > 2013-02-14 17:00:42,392 [main] ERROR > org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line > 11, column 37> mismatched input 'year' expecting RIGHT_PAREN > > As with %declare, I get the same error if I prefix the parameters with > dollar signs or surround prefixed parameters with quotes. > > I've searched around and haven't come up with a solution. I'm possibly > searching for the wrong thing. Invoking a shell command may work, but > would be difficult as it would complicate our script deploy and may > not be feasible given we're retrieving logs from S3 and not a mounted > directory. > > It's also likely there's a nice Pig-friendly way to restrict LOAD > other than using globs. That said, I'd still have to use my UDF which > seems to be the root of the issue. > > Do I need to convert my UDF to a static Java method? Or will I run > into the same issue? (I hesitate to do this on the off-chance it will > work. It's an 8-line Python function, readily deployable and much more > maintainable by others than the equivalent Java code would be.) > > Any ideas? > > Cheers, > Ian. >
