Thanks for this Cheolsoo. I started work on a LoadFunc assuming it was an easy 
win, although I hate that I have to do this. It's just text concatenation after 
all. Moving the logic of our log path structure to a Java class or an external 
package is wrong from a maintenance standpoint.

How familiar are you (or anyone) with creating a custom LoadFunc? The 
documentation I've found is sparse. Is there a method I can override which 
considers reading on a file-by-file basis? Our Omniture logs have a date stamp 
in the filename, and it would be more maintainable to reject a file based on 
its basename rather than its path. We're more likely to change our paths than 
change the filenames, so this would mean the code has a better chance of 
standing the test of time.

Cheers,
Ian.

-----Original Message-----
From: Cheolsoo Park [mailto:[email protected]] 
Sent: February-15-13 4:53 PM
To: [email protected]
Subject: Re: Restricting loading of log files based on parameter input

Hi Ian,

1) Pre-processor statements are just text substitution, so you can't call a 
Python (or Java) function inside %declare.

2) Regarding DEFINE statements, there are two problems using them with 
scripting UDF:
- You can't pass constructor parameters to scripting UDF.
- You can't use scripting UDF for Load/StoreFunc.

Given these constraints, I think writing a Java LoadFunc seems to be the best 
option. I would write a sub-class of OmnitureTextLoader in such a way that it 
can take constructor parameters. For example,

class MyOmnitureTextLoader extends OmnitureTextLoader {

  private String year;
  private String month;

  public MyOmnitureTextLoader() { ... }
  public MyOmnitureTextLoader(String year, String month) { ... }

  @Override
  setLocation(String location, Job job) {
    // Compute week path with year and month and replace location with that.
  }
}

Then, you can do something like in Pig:

DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month);

A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER;

Hope this is helpful.

Thanks,
Cheolsoo




On Thu, Feb 14, 2013 at 2:16 PM, Stevens, Ian
<[email protected]>wrote:

> Hi everyone. I'm having a problem loading log files based on parameter 
> input and was wondering whether someone would be able to provide some 
> guidance. The logs in question are Omniture logs, stored in 
> subdirectories based on year, month, and day (eg. 
> /year=2013/month=02/day=14). For any day, multiple logs could exist, each 
> hundreds of MB.
>
> I have a Pig script which currently processes logs for an entire 
> month, with the month and the year specified as script parameters (eg.
> /year=$year/month=$month/day=*). It works fine and we're quite happy 
> with it. That said, we want to switch to weekly processing of logs, 
> which means the previous LOAD path glob won't work (weeks can wrap 
> months as well as years). To solve this, I have a Python UDF which 
> takes a start date and spits out the necessary glob for a week's worth of 
> logs, eg:
>
>                 >>> log_path_regex(2013, 1, 28)
>
> '{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'
>
> This glob will then be inserted in the appropriate path:
>
>                 > %declare omniture_log_path 
> 's3://foo/bar/$week_path/*.tsv.gz';
>                 > data = LOAD '$omniture_log_path' USING 
> OmnitureTextLoader(); // See 
> http://github.com/msukmanowsky/OmnitureTextLoader
>
> Unfortunately, I can't for the life of me figure out how to populate 
> $week_path based on $year, $month and $day script parameters. I tried 
> using %declare but grunt complains, says its logging but never does:
>
> > %declare week_path util.log_path_regex(year, month, day);
> 2013-02-14 16:54:02,648 [main] INFO  org.apache.pig.Main - Apache Pig 
> version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13
>
> 2013-02-1416:54:02,648 [main] INFO  org.apache.pig.Main - Logging 
> error messages to: /tmp/pig_1360878842643.log % ls  
> /tmp/pig_1360878842643.log
> ls: cannot access /tmp/pig_1360878842643.log: No such file or 
> directory
>
> The same error results if I prefix the parameters with dollar signs or 
> surround prefixed parameters with quotes.
>
> If I try to use define (which I believe only works for static Java 
> functions), I get the following:
>
>                 > define week_path util.log_path_regex(year, month, day);
>                 2013-02-14 17:00:42,392 [main] ERROR 
> org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line 
> 11, column 37>  mismatched input 'year' expecting RIGHT_PAREN
>
> As with %declare, I get the same error if I prefix the parameters with 
> dollar signs or surround prefixed parameters with quotes.
>
> I've searched around and haven't come up with a solution. I'm possibly 
> searching for the wrong thing. Invoking a shell command may work, but 
> would be difficult as it would complicate our script deploy and may 
> not be feasible given we're retrieving logs from S3 and not a mounted 
> directory.
>
> It's also likely there's a nice Pig-friendly way to restrict LOAD 
> other than using globs. That said, I'd still have to use my UDF which 
> seems to be the root of the issue.
>
> Do I need to convert my UDF to a static Java method? Or will I run 
> into the same issue? (I hesitate to do this on the off-chance it will 
> work. It's an 8-line Python function, readily deployable and much more 
> maintainable by others than the equivalent Java code would be.)
>
> Any ideas?
>
> Cheers,
> Ian.
>

Reply via email to