Hi Ian,
Sorry for the late reply.
>> Is there a method I can override which considers reading on a
file-by-file basis? Our Omniture logs have a date stamp in the filename,
and it would be more maintainable to reject a file based on its basename
rather than its path. We're more likely to change our paths than change the
filenames, so this would mean the code has a better chance of standing the
test of time.
The location parameter in setLocation(String location, Job job) is just a
path glob, so you can replace it with a filename-based pattern. For
example, if you have the following in Pig script,
A = LOAD '/foo/replace_me_with_regex' USING MyLoadFunc('2013', '1', '28');
You can do something like this:
@Override
public void setLocation(String location, Job job) {
String regex = log_path_regex(year, month, day);
location.replace('replace_me_with_filename', reg);
FileInputFormat.setInputPaths(job, location);
}
// This is a java version of your function that returns a filename pattern.
private String log_patt_regex(String y, String m, String d) {
// compute regex
}
Thanks,
Cheolsoo
On Tue, Feb 19, 2013 at 1:55 PM, Stevens, Ian
<[email protected]>wrote:
> Thanks for this Cheolsoo. I started work on a LoadFunc assuming it was an
> easy win, although I hate that I have to do this. It's just text
> concatenation after all. Moving the logic of our log path structure to a
> Java class or an external package is wrong from a maintenance standpoint.
>
> How familiar are you (or anyone) with creating a custom LoadFunc? The
> documentation I've found is sparse. Is there a method I can override which
> considers reading on a file-by-file basis? Our Omniture logs have a date
> stamp in the filename, and it would be more maintainable to reject a file
> based on its basename rather than its path. We're more likely to change our
> paths than change the filenames, so this would mean the code has a better
> chance of standing the test of time.
>
> Cheers,
> Ian.
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:[email protected]]
> Sent: February-15-13 4:53 PM
> To: [email protected]
> Subject: Re: Restricting loading of log files based on parameter input
>
> Hi Ian,
>
> 1) Pre-processor statements are just text substitution, so you can't call
> a Python (or Java) function inside %declare.
>
> 2) Regarding DEFINE statements, there are two problems using them with
> scripting UDF:
> - You can't pass constructor parameters to scripting UDF.
> - You can't use scripting UDF for Load/StoreFunc.
>
> Given these constraints, I think writing a Java LoadFunc seems to be the
> best option. I would write a sub-class of OmnitureTextLoader in such a way
> that it can take constructor parameters. For example,
>
> class MyOmnitureTextLoader extends OmnitureTextLoader {
>
> private String year;
> private String month;
>
> public MyOmnitureTextLoader() { ... }
> public MyOmnitureTextLoader(String year, String month) { ... }
>
> @Override
> setLocation(String location, Job job) {
> // Compute week path with year and month and replace location with
> that.
> }
> }
>
> Then, you can do something like in Pig:
>
> DEFINE WEEK_PATH_LOADER MyOmnitureTextLoader($year, $month);
>
> A = LOAD 'replace_me_with_week_path' USING WEEK_PATH_LOADER;
>
> Hope this is helpful.
>
> Thanks,
> Cheolsoo
>
>
>
>
> On Thu, Feb 14, 2013 at 2:16 PM, Stevens, Ian
> <[email protected]>wrote:
>
> > Hi everyone. I'm having a problem loading log files based on parameter
> > input and was wondering whether someone would be able to provide some
> > guidance. The logs in question are Omniture logs, stored in
> > subdirectories based on year, month, and day (eg.
> > /year=2013/month=02/day=14). For any day, multiple logs could exist,
> each hundreds of MB.
> >
> > I have a Pig script which currently processes logs for an entire
> > month, with the month and the year specified as script parameters (eg.
> > /year=$year/month=$month/day=*). It works fine and we're quite happy
> > with it. That said, we want to switch to weekly processing of logs,
> > which means the previous LOAD path glob won't work (weeks can wrap
> > months as well as years). To solve this, I have a Python UDF which
> > takes a start date and spits out the necessary glob for a week's worth
> of logs, eg:
> >
> > >>> log_path_regex(2013, 1, 28)
> >
> >
> '{year=2013/month=01/day=28,year=2013/month=01/day=29,year=2013/month=01/day=30,year=2013/month=01/day=31,year=2013/month=02/day=01,year=2013/month=02/day=02,year=2013/month=02/day=03}'
> >
> > This glob will then be inserted in the appropriate path:
> >
> > > %declare omniture_log_path
> > 's3://foo/bar/$week_path/*.tsv.gz';
> > > data = LOAD '$omniture_log_path' USING
> > OmnitureTextLoader(); // See
> > http://github.com/msukmanowsky/OmnitureTextLoader
> >
> > Unfortunately, I can't for the life of me figure out how to populate
> > $week_path based on $year, $month and $day script parameters. I tried
> > using %declare but grunt complains, says its logging but never does:
> >
> > > %declare week_path util.log_path_regex(year, month, day);
> > 2013-02-14 16:54:02,648 [main] INFO org.apache.pig.Main - Apache Pig
> > version 0.10.1 (r1426677) compiled Dec 28 2012, 16:46:13
> >
> > 2013-02-1416:54:02,648 [main] INFO org.apache.pig.Main - Logging
> > error messages to: /tmp/pig_1360878842643.log % ls
> > /tmp/pig_1360878842643.log
> > ls: cannot access /tmp/pig_1360878842643.log: No such file or
> > directory
> >
> > The same error results if I prefix the parameters with dollar signs or
> > surround prefixed parameters with quotes.
> >
> > If I try to use define (which I believe only works for static Java
> > functions), I get the following:
> >
> > > define week_path util.log_path_regex(year, month, day);
> > 2013-02-14 17:00:42,392 [main] ERROR
> > org.apache.pig.tools.grunt.Grunt - ERROR 1200: <file script.pig, line
> > 11, column 37> mismatched input 'year' expecting RIGHT_PAREN
> >
> > As with %declare, I get the same error if I prefix the parameters with
> > dollar signs or surround prefixed parameters with quotes.
> >
> > I've searched around and haven't come up with a solution. I'm possibly
> > searching for the wrong thing. Invoking a shell command may work, but
> > would be difficult as it would complicate our script deploy and may
> > not be feasible given we're retrieving logs from S3 and not a mounted
> directory.
> >
> > It's also likely there's a nice Pig-friendly way to restrict LOAD
> > other than using globs. That said, I'd still have to use my UDF which
> > seems to be the root of the issue.
> >
> > Do I need to convert my UDF to a static Java method? Or will I run
> > into the same issue? (I hesitate to do this on the off-chance it will
> > work. It's an 8-line Python function, readily deployable and much more
> > maintainable by others than the equivalent Java code would be.)
> >
> > Any ideas?
> >
> > Cheers,
> > Ian.
> >
>