You don't really have to mess with that -- you can just have your UDF
initialized with the prefix file location.
So, your udf would have:

private String prefixPath;
// needed by Pig
public MyUDF() {}

// use this constructor
public MyUDF(String path) {
  this.prefixPath = path;
}

// in the eval, check if prefix file has been loaded, if not, do so

Then in pig you would say:

DEFINE MyUDFInstance org.myorg.MyUDF("/this/is/my/prefix/file");

-- load data...

processed_data = foreach data generate MyUDFInstance(some_field);


On Tue, Nov 2, 2010 at 11:27 AM, Joe Ciaramitaro <
[email protected]> wrote:

> Thanks for the quick response.. I have some follow ups though :)  --
>
> Not quite as bad(computationally expensive) as a regular expression, just
> something that would allow me to check String.startWith... but same basic
> idea
>
> Prefixes is small enough to fit into memory, but it's not clear to me how
> to make that happen.
>
> I see that the UDF has access to the JobConf, so I could pass in a
> configuration that resolves to the hdfs path of the prefixes file.  The Pig
> UDF manual shows how to receive the configurations, but I'm not sure how to
> SET them on the client side.
>
> -Joe
>
> On Nov 2, 2010, at 1:27 PM, Alan Gates wrote:
>
> > Basically you want to join on a regular expression, correct?
>  Unfortunately Map Reduce (and thus Pig) is spectacularly bad at
> non-equijoins.  Is 'prefixes' small enough to fit in memory?  If so, you
> could write a UDF that loaded it into memory and did the comparison.  This
> way the join would be done in the map phase.
> >
> > Alan.
> >
> > On Nov 2, 2010, at 10:19 AM, Joe Ciaramitaro wrote:
> >
> >> Hi all,
> >>
> >> I have 2 data files.  One which contains a number of records, and one
> which contains a number of prefixes.
> >>
> >> A = load 'data' AS (id, name)
> >> B = load 'prefixes' AS (prefix)
> >>
> >> I'd like to pull records in A whose name begins with prefix
> >>
> >> The prefixes are of varying lengths
> >>
> >> I've been scouring the documentation, but haven't figured out what the
> best approach could be.
> >>
> >> Thanks for any help,
> >>
> >> Joe
> >
>
>

Reply via email to