Thanks for the quick response.. I have some follow ups though :)  --

Not quite as bad(computationally expensive) as a regular expression, just 
something that would allow me to check String.startWith... but same basic idea

Prefixes is small enough to fit into memory, but it's not clear to me how to 
make that happen.

I see that the UDF has access to the JobConf, so I could pass in a 
configuration that resolves to the hdfs path of the prefixes file.  The Pig UDF 
manual shows how to receive the configurations, but I'm not sure how to SET 
them on the client side.

-Joe

On Nov 2, 2010, at 1:27 PM, Alan Gates wrote:

> Basically you want to join on a regular expression, correct?  Unfortunately 
> Map Reduce (and thus Pig) is spectacularly bad at non-equijoins.  Is 
> 'prefixes' small enough to fit in memory?  If so, you could write a UDF that 
> loaded it into memory and did the comparison.  This way the join would be 
> done in the map phase.
> 
> Alan.
> 
> On Nov 2, 2010, at 10:19 AM, Joe Ciaramitaro wrote:
> 
>> Hi all,
>> 
>> I have 2 data files.  One which contains a number of records, and one which 
>> contains a number of prefixes.
>> 
>> A = load 'data' AS (id, name)
>> B = load 'prefixes' AS (prefix)
>> 
>> I'd like to pull records in A whose name begins with prefix
>> 
>> The prefixes are of varying lengths
>> 
>> I've been scouring the documentation, but haven't figured out what the best 
>> approach could be.
>> 
>> Thanks for any help,
>> 
>> Joe
> 

Reply via email to