Basically you want to join on a regular expression, correct? Unfortunately Map Reduce (and thus Pig) is spectacularly bad at non- equijoins. Is 'prefixes' small enough to fit in memory? If so, you could write a UDF that loaded it into memory and did the comparison. This way the join would be done in the map phase.

Alan.

On Nov 2, 2010, at 10:19 AM, Joe Ciaramitaro wrote:

Hi all,

I have 2 data files. One which contains a number of records, and one which contains a number of prefixes.

A = load 'data' AS (id, name)
B = load 'prefixes' AS (prefix)

I'd like to pull records in A whose name begins with prefix

The prefixes are of varying lengths

I've been scouring the documentation, but haven't figured out what the best approach could be.

Thanks for any help,

Joe

Reply via email to