Reading around a bit more, it looks like the best method to do this to:
1. Copy the smaller dataset, B, to the distributed cache.
2. In the UDF args, tell the UDF how to parse B and what field from
the smaller dataset to use as "regex" (specify delimiter and index #)
3. Initialize the smaller dataset within the UDF as an instance
variable of some sort (definitely not read it within exec() ) because
pig will instantiate a UDF instance per mapper whereas exec() will get
called for each row/tuple of dataset A.
4. To the UDF pass relation to be matched (dataset A), location of
file representing dataset B, delimiter for each row of dataset B,
index number of field that contains the regex for B.
5. Return bag (and schema).

Use UDF as, joinedAndmatched = FOREACH A generate
matchAndJoin(filePath, delimiter, index) ;

Suggestions/comments?





On Sun, May 11, 2014 at 2:18 PM, Xuri Nagarin <[email protected]> wrote:
> Hi,
>
> Lets say I have a large data set, A, that is like:
>
> user, verb, action, location
>
> Example:
> joe, said, I had a nice day, Tokyo
> jane, paid, two dollars for a nice cup of coffee, Melbourne
> jack, watched, an interesting movie, New York
> jamie, said, I am interested in hiking, Austin
>
> Another smaller data set, B, has a list of regex to match the "action"
> and each regex has some other attribute associated with it, say,
> category of action.
> Example:
> .*interest.*, explore
> .*bank.*, account
> .*tax.*, account
> .*play.*, sports
>
> What I want is that if "action" matches "regex" then join join sets A
> and B such that I end up with tuple (user, verb, category of action,
> location).
>
> Right now, I have done this using a Java UDF where each A::action gets
> evaluated against each B::regex for a match. If yes, returns the
> desired tuple.
>
> However, performance is slow. I am wondering if there is a better
> strategy to do what I think is essentially a lookup table. I have seen
> threads where replicated join has been recommended but obviously a
> simple "join" isn't going to work for regex matching.
>
> Any recommendations?
>
> Thanks,
>
> Xuri

Reply via email to