Re: Find variants of a term in relation A from a field in relation B

Zach Bailey Thu, 02 Dec 2010 11:02:06 -0800

 I would recommend writing a UDF based on the Aho Corasick algorithm [1] which 
is seeded with relation a and then applied to relation b. This should let you 
do a single-pass search on each "row" in B, reducing your algorithmic 
complexity from O(n^2) to O(n).



I'd love to know if anyone else has a better way of doing this, because it's a 
problem I'm also working on :)

-Zach



[1] http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm

On Thursday, December 2, 2010 at 1:53 PM, Arun A K wrote:

> Hello
> 
> I have this problem to solve using Pig.
> 
> *Input*
> 1. Relation A which has only one field of type chararray. Sample of A
> follows:
> *abc*
> *xyz gh*
> *zzz yy*
> *red*
> 
> Approximate numbers of rows in A = 10,000
> 
> 2. Relation B which has only one field of type chararray. Sample of B
> follows:
> *red car*
> *red ferrari*
> *abc*
> *abcd*
> *xyz ghis*
> 
> Approximate numbers of rows in B = 1 billion
> 
> *Problem to be solved* I need to find all case-insensitive variants of each
> term in relation A existing in relation B. For example: Term 'red' from A
> would have variants 'red car' and 'red ferrari' in B.
> 
> I was able to get variants of one term in A from B using matches operator
> i.e. matches '.*red.*' How to go about creating a complete solution for this
> problem? Should I use a UDF or go for native Map Reduce? Am a bit confused
> on how to proceed on this. I would really appreciate any help on this.
> 
> Thanks much.
> 
> Regards
> Arun A K
> 
> 
> 
>

Re: Find variants of a term in relation A from a field in relation B

Reply via email to