Thanks Zach for introducing me to Aho Corasick algorithm. Let me see if a UDF can help here.
Regards Arun A K Graduate Student Department of Computer Science Indiana University, Bloomington On Thu, Dec 2, 2010 at 11:01 AM, Zach Bailey <[email protected]>wrote: > > I would recommend writing a UDF based on the Aho Corasick algorithm [1] > which is seeded with relation a and then applied to relation b. This should > let you do a single-pass search on each "row" in B, reducing your > algorithmic complexity from O(n^2) to O(n). > > > I'd love to know if anyone else has a better way of doing this, because > it's a problem I'm also working on :) > > -Zach > > > > [1] > http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm > > On Thursday, December 2, 2010 at 1:53 PM, Arun A K wrote: > > > Hello > > > > I have this problem to solve using Pig. > > > > *Input* > > 1. Relation A which has only one field of type chararray. Sample of A > > follows: > > *abc* > > *xyz gh* > > *zzz yy* > > *red* > > > > Approximate numbers of rows in A = 10,000 > > > > 2. Relation B which has only one field of type chararray. Sample of B > > follows: > > *red car* > > *red ferrari* > > *abc* > > *abcd* > > *xyz ghis* > > > > Approximate numbers of rows in B = 1 billion > > > > *Problem to be solved* I need to find all case-insensitive variants of > each > > term in relation A existing in relation B. For example: Term 'red' from A > > would have variants 'red car' and 'red ferrari' in B. > > > > I was able to get variants of one term in A from B using matches operator > > i.e. matches '.*red.*' How to go about creating a complete solution for > this > > problem? Should I use a UDF or go for native Map Reduce? Am a bit > confused > > on how to proceed on this. I would really appreciate any help on this. > > > > Thanks much. > > > > Regards > > Arun A K > > > > > > > > > > > >
