Thanks much Daniel for the response. The solution looks good though I wonder it might miss terms in A which contain spaces because STRSPLIT would break each term in B before doing the equi join.
Regards Arun A K Graduate Student Department of Computer Science Indiana University, Bloomington On Thu, Dec 2, 2010 at 11:17 AM, Daniel Dai <[email protected]> wrote: > Can you convert it into a equal join problem? That's the case mapreduce can > handle efficiently. Not sure if it address your problem but provide a sample > script. > > a = load 'A' as (a0:chararray); > b = foreach a generate LOWER(a0) as b0; > > c = load 'B' as (c0:chararray); > d = foreach c generate LOWER(c0) as d0; > e = foreach d generate d0, flatten(STRSPLIT(d0)) as e0; > > f = join b by b0, e by e0; > g = foreach f generate d0, b0; > dump g; > > Daniel > > > Arun A K wrote: > >> Hello >> >> I have this problem to solve using Pig. >> >> *Input* >> 1. Relation A which has only one field of type chararray. Sample of A >> follows: >> *abc* >> *xyz gh* >> *zzz yy* >> *red* >> >> Approximate numbers of rows in A = 10,000 >> >> 2. Relation B which has only one field of type chararray. Sample of B >> follows: >> *red car* >> *red ferrari* >> *abc* >> *abcd* >> *xyz ghis* >> >> Approximate numbers of rows in B = 1 billion >> >> *Problem to be solved* I need to find all case-insensitive variants of >> each >> term in relation A existing in relation B. For example: Term 'red' from A >> would have variants 'red car' and 'red ferrari' in B. >> >> I was able to get variants of one term in A from B using matches operator >> i.e. matches '.*red.*' How to go about creating a complete solution for >> this >> problem? Should I use a UDF or go for native Map Reduce? Am a bit confused >> on how to proceed on this. I would really appreciate any help on this. >> >> Thanks much. >> >> Regards >> Arun A K >> >> > >
