You could try doing GROUP ALL on the contents of M, which would produce a since bag containing each record and then joining M with data using a surrogate constant key. Or CROSS would also work instead of the join I suspect. Then you'd have a tuple like this to work with:
(a, b, M:bag) I'm not sure if things would blow up if M is too large to fit into memory in your UDF though. On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]> wrote: > I'm trying to do something like this: > (if 'data' is a set of tuples loaded from a file containing fields a, b and > c) > (if 'M' is another set of tuples loaded from a file) > > data = FOREACH data GENERATE *, someUDF(a, b, M); > > What I'm looking for is to generate (in this case, a string) based on a and > b, using the contents of M inside the UDF. > > The UDF looks like this, in pseudocode: > > foreach element x in M { > if a matches x or b matches x { > return "something" > } > } > return "something else" > > Is this possible? I keep getting errors related to "Scalars can only be > used with projections" and the like. > The thing holding me back from using filters is that I won't know what's in > M until it's read, and since (in this case) they'll be regular expressions, > I'd need to be able to join/group with regex matching which I don't think > Pig can do. > > -Mark >
