You could try doing GROUP ALL on the contents of M, which would
produce a since bag containing each record and then joining M with
data using a surrogate constant key. Or CROSS would also work instead
of the join I suspect. Then you'd have a tuple like this to work with:

(a, b, M:bag)

I'm not sure if things would blow up if M is too large to fit into
memory in your UDF though.


On Wed, Apr 20, 2011 at 6:27 AM, Mark Laczin <[email protected]> wrote:
> I'm trying to do something like this:
> (if 'data' is a set of tuples loaded from a file containing fields a, b and
> c)
> (if 'M' is another set of tuples loaded from a file)
>
> data = FOREACH data GENERATE *, someUDF(a, b, M);
>
> What I'm looking for is to generate (in this case, a string) based on a and
> b, using the contents of M inside the UDF.
>
> The UDF looks like this, in pseudocode:
>
> foreach element x in M {
>  if a matches x or b matches x {
>    return "something"
>  }
> }
> return "something else"
>
> Is this possible?  I keep getting errors related to "Scalars can only be
> used with projections" and the like.
> The thing holding me back from using filters is that I won't know what's in
> M until it's read, and since (in this case) they'll be regular expressions,
> I'd need to be able to join/group with regex matching which I don't think
> Pig can do.
>
> -Mark
>

Reply via email to