You can write an EvalFunc UDF that depends on a sort, and there are
several in piggybank that do so. COR (the correlate UDF) is such an
example. You call these UDFs on a relation after ordering them.
For example:
answers = foreach (group data by key)
{
sorted = order data by value;
generate my_udf(sorted.field1, sorted.field2);
}
If I remember correctly, you can in fact also do this:
sorted = order data by field;
answer = foreach sorted generate my_udf(sorted.field, sorted.other_field);
Although strictly speaking, Pig doesn't garuantee a sort is maintained
outside of {}
I can't help on the JOIN, I don't know about that. But check Pig's
bloom filter:
http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/Bloom.html
Russell Jurney twitter.com/rjurney
On Oct 5, 2012, at 11:46 AM, Brian Stempin <[email protected]> wrote:
> Hi,
> I'm fairly new to writing UDFs and Pig in general. I want to be able to
> write a UDF that can take advantage of MapReduce's sorting of data.
> Specifically, I'm trying to conceive how I'd write a UDF to do a specialized
> join or a pivot. In both cases, sorting would be useful. EvalFunc seems to
> give no guarantees about ordering of tuples that are passed in.
>
> Is there any way to do such things as a UDF?
>
> TIA for your help,
> Brian Stempin
> Machine Learning Engineer
> ColdLight Solutions, LLC
>
> ________________________________
> This e-mail is intended solely for the above-mentioned recipient and it may
> contain confidential or privileged information. If you have received it in
> error, please notify us immediately and delete the e-mail. You must not copy,
> distribute, disclose or take any action in reliance on it. In addition, the
> contents of an attachment to this e-mail may contain software viruses which
> could damage your own computer system. While ColdLight Solutions, LLC has
> taken every reasonable precaution to minimize this risk, we cannot accept
> liability for any damage which you sustain as a result of software viruses.
> You should perform your own virus checks before opening the attachment.