The script you propose will work, but if your data is of even reasonable size it will be very slow. A quick search of the web turned up one paper with an algorithm for parallel non-equijoins that at first glance might work in your case.

Alan.

On Jan 26, 2011, at 5:15 PM, Jonathan Coveney wrote:

Also, it'd be worth thinking about this for the case where the min and maxes are arbitrary, and also the case where they aren't overlapping. That is to
say, there is only one thing for a given value.

2011/1/26 Jonathan Coveney <[email protected]>

A is (val:int)
B is (thing:chararray, min:int, max:int)

Basically what I want is C = (val, thing) where val is between min and max for that thing. In sql the syntax for this would not be hard, in pig the
naive solution I have is..

cro = CROSS A,B;
fil = FILTER cro BY val >= min AND val <= max;
C = FOREACH fil GENERATE val,thing;

I am wondering what the most efficient way of doing this sort of operation is. I imagine with some sort of indexing you could ideally speed things up?
Not sure. But this is important enough that I'd be willing to do some
legwork.

As always, thanks for your help.


Reply via email to