I love the TupleFuture idea. I wonder how hard it would be to do if you could defer evaluation until serialization (maybe it serializes instantly?)
2011/11/9 Dmitriy Ryaboy <[email protected]> > Could extend Tuple and return a TupleFuture object. > I am guessing it'll just get evaluated immediately by the next operator and > not actually gain you anything. > It'd be neat to be able to do that sort of thing though. > D > > On Wed, Nov 9, 2011 at 12:22 PM, Mridul Muralidharan > <[email protected]>wrote: > > > > > A simple solution would be to tag each tuple with a random number (such > > that each number has multiple url's associated with it - but not too > large > > a number of urls), and simply group based on this field. > > In the reducer, you get a bag of url's for each random number : at which > > point, you can use multiple threads to fetch content and associate their > > responses with the appropriate input tuple. > > > > > > You only need to ensure that : > > a) Too many tuples dont get associated with a single random number (to > the > > extent that it causes spills to disk). > > > > b) Too few tuples dont get associated over all random numbers you use - > > else it degenerates to current case. > > > > c) You seed the random number sensible, in order not to hit problems with > > having your tasks being non-repeatable. > > > > Regards, > > Mridul > > > > > > On Wednesday 09 November 2011 07:04 PM, Daan Gerits wrote: > > > >> Hello, > >> > >> First of all, great job creating pig, really a magnificent piece of > >> software. > >> > >> I do have a few questions about UDFs. I have a dataset with a list of > >> url's I want to fetch. Since an EvalFunc can only process one tuple at a > >> time and the asynchronous abilities of the UDF are deprecated, I can > only > >> fetch one url at a time. The problem is that fetching this one url > takes a > >> reasonable amount of time (1 to 5 seconds, there is a delay built in) so > >> that really slows down the processing. I already converted the UDF into > an > >> Accumulator but that only seems to get fired after a group by. If would > be > >> nice to have some kind of Queue UDF which will queue the tuples until a > >> certain amount is reached and than flushes the queue. That way I can add > >> tuples to an internal list and on flush start multiple threads to go > >> through the list of Tuples. > >> > >> This is a workaround though, since the best solution would be to > >> reintroduce the asynchronous UDF's (in which case I can schedule the > >> threads as the tuples come in) > >> > >> Any idea's on this? I already saw someone trying almost the same thing, > >> but didn't get a definite answer from that one. > >> > >> An other option is to increase the number of reducer slots on the > >> cluster, but I'm afraid that would mean we overload the nodes in case > of a > >> heavy reduce phase. > >> > >> Best Regards, > >> > >> Daan > >> > > > > >
