oh, this is much better than custom loader hack I mentioned to batch up input tuples.
On Wed, Nov 9, 2011 at 12:22 PM, Mridul Muralidharan <[email protected]>wrote: > > A simple solution would be to tag each tuple with a random number (such > that each number has multiple url's associated with it - but not too large > a number of urls), and simply group based on this field. > In the reducer, you get a bag of url's for each random number : at which > point, you can use multiple threads to fetch content and associate their > responses with the appropriate input tuple. > > > You only need to ensure that : > a) Too many tuples dont get associated with a single random number (to the > extent that it causes spills to disk). > > b) Too few tuples dont get associated over all random numbers you use - > else it degenerates to current case. > > c) You seed the random number sensible, in order not to hit problems with > having your tasks being non-repeatable. > > Regards, > Mridul > > > On Wednesday 09 November 2011 07:04 PM, Daan Gerits wrote: > >> Hello, >> >> First of all, great job creating pig, really a magnificent piece of >> software. >> >> I do have a few questions about UDFs. I have a dataset with a list of >> url's I want to fetch. Since an EvalFunc can only process one tuple at a >> time and the asynchronous abilities of the UDF are deprecated, I can only >> fetch one url at a time. The problem is that fetching this one url takes a >> reasonable amount of time (1 to 5 seconds, there is a delay built in) so >> that really slows down the processing. I already converted the UDF into an >> Accumulator but that only seems to get fired after a group by. If would be >> nice to have some kind of Queue UDF which will queue the tuples until a >> certain amount is reached and than flushes the queue. That way I can add >> tuples to an internal list and on flush start multiple threads to go >> through the list of Tuples. >> >> This is a workaround though, since the best solution would be to >> reintroduce the asynchronous UDF's (in which case I can schedule the >> threads as the tuples come in) >> >> Any idea's on this? I already saw someone trying almost the same thing, >> but didn't get a definite answer from that one. >> >> An other option is to increase the number of reducer slots on the >> cluster, but I'm afraid that would mean we overload the nodes in case of a >> heavy reduce phase. >> >> Best Regards, >> >> Daan >> > >
