Do you use parallels in the GROUP? ________________________________________ From: Daan Gerits [[email protected]] Sent: Wednesday, November 09, 2011 3:34 PM To: [email protected] Subject: Multithreaded UDF
Hello, First of all, great job creating pig, really a magnificent piece of software. I do have a few questions about UDFs. I have a dataset with a list of url's I want to fetch. Since an EvalFunc can only process one tuple at a time and the asynchronous abilities of the UDF are deprecated, I can only fetch one url at a time. The problem is that fetching this one url takes a reasonable amount of time (1 to 5 seconds, there is a delay built in) so that really slows down the processing. I already converted the UDF into an Accumulator but that only seems to get fired after a group by. If would be nice to have some kind of Queue UDF which will queue the tuples until a certain amount is reached and than flushes the queue. That way I can add tuples to an internal list and on flush start multiple threads to go through the list of Tuples. This is a workaround though, since the best solution would be to reintroduce the asynchronous UDF's (in which case I can schedule the threads as the tuples come in) Any idea's on this? I already saw someone trying almost the same thing, but didn't get a definite answer from that one. An other option is to increase the number of reducer slots on the cluster, but I'm afraid that would mean we overload the nodes in case of a heavy reduce phase. Best Regards, Daan
