Re: Multithreaded UDF

Alan Gates Wed, 09 Nov 2011 10:55:21 -0800

Multi-threading of UDFs is not deprecated, it just isn't explicitly supported.  
However, it should work.  The internal MonitoredUDF uses multiple threads.

Do you need to output records conditionally or modify the contents of the 
record based on the results of this http call?  If not, then you can place 
records in a queue as they go through and have a pool of worker threads doing 
the http calls in the background.  You can then use the finish() call to make 
sure your queue is empty and all your work threads finished.

The problem if you need to modify or remove records is that finish() doesn't 
let you return data.  So even though you could return a bag full of records you 
had finished for each record that came in (with some bags being empty, which a 
subsequent flatten could then remove), you would loose the last few records 
because you wouldn't get a chance to return them.  As suggested in a previous 
mail, streaming will do what you want in this case.

Alan.

On Nov 9, 2011, at 5:34 AM, Daan Gerits wrote:

> Hello,
> 
> First of all, great job creating pig, really a magnificent piece of software.
> 
> I do have a few questions about UDFs. I have a dataset with a list of url's I 
> want to fetch. Since an EvalFunc can only process one tuple at a time and the 
> asynchronous abilities of the UDF are deprecated, I can only fetch one url at 
> a time. The problem is that fetching this one url takes a reasonable amount 
> of time (1 to 5 seconds, there is a delay built in) so that really slows down 
> the processing. I already converted the UDF into an Accumulator but that only 
> seems to get fired after a group by. If would be nice to have some kind of 
> Queue UDF which will queue the tuples until a certain amount is reached and 
> than flushes the queue. That way I can add tuples to an internal list and on 
> flush start multiple threads to go through the list of Tuples.
> 
> This is a workaround though, since the best solution would be to reintroduce 
> the asynchronous UDF's (in which case I can schedule the threads as the 
> tuples come in)
> 
> Any idea's on this? I already saw someone trying almost the same thing, but 
> didn't get a definite answer from that one.
> 
> An other option is to increase the number of reducer slots on the cluster, 
> but I'm afraid that would mean we overload the nodes in case of a heavy 
> reduce phase.
> 
> Best Regards,
> 
> Daan

Re: Multithreaded UDF

Reply via email to