Hello,

First of all, great job creating pig, really a magnificent piece of software.

I do have a few questions about UDFs. I have a dataset with a list of url's I 
want to fetch. Since an EvalFunc can only process one tuple at a time and the 
asynchronous abilities of the UDF are deprecated, I can only fetch one url at a 
time. The problem is that fetching this one url takes a reasonable amount of 
time (1 to 5 seconds, there is a delay built in) so that really slows down the 
processing. I already converted the UDF into an Accumulator but that only seems 
to get fired after a group by. If would be nice to have some kind of Queue UDF 
which will queue the tuples until a certain amount is reached and than flushes 
the queue. That way I can add tuples to an internal list and on flush start 
multiple threads to go through the list of Tuples.

This is a workaround though, since the best solution would be to reintroduce 
the asynchronous UDF's (in which case I can schedule the threads as the tuples 
come in)

Any idea's on this? I already saw someone trying almost the same thing, but 
didn't get a definite answer from that one.

An other option is to increase the number of reducer slots on the cluster, but 
I'm afraid that would mean we overload the nodes in case of a heavy reduce 
phase.

Best Regards,

Daan

Reply via email to