Re: Multithreaded UDF

Daan Gerits Wed, 09 Nov 2011 06:20:00 -0800

Hi Marek,

yes, I have:


SET default_parallel 50;

at the top of my script.

The idea to use the udf is as follows:

customers       = LOAD 'hdfs://node1.c.foundation.local/data/customers.csv'
                    USING PigStorage(',')
                    AS (id:chararray, name:chararray, url:chararray);

fetchResults    = FOREACH customers
                    GENERATE id, name, url, fetchHttp(url);

ending up with the following data structure:
(id, name, url, {timestamp, content, fetchDuration})

I am currently not yet using the group since I would like to find a solution 
without first having to group everything. The workaround for me would be to 
group everything on a field which I know is unique, that way I won't loose the 
structure of the relation.

Thanks for the quick reply,

Daan

On 09 Nov 2011, at 15:12, Marek Miglinski wrote:

> Do you use parallels in the GROUP?
> ________________________________________
> From: Daan Gerits [[email protected]]
> Sent: Wednesday, November 09, 2011 3:34 PM
> To: [email protected]
> Subject: Multithreaded UDF
> 
> Hello,
> 
> First of all, great job creating pig, really a magnificent piece of software.
> 
> I do have a few questions about UDFs. I have a dataset with a list of url's I 
> want to fetch. Since an EvalFunc can only process one tuple at a time and the 
> asynchronous abilities of the UDF are deprecated, I can only fetch one url at 
> a time. The problem is that fetching this one url takes a reasonable amount 
> of time (1 to 5 seconds, there is a delay built in) so that really slows down 
> the processing. I already converted the UDF into an Accumulator but that only 
> seems to get fired after a group by. If would be nice to have some kind of 
> Queue UDF which will queue the tuples until a certain amount is reached and 
> than flushes the queue. That way I can add tuples to an internal list and on 
> flush start multiple threads to go through the list of Tuples.
> 
> This is a workaround though, since the best solution would be to reintroduce 
> the asynchronous UDF's (in which case I can schedule the threads as the 
> tuples come in)
> 
> Any idea's on this? I already saw someone trying almost the same thing, but 
> didn't get a definite answer from that one.
> 
> An other option is to increase the number of reducer slots on the cluster, 
> but I'm afraid that would mean we overload the nodes in case of a heavy 
> reduce phase.
> 
> Best Regards,
> 
> Daan

Re: Multithreaded UDF

Reply via email to