I expect you are talking about the 1-5 second delay I talked about. What I
actually meant was that the code within the exec function of the UDF is taking
1 to 5 seconds for each invocation. That's something I cannot change since the
fetch method is actually doing a lot more than only fetching something. I
cannot push the additional logic the fetching is invoking higher since that
would break the algorithm.
On 09 Nov 2011, at 16:05, Marek Miglinski wrote:
> Something is wrong with your calculations UDF, think of something, because I
> had experience when I needed to calculate efficiency of data sent/downloaded
> by user, the logic there was too complex and despite that the speed was ~
> 0.02s per user which had ~ 500 transactions each, so in overall ~ 0.00004s
> per tx.
>
> Example of the code:
> userGroup = GROUP recordTx BY user PARALLEL 100;
> userFlattened = FOREACH userGroup {
> generated = Merge(recordTx);
> GENERATE FLATTEN(generated);
> }
>
>
> Sincerely,
> Marek M.
> ________________________________________
> From: Daan Gerits [[email protected]]
> Sent: Wednesday, November 09, 2011 4:19 PM
> To: [email protected]
> Subject: Re: Multithreaded UDF
>
> Hi Marek,
>
> yes, I have:
>
> SET default_parallel 50;
>
> at the top of my script.
>
> The idea to use the udf is as follows:
>
> customers = LOAD 'hdfs://node1.c.foundation.local/data/customers.csv'
> USING PigStorage(',')
> AS (id:chararray, name:chararray, url:chararray);
>
> fetchResults = FOREACH customers
> GENERATE id, name, url, fetchHttp(url);
>
> ending up with the following data structure:
> (id, name, url, {timestamp, content, fetchDuration})
>
> I am currently not yet using the group since I would like to find a solution
> without first having to group everything. The workaround for me would be to
> group everything on a field which I know is unique, that way I won't loose
> the structure of the relation.
>
> Thanks for the quick reply,
>
> Daan
>
> On 09 Nov 2011, at 15:12, Marek Miglinski wrote:
>
>> Do you use parallels in the GROUP?
>> ________________________________________
>> From: Daan Gerits [[email protected]]
>> Sent: Wednesday, November 09, 2011 3:34 PM
>> To: [email protected]
>> Subject: Multithreaded UDF
>>
>> Hello,
>>
>> First of all, great job creating pig, really a magnificent piece of software.
>>
>> I do have a few questions about UDFs. I have a dataset with a list of url's
>> I want to fetch. Since an EvalFunc can only process one tuple at a time and
>> the asynchronous abilities of the UDF are deprecated, I can only fetch one
>> url at a time. The problem is that fetching this one url takes a reasonable
>> amount of time (1 to 5 seconds, there is a delay built in) so that really
>> slows down the processing. I already converted the UDF into an Accumulator
>> but that only seems to get fired after a group by. If would be nice to have
>> some kind of Queue UDF which will queue the tuples until a certain amount is
>> reached and than flushes the queue. That way I can add tuples to an internal
>> list and on flush start multiple threads to go through the list of Tuples.
>>
>> This is a workaround though, since the best solution would be to reintroduce
>> the asynchronous UDF's (in which case I can schedule the threads as the
>> tuples come in)
>>
>> Any idea's on this? I already saw someone trying almost the same thing, but
>> didn't get a definite answer from that one.
>>
>> An other option is to increase the number of reducer slots on the cluster,
>> but I'm afraid that would mean we overload the nodes in case of a heavy
>> reduce phase.
>>
>> Best Regards,
>>
>> Daan
>