Alan, do you mean overriding finalize(), or does EvalFunc have a finish()
that I didn't realize existed?

Daan: I think at this point it might be worth stepping back and explaining
a bit more about what you want to do. At this point, you want to fetch
urls, and then "do work." Is fetching url the only io that the evalfuncs
have to do? How long does fetching a URL typically take? What sort of work
do you want to do based on the url? And lastly, what specifically is
happening that you think multithreaded would be good? Is there blocking io?

2011/11/9 Mridul Muralidharan <[email protected]>

>
> A simple solution would be to tag each tuple with a random number (such
> that each number has multiple url's associated with it - but not too large
> a number of urls), and simply group based on this field.
> In the reducer, you get a bag of url's for each random number : at which
> point, you can use multiple threads to fetch content and associate their
> responses with the appropriate input tuple.
>
>
> You only need to ensure that :
> a) Too many tuples dont get associated with a single random number (to the
> extent that it causes spills to disk).
>
> b) Too few tuples dont get associated over all random numbers you use -
> else it degenerates to current case.
>
> c) You seed the random number sensible, in order not to hit problems with
> having your tasks being non-repeatable.
>
> Regards,
> Mridul
>
>
> On Wednesday 09 November 2011 07:04 PM, Daan Gerits wrote:
>
>> Hello,
>>
>> First of all, great job creating pig, really a magnificent piece of
>> software.
>>
>> I do have a few questions about UDFs. I have a dataset with a list of
>> url's I want to fetch. Since an EvalFunc can only process one tuple at a
>> time and the asynchronous abilities of the UDF are deprecated, I can only
>> fetch one url at a time. The problem is that fetching this one url takes a
>> reasonable amount of time (1 to 5 seconds, there is a delay built in) so
>> that really slows down the processing. I already converted the UDF into an
>> Accumulator but that only seems to get fired after a group by. If would be
>> nice to have some kind of Queue UDF which will queue the tuples until a
>> certain amount is reached and than flushes the queue. That way I can add
>> tuples to an internal list and on flush start multiple threads to go
>> through the list of Tuples.
>>
>> This is a workaround though, since the best solution would be to
>> reintroduce the asynchronous UDF's (in which case I can schedule the
>> threads as the tuples come in)
>>
>> Any idea's on this? I already saw someone trying almost the same thing,
>> but didn't get a definite answer from that one.
>>
>> An other option is to increase the number of reducer slots on the
>> cluster, but I'm afraid that would mean we overload the nodes in case of a
>> heavy reduce phase.
>>
>> Best Regards,
>>
>> Daan
>>
>
>

Reply via email to