I love the TupleFuture idea. I wonder how hard it would be to do if you
could defer evaluation until serialization (maybe it serializes instantly?)

2011/11/9 Dmitriy Ryaboy <[email protected]>

> Could extend Tuple and return a TupleFuture object.
> I am guessing it'll just get evaluated immediately by the next operator and
> not actually gain you anything.
> It'd be neat to be able to do that sort of thing though.
> D
>
> On Wed, Nov 9, 2011 at 12:22 PM, Mridul Muralidharan
> <[email protected]>wrote:
>
> >
> > A simple solution would be to tag each tuple with a random number (such
> > that each number has multiple url's associated with it - but not too
> large
> > a number of urls), and simply group based on this field.
> > In the reducer, you get a bag of url's for each random number : at which
> > point, you can use multiple threads to fetch content and associate their
> > responses with the appropriate input tuple.
> >
> >
> > You only need to ensure that :
> > a) Too many tuples dont get associated with a single random number (to
> the
> > extent that it causes spills to disk).
> >
> > b) Too few tuples dont get associated over all random numbers you use -
> > else it degenerates to current case.
> >
> > c) You seed the random number sensible, in order not to hit problems with
> > having your tasks being non-repeatable.
> >
> > Regards,
> > Mridul
> >
> >
> > On Wednesday 09 November 2011 07:04 PM, Daan Gerits wrote:
> >
> >> Hello,
> >>
> >> First of all, great job creating pig, really a magnificent piece of
> >> software.
> >>
> >> I do have a few questions about UDFs. I have a dataset with a list of
> >> url's I want to fetch. Since an EvalFunc can only process one tuple at a
> >> time and the asynchronous abilities of the UDF are deprecated, I can
> only
> >> fetch one url at a time. The problem is that fetching this one url
> takes a
> >> reasonable amount of time (1 to 5 seconds, there is a delay built in) so
> >> that really slows down the processing. I already converted the UDF into
> an
> >> Accumulator but that only seems to get fired after a group by. If would
> be
> >> nice to have some kind of Queue UDF which will queue the tuples until a
> >> certain amount is reached and than flushes the queue. That way I can add
> >> tuples to an internal list and on flush start multiple threads to go
> >> through the list of Tuples.
> >>
> >> This is a workaround though, since the best solution would be to
> >> reintroduce the asynchronous UDF's (in which case I can schedule the
> >> threads as the tuples come in)
> >>
> >> Any idea's on this? I already saw someone trying almost the same thing,
> >> but didn't get a definite answer from that one.
> >>
> >> An other option is to increase the number of reducer slots on the
> >> cluster, but I'm afraid that would mean we overload the nodes in case
> of a
> >> heavy reduce phase.
> >>
> >> Best Regards,
> >>
> >> Daan
> >>
> >
> >
>

Reply via email to