Thanks, Dmitriy, for your thoughts and pointers.
I'm now re-implementing my tfidf test case using a flow
of tuples, rather than clinging to bags, and I'll construct
the Accumulator interface support.

I'm just a neurotic about communication costs,
and once a bunch of data is in my grasp on a node,
in my address space, on my heap, I want to do as
much as I can in one scan. But, I'll change my ways,
you'll see :-)

Happy New Year to you too, and to the rest of the list,
of course.

Andreas


On Sat, Jan 1, 2011 at 12:04 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Andreas,
>
> A map usually implies random access (if you do not need random access to
> the
> keys, it is likely a different data structure would do). This also implies
> that an on-disk Map would incur extremely high IO cost. Bags are already
> spillable, and even though they are a whole lot more sequential in nature,
> hitting the point where they start spilling usually means increasing job
> run
> time by orders of magnitude (if they finish at all).  It is possible that
> you can get this to work (I see a cs.stanford in the cc list :)), but it
> seems like the effort would be better spent in making your algorithm not
> require having the whole map in memory, or reducing the size of the map. If
> the efficiency you refer to is efficiency of the job, I doubt you will get
> to it by means of a spillable map.
>
> I am not sure why you think passing in an entire bag is more efficient than
> the accumulator interface -- in most cases, using the accumulator
> implementation speeds up the job. There are some measurements at the very
> bottom of this page: http://wiki.apache.org/pig/PigAccumulatorSpec -- and
> I
> don't believe those bags were spilling to disk in either implementation,
> this was just processing time.
>
> -Dmitriy
> P.S. Happy New Year!
>
> On Sat, Jan 1, 2011 at 9:03 AM, Andreas Paepcke <[email protected]> wrote:
>
> > Newbie issue.
> >
> > I find myself wanting a spillable hashmap facility
> > within my UDFs. Maybe I'm still not thinking hadoopy
> > enough. But hashmaps are often convenient as
> > temporary tools when operating over bags that
> > are passed into a UDF.
> >
> > Yet if the bag sizes passed into the UDF are not
> > known to be bounded, heap exhaustion is a danger.
> > A spillable hashmap sounds like the most intuitive
> > solution. I've seen this topic popping up on the Web,
> > but I have not found either an implementation, or a
> > strong argument against such a facility in principle.
> >
> > I understand that one can often write algorithms
> > that simply stream tuples into a UDF, rather than
> > passing in entire bags. But for efficiency it seems like
> > bagging can be a good idea.
> >
> > Any pointers to an implementation or counter indication
> > argument?
> >
> > Thanks,
> >
> > Andreas
> >
>

Reply via email to