Newbie issue.

I find myself wanting a spillable hashmap facility
within my UDFs. Maybe I'm still not thinking hadoopy
enough. But hashmaps are often convenient as
temporary tools when operating over bags that
are passed into a UDF.

Yet if the bag sizes passed into the UDF are not
known to be bounded, heap exhaustion is a danger.
A spillable hashmap sounds like the most intuitive
solution. I've seen this topic popping up on the Web,
but I have not found either an implementation, or a
strong argument against such a facility in principle.

I understand that one can often write algorithms
that simply stream tuples into a UDF, rather than
passing in entire bags. But for efficiency it seems like
bagging can be a good idea.

Any pointers to an implementation or counter indication
argument?

Thanks,

Andreas

Reply via email to