Newbie issue. I find myself wanting a spillable hashmap facility within my UDFs. Maybe I'm still not thinking hadoopy enough. But hashmaps are often convenient as temporary tools when operating over bags that are passed into a UDF.
Yet if the bag sizes passed into the UDF are not known to be bounded, heap exhaustion is a danger. A spillable hashmap sounds like the most intuitive solution. I've seen this topic popping up on the Web, but I have not found either an implementation, or a strong argument against such a facility in principle. I understand that one can often write algorithms that simply stream tuples into a UDF, rather than passing in entire bags. But for efficiency it seems like bagging can be a good idea. Any pointers to an implementation or counter indication argument? Thanks, Andreas
