Andreas, A map usually implies random access (if you do not need random access to the keys, it is likely a different data structure would do). This also implies that an on-disk Map would incur extremely high IO cost. Bags are already spillable, and even though they are a whole lot more sequential in nature, hitting the point where they start spilling usually means increasing job run time by orders of magnitude (if they finish at all). It is possible that you can get this to work (I see a cs.stanford in the cc list :)), but it seems like the effort would be better spent in making your algorithm not require having the whole map in memory, or reducing the size of the map. If the efficiency you refer to is efficiency of the job, I doubt you will get to it by means of a spillable map.
I am not sure why you think passing in an entire bag is more efficient than the accumulator interface -- in most cases, using the accumulator implementation speeds up the job. There are some measurements at the very bottom of this page: http://wiki.apache.org/pig/PigAccumulatorSpec -- and I don't believe those bags were spilling to disk in either implementation, this was just processing time. -Dmitriy P.S. Happy New Year! On Sat, Jan 1, 2011 at 9:03 AM, Andreas Paepcke <[email protected]> wrote: > Newbie issue. > > I find myself wanting a spillable hashmap facility > within my UDFs. Maybe I'm still not thinking hadoopy > enough. But hashmaps are often convenient as > temporary tools when operating over bags that > are passed into a UDF. > > Yet if the bag sizes passed into the UDF are not > known to be bounded, heap exhaustion is a danger. > A spillable hashmap sounds like the most intuitive > solution. I've seen this topic popping up on the Web, > but I have not found either an implementation, or a > strong argument against such a facility in principle. > > I understand that one can often write algorithms > that simply stream tuples into a UDF, rather than > passing in entire bags. But for efficiency it seems like > bagging can be a good idea. > > Any pointers to an implementation or counter indication > argument? > > Thanks, > > Andreas >
