Thanks, Dmitriy, for your thoughts and pointers. I'm now re-implementing my tfidf test case using a flow of tuples, rather than clinging to bags, and I'll construct the Accumulator interface support.
I'm just a neurotic about communication costs, and once a bunch of data is in my grasp on a node, in my address space, on my heap, I want to do as much as I can in one scan. But, I'll change my ways, you'll see :-) Happy New Year to you too, and to the rest of the list, of course. Andreas On Sat, Jan 1, 2011 at 12:04 PM, Dmitriy Ryaboy <[email protected]> wrote: > Andreas, > > A map usually implies random access (if you do not need random access to > the > keys, it is likely a different data structure would do). This also implies > that an on-disk Map would incur extremely high IO cost. Bags are already > spillable, and even though they are a whole lot more sequential in nature, > hitting the point where they start spilling usually means increasing job > run > time by orders of magnitude (if they finish at all). It is possible that > you can get this to work (I see a cs.stanford in the cc list :)), but it > seems like the effort would be better spent in making your algorithm not > require having the whole map in memory, or reducing the size of the map. If > the efficiency you refer to is efficiency of the job, I doubt you will get > to it by means of a spillable map. > > I am not sure why you think passing in an entire bag is more efficient than > the accumulator interface -- in most cases, using the accumulator > implementation speeds up the job. There are some measurements at the very > bottom of this page: http://wiki.apache.org/pig/PigAccumulatorSpec -- and > I > don't believe those bags were spilling to disk in either implementation, > this was just processing time. > > -Dmitriy > P.S. Happy New Year! > > On Sat, Jan 1, 2011 at 9:03 AM, Andreas Paepcke <[email protected]> wrote: > > > Newbie issue. > > > > I find myself wanting a spillable hashmap facility > > within my UDFs. Maybe I'm still not thinking hadoopy > > enough. But hashmaps are often convenient as > > temporary tools when operating over bags that > > are passed into a UDF. > > > > Yet if the bag sizes passed into the UDF are not > > known to be bounded, heap exhaustion is a danger. > > A spillable hashmap sounds like the most intuitive > > solution. I've seen this topic popping up on the Web, > > but I have not found either an implementation, or a > > strong argument against such a facility in principle. > > > > I understand that one can often write algorithms > > that simply stream tuples into a UDF, rather than > > passing in entire bags. But for efficiency it seems like > > bagging can be a good idea. > > > > Any pointers to an implementation or counter indication > > argument? > > > > Thanks, > > > > Andreas > > >
