Hi,
I’m trying to better understand how algebraic UDFs do or do not help with bag 
spilling.  I have an algebraic UDF and I see the algebraic part getting 
invoked.  However, I am getting bad performance and see lots of spilling.  I 
see the spilling both in heap dumps and in the final pig counters.  E.g.
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 7
Total records proactively spilled: 258472

That 258K number is out of 400K original input records.  As I understand these 
numbers, 7 bags were spilled with a total of 258K tuples within them.  So it 
seems like it is not calling intermediate aggregation and instead spilling 
large bags of singletons to disk.

I know algebraic’s stated purpose is to do map-side aggregation to avoid the 
network cost of shuffling so many records.  But can it do anything to more 
proactively call ‘intermediate’ aggregation map side to avoid bags getting so 
large?  I see for example that Accumulator has ‘pig.accumulative.batchsize.’  I 
haven’t seen the equivalent for algebraic.  

FYI, part of the reason for all the memory usage is that I am computing 
algebraics over 10s of columns, and have a few such UDFs chained together.  Pig 
is managing to get them all run in the same wave of maps.  So that is good in 
principle, but does cause memory pressure.

Thanks!
Adam


Reply via email to