[ https://issues.apache.org/jira/browse/PIG-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich updated PIG-807: ------------------------------- Fix Version/s: 0.6.0 Assignee: Ying He > PERFORMANCE: Provide a way for UDFs to use read-once bags (backed by the > Hadoop values iterator) > ------------------------------------------------------------------------------------------------ > > Key: PIG-807 > URL: https://issues.apache.org/jira/browse/PIG-807 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.2.1 > Reporter: Pradeep Kamath > Assignee: Ying He > Fix For: 0.6.0 > > > Currently all bags resulting from a group or cogroup are materialized as bags > containing all of the contents. The issue with this is that if a particular > key has many corresponding values, all these values get stuffed in a bag > which may run out of memory and hence spill causing slow down in performance > and sometime memory exceptions. In many cases, the udfs which use these bags > coming out a group and cogroup only need to iterate over the bag in a > unidirectional read-once manner. This can be implemented by having the bag > implement its iterator by simply iterating over the underlying hadoop > iterator provided in the reduce. This kind of a bag is also needed in > http://issues.apache.org/jira/browse/PIG-802. So the code can be reused for > this issue too. The other part of this issue is to have some way for the udfs > to communicate to Pig that any input bags that they need are "read once" bags > . This can be achieved by having an Interface - say "UsesReadOnceBags " which > is serves as a tag to indicate the intent to Pig. Pig can then rewire its > execution plan to use ReadOnceBags is feasible. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.