So, as explained earlier, the reason you are running out of memory is that you are loading all records into memory when you want to do non-algebraic things to results of grouping.
Can you come up with ways to achieve what you need without having to have the raw records at the reducer? One way has been suggested. It's reasonably straightforward to figure out the solution to your question given advice already provided. D On Thu, Mar 22, 2012 at 9:06 AM, Rohini U <[email protected]> wrote: > Has a Jira been filed for this? I can send my example I am trying if that > helps. > > Thanks, > Rohini > > On Wed, Mar 21, 2012 at 11:41 PM, Prashant Kommireddi > <[email protected]>wrote: > > > Sure I can do that. Isn't this something that should be done already? Or > > does it not work if the filter is working on a field that is part of the > > group? > > > > On Wed, Mar 21, 2012 at 11:02 PM, Dmitriy Ryaboy <[email protected]> > > wrote: > > > > > Prashant, mind filing a jira with this example? Technically, this is > > > something we could do automatically. > > > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi < > > [email protected] > > > >wrote: > > > > > > > Please pull your FILTER out of GROUP BY and do it earlier > > > > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > > > > > > > In this case, you could use a FILTER followed by a bincond to > > introduce a > > > > new field "employerOrLocation", then do a group by and include the > new > > > > field in the GROUP BY clause. > > > > > > > > Thanks, > > > > Prashant > > > > > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[email protected]> > wrote: > > > > > > > > > My input data size is 9GB and I am using 20 machines. > > > > > > > > > > My grouped criteria has two cases so I want 1) counts by the > > criteria I > > > > > have grouped 2) counts of the two inviduals cases in each of my > > group. > > > > > > > > > > So my script in detail is: > > > > > > > > > > counts = FOREACH grouped { > > > > > selectedFields1 = FILTER rawItems BY > > > > type="EMPLOYER"; > > > > > selectedFields2 = FILTER rawItems BY > > > type="LOCATION"; > > > > > GENERATE > > > > > FLATTEN(group) as (item1, item2, item3, > > > > type) , > > > > > SUM(selectedFields1.count) as > > > > > selectFields1Count, > > > > > SUM(selectedFields2.count) as > > > > > selectFields2Count, > > > > > COUNT(rawItems) as groupCriteriaCount > > > > > > > > > > } > > > > > > > > > > Is there a way way to do this? > > > > > > > > > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy < > [email protected]> > > > > > wrote: > > > > > > > > > > > you are not doing grouping followed by counting. You are doing > > > grouping > > > > > > followed by filtering followed by counting. > > > > > > Try filtering before grouping. > > > > > > > > > > > > D > > > > > > > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[email protected]> > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I have a pig script which does a simple GROUPing followed by > > > couting > > > > > and > > > > > > I > > > > > > > get this error. My data is certaining not that big for it to > > cause > > > > > this > > > > > > > out of memory error. Is there a chance that this is because of > > some > > > > > bug ? > > > > > > > Did any one come across this kind of error before? > > > > > > > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > > > > > > > > > > > My script: > > > > > > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > > > > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > > > > > > > > > > > counts = FOREACH grouped { > > > > > > > selectedFields = FILTER rawItems BY > > > > > type="EMPLOYER"; > > > > > > > GENERATE > > > > > > > FLATTEN(group) as (item1, item2, > > item3, > > > > > > type) , > > > > > > > SUM(selectedFields.count) as count > > > > > > > > > > > > > > } > > > > > > > > > > > > > > Stack Trace: > > > > > > > > > > > > > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child > > > (main): > > > > > > Error > > > > > > > running child : java.lang.OutOfMemoryError: GC overhead limit > > > > exceeded > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:281) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261) > > > > > > > at > > org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) > > > > > > > at > > > > > > > > > > > > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:662) > > > > > > > at > > > > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:425) > > > > > > > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > > > > > > > at java.security.AccessController.doPrivileged(Native > > > Method) > > > > > > > at javax.security.auth.Subject.doAs(Subject.java:396) > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > > > > > > > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > > > > > > > > > > > > > Thanks > > > > > > > -Rohini > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Regards > > > > > -Rohini > > > > > > > > > > -- > > > > > ** > > > > > People of accomplishment rarely sat back & let things happen to > them. > > > > They > > > > > went out & happened to things - Leonardo Da Vinci > > > > > > > > > > > > > > >
