Rohini, it's fine even if you could reply with the stacktrace here. I can add it to JIRA.
Thanks, Prashant On Thu, Mar 22, 2012 at 7:10 PM, Prashant Kommireddi <[email protected]>wrote: > Rohini, > > Here is the JIRA. https://issues.apache.org/jira/browse/PIG-2610 > > Can you please post the stacktrace as a comment to it? > > Thanks, > Prashant > > > On Thu, Mar 22, 2012 at 2:37 PM, Jonathan Coveney <[email protected]>wrote: > >> Rohini, >> >> In the meantime, something like the following should work: >> >> aw = LOAD 'input' using MyCustomLoader(); >> >> searches = FOREACH raw GENERATE >> day, searchType, >> FLATTEN(impBag) AS (adType, clickCount) >> ; >> >> searches_2 = foreach searches generate *, ( adType == 'type1' ? clickCount >> : 0 ) as type1_clickCount, ( adType == 'type2' ? clickCount : 0 ) as >> type2_clickCount; >> >> groupedSearches = GROUP searches_2 BY (day, searchType) PARALLEL 50; >> counts = FOREACH groupedSearches{ >> GENERATE >> FLATTEN(group) AS (day, searchType), >> COUNT(searches) numSearches, >> SUM(clickCount) AS clickCountPerSearchType, >> SUM(searches_2. type1_clickCount) AS type1ClickCount, >> SUM(searches_2. type2_clickCount) AS type2ClickCount; >> } >> ; >> >> 2012/3/22 Rohini U <[email protected]> >> >> > Thanks Prashant, >> > I am using Pig 0.9.1 and hadoop 0.20.205 >> > >> > Thanks, >> > Rohini >> > >> > On Thu, Mar 22, 2012 at 1:27 PM, Prashant Kommireddi < >> [email protected] >> > >wrote: >> > >> > > This makes more sense, grouping and filter are on different columns. I >> > will >> > > open a JIRA soon. >> > > >> > > What version of Pig and Hadoop are you using? >> > > >> > > Thanks, >> > > Prashant >> > > >> > > On Thu, Mar 22, 2012 at 1:12 PM, Rohini U <[email protected]> wrote: >> > > >> > > > Hi Prashant, >> > > > >> > > > Here is my script in full. >> > > > >> > > > >> > > > raw = LOAD 'input' using MyCustomLoader(); >> > > > >> > > > searches = FOREACH raw GENERATE >> > > > day, searchType, >> > > > FLATTEN(impBag) AS (adType, clickCount) >> > > > ; >> > > > >> > > > groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50; >> > > > counts = FOREACH groupedSearches{ >> > > > type1 = FILTER searches BY adType == 'type1'; >> > > > type2 = FILTER searches BY adType == 'type2'; >> > > > GENERATE >> > > > FLATTEN(group) AS (day, searchType), >> > > > COUNT(searches) numSearches, >> > > > SUM(clickCount) AS clickCountPerSearchType, >> > > > SUM(type1.clickCount) AS type1ClickCount, >> > > > SUM(type2.clickCount) AS type2ClickCount; >> > > > } >> > > > ; >> > > > >> > > > As you can see above, I am counting the counts by the day and search >> > type >> > > > in clickCountPerSearchType and for each of them i need the counts >> > broken >> > > by >> > > > the ad type. >> > > > >> > > > Thanks for your help! >> > > > Thanks, >> > > > Rohini >> > > > >> > > > >> > > > On Thu, Mar 22, 2012 at 12:44 PM, Prashant Kommireddi >> > > > <[email protected]>wrote: >> > > > >> > > > > Hi Rohini, >> > > > > >> > > > > From your query it looks like you are already grouping it by >> TYPE, so >> > > not >> > > > > sure why you would want the SUM of, say "EMPLOYER" type in >> "LOCATION" >> > > and >> > > > > vice-versa. Your output is already broken down by TYPE. >> > > > > >> > > > > Thanks, >> > > > > Prashant >> > > > > >> > > > > On Thu, Mar 22, 2012 at 9:03 AM, Rohini U <[email protected]> >> > wrote: >> > > > > >> > > > > > Thanks for the suggestion Prashant. However, that will not work >> in >> > my >> > > > > case. >> > > > > > >> > > > > > If I filter before the group and include the new field in group >> as >> > > you >> > > > > > suggested, I get the individual counts broken by the select >> field >> > > > > > critieria. However, I want the totals also without taking the >> > select >> > > > > fields >> > > > > > into account. That is why I took the approach I described in my >> > > earlier >> > > > > > emails. >> > > > > > >> > > > > > Thanks >> > > > > > Rohini >> > > > > > >> > > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi < >> > > > > [email protected] >> > > > > > >wrote: >> > > > > > >> > > > > > > Please pull your FILTER out of GROUP BY and do it earlier >> > > > > > > http://pig.apache.org/docs/r0.9.1/perf.html#filter >> > > > > > > >> > > > > > > In this case, you could use a FILTER followed by a bincond to >> > > > > introduce a >> > > > > > > new field "employerOrLocation", then do a group by and include >> > the >> > > > new >> > > > > > > field in the GROUP BY clause. >> > > > > > > >> > > > > > > Thanks, >> > > > > > > Prashant >> > > > > > > >> > > > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[email protected] >> > >> > > > wrote: >> > > > > > > >> > > > > > > > My input data size is 9GB and I am using 20 machines. >> > > > > > > > >> > > > > > > > My grouped criteria has two cases so I want 1) counts by the >> > > > > criteria I >> > > > > > > > have grouped 2) counts of the two inviduals cases in each >> of my >> > > > > group. >> > > > > > > > >> > > > > > > > So my script in detail is: >> > > > > > > > >> > > > > > > > counts = FOREACH grouped { >> > > > > > > > selectedFields1 = FILTER rawItems BY >> > > > > > > type="EMPLOYER"; >> > > > > > > > selectedFields2 = FILTER rawItems BY >> > > > > > type="LOCATION"; >> > > > > > > > GENERATE >> > > > > > > > FLATTEN(group) as (item1, item2, >> > > item3, >> > > > > > > type) , >> > > > > > > > SUM(selectedFields1.count) as >> > > > > > > > selectFields1Count, >> > > > > > > > SUM(selectedFields2.count) as >> > > > > > > > selectFields2Count, >> > > > > > > > COUNT(rawItems) as >> > groupCriteriaCount >> > > > > > > > >> > > > > > > > } >> > > > > > > > >> > > > > > > > Is there a way way to do this? >> > > > > > > > >> > > > > > > > >> > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy < >> > > > [email protected]> >> > > > > > > > wrote: >> > > > > > > > >> > > > > > > > > you are not doing grouping followed by counting. You are >> > doing >> > > > > > grouping >> > > > > > > > > followed by filtering followed by counting. >> > > > > > > > > Try filtering before grouping. >> > > > > > > > > >> > > > > > > > > D >> > > > > > > > > >> > > > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U < >> > [email protected] >> > > > >> > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > Hi, >> > > > > > > > > > >> > > > > > > > > > I have a pig script which does a simple GROUPing >> followed >> > by >> > > > > > couting >> > > > > > > > and >> > > > > > > > > I >> > > > > > > > > > get this error. My data is certaining not that big for >> it >> > to >> > > > > cause >> > > > > > > > this >> > > > > > > > > > out of memory error. Is there a chance that this is >> because >> > > of >> > > > > some >> > > > > > > > bug ? >> > > > > > > > > > Did any one come across this kind of error before? >> > > > > > > > > > >> > > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 >> > > > > > > > > > >> > > > > > > > > > My script: >> > > > > > > > > > rawItems = LOAD 'in' as (item1, item2, item3, type, >> count); >> > > > > > > > > > >> > > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); >> > > > > > > > > > >> > > > > > > > > > counts = FOREACH grouped { >> > > > > > > > > > selectedFields = FILTER rawItems BY >> > > > > > > > type="EMPLOYER"; >> > > > > > > > > > GENERATE >> > > > > > > > > > FLATTEN(group) as (item1, >> > item2, >> > > > > item3, >> > > > > > > > > type) , >> > > > > > > > > > SUM(selectedFields.count) >> as >> > > count >> > > > > > > > > > >> > > > > > > > > > } >> > > > > > > > > > >> > > > > > > > > > Stack Trace: >> > > > > > > > > > >> > > > > > > > > > 2012-03-21 19:19:59,346 FATAL >> > org.apache.hadoop.mapred.Child >> > > > > > (main): >> > > > > > > > > Error >> > > > > > > > > > running child : java.lang.OutOfMemoryError: GC overhead >> > limit >> > > > > > > exceeded >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:281) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261) >> > > > > > > > > > at >> > > > > org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > >> > > > >> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:662) >> > > > > > > > > > at >> > > > > > > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:425) >> > > > > > > > > > at >> > > org.apache.hadoop.mapred.Child$4.run(Child.java:255) >> > > > > > > > > > at >> > java.security.AccessController.doPrivileged(Native >> > > > > > Method) >> > > > > > > > > > at >> > javax.security.auth.Subject.doAs(Subject.java:396) >> > > > > > > > > > at >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) >> > > > > > > > > > at >> > org.apache.hadoop.mapred.Child.main(Child.java:249) >> > > > > > > > > > >> > > > > > > > > > Thanks >> > > > > > > > > > -Rohini >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > -- >> > > > > > > > Regards >> > > > > > > > -Rohini >> > > > > > > > >> > > > > > > > -- >> > > > > > > > ** >> > > > > > > > People of accomplishment rarely sat back & let things >> happen to >> > > > them. >> > > > > > > They >> > > > > > > > went out & happened to things - Leonardo Da Vinci >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> > -- >> > Regards >> > -Rohini >> > >> > -- >> > ** >> > People of accomplishment rarely sat back & let things happen to them. >> They >> > went out & happened to things - Leonardo Da Vinci >> > >> > >
