Thanks Prashant, I am using Pig 0.9.1 and hadoop 0.20.205 Thanks, Rohini
On Thu, Mar 22, 2012 at 1:27 PM, Prashant Kommireddi <[email protected]>wrote: > This makes more sense, grouping and filter are on different columns. I will > open a JIRA soon. > > What version of Pig and Hadoop are you using? > > Thanks, > Prashant > > On Thu, Mar 22, 2012 at 1:12 PM, Rohini U <[email protected]> wrote: > > > Hi Prashant, > > > > Here is my script in full. > > > > > > raw = LOAD 'input' using MyCustomLoader(); > > > > searches = FOREACH raw GENERATE > > day, searchType, > > FLATTEN(impBag) AS (adType, clickCount) > > ; > > > > groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50; > > counts = FOREACH groupedSearches{ > > type1 = FILTER searches BY adType == 'type1'; > > type2 = FILTER searches BY adType == 'type2'; > > GENERATE > > FLATTEN(group) AS (day, searchType), > > COUNT(searches) numSearches, > > SUM(clickCount) AS clickCountPerSearchType, > > SUM(type1.clickCount) AS type1ClickCount, > > SUM(type2.clickCount) AS type2ClickCount; > > } > > ; > > > > As you can see above, I am counting the counts by the day and search type > > in clickCountPerSearchType and for each of them i need the counts broken > by > > the ad type. > > > > Thanks for your help! > > Thanks, > > Rohini > > > > > > On Thu, Mar 22, 2012 at 12:44 PM, Prashant Kommireddi > > <[email protected]>wrote: > > > > > Hi Rohini, > > > > > > From your query it looks like you are already grouping it by TYPE, so > not > > > sure why you would want the SUM of, say "EMPLOYER" type in "LOCATION" > and > > > vice-versa. Your output is already broken down by TYPE. > > > > > > Thanks, > > > Prashant > > > > > > On Thu, Mar 22, 2012 at 9:03 AM, Rohini U <[email protected]> wrote: > > > > > > > Thanks for the suggestion Prashant. However, that will not work in my > > > case. > > > > > > > > If I filter before the group and include the new field in group as > you > > > > suggested, I get the individual counts broken by the select field > > > > critieria. However, I want the totals also without taking the select > > > fields > > > > into account. That is why I took the approach I described in my > earlier > > > > emails. > > > > > > > > Thanks > > > > Rohini > > > > > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi < > > > [email protected] > > > > >wrote: > > > > > > > > > Please pull your FILTER out of GROUP BY and do it earlier > > > > > http://pig.apache.org/docs/r0.9.1/perf.html#filter > > > > > > > > > > In this case, you could use a FILTER followed by a bincond to > > > introduce a > > > > > new field "employerOrLocation", then do a group by and include the > > new > > > > > field in the GROUP BY clause. > > > > > > > > > > Thanks, > > > > > Prashant > > > > > > > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[email protected]> > > wrote: > > > > > > > > > > > My input data size is 9GB and I am using 20 machines. > > > > > > > > > > > > My grouped criteria has two cases so I want 1) counts by the > > > criteria I > > > > > > have grouped 2) counts of the two inviduals cases in each of my > > > group. > > > > > > > > > > > > So my script in detail is: > > > > > > > > > > > > counts = FOREACH grouped { > > > > > > selectedFields1 = FILTER rawItems BY > > > > > type="EMPLOYER"; > > > > > > selectedFields2 = FILTER rawItems BY > > > > type="LOCATION"; > > > > > > GENERATE > > > > > > FLATTEN(group) as (item1, item2, > item3, > > > > > type) , > > > > > > SUM(selectedFields1.count) as > > > > > > selectFields1Count, > > > > > > SUM(selectedFields2.count) as > > > > > > selectFields2Count, > > > > > > COUNT(rawItems) as groupCriteriaCount > > > > > > > > > > > > } > > > > > > > > > > > > Is there a way way to do this? > > > > > > > > > > > > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > > > you are not doing grouping followed by counting. You are doing > > > > grouping > > > > > > > followed by filtering followed by counting. > > > > > > > Try filtering before grouping. > > > > > > > > > > > > > > D > > > > > > > > > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <[email protected] > > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > I have a pig script which does a simple GROUPing followed by > > > > couting > > > > > > and > > > > > > > I > > > > > > > > get this error. My data is certaining not that big for it to > > > cause > > > > > > this > > > > > > > > out of memory error. Is there a chance that this is because > of > > > some > > > > > > bug ? > > > > > > > > Did any one come across this kind of error before? > > > > > > > > > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205 > > > > > > > > > > > > > > > > My script: > > > > > > > > rawItems = LOAD 'in' as (item1, item2, item3, type, count); > > > > > > > > > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type); > > > > > > > > > > > > > > > > counts = FOREACH grouped { > > > > > > > > selectedFields = FILTER rawItems BY > > > > > > type="EMPLOYER"; > > > > > > > > GENERATE > > > > > > > > FLATTEN(group) as (item1, item2, > > > item3, > > > > > > > type) , > > > > > > > > SUM(selectedFields.count) as > count > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > Stack Trace: > > > > > > > > > > > > > > > > 2012-03-21 19:19:59,346 FATAL org.apache.hadoop.mapred.Child > > > > (main): > > > > > > > Error > > > > > > > > running child : java.lang.OutOfMemoryError: GC overhead limit > > > > > exceeded > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:281) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261) > > > > > > > > at > > > org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176) > > > > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:662) > > > > > > > > at > > > > > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:425) > > > > > > > > at > org.apache.hadoop.mapred.Child$4.run(Child.java:255) > > > > > > > > at java.security.AccessController.doPrivileged(Native > > > > Method) > > > > > > > > at javax.security.auth.Subject.doAs(Subject.java:396) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) > > > > > > > > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > > > > > > > > > > > > > > > Thanks > > > > > > > > -Rohini > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Regards > > > > > > -Rohini > > > > > > > > > > > > -- > > > > > > ** > > > > > > People of accomplishment rarely sat back & let things happen to > > them. > > > > > They > > > > > > went out & happened to things - Leonardo Da Vinci > > > > > > > > > > > > > > > > > > > > > -- Regards -Rohini -- ** People of accomplishment rarely sat back & let things happen to them. They went out & happened to things - Leonardo Da Vinci
