Rohini, it's fine even if you could reply with the stacktrace here. I can
add it to JIRA.

Thanks,
Prashant

On Thu, Mar 22, 2012 at 7:10 PM, Prashant Kommireddi <[email protected]>wrote:

> Rohini,
>
> Here is the JIRA. https://issues.apache.org/jira/browse/PIG-2610
>
> Can you please post the stacktrace as a comment to it?
>
> Thanks,
> Prashant
>
>
> On Thu, Mar 22, 2012 at 2:37 PM, Jonathan Coveney <[email protected]>wrote:
>
>> Rohini,
>>
>> In the meantime, something like the following should work:
>>
>> aw = LOAD 'input' using MyCustomLoader();
>>
>> searches = FOREACH raw GENERATE
>>               day, searchType,
>>               FLATTEN(impBag) AS (adType, clickCount)
>>           ;
>>
>> searches_2 = foreach searches generate *, ( adType == 'type1' ? clickCount
>> : 0 ) as type1_clickCount, ( adType == 'type2' ? clickCount : 0 ) as
>> type2_clickCount;
>>
>> groupedSearches = GROUP searches_2 BY (day, searchType) PARALLEL 50;
>> counts = FOREACH groupedSearches{
>>                GENERATE
>>                   FLATTEN(group) AS (day, searchType),
>>                   COUNT(searches) numSearches,
>>                   SUM(clickCount) AS clickCountPerSearchType,
>>                    SUM(searches_2. type1_clickCount) AS type1ClickCount,
>>                   SUM(searches_2. type2_clickCount) AS type2ClickCount;
>>       }
>> ;
>>
>> 2012/3/22 Rohini U <[email protected]>
>>
>> > Thanks Prashant,
>> > I am using Pig 0.9.1 and hadoop 0.20.205
>> >
>> > Thanks,
>> > Rohini
>> >
>> > On Thu, Mar 22, 2012 at 1:27 PM, Prashant Kommireddi <
>> [email protected]
>> > >wrote:
>> >
>> > > This makes more sense, grouping and filter are on different columns. I
>> > will
>> > > open a JIRA soon.
>> > >
>> > > What version of Pig and Hadoop are you using?
>> > >
>> > > Thanks,
>> > > Prashant
>> > >
>> > > On Thu, Mar 22, 2012 at 1:12 PM, Rohini U <[email protected]> wrote:
>> > >
>> > > > Hi Prashant,
>> > > >
>> > > > Here is my script in full.
>> > > >
>> > > >
>> > > > raw = LOAD 'input' using MyCustomLoader();
>> > > >
>> > > > searches = FOREACH raw GENERATE
>> > > >                day, searchType,
>> > > >                FLATTEN(impBag) AS (adType, clickCount)
>> > > >            ;
>> > > >
>> > > > groupedSearches = GROUP searches BY (day, searchType) PARALLEL 50;
>> > > > counts = FOREACH groupedSearches{
>> > > >                type1 = FILTER searches BY adType == 'type1';
>> > > >                type2 = FILTER searches BY adType == 'type2';
>> > > >                GENERATE
>> > > >                    FLATTEN(group) AS (day, searchType),
>> > > >                    COUNT(searches) numSearches,
>> > > >                    SUM(clickCount) AS clickCountPerSearchType,
>> > > >                    SUM(type1.clickCount) AS type1ClickCount,
>> > > >                    SUM(type2.clickCount) AS type2ClickCount;
>> > > >        }
>> > > > ;
>> > > >
>> > > > As you can see above, I am counting the counts by the day and search
>> > type
>> > > > in clickCountPerSearchType and for each of them i need the counts
>> > broken
>> > > by
>> > > > the ad type.
>> > > >
>> > > > Thanks for your help!
>> > > > Thanks,
>> > > > Rohini
>> > > >
>> > > >
>> > > > On Thu, Mar 22, 2012 at 12:44 PM, Prashant Kommireddi
>> > > > <[email protected]>wrote:
>> > > >
>> > > > > Hi Rohini,
>> > > > >
>> > > > > From your query it looks like you are already grouping it by
>> TYPE, so
>> > > not
>> > > > > sure why you would want the SUM of, say "EMPLOYER" type in
>> "LOCATION"
>> > > and
>> > > > > vice-versa. Your output is already broken down by TYPE.
>> > > > >
>> > > > > Thanks,
>> > > > > Prashant
>> > > > >
>> > > > > On Thu, Mar 22, 2012 at 9:03 AM, Rohini U <[email protected]>
>> > wrote:
>> > > > >
>> > > > > > Thanks for the suggestion Prashant. However, that will not work
>> in
>> > my
>> > > > > case.
>> > > > > >
>> > > > > > If I filter before the group and include the new field in group
>> as
>> > > you
>> > > > > > suggested, I get the individual counts broken by the select
>> field
>> > > > > > critieria. However, I want the totals also without taking the
>> > select
>> > > > > fields
>> > > > > > into account. That is why I took the approach I described in my
>> > > earlier
>> > > > > > emails.
>> > > > > >
>> > > > > > Thanks
>> > > > > > Rohini
>> > > > > >
>> > > > > > On Wed, Mar 21, 2012 at 5:02 PM, Prashant Kommireddi <
>> > > > > [email protected]
>> > > > > > >wrote:
>> > > > > >
>> > > > > > > Please pull your FILTER out of GROUP BY and do it earlier
>> > > > > > > http://pig.apache.org/docs/r0.9.1/perf.html#filter
>> > > > > > >
>> > > > > > > In this case, you could use a FILTER followed by a bincond to
>> > > > > introduce a
>> > > > > > > new field "employerOrLocation", then do a group by and include
>> > the
>> > > > new
>> > > > > > > field in the GROUP BY clause.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > > Prashant
>> > > > > > >
>> > > > > > > On Wed, Mar 21, 2012 at 4:45 PM, Rohini U <[email protected]
>> >
>> > > > wrote:
>> > > > > > >
>> > > > > > > > My input data size is 9GB and I am using 20 machines.
>> > > > > > > >
>> > > > > > > > My grouped criteria has two cases so I want 1) counts by the
>> > > > > criteria I
>> > > > > > > > have grouped 2) counts of the two inviduals cases in each
>> of my
>> > > > > group.
>> > > > > > > >
>> > > > > > > > So my script in detail is:
>> > > > > > > >
>> > > > > > > > counts = FOREACH grouped {
>> > > > > > > >                     selectedFields1 = FILTER rawItems  BY
>> > > > > > > type="EMPLOYER";
>> > > > > > > >                   selectedFields2 = FILTER rawItems  BY
>> > > > > > type="LOCATION";
>> > > > > > > >                      GENERATE
>> > > > > > > >                             FLATTEN(group) as (item1, item2,
>> > > item3,
>> > > > > > > type) ,
>> > > > > > > >                               SUM(selectedFields1.count) as
>> > > > > > > > selectFields1Count,
>> > > > > > > >                              SUM(selectedFields2.count) as
>> > > > > > > > selectFields2Count,
>> > > > > > > >                             COUNT(rawItems) as
>> > groupCriteriaCount
>> > > > > > > >
>> > > > > > > >              }
>> > > > > > > >
>> > > > > > > > Is there a way way to do this?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Wed, Mar 21, 2012 at 4:29 PM, Dmitriy Ryaboy <
>> > > > [email protected]>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > you are not doing grouping followed by counting. You are
>> > doing
>> > > > > > grouping
>> > > > > > > > > followed by filtering followed by counting.
>> > > > > > > > > Try filtering before grouping.
>> > > > > > > > >
>> > > > > > > > > D
>> > > > > > > > >
>> > > > > > > > > On Wed, Mar 21, 2012 at 12:34 PM, Rohini U <
>> > [email protected]
>> > > >
>> > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Hi,
>> > > > > > > > > >
>> > > > > > > > > > I have a pig script which does a simple GROUPing
>> followed
>> > by
>> > > > > > couting
>> > > > > > > > and
>> > > > > > > > > I
>> > > > > > > > > > get this error.  My data is certaining not that big for
>> it
>> > to
>> > > > > cause
>> > > > > > > > this
>> > > > > > > > > > out of memory error. Is there a chance that this is
>> because
>> > > of
>> > > > > some
>> > > > > > > > bug ?
>> > > > > > > > > > Did any one come across this kind of error before?
>> > > > > > > > > >
>> > > > > > > > > > I am using pig 0.9.1 with hadoop 0.20.205
>> > > > > > > > > >
>> > > > > > > > > > My script:
>> > > > > > > > > > rawItems = LOAD 'in' as (item1, item2, item3, type,
>> count);
>> > > > > > > > > >
>> > > > > > > > > > grouped = GROUP rawItems BY (item1, item2, item3, type);
>> > > > > > > > > >
>> > > > > > > > > > counts = FOREACH grouped {
>> > > > > > > > > >                     selectedFields = FILTER rawItems  BY
>> > > > > > > > type="EMPLOYER";
>> > > > > > > > > >                     GENERATE
>> > > > > > > > > >                             FLATTEN(group) as (item1,
>> > item2,
>> > > > > item3,
>> > > > > > > > > type) ,
>> > > > > > > > > >                              SUM(selectedFields.count)
>> as
>> > > count
>> > > > > > > > > >
>> > > > > > > > > >              }
>> > > > > > > > > >
>> > > > > > > > > > Stack Trace:
>> > > > > > > > > >
>> > > > > > > > > > 2012-03-21 19:19:59,346 FATAL
>> > org.apache.hadoop.mapred.Child
>> > > > > > (main):
>> > > > > > > > > Error
>> > > > > > > > > > running child : java.lang.OutOfMemoryError: GC overhead
>> > limit
>> > > > > > > exceeded
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:387)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:406)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNext(PORelationToExprProject.java:107)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:570)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:248)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:316)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:293)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:453)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.processInput(POUserFunc.java:159)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:184)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:281)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:324)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:459)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:427)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:407)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:261)
>> > > > > > > > > >        at
>> > > > > org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > >
>> > > >
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:662)
>> > > > > > > > > >        at
>> > > > > > > org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:425)
>> > > > > > > > > >        at
>> > > org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>> > > > > > > > > >        at
>> > java.security.AccessController.doPrivileged(Native
>> > > > > > Method)
>> > > > > > > > > >        at
>> > javax.security.auth.Subject.doAs(Subject.java:396)
>> > > > > > > > > >        at
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>> > > > > > > > > >        at
>> > org.apache.hadoop.mapred.Child.main(Child.java:249)
>> > > > > > > > > >
>> > > > > > > > > > Thanks
>> > > > > > > > > > -Rohini
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Regards
>> > > > > > > > -Rohini
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > **
>> > > > > > > > People of accomplishment rarely sat back & let things
>> happen to
>> > > > them.
>> > > > > > > They
>> > > > > > > > went out & happened to things - Leonardo Da Vinci
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Regards
>> > -Rohini
>> >
>> > --
>> > **
>> > People of accomplishment rarely sat back & let things happen to them.
>> They
>> > went out & happened to things - Leonardo Da Vinci
>> >
>>
>
>

Reply via email to