That's great. Thanks so much!
-Marco

On Wed, Oct 12, 2011 at 3:36 PM, Norbert Burger <[email protected]>wrote:

> Adding FLATTEN to your "grouped-by-multiple-cols" relation
> (iq_per_region_per_gender) will make it much easier to join and visualize.
>  Once your join keys are flat string literals ("gender"), then it's just a
> straightforward JOIN/FOREACH.
>
> Here's a fragment that seems to do what you need:
>
> A = LOAD 'test.txt' USING PigStorage(',') AS
> (name:chararray,region:chararray,gender:chararray,iq:int);
> per_gender = FOREACH (GROUP A by gender) GENERATE group, SUM(A.iq) as
> gentot:long;
> per_gender_region = FOREACH (GROUP A by (gender, region)) GENERATE
> FLATTEN(group), SUM(A.iq) AS genregtot:long;
> joined = JOIN per_gender by $0, per_gender_region by $0;
> final = FOREACH joined GENERATE per_gender::group, region,
> (float)genregtot/(float)gentot;
>
> Which yields the following:
>
> (Male,Here,0.89285713)
> (Male,There,0.10714286)
> (Female,Here,0.13793103)
> (Female,There,0.86206895)
>
> Norbert
>
> On Wed, Oct 12, 2011 at 5:38 AM, Marco Cadetg <[email protected]> wrote:
>
> > Yes but I'm still not able to compute the percentage. I've joined the
> bags
> > as below.
> >
> > A = LOAD '/data/marco/foo.csv' USING PigStorage(',') AS (name:cha
> > rarray, region:chararray, gender:chararray, iq:int);
> > iq_per_region_per_gender = GROUP A BY (region, gender);
> > total_iq_per_gender = GROUP A BY (gender);
> >
> > describe iq_per_region_per_gender
> > iq_per_region_per_gender: {group: (region: chararray,gender:
> chararray),A:
> > {(name: chararray,region: chararray,gender: chararray,iq: int)}}
> >
> > describe total_iq_per_gender;
> > total_iq_per_gender: {group: chararray,A: {(name: chararray,region:
> > chararray,gender: chararray,iq: int)}}
> >
> > total = JOIN iq_per_region_per_gender BY group.gender,
> total_iq_per_gender
> > BY $0;
> >
> > describe total
> > total: {iq_per_region_per_gender::group: (region: chararray,gender:
> > chararray),iq_per_region_per_gender::A: {(name: chararray,region:
> > chararray,gender: chararray,iq: int)},total_iq_per_gender::group:
> > chararray,total_iq_per_gender::A: {(name: chararray,region:
> > chararray,gender: chararray,iq: int)}}
> >
> > -- Now I would like to use the 'joined' data.
> > -- providing me sth like this:
> > -- Male, Here, 0.2
> > -- Female, Here, 0,8
> > -- Male, There, 1
> > -- Female, There, 0
> > -- But I'm not sure how my FOREACH GENERATE needs to look like.
> >
> >
> > On Wed, Oct 12, 2011 at 10:34 AM, Dmitriy Ryaboy <[email protected]>
> > wrote:
> >
> > > Sure, just join your total counts with your partials on gender.
> > >
> > > D
> > >
> > > On Tue, Oct 11, 2011 at 11:58 PM, Marco Cadetg <[email protected]>
> wrote:
> > >
> > > > D'oh I just see that unfortunately my example was a bit over
> > simplified.
> > > > The
> > > > total needs to be grouped by another field like below.
> > > >
> > > > A = LOAD 'student' USING PigStorage() AS (name:chararray,
> > > region:chararray,
> > > > gender:charrarray, iq:int);
> > > > DUMP A;
> > > > (Eva, There, Female,500)
> > > > (John, There, Male, 10)
> > > > (Alf, There, Male, 10)
> > > > (ET, There, Male, 10)
> > > > (Mary, Here, Female, 80)
> > > > (Bill, Here, Male, 100)
> > > > (Joe, Here, Male, 150)
> > > >
> > > > total_iq_per_region = GROUP A BY (region, gender);
> > > >
> > > > total_iq_per_region_per_gender = FOREACH total_iq_per_region
> > > > {
> > > >  GENERATE FLATTEN(group),
> > > >  SUM(A.iq) AS iq_per_region_per_gender;
> > > > }
> > > >
> > > > total_iq_per_gender = GROUP A BY (gender);
> > > >
> > > > total_iq_per_gender = FOREACH A
> > > > {
> > > >  GENERATE FLATTEN(group),
> > > >  SUM(A.iq) AS iq_per_gender;
> > > > }
> > > >
> > > > Now I guess I could use JOIN to combine both bags(?) by gender but
> > > somehow
> > > > I
> > > > don't get it.
> > > >
> > > > Thanks
> > > > -Marco
> > > >
> > > > On Tue, Oct 11, 2011 at 6:02 PM, Marco Cadetg <[email protected]>
> > wrote:
> > > >
> > > > > Thanks a lot, Shawn! Looks like I need to learn some basics ;)
> > > > > -Marco
> > > > >
> > > > > On Tue, Oct 11, 2011 at 5:39 PM, Xiaomeng Wan <[email protected]>
> > > > wrote:
> > > > >
> > > > >> total_iq = foreach (group A by all) generate SUM(A.iq) as total;
> > > > >>
> > > > >> total_iq_per_region = FOREACH total_iq_per_region
> > > > >> {
> > > > >>  GENERATE FLATTEN(group),
> > > > >>  SUM(A.iq)/total_iq.total AS iq_per_region;
> > > > >> }
> > > > >>
> > > > >> Shawn
> > > > >>
> > > > >>
> > > > >> On Tue, Oct 11, 2011 at 9:20 AM, Marco Cadetg <[email protected]>
> > > wrote:
> > > > >> > Hi there,
> > > > >> >
> > > > >> > I would need to do something like this:
> > > > >> >
> > > > >> > A = LOAD 'student' USING PigStorage() AS (name:chararray,
> > > > >> region:chararry,
> > > > >> > iq:int);
> > > > >> > DUMP A;
> > > > >> > (John, There, 10)
> > > > >> > (Alf, There, 10)
> > > > >> > (ET, There, 10)
> > > > >> > (Mary, Here, 80)
> > > > >> > (Bill, Here, 100)
> > > > >> > (Joe, Here, 150)
> > > > >> >
> > > > >> > total_iq_per_region = GROUP A BY (region);
> > > > >> >
> > > > >> > total_iq_per_region = FOREACH total_iq_per_region
> > > > >> > {
> > > > >> >  GENERATE FLATTEN(group),
> > > > >> >  SUM(A.iq) AS iq_per_region;
> > > > >> > }
> > > > >> >
> > > > >> > total_iq = FOREACH A
> > > > >> > {
> > > > >> >  GENERATE SUM(iq) AS total_iq:
> > > > >> > }
> > > > >> >
> > > > >> > Now I would like to retrieve the percentage of the region e.g.
> > > > >> iq_per_reqion
> > > > >> > / total_iq and store the result. How can I achieve that? I hope
> my
> > > > >> example
> > > > >> > is not too confusing.
> > > > >> >
> > > > >> > Cheers
> > > > >> > -Marco
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to