Adding FLATTEN to your "grouped-by-multiple-cols" relation
(iq_per_region_per_gender) will make it much easier to join and visualize.
Once your join keys are flat string literals ("gender"), then it's just a
straightforward JOIN/FOREACH.
Here's a fragment that seems to do what you need:
A = LOAD 'test.txt' USING PigStorage(',') AS
(name:chararray,region:chararray,gender:chararray,iq:int);
per_gender = FOREACH (GROUP A by gender) GENERATE group, SUM(A.iq) as
gentot:long;
per_gender_region = FOREACH (GROUP A by (gender, region)) GENERATE
FLATTEN(group), SUM(A.iq) AS genregtot:long;
joined = JOIN per_gender by $0, per_gender_region by $0;
final = FOREACH joined GENERATE per_gender::group, region,
(float)genregtot/(float)gentot;
Which yields the following:
(Male,Here,0.89285713)
(Male,There,0.10714286)
(Female,Here,0.13793103)
(Female,There,0.86206895)
Norbert
On Wed, Oct 12, 2011 at 5:38 AM, Marco Cadetg <[email protected]> wrote:
> Yes but I'm still not able to compute the percentage. I've joined the bags
> as below.
>
> A = LOAD '/data/marco/foo.csv' USING PigStorage(',') AS (name:cha
> rarray, region:chararray, gender:chararray, iq:int);
> iq_per_region_per_gender = GROUP A BY (region, gender);
> total_iq_per_gender = GROUP A BY (gender);
>
> describe iq_per_region_per_gender
> iq_per_region_per_gender: {group: (region: chararray,gender: chararray),A:
> {(name: chararray,region: chararray,gender: chararray,iq: int)}}
>
> describe total_iq_per_gender;
> total_iq_per_gender: {group: chararray,A: {(name: chararray,region:
> chararray,gender: chararray,iq: int)}}
>
> total = JOIN iq_per_region_per_gender BY group.gender, total_iq_per_gender
> BY $0;
>
> describe total
> total: {iq_per_region_per_gender::group: (region: chararray,gender:
> chararray),iq_per_region_per_gender::A: {(name: chararray,region:
> chararray,gender: chararray,iq: int)},total_iq_per_gender::group:
> chararray,total_iq_per_gender::A: {(name: chararray,region:
> chararray,gender: chararray,iq: int)}}
>
> -- Now I would like to use the 'joined' data.
> -- providing me sth like this:
> -- Male, Here, 0.2
> -- Female, Here, 0,8
> -- Male, There, 1
> -- Female, There, 0
> -- But I'm not sure how my FOREACH GENERATE needs to look like.
>
>
> On Wed, Oct 12, 2011 at 10:34 AM, Dmitriy Ryaboy <[email protected]>
> wrote:
>
> > Sure, just join your total counts with your partials on gender.
> >
> > D
> >
> > On Tue, Oct 11, 2011 at 11:58 PM, Marco Cadetg <[email protected]> wrote:
> >
> > > D'oh I just see that unfortunately my example was a bit over
> simplified.
> > > The
> > > total needs to be grouped by another field like below.
> > >
> > > A = LOAD 'student' USING PigStorage() AS (name:chararray,
> > region:chararray,
> > > gender:charrarray, iq:int);
> > > DUMP A;
> > > (Eva, There, Female,500)
> > > (John, There, Male, 10)
> > > (Alf, There, Male, 10)
> > > (ET, There, Male, 10)
> > > (Mary, Here, Female, 80)
> > > (Bill, Here, Male, 100)
> > > (Joe, Here, Male, 150)
> > >
> > > total_iq_per_region = GROUP A BY (region, gender);
> > >
> > > total_iq_per_region_per_gender = FOREACH total_iq_per_region
> > > {
> > > GENERATE FLATTEN(group),
> > > SUM(A.iq) AS iq_per_region_per_gender;
> > > }
> > >
> > > total_iq_per_gender = GROUP A BY (gender);
> > >
> > > total_iq_per_gender = FOREACH A
> > > {
> > > GENERATE FLATTEN(group),
> > > SUM(A.iq) AS iq_per_gender;
> > > }
> > >
> > > Now I guess I could use JOIN to combine both bags(?) by gender but
> > somehow
> > > I
> > > don't get it.
> > >
> > > Thanks
> > > -Marco
> > >
> > > On Tue, Oct 11, 2011 at 6:02 PM, Marco Cadetg <[email protected]>
> wrote:
> > >
> > > > Thanks a lot, Shawn! Looks like I need to learn some basics ;)
> > > > -Marco
> > > >
> > > > On Tue, Oct 11, 2011 at 5:39 PM, Xiaomeng Wan <[email protected]>
> > > wrote:
> > > >
> > > >> total_iq = foreach (group A by all) generate SUM(A.iq) as total;
> > > >>
> > > >> total_iq_per_region = FOREACH total_iq_per_region
> > > >> {
> > > >> GENERATE FLATTEN(group),
> > > >> SUM(A.iq)/total_iq.total AS iq_per_region;
> > > >> }
> > > >>
> > > >> Shawn
> > > >>
> > > >>
> > > >> On Tue, Oct 11, 2011 at 9:20 AM, Marco Cadetg <[email protected]>
> > wrote:
> > > >> > Hi there,
> > > >> >
> > > >> > I would need to do something like this:
> > > >> >
> > > >> > A = LOAD 'student' USING PigStorage() AS (name:chararray,
> > > >> region:chararry,
> > > >> > iq:int);
> > > >> > DUMP A;
> > > >> > (John, There, 10)
> > > >> > (Alf, There, 10)
> > > >> > (ET, There, 10)
> > > >> > (Mary, Here, 80)
> > > >> > (Bill, Here, 100)
> > > >> > (Joe, Here, 150)
> > > >> >
> > > >> > total_iq_per_region = GROUP A BY (region);
> > > >> >
> > > >> > total_iq_per_region = FOREACH total_iq_per_region
> > > >> > {
> > > >> > GENERATE FLATTEN(group),
> > > >> > SUM(A.iq) AS iq_per_region;
> > > >> > }
> > > >> >
> > > >> > total_iq = FOREACH A
> > > >> > {
> > > >> > GENERATE SUM(iq) AS total_iq:
> > > >> > }
> > > >> >
> > > >> > Now I would like to retrieve the percentage of the region e.g.
> > > >> iq_per_reqion
> > > >> > / total_iq and store the result. How can I achieve that? I hope my
> > > >> example
> > > >> > is not too confusing.
> > > >> >
> > > >> > Cheers
> > > >> > -Marco
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>