Thanks Gianmarco! I see why it makes sense now. I guess when I see multiple levels of nesting, I should flatten for ease of processing.
-----Original Message----- From: Gianmarco De Francisci Morales [mailto:[email protected]] Sent: Monday, April 23, 2012 1:10 PM To: [email protected] Subject: Re: Problem with dereferencing and alias Hi, the fact is that visit is a nested tuple inside the tuples that make your original relation. If you describe the data2 relation it should get clear: WITH FLATTEN grunt> describe data2 data2: {visit::visitorid: bytearray,visit::visitid: bytearray,visit::browser: bytearray} WITHOUT FLATTEN grunt> data2 = foreach data generate visit; grunt> describe data2 data2: {visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray)} If you don't want to flatten (for whichever reason), you need to modify your script like this: data3 = group data2 by visit.browser; But then you have a double nesting which I find cumbersome to work with. grunt> describe data3 data3: {group: bytearray,data2: {(visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray))}} Now you have data3 which is a bag with a nested bag data2 with a nested tuple which contains a 3 element tuple. That's why flattening comes handy in this case. I hope it helps. Cheers, -- Gianmarco On Mon, Apr 23, 2012 at 21:05, Mustafi, Priyo <[email protected]> wrote: > Hi All, > I am pretty new to pig and am having some issues with dereferencing. My > data in simplified form looks like below > > data = load 'visitevent' using PigStorage() AS (visit:tuple(visitorid, > visitid, browser), events:bag{event:tuple(pagename, pagevar)}); > > cat visitevent (note there is tab in between the visit and the events) > (vr1,vi1,ff) {((pagea,eb1)),((pageb,eb3))} > (vr1,vi2,ff) {((pageb,eb2))} > (vr2,vi3,ff) {((pageb,eb4))} > (vr3,vi4,ie) {((pagec,eb3)),((pagea,eb5))} > > > My task is the following > 1) Generate count(visitid) and count(distinct visitorid) by browser > 2) Generate count(events), count(visitid) and count(distinct visitorid) > by pagename > > > I have issues with the first task. I tried the below after flattening > visit and it worked. > > data = load 'c:/shared/visitevent' using PigStorage() AS > (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, > pagevar)}); > data2 = foreach data generate FLATTEN(visit); > data3 = group data2 by browser; > dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate > group, COUNT(d2), COUNT(d1);}; > describe dc; > dump dc; > > > I don't understand why I would need to flatten visit. I tried the below > without flattening and whatever I try it doesn't work. Not sure why. > > data = load 'c:/shared/visitevent' using PigStorage() AS > (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, > pagevar)}); > data2 = foreach data generate visit; > data3 = group data2 by browser; > # describe data3 produces below > # data3: {group: bytearray,data2: {visit: (visitorid: > bytearray,visitid: bytearray,browser: bytearray)}} > # none of the below work as somehow it doesn't find the alias. Why? > dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate > group, COUNT(d2), COUNT(d1);}; > dc = foreach data3 {d1 = visit.visitorid; d2 = distinct d1; generate > group, COUNT(d2), COUNT(d1);}; > > What am I doing wrong? Since my task #2 is going to group by pagename > which is in a bag->tuple, do I have to flatten that one twice to get this > working? Are there any documentation on dereferencing complex and nested > structures? Any help appreciated. > > Thanks > Priyo > > > >
