Hi All,
I am pretty new to pig and am having some issues with dereferencing. My data in 
simplified form looks like below

data = load 'visitevent' using PigStorage() AS (visit:tuple(visitorid, visitid, 
browser), events:bag{event:tuple(pagename, pagevar)});

cat visitevent   (note there is tab in between the visit and the events)
(vr1,vi1,ff)    {((pagea,eb1)),((pageb,eb3))}
(vr1,vi2,ff)    {((pageb,eb2))}
(vr2,vi3,ff)    {((pageb,eb4))}
(vr3,vi4,ie)    {((pagec,eb3)),((pagea,eb5))}


My task is the following
1)  Generate count(visitid) and count(distinct visitorid) by browser
2)  Generate count(events), count(visitid) and count(distinct visitorid) by 
pagename


I have issues with the first task.  I tried the below after flattening visit 
and it worked.

data = load 'c:/shared/visitevent' using PigStorage() AS 
(visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, 
pagevar)});
data2 = foreach data generate FLATTEN(visit);
data3 = group data2 by browser;
dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate group, 
COUNT(d2), COUNT(d1);};
describe dc;
dump dc;


I don't understand why I would need to flatten visit.  I tried the below 
without flattening and whatever I try it doesn't work. Not sure why.  

data = load 'c:/shared/visitevent' using PigStorage() AS 
(visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, 
pagevar)});
data2 = foreach data generate visit;
data3 = group data2 by browser;
#  describe data3  produces below
#       data3: {group: bytearray,data2: {visit: (visitorid: bytearray,visitid: 
bytearray,browser: bytearray)}}
#  none of the below work as somehow it doesn't find the alias.  Why?
dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate group, 
COUNT(d2), COUNT(d1);};
dc = foreach data3 {d1 = visit.visitorid; d2 = distinct d1; generate group, 
COUNT(d2), COUNT(d1);};

What am I doing wrong?  Since my task #2 is going to group by pagename which is 
in a bag->tuple, do I have to flatten that one twice to get this working? Are 
there any documentation on dereferencing complex and nested structures?  Any 
help appreciated.  
        
Thanks 
Priyo



Reply via email to