Isn't the query parser getting confused because you've chosen the same column name mappings for the LHS and RHS of your join?
What happens if you change this: AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING 'replicated'; to: AjD = JOIN ACCT BY ($0,$1), DIM BY ($0,$1) USING 'replicated'; Norbert On Fri, Jul 29, 2011 at 12:47 PM, rob parker <[email protected]> wrote: > Trying to join two sets and generate a set from the join and I am getting a > > > $ hadoop fs -cat DIM/\* > 2011,01,31 > 2011,02,28 > 2011,03,31 > 2011,04,30 > 2011,05,31 > 2011,06,30 > 2011,07,31 > 2011,08,31 > 2011,09,30 > 2011,10,31 > 2011,11,30 > 2011,12,31 > > > $ hadoop fs -cat ACCT/\* > 2011,7,26,key1,23.25,2470.0 > 2011,7,26,key2,10.416666666666668,232274.08333333334 > 2011,7,26,key3,82.83333333333333,541377.25 > 2011,7,26,key4,78.5,492823.33333333326 > 2011,7,26,key5,110.83333333333334,729811.9166666667 > 2011,7,26,key6,102.16666666666666,675941.25 > 2011,7,26,key7,118.91666666666666,770896.75 > > > grunt> DIM = LOAD 'DIM' USING PigStorage(',') AS (year:int, month:int, > days:int); > grunt> ACCT = LOAD 'ACCT' USING PigStorage(',') AS (year:int, month:int, > day: int, account:chararray, metric1:double, metric2:double); > grunt> AjD = JOIN ACCT BY (year,month), DIM BY (year,month) USING > 'replicated'; > grunt> dump AjD; > ... > (2011,7,26,key1,23.25,2470.0,2011,7,31) > (2011,7,26,key2,10.416666666666668,232274.08333333334,2011,7,31) > (2011,7,26,key3,82.83333333333333,541377.25,2011,7,31) > (2011,7,26,key4,78.5,492823.33333333326,2011,7,31) > (2011,7,26,key5,110.83333333333334,729811.9166666667,2011,7,31) > (2011,7,26,key6,102.16666666666666,675941.25,2011,7,31) > (2011,7,26,key7,118.91666666666666,770896.75,2011,7,31) > grunt> describe AjD; > AjD: {ACCT::year: int,ACCT::month: int,ACCT::day: int,ACCT::account: > chararray,ACCT::metric1: double,ACCT::metric2: double,DIM::year: > int,DIM::month: int,DIM::days: int} > > grunt> FINAL = FOREACH AjD > >> GENERATE ACCT.year, ACCT.month, ACCT.account, (ACCT.metric2 / DIM.days); > grunt> dump FINAL; > ... > ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open > iterator > for alias FINAL. Backend error : Scalar has more than one row in the > output. > 1st : (2011,7,26,key1,23.25,2470.0), 2nd > :(2011,7,26,key2,10.416666666666668,232274.08333333334) > > However if I store it and reload it to shed the "join" schema: > > grunt> STORE AjD INTO 'AjD' using PigStorage(','); > grunt> AjD2 = LOAD 'AjD' USING PigStorage(',') AS (year:int, month:int, > day:int, account:chararray, metric1:double, metric2:double, year2:int, > month2:int, days:int); > > grunt> FINAL = FOREACH AjD2 > > >> GENERATE year, month, account, (metric2 /days); > > grunt> dump FINAL; > ... > (2011,7,key1,79.6774193548387) > (2011,7,key2,7492.712365591398) > (2011,7,key3,17463.782258064515) > (2011,7,key4,15897.526881720427) > (2011,7,key5,23542.319892473122) > (2011,7,key6,21804.5564516129) > (2011,7,key7,24867.637096774193) > > What am I missing to make this work without storing and reloading? > > Thanks, > Rob >
