This is something tricky to propose. There are a couple of reasons why. First of all, Pig does not guarantee any specific ordering of a Bag. You can see how this is an issue, as it means that taking A transpose transpose might not yield A again.
Secondly, bags are the only spillable data structure. Tuples are not. This means you're going to have a hard limit on how big your matrix can get. Altogether this means that Bags aren't a great data structure to represent matrices. You could do a Tuple or Tuples, but that will have serious memory issues. There are ways to make a bag work, but it'd be tricky... I suppose it depends on the problem you want to solve. 2012/1/19 David Langer <[email protected]> > > Greetings All! > > Hopefully this isn't too annoying of a newbie question. > > I'd like to transpose the columns in a relation into a relation consisting > of rows of bags (i.e., something akin to matrix transposition). As an > example: > > 1 A 1A > 2 B 2B > 3 C 3C > > Transposes to: > > {1, 2, 3} > {A, B, C} > {3, C, 3C} > > The Pig code I came up with is along the lines of: > > Bag1 = FOREACH SomeData GENERATE Col1; > Bag1 = GROUP Bag1 ALL; > > Bag2 = FOREACH SomeData GENERATE Col2; > Bag2 = GROUP Bag2 ALL; > > Bag3 = FOREACH SomeData GENERATE Col3; > Bag3 = GROUP Bag3 ALL; > > Bags = UNION Bag1, Bag2, Bag3; > > The above Pig code works, just wondering if this is the best way without > using a UDF. > > Thanx, > > Dave
