This is something tricky to propose. There are a couple of reasons why.

First of all, Pig does not guarantee any specific ordering of a Bag. You
can see how this is an issue, as it means that taking A transpose transpose
might not yield A again.

Secondly, bags are the only spillable data structure. Tuples are not. This
means you're going to have a hard limit on how big your matrix can get.

Altogether this means that Bags aren't a great data structure to represent
matrices. You could do a Tuple or Tuples, but that will have serious memory
issues. There are ways to make a bag work, but it'd be tricky... I suppose
it depends on the problem you want to solve.

2012/1/19 David Langer <[email protected]>

>
> Greetings All!
>
> Hopefully this isn't too annoying of a newbie question.
>
> I'd like to transpose the columns in a relation into a relation consisting
> of rows of bags (i.e., something akin to matrix transposition). As an
> example:
>
> 1 A 1A
> 2 B 2B
> 3 C 3C
>
> Transposes to:
>
> {1, 2, 3}
> {A, B, C}
> {3, C, 3C}
>
> The Pig code I came up with is along the lines of:
>
> Bag1 = FOREACH SomeData GENERATE Col1;
> Bag1 = GROUP Bag1 ALL;
>
> Bag2 = FOREACH SomeData GENERATE Col2;
> Bag2 = GROUP Bag2 ALL;
>
> Bag3 = FOREACH SomeData GENERATE Col3;
> Bag3 = GROUP Bag3 ALL;
>
> Bags = UNION Bag1, Bag2, Bag3;
>
> The above Pig code works, just wondering if this is the best way without
> using a UDF.
>
> Thanx,
>
> Dave

Reply via email to