Collect_set() is built into hive. If you want a version that does not de-duplicate look here. https://github.com/edwardcapriolo/hive-collect
Caution both of these functions can produce out of memory if the results are later then a mapper can store in memory. On Thu, Nov 1, 2012 at 2:27 PM, Ratner, Alan S (IS) <alan.rat...@ngc.com> wrote: > Sorry to ask what is probably a very naïve Hive question but here goes: > > > > I have a table as follows: > > Col1 Col2 > > K1 V1 > > K1 V1 > > K2 V1 > > K3 V1 > > K1 V2 > > K1 V3 > > K2 V2 > > > > Now I have managed to SELECT Col1,COUNT(DISTINCT Col2) FROM … BY COL1; to > obtain > > K1 3 > > K2 2 > > K3 1 > > > > But what I want is a concatenated list of all the distinct Col2 values for > each Col1 key. > > K1 V1 V2 V3 > > K2 V1 V2 > > K3 V1 > > > > This is something absolutely trivial in MR but I cannot seem to find > anything in Hive that will do this for me. Do I have to write a UDF to > accomplish this? > > > > > > Alan > >