Thanks Eli, That helps and it was exactly what I was doing. I wrote the UDF and it is working. I wrote a UDF that takes two parameters, first parameter was a bag of tuples containing distinct values (ordered ascending) and the second parameter is the original data set. It is working but now I am trying to figure out how I can return a schema for the columns created with the names of the distinct values.
City A B C A C C I want to convert it into A B C 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 1 how can the UDF return a schema containing the names of the cities? is it possible? I should be able to generate A rather than generate $0. Thanks, Austin On Tue, Feb 21, 2012 at 10:23 AM, Eli Finkelshteyn <[email protected]>wrote: > Interesting problem. What I'm thinking is why not do two steps. First, > read in the data, group on the column you care about. Then generate on it > so you get just the distinct values for that column left. This would be > something like: > > CITIES_GROUPED= GROUP INITIALBY city; > CITIES= FOREACHCITIES_GROUPED GENERATE group AS city; > > > Once you have that, convert it to a tuple, and then just write a quick udf > that goes through the ORIGINAL data set and takes in the row value for the > column you care about along with the distinct values tuple you just created > as parameters and returns a tuple of 0s and one 1 where the one is in the > position in the distinct values tuple that matches the row value for that > row for the column you care about. You could write that udf in Java, > Python, or one of the other supported udf languages, depending on your > requirements. > > For inputting, you could do it either through a simple bash script (your > use case is simple enough, I think), or you could go ahead and embed the > PIG script in Java, Python, or one of the other languages that's supported > for that functionality, so it's easy to expand if you later need to. I'm > personally partial to Python and have had great results embedding in that. > Just make sure you're on Pig 9.1+. > > Hopefully that helps, > Eli > > > On 2/20/12 6:56 AM, Prashant Kommireddi wrote: > >> This should work if the values are only A,B,C. >> >> M = load 'input' as (city:chararray); >> >> N = foreach M generate city == 'A' ? 1 : 0 as A, city == 'B' ? 1 : 0 as B, >> city == 'C' ? 1 : 0 as C; >> >> However, if city values vary it might be a good option to do it by >> embedding Pig in Java. >> http://pig.apache.org/docs/r0.**9.1/cont.html#embed-java<http://pig.apache.org/docs/r0.9.1/cont.html#embed-java> >> >> Thanks, >> Prashant >> >> On Mon, Feb 20, 2012 at 3:16 AM, Austin Chungath<[email protected]> >> wrote: >> >> Consider this scenario: >>> >>> I have a column named City and it takes 3 possible values: A,B,C >>> >>> City >>> A >>> B >>> C >>> A >>> C >>> C >>> >>> I want to convert it into >>> >>> A B C >>> 1 0 0 >>> 0 1 0 >>> 0 0 1 >>> 1 0 0 >>> 0 0 1 >>> 0 0 1 >>> >>> I am trying to write a pig script that will take two parameters, one >>> parameter is the data and then the column name, in this case 'City'. The >>> script should then identify distinct values that it will take and then >>> create that many columns and populate it with 1 or 0 depending on which >>> one >>> is true. >>> Please let me know if you have got any ideas on how to approach this >>> problem. >>> >>> Thanks, >>> Austin >>> >>> >
