Re: Pig script to convert Categorical variables

Austin Chungath Mon, 20 Feb 2012 22:27:49 -0800

Thanks Eli,
That helps and it was exactly what I was doing. I wrote the UDF and it is
working.
I wrote a UDF that takes two parameters, first parameter was a bag of
tuples containing distinct values (ordered ascending)  and the second
parameter is the original data set. It is working but now I am trying
to figure out how I can return a schema for the columns created with the
names of the distinct values.


City
A
B
C
A
C
C

I want to convert it into

A             B            C
1              0            0
0              1            0
0              0            1
1              0            0
0              0            1
0              0            1
how can the UDF return a schema containing the names of the cities? is it
possible?
I should be able to generate A rather than generate $0.
Thanks,
Austin

On Tue, Feb 21, 2012 at 10:23 AM, Eli Finkelshteyn <[email protected]>wrote:

> Interesting problem. What I'm thinking is why not do two steps. First,
> read in the data, group on the column you care about. Then generate on it
> so you get just the distinct values for that column left. This would be
> something like:
>
> CITIES_GROUPED=  GROUP  INITIALBY  city;
> CITIES=  FOREACHCITIES_GROUPED GENERATE group AS city;
>
>
> Once you have that, convert it to a tuple, and then just write a quick udf
> that goes through the ORIGINAL data set and takes in the row value for the
> column you care about along with the distinct values tuple you just created
> as parameters and returns a tuple of 0s and one 1 where the one is in the
> position in the distinct values tuple that matches the row value for that
> row for the column you care about. You could write that udf in Java,
> Python, or one of the other supported udf languages, depending on your
> requirements.
>
> For inputting, you could do it either through a simple bash script (your
> use case is simple enough, I think), or you could go ahead and embed the
> PIG script in Java, Python, or one of the other languages that's supported
> for that functionality, so it's easy to expand if you later need to. I'm
> personally partial to Python and have had great results embedding in that.
> Just make sure you're on Pig 9.1+.
>
> Hopefully that helps,
> Eli
>
>
> On 2/20/12 6:56 AM, Prashant Kommireddi wrote:
>
>> This should work if the values are only A,B,C.
>>
>> M = load 'input' as (city:chararray);
>>
>> N = foreach M generate city == 'A' ? 1 : 0 as A, city == 'B' ? 1 : 0 as B,
>> city == 'C' ? 1 : 0 as C;
>>
>> However, if city values vary it might be a good option to do it by
>> embedding Pig in Java.
>> http://pig.apache.org/docs/r0.**9.1/cont.html#embed-java<http://pig.apache.org/docs/r0.9.1/cont.html#embed-java>
>>
>> Thanks,
>> Prashant
>>
>> On Mon, Feb 20, 2012 at 3:16 AM, Austin Chungath<[email protected]>
>>  wrote:
>>
>> Consider this scenario:
>>>
>>> I have a column named City and it takes 3 possible values: A,B,C
>>>
>>> City
>>> A
>>> B
>>> C
>>> A
>>> C
>>> C
>>>
>>> I want to convert it into
>>>
>>> A             B            C
>>> 1              0            0
>>> 0              1            0
>>> 0              0            1
>>> 1              0            0
>>> 0              0            1
>>> 0              0            1
>>>
>>> I am trying to write a pig script that will take two parameters, one
>>> parameter is the data and then the column name, in this case 'City'. The
>>> script should then identify distinct values that it will take and then
>>> create that many columns and populate it with 1 or 0 depending on which
>>> one
>>> is true.
>>> Please let me know if you have got any ideas on how to approach this
>>> problem.
>>>
>>> Thanks,
>>> Austin
>>>
>>>
>

Re: Pig script to convert Categorical variables

Reply via email to