I am looking for some input on how to vectorize my data. 

> From: [email protected]
> To: [email protected]
> Subject: Mahout for clustering
> Date: Mon, 2 Dec 2013 16:22:03 -0800
> 
> 
> 
> 
> Hi All,We are using Apache Pig for building our data pipeline. We have data 
> in the following fashion:
> userid, age, items {code 1, code 2, ….}, few other features...
> Each item has a unique alphanumeric code.  I would like to use mahout for 
> clustering it.  Based on my current  reading I see following few options
> 1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0, AAAAA2 
> -> 1, AAAAA2 ->2 etc. Then run the clustering algorithm on the reformatted 
> data and then map the results back onto the real item codes.2. Represent info 
> on item codes  as 1 X M matrix where a column represents an items (1 if a 
> given user has viewed a particular item 0 otherwise) and will have millions 
> of columns. So each user will have id, age, and this matrix. Not sure if this 
> will work…..
> We also want to do frequency pattern mining etc. on the same data. Any 
> thoughts on data representation and clustering will be great.
> 
>                                         
                                          

Reply via email to