I am looking for some input on how to vectorize my data.
> From: [email protected] > To: [email protected] > Subject: Mahout for clustering > Date: Mon, 2 Dec 2013 16:22:03 -0800 > > > > > Hi All,We are using Apache Pig for building our data pipeline. We have data > in the following fashion: > userid, age, items {code 1, code 2, ….}, few other features... > Each item has a unique alphanumeric code. I would like to use mahout for > clustering it. Based on my current reading I see following few options > 1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0, AAAAA2 > -> 1, AAAAA2 ->2 etc. Then run the clustering algorithm on the reformatted > data and then map the results back onto the real item codes.2. Represent info > on item codes as 1 X M matrix where a column represents an items (1 if a > given user has viewed a particular item 0 otherwise) and will have millions > of columns. So each user will have id, age, and this matrix. Not sure if this > will work….. > We also want to do frequency pattern mining etc. on the same data. Any > thoughts on data representation and clustering will be great. > >
