I am currently working on a dataset containing product views and sales of about 
10^7 users and 6000 items for my master's thesis in CS. My goal is to build 
product clusters from this. As expected, item-(row)-vectors are VERY sparse. My 
current approach is to implement PCA using the SVDSolver class and I am further 
planning to build the product clusters using the dirichlet clustering 
implementation of mahout. Since I am not really sure if this will lead me to 
the desired results at all, I am open for any kind of advice (further 
preprocessing, other clustering algorithms, …). 

Oh, and I forgot how the views and sales are used to build product vectors. As 
of now, I implemented binary vectors, vectors counting the number of views and 
sales (e.g 1view=1count, 1sale=10counts) and ordinary vectors ( view => 1, 
sale=>5).

Many thanks in advance
Dominik

Reply via email to