I am currently working on a dataset containing product views and sales of about 10^7 users and 6000 items for my master's thesis in CS. My goal is to build product clusters from this. As expected, item-(row)-vectors are VERY sparse. My current approach is to implement PCA using the SVDSolver class and I am further planning to build the product clusters using the dirichlet clustering implementation of mahout. Since I am not really sure if this will lead me to the desired results at all, I am open for any kind of advice (further preprocessing, other clustering algorithms, …).
Oh, and I forgot how the views and sales are used to build product vectors. As of now, I implemented binary vectors, vectors counting the number of views and sales (e.g 1view=1count, 1sale=10counts) and ordinary vectors ( view => 1, sale=>5). Many thanks in advance Dominik
