You mentioned a matrix decomposition technique. Should I run the SVD job instead of RowSimilarityJob? I found this page describes the SVD job and it seems like that's what I should try. However, I notice the SVD job does not need a similarity class as input. Would the SVD job returns a DRM with Similarity vectors? Also, I am not sure how to determine the decomposition rank. In the book example above, would the rank be 600?
https://mahout.apache.org/users/dim-reduction/dimensional-reduction.html I see your point on using other information (ie browsing history) to "boost" correlation. This is something I will try after my demo deadline (or if I could not find a way to solve the DRM sparsity problem). BTW, I took the Solr/Mahout combo approach you described in your book. It works very well for the cases where a mahout Similarity vector is present. Thanks for your help. Much appreciated Edith On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <[email protected]> wrote: > Having such sparse data is going to make it very difficult to do anything > at all. For instance, if you have only one non-zero in a row, there is no > cooccurrence to analyze and that row should be deleted. With only two > non-zeros, you have to be very careful about drawing any inferences. > > The other aspect of sparsity is that you only have 600 books. That may > mean that you would be better served by using a matrix decomposition > technique. > > One question I have is whether you have other actions besides purchase that > indicate engagement with the books. Can you record which users browse a > certain book? How about whether they have read the reviews? > > > > On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <[email protected]> wrote: > > > Hi > > > > My RowSimiliarityJob returns a DRM with some rows missing. The input > file > > is very sparse. there are about 600 columns but only 1 - 6 would have a > > value (for each row). The output file has some rows missing. The > missing > > rows are the ones with only 1 - 2 values filled. Not all rows with 1 or > 2 > > values are missing, just some of them. And the missing rows are not > always > > the same for each RowSimilarityJob execution > > > > What I would like to achieve is to find the relative strength between > > rows. For example, if there are 600 books, user1 and user2 like only > one > > book (the same book), then there should be a correlation between these 2 > > users. > > > > But my RowSimilarityJob output file seems to skip some of the users with > > sparse preferences. I am running the job locally with 4 options: input, > > output, SIMILARITY_LOGLIKELIHOOD, and temp dir. What would be the right > > approach to pick up similarity between users with sparse preferences? > > > > Thanks! > > > > Edith > > >
