BTW, what is the min sparsity for a DRM?
On Tue, Jul 22, 2014 at 11:19 AM, Edith Au <[email protected]> wrote: > You mentioned a matrix decomposition technique. Should I run the SVD job > instead of RowSimilarityJob? I found this page describes the SVD job and > it seems like that's what I should try. However, I notice the SVD job does > not need a similarity class as input. Would the SVD job returns a DRM with > Similarity vectors? Also, I am not sure how to determine the decomposition > rank. In the book example above, would the rank be 600? > > https://mahout.apache.org/users/dim-reduction/dimensional-reduction.html > > > I see your point on using other information (ie browsing history) to > "boost" correlation. This is something I will try after my demo deadline > (or if I could not find a way to solve the DRM sparsity problem). BTW, I > took the Solr/Mahout combo approach you described in your book. It works > very well for the cases where a mahout Similarity vector is present. > > Thanks for your help. Much appreciated > Edith > > > On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <[email protected]> > wrote: > >> Having such sparse data is going to make it very difficult to do anything >> at all. For instance, if you have only one non-zero in a row, there is no >> cooccurrence to analyze and that row should be deleted. With only two >> non-zeros, you have to be very careful about drawing any inferences. >> >> The other aspect of sparsity is that you only have 600 books. That may >> mean that you would be better served by using a matrix decomposition >> technique. >> >> One question I have is whether you have other actions besides purchase >> that >> indicate engagement with the books. Can you record which users browse a >> certain book? How about whether they have read the reviews? >> >> >> >> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <[email protected]> wrote: >> >> > Hi >> > >> > My RowSimiliarityJob returns a DRM with some rows missing. The input >> file >> > is very sparse. there are about 600 columns but only 1 - 6 would have a >> > value (for each row). The output file has some rows missing. The >> missing >> > rows are the ones with only 1 - 2 values filled. Not all rows with 1 >> or 2 >> > values are missing, just some of them. And the missing rows are not >> always >> > the same for each RowSimilarityJob execution >> > >> > What I would like to achieve is to find the relative strength between >> > rows. For example, if there are 600 books, user1 and user2 like only >> one >> > book (the same book), then there should be a correlation between these 2 >> > users. >> > >> > But my RowSimilarityJob output file seems to skip some of the users with >> > sparse preferences. I am running the job locally with 4 options: input, >> > output, SIMILARITY_LOGLIKELIHOOD, and temp dir. What would be the >> right >> > approach to pick up similarity between users with sparse preferences? >> > >> > Thanks! >> > >> > Edith >> > >> > >
