The minimum sparsity in a DRM is 0 non-zero elements in a row. That can't be what you were asking, however. Can you expand the question.
On Tue, Jul 22, 2014 at 11:39 AM, Edith Au <[email protected]> wrote: > BTW, what is the min sparsity for a DRM? > > > On Tue, Jul 22, 2014 at 11:19 AM, Edith Au <[email protected]> wrote: > > > You mentioned a matrix decomposition technique. Should I run the SVD job > > instead of RowSimilarityJob? I found this page describes the SVD job and > > it seems like that's what I should try. However, I notice the SVD job > does > > not need a similarity class as input. Would the SVD job returns a DRM > with > > Similarity vectors? Also, I am not sure how to determine the > decomposition > > rank. In the book example above, would the rank be 600? > > > > https://mahout.apache.org/users/dim-reduction/dimensional-reduction.html > > > > > > I see your point on using other information (ie browsing history) to > > "boost" correlation. This is something I will try after my demo > deadline > > (or if I could not find a way to solve the DRM sparsity problem). BTW, > I > > took the Solr/Mahout combo approach you described in your book. It works > > very well for the cases where a mahout Similarity vector is present. > > > > Thanks for your help. Much appreciated > > Edith > > > > > > On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <[email protected]> > > wrote: > > > >> Having such sparse data is going to make it very difficult to do > anything > >> at all. For instance, if you have only one non-zero in a row, there is > no > >> cooccurrence to analyze and that row should be deleted. With only two > >> non-zeros, you have to be very careful about drawing any inferences. > >> > >> The other aspect of sparsity is that you only have 600 books. That may > >> mean that you would be better served by using a matrix decomposition > >> technique. > >> > >> One question I have is whether you have other actions besides purchase > >> that > >> indicate engagement with the books. Can you record which users browse a > >> certain book? How about whether they have read the reviews? > >> > >> > >> > >> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <[email protected]> wrote: > >> > >> > Hi > >> > > >> > My RowSimiliarityJob returns a DRM with some rows missing. The input > >> file > >> > is very sparse. there are about 600 columns but only 1 - 6 would > have a > >> > value (for each row). The output file has some rows missing. The > >> missing > >> > rows are the ones with only 1 - 2 values filled. Not all rows with 1 > >> or 2 > >> > values are missing, just some of them. And the missing rows are not > >> always > >> > the same for each RowSimilarityJob execution > >> > > >> > What I would like to achieve is to find the relative strength between > >> > rows. For example, if there are 600 books, user1 and user2 like only > >> one > >> > book (the same book), then there should be a correlation between > these 2 > >> > users. > >> > > >> > But my RowSimilarityJob output file seems to skip some of the users > with > >> > sparse preferences. I am running the job locally with 4 options: > input, > >> > output, SIMILARITY_LOGLIKELIHOOD, and temp dir. What would be the > >> right > >> > approach to pick up similarity between users with sparse preferences? > >> > > >> > Thanks! > >> > > >> > Edith > >> > > >> > > > > >
