I meant to ask what is the min percentage of non zero elements in a DRM row in order for RowSimilarityJob to generate a similarity vector. I probably should have asked for the maximum sparsity.
What about using SVD for matrix decomposition? Would the SVD job returns a DRM with Similarity vectors? Any good sites/links to start researching SVD would be greatly appreciated! Thanks! On Tue, Jul 22, 2014 at 1:05 PM, Ted Dunning <[email protected]> wrote: > The minimum sparsity in a DRM is 0 non-zero elements in a row. > > That can't be what you were asking, however. Can you expand the question. > > > On Tue, Jul 22, 2014 at 11:39 AM, Edith Au <[email protected]> wrote: > > > BTW, what is the min sparsity for a DRM? > > > > > > On Tue, Jul 22, 2014 at 11:19 AM, Edith Au <[email protected]> wrote: > > > > > You mentioned a matrix decomposition technique. Should I run the SVD > job > > > instead of RowSimilarityJob? I found this page describes the SVD job > and > > > it seems like that's what I should try. However, I notice the SVD job > > does > > > not need a similarity class as input. Would the SVD job returns a DRM > > with > > > Similarity vectors? Also, I am not sure how to determine the > > decomposition > > > rank. In the book example above, would the rank be 600? > > > > > > > https://mahout.apache.org/users/dim-reduction/dimensional-reduction.html > > > > > > > > > I see your point on using other information (ie browsing history) to > > > "boost" correlation. This is something I will try after my demo > > deadline > > > (or if I could not find a way to solve the DRM sparsity problem). > BTW, > > I > > > took the Solr/Mahout combo approach you described in your book. It > works > > > very well for the cases where a mahout Similarity vector is present. > > > > > > Thanks for your help. Much appreciated > > > Edith > > > > > > > > > On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <[email protected]> > > > wrote: > > > > > >> Having such sparse data is going to make it very difficult to do > > anything > > >> at all. For instance, if you have only one non-zero in a row, there > is > > no > > >> cooccurrence to analyze and that row should be deleted. With only two > > >> non-zeros, you have to be very careful about drawing any inferences. > > >> > > >> The other aspect of sparsity is that you only have 600 books. That > may > > >> mean that you would be better served by using a matrix decomposition > > >> technique. > > >> > > >> One question I have is whether you have other actions besides purchase > > >> that > > >> indicate engagement with the books. Can you record which users > browse a > > >> certain book? How about whether they have read the reviews? > > >> > > >> > > >> > > >> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <[email protected]> wrote: > > >> > > >> > Hi > > >> > > > >> > My RowSimiliarityJob returns a DRM with some rows missing. The > input > > >> file > > >> > is very sparse. there are about 600 columns but only 1 - 6 would > > have a > > >> > value (for each row). The output file has some rows missing. The > > >> missing > > >> > rows are the ones with only 1 - 2 values filled. Not all rows with > 1 > > >> or 2 > > >> > values are missing, just some of them. And the missing rows are not > > >> always > > >> > the same for each RowSimilarityJob execution > > >> > > > >> > What I would like to achieve is to find the relative strength > between > > >> > rows. For example, if there are 600 books, user1 and user2 like > only > > >> one > > >> > book (the same book), then there should be a correlation between > > these 2 > > >> > users. > > >> > > > >> > But my RowSimilarityJob output file seems to skip some of the users > > with > > >> > sparse preferences. I am running the job locally with 4 options: > > input, > > >> > output, SIMILARITY_LOGLIKELIHOOD, and temp dir. What would be the > > >> right > > >> > approach to pick up similarity between users with sparse > preferences? > > >> > > > >> > Thanks! > > >> > > > >> > Edith > > >> > > > >> > > > > > > > > >
