I think the default is 1 per user but as Ted said, this will give no cooccurrences—hmm, not sure why the default isn’t 2.
A DRM can have all zero rows so there is no min for the data structure, the min you are setting is for the algo, in this case rowsimilarity. Using the solr approach (sorry I haven’t read this whole thread) uses an indicator matrix that will be 600x600 and is as good as the data you have. Then you use the current user’s history as the solr query on the indexed indicator matrix. If the current user has good history data the recs may be OK. I’ve done some work looking at using other actions like views + purchases and treating them all the same but for the data we used the quality of recs did not improve very much at all. But data varies... There is another technique--call it cross-cooccurrence--where you use purchases to find important views and then treat them all the same. This may give you much more data to work with but requires that you write a lot more code. We are working on a version of RSJ and itemsimilarity that does this for you but it’s not quite ready yet. I think the other method Ted is talking about is ALS-WR, which is a latent factor method that may help. The CLI is recommendfactorized On Jul 22, 2014, at 5:17 PM, Edith Au <[email protected]> wrote: I meant to ask what is the min percentage of non zero elements in a DRM row in order for RowSimilarityJob to generate a similarity vector. I probably should have asked for the maximum sparsity. What about using SVD for matrix decomposition? Would the SVD job returns a DRM with Similarity vectors? Any good sites/links to start researching SVD would be greatly appreciated! Thanks! On Tue, Jul 22, 2014 at 1:05 PM, Ted Dunning <[email protected]> wrote: > The minimum sparsity in a DRM is 0 non-zero elements in a row. > > That can't be what you were asking, however. Can you expand the question. > > > On Tue, Jul 22, 2014 at 11:39 AM, Edith Au <[email protected]> wrote: > >> BTW, what is the min sparsity for a DRM? >> >> >> On Tue, Jul 22, 2014 at 11:19 AM, Edith Au <[email protected]> wrote: >> >>> You mentioned a matrix decomposition technique. Should I run the SVD > job >>> instead of RowSimilarityJob? I found this page describes the SVD job > and >>> it seems like that's what I should try. However, I notice the SVD job >> does >>> not need a similarity class as input. Would the SVD job returns a DRM >> with >>> Similarity vectors? Also, I am not sure how to determine the >> decomposition >>> rank. In the book example above, would the rank be 600? >>> >>> > https://mahout.apache.org/users/dim-reduction/dimensional-reduction.html >>> >>> >>> I see your point on using other information (ie browsing history) to >>> "boost" correlation. This is something I will try after my demo >> deadline >>> (or if I could not find a way to solve the DRM sparsity problem). > BTW, >> I >>> took the Solr/Mahout combo approach you described in your book. It > works >>> very well for the cases where a mahout Similarity vector is present. >>> >>> Thanks for your help. Much appreciated >>> Edith >>> >>> >>> On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <[email protected]> >>> wrote: >>> >>>> Having such sparse data is going to make it very difficult to do >> anything >>>> at all. For instance, if you have only one non-zero in a row, there > is >> no >>>> cooccurrence to analyze and that row should be deleted. With only two >>>> non-zeros, you have to be very careful about drawing any inferences. >>>> >>>> The other aspect of sparsity is that you only have 600 books. That > may >>>> mean that you would be better served by using a matrix decomposition >>>> technique. >>>> >>>> One question I have is whether you have other actions besides purchase >>>> that >>>> indicate engagement with the books. Can you record which users > browse a >>>> certain book? How about whether they have read the reviews? >>>> >>>> >>>> >>>> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <[email protected]> wrote: >>>> >>>>> Hi >>>>> >>>>> My RowSimiliarityJob returns a DRM with some rows missing. The > input >>>> file >>>>> is very sparse. there are about 600 columns but only 1 - 6 would >> have a >>>>> value (for each row). The output file has some rows missing. The >>>> missing >>>>> rows are the ones with only 1 - 2 values filled. Not all rows with > 1 >>>> or 2 >>>>> values are missing, just some of them. And the missing rows are not >>>> always >>>>> the same for each RowSimilarityJob execution >>>>> >>>>> What I would like to achieve is to find the relative strength > between >>>>> rows. For example, if there are 600 books, user1 and user2 like > only >>>> one >>>>> book (the same book), then there should be a correlation between >> these 2 >>>>> users. >>>>> >>>>> But my RowSimilarityJob output file seems to skip some of the users >> with >>>>> sparse preferences. I am running the job locally with 4 options: >> input, >>>>> output, SIMILARITY_LOGLIKELIHOOD, and temp dir. What would be the >>>> right >>>>> approach to pick up similarity between users with sparse > preferences? >>>>> >>>>> Thanks! >>>>> >>>>> Edith >>>>> >>>> >>> >>> >> >
