Having such sparse data is going to make it very difficult to do anything at all. For instance, if you have only one non-zero in a row, there is no cooccurrence to analyze and that row should be deleted. With only two non-zeros, you have to be very careful about drawing any inferences.
The other aspect of sparsity is that you only have 600 books. That may mean that you would be better served by using a matrix decomposition technique. One question I have is whether you have other actions besides purchase that indicate engagement with the books. Can you record which users browse a certain book? How about whether they have read the reviews? On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <[email protected]> wrote: > Hi > > My RowSimiliarityJob returns a DRM with some rows missing. The input file > is very sparse. there are about 600 columns but only 1 - 6 would have a > value (for each row). The output file has some rows missing. The missing > rows are the ones with only 1 - 2 values filled. Not all rows with 1 or 2 > values are missing, just some of them. And the missing rows are not always > the same for each RowSimilarityJob execution > > What I would like to achieve is to find the relative strength between > rows. For example, if there are 600 books, user1 and user2 like only one > book (the same book), then there should be a correlation between these 2 > users. > > But my RowSimilarityJob output file seems to skip some of the users with > sparse preferences. I am running the job locally with 4 options: input, > output, SIMILARITY_LOGLIKELIHOOD, and temp dir. What would be the right > approach to pick up similarity between users with sparse preferences? > > Thanks! > > Edith >
