As Sean mentioned, you would be computing similar features then. If you want to find similar users, I suggest running k-means with some fixed number of clusters. It's not reasonable to try and compute all pairs of similarities between 1bn items, so k-means with fixed k is more suitable here.
Best, Reza On Wed, Dec 10, 2014 at 10:39 AM, Sean Owen <so...@cloudera.com> wrote: > Well, you're computing similarity of your features then. Whether it is > meaningful depends a bit on the nature of your features and more on > the similarity algorithm. > > On Wed, Dec 10, 2014 at 2:53 PM, Jaonary Rabarisoa <jaon...@gmail.com> > wrote: > > Dear all, > > > > I'm trying to understand what is the correct use case of ColumnSimilarity > > implemented in RowMatrix. > > > > As far as I know, this function computes the similarity of a column of a > > given matrix. The DIMSUM paper says that it's efficient for large m > (rows) > > and small n (columns). In this case the output will be a n by n matrix. > > > > Now, suppose I want to compute similarity of several users, say m = > > billions. Each users is described by a high dimensional feature vector, > say > > n = 10000. In my dataset, one row represent one user. So in that case > > computing the similarity my matrix is not the same as computing the > > similarity of all users. Then, what does it mean computing the > similarity of > > the columns of my matrix in this case ? > > > > Best regards, > > > > Jao > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >