I have a version that works well for Netflix data but now I am validating on internal datasets..this code will work on matrix factors and sparse matrices that has rows = 100* columns....if columns are much smaller than rows then col based flow works well...basically we need both flows...
I did not think on random sampling yet but LSH will work well...metric is the key here and so every optimization needs to be validated wrt the raw flow.. On Apr 6, 2015 10:15 AM, "Reza Zadeh" <r...@databricks.com> wrote: > Right now dimsum is meant to be used for tall and skinny matrices, and so > columnSimilarities() returns similar columns, not rows. We are working on > adding an efficient row similarity as well, tracked by this JIRA: > https://issues.apache.org/jira/browse/SPARK-4823 > Reza > > On Mon, Apr 6, 2015 at 6:08 AM, James <alcaid1...@gmail.com> wrote: > >> The example below illustrates how to use the DIMSUM algorithm to >> calculate the similarity between each two rows and output row pairs with >> cosine simiarity that is not less than a threshold. >> >> >> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala >> >> >> But what if I hope to hold an Id of each row, which means the input file >> is: >> >> id1 vector1 >> id2 vector2 >> id3 vector3 >> ... >> >> And we hope to output >> >> id1 id2 sim(id1, id2) >> id1 id3 sim(id1, id3) >> ... >> >> >> Alcaid >> > >