I have a version that works well for Netflix data but now I am validating
on internal datasets..this code will work on matrix factors and sparse
matrices that has rows = 100* columns....if columns are much smaller than
rows then col based flow works well...basically we need both flows...

I did not think on random sampling yet but LSH will work well...metric is
the key here and so every optimization needs to be validated wrt the raw
flow..
On Apr 6, 2015 10:15 AM, "Reza Zadeh" <r...@databricks.com> wrote:

> Right now dimsum is meant to be used for tall and skinny matrices, and so
> columnSimilarities() returns similar columns, not rows. We are working on
> adding an efficient row similarity as well, tracked by this JIRA:
> https://issues.apache.org/jira/browse/SPARK-4823
> Reza
>
> On Mon, Apr 6, 2015 at 6:08 AM, James <alcaid1...@gmail.com> wrote:
>
>> The example below illustrates how to use the DIMSUM algorithm to
>> calculate the similarity between each two rows and output row pairs with
>> cosine simiarity that is not less than a threshold.
>>
>>
>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala
>>
>>
>> But what if I hope to hold an Id of each row, which means the input file
>> is:
>>
>> id1 vector1
>> id2 vector2
>> id3 vector3
>> ...
>>
>> And we hope to output
>>
>> id1 id2 sim(id1, id2)
>> id1 id3 sim(id1, id3)
>> ...
>>
>>
>> Alcaid
>>
>
>

Reply via email to