Re: Document Similarity -Spark Mllib

2016-12-15 Thread Liang-Chi Hsieh
OK. I go to check the DIMSUM implementation in Spark MLlib. The probability a column is sampled is decided by math.sqrt(10 * math.log(nCol) / threshold) / colMagnitude. The most influential parameter is colMagnitude. If in your dataset, the colMagnitude for most columns is very low, then looks lik

Re: Document Similarity -Spark Mllib

2016-12-13 Thread Liang-Chi Hsieh
Hi Satyajit, Have you tried to adjust a higher threshold for columnSimilarities to lower the computation cost? BTW, can you also comment out most of other codes and just run columnSimilarities and do a simple computation like counting for the entries of returned CoordinateMatrix? So we can make

Re: Document Similarity -Spark Mllib

2016-12-13 Thread satyajit vegesna
Hi Liang, The problem is that when i take a huge data set , i get a matrix size 1616160 * 1616160. PFB code, val exact = mat.columnSimilarities(0.5) val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i, j), u) } case class output(label1:Long,label2:Long,score:Double) val fin

Re: Document Similarity -Spark Mllib

2016-12-10 Thread Liang-Chi Hsieh
Hi Satyajit, I am not sure why you think DIMSUM cannot apply for your use case. Or you've tried it but encountered some problems. Although in the paper[1] the authors mentioned they concentrate on the regime where the number of rows is very large, and the number of columns is not too large. But I