OK. I go to check the DIMSUM implementation in Spark MLlib. The probability
a column is sampled is decided by math.sqrt(10 * math.log(nCol) / threshold)
/ colMagnitude. The most influential parameter is colMagnitude. If in your
dataset, the colMagnitude for most columns is very low, then looks lik
Hi Satyajit,
Have you tried to adjust a higher threshold for columnSimilarities to lower
the computation cost?
BTW, can you also comment out most of other codes and just run
columnSimilarities and do a simple computation like counting for the entries
of returned CoordinateMatrix? So we can make
Hi Liang,
The problem is that when i take a huge data set , i get a matrix size
1616160 * 1616160.
PFB code,
val exact = mat.columnSimilarities(0.5)
val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i,
j), u) }
case class output(label1:Long,label2:Long,score:Double)
val fin
Hi Satyajit,
I am not sure why you think DIMSUM cannot apply for your use case. Or you've
tried it but encountered some problems.
Although in the paper[1] the authors mentioned they concentrate on the
regime where the number of rows is very large, and the number of columns is
not too large. But I