Hi Folks!
I am trying to implement a spark job to calculate the similarity of my database
products, using only name and descriptions.
I would like to use TF-IDF to represent my text data and cosine similarity to
calculate all similarities.
My goal is, after job completes, get all similarities as a list.
For example:
Prod1 = ((Prod2, 0.98), (Prod3, 0.88))
Prod2 = ((Prod1, 0.98), (Prod4, 0.53))
Prod3 = ((Prod1, 0.98))
Prod4 = ((Prod1, 0.53))
However, I am new with Spark and I am having issues to use understanding what
cosine similarity returns!
My code:
val documents: RDD[Seq[String]] = sc.textFile(filename).map(_.split("
").toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF(minDocFreq = 2).fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
val mat = new RowMatrix(tfidf)
// Compute similar columns perfectly, with brute force.
val exact = mat.columnSimilarities()
// Compute similar columns with estimation using DIMSUM
val approx = mat.columnSimilarities(0.1)
val exactEntries = exact.entries.map { case MatrixEntry(i, j, u) => ((i,
j), u) }
val approxEntries = approx.entries.map { case MatrixEntry(i, j, v) => ((i,
j), v) }
The file is just products name and description in each row.
The return I got:
approxEntries.first()
res18: ((Long, Long), Double) = ((1638,966248),0.632455532033676)
How can I figure out what row this return is about?
Thanks in advance! =]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]