Hi all,

I'm trying to write a Spark application that will detect similar items (in
this case products) based on their descriptions. I've got an ML pipeline
that transforms the product data to TF-IDF representation, using the
following components.

   - *RegexTokenizer* - strips out non-word characters, results in a list
   of tokens
   - *StopWordsRemover* - removes common "stop words", such as "the",
   "and", etc.
   - *HashingTF* - assigns a numeric "hash" to each token and calculates
   the term frequency
   - *IDF* - computes the inverse document frequency

After this pipeline evaluates, I'm left with a SparseVector that represents
the inverse document frequency of tokens for each product. As a next step,
I'd like to be able to compare each vector to one another, to detect
similarities.

Does anybody know of a straightforward way to do this in Spark? I tried
creating a UDF (that used the Breeze linear algebra methods internally);
however, that did not scale well.

Thanks,
Kevin

Reply via email to