Hi all, I'm trying to write a Spark application that will detect similar items (in this case products) based on their descriptions. I've got an ML pipeline that transforms the product data to TF-IDF representation, using the following components.
- *RegexTokenizer* - strips out non-word characters, results in a list of tokens - *StopWordsRemover* - removes common "stop words", such as "the", "and", etc. - *HashingTF* - assigns a numeric "hash" to each token and calculates the term frequency - *IDF* - computes the inverse document frequency After this pipeline evaluates, I'm left with a SparseVector that represents the inverse document frequency of tokens for each product. As a next step, I'd like to be able to compare each vector to one another, to detect similarities. Does anybody know of a straightforward way to do this in Spark? I tried creating a UDF (that used the Breeze linear algebra methods internally); however, that did not scale well. Thanks, Kevin