Related question: is there anything that does scalable matrix
multiplication on Spark?  For example, we have that long list of vectors
and want to construct the similarity matrix:  v * T(v).  In R it would be: v
%*% t(v)

On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott <>

> Hi all,
> I'm trying to write a Spark application that will detect similar items (in
> this case products) based on their descriptions. I've got an ML pipeline
> that transforms the product data to TF-IDF representation, using the
> following components.
>    - *RegexTokenizer* - strips out non-word characters, results in a list
>    of tokens
>    - *StopWordsRemover* - removes common "stop words", such as "the",
>    "and", etc.
>    - *HashingTF* - assigns a numeric "hash" to each token and calculates
>    the term frequency
>    - *IDF* - computes the inverse document frequency
> After this pipeline evaluates, I'm left with a SparseVector that
> represents the inverse document frequency of tokens for each product. As a
> next step, I'd like to be able to compare each vector to one another, to
> detect similarities.
> Does anybody know of a straightforward way to do this in Spark? I tried
> creating a UDF (that used the Breeze linear algebra methods internally);
> however, that did not scale well.
> Thanks,
> Kevin

Reply via email to