How many products do you have? How large are your vectors?

It could be that SVD / LSA could be helpful. But if you have many products
then trying to compute all-pair similarity with brute force is not going to
be scalable. In this case you may want to investigate hashing (LSH)

On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <>

> Hi all,
> I'm trying to write a Spark application that will detect similar items (in
> this case products) based on their descriptions. I've got an ML pipeline
> that transforms the product data to TF-IDF representation, using the
> following components.
>    - *RegexTokenizer* - strips out non-word characters, results in a list
>    of tokens
>    - *StopWordsRemover* - removes common "stop words", such as "the",
>    "and", etc.
>    - *HashingTF* - assigns a numeric "hash" to each token and calculates
>    the term frequency
>    - *IDF* - computes the inverse document frequency
> After this pipeline evaluates, I'm left with a SparseVector that
> represents the inverse document frequency of tokens for each product. As a
> next step, I'd like to be able to compare each vector to one another, to
> detect similarities.
> Does anybody know of a straightforward way to do this in Spark? I tried
> creating a UDF (that used the Breeze linear algebra methods internally);
> however, that did not scale well.
> Thanks,
> Kevin

Reply via email to