There are several ways I can compute the cosine similarities between a Spark ML 
vector to each ML vector in a Spark DataFrame column then sorting for the 
highest results.  However, I can't come up with a method that is faster than 
replacing the `/data/` in a Spark ML Word2Vec model, then using 
`.findSynonyms()`.  The problem is the Word2Vec model is held entirely in the 
driver which can cause memory issues if the data set I want to compare to gets 
too big.

1. Is there a more efficient method than the ones I have shown below?
2. Could the data for the Word2Vec model be distributed across the cluster?
3. Could the the `.findSynonyms()` [Scala 
 be modified to make a spark sql function that can operate efficiently over a 
whole Spark DataFrame?

Methods I have tried:

#1 rdd function:

    # vecIn = vector of same dimensions as 'vectors' column
    def cosSim(row, vecIn):
        return (
            tuple(( Vectors.dense( Vectors.dense( /
                        (Vectors.dense(np.sqrt( *
                ).toArray().tolist())) row: cosSim(row, 

#2  `.toIndexedRowMatrix().columnSimilarities()` then filter the results (not 


        IndexedRowMatrix( row: (row.vectors.toArray())))

#3 replace Word2Vec model `/data/` with my own, then load 'revised' model and 
use `.findSynonyms()`:


    new_Word2Vec_model = Word2VecModel.load("exiting_Word2Vec_model")

    ## vecIn = vector of same dimensions as 'vector' column in DataFrame saved 
over Word2Vec model /data/
    new_Word2Vec_model.findSynonyms(vecIn, 20).show()

Clay Stevens

Reply via email to