I'm assuming that you are writing the cosine similarity and you have two 
vectors containing the pairs <term, tfidf>. The two vectors could have 
different sizes because they only contain the terms that have tfidf != 0.
if you want to compute cosine similarity between the two lists you just have to 
consider the pairs that appears in **both the vectors**, because otherwise if a 
term doesn't appear in one of the two the product is going to be 0, so it will 
not contribute to the final tfidf score. 

(Really old) Example: 
https://github.com/diegoceccarelli/dexter/blob/fb4bbcb27a13da2665f3c19d6c75bfc4f5778440/dexter-core/src/main/java/it/cnr/isti/hpc/dexter/lucene/LuceneHelper.java#L386


From: solr-user@lucene.apache.org At: 01/06/18 17:24:07To:  
solr-user@lucene.apache.org
Subject: Re: Personalized search parameters

Don't we need vectors of the same size to calculate the cosine similarity? 
Maybe I missed something, but following that example it looks like i have to
manually recreate the sparse vectors, because the term vector of a document
should (i may be wrong) contain only the terms that appear in that document.
Am I wrong?

Given that i assumed (and that example goes in that direction) that we have
to manually create the sparse vector by first collecting all the terms and
then calculating the tf-idf frequency for each term in each document.
That's what i did, and I obtained vectors of the same dimension for each
document, i was just wondering if there was a better optimized way to obtain
those sparse vectors.


--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Reply via email to