Document similarity

David Starina Sun, 14 Feb 2016 05:15:07 -0800

Hi,

I need to build a system to determine N (i.e. 10) most similar documents to
a given document. I have some (theoretical) knowledge of Mahout algorithms,
but not enough to build the system. Can you give me some suggestions?


At first I was researching Latent Semantic Analysis for the task, but since
Mahout doesn't support it, I started researching some other options. I got
a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation)
in Mahout to achieve similar and even better results.

However ... and this is where I got confused ... LDA is a clustering
algorithm. However, what I need is not to cluster the documents into N
clusters - I need to get a matrix (similar to TF-IDF) from which I can
calculate some sort of a distance for any two documents to get N most
similar documents for any given document.

How do I achieve that? My idea was (still mostly theoretical, since I have
some problems with running the LDA algorithm) to extract some number of
topics with LDA, but I need not cluster the documents with the help of this
topics, but to get the matrix of documents as one dimention and topics as
the other dimension. I was guessing I could then use this matrix an an
input to row-similarity algorithm.

Is this the correct concept? Or am I missing something?

And, since LDA is not supperted on Spark/Samsara, how could I achieve
similar results on Spark?


Thanks in advance,
David

Document similarity

Reply via email to