David,
LDA or LSI can work quite nicely for similarity (YMMV of course depending on 
the characterization of your documents).
You basically use the dot product of the square roots of the vectors for LDA -- 
if you do a search for Hellinger or Bhattachararyya distance that will lead you 
to a good similarity or distance measure.
As I recall, Spark does provide an LDA implementation. Gensim provides a API 
for doing LDA similarity out of the box. Vowpal Wabbit is also worth looking 
at, particularly for a large dataset.
Hope this is useful.
Cheers

Sent from my iPhone

> On Feb 14, 2016, at 8:14 AM, David Starina <david.star...@gmail.com> wrote:
> 
> Hi,
> 
> I need to build a system to determine N (i.e. 10) most similar documents to
> a given document. I have some (theoretical) knowledge of Mahout algorithms,
> but not enough to build the system. Can you give me some suggestions?
> 
> At first I was researching Latent Semantic Analysis for the task, but since
> Mahout doesn't support it, I started researching some other options. I got
> a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation)
> in Mahout to achieve similar and even better results.
> 
> However ... and this is where I got confused ... LDA is a clustering
> algorithm. However, what I need is not to cluster the documents into N
> clusters - I need to get a matrix (similar to TF-IDF) from which I can
> calculate some sort of a distance for any two documents to get N most
> similar documents for any given document.
> 
> How do I achieve that? My idea was (still mostly theoretical, since I have
> some problems with running the LDA algorithm) to extract some number of
> topics with LDA, but I need not cluster the documents with the help of this
> topics, but to get the matrix of documents as one dimention and topics as
> the other dimension. I was guessing I could then use this matrix an an
> input to row-similarity algorithm.
> 
> Is this the correct concept? Or am I missing something?
> 
> And, since LDA is not supperted on Spark/Samsara, how could I achieve
> similar results on Spark?
> 
> 
> Thanks in advance,
> David

Reply via email to