Did you want textual similarity? Or semantic similarity?
The actual semantics of a message can be opaque from the content, but clear from the usage. On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <charlesce...@me.com> wrote: > David, > LDA or LSI can work quite nicely for similarity (YMMV of course depending > on the characterization of your documents). > You basically use the dot product of the square roots of the vectors for > LDA -- if you do a search for Hellinger or Bhattachararyya distance that > will lead you to a good similarity or distance measure. > As I recall, Spark does provide an LDA implementation. Gensim provides a > API for doing LDA similarity out of the box. Vowpal Wabbit is also worth > looking at, particularly for a large dataset. > Hope this is useful. > Cheers > > Sent from my iPhone > > > On Feb 14, 2016, at 8:14 AM, David Starina <david.star...@gmail.com> > wrote: > > > > Hi, > > > > I need to build a system to determine N (i.e. 10) most similar documents > to > > a given document. I have some (theoretical) knowledge of Mahout > algorithms, > > but not enough to build the system. Can you give me some suggestions? > > > > At first I was researching Latent Semantic Analysis for the task, but > since > > Mahout doesn't support it, I started researching some other options. I > got > > a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation) > > in Mahout to achieve similar and even better results. > > > > However ... and this is where I got confused ... LDA is a clustering > > algorithm. However, what I need is not to cluster the documents into N > > clusters - I need to get a matrix (similar to TF-IDF) from which I can > > calculate some sort of a distance for any two documents to get N most > > similar documents for any given document. > > > > How do I achieve that? My idea was (still mostly theoretical, since I > have > > some problems with running the LDA algorithm) to extract some number of > > topics with LDA, but I need not cluster the documents with the help of > this > > topics, but to get the matrix of documents as one dimention and topics as > > the other dimension. I was guessing I could then use this matrix an an > > input to row-similarity algorithm. > > > > Is this the correct concept? Or am I missing something? > > > > And, since LDA is not supperted on Spark/Samsara, how could I achieve > > similar results on Spark? > > > > > > Thanks in advance, > > David >