Charles, thank you, I will check that out. Ted, I am looking for semantic similarity. Unfortunately, I do not have any data on the usage of the documents (if by usage you mean user behavior).
On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Did you want textual similarity? > > Or semantic similarity? > > The actual semantics of a message can be opaque from the content, but clear > from the usage. > > > > On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <charlesce...@me.com> wrote: > > > David, > > LDA or LSI can work quite nicely for similarity (YMMV of course depending > > on the characterization of your documents). > > You basically use the dot product of the square roots of the vectors for > > LDA -- if you do a search for Hellinger or Bhattachararyya distance that > > will lead you to a good similarity or distance measure. > > As I recall, Spark does provide an LDA implementation. Gensim provides a > > API for doing LDA similarity out of the box. Vowpal Wabbit is also worth > > looking at, particularly for a large dataset. > > Hope this is useful. > > Cheers > > > > Sent from my iPhone > > > > > On Feb 14, 2016, at 8:14 AM, David Starina <david.star...@gmail.com> > > wrote: > > > > > > Hi, > > > > > > I need to build a system to determine N (i.e. 10) most similar > documents > > to > > > a given document. I have some (theoretical) knowledge of Mahout > > algorithms, > > > but not enough to build the system. Can you give me some suggestions? > > > > > > At first I was researching Latent Semantic Analysis for the task, but > > since > > > Mahout doesn't support it, I started researching some other options. I > > got > > > a hint that instead of LSA, you can use LDA (Latent Dirichlet > allocation) > > > in Mahout to achieve similar and even better results. > > > > > > However ... and this is where I got confused ... LDA is a clustering > > > algorithm. However, what I need is not to cluster the documents into N > > > clusters - I need to get a matrix (similar to TF-IDF) from which I can > > > calculate some sort of a distance for any two documents to get N most > > > similar documents for any given document. > > > > > > How do I achieve that? My idea was (still mostly theoretical, since I > > have > > > some problems with running the LDA algorithm) to extract some number of > > > topics with LDA, but I need not cluster the documents with the help of > > this > > > topics, but to get the matrix of documents as one dimention and topics > as > > > the other dimension. I was guessing I could then use this matrix an an > > > input to row-similarity algorithm. > > > > > > Is this the correct concept? Or am I missing something? > > > > > > And, since LDA is not supperted on Spark/Samsara, how could I achieve > > > similar results on Spark? > > > > > > > > > Thanks in advance, > > > David > > >