Re: Document similarity

Ted Dunning Sun, 14 Feb 2016 07:06:02 -0800

Did you want textual similarity?

Or semantic similarity?


The actual semantics of a message can be opaque from the content, but clear
from the usage.



On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <charlesce...@me.com> wrote:

> David,
> LDA or LSI can work quite nicely for similarity (YMMV of course depending
> on the characterization of your documents).
> You basically use the dot product of the square roots of the vectors for
> LDA -- if you do a search for Hellinger or Bhattachararyya distance that
> will lead you to a good similarity or distance measure.
> As I recall, Spark does provide an LDA implementation. Gensim provides a
> API for doing LDA similarity out of the box. Vowpal Wabbit is also worth
> looking at, particularly for a large dataset.
> Hope this is useful.
> Cheers
>
> Sent from my iPhone
>
> > On Feb 14, 2016, at 8:14 AM, David Starina <david.star...@gmail.com>
> wrote:
> >
> > Hi,
> >
> > I need to build a system to determine N (i.e. 10) most similar documents
> to
> > a given document. I have some (theoretical) knowledge of Mahout
> algorithms,
> > but not enough to build the system. Can you give me some suggestions?
> >
> > At first I was researching Latent Semantic Analysis for the task, but
> since
> > Mahout doesn't support it, I started researching some other options. I
> got
> > a hint that instead of LSA, you can use LDA (Latent Dirichlet allocation)
> > in Mahout to achieve similar and even better results.
> >
> > However ... and this is where I got confused ... LDA is a clustering
> > algorithm. However, what I need is not to cluster the documents into N
> > clusters - I need to get a matrix (similar to TF-IDF) from which I can
> > calculate some sort of a distance for any two documents to get N most
> > similar documents for any given document.
> >
> > How do I achieve that? My idea was (still mostly theoretical, since I
> have
> > some problems with running the LDA algorithm) to extract some number of
> > topics with LDA, but I need not cluster the documents with the help of
> this
> > topics, but to get the matrix of documents as one dimention and topics as
> > the other dimension. I was guessing I could then use this matrix an an
> > input to row-similarity algorithm.
> >
> > Is this the correct concept? Or am I missing something?
> >
> > And, since LDA is not supperted on Spark/Samsara, how could I achieve
> > similar results on Spark?
> >
> >
> > Thanks in advance,
> > David
>

Re: Document similarity

Reply via email to