Re: Document similarity

David Starina Sun, 14 Feb 2016 11:05:51 -0800

Charles, thank you, I will check that out.

Ted, I am looking for semantic similarity. Unfortunately, I do not have any
data on the usage of the documents (if by usage you mean user behavior).


On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Did you want textual similarity?
>
> Or semantic similarity?
>
> The actual semantics of a message can be opaque from the content, but clear
> from the usage.
>
>
>
> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <charlesce...@me.com> wrote:
>
> > David,
> > LDA or LSI can work quite nicely for similarity (YMMV of course depending
> > on the characterization of your documents).
> > You basically use the dot product of the square roots of the vectors for
> > LDA -- if you do a search for Hellinger or Bhattachararyya distance that
> > will lead you to a good similarity or distance measure.
> > As I recall, Spark does provide an LDA implementation. Gensim provides a
> > API for doing LDA similarity out of the box. Vowpal Wabbit is also worth
> > looking at, particularly for a large dataset.
> > Hope this is useful.
> > Cheers
> >
> > Sent from my iPhone
> >
> > > On Feb 14, 2016, at 8:14 AM, David Starina <david.star...@gmail.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > I need to build a system to determine N (i.e. 10) most similar
> documents
> > to
> > > a given document. I have some (theoretical) knowledge of Mahout
> > algorithms,
> > > but not enough to build the system. Can you give me some suggestions?
> > >
> > > At first I was researching Latent Semantic Analysis for the task, but
> > since
> > > Mahout doesn't support it, I started researching some other options. I
> > got
> > > a hint that instead of LSA, you can use LDA (Latent Dirichlet
> allocation)
> > > in Mahout to achieve similar and even better results.
> > >
> > > However ... and this is where I got confused ... LDA is a clustering
> > > algorithm. However, what I need is not to cluster the documents into N
> > > clusters - I need to get a matrix (similar to TF-IDF) from which I can
> > > calculate some sort of a distance for any two documents to get N most
> > > similar documents for any given document.
> > >
> > > How do I achieve that? My idea was (still mostly theoretical, since I
> > have
> > > some problems with running the LDA algorithm) to extract some number of
> > > topics with LDA, but I need not cluster the documents with the help of
> > this
> > > topics, but to get the matrix of documents as one dimention and topics
> as
> > > the other dimension. I was guessing I could then use this matrix an an
> > > input to row-similarity algorithm.
> > >
> > > Is this the correct concept? Or am I missing something?
> > >
> > > And, since LDA is not supperted on Spark/Samsara, how could I achieve
> > > similar results on Spark?
> > >
> > >
> > > Thanks in advance,
> > > David
> >
>

Re: Document similarity

Reply via email to