Re: Document similarity

Pat Ferrel Wed, 24 Feb 2016 09:11:23 -0800

The method I described calculates similarity on the fly but requires new docs 
to go through feature extraction before similarity can be queried. The length 
of time to do feature extraction is short compared to training LDA.


Another method that gets at semantic similarity uses adaptive skip-grams for 
text features. http://arxiv.org/abs/1502.07257 I haven’t tried this but a 
friend saw a presentation about using this method to create features for a 
search engine which showed a favorable comparison with word2vec.

If you want to use LDA note that it is an unsupervised categorization method. 
To use it, the cluster descriptors (a vector of important terms) can be 
compared to the analyzed incoming document using a KNN/search engine. This will 
give you a list of the closest clusters but doesn’t really give you documents, 
which is your goal I think. LDA should be re-run periodically to generate new 
clusters. Do you want to know cluster inclusion or get a list of similar docs?

On Feb 23, 2016, at 1:01 PM, David Starina <david.star...@gmail.com> wrote:

Guys, one more question ... Are there some incremental methods to do this?
I don't want to run the whole job again once a new document is added. In
case of LDA ... I guess the best way is to calculate the topics on the new
document using the topics from the previous LDA run ... And then every once
in a while to recalculate the topics with the new documents?

On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel <p...@occamsmachete.com> wrote:

> Something we are working on for purely content based similarity is using a
> KNN engine (search engine) but creating features from word2vec and an NER
> (Named Entity Recognizer).
> 
> putting the generated features into fields of a doc can really help with
> similarity because w2v and NER create semantic features. You can also try
> n-grams or skip-grams. These features are not very helpful for search but
> for  similarity they work well.
> 
> The query to the KNN engine is a document, each field mapped to the
> corresponding field of the index. The result is the k nearest neighbors to
> the query doc.
> 
> 
>> On Feb 14, 2016, at 11:05 AM, David Starina <david.star...@gmail.com>
> wrote:
>> 
>> Charles, thank you, I will check that out.
>> 
>> Ted, I am looking for semantic similarity. Unfortunately, I do not have
> any
>> data on the usage of the documents (if by usage you mean user behavior).
>> 
>> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <ted.dunn...@gmail.com>
> wrote:
>> 
>>> Did you want textual similarity?
>>> 
>>> Or semantic similarity?
>>> 
>>> The actual semantics of a message can be opaque from the content, but
> clear
>>> from the usage.
>>> 
>>> 
>>> 
>>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <charlesce...@me.com>
> wrote:
>>> 
>>>> David,
>>>> LDA or LSI can work quite nicely for similarity (YMMV of course
> depending
>>>> on the characterization of your documents).
>>>> You basically use the dot product of the square roots of the vectors
> for
>>>> LDA -- if you do a search for Hellinger or Bhattachararyya distance
> that
>>>> will lead you to a good similarity or distance measure.
>>>> As I recall, Spark does provide an LDA implementation. Gensim provides
> a
>>>> API for doing LDA similarity out of the box. Vowpal Wabbit is also
> worth
>>>> looking at, particularly for a large dataset.
>>>> Hope this is useful.
>>>> Cheers
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>>> On Feb 14, 2016, at 8:14 AM, David Starina <david.star...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I need to build a system to determine N (i.e. 10) most similar
>>> documents
>>>> to
>>>>> a given document. I have some (theoretical) knowledge of Mahout
>>>> algorithms,
>>>>> but not enough to build the system. Can you give me some suggestions?
>>>>> 
>>>>> At first I was researching Latent Semantic Analysis for the task, but
>>>> since
>>>>> Mahout doesn't support it, I started researching some other options. I
>>>> got
>>>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet
>>> allocation)
>>>>> in Mahout to achieve similar and even better results.
>>>>> 
>>>>> However ... and this is where I got confused ... LDA is a clustering
>>>>> algorithm. However, what I need is not to cluster the documents into N
>>>>> clusters - I need to get a matrix (similar to TF-IDF) from which I can
>>>>> calculate some sort of a distance for any two documents to get N most
>>>>> similar documents for any given document.
>>>>> 
>>>>> How do I achieve that? My idea was (still mostly theoretical, since I
>>>> have
>>>>> some problems with running the LDA algorithm) to extract some number
> of
>>>>> topics with LDA, but I need not cluster the documents with the help of
>>>> this
>>>>> topics, but to get the matrix of documents as one dimention and topics
>>> as
>>>>> the other dimension. I was guessing I could then use this matrix an an
>>>>> input to row-similarity algorithm.
>>>>> 
>>>>> Is this the correct concept? Or am I missing something?
>>>>> 
>>>>> And, since LDA is not supperted on Spark/Samsara, how could I achieve
>>>>> similar results on Spark?
>>>>> 
>>>>> 
>>>>> Thanks in advance,
>>>>> David
>>>> 
>>> 
> 
>

Re: Document similarity

Reply via email to