Guys, one more question ... Are there some incremental methods to do this? I don't want to run the whole job again once a new document is added. In case of LDA ... I guess the best way is to calculate the topics on the new document using the topics from the previous LDA run ... And then every once in a while to recalculate the topics with the new documents?
On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Something we are working on for purely content based similarity is using a > KNN engine (search engine) but creating features from word2vec and an NER > (Named Entity Recognizer). > > putting the generated features into fields of a doc can really help with > similarity because w2v and NER create semantic features. You can also try > n-grams or skip-grams. These features are not very helpful for search but > for similarity they work well. > > The query to the KNN engine is a document, each field mapped to the > corresponding field of the index. The result is the k nearest neighbors to > the query doc. > > > > On Feb 14, 2016, at 11:05 AM, David Starina <david.star...@gmail.com> > wrote: > > > > Charles, thank you, I will check that out. > > > > Ted, I am looking for semantic similarity. Unfortunately, I do not have > any > > data on the usage of the documents (if by usage you mean user behavior). > > > > On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > >> Did you want textual similarity? > >> > >> Or semantic similarity? > >> > >> The actual semantics of a message can be opaque from the content, but > clear > >> from the usage. > >> > >> > >> > >> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <charlesce...@me.com> > wrote: > >> > >>> David, > >>> LDA or LSI can work quite nicely for similarity (YMMV of course > depending > >>> on the characterization of your documents). > >>> You basically use the dot product of the square roots of the vectors > for > >>> LDA -- if you do a search for Hellinger or Bhattachararyya distance > that > >>> will lead you to a good similarity or distance measure. > >>> As I recall, Spark does provide an LDA implementation. Gensim provides > a > >>> API for doing LDA similarity out of the box. Vowpal Wabbit is also > worth > >>> looking at, particularly for a large dataset. > >>> Hope this is useful. > >>> Cheers > >>> > >>> Sent from my iPhone > >>> > >>>> On Feb 14, 2016, at 8:14 AM, David Starina <david.star...@gmail.com> > >>> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I need to build a system to determine N (i.e. 10) most similar > >> documents > >>> to > >>>> a given document. I have some (theoretical) knowledge of Mahout > >>> algorithms, > >>>> but not enough to build the system. Can you give me some suggestions? > >>>> > >>>> At first I was researching Latent Semantic Analysis for the task, but > >>> since > >>>> Mahout doesn't support it, I started researching some other options. I > >>> got > >>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet > >> allocation) > >>>> in Mahout to achieve similar and even better results. > >>>> > >>>> However ... and this is where I got confused ... LDA is a clustering > >>>> algorithm. However, what I need is not to cluster the documents into N > >>>> clusters - I need to get a matrix (similar to TF-IDF) from which I can > >>>> calculate some sort of a distance for any two documents to get N most > >>>> similar documents for any given document. > >>>> > >>>> How do I achieve that? My idea was (still mostly theoretical, since I > >>> have > >>>> some problems with running the LDA algorithm) to extract some number > of > >>>> topics with LDA, but I need not cluster the documents with the help of > >>> this > >>>> topics, but to get the matrix of documents as one dimention and topics > >> as > >>>> the other dimension. I was guessing I could then use this matrix an an > >>>> input to row-similarity algorithm. > >>>> > >>>> Is this the correct concept? Or am I missing something? > >>>> > >>>> And, since LDA is not supperted on Spark/Samsara, how could I achieve > >>>> similar results on Spark? > >>>> > >>>> > >>>> Thanks in advance, > >>>> David > >>> > >> > >