You might also look at this paper by Wallach http://maroo.cs.umass.edu/pub/web/getpdf.php?id=1101
Sent from my iPhone > On Mar 11, 2016, at 8:11 AM, David Starina <david.star...@gmail.com> wrote: > > Well, there is also an online method of LDA in Spark ... Pat, is there any > documentation on the method you described? > >> On Wed, Feb 24, 2016 at 6:10 PM, Pat Ferrel <p...@occamsmachete.com> wrote: >> >> The method I described calculates similarity on the fly but requires new >> docs to go through feature extraction before similarity can be queried. The >> length of time to do feature extraction is short compared to training LDA. >> >> Another method that gets at semantic similarity uses adaptive skip-grams >> for text features. http://arxiv.org/abs/1502.07257 I haven’t tried this >> but a friend saw a presentation about using this method to create features >> for a search engine which showed a favorable comparison with word2vec. >> >> If you want to use LDA note that it is an unsupervised categorization >> method. To use it, the cluster descriptors (a vector of important terms) >> can be compared to the analyzed incoming document using a KNN/search >> engine. This will give you a list of the closest clusters but doesn’t >> really give you documents, which is your goal I think. LDA should be re-run >> periodically to generate new clusters. Do you want to know cluster >> inclusion or get a list of similar docs? >> >> On Feb 23, 2016, at 1:01 PM, David Starina <david.star...@gmail.com> >> wrote: >> >> Guys, one more question ... Are there some incremental methods to do this? >> I don't want to run the whole job again once a new document is added. In >> case of LDA ... I guess the best way is to calculate the topics on the new >> document using the topics from the previous LDA run ... And then every once >> in a while to recalculate the topics with the new documents? >> >> On Sun, Feb 14, 2016 at 10:02 PM, Pat Ferrel <p...@occamsmachete.com> >> wrote: >> >>> Something we are working on for purely content based similarity is using >> a >>> KNN engine (search engine) but creating features from word2vec and an NER >>> (Named Entity Recognizer). >>> >>> putting the generated features into fields of a doc can really help with >>> similarity because w2v and NER create semantic features. You can also try >>> n-grams or skip-grams. These features are not very helpful for search but >>> for similarity they work well. >>> >>> The query to the KNN engine is a document, each field mapped to the >>> corresponding field of the index. The result is the k nearest neighbors >> to >>> the query doc. >>> >>> >>>>> On Feb 14, 2016, at 11:05 AM, David Starina <david.star...@gmail.com> >>>> wrote: >>>> >>>> Charles, thank you, I will check that out. >>>> >>>> Ted, I am looking for semantic similarity. Unfortunately, I do not have >>> any >>>> data on the usage of the documents (if by usage you mean user behavior). >>>> >>>>> On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning <ted.dunn...@gmail.com> >>>> wrote: >>>> >>>>> Did you want textual similarity? >>>>> >>>>> Or semantic similarity? >>>>> >>>>> The actual semantics of a message can be opaque from the content, but >>> clear >>>>> from the usage. >>>>> >>>>> >>>>> >>>>> On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl <charlesce...@me.com> >>> wrote: >>>>> >>>>>> David, >>>>>> LDA or LSI can work quite nicely for similarity (YMMV of course >>> depending >>>>>> on the characterization of your documents). >>>>>> You basically use the dot product of the square roots of the vectors >>> for >>>>>> LDA -- if you do a search for Hellinger or Bhattachararyya distance >>> that >>>>>> will lead you to a good similarity or distance measure. >>>>>> As I recall, Spark does provide an LDA implementation. Gensim provides >>> a >>>>>> API for doing LDA similarity out of the box. Vowpal Wabbit is also >>> worth >>>>>> looking at, particularly for a large dataset. >>>>>> Hope this is useful. >>>>>> Cheers >>>>>> >>>>>> Sent from my iPhone >>>>>> >>>>>>>> On Feb 14, 2016, at 8:14 AM, David Starina <david.star...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I need to build a system to determine N (i.e. 10) most similar >>>>> documents >>>>>> to >>>>>>> a given document. I have some (theoretical) knowledge of Mahout >>>>>> algorithms, >>>>>>> but not enough to build the system. Can you give me some suggestions? >>>>>>> >>>>>>> At first I was researching Latent Semantic Analysis for the task, but >>>>>> since >>>>>>> Mahout doesn't support it, I started researching some other options. >> I >>>>>> got >>>>>>> a hint that instead of LSA, you can use LDA (Latent Dirichlet >>>>> allocation) >>>>>>> in Mahout to achieve similar and even better results. >>>>>>> >>>>>>> However ... and this is where I got confused ... LDA is a clustering >>>>>>> algorithm. However, what I need is not to cluster the documents into >> N >>>>>>> clusters - I need to get a matrix (similar to TF-IDF) from which I >> can >>>>>>> calculate some sort of a distance for any two documents to get N most >>>>>>> similar documents for any given document. >>>>>>> >>>>>>> How do I achieve that? My idea was (still mostly theoretical, since I >>>>>> have >>>>>>> some problems with running the LDA algorithm) to extract some number >>> of >>>>>>> topics with LDA, but I need not cluster the documents with the help >> of >>>>>> this >>>>>>> topics, but to get the matrix of documents as one dimention and >> topics >>>>> as >>>>>>> the other dimension. I was guessing I could then use this matrix an >> an >>>>>>> input to row-similarity algorithm. >>>>>>> >>>>>>> Is this the correct concept? Or am I missing something? >>>>>>> >>>>>>> And, since LDA is not supperted on Spark/Samsara, how could I achieve >>>>>>> similar results on Spark? >>>>>>> >>>>>>> >>>>>>> Thanks in advance, >>>>>>> David >> >>