Re: Document similarity

2016-02-14 Thread Pat Ferrel
Something we are working on for purely content based similarity is using a KNN engine (search engine) but creating features from word2vec and an NER (Named Entity Recognizer). putting the generated features into fields of a doc can really help with similarity because w2v and NER create

Re: Document similarity

2016-02-14 Thread David Starina
Charles, thank you, I will check that out. Ted, I am looking for semantic similarity. Unfortunately, I do not have any data on the usage of the documents (if by usage you mean user behavior). On Sun, Feb 14, 2016 at 4:04 PM, Ted Dunning wrote: > Did you want textual

Document similarity

2016-02-14 Thread David Starina
Hi, I need to build a system to determine N (i.e. 10) most similar documents to a given document. I have some (theoretical) knowledge of Mahout algorithms, but not enough to build the system. Can you give me some suggestions? At first I was researching Latent Semantic Analysis for the task, but

Re: Document similarity

2016-02-14 Thread Ted Dunning
Did you want textual similarity? Or semantic similarity? The actual semantics of a message can be opaque from the content, but clear from the usage. On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl wrote: > David, > LDA or LSI can work quite nicely for similarity (YMMV of