Re: LSA Implementation
Lance, It does cover European languages, but pretty much nothing on Asian languages (CJK). - Eswar On Nov 28, 2007 1:51 AM, Norskog, Lance [EMAIL PROTECTED] wrote: WordNet itself is English-only. There are various ontology projects for it. http://www.globalwordnet.org/ is a separate world language database project. I found it at the bottom of the WordNet wikipedia page. Thanks for starting me on the search! Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:50 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation The languages also include CJK :) among others. - Eswar On Nov 27, 2007 8:16 AM, Norskog, Lance [EMAIL PROTECTED] wrote: The WordNet project at Princeton (USA) is a large database of synonyms. If you're only working in English this might be useful instead of running your own analyses. http://en.wikipedia.org/wiki/WordNet http://wordnet.princeton.edu/ Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:34 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. this algo should consider documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the algorithm doesn't understand anything about what the words *mean*, the patterns it notices can make it seem astonishingly intelligent. When you search an such an index, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, Where a plain keyword search will fail if there is no exact match, this algo will often return relevant documents that don't contain the keyword at all. - Eswar On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED] wrote: On Nov 26, 2007, at 6:06 PM, Eswar K wrote: We essentially are looking at having an implementation for doing search which can return documents having conceptually similar words without necessarily having the original word searched for. Very challenging. Say someone searches for LSA and hits an archived version of the mail you sent to this list. LSA is a reasonably discriminating term. But so is Eswar. If you knew that the original term was LSA, then you might look for documents near it in term vector space. But if you don't know the original term, only the content of the document, how do you know whether you should look for docs near lsa or eswar? Marvin Humphrey Rectangular Research http://www.rectangular.com/
RE: LSA Implementation
WordNet itself is English-only. There are various ontology projects for it. http://www.globalwordnet.org/ is a separate world language database project. I found it at the bottom of the WordNet wikipedia page. Thanks for starting me on the search! Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:50 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation The languages also include CJK :) among others. - Eswar On Nov 27, 2007 8:16 AM, Norskog, Lance [EMAIL PROTECTED] wrote: The WordNet project at Princeton (USA) is a large database of synonyms. If you're only working in English this might be useful instead of running your own analyses. http://en.wikipedia.org/wiki/WordNet http://wordnet.princeton.edu/ Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:34 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. this algo should consider documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the algorithm doesn't understand anything about what the words *mean*, the patterns it notices can make it seem astonishingly intelligent. When you search an such an index, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, Where a plain keyword search will fail if there is no exact match, this algo will often return relevant documents that don't contain the keyword at all. - Eswar On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED] wrote: On Nov 26, 2007, at 6:06 PM, Eswar K wrote: We essentially are looking at having an implementation for doing search which can return documents having conceptually similar words without necessarily having the original word searched for. Very challenging. Say someone searches for LSA and hits an archived version of the mail you sent to this list. LSA is a reasonably discriminating term. But so is Eswar. If you knew that the original term was LSA, then you might look for documents near it in term vector space. But if you don't know the original term, only the content of the document, how do you know whether you should look for docs near lsa or eswar? Marvin Humphrey Rectangular Research http://www.rectangular.com/
Re: LSA Implementation
Using Wordnet may require having some type of disambiguation approach, otherwise you can end up w/ a lot of synonyms. I also would look into how much coverage there is for non-English languages. If you have the resources, you may be better off developing/finding your own synonym/concept list based on your genres. You may also look into other approaches for assigning concepts off line and adding them to the document. -Grant On Nov 27, 2007, at 3:21 PM, Norskog, Lance wrote: WordNet itself is English-only. There are various ontology projects for it. http://www.globalwordnet.org/ is a separate world language database project. I found it at the bottom of the WordNet wikipedia page. Thanks for starting me on the search! Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:50 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation The languages also include CJK :) among others. - Eswar On Nov 27, 2007 8:16 AM, Norskog, Lance [EMAIL PROTECTED] wrote: The WordNet project at Princeton (USA) is a large database of synonyms. If you're only working in English this might be useful instead of running your own analyses. http://en.wikipedia.org/wiki/WordNet http://wordnet.princeton.edu/ Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:34 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. this algo should consider documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the algorithm doesn't understand anything about what the words *mean*, the patterns it notices can make it seem astonishingly intelligent. When you search an such an index, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, Where a plain keyword search will fail if there is no exact match, this algo will often return relevant documents that don't contain the keyword at all. - Eswar On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED] wrote: On Nov 26, 2007, at 6:06 PM, Eswar K wrote: We essentially are looking at having an implementation for doing search which can return documents having conceptually similar words without necessarily having the original word searched for. Very challenging. Say someone searches for LSA and hits an archived version of the mail you sent to this list. LSA is a reasonably discriminating term. But so is Eswar. If you knew that the original term was LSA, then you might look for documents near it in term vector space. But if you don't know the original term, only the content of the document, how do you know whether you should look for docs near lsa or eswar? Marvin Humphrey Rectangular Research http://www.rectangular.com/ -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: LSA Implementation
LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is patented, so it is not likely to happen unless the authors donate the patent to the ASF. -Grant On Nov 26, 2007, at 8:23 AM, Eswar K wrote: All, Is there any plan to implement Latent Semantic Analysis as part of Solr anytime in the near future? Regards, Eswar -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: LSA Implementation
Interesting. Patents are valid for 20 years so it expires next year? :) PLSA does not seem to have been patented, at least not mentioned in http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis On Nov 26, 2007 6:58 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is patented, so it is not likely to happen unless the authors donate the patent to the ASF. -Grant On Nov 26, 2007, at 8:23 AM, Eswar K wrote: All, Is there any plan to implement Latent Semantic Analysis as part of Solr anytime in the near future? Regards, Eswar -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: LSA Implementation
I was just searching for info on LSA and came across Semantic Indexing project under GNU license...which of couse is still under development in C++ though. - Eswar On Nov 26, 2007 9:56 PM, Jack [EMAIL PROTECTED] wrote: Interesting. Patents are valid for 20 years so it expires next year? :) PLSA does not seem to have been patented, at least not mentioned in http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis On Nov 26, 2007 6:58 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is patented, so it is not likely to happen unless the authors donate the patent to the ASF. -Grant On Nov 26, 2007, at 8:23 AM, Eswar K wrote: All, Is there any plan to implement Latent Semantic Analysis as part of Solr anytime in the near future? Regards, Eswar -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: LSA Implementation
On Nov 26, 2007 6:58 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is patented, so it is not likely to happen unless the authors donate the patent to the ASF. -Grant There are many ways to catch a bird... LSA reduces to SVD on the TF graph. I have had limited success using JAMA's SVD, which is PD. It's pure java; for something serious you'd want to wrap the hard bits in MKL/Accelerate. A more interesting solr related question is where a very heavy process like SVD would operate. You'd want to run the 'training' half of it separate from a indexing or querying. It'd almost be like an optimize. Is there any hook right now to give Solr a command like updateModels/ and map it to the class in the solrconfig? The classify half of the SVD can happen at query or index time, very quickly, I imagine that could even be a custom field type.
Re: LSA Implementation
LDA (Latent Dirichlet Allocation) is a similar technique that extends pLSI. You can find some implementation in C++ and Java on the Web. Grant Ingersoll wrote: Interesting. I am not a lawyer, but my understanding has always been that this is not something we could do. The question has come up from time to time on the Lucene mailing list: http://www.gossamer-threads.com/lists/engine?list=lucenedo=search_resultssearch_forum=forum_3search_string=Latent+Semanticsearch_type=AND That being said, there may be other approaches that do similar things that aren't covered by a patent, I don't know. Is there something specific you want to do, or are you just going by the promise of better results using LSI? I suppose if someone said they had a patch for Lucene/Solr that implemented it, we could ask on legal-discuss for advice. -Grant On Nov 26, 2007, at 1:13 PM, Eswar K wrote: I was just searching for info on LSA and came across Semantic Indexing project under GNU license...which of couse is still under development in C++ though. - Eswar On Nov 26, 2007 9:56 PM, Jack [EMAIL PROTECTED] wrote: Interesting. Patents are valid for 20 years so it expires next year? :) PLSA does not seem to have been patented, at least not mentioned in http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis On Nov 26, 2007 6:58 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is patented, so it is not likely to happen unless the authors donate the patent to the ASF. -Grant On Nov 26, 2007, at 8:23 AM, Eswar K wrote: All, Is there any plan to implement Latent Semantic Analysis as part of Solr anytime in the near future? Regards, Eswar -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- Renaud Delbru, E.C.S., M.Sc. Student, Semantic Information Systems and Language Engineering Group (SmILE), Digital Enterprise Research Institute, National University of Ireland, Galway. http://smile.deri.ie/
Re: LSA Implementation
: A more interesting solr related question is where a very heavy process like : SVD would operate. You'd want to run the 'training' half of it separate from a : indexing or querying. It'd almost be like an optimize. Is there any hook right : now to give Solr a command like updateModels/ and map it to the class in : the solrconfig? The classify half of the SVD can happen at query or index : time, very quickly, I imagine that could even be a custom field type. The EventListener plugin type let's you register arbitrary java code to be run after a commit or an optimize (before a new searcher is opened) ... this is the same hook mechanism that is used to trigger snapshots on masters and do explicit warming on slaves. there was talk about creating a request handler that could be used to trigger aritrary events and xecute all of hte EventListeners (so you could create a new updateModels even type, independent of commit and optimize) but no one has ever submitted a patch... http://issues.apache.org/jira/browse/SOLR-371 -Hoss
Re: LSA Implementation
We essentially are looking at having an implementation for doing search which can return documents having conceptually similar words without necessarily having the original word searched for. - Eswar On Nov 27, 2007 12:06 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: Interesting. I am not a lawyer, but my understanding has always been that this is not something we could do. The question has come up from time to time on the Lucene mailing list: http://www.gossamer-threads.com/lists/engine?list=lucenedo=search_resultssearch_forum=forum_3search_string=Latent+Semanticsearch_type=AND That being said, there may be other approaches that do similar things that aren't covered by a patent, I don't know. Is there something specific you want to do, or are you just going by the promise of better results using LSI? I suppose if someone said they had a patch for Lucene/Solr that implemented it, we could ask on legal-discuss for advice. -Grant On Nov 26, 2007, at 1:13 PM, Eswar K wrote: I was just searching for info on LSA and came across Semantic Indexing project under GNU license...which of couse is still under development in C++ though. - Eswar On Nov 26, 2007 9:56 PM, Jack [EMAIL PROTECTED] wrote: Interesting. Patents are valid for 20 years so it expires next year? :) PLSA does not seem to have been patented, at least not mentioned in http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis On Nov 26, 2007 6:58 AM, Grant Ingersoll [EMAIL PROTECTED] wrote: LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is patented, so it is not likely to happen unless the authors donate the patent to the ASF. -Grant On Nov 26, 2007, at 8:23 AM, Eswar K wrote: All, Is there any plan to implement Latent Semantic Analysis as part of Solr anytime in the near future? Regards, Eswar -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: LSA Implementation
On Nov 26, 2007, at 6:06 PM, Eswar K wrote: We essentially are looking at having an implementation for doing search which can return documents having conceptually similar words without necessarily having the original word searched for. Very challenging. Say someone searches for LSA and hits an archived version of the mail you sent to this list. LSA is a reasonably discriminating term. But so is Eswar. If you knew that the original term was LSA, then you might look for documents near it in term vector space. But if you don't know the original term, only the content of the document, how do you know whether you should look for docs near lsa or eswar? Marvin Humphrey Rectangular Research http://www.rectangular.com/
Re: LSA Implementation
In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. this algo should consider documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the algorithm doesn't understand anything about what the words *mean*, the patterns it notices can make it seem astonishingly intelligent. When you search an such an index, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, Where a plain keyword search will fail if there is no exact match, this algo will often return relevant documents that don't contain the keyword at all. - Eswar On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED] wrote: On Nov 26, 2007, at 6:06 PM, Eswar K wrote: We essentially are looking at having an implementation for doing search which can return documents having conceptually similar words without necessarily having the original word searched for. Very challenging. Say someone searches for LSA and hits an archived version of the mail you sent to this list. LSA is a reasonably discriminating term. But so is Eswar. If you knew that the original term was LSA, then you might look for documents near it in term vector space. But if you don't know the original term, only the content of the document, how do you know whether you should look for docs near lsa or eswar? Marvin Humphrey Rectangular Research http://www.rectangular.com/
RE: LSA Implementation
The WordNet project at Princeton (USA) is a large database of synonyms. If you're only working in English this might be useful instead of running your own analyses. http://en.wikipedia.org/wiki/WordNet http://wordnet.princeton.edu/ Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:34 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. this algo should consider documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the algorithm doesn't understand anything about what the words *mean*, the patterns it notices can make it seem astonishingly intelligent. When you search an such an index, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, Where a plain keyword search will fail if there is no exact match, this algo will often return relevant documents that don't contain the keyword at all. - Eswar On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED] wrote: On Nov 26, 2007, at 6:06 PM, Eswar K wrote: We essentially are looking at having an implementation for doing search which can return documents having conceptually similar words without necessarily having the original word searched for. Very challenging. Say someone searches for LSA and hits an archived version of the mail you sent to this list. LSA is a reasonably discriminating term. But so is Eswar. If you knew that the original term was LSA, then you might look for documents near it in term vector space. But if you don't know the original term, only the content of the document, how do you know whether you should look for docs near lsa or eswar? Marvin Humphrey Rectangular Research http://www.rectangular.com/
Re: LSA Implementation
The languages also include CJK :) among others. - Eswar On Nov 27, 2007 8:16 AM, Norskog, Lance [EMAIL PROTECTED] wrote: The WordNet project at Princeton (USA) is a large database of synonyms. If you're only working in English this might be useful instead of running your own analyses. http://en.wikipedia.org/wiki/WordNet http://wordnet.princeton.edu/ Lance -Original Message- From: Eswar K [mailto:[EMAIL PROTECTED] Sent: Monday, November 26, 2007 6:34 PM To: solr-user@lucene.apache.org Subject: Re: LSA Implementation In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. this algo should consider documents that have many words in common to be semantically close, and ones with few words in common to be semantically distant. This simple method correlates surprisingly well with how a human being, looking at content, might classify a document collection. Although the algorithm doesn't understand anything about what the words *mean*, the patterns it notices can make it seem astonishingly intelligent. When you search an such an index, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, Where a plain keyword search will fail if there is no exact match, this algo will often return relevant documents that don't contain the keyword at all. - Eswar On Nov 27, 2007 7:51 AM, Marvin Humphrey [EMAIL PROTECTED] wrote: On Nov 26, 2007, at 6:06 PM, Eswar K wrote: We essentially are looking at having an implementation for doing search which can return documents having conceptually similar words without necessarily having the original word searched for. Very challenging. Say someone searches for LSA and hits an archived version of the mail you sent to this list. LSA is a reasonably discriminating term. But so is Eswar. If you knew that the original term was LSA, then you might look for documents near it in term vector space. But if you don't know the original term, only the content of the document, how do you know whether you should look for docs near lsa or eswar? Marvin Humphrey Rectangular Research http://www.rectangular.com/
Re: LSA Implementation
On Nov 26, 2007, at 6:34 PM, Eswar K wrote: Although the algorithm doesn't understand anything about what the words *mean*, the patterns it notices can make it seem astonishingly intelligent. When you search an such an index, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query. Because two documents may be semantically very close even if they do not share a particular keyword, Where a plain keyword search will fail if there is no exact match, this algo will often return relevant documents that don't contain the keyword at all. Perhaps I should have been less curt. I've read a few papers on LSA, so I'm familiar at least in passing with everything you describe above. It would be entertaining to write an implementation, and I've considered it... but it's a low priority while the patent's in force. A full term-vector space calculation is... expensive :) ... so LSA performs reduction. Tuning the algorithm for a threshold effect not just against n words in common but against a rough approximation of n words in common is presumably non-trivial. If you can either find or write open source software that pulls off such astonishingly intelligent matches despite the many challenges, kudos. I'd love to see it. Cheers, Marvin Humphrey Rectangular Research http://www.rectangular.com/