Re: Term extraction
On Sep 21, 2007, at 3:37 AM, Pieter Berkel wrote: Thanks for the response guys: Grant: I had a brief look at LingPipe, it looks quite interesting but I'm concerned that the licensing may prevent me from using it in my project. Does the opennlp license look good for you? It's LGPL. Not all the features of lingpipe but it works pretty well. https:// sourceforge.net/projects/opennlp/
Re: Term extraction
Thanks for the response guys: Grant: I had a brief look at LingPipe, it looks quite interesting but I'm concerned that the licensing may prevent me from using it in my project. Michael: I have used the Yahoo API in the past but due to it's generic nature, I wasn't entirely happy with the results in my test cases. Yonik: This is the approach I had in mind, will it still work if I put the SynonymFilter after the word-delimiter filter in the schema config? Ideally I want to strip out the underscore char before it gets indexed, is that possible by using a PatternReplaceFilterFactory after the SynonymFilter? Cheers, Piete On 21/09/2007, Yonik Seeley [EMAIL PROTECTED] wrote: On 9/19/07, Pieter Berkel [EMAIL PROTECTED] wrote: However, I'd like to be able to analyze documents more intelligently to recognize phrase keywords such as open source, Microsoft Office, Bill Gates rather than splitting each word into separate tokens (the field is never used in search queries so matching is not an issue). I've been looking at SynonymFilterFactory as a possible solution to this problem but haven't been able to work out the specifics of how to configure it for phrase mappings. SynonymFilter works out-of-the-box with multi-token synonyms... Microsoft Office = microsoft_office Bill Gates, William Gates = bill_gates Just don't use a word-delimiter filter if you use underscore to join words. -Yonik
Re: Term extraction
On 9/21/07, Pieter Berkel [EMAIL PROTECTED] wrote: Yonik: This is the approach I had in mind, will it still work if I put the SynonymFilter after the word-delimiter filter in the schema config? SynonymFilter doesn't currently have the capability to handle multiple tokens at the same position in the input. You could simply remove the WordDelimiterFilter unless you need it. Ideally I want to strip out the underscore char before it gets indexed Why's that? You could just define your synonyms like that initially: Bill Gates, William Gates = billgates -Yonik
Re: Term extraction
Not sure if this is in the same league or not, but Yahoo offers a term extraction web service. http://developer.yahoo.com/search/content/V1/termExtraction.html On 9/20/07, Grant Ingersoll [EMAIL PROTECTED] wrote: You might investigate some tools like Alias-i's LingPipe or do some searches for phrase recognition software, etc. -Grant On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote: I'm currently looking at methods of term extraction and automatic keyword generation from indexed documents. I've been experimenting with MoreLikeThis and values returned by the mlt.interestingTerms parameter and so far this approach has worked well. However, I'd like to be able to analyze documents more intelligently to recognize phrase keywords such as open source, Microsoft Office, Bill Gates rather than splitting each word into separate tokens (the field is never used in search queries so matching is not an issue). I've been looking at SynonymFilterFactory as a possible solution to this problem but haven't been able to work out the specifics of how to configure it for phrase mappings. Has anybody else dealt with this problem before or able to offer any insights into achieve the desired results? Thanks in advance, Pieter -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ -- Michael Kimsal http://webdevradio.com
Re: Term extraction
On 9/19/07, Pieter Berkel [EMAIL PROTECTED] wrote: However, I'd like to be able to analyze documents more intelligently to recognize phrase keywords such as open source, Microsoft Office, Bill Gates rather than splitting each word into separate tokens (the field is never used in search queries so matching is not an issue). I've been looking at SynonymFilterFactory as a possible solution to this problem but haven't been able to work out the specifics of how to configure it for phrase mappings. SynonymFilter works out-of-the-box with multi-token synonyms... Microsoft Office = microsoft_office Bill Gates, William Gates = bill_gates Just don't use a word-delimiter filter if you use underscore to join words. -Yonik
Term extraction
I'm currently looking at methods of term extraction and automatic keyword generation from indexed documents. I've been experimenting with MoreLikeThis and values returned by the mlt.interestingTerms parameter and so far this approach has worked well. However, I'd like to be able to analyze documents more intelligently to recognize phrase keywords such as open source, Microsoft Office, Bill Gates rather than splitting each word into separate tokens (the field is never used in search queries so matching is not an issue). I've been looking at SynonymFilterFactory as a possible solution to this problem but haven't been able to work out the specifics of how to configure it for phrase mappings. Has anybody else dealt with this problem before or able to offer any insights into achieve the desired results? Thanks in advance, Pieter
Re: Term extraction
On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote: I'm currently looking at methods of term extraction and automatic keyword generation from indexed documents. We do it manually (not in solr, but we put the results in solr.) We do it the usual way - chunk (into n-grams, named entities noun phrases) and count (tf df). It works well enough. There is a bevy of literature on the topic if you want to get smart -- but be warned smart and fast are likely not very good friends. A lot depends on the provenance of your data -- is it clean text that uses a lot of domain specific terms? Is it webtext?
Re: Term extraction
Thanks Brian, I think the smart approaches you refer to might be outside the scope of my current project. The documents I am indexing already have manually-generated keyword data, moving forward I'd like to have these keywords automatically generated, selected from a pre-defined list of keywords (i.e. the simple approach). The data is fairly clean and domain-specific so I don't expect there will be more than several hundred of these phrase terms to deal with, which is why I was exploring the SynonymFilterFactory option. Pieter On 20/09/2007, Brian Whitman [EMAIL PROTECTED] wrote: On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote: I'm currently looking at methods of term extraction and automatic keyword generation from indexed documents. We do it manually (not in solr, but we put the results in solr.) We do it the usual way - chunk (into n-grams, named entities noun phrases) and count (tf df). It works well enough. There is a bevy of literature on the topic if you want to get smart -- but be warned smart and fast are likely not very good friends. A lot depends on the provenance of your data -- is it clean text that uses a lot of domain specific terms? Is it webtext?