Re: Term extraction

2007-09-22 Thread Brian Whitman


On Sep 21, 2007, at 3:37 AM, Pieter Berkel wrote:


Thanks for the response guys:

Grant: I had a brief look at LingPipe, it looks quite interesting  
but I'm
concerned that the licensing may prevent me from using it in my  
project.




Does the opennlp license look good for you? It's LGPL. Not all the  
features of lingpipe but it works pretty well.  https:// 
sourceforge.net/projects/opennlp/





Re: Term extraction

2007-09-21 Thread Pieter Berkel
Thanks for the response guys:

Grant: I had a brief look at LingPipe, it looks quite interesting but I'm
concerned that the licensing may prevent me from using it in my project.
Michael: I have used the Yahoo API in the past but due to it's generic
nature, I wasn't entirely happy with the results in my test cases.
Yonik: This is the approach I had in mind, will it still work if I put the
SynonymFilter after the word-delimiter filter in the schema config? Ideally
I want to strip out the underscore char before it gets indexed, is that
possible by using a PatternReplaceFilterFactory after the SynonymFilter?

Cheers,
Piete



On 21/09/2007, Yonik Seeley [EMAIL PROTECTED] wrote:

 On 9/19/07, Pieter Berkel [EMAIL PROTECTED] wrote:
  However, I'd like to be able to
  analyze documents more intelligently to recognize phrase keywords such
 as
  open source, Microsoft Office, Bill Gates rather than splitting
 each
  word into separate tokens (the field is never used in search queries so
  matching is not an issue).  I've been looking at SynonymFilterFactory as
 a
  possible solution to this problem but haven't been able to work out the
  specifics of how to configure it for phrase mappings.

 SynonymFilter works out-of-the-box with multi-token synonyms...

 Microsoft Office = microsoft_office
 Bill Gates, William Gates = bill_gates

 Just don't use a word-delimiter filter if you use underscore to join
 words.

 -Yonik



Re: Term extraction

2007-09-21 Thread Yonik Seeley
On 9/21/07, Pieter Berkel [EMAIL PROTECTED] wrote:
 Yonik: This is the approach I had in mind, will it still work if I put the
 SynonymFilter after the word-delimiter filter in the schema config?

SynonymFilter doesn't currently have the capability to handle multiple
tokens at the same position in the input.  You could simply remove the
WordDelimiterFilter unless you need it.

 Ideally
 I want to strip out the underscore char before it gets indexed

Why's that?

You could just define your synonyms like that initially:
Bill Gates, William Gates = billgates

-Yonik


Re: Term extraction

2007-09-20 Thread Michael Kimsal
Not sure if this is in the same league or not, but Yahoo offers a term
extraction
web service.

http://developer.yahoo.com/search/content/V1/termExtraction.html



On 9/20/07, Grant Ingersoll [EMAIL PROTECTED] wrote:

 You might investigate some tools like Alias-i's LingPipe or do some
 searches for phrase recognition software, etc.

 -Grant

 On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote:

  I'm currently looking at methods of term extraction and automatic
  keyword
  generation from indexed documents.  I've been experimenting with
  MoreLikeThis and values returned by the mlt.interestingTerms
  parameter and
  so far this approach has worked well.  However, I'd like to be able to
  analyze documents more intelligently to recognize phrase keywords
  such as
  open source, Microsoft Office, Bill Gates rather than
  splitting each
  word into separate tokens (the field is never used in search
  queries so
  matching is not an issue).  I've been looking at
  SynonymFilterFactory as a
  possible solution to this problem but haven't been able to work out
  the
  specifics of how to configure it for phrase mappings.
 
  Has anybody else dealt with this problem before or able to offer any
  insights into achieve the desired results?
 
  Thanks in advance,
  Pieter

 --
 Grant Ingersoll
 http://lucene.grantingersoll.com

 Lucene Helpful Hints:
 http://wiki.apache.org/lucene-java/BasicsOfPerformance
 http://wiki.apache.org/lucene-java/LuceneFAQ





-- 
Michael Kimsal
http://webdevradio.com


Re: Term extraction

2007-09-20 Thread Yonik Seeley
On 9/19/07, Pieter Berkel [EMAIL PROTECTED] wrote:
 However, I'd like to be able to
 analyze documents more intelligently to recognize phrase keywords such as
 open source, Microsoft Office, Bill Gates rather than splitting each
 word into separate tokens (the field is never used in search queries so
 matching is not an issue).  I've been looking at SynonymFilterFactory as a
 possible solution to this problem but haven't been able to work out the
 specifics of how to configure it for phrase mappings.

SynonymFilter works out-of-the-box with multi-token synonyms...

Microsoft Office = microsoft_office
Bill Gates, William Gates = bill_gates

Just don't use a word-delimiter filter if you use underscore to join words.

-Yonik


Term extraction

2007-09-19 Thread Pieter Berkel
I'm currently looking at methods of term extraction and automatic keyword
generation from indexed documents.  I've been experimenting with
MoreLikeThis and values returned by the mlt.interestingTerms parameter and
so far this approach has worked well.  However, I'd like to be able to
analyze documents more intelligently to recognize phrase keywords such as
open source, Microsoft Office, Bill Gates rather than splitting each
word into separate tokens (the field is never used in search queries so
matching is not an issue).  I've been looking at SynonymFilterFactory as a
possible solution to this problem but haven't been able to work out the
specifics of how to configure it for phrase mappings.

Has anybody else dealt with this problem before or able to offer any
insights into achieve the desired results?

Thanks in advance,
Pieter


Re: Term extraction

2007-09-19 Thread Brian Whitman

On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote:

I'm currently looking at methods of term extraction and automatic  
keyword

generation from indexed documents.


We do it manually (not in solr, but we put the results in solr.) We  
do it the usual way - chunk (into n-grams, named entities  noun  
phrases) and count (tf  df). It works well enough. There is a bevy  
of literature on the topic if you want to get smart -- but be  
warned smart and fast are likely not very good friends.


A lot depends on the provenance of your data -- is it clean text that  
uses a lot of domain specific terms? Is it webtext?




Re: Term extraction

2007-09-19 Thread Pieter Berkel
Thanks Brian, I think the smart approaches you refer to might be outside
the scope of my current project.  The documents I am indexing already have
manually-generated keyword data, moving forward I'd like to have these
keywords automatically generated, selected from a pre-defined list of
keywords (i.e. the simple approach).

The data is fairly clean and domain-specific so I don't expect there will be
more than several hundred of these phrase terms to deal with, which is why I
was exploring the SynonymFilterFactory option.

Pieter



On 20/09/2007, Brian Whitman [EMAIL PROTECTED] wrote:

 On Sep 19, 2007, at 9:58 PM, Pieter Berkel wrote:

  I'm currently looking at methods of term extraction and automatic
  keyword
  generation from indexed documents.

 We do it manually (not in solr, but we put the results in solr.) We
 do it the usual way - chunk (into n-grams, named entities  noun
 phrases) and count (tf  df). It works well enough. There is a bevy
 of literature on the topic if you want to get smart -- but be
 warned smart and fast are likely not very good friends.

 A lot depends on the provenance of your data -- is it clean text that
 uses a lot of domain specific terms? Is it webtext?