Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.

The "OpenNLP" page has been changed by LanceXNorskog:
http://wiki.apache.org/solr/OpenNLP?action=diff&rev1=2&rev2=3

+ <!> [[Solr4.0]] <<TableOfContents(3)>>
- <!> [[Solr4.0]]
- <<TableOfContents(3)>>
  
- <!> This page discusses uncommitted code and design.  See 
[[https://issues.apache.org/jira/browse/LUCENE-2899|LUCENE-2899]] for the main 
JIRA issue tracking this development. The issue is packaged as a Solr contrib, 
but will be split between Lucene and Solr. There is some design work needed 
before this can be committed.
+ <!> This page discusses uncommitted code and design.  See 
[[https://issues.apache.org/jira/browse/LUCENE-2899|LUCENE-2899]] for the main 
JIRA issue tracking this development. The issue is packaged as a Solr contrib, 
but is split between Lucene and Solr.
  
- NLP is a large field of inquiry. Unless you are familiar with it you may find 
this patch confusing.
+ NLP is a large field of inquiry. Unless you are familiar with it you may find 
this patch confusing. The [[http://opennlp.apache.org/|Apache OpenNLP project]] 
is the best place to learn what this package can do.
  
  == Introduction ==
- 
  OpenNLP is a toolkit for Natural Language Processing (NLP). It is an Apache 
top-level project located [[here|http://opennlp.apache.org/]]. It includes 
implementations of many popular NLP algorithms. This project integrates some of 
its features into Lucene and Solr. This first effort incorporates Analyzer 
chain tools for sentence detection, tokenization, Parts-of-Speech tagging 
(nouns, verbs, ejaculations, etc.), Chunking (noun phrases, verb phrases) and 
Named Entity Recognition.  See the OpenNLP project page for information on the 
implementations.  Here are some use cases:
  
  === Indexing interesting words ===
@@ -18, +16 @@

  Chunking lets you create N-Grams only within noun and verb phrases.
  
  === Named Entity Recognition ===
- Named Entity Recognition identifies names, dates, places, currency and other 
types of data within free text. This is profoundly useful in searching. Or, you 
can create autosuggest entries with icons for 'Name', 'Place', etc.
+ Named Entity Recognition identifies names, dates, places, currency and other 
types of data within free text. This is profoundly useful in searching. Or, you 
can create facets or autosuggest entries with icons for 'Name', 'Place', etc.
  
  == Analyzer tools ==
- 
  The OpenNLP Tokenizer behavior is similar to the WhiteSpaceTokenizer but is 
smart about inter-word punctuation. The term stream looks very much like the 
way you parse words and punctuation while reading. The OpenNLP taggers assign 
payloads to terms. There are tools to filter the term stream according to the 
payload values, and to remove the payloads.
  
  === solr.OpenNLPTokenizerFactory ===
- 
  Tokenizes text into sentences or words.
  
  This Tokenizer uses the OpenNLP Sentence Detector and/or Tokenizer classes. 
When used together, the Tokenizer receives sentences and can do a better job. 
The arguments give the file names of the statistical models:
@@ -40, +36 @@

        </analyzer>
      </fieldType>
  }}}
- 
  === solr.OpenNLPFilterFactory ===
- 
- Tags words using one or more technologies: Parts-of-Speech, Chunking, and 
Named Entity Recognition. 
+ Tags words using one or more technologies: Parts-of-Speech, Chunking, and 
Named Entity Recognition.
  
  {{{
      <fieldType name="text_opennlp_pos" class="solr.TextField" 
positionIncrementGap="100">
@@ -51, +45 @@

          <tokenizer class="solr.OpenNLPTokenizerFactory"
            tokenizerModel="opennlp/en-token.bin"
          />
-         <filter class="solr.OpenNLPFilterFactory" 
+         <filter class="solr.OpenNLPFilterFactory"
            posTaggerModel="opennlp/en-pos-maxent.bin"
-         />       
+         />
        </analyzer>
      </fieldType>
  }}}
- 
  This example assigns parts of speech tags based on a model derived with the 
[[http://opennlp.apache.org/documentation/1.5.2-incubating/apidocs/opennlp-maxent/index.html|OpenNLP
 Maximum Entropy]] implementation. See 
[[http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.postagger.tagging|OpenNLP
 Tagging]] for more information. The tags are from the 
[[http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|Penn
 Treebank]] tagset
  
  === solr.FilterPayloadsFilterFactory ===
- 
  Filter terms for certain payload values. In this example, retain only terms 
which have been marked 'nouns' and 'verbs' with the 
[[http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html|Penn
 Treebank]] tagset.
  
  {{{
          <filter class="solr.FilterPayloadsFilterFactory" keepPayloads="true"
            payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW"/>
  }}}
- 
  === solr.StripPayloadsFilterFactory ===
- 
  Remove payloads from terms.
  
  {{{
          <filter class="solr.StripPayloadsFilterFactory"/>
  }}}
+ == Full Example ==
+ This "Noun-Verb Filter" field type assigns parts of speech, retains only 
nouns and verbs, and removes the payloads. Free-text search sites (for example, 
newspaper and magazine articles) may benefit from this.
  
- == Full Example ==
- 
- This "Noun-Verb Filter" field type assigns parts of speech, retains only 
nouns and verbs, and removes the payloads. Free-text search sites (for example, 
newspaper and magazine articles) may benefit from this.
  {{{
      <fieldType name="text_opennlp_nvf" class="solr.TextField" 
positionIncrementGap="100">
        <analyzer>
@@ -94, +83 @@

        </analyzer>
      </fieldType>
  }}}
- 
- This example should work well with most English-language free text. 
+ This example should work well with most English-language free text.
  
  == Installation ==
- 
  See the patch for more information. The short story is you have to download 
statistical models from sourceforge to make OpenNLP work- the models do not 
have an Apache-compatible license.
  

Reply via email to