[Solr Wiki] Update of "OpenNLP" by LanceXNorskog

Apache Wiki Wed, 04 Jul 2012 17:59:51 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change 
notification.


The "OpenNLP" page has been changed by LanceXNorskog:
http://wiki.apache.org/solr/OpenNLP?action=diff&rev1=3&rev2=4

  This example should work well with most English-language free text.
  
  == Installation ==
- See the patch for more information. The short story is you have to download 
statistical models from sourceforge to make OpenNLP work- the models do not 
have an Apache-compatible license.
+ For English language testing:
+ Until SOLR-2899 is committed:
+ * pull the latest trunk or 4.0 branch
+ * apply the patch 
+ * do 'ant compile'
+ * cd solr/contrib/opennlp/src/test-files/training 
+ * run 'bin/trainall.sh'
+ ** this will create binary files which will be included in the distribution 
when committed.
  
+ Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles the OpenNLP 
lucene and solr code against the OpenNLP libraries and uses the small model 
files. 
+ 
+ 
+ Deployment to Solr
+ A Solr core requires schema types for the OpenNLP Tokenizer & Filter, and 
also requires model files.  The distribution includes a schema.xml file in 
solr/contrib/opennlp/src/test-files/opennlp/solr/conf/ which demonstrates 
OpenNLP-based analyzers. It does not contain other text types (to avoid falling 
out of date with the full text suite). You should copy the text types from this 
file into your test collection schema.xml, and download "real" models for 
testing. Also, you may have to add the OpenNLP lib directory to your solr/lib 
or solr/cores/collection/lib directory.
+ 
+ Now, download these model files to 
solr/contrib/opennlp/src/test-files/opennlp/solr/conf/opennlp/
+ * [http://opennlp.sourceforge.net/models-1.5/]
+ * The English-language models start with 'en'. The 'maxent' models are 
preferred to the 'perceptron' models.
+ 
+ Your Solr should start without any Exceptions. At this point, go to the 
Schema analyzer, pick the 'text_opennlp_pos' field type, and post a sentence or 
two to the analyzer. You should get text tokenized with payloads. 
Unfortunately, the analysis page shows them as bytes instead of text. If you 
would like this in text form, then go vote on SOLR-3493.
+ 
+ Licensing
+ The OpenNLP library is Apache. The 'jwnl' library is 'BSD-like'. 
+ Model licensing:
+ * The contrib directory includes some small training data and scripts to 
generate model files. These are supplied only for running "unit" tests aginst 
the complete Solr/Lucene/OpenNLP code assemblies. They are not useful for * 
exploring OpenNLP's features or for production deployment. In 
solr/contrib/opennlp/src/test-files/training, run 'bin/trainall.sh' to populate 
solr/contrib/opennlp/src/test-files/opennlp/solr/conf/opennlp with the test 
models. The schema.xml in that conf/ directory uses those models.
+ * The models available from Sourceforge are created from licensed training 
data. I have not seen a formal description of their license status, but they 
are not "safe" for Apache. If you want production-quality models for commercial 
use, you will need to make other arrangements. is you have to download 
statistical models from sourceforge to make OpenNLP work- the models do not 
have an Apache-compatible license.
+

[Solr Wiki] Update of "OpenNLP" by LanceXNorskog

Reply via email to