Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.
The "OpenNLP" page has been changed by LanceXNorskog: http://wiki.apache.org/solr/OpenNLP?action=diff&rev1=3&rev2=4 This example should work well with most English-language free text. == Installation == - See the patch for more information. The short story is you have to download statistical models from sourceforge to make OpenNLP work- the models do not have an Apache-compatible license. + For English language testing: + Until SOLR-2899 is committed: + * pull the latest trunk or 4.0 branch + * apply the patch + * do 'ant compile' + * cd solr/contrib/opennlp/src/test-files/training + * run 'bin/trainall.sh' + ** this will create binary files which will be included in the distribution when committed. + Now, go to trunk-dir/solr and run 'ant test-contrib'. It compiles the OpenNLP lucene and solr code against the OpenNLP libraries and uses the small model files. + + + Deployment to Solr + A Solr core requires schema types for the OpenNLP Tokenizer & Filter, and also requires model files. The distribution includes a schema.xml file in solr/contrib/opennlp/src/test-files/opennlp/solr/conf/ which demonstrates OpenNLP-based analyzers. It does not contain other text types (to avoid falling out of date with the full text suite). You should copy the text types from this file into your test collection schema.xml, and download "real" models for testing. Also, you may have to add the OpenNLP lib directory to your solr/lib or solr/cores/collection/lib directory. + + Now, download these model files to solr/contrib/opennlp/src/test-files/opennlp/solr/conf/opennlp/ + * [http://opennlp.sourceforge.net/models-1.5/] + * The English-language models start with 'en'. The 'maxent' models are preferred to the 'perceptron' models. + + Your Solr should start without any Exceptions. At this point, go to the Schema analyzer, pick the 'text_opennlp_pos' field type, and post a sentence or two to the analyzer. You should get text tokenized with payloads. Unfortunately, the analysis page shows them as bytes instead of text. If you would like this in text form, then go vote on SOLR-3493. + + Licensing + The OpenNLP library is Apache. The 'jwnl' library is 'BSD-like'. + Model licensing: + * The contrib directory includes some small training data and scripts to generate model files. These are supplied only for running "unit" tests aginst the complete Solr/Lucene/OpenNLP code assemblies. They are not useful for * exploring OpenNLP's features or for production deployment. In solr/contrib/opennlp/src/test-files/training, run 'bin/trainall.sh' to populate solr/contrib/opennlp/src/test-files/opennlp/solr/conf/opennlp with the test models. The schema.xml in that conf/ directory uses those models. + * The models available from Sourceforge are created from licensed training data. I have not seen a formal description of their license status, but they are not "safe" for Apache. If you want production-quality models for commercial use, you will need to make other arrangements. is you have to download statistical models from sourceforge to make OpenNLP work- the models do not have an Apache-compatible license. +