Hi Lance, there is already a Jira issue for that you may attach your patch [1]. I thought to that some time ago and it could be done by using openNLP UIMA integration on top of lucene-analysis-uima tokenizers [2]. For filtering out some PoS tagged tokens one could use the UIMATypeAwareAnnotationsTokenizer [3] with the TypeTokenFilter [4]. That would use all existing Lucene piecese, however also a plain integration would be good to avoid unnecessary layers if not needed. My 2 cents, Tommaso
[1] : https://issues.apache.org/jira/browse/LUCENE-2899 [2] : http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/uima/ [3] : http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/uima/src/java/org/apache/lucene/analysis/uima/UIMATypeAwareAnnotationsTokenizer [4] : http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/TypeTokenFilter.java 2012/5/31 Lance Norskog <[email protected]> > I'm creating a patch to integrate OpenNLP into the Lucene/Solr > project. The SentenceDetector, Tokenizer, POS tagger, Chunker, and NER > tools are included. The SentenceDetector and Tokenizer are a Lucene > Tokenizer, and a Lucene TokenFilter takes this stream and runs > POS/Chunking/NER on it, saving the tags as upper-case payloads. The > patch includes a couple of handy combinations. For example, make a > more focused search index by only indexing the nouns & verbs. > > Do you have any hints on how to package it? The documentation should > include how to download and install the models. > > -- > Lance Norskog > [email protected] >
