Hi Lance,

there is already a Jira issue for that you may attach your patch [1].
I thought to that some time ago and it could be done by using openNLP UIMA
integration on top of lucene-analysis-uima tokenizers [2].
For filtering out some PoS tagged tokens one could use
the UIMATypeAwareAnnotationsTokenizer [3] with the TypeTokenFilter [4].
That would use all existing Lucene piecese, however also a plain
integration would be good to avoid unnecessary layers if not needed.
My 2 cents,
Tommaso


[1] : https://issues.apache.org/jira/browse/LUCENE-2899
[2] : http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/uima/
[3] :
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/uima/src/java/org/apache/lucene/analysis/uima/UIMATypeAwareAnnotationsTokenizer
[4] :
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/TypeTokenFilter.java

2012/5/31 Lance Norskog <[email protected]>

> I'm creating a patch to integrate OpenNLP into the Lucene/Solr
> project. The SentenceDetector, Tokenizer, POS tagger, Chunker, and NER
> tools are included. The SentenceDetector and Tokenizer are a Lucene
> Tokenizer, and a Lucene TokenFilter takes this stream and runs
> POS/Chunking/NER on it, saving the tags as upper-case payloads. The
> patch includes a couple of handy combinations. For example, make a
> more focused search index by only indexing the nouns & verbs.
>
> Do you have any hints on how to package it? The documentation should
> include how to download and install the models.
>
> --
> Lance Norskog
> [email protected]
>

Reply via email to