text_opennlp has the right behavior.
text_opennlp_pos does what you describe.
I'll look some more.

On 06/09/2013 04:38 PM, Patrick Mi wrote:
Hi Lance,

I updated the src from 4.x and applied the latest patch LUCENE-2899-x.patch
uploaded on 6th June but still had the same problem.


Regards,
Patrick

-----Original Message-----
From: Lance Norskog [mailto:goks...@gmail.com]
Sent: Thursday, 6 June 2013 5:16 p.m.
To: solr-user@lucene.apache.org
Subject: Re: OPENNLP problems

Patrick-
I found the problem with multiple documents. The problem was that the
API for the life cycle of a Tokenizer changed, and I only noticed part
of the change. You can now upload multiple documents in one post, and
the OpenNLPTokenizer will process each document.

You're right, the example on the wiki is wrong. The FilterPayloadsFilter
default is to remove the given payloads, and needs keepPayloads="true"
to retain them.

The fixed patch is up as LUCENE-2899-x.patch. Again, thanks for trying it.

Lance

https://issues.apache.org/jira/browse/LUCENE-2899

On 05/28/2013 10:08 PM, Patrick Mi wrote:
Hi there,

Checked out branch_4x and applied the latest patch
LUCENE-2899-current.patch however I ran into 2 problems

Followed the wiki page instruction and set up a field with this type
aiming
to keep nouns and verbs and do a facet on the field
==
<fieldType name="text_opennlp_nvf" class="solr.TextField"
positionIncrementGap="100">
        <analyzer>
          <tokenizer class="solr.OpenNLPTokenizerFactory"
tokenizerModel="opennlp/en-token.bin"/>
          <filter class="solr.OpenNLPFilterFactory"
posTaggerModel="opennlp/en-pos-maxent.bin"/>
          <filter class="solr.FilterPayloadsFilterFactory"
payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW"/>
          <filter class="solr.StripPayloadsFilterFactory"/>
        </analyzer>
      </fieldType>
==

Struggled to get that going until I put the extra parameter
keepPayloads="true" in as below.
       <filter class="solr.FilterPayloadsFilterFactory" keepPayloads="true"
payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW"/>

Question: am I doing the right thing? Is this a mistake on wiki

Second problem:

Posted the document xml one by one to the solr and the result was what I
expected.

<add>
<doc>
    <field name="id">1</field>
    <field name="text_opennlp_nvf">check in the hotel</field></doc>
</add>

However if I put multiple documents into the same xml file and post it in
one go only the first document gets processed( only 'check' and 'hotel'
were
showing in the facet result.)
<add>
<doc>
    <field name="id">1</field>
    <field name="text_opennlp_nvf">check in the hotel</field>
</doc>
<doc>
    <field name="id">2</field>
    <field name="text_opennlp_nvf">removes the payloads</field>
</doc>
<doc>
    <field name="id">3</field>
    <field name="text_opennlp_nvf">retains only nouns and verbs </field>
</doc>
</add>

Same problem when updated the data using csv upload.

Is that a bug or something I did wrong?

Thanks in advance!

Regards,
Patrick




Reply via email to