Weird Facet and KeywordTokenizerFactory Issue

Ravi Kiran Tue, 06 Oct 2009 12:55:02 -0700

Hello All,
              Iam getting some ghost facets in solr 1.4. Can anybody kindly
help me understand why I get them and how to eliminate them. My schema.xml
snippet is given at the end. Iam indexing Named Entities extracted via
OpenNLP into solr. My understanding regarding KeywordTokenizerFactory is
that it will use all words as a single token, am I right ? for example: "New
York" will be indexed as 'New York' and will not be split right??? However I
see then splitup in facets as follows when running the query "
http://localhost:8080/solr-admin/topicscore/select/?facet=true&facet.limit=-1"...but
when I search with standard handler qt=standard&q=keyword:"New" I dont find
any doc which has just "New". After digging in a bit I found that if several
keywords have a common starting word it is being pulled out as another facet
like the following. Any help is greatly appreciated


Result
------------
<int name="New">47</int>    --------> Ghost
<int name="New Hampshire">7</int>
<int name="New Jersey">16</int>
<int name="New Orleans">10</int>
<int name="New York">147</int>
<int name="New York City">23</int>
<int name="New York Giants">8</int>
<int name="New York Islanders">5</int>
<int name="New York Mercantile Exchange">6</int>
<int name="New York Mets">8</int>
<int name="New York Stock Exchange">10</int>
<int name="New York Times">8</int>
<int name="New York University">5</int>
<int name="New Zealand">7</int>

<int name="Energy">7</int>    --------------> Ghost
<int name="Energy Department">5</int>
<int name="Energy Information Administration">5</int>


<int name="Federal">7</int>  --------------> Ghost
<int name="Federal Deposit Insurance Corp.">6</int>
<int name="Federal Reserve">26</int>
<int name="Federal Reserve Chairman">6</int>

<int name="North">27</int>
<int name="North Carolina">8</int>
<int name="North Dakota">7</int>
<int name="North Korea">12</int>

Schema.xml
-----------------

    <fieldType name="keywordText" class="solr.TextField"
sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"/>

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt,entity-stopwords.txt" enablePositionIncrements="true"
/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    <field name="person" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="organization" type="keywordText" indexed="true"
stored="true" multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="location" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>
    <field name="keyword" type="keywordText" indexed="true" stored="true"
multiValued="true" termVectors="false" termPositions="false"
termOffsets="false"/>

Weird Facet and KeywordTokenizerFactory Issue

Reply via email to