Cannot get like exact searching to work

Aaron Zeckoski Wed, 10 Feb 2010 06:05:39 -0800

I am using SOLR 1.3 and my server is embedded and accessed using SOLRJ.
I would like to setup my searches so that exact matches are the first
results returned, followed by near matches, and finally token based
matches.
For example, if I have a summary field in schema which is created
using copyField from a bunch of other fields:
"My item title, keyword, other, stuff"


I want this search to match the item above first and foremost:
1) "My item title*"

Then this one:
2) "my item*"

and finally this one should also work:
3) "my title"

I tried creating a field to hold exact match data (summaryExact) which
actually works if I paste in the precise text but stops working as
soon as I add any wildcard to it. In other words I get no matches for
"My item title*" but I get 1 match for "My item title". I also tried
this:
(summary:"my item" || summaryExact:"my item*"^3)
but that results in 0 matches as well.

I could not quite figure out which tokenizer to use if I don't want
any tokens created but just want to trim and lowercase the string so
let me know if you have ideas on this. Basically, I want something
similar to DB "like" matching without case sensitivity and probably
trimmed as well. I don't really want the field to be tokenized though.

I am attaching my schema in case that helps.
I have spent a few days reading through the SOLR documentation and
forums and trying various things to get this to work but I just end up
making the matching worse when I make changes. I appreciate any
pointers, links, or ideas.
Thanks!
-AZ


--
Aaron Zeckoski (azeckoski (at) vt.edu)
Senior Research Engineer - CARET - University of Cambridge
https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile

<?xml version="1.0" encoding="UTF-8" ?>

<!--  
 This is the Solr schema file. This file should be named "schema.xml" and
 should be in the conf directory under the solr home
 (i.e. ./solr/conf/schema.xml by default) 
 or located where the classloader for the Solr webapp can find it.

 For more information, on how to customize this file, please see
 http://wiki.apache.org/solr/SchemaXml
-->

<!-- Steeple Portal project schema - Aaron Zeckoski (aa...@caret.cam.ac.uk) -->
<schema name="steeple" version="1.1">
  <!-- this is a unified schema of multiple types since the searches need to be combined,
    not completely sure if this is required
-->

  <types>
    <!-- 
        omitNorms -If you have tokenized fields of variable size and you want the field length to 
        affect the relevance score, then you do not want to omit norms.  Omitting norms is good for 
        fields where length is of no importance (e.g. gender="Male" vs. gender="Female").  Omitting 
        norms saves you heap/RAM, one byte per doc per field without norms, I believe. 

        positionIncrementGap - Used for multivalued fields
        With a position increment gap of 0, a phrase query of "doe bob" would  
        be a match.  But often it is undesirable for that kind of match across  
        different field values.  A position increment gap controls the virtual  
        space between the last token of one field instance and the first token  
        of the next instance.  With a gap of 100, this prevents phrase queries  
        (even with a modest slop factor) from matching across instances. 

        Comma delimited splitter (maybe for keywords if they are delimited)
        <analyzer class="org.apache.lucene.analysis.PatternTokenizerFactory" pattern=", *" />
    -->

    <!-- The identifier should always be extremely simple so there are no filters on it -->
    <fieldType name="identifier" class="solr.StrField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true" />
    <!-- special field for exact text matches, no processing -->
    <fieldType name="exact" class="solr.TextField" compressed="false" indexed="true" stored="true" />
    <!-- name indicates names, titles, and summaries, 
        these are not tokenized but are flattened (html and special chars) to make searches easier -->
    <fieldType name="name" class="solr.StrField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true">
      <analyzer type="index">
        <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/>
        <!-- splits things up <filter class="solr.StandardFilterFactory"/> -->
        <filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.TrimFilterFactory" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
    <fieldType name="year" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true" />
    <fieldtype name="keywords" class="solr.TextField"  positionIncrementGap="10" omitNorms="true">
      <analyzer>
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
      </analyzer>
    </fieldtype>

    <!-- standard field types below -->
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true" />
    <fieldType name="integer" class="solr.IntField" omitNorms="true" /><!-- not sortable -->
    <fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true" />
    <fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true" />

    <fieldType name="text" class="solr.TextField" positionIncrementGap="10">
      <analyzer type="index">
        <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
             enablePositionIncrements=true ensures that a 'gap' is left to
             allow for accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

    <!-- since fields of this type are by default not stored or indexed, any data added to 
         them will be ignored outright 
     --> 
    <fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" /> 

  </types>

  <fields>
    <field name="id" type="identifier" required="true" />
    <field name="baseId" type="identifier" required="true" />
    <field name="sourceId" type="identifier" required="true" />
    <!-- type should be "channel, category, mediaitem, etc."-->
    <field name="type" type="identifier" required="true" />
    <field name="title" type="name" required="true" />
    <field name="timestamp" type="slong" required="true" stored="true" />
    <!-- OPTIONAL -->
    <field name="description" type="name" />
    <field name="publishDate" type="identifier" />
    <field name="publishDateISO8601" type="identifier" />
    <field name="publishDateCode" type="slong" stored="true" />
    <field name="ownerUserId" type="name" />
    <field name="copyright" type="name" />
    <field name="license" type="name" />
    <field name="author" type="name" />
    <field name="itemURL" type="identifier" />
    <field name="rssURL" type="identifier" />
    <field name="thumbnailURL" type="identifier" />
    <field name="imageURL" type="identifier" />
    <field name="mcMedium" type="name" />
    <field name="mimeType" type="name" multiValued="true" />
    <field name="mediaContent" type="name" multiValued="true" />
    <field name="category" type="identifier" multiValued="true" />
    <field name="keyword" type="identifier" multiValued="true" />
    <!-- flags -->
    <field name="private" type="boolean" stored="true" />
    <field name="external" type="boolean" stored="true" />
    <field name="readonly" type="boolean" stored="true" />
    <field name="hidden" type="boolean" stored="true" />
    <!-- hierarchy -->
    <field name="parents" type="identifier" multiValued="true" />
    <field name="children" type="identifier" multiValued="true" />
    <!-- summary is merged content for search -->
    <field name="summary" type="text" indexed="true" stored="true" multiValued="true" />
    <field name="summaryExact" type="exact" indexed="true" stored="true" multiValued="true" />

    <!-- Valid attributes for fields:
     name: mandatory - the name for the field
     type: mandatory - the name of a previously defined type from the <types> section
     indexed: true if this field should be indexed (searchable or sortable)
     stored: true if this field should be retrievable
     compressed: [false] if this field should be stored using gzip compression
       (this will only apply if the field type is compressable; among
       the standard field types, only TextField and StrField are)
     multiValued: true if this field may contain multiple values per document
     omitNorms: (expert) set to true to omit the norms associated with
       this field (this disables length normalization and index-time
       boosting for the field, and saves some memory).  Only full-text
       fields or fields that need an index-time boost need norms.
     termVectors: [false] set to true to store the term vector for a given field.
       When using MoreLikeThis, fields used for similarity should be stored for 
       best performance.
    -->

    <!-- copyField commands copy one field to another at the time a document
        is added to the index.  It's used either to index the same field differently,
        or to add multiple fields to the same field for easier/faster searching.  -->
    <copyField source="title"        dest="summary" />
    <copyField source="description"  dest="summary" />
    <copyField source="keyword"      dest="summary" />
    <copyField source="author"       dest="summary" />
    <copyField source="type"         dest="summary" />
    <copyField source="baseId"       dest="summary" />
    <copyField source="mcMedium"     dest="summary" />
    <copyField source="ownerUserId"  dest="summary" />
    <copyField source="publishDate"  dest="summary" />

    <copyField source="title"        dest="summaryExact" />

    <!-- uncomment the following to ignore any fields that don't already match an existing 
        field name or dynamic field, rather than reporting them as an error. 
        alternately, change the type="ignored" to some other type e.g. "text" if you want 
        unknown fields indexed and/or stored by default
--> 
    <!-- we are not indexing or storing fields which are not part of the schema -AZ -->
    <dynamicField name="*" type="ignored" multiValued="true" />
   
  </fields>

  <!-- Field to use to determine and enforce document uniqueness. -->
  <uniqueKey>id</uniqueKey>

  <!-- field for the QueryParser to use when an explicit fieldname is absent -->
  <defaultSearchField>summary</defaultSearchField>

  <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
  <solrQueryParser defaultOperator="OR"/>

</schema>

Cannot get like exact searching to work

Reply via email to