I am using SOLR 1.3 and my server is embedded and accessed using SOLRJ. I would like to setup my searches so that exact matches are the first results returned, followed by near matches, and finally token based matches. For example, if I have a summary field in schema which is created using copyField from a bunch of other fields: "My item title, keyword, other, stuff"
I want this search to match the item above first and foremost: 1) "My item title*" Then this one: 2) "my item*" and finally this one should also work: 3) "my title" I tried creating a field to hold exact match data (summaryExact) which actually works if I paste in the precise text but stops working as soon as I add any wildcard to it. In other words I get no matches for "My item title*" but I get 1 match for "My item title". I also tried this: (summary:"my item" || summaryExact:"my item*"^3) but that results in 0 matches as well. I could not quite figure out which tokenizer to use if I don't want any tokens created but just want to trim and lowercase the string so let me know if you have ideas on this. Basically, I want something similar to DB "like" matching without case sensitivity and probably trimmed as well. I don't really want the field to be tokenized though. I am attaching my schema in case that helps. I have spent a few days reading through the SOLR documentation and forums and trying various things to get this to work but I just end up making the matching worse when I make changes. I appreciate any pointers, links, or ideas. Thanks! -AZ -- Aaron Zeckoski (azeckoski (at) vt.edu) Senior Research Engineer - CARET - University of Cambridge https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile
<?xml version="1.0" encoding="UTF-8" ?> <!-- This is the Solr schema file. This file should be named "schema.xml" and should be in the conf directory under the solr home (i.e. ./solr/conf/schema.xml by default) or located where the classloader for the Solr webapp can find it. For more information, on how to customize this file, please see http://wiki.apache.org/solr/SchemaXml --> <!-- Steeple Portal project schema - Aaron Zeckoski (aa...@caret.cam.ac.uk) --> <schema name="steeple" version="1.1"> <!-- this is a unified schema of multiple types since the searches need to be combined, not completely sure if this is required --> <types> <!-- omitNorms -If you have tokenized fields of variable size and you want the field length to affect the relevance score, then you do not want to omit norms. Omitting norms is good for fields where length is of no importance (e.g. gender="Male" vs. gender="Female"). Omitting norms saves you heap/RAM, one byte per doc per field without norms, I believe. positionIncrementGap - Used for multivalued fields With a position increment gap of 0, a phrase query of "doe bob" would be a match. But often it is undesirable for that kind of match across different field values. A position increment gap controls the virtual space between the last token of one field instance and the first token of the next instance. With a gap of 100, this prevents phrase queries (even with a modest slop factor) from matching across instances. Comma delimited splitter (maybe for keywords if they are delimited) <analyzer class="org.apache.lucene.analysis.PatternTokenizerFactory" pattern=", *" /> --> <!-- The identifier should always be extremely simple so there are no filters on it --> <fieldType name="identifier" class="solr.StrField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true" /> <!-- special field for exact text matches, no processing --> <fieldType name="exact" class="solr.TextField" compressed="false" indexed="true" stored="true" /> <!-- name indicates names, titles, and summaries, these are not tokenized but are flattened (html and special chars) to make searches easier --> <fieldType name="name" class="solr.StrField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true"> <analyzer type="index"> <tokenizer class="solr.HTMLStripStandardTokenizerFactory"/> <!-- splits things up <filter class="solr.StandardFilterFactory"/> --> <filter class="solr.ISOLatin1AccentFilterFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.TrimFilterFactory" /> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="year" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true" compressed="false" indexed="true" stored="true" /> <fieldtype name="keywords" class="solr.TextField" positionIncrementGap="10" omitNorms="true"> <analyzer> <tokenizer class="solr.LowerCaseTokenizerFactory"/> </analyzer> </fieldtype> <!-- standard field types below --> <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" /> <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true" /> <fieldType name="integer" class="solr.IntField" omitNorms="true" /><!-- not sortable --> <fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true" /> <fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true" /> <fieldType name="text" class="solr.TextField" positionIncrementGap="10"> <analyzer type="index"> <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> <!-- in this example, we will only use synonyms at query time <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> --> <!-- Case insensitive stop word removal. enablePositionIncrements=true ensures that a 'gap' is left to allow for accurate phrase queries. --> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.LowerCaseTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> <!-- since fields of this type are by default not stored or indexed, any data added to them will be ignored outright --> <fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" /> </types> <fields> <field name="id" type="identifier" required="true" /> <field name="baseId" type="identifier" required="true" /> <field name="sourceId" type="identifier" required="true" /> <!-- type should be "channel, category, mediaitem, etc."--> <field name="type" type="identifier" required="true" /> <field name="title" type="name" required="true" /> <field name="timestamp" type="slong" required="true" stored="true" /> <!-- OPTIONAL --> <field name="description" type="name" /> <field name="publishDate" type="identifier" /> <field name="publishDateISO8601" type="identifier" /> <field name="publishDateCode" type="slong" stored="true" /> <field name="ownerUserId" type="name" /> <field name="copyright" type="name" /> <field name="license" type="name" /> <field name="author" type="name" /> <field name="itemURL" type="identifier" /> <field name="rssURL" type="identifier" /> <field name="thumbnailURL" type="identifier" /> <field name="imageURL" type="identifier" /> <field name="mcMedium" type="name" /> <field name="mimeType" type="name" multiValued="true" /> <field name="mediaContent" type="name" multiValued="true" /> <field name="category" type="identifier" multiValued="true" /> <field name="keyword" type="identifier" multiValued="true" /> <!-- flags --> <field name="private" type="boolean" stored="true" /> <field name="external" type="boolean" stored="true" /> <field name="readonly" type="boolean" stored="true" /> <field name="hidden" type="boolean" stored="true" /> <!-- hierarchy --> <field name="parents" type="identifier" multiValued="true" /> <field name="children" type="identifier" multiValued="true" /> <!-- summary is merged content for search --> <field name="summary" type="text" indexed="true" stored="true" multiValued="true" /> <field name="summaryExact" type="exact" indexed="true" stored="true" multiValued="true" /> <!-- Valid attributes for fields: name: mandatory - the name for the field type: mandatory - the name of a previously defined type from the <types> section indexed: true if this field should be indexed (searchable or sortable) stored: true if this field should be retrievable compressed: [false] if this field should be stored using gzip compression (this will only apply if the field type is compressable; among the standard field types, only TextField and StrField are) multiValued: true if this field may contain multiple values per document omitNorms: (expert) set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms. termVectors: [false] set to true to store the term vector for a given field. When using MoreLikeThis, fields used for similarity should be stored for best performance. --> <!-- copyField commands copy one field to another at the time a document is added to the index. It's used either to index the same field differently, or to add multiple fields to the same field for easier/faster searching. --> <copyField source="title" dest="summary" /> <copyField source="description" dest="summary" /> <copyField source="keyword" dest="summary" /> <copyField source="author" dest="summary" /> <copyField source="type" dest="summary" /> <copyField source="baseId" dest="summary" /> <copyField source="mcMedium" dest="summary" /> <copyField source="ownerUserId" dest="summary" /> <copyField source="publishDate" dest="summary" /> <copyField source="title" dest="summaryExact" /> <!-- uncomment the following to ignore any fields that don't already match an existing field name or dynamic field, rather than reporting them as an error. alternately, change the type="ignored" to some other type e.g. "text" if you want unknown fields indexed and/or stored by default --> <!-- we are not indexing or storing fields which are not part of the schema -AZ --> <dynamicField name="*" type="ignored" multiValued="true" /> </fields> <!-- Field to use to determine and enforce document uniqueness. --> <uniqueKey>id</uniqueKey> <!-- field for the QueryParser to use when an explicit fieldname is absent --> <defaultSearchField>summary</defaultSearchField> <!-- SolrQueryParser configuration: defaultOperator="AND|OR" --> <solrQueryParser defaultOperator="OR"/> </schema>