On May 19, 2009, at 5:50 AM, Justin wrote:

I have a solr index which contains research data from the human genome
project.

Each document contains about 60 facets, including one general composite field that contains all the facet data. the general facet is anywhere from
100KB to 7MB.

One facet is called Gene.Symbol and, you guessed it, it contains only the
gene symbol. There is only one Symbol per gene (for smarty pantses out
there, the aliases are contained in another facet).

When I do a search for anything in the big general facet, I find what i'm looking for. But if I do a search in the Gene.Symbol facet, it does not
find anything.

I realize it's probably finding the string repeated elsewhere in the
document, but how do I get it to find it in the Gene.Symbol facet?

I'd look at the analysis tool in Solr admin and compare putting in various gene names. It seems a bit odd that you are applying Porter stemming to gene names.

You are likely getting matches due to the WordDelimiterFilter and other manipulations in the BFDText. In the Symbol field you aren't doing nearly as much to the tokens, so I doubt there is an "abc" gene in there.

You could try doing a prefix query. You could also try creating n- grams during indexing or other mechanisms for allowing matches within a string.



so a search for

http://localhost:8983/solr/core0/select?indent=on&version=2.2&q=Gene.Symbol:abc

returns nothing, but a search for

http://localhost:8983/solr/core0/select?indent=on&version=2.2&q=abc

returns
ABCC2
ABCC8
ABCD1
ABCG1
ABCA1
...
CABC1
...
ABCD3
ABCC5
ABCC9
ABCG2
ABCB11
ABCC3
ABCF1
ABCC1
ABCF2
ABCB9



Schema.xml:

<fieldType name="symbol" class="solr.TextField" positionIncrementGap="0">
       <analyzer type="index">
         <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory" />
       <filter class="solr.TrimFilterFactory" />
      </analyzer>
      <analyzer type="query">
         <tokenizer class="solr.StandardTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory" />
       <filter class="solr.TrimFilterFactory" />
      </analyzer>

</fieldType>
...
<!-- yes, taken directly from the example -->

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
     <analyzer type="index">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
       <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
       <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
       <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
     </analyzer>
   </fieldType>

...
<field name="Gene.Symbol"           type="symbol" indexed="true"
stored="true" required="true" multiValued="false" omitNorms="false"/>
<field name="BFDText" type="text" indexed="true"
stored="false"     multiValued="true"  omitNorms="true"/>
...
<defaultSearchField>BFDText</defaultSearchField>
<solrQueryParser defaultOperator="AND"/>
<copyField source="*" dest="BFDText"/>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to