On May 19, 2009, at 5:50 AM, Justin wrote:
I have a solr index which contains research data from the human genome
project.
Each document contains about 60 facets, including one general
composite
field that contains all the facet data. the general facet is
anywhere from
100KB to 7MB.
One facet is called Gene.Symbol and, you guessed it, it contains
only the
gene symbol. There is only one Symbol per gene (for smarty pantses out
there, the aliases are contained in another facet).
When I do a search for anything in the big general facet, I find
what i'm
looking for. But if I do a search in the Gene.Symbol facet, it does
not
find anything.
I realize it's probably finding the string repeated elsewhere in the
document, but how do I get it to find it in the Gene.Symbol facet?
I'd look at the analysis tool in Solr admin and compare putting in
various gene names. It seems a bit odd that you are applying Porter
stemming to gene names.
You are likely getting matches due to the WordDelimiterFilter and
other manipulations in the BFDText. In the Symbol field you aren't
doing nearly as much to the tokens, so I doubt there is an "abc" gene
in there.
You could try doing a prefix query. You could also try creating n-
grams during indexing or other mechanisms for allowing matches within
a string.
so a search for
http://localhost:8983/solr/core0/select?indent=on&version=2.2&q=Gene.Symbol:abc
returns nothing, but a search for
http://localhost:8983/solr/core0/select?indent=on&version=2.2&q=abc
returns
ABCC2
ABCC8
ABCD1
ABCG1
ABCA1
...
CABC1
...
ABCD3
ABCC5
ABCC9
ABCG2
ABCB11
ABCC3
ABCF1
ABCC1
ABCF2
ABCB9
Schema.xml:
<fieldType name="symbol" class="solr.TextField"
positionIncrementGap="0">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
...
<!-- yes, taken directly from the example -->
<fieldType name="text" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
...
<field name="Gene.Symbol" type="symbol" indexed="true"
stored="true" required="true" multiValued="false" omitNorms="false"/>
<field name="BFDText" type="text"
indexed="true"
stored="false" multiValued="true" omitNorms="true"/>
...
<defaultSearchField>BFDText</defaultSearchField>
<solrQueryParser defaultOperator="AND"/>
<copyField source="*" dest="BFDText"/>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search