Using dismax to find multiple terms across multiple fields

Stephanie Belton Thu, 30 Nov 2006 05:10:33 -0800

Hello,

I am using Solr to index and search documents in Russian. I have successfully 
set up the RussianAnalyzer but found that it eliminates some tokens such as 
numbers. I am therefore indexing my text fields in 2 ways, once with a quite 
literal version of the text using something similar to textTight in the example 
config:


    <fieldtype name="text_literal" class="solr.TextField" 
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="false"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldtype>
 
And I index my fields again using the RussianAnalyzer to cover the Russian 
stemming and stop words:
    <fieldtype name="text_ru_RU" class="solr.TextField"  >
      <analyzer class="org.apache.lucene.analysis.ru.RussianAnalyzer"/>
    </fieldtype>

I then specify my field names:
   <dynamicField name="*_ru_RU"   type="text_ru_RU" indexed="true" 
stored="false"/>
   <dynamicField name="*_literal" type="text_literal" indexed="true" 
stored="false"/>

And use the copyField feature to index them twice:
   <copyField source="title_ru_RU"         dest="title_literal"    />
   <copyField source="location_ru_RU"   dest="location_literal" />
   <copyField source="body_ru_RU"       dest="body_literal"     />

I then specify my own DisMaxRequestHandler in solrconfig.xml:
  <requestHandler name="dismax_ru_RU" class="solr.DisMaxRequestHandler" >
    <lst name="defaults">
     <float name="tie">0.01</float>
     <str name="qf">
        title_literal^1.5 title_ru_RU^1.3 body_literal^1.0 body_ru_RU^0.8 
location_literal^0.5 location_ru_RU^0.4  </str>
     <str name="pf">
        title_literal^1.5 title_ru_RU^1.3 body_literal^1.0 body_ru_RU^0.8 
location_literal^0.5 location_ru_RU^0.4  </str>
     <str name="mm">
        100%
     </str>
     <int name="ps">100</int>
    </lst>
  </requestHandler>
 
Because I am searching through classified ads, date sorting is more important 
to me than relevance. Therefore I am sorting by date first and then by score. I 
expect the system to return all matches for todays ads sorted by relevance, 
followed by matches for yesterday’s ads sorted by relevance etc. I would also 
like the search to only return ads where every single term of the query was 
found across my 3 fields (title, body, location). I can’t seem to get this to 
work. When I do a search for ‘1970’, it works fine and returns 2 ads containing 
1970. If I search for ‘Ташкент’ I get 3 results incl. one with Russian stemming 
(Ташкента). But when I do a search for ‘1970 Ташкента’ it seems to ignore 1970 
and give me the same results as only looking for ‘Ташкент’. I got it to display 
the debug info and 1970 seems to be ignored in the matching:

<lst name="debug">
 <str name="rawquerystring">"1970 Ташкент"</str>
 <str name="querystring">"1970 Ташкент"</str>
 <str name="parsedquery">+DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | 
body_literal:"1970 ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 
ташкент"^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 
ташкент"^1.5)~0.01) DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | 
body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | 
location_literal:"1970 ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | 
title_literal:"1970 ташкент"~100^1.5)~0.01)</str>
 <str name="parsedquery_toString">+(body_ru_RU:ташкент^0.8 | body_literal:"1970 
ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"^0.5 | 
location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01 
(body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент"~100 | 
title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"~100^0.5 | 
location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01</str>
 <lst name="explain">
  <str name="id=€#26;੥,internal_docid=4">
0.7263521 = (MATCH) sum of:
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=4)
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=4)
</str>
  <str name="id=€#26;ી,internal_docid=9">
0.7263521 = (MATCH) sum of:
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=9)
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=9)
</str>
  <str name="id=€#26;੕,internal_docid=2">
0.43162674 = (MATCH) sum of:
  0.21581337 = (MATCH) max plus 0.01 times others of:
    0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
      0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
        0.8 = boost
        4.901973 = idf(docFreq=1)
        0.044906225 = queryNorm
      1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
        1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
        4.901973 = idf(docFreq=1)
        0.25 = fieldNorm(field=body_ru_RU, doc=2)
  0.21581337 = (MATCH) max plus 0.01 times others of:
    0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
      0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
        0.8 = boost
        4.901973 = idf(docFreq=1)
        0.044906225 = queryNorm
      1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
        1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
        4.901973 = idf(docFreq=1)
        0.25 = fieldNorm(field=body_ru_RU, doc=2)
</str>
 </lst>

Apologies for the verbosity, can anyone help me achieving my goal?

Thanks
Stephanie

Using dismax to find multiple terms across multiple fields

Reply via email to