Re: Using dismax to find multiple terms across multiple fields

Yonik Seeley Thu, 30 Nov 2006 12:02:29 -0800

On 11/30/06, Stephanie Belton <[EMAIL PROTECTED]> wrote:

I am using Solr to index and search documents in Russian. I have successfully 
set up the RussianAnalyzer but found that it eliminates some tokens such as 
numbers.


You can get better control (and avoid having numbers removed)
by using TokenFilters instead of analyzers.

You might be able to use the Porter stemmer for Russian (but I don't
know how it compares to the other you are using):

   <filter class="solr.SnowballPorterFilterFactory" language="Russian" />

Here is a portion of the code from RussianAnalyzer.java:
   public TokenStream tokenStream(String fieldName, Reader reader)
   {
       TokenStream result = new RussianLetterTokenizer(reader, charset);
       result = new RussianLowerCaseFilter(result, charset);
       result = new StopFilter(result, stopSet);
       result = new RussianStemFilter(result, charset);
       return result;
   }

You could easily create FilterFactories for these Russian specific
ones, and then
gain the ability to use them just like the other factories included in Solr.

It's probably the RussianLetterTokenizer that is throwing away numbers.
Assuming russian uses normal whitespace, you might be able to use the
WhitespaceTokenizer instead.

I would also like the search to only return ads where every single term of the 
query was found across my 3 fields (title, body, location). I can't seem to get 
this to work.  When I do a search for '1970', it works fine and returns 2 ads 
containing 1970. If I search for 'Ташкент' I get 3 results incl. one with 
Russian stemming (Ташкента). But when I do a search for '1970 Ташкента' it 
seems to ignore 1970 and give me the same results as only looking for 
'Ташкент'. I got it to display the debug info and 1970 seems to be ignored in 
the matching:


You are including the russian stemmed fields in the dismax query, and
the analysis of those fields discards numbers, hence 1970 is ignored,
right?  Either querying only the literals, or fixing the stemmed text
to not discard numbers may help (or get you further along at least).


-Yonik

<lst name="debug">
 <str name="rawquerystring">"1970 Ташкент"</str>
 <str name="querystring">"1970 Ташкент"</str>
 <str name="parsedquery">+DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент" | title_ru_RU:ташкент^1.3 | 
location_literal:"1970 ташкент"^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01) 
DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970 
ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01)</str>
 <str name="parsedquery_toString">+(body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент" | title_ru_RU:ташкент^1.3 | 
location_literal:"1970 ташкент"^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01 (body_ru_RU:ташкент^0.8 | 
body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | 
title_literal:"1970 ташкент"~100^1.5)~0.01</str>
 <lst name="explain">
  <str name="id=€#26;੥,internal_docid=4">
0.7263521 = (MATCH) sum of:
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=4)
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=4)
</str>
  <str name="id=€#26;ી,internal_docid=9">
0.7263521 = (MATCH) sum of:
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=9)
  0.36317605 = (MATCH) max plus 0.01 times others of:
    0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of:
      0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of:
        0.4 = boost
        4.4965076 = idf(docFreq=2)
        0.044906225 = queryNorm
      4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of:
        1.0 = tf(termFreq(location_ru_RU:ташкент)=1)
        4.4965076 = idf(docFreq=2)
        1.0 = fieldNorm(field=location_ru_RU, doc=9)
</str>
  <str name="id=€#26;੕,internal_docid=2">
0.43162674 = (MATCH) sum of:
  0.21581337 = (MATCH) max plus 0.01 times others of:
    0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
      0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
        0.8 = boost
        4.901973 = idf(docFreq=1)
        0.044906225 = queryNorm
      1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
        1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
        4.901973 = idf(docFreq=1)
        0.25 = fieldNorm(field=body_ru_RU, doc=2)
  0.21581337 = (MATCH) max plus 0.01 times others of:
    0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of:
      0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of:
        0.8 = boost
        4.901973 = idf(docFreq=1)
        0.044906225 = queryNorm
      1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of:
        1.0 = tf(termFreq(body_ru_RU:ташкент)=1)
        4.901973 = idf(docFreq=1)
        0.25 = fieldNorm(field=body_ru_RU, doc=2)
</str>
 </lst>

Re: Using dismax to find multiple terms across multiple fields

Reply via email to