On 11/30/06, Stephanie Belton <[EMAIL PROTECTED]> wrote:
I am using Solr to index and search documents in Russian. I have successfully set up the RussianAnalyzer but found that it eliminates some tokens such as numbers.
You can get better control (and avoid having numbers removed) by using TokenFilters instead of analyzers. You might be able to use the Porter stemmer for Russian (but I don't know how it compares to the other you are using): <filter class="solr.SnowballPorterFilterFactory" language="Russian" /> Here is a portion of the code from RussianAnalyzer.java: public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new RussianLetterTokenizer(reader, charset); result = new RussianLowerCaseFilter(result, charset); result = new StopFilter(result, stopSet); result = new RussianStemFilter(result, charset); return result; } You could easily create FilterFactories for these Russian specific ones, and then gain the ability to use them just like the other factories included in Solr. It's probably the RussianLetterTokenizer that is throwing away numbers. Assuming russian uses normal whitespace, you might be able to use the WhitespaceTokenizer instead.
I would also like the search to only return ads where every single term of the query was found across my 3 fields (title, body, location). I can't seem to get this to work. When I do a search for '1970', it works fine and returns 2 ads containing 1970. If I search for 'Ташкент' I get 3 results incl. one with Russian stemming (Ташкента). But when I do a search for '1970 Ташкента' it seems to ignore 1970 and give me the same results as only looking for 'Ташкент'. I got it to display the debug info and 1970 seems to be ignored in the matching:
You are including the russian stemmed fields in the dismax query, and the analysis of those fields discards numbers, hence 1970 is ignored, right? Either querying only the literals, or fixing the stemmed text to not discard numbers may help (or get you further along at least). -Yonik
<lst name="debug"> <str name="rawquerystring">"1970 Ташкент"</str> <str name="querystring">"1970 Ташкент"</str> <str name="parsedquery">+DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01) DisjunctionMaxQuery((body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01)</str> <str name="parsedquery_toString">+(body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент" | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"^1.5)~0.01 (body_ru_RU:ташкент^0.8 | body_literal:"1970 ташкент"~100 | title_ru_RU:ташкент^1.3 | location_literal:"1970 ташкент"~100^0.5 | location_ru_RU:ташкент^0.4 | title_literal:"1970 ташкент"~100^1.5)~0.01</str> <lst name="explain"> <str name="id=€#26;,internal_docid=4"> 0.7263521 = (MATCH) sum of: 0.36317605 = (MATCH) max plus 0.01 times others of: 0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of: 0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of: 0.4 = boost 4.4965076 = idf(docFreq=2) 0.044906225 = queryNorm 4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of: 1.0 = tf(termFreq(location_ru_RU:ташкент)=1) 4.4965076 = idf(docFreq=2) 1.0 = fieldNorm(field=location_ru_RU, doc=4) 0.36317605 = (MATCH) max plus 0.01 times others of: 0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 4), product of: 0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of: 0.4 = boost 4.4965076 = idf(docFreq=2) 0.044906225 = queryNorm 4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 4), product of: 1.0 = tf(termFreq(location_ru_RU:ташкент)=1) 4.4965076 = idf(docFreq=2) 1.0 = fieldNorm(field=location_ru_RU, doc=4) </str> <str name="id=€#26;ી,internal_docid=9"> 0.7263521 = (MATCH) sum of: 0.36317605 = (MATCH) max plus 0.01 times others of: 0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of: 0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of: 0.4 = boost 4.4965076 = idf(docFreq=2) 0.044906225 = queryNorm 4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of: 1.0 = tf(termFreq(location_ru_RU:ташкент)=1) 4.4965076 = idf(docFreq=2) 1.0 = fieldNorm(field=location_ru_RU, doc=9) 0.36317605 = (MATCH) max plus 0.01 times others of: 0.36317605 = (MATCH) weight(location_ru_RU:ташкент^0.4 in 9), product of: 0.08076847 = queryWeight(location_ru_RU:ташкент^0.4), product of: 0.4 = boost 4.4965076 = idf(docFreq=2) 0.044906225 = queryNorm 4.4965076 = (MATCH) fieldWeight(location_ru_RU:ташкент in 9), product of: 1.0 = tf(termFreq(location_ru_RU:ташкент)=1) 4.4965076 = idf(docFreq=2) 1.0 = fieldNorm(field=location_ru_RU, doc=9) </str> <str name="id=€#26;,internal_docid=2"> 0.43162674 = (MATCH) sum of: 0.21581337 = (MATCH) max plus 0.01 times others of: 0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of: 0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of: 0.8 = boost 4.901973 = idf(docFreq=1) 0.044906225 = queryNorm 1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of: 1.0 = tf(termFreq(body_ru_RU:ташкент)=1) 4.901973 = idf(docFreq=1) 0.25 = fieldNorm(field=body_ru_RU, doc=2) 0.21581337 = (MATCH) max plus 0.01 times others of: 0.21581337 = (MATCH) weight(body_ru_RU:ташкент^0.8 in 2), product of: 0.17610328 = queryWeight(body_ru_RU:ташкент^0.8), product of: 0.8 = boost 4.901973 = idf(docFreq=1) 0.044906225 = queryNorm 1.2254932 = (MATCH) fieldWeight(body_ru_RU:ташкент in 2), product of: 1.0 = tf(termFreq(body_ru_RU:ташкент)=1) 4.901973 = idf(docFreq=1) 0.25 = fieldNorm(field=body_ru_RU, doc=2) </str> </lst>