Same score for different length matches
Hey, we have multiple documents that are matches for the query in question ("name:hubwagen"). Thing is, some of the documents only contain the query, while others match 100% in the "name" field: Hochhubwagen 5.9861565 Hubwagen 5.9861565 The debug looks like this (for the first and 5th match): namhubwagnamehubwag name:Hubwagen name:Hubwagen name:hubwag name:hubwag 5.9861565 = (MATCH) weight(name:hubwag in 8093) [DefaultSimilarity], result of: 5.9861565 = fieldWeight in 8093, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.9861565 = idf(docFreq=109, maxDocs=16101) 1.0 = fieldNorm(doc=8093) 5.9861565 = (MATCH) weight(name:hubwag in 9537) [DefaultSimilarity], result of: 5.9861565 = fieldWeight in 9537, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.9861565 = idf(docFreq=109, maxDocs=16101) 1.0 = fieldNorm(doc=9537) Now, I am decently certain that at one point in time it worked in a way that a higher match length would rank higher. As far as I can read in the SolrRelevancyFAQ, the correct term is "lengthNorm". However, I a missing a preference for the full match. Usually, the debug helps me identify mistakes, but in this case, the debug only tells me that the scores are perfectly equal, down to the lowest level.
Suggester needed for returning suggestions when term is not start of field value
Hey, I'm playing around with the suggester component, and it works perfectly as described: Suggestions for 'logitech mouse' include 'logitech mouse g500' and 'logitech mouse gaming'. However, when the words in the record supplying the suggester do not follow each other as in the search terms, nothing is returned. Suggestions for 'logitech mouse' do not include 'logitech g500 mouse'. Is there a suggester implementation that can suggest records that way? Best wishes.
Re: Questions regarding autosuggest (Solr 5.2.1)
God damn. Thank you. *ashamed* Am 30.06.2015 00:21 schrieb Erick Erickson: Try not putting it in double quotes? Best, Erick On Mon, Jun 29, 2015 at 12:22 PM, Thomas Michael Engelke thomas.enge...@posteo.de wrote: A friend and I are trying to develop some software using Solr in the background, and with that comes alot of changes. We're used to older versions (4.3 and below). We especially have problems with the autosuggest feature. This is the field definition (schema.xml) for our autosuggest field: field name=autosuggest type=autosuggest indexed=true stored=true required=false multiValued=true / ... copyField source=name dest=autosuggest / ... fieldType name=autosuggest class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=30/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Afterwards, we defined an autosuggest component to use this field, like this (solrconfig.xml): searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=storeDirsuggester_fuzzy_dir/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldsuggest/str str name=suggestAnalyzerFieldTypeautosuggest/str str name=buildOnStartupfalse/str str name=buildOnCommitfalse/str /lst /searchComponent And add a requesthandler to test out the functionality: requestHandler name=/suggesthandler class=solr.SearchHandler startup=lazy lst name=defaults str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler However, trying to start the core that has this configuration, a long exception occurs, telling us this: Error in configuration: autosuggest is not defined in the schema Now, that seems to be wrong. Any idea how to fix that?
Questions regarding autosuggest (Solr 5.2.1)
A friend and I are trying to develop some software using Solr in the background, and with that comes alot of changes. We're used to older versions (4.3 and below). We especially have problems with the autosuggest feature. This is the field definition (schema.xml) for our autosuggest field: field name=autosuggest type=autosuggest indexed=true stored=true required=false multiValued=true / ... copyField source=name dest=autosuggest / ... fieldType name=autosuggest class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=30/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Afterwards, we defined an autosuggest component to use this field, like this (solrconfig.xml): searchComponent name=suggest class=solr.SuggestComponent lst name=suggester str name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str str name=storeDirsuggester_fuzzy_dir/str str name=dictionaryImplDocumentDictionaryFactory/str str name=fieldsuggest/str str name=suggestAnalyzerFieldTypeautosuggest/str str name=buildOnStartupfalse/str str name=buildOnCommitfalse/str /lst /searchComponent And add a requesthandler to test out the functionality: requestHandler name=/suggesthandler class=solr.SearchHandler startup=lazy lst name=defaults str name=suggesttrue/str str name=suggest.count10/str str name=suggest.dictionarymySuggester/str /lst arr name=components strsuggest/str /arr /requestHandler However, trying to start the core that has this configuration, a long exception occurs, telling us this: Error in configuration: autosuggest is not defined in the schema Now, that seems to be wrong. Any idea how to fix that?
Problem with german hyphenated words not being found
Hey, in german, you can string most nouns together by using hyphens, like this: Industrie = industry Anhänger = trailer Industrie-Anhänger = trailer for industrial use Here [1], you can see me querying Industrieanhänger from the name field (name:Industrieanhänger), to make sure the index actually contains the word. Our data is structured that products are listed without the hyphen. Now, customers can come around and use the hyphenated version as a search term (i.e.industrie-anhänger), and of course we want them to find what they are looking for. I've set it up so that the WordDelimiterFilterFactory uses catenateWords=1, so that these words are catenated. An analysis of Industrieanhänger as index and industrie-anhänger as query can be seen here [2]. You can see that both word parts are found. However, querying for industrie-anhänger does not yield results, only when the hyphen is removed, as you can see here [3]. I'm not sure how to proceed from here, as the results of the analysis have so far always lined up with what I could see when querying. Here's the schema definition for text, the field type for the name field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ -- filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I've also thought it might be a problem with URL encoding not encoding the hyphen, but replacing it with %2D didn't change the outcome (and was probably wrong anyway). Any help is greatly appreciated. Links: -- [1] http://imgur.com/2oEC5vz [2] http://i.imgur.com/H0AhEsF.png [3] http://imgur.com/dzmMe7t
Re: Problem with german hyphenated words not being found
Thank you for your input. Here's how the query looks with debugQuery=true: rawquerystring: name:industrie-anhänger, querystring: name:industrie-anhänger, parsedquery: MultiPhraseQuery(name:(industrie-anhang industri) (anhang industrieanhang)), parsedquery_toString: name:(industrie-anhang industri) (anhang industrieanhang), It looks like there are some rules applied, expressed by the braces. What's the correct interpretation of that? The default operator is OR, yet this looks like the terms inside the braces group using AND. Am 11.06.2015 12:40 schrieb Upayavira: The next thing to do is add debugQuery=true to your URL (or enable it in the query pane of the admin UI). Then look for the parsed query info. On the standard text_en field which includes an English stop word filter, I ran a query on Jack and Jill's House which showed this output: rawquerystring: text_en:(Jack and Jill's House), querystring: text_en:(Jack and Jill's House), parsedquery: text_en:jack text_en:jill text_en:hous, parsedquery_toString: text_en:jack text_en:jill text_en:hous, You can see that the parsed query is formed *after* analysis, so you can see exactly what is being queried for. Also, as a corollary to this, you can use the schema browser (or faceting for that matter) to view what terms are being indexed, to see if they should match. HTH Upayavira Am 11.06.2015 12:00 schrieb Upayavira: Have you used the analysis tab in the admin UI? You can type in sentences for both index and query time and see how they would be analysed by various fields/field types. Once you have got index time and query time to result in the same tokens at the end of the analysis chain, you should start seeing matches in your queries. Upayavira On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote: Hey, in german, you can string most nouns together by using hyphens, like this: Industrie = industry Anhänger = trailer Industrie- Anhänger = trailer for industrial use Here [1[1]], you can see me querying Industrieanhänger from the name field (name:Industrieanhänger), to make sure the index actually contains the word. Our data is structured that products are listed without the hyphen. Now, customers can come around and use the hyphenated version as a search term (i.e.industrie-anhänger), and of course we want them to find what they are looking for. I've set it up so that the WordDelimiterFilterFactory uses catenateWords=1, so that these words are catenated. An analysis of Industrieanhänger as index and industrie-anhänger as query can be seen here [2[2]]. You can see that both word parts are found. However, querying for industrie- anhänger does not yield results, only when the hyphen is removed, as you can see here [3[3]]. I'm not sure how to proceed from here, as the results of the analysis have so far always lined up with what I could see when querying. Here's the schema definition for text, the field type for the name field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ filter class=solr.LowerCaseFilterFactory/ !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ -- filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I've also thought it might be a problem with URL encoding not encoding the hyphen, but replacing it with %2D didn't change the outcome (and was probably wrong anyway). Any help is greatly appreciated. Links: -- [1] http://imgur.com/2oEC5vz [1] [2] http://i.imgur.com
Solr: Elevate with complex query specifying field names
I have Solr as the backend to an ECommerce solution where the fields can be configured to be searchable, which generates a schema.xml and loads it into Solr. Now we also allow to configure Solr search weight per field to affect queries, so my queries usually look something like this: spellcheck=truefl=entity_id,scorehl.snippets=1start=0q=ean:test+name:test^10.00+persartnr:test^5.00+persartnr_direct:test+short_description:testspellcheck.q=testspellcheck.build=true=truehl.simple.pre=span+class%3Dhighlighthl.simple.post=/spanjson.nl=maphl.fl=name,short_descriptionwt=jsonspellcheck.collate=truehl=truerows=1000 Now, I want to add query elevation to my mix. I got it to work pretty flawlessly, however, I'm not sure how to get it to work with my queries as they specifically state field names and especially boosts on a regular basis. This works and gets elevated when queried as q=test: elevate query text=test doc id=14153 / /query /elevate However, when queried as q=name:test^10.00, this elevation does not work/doesn't elevate. Is there a way around that? Can I specify the naked query somehow for the elevation component?
Re: Suggester not suggesting anything using DictionaryCompoundWordTokenFilterFactory
I think I found the problem. The definition of the suggester component has a field option which references the field that the suggester uses to generate suggestions. Changing this to the field using the DictionaryCompundWordTokenFilterFactory also suggests word parts. Am 11.11.2014 08:52 schrieb Thomas Michael Engelke: I'm toying around with the suggester component, like described here: http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx [1] So I made 4 fields: field name=text_suggest type=text_suggest indexed=true stored=true multiValued=true / copyField source=name dest=text_suggest / field name=text_suggest_edge type=text_suggest_edge indexed=true stored=true multiValued=true / copyField source=name dest=text_suggest_edge / field name=text_suggest_ngram type=text_suggest_ngram indexed=true stored=true multiValued=true / copyField source=name dest=text_suggest_ngram / field name=text_suggest_dictionary_ngram type=text_suggest_dictionary_ngram indexed=true stored=true multiValued=true / copyField source=name dest=text_suggest_dictionary_ngram / with the corresponding definitions: fieldType name=text_suggest class=solr.TextField analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType fieldType name=text_suggest_edge class=solr.TextField analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 side=front / /analyzer /fieldType fieldType name=text_suggest_ngram class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 side=front / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType fieldType name=text_suggest_dictionary_ngram class=solr.TextField analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 side=front / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType I'm calling the suggester component this way: http://address:8983/solr/core/suggest?qf=text_suggest^6.0%20test_suggest_edge^3.0%20text_suggest_ngram^1.0%20text_suggest_dictionary_ngram^0.2q=wa This seems to work fine: response lst name=responseHeader int name=status0/int int name=QTime0/int /lst lst name=spellcheck lst name=suggestions lst name=wa int name=numFound5/int int name=startOffset0/int int name=endOffset2/int arr name=suggestion strwandelement aus gitter/str strwandelement aus stahlblech/str strwandelement/str strwandhalter für prospekte/str strwandascher, h 300 × b 230 × t 60 mm/str /arr /lst str name=collation(wandelement aus gitter)/str /lst /lst /response However, I added the fourth field so I could get low-boosted suggestions using the afformentioned DictionaryCompoundWordTokenFilterFactory. A sample analysis for the field(type) text_suggest_dictionary_ngram for the word Geländewagen: g ge gel gelä gelän geländ gelände geländew geländewa geländewag geländewage geländewagen g ge gel gelä gelän geländ gelände w wa wag wage wagen As we can see, the DictionaryCompoundWordTokenFilterFactory extracts the word wagen and EdgeNGrams it. However, I cannot get results from these NGrams. Trying wag as the search term for the suggester, there are no results. However, doing an analysis of Geländewagen (as field value index) and wag (as field value query), analysis shows a match. I had the thought that it might be because the underlying component of the suggester is a spellchecker, and a spellchecker wouldn't correct wag to wagen because there was an NGram that spelled wag, and so the word was spelled correctly already. So I tried without the EdgeNGrams, but the result stays the same. Links: -- [1] http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx
How to suggest from multiple fields?
Like in this article (http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx), I am using multiple fields to generate different options for an autosuggest functionality: - First, the whole field (top priority) - Then, the whole field as EdgeNGrams from the left side (normal priority) - Lastly, single words or word parts (compound words) as EdgeNGrams However, I was not very successful in supplying a single requestHandler (/suggest) with data from multiple suggesters. I have also not been able to find any sample of how this might be done correctly. Is there a sample that I can read, or a documentation of how this might be done? The referenced article was doing it, yet only marginally described the technical implementation.
Re: Best practice: Autosuggest/autocomplete vs. real search
The dedicated autosuggest field is not used by a suggester component, instead we just directly query it (/select). I'm trying to read my way into how the suggesters work, and toying around with some configurations (For instance from here: http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx). Compared to how you can analyze search result through the Solr backend, the analysis of suggester results seems to be sorely lacking. Am 10.11.2014 14:37 schrieb Michael Sokolov: The goal is to ensure that suggestions from autocomplete are actually terms in the main index, so that the suggestions will actually result in matches. You've considered expanding the main index by adding the suggestion n-grams to it, but it would probably be better to alter your suggester so that it produces only tokens that are in the main index. I think this is basically how all the Suggester implementations are designed to work already; are you using one of those, or are you using the TermsComponent, or something else? -Mike On 11/10/14 2:54 AM, Thomas Michael Engelke wrote: We're using Solr as a backend for an ECommerce site/system. The Solr index stores products with selected attributes, as well as a dedicated field for autocomplete suggestions (Done via AJAX request when typing in the search box without pressing return). The autosuggest field is supplied by copyField directives from certain select product attribute fields (description and/or name mostly). It uses EdgeNGramFilterFactory to complete words not yet typed completely, and it works quite well. However, we come across an issue with a disconnect between the autosuggest results and results of a normal search, that is, a query over the full fields of the product. Let's say there are products that are called motor. - When autosuggesting, typing mot autosuggests all products with motor, because the EdgeNGram created m, mo, mot, moto and motor, respectively, and it matches. - When searching for mot, however (i.e. pressing enter when seeing the autosuggestions), it doesn't find any products. The autosuggest field is not part of the real search, and no product attribute contains mot as a word. One obvious solution would be to incorporate the autosuggest field into the real search, however, this adds many tokens to the index that aren't really part of the products indexed and makes for strange search results, for example when an NGram is also a word, but the record itself does contain the search term only as part of a word. Are there clever solutions to this problem?
Suggester not suggesting anything using DictionaryCompoundWordTokenFilterFactory
I'm toying around with the suggester component, like described here: http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx So I made 4 fields: field name=text_suggest type=text_suggest indexed=true stored=true multiValued=true / copyField source=name dest=text_suggest / field name=text_suggest_edge type=text_suggest_edge indexed=true stored=true multiValued=true / copyField source=name dest=text_suggest_edge / field name=text_suggest_ngram type=text_suggest_ngram indexed=true stored=true multiValued=true / copyField source=name dest=text_suggest_ngram / field name=text_suggest_dictionary_ngram type=text_suggest_dictionary_ngram indexed=true stored=true multiValued=true / copyField source=name dest=text_suggest_dictionary_ngram / with the corresponding definitions: fieldType name=text_suggest class=solr.TextField analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType fieldType name=text_suggest_edge class=solr.TextField analyzer tokenizer class=solr.KeywordTokenizerFactory / filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 side=front / /analyzer /fieldType fieldType name=text_suggest_ngram class=solr.TextField analyzer tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 side=front / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType fieldType name=text_suggest_dictionary_ngram class=solr.TextField analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 side=front / filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / /analyzer /fieldType I'm calling the suggester component this way: http://address:8983/solr/core/suggest?qf=text_suggest^6.0%20test_suggest_edge^3.0%20text_suggest_ngram^1.0%20text_suggest_dictionary_ngram^0.2q=wa This seems to work fine: response lst name=responseHeader int name=status0/int int name=QTime0/int /lst lst name=spellcheck lst name=suggestions lst name=wa int name=numFound5/int int name=startOffset0/int int name=endOffset2/int arr name=suggestion strwandelement aus gitter/str strwandelement aus stahlblech/str strwandelement/str strwandhalter für prospekte/str strwandascher, h 300 × b 230 × t 60 mm/str /arr /lst str name=collation(wandelement aus gitter)/str /lst /lst /response However, I added the fourth field so I could get low-boosted suggestions using the afformentioned DictionaryCompoundWordTokenFilterFactory. A sample analysis for the field(type) text_suggest_dictionary_ngram for the word Geländewagen: g ge gel gelä gelän geländ gelände geländew geländewa geländewag geländewage geländewagen g ge gel gelä gelän geländ gelände w wa wag wage wagen As we can see, the DictionaryCompoundWordTokenFilterFactory extracts the word wagen and EdgeNGrams it. However, I cannot get results from these NGrams. Trying wag as the search term for the suggester, there are no results. However, doing an analysis of Geländewagen (as field value index) and wag (as field value query), analysis shows a match. I had the thought that it might be because the underlying component of the suggester is a spellchecker, and a spellchecker wouldn't correct wag to wagen because there was an NGram that spelled wag, and so the word was spelled correctly already. So I tried without the EdgeNGrams, but the result stays the same.
Best practice: Autosuggest/autocomplete vs. real search
We're using Solr as a backend for an ECommerce site/system. The Solr index stores products with selected attributes, as well as a dedicated field for autocomplete suggestions (Done via AJAX request when typing in the search box without pressing return). The autosuggest field is supplied by copyField directives from certain select product attribute fields (description and/or name mostly). It uses EdgeNGramFilterFactory to complete words not yet typed completely, and it works quite well. However, we come across an issue with a disconnect between the autosuggest results and results of a normal search, that is, a query over the full fields of the product. Let's say there are products that are called motor. - When autosuggesting, typing mot autosuggests all products with motor, because the EdgeNGram created m, mo, mot, moto and motor, respectively, and it matches. - When searching for mot, however (i.e. pressing enter when seeing the autosuggestions), it doesn't find any products. The autosuggest field is not part of the real search, and no product attribute contains mot as a word. One obvious solution would be to incorporate the autosuggest field into the real search, however, this adds many tokens to the index that aren't really part of the products indexed and makes for strange search results, for example when an NGram is also a word, but the record itself does contain the search term only as part of a word. Are there clever solutions to this problem?
Autosuggest using EdgeNGrams with strange highlighting
We've moved from an asterisk based autosuggest functionality (searchterm*) to a version using a special field called autosuggest, filled via copyField directives. The field definition: fieldType name=autosuggest class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 side=front/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true enablePositionIncrements=true format=snowball/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 maxSubwordSize=30 onlyLongestMatch=false/ filter class=solr.GermanNormalizationFilterFactory/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType It works like a charm. Now, we've had highlighting from Solr before, using these parameters: hl=truehl.simple.pre=span+class%3Dhighlighthl.snippets=1hl.simple.post=/spanspellcheck=truehl.fl=description Now, we've seen something strange. This is just an example, the problem is with more than this record. In this example, the autosuggest field contains: 2CV4 Spot, Dekorsatz, für 2CV. However, the highlighting branch for this autosuggest field in the record looks like this: lst name=highlighting lst name=34725 arr name=short_description str2CV4 Spot, Dekorsatz, für em2CV/em./str /arr /lst ... Although the EdgeNGramFilterFactory generated the NGrams so that 2CV4 - 2, 2C, 2CV, 2CV4, the term is not highlighted. Shouldn't it? It's not a question of the number of highlights, records containing multiple occurances of 2CV get highlighted multiple times with no problems. It seems that words only containing parts of the search term which match the EdgeNGrams are not highlighted. As we're using highlighting from Solr exclusively, this leads to records being found, but having no highlight at all.
Re: Weird Problem (possible bug?) with german stemming and wildcard search
Thank you very much, this information is worht it's weight in gold. So far, we've used the asterisk method because it seemed logical and straight-forward. We will slowly migrate to a version using EdgeNGramFilterFactory. Thanks a bunch. Am 07.10.2014 14:42 schrieb Alexandre Rafalovitch: On 7 October 2014 08:25, Thomas Michael Engelke thomas.enge...@posteo.de wrote: So the culprit is the asterisk at the end. As far as we can read from the docs, an asterisk is just 0 or more characters, which means that the literal word in front of the asterisk should match the query. Not quite: http://wiki.apache.org/solr/MultitermQueryAnalysis [1] It's actually quite complicated and even depends on exact version of Solr you are using. In fact, out of all the analyzers you showed above, I think only LowerCase will be present on the chain. Look for (multi) marker at: http://www.solr-start.com/info/analyzers/ [2] for more details. On a higher level, I would suggest getting away from *-based expansion and looking at EdgeNGrams instead. You can see an example of autocomplete at http://www.solr-start.com/javadoc/solr-lucene/index.html [3] and the matching configuration at: https://github.com/arafalov/Solr-Javadoc/blob/master/JavadocIndex/JavadocCollection/conf/schema.xml#L24 [4] Or a dedicated Suggester module, though information on that is a bit harder to find. Regards, Alex. Personal: http://www.outerthoughts.com/ [5] and @arafalov Solr resources and newsletter: http://www.solr-start.com/ [6] and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 [7] Links: -- [1] http://wiki.apache.org/solr/MultitermQueryAnalysis [2] http://www.solr-start.com/info/analyzers/ [3] http://www.solr-start.com/javadoc/solr-lucene/index.html [4] https://github.com/arafalov/Solr-Javadoc/blob/master/JavadocIndex/JavadocCollection/conf/schema.xml#L24 [5] http://www.outerthoughts.com/ [6] http://www.solr-start.com/ [7] https://www.linkedin.com/groups?gid=6713853
Weird Problem (possible bug?) with german stemming and wildcard search
I have a problem with a stemmed german field. The field definition: field name=description type=text_splitting indexed=true stored=true required=false multiValued=false/ ... fieldType name=text_splitting class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType When we search for a word from an autosuggest kind of component, we always add an asterisk to a word, so when somebody enters something like Radbremszylinder and waits for some milliseconds, the autosuggest list is filled with the results of searching for Radbremszylinder*. This seemed to work quite well. Today we got a bug report from a customer for that exact word. So I made an analysis for the word as Field value (index) and Field value (query), and it looked like this: ST RadbremszylinderWT Radbremszylinder* SF RadbremszylinderSF Radbremszylinder* WDF RadbremszylinderSF Radbremszylinder* LCF radbremszylinderWDF Radbremszylinder SKMF radbremszylinderLCF radbremszylinder PSF radbremszylind SKMF radbremszylinder As you can see, the end result looks very much alike. However, records containing that word in their description field aren't reported as results. Strangely enough, records containing Radbremszylindern (plural) are reported as results. Removing the asterisk from the end reports all records with Radbremszylinder, just as we would expect. So the culprit is the asterisk at the end. As far as we can read from the docs, an asterisk is just 0 or more characters, which means that the literal word in front of the asterisk should match the query. Searching further we tried some variations, and it seems that searching for Radbremszylind* works. All records with any variation (Radbremszylinder, Radbremszylindern) are reported. So maybe there's a weird interaction with stemming? Any ideas?
RE: Solr Spellcheck suggestions only return from /select handler when returning search results
Hi James, hi list, I can confirm the existence of data that's within 1 Levenshtein step from ichtscheiben: { responseHeader: { status: 0, QTime: 0, params: { fl: name,spell, indent: true, q: name:Sichtscheiben, _: 1410423419758, wt: json, rows: 50 } }, response: { numFound: 6, start: 0, docs: [ { name: Sichtscheiben, spell: Sichtscheiben }, { name: Sichtscheiben, spell: Sichtscheiben }, { name: Sichtscheiben, spell: Sichtscheiben }, { name: Sichtscheiben, spell: Sichtscheiben }, { name: Sichtscheiben, spell: Sichtscheiben }, { name: Sichtscheiben, spell: Sichtscheiben } ] } } Multiple records exist that should match. The note for alternativeTermCount is appreciated. I've tried another term: Transport. I get suggestions when I use Transpor and Transpo, even Transpotr, but ransport doesn't yield any suggestions. Maybe it's a question of the beginning of a word and has not really anything to do with stemming. Am 10.09.2014 15:19 schrieb Dyer, James: Thomas, It looks like you've set things up correctly in that while the user is searching against a stemmed field (name), spellcheck is checking against a lightly-analyzed copy of it (spell). This is the right way to do it as spellcheck against stemmed forms is usually undesirable. But as you've experienced, you will sometimes get results (due to stemming) and also suggestions (because the spellechecker is looking at unstemmed forms). If you do not want spellcheck to return anything when you get results, you can set spellcheck.maxResultsForSuggest=0. Now keeping in mind we're comparing unstemmed forms, can you verify you indeed have something in your index that is within 2 edits of ichtscheiben ? My guess is you probably don't, which would be why you do not get spelling results in that case. Also, even if you do have something within 2 edits, if ichtscheiben occurs in your index, by default it won't try to correct it at all (even if the query returns nothing, maybe because of filters or other required terms on the query). In this case you need to set spellcheck.alternativeTermCount to a non-zero value (try maybe 5). See http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount [1] and following sections. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Thomas Michael Engelke [mailto:thomas.enge...@posteo.de] Sent: Wednesday, September 10, 2014 5:00 AM To: Solr user Subject: Solr Spellcheck suggestions only return from /select handler when returning search results Hi, I'm experimenting with the Spellcheck component and have therefor used the example configuration for spell checking to try things out. My solrconfig.xml looks like this: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypespell/str !-- Multiple Spell Checkers can be declared and used by this component -- !-- a spellchecker built from a field of the main index -- lst name=spellchecker str name=namedefault/str str name=fieldspell/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str name=distanceMeasureinternal/str !-- uncomment this to require suggestions to occur in 1% of the documents float name=thresholdTokenFrequency.01/float -- /lst !-- a spellchecker that can break or combine words. See /spell handler below for usage -- lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldspell/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges10/int /lst /searchComponent And I've added the spellcheck component to my /select request handler: requestHandler name=/select class=solr.SearchHandler ... arr name=last-components strspellcheck/str /arr /requestHandler I have built up the spellchecker source in the schema.xml from the name field: field name=spell type=spell indexed=true stored=true required=false multiValued=false/ copyField source=name dest=spell maxChars=3 / ... fieldType name=spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ /analyzer /fieldType As I'm querying the /select request handler, I should get spellcheck suggestions with my results. However, I rarely get a suggestion. Examples: query: Sichtscheibe, spellcheck suggestion: Sichtscheiben (works) query: Sichtscheib, spellcheck suggestion: Sichtscheiben (works) query: ichtscheiben, no spellcheck suggestions As far as I can identify, I only get suggestions when I get real search results. I get results for the first 2 examples, because the german StemFilterFactory translates Sichtscheibe and Sichtscheiben into Sichtscheib, so
Solr Spellcheck suggestions only return from /select handler when returning search results
Hi, I'm experimenting with the Spellcheck component and have therefor used the example configuration for spell checking to try things out. My solrconfig.xml looks like this: searchComponent name=spellcheck class=solr.SpellCheckComponent str name=queryAnalyzerFieldTypespell/str !-- Multiple Spell Checkers can be declared and used by this component -- !-- a spellchecker built from a field of the main index -- lst name=spellchecker str name=namedefault/str str name=fieldspell/str str name=classnamesolr.DirectSolrSpellChecker/str !-- the spellcheck distance measure used, the default is the internal levenshtein -- str name=distanceMeasureinternal/str !-- uncomment this to require suggestions to occur in 1% of the documents float name=thresholdTokenFrequency.01/float -- /lst !-- a spellchecker that can break or combine words. See /spell handler below for usage -- lst name=spellchecker str name=namewordbreak/str str name=classnamesolr.WordBreakSolrSpellChecker/str str name=fieldspell/str str name=combineWordstrue/str str name=breakWordstrue/str int name=maxChanges10/int /lst /searchComponent And I've added the spellcheck component to my /select request handler: requestHandler name=/select class=solr.SearchHandler ... arr name=last-components strspellcheck/str /arr /requestHandler I have built up the spellchecker source in the schema.xml from the name field: field name=spell type=spell indexed=true stored=true required=false multiValued=false/ copyField source=name dest=spell maxChars=3 / ... fieldType name=spell class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.StandardTokenizerFactory/ /analyzer analyzer type=query tokenizer class=solr.StandardTokenizerFactory/ /analyzer /fieldType As I'm querying the /select request handler, I should get spellcheck suggestions with my results. However, I rarely get a suggestion. Examples: query: Sichtscheibe, spellcheck suggestion: Sichtscheiben (works) query: Sichtscheib, spellcheck suggestion: Sichtscheiben (works) query: ichtscheiben, no spellcheck suggestions As far as I can identify, I only get suggestions when I get real search results. I get results for the first 2 examples, because the german StemFilterFactory translates Sichtscheibe and Sichtscheiben into Sichtscheib, so there are matches found. However, the third query should result in a suggestion, as the Levenshtein distance is less than in the second example. Suggestions, improvements, corrections?
Solr spellcheck returns more than 1 word for a 1 word spellcheck
I'm in the process of incorporating Solr spellchecking in our product. For that, I've created a new field: field name=spell type=spell indexed=true stored=true required=false multiValued=false/ copyField source=name dest=spell maxChars=3 / And in the fieldType definitions: fieldType name=spell class=solr.TextField positionIncrementGap=100 analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType Then I feed the names of products into the corresponding core. They can have a lot of words (examples): door lock rear left Door brake, door in front + rear fitting. However, the names get pretty long, and in the source data, they have been truncated. This sometimes leaves parts of words at the end: The water pump can evacuate some coo I have created a spellcheck component, feeding of the `spell` field defined earlier. Now for the problem. Sometimes, when I look up a slightly misspelled word, I get results I do not expect. Example request: http://solr.url:8983/solr/en/spell?q=coole This is (part of) the response: str name=wordcooler/strint name=freq21/int str name=wordcoo le/strint name=freq2/int str name=wordcable/strint name=freq334/int str name=wordco o le/strint name=freq4/int [...] Now, as you can see, the misspelled `coole` should have been `cooler`, and it's the first suggestion. However, the second and fourth suggestion baffle me. After a bit of research, I found this to be multiple words clunked together. As I described above, `coo` was a part of a name that was truncated. I found `co` the same way, and the source data contains a small number of `o` characters on their own (product number names). Now, my question is: Why is Solr suggesting `multiple words` pasted together for a spellcheck for a single word? Is there a way to prevent Solr from pasting together word parts to forge suggestions?
Re: Ranking based on match position in field
Hi, thanks for the link. I've upgraded from the used 4.7 to the recent 4.9 version. I've tried to use the new feature with this query in the admin interface using edismax: description:Kühler^~1^5 However, the result seems to stay the same: lst name=debug str name=rawquerystringdescription:Kühler~1^5/str str name=querystringdescription:Kühler~1^5/str str name=parsedquery(+description:kühler~1^5.0)/no_coord/str str name=parsedquery_toString+description:kühler~1^5.0/str lst name=explain str name=17411 2.334934 = (MATCH) weight(description:kühler^5.0 in 4080) [DefaultSimilarity], result of: 2.334934 = score(doc=4080,freq=1.0 = termFreq=1.0 ), product of: 0.9994 = queryWeight, product of: 5.0 = boost 6.226491 = idf(docFreq=64, maxDocs=12099) 0.03212082 = queryNorm 2.3349342 = fieldWeight in 4080, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.226491 = idf(docFreq=64, maxDocs=12099) 0.375 = fieldNorm(doc=4080) /str str name=19085 2.334934 = (MATCH) weight(description:kühler^5.0 in 5754) [DefaultSimilarity], result of: 2.334934 = score(doc=5754,freq=1.0 = termFreq=1.0 ), product of: 0.9994 = queryWeight, product of: 5.0 = boost 6.226491 = idf(docFreq=64, maxDocs=12099) 0.03212082 = queryNorm 2.3349342 = fieldWeight in 5754, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.226491 = idf(docFreq=64, maxDocs=12099) 0.375 = fieldNorm(doc=5754) /str Am I using this feature wrong? Am 30.07.2014 14:48 schrieb Ahmet Arslan: Hi, Please see : https://issues.apache.org/jira/browse/SOLR-3925 [1] Ahmet On Wednesday, July 30, 2014 2:39 PM, Thomas Michael Engelke thomas.enge...@posteo.de wrote: Hi, an example. We have 2 records with this data in the same field (description): 1: Lufthutze vor Kühler Bj 62-65, DS 2: Kühler HY im Austausch, Altteilpfand 250 Euro A search with the parameters 'description:Kühler' does provide this debug: 2.3234584 = (MATCH) weight(description:kühler in 4053) [DefaultSimilarity], result of: 2.3234584 = fieldWeight in 4053, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.195889 = idf(docFreq=69, maxDocs=12637) 0.375 = fieldNorm(doc=4053) /str str name=16946 2.3234584 = (MATCH) weight(description:kühler in 5729) [DefaultSimilarity], result of: 2.3234584 = fieldWeight in 5729, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.195889 = idf(docFreq=69, maxDocs=12637) 0.375 = fieldNorm(doc=5729) As you can see, both get the exact same score. However, we would like to rank the second document higher, on the basis that the search term occurs further to the left of the field. Is there a component/setting that can do that? Links: -- [1] https://issues.apache.org/jira/browse/SOLR-3925
Ranking based on match position in field
Hi, an example. We have 2 records with this data in the same field (description): 1: Lufthutze vor Kühler Bj 62-65, DS 2: Kühler HY im Austausch, Altteilpfand 250 Euro A search with the parameters 'description:Kühler' does provide this debug: 2.3234584 = (MATCH) weight(description:kühler in 4053) [DefaultSimilarity], result of: 2.3234584 = fieldWeight in 4053, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.195889 = idf(docFreq=69, maxDocs=12637) 0.375 = fieldNorm(doc=4053) /str str name=16946 2.3234584 = (MATCH) weight(description:kühler in 5729) [DefaultSimilarity], result of: 2.3234584 = fieldWeight in 5729, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.195889 = idf(docFreq=69, maxDocs=12637) 0.375 = fieldNorm(doc=5729) As you can see, both get the exact same score. However, we would like to rank the second document higher, on the basis that the search term occurs further to the left of the field. Is there a component/setting that can do that?
Re: Not finding part of fulltext field when word ends in dot
That was a complicated answer, but ultimately the right one. Thank you very much. 2014-01-30 Jack Krupansky j...@basetechnology.com: The word delimiter filter will turn 26KA into two tokens, as if you had written 26 KA without the quotes. The autoGeneratePhraseQueries option will cause the multiple terms to be treated as if they actually were enclosed within quotes, otherwise they will be treated as separate and unquoted terms. If you do enclose 26KA in quotes in your query then autoGeneratePhraseQueries is not relevant. Ah... maybe the problem is that you have preserveOriginal=true in your query analyzer. Do you have your default query operator set to AND? If so, it would treat 26KA as 26 AND KA AND 26KA, which requires that 26KA (without the trailing dot) to be in the index. It seems counter-intuitive, but the attributes of the index and query word delimiter filters need to be slightly asymmetric. -- Jack Krupansky -Original Message- From: Thomas Michael Engelke Sent: Thursday, January 30, 2014 2:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot I'm not sure I got my problem across. If I understand the snippet of documentation right, autoGeneratePhraseQueries only affects queries that result in multiple tokens, which mine does not. The version also is 3.6.0.1, and we're not planning on upgrading to any 4.x version. 2014-01-29 Jack Krupansky j...@basetechnology.com You might want to add autoGeneratePhraseQueries=true to your field type, but I don't think that would cause a break when going from 3.6 to 4.x. The default for that attribute changed in Solr 3.5. What release was your data indexed using? There may have been some subtle word delimiter filter changes between 3.x and 4.x. Read: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/% 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03. adsroot.itcs.umich.edu%3E -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 11:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot The fieldType definition is a tad on the longer side: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=1 catenateNumbers=1 generateNumberParts=1 splitOnCaseChange=1 generateWordParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=german/synonyms.txt ignoreCase=true expand=true/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=german/german-common-nouns.txt minWordSize=5 minSubwordSize=4 maxSubwordSize=15 onlyLongestMatch=true / filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=0 catenateNumbers=0 generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true
Not finding part of fulltext field when word ends in dot
Hello everybody, we have a legacy solr installation in version 3.6.0.1. One of the indices defines a field named content as a fulltext field where a product description will reside. One of the records indexed contains the following data (excerpt): z. B. in der Serie 26KA. I had the problem that searching the value 26KA didn't find anything. Using the analyzer of the adminstrative interface and using the full text on one hand and 26KA as the query string, I can see how the search string is transformed by the used filter factories. The WordDelimiterFilterFactory transforms the 26KA. into 26KA, which is displayed like this (excerpt): 73 74 7576 in der Serie 26KA. 26KA It seems that it stripped the 26KA. of the dot. Using the option to highlight matches, an analysis search of 26KA shows the lower of the two entries matches (after reaching the LowerCaseFilterFactory). However, querying the index using the query interface doesn't show any matches. I discovered that adding an asterisk to the search seems to work, as does adding the dot. I am puzzled by this, as I thought that the second added entry was the word actually indexed. I've tried looking up the definition of the administrative interface, but the documentation only specifies this for the latest version, where the display is different and (at least in the sample) doesn't show such duplication. Can anybody shed some light onto this?
Re: Not finding part of fulltext field when word ends in dot
The fieldType definition is a tad on the longer side: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=1 catenateNumbers=1 generateNumberParts=1 splitOnCaseChange=1 generateWordParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=german/synonyms.txt ignoreCase=true expand=true/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=german/german-common-nouns.txt minWordSize=5 minSubwordSize=4 maxSubwordSize=15 onlyLongestMatch=true / filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=0 catenateNumbers=0 generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Thank you for taking a look. 2014-01-29 Jack Krupansky j...@basetechnology.com What field type and analyzer/tokenizer are you using? -- Jack Krupansky -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not finding part of fulltext field when word ends in dot Hello everybody, we have a legacy solr installation in version 3.6.0.1. One of the indices defines a field named content as a fulltext field where a product description will reside. One of the records indexed contains the following data (excerpt): z. B. in der Serie 26KA. I had the problem that searching the value 26KA didn't find anything. Using the analyzer of the adminstrative interface and using the full text on one hand and 26KA as the query string, I can see how the search string is transformed by the used filter factories. The WordDelimiterFilterFactory transforms the 26KA. into 26KA, which is displayed like this (excerpt): 73 74 7576 in der Serie 26KA. 26KA It seems that it stripped the 26KA. of the dot. Using the option to highlight matches, an analysis search of 26KA shows the lower of the two entries matches (after reaching the LowerCaseFilterFactory). However, querying the index using the query interface doesn't show any matches. I discovered that adding an asterisk to the search seems to work, as does adding the dot. I am puzzled by this, as I thought that the second added entry was the word actually indexed. I've tried looking up the definition of the administrative interface, but the documentation only specifies this for the latest version, where the display is different and (at least in the sample) doesn't show such duplication. Can anybody shed some light onto this?
Re: Not finding part of fulltext field when word ends in dot
I'm not sure I got my problem across. If I understand the snippet of documentation right, autoGeneratePhraseQueries only affects queries that result in multiple tokens, which mine does not. The version also is 3.6.0.1, and we're not planning on upgrading to any 4.x version. 2014-01-29 Jack Krupansky j...@basetechnology.com You might want to add autoGeneratePhraseQueries=true to your field type, but I don't think that would cause a break when going from 3.6 to 4.x. The default for that attribute changed in Solr 3.5. What release was your data indexed using? There may have been some subtle word delimiter filter changes between 3.x and 4.x. Read: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/% 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03. adsroot.itcs.umich.edu%3E -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 11:16 AM To: solr-user@lucene.apache.org Subject: Re: Not finding part of fulltext field when word ends in dot The fieldType definition is a tad on the longer side: fieldType name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=1 catenateNumbers=1 generateNumberParts=1 splitOnCaseChange=1 generateWordParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.SynonymFilterFactory synonyms=german/synonyms.txt ignoreCase=true expand=true/ filter class=solr.DictionaryCompoundWordTokenFilterFactory dictionary=german/german-common-nouns.txt minWordSize=5 minSubwordSize=4 maxSubwordSize=15 onlyLongestMatch=true / filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.WordDelimiterFilterFactory catenateWords=0 catenateNumbers=0 generateWordParts=1 splitOnCaseChange=1 generateNumberParts=1 catenateAll=0 preserveOriginal=1 splitOnNumerics=0 / filter class=solr.LowerCaseFilterFactory/ filter class=solr.StopFilterFactory words=german/stopwords.txt ignoreCase=true enablePositionIncrements=true/ filter class=solr.SnowballPorterFilterFactory language=German2 protected=german/protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType Thank you for taking a look. 2014-01-29 Jack Krupansky j...@basetechnology.com What field type and analyzer/tokenizer are you using? -- Jack Krupansky -Original Message- From: Thomas Michael Engelke Sent: Wednesday, January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not finding part of fulltext field when word ends in dot Hello everybody, we have a legacy solr installation in version 3.6.0.1. One of the indices defines a field named content as a fulltext field where a product description will reside. One of the records indexed contains the following data (excerpt): z. B. in der Serie 26KA. I had the problem that searching the value 26KA didn't find anything. Using the analyzer of the adminstrative interface and using the full text on one hand and 26KA as the query string, I can see how the search string is transformed by the used filter factories