Same score for different length matches

2017-06-30 Thread Thomas Michael Engelke
 Hey,

we have multiple documents that are matches for the query in question
("name:hubwagen"). Thing is, some of the documents only contain the
query, while others match 100% in the "name" field:


 
 Hochhubwagen
 5.9861565
 
 Hubwagen
 5.9861565


The debug looks like this (for the first and 5th match):


 
 namhubwagnamehubwag
 
 
 name:Hubwagen
 name:Hubwagen
 name:hubwag
 name:hubwag
 
 
5.9861565 = (MATCH) weight(name:hubwag in 8093) [DefaultSimilarity],
result of:
 5.9861565 = fieldWeight in 8093, product of:
 1.0 = tf(freq=1.0), with freq of:
 1.0 = termFreq=1.0
 5.9861565 = idf(docFreq=109, maxDocs=16101)
 1.0 = fieldNorm(doc=8093)

 
5.9861565 = (MATCH) weight(name:hubwag in 9537) [DefaultSimilarity],
result of:
 5.9861565 = fieldWeight in 9537, product of:
 1.0 = tf(freq=1.0), with freq of:
 1.0 = termFreq=1.0
 5.9861565 = idf(docFreq=109, maxDocs=16101)
 1.0 = fieldNorm(doc=9537)


Now, I am decently certain that at one point in time it worked in a way
that a higher match length would rank higher. As far as I can read in
the SolrRelevancyFAQ, the correct term is "lengthNorm". However, I a
missing a preference for the full match.

Usually, the debug helps me identify mistakes, but in this case, the
debug only tells me that the scores are perfectly equal, down to the
lowest level. 

Suggester needed for returning suggestions when term is not start of field value

2015-08-07 Thread Thomas Michael Engelke
 Hey,

I'm playing around with the suggester component, and it works perfectly
as described: Suggestions for 'logitech mouse' include 'logitech mouse
g500' and 'logitech mouse gaming'.

However, when the words in the record supplying the suggester do not
follow each other as in the search terms, nothing is returned.
Suggestions for 'logitech mouse' do not include 'logitech g500 mouse'.

Is there a suggester implementation that can suggest records that way?

Best wishes. 

Re: Questions regarding autosuggest (Solr 5.2.1)

2015-06-30 Thread Thomas Michael Engelke
 God damn. Thank you.

*ashamed*

Am 30.06.2015 00:21 schrieb Erick Erickson: 

 Try not putting it in double quotes?
 
 Best,
 Erick
 
 On Mon, Jun 29, 2015 at 12:22 PM, Thomas Michael Engelke
 thomas.enge...@posteo.de wrote:
 
 A friend and I are trying to develop some software using Solr in the 
 background, and with that comes alot of changes. We're used to older 
 versions (4.3 and below). We especially have problems with the autosuggest 
 feature. This is the field definition (schema.xml) for our autosuggest 
 field: field name=autosuggest type=autosuggest indexed=true 
 stored=true required=false multiValued=true / ... copyField 
 source=name dest=autosuggest / ... fieldType name=autosuggest 
 class=solr.TextField positionIncrementGap=100 analyzer type=index 
 tokenizer class=solr.WhitespaceTokenizerFactory/ filter 
 class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 
 splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 
 catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ 
 filter class=solr.LowerCaseFilterFactory/ filter 
 class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true 
 enablePositionIncrements=true
format=snowball/ filter 
class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/ filter 
class=solr.GermanNormalizationFilterFactory/ filter 
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/ filter class=solr.EdgeNGramFilterFactory 
minGramSize=2 maxGramSize=30/ filter 
class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer 
type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter 
class=solr.WordDelimiterFilterFactory splitOnCaseChange=0 
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=0/ 
filter class=solr.LowerCaseFilterFactory/ filter 
class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true 
enablePositionIncrements=true format=snowball/ filter
class=solr.GermanNormalizationFilterFactory/ filter 
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/ filter 
class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType 
Afterwards, we defined an autosuggest component to use this field, like this 
(solrconfig.xml): searchComponent name=suggest 
class=solr.SuggestComponent lst name=suggester str 
name=namemySuggester/str str name=lookupImplFuzzyLookupFactory/str 
str name=storeDirsuggester_fuzzy_dir/str str 
name=dictionaryImplDocumentDictionaryFactory/str str 
name=fieldsuggest/str str 
name=suggestAnalyzerFieldTypeautosuggest/str str 
name=buildOnStartupfalse/str str name=buildOnCommitfalse/str /lst 
/searchComponent And add a requesthandler to test out the functionality: 
requestHandler name=/suggesthandler class=solr.SearchHandler 
startup=lazy  lst name=defaults str name=suggesttrue/str str
name=suggest.count10/str str name=suggest.dictionarymySuggester/str 
/lst arr name=components strsuggest/str /arr /requestHandler 
However, trying to start the core that has this configuration, a long exception 
occurs, telling us this: Error in configuration: autosuggest is not defined 
in the schema Now, that seems to be wrong. Any idea how to fix that?
 

Questions regarding autosuggest (Solr 5.2.1)

2015-06-29 Thread Thomas Michael Engelke
 

 A friend and I are trying to develop some software using Solr in the
background, and with that comes alot of changes. We're used to older
versions (4.3 and below). We especially have problems with the
autosuggest feature.

This is the field definition (schema.xml) for our autosuggest field:

field name=autosuggest type=autosuggest indexed=true
stored=true required=false multiValued=true /
...
copyField source=name dest=autosuggest /
...
fieldType name=autosuggest class=solr.TextField
positionIncrementGap=100
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=0 catenateAll=0
preserveOriginal=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true enablePositionIncrements=true format=snowball/
 filter class=solr.DictionaryCompoundWordTokenFilterFactory
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
maxSubwordSize=30 onlyLongestMatch=false/
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
 filter class=solr.EdgeNGramFilterFactory minGramSize=2
maxGramSize=30/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=0
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=0 catenateAll=0
preserveOriginal=0/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true enablePositionIncrements=true format=snowball/
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
/fieldType

Afterwards, we defined an autosuggest component to use this field, like
this (solrconfig.xml):

searchComponent name=suggest class=solr.SuggestComponent
 lst name=suggester
 str name=namemySuggester/str
 str name=lookupImplFuzzyLookupFactory/str
 str name=storeDirsuggester_fuzzy_dir/str
 str name=dictionaryImplDocumentDictionaryFactory/str
 str name=fieldsuggest/str
 str name=suggestAnalyzerFieldTypeautosuggest/str
 str name=buildOnStartupfalse/str
 str name=buildOnCommitfalse/str
 /lst
/searchComponent

And add a requesthandler to test out the functionality:

requestHandler name=/suggesthandler class=solr.SearchHandler
startup=lazy 
 lst name=defaults
 str name=suggesttrue/str
 str name=suggest.count10/str
 str name=suggest.dictionarymySuggester/str
 /lst
 arr name=components
 strsuggest/str
 /arr
/requestHandler

However, trying to start the core that has this configuration, a long
exception occurs, telling us this:

Error in configuration: autosuggest is not defined in the schema

Now, that seems to be wrong. Any idea how to fix that? 

Problem with german hyphenated words not being found

2015-06-11 Thread Thomas Michael Engelke
 Hey,

in german, you can string most nouns together by using hyphens, like
this:

Industrie = industry
Anhänger = trailer

Industrie-Anhänger = trailer for industrial use

Here [1], you can see me querying Industrieanhänger from the name
field (name:Industrieanhänger), to make sure the index actually contains
the word. Our data is structured that products are listed without the
hyphen.

Now, customers can come around and use the hyphenated version as a
search term (i.e.industrie-anhänger), and of course we want them to
find what they are looking for. I've set it up so that the
WordDelimiterFilterFactory uses catenateWords=1, so that these words
are catenated. An analysis of Industrieanhänger as index and
industrie-anhänger as query can be seen here [2].

You can see that both word parts are found. However, querying for
industrie-anhänger does not yield results, only when the hyphen is
removed, as you can see here [3]. I'm not sure how to proceed from here,
as the results of the analysis have so far always lined up with what I
could see when querying. Here's the schema definition for text, the
field type for the name field:

fieldType name=text class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=0 catenateAll=0
preserveOriginal=1/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.DictionaryCompoundWordTokenFilterFactory
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
maxSubwordSize=30 onlyLongestMatch=false/
 filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true enablePositionIncrements=true format=snowball/
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.WordDelimiterFilterFactory splitOnCaseChange=1
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1
catenateWords=1 catenateNumbers=0 catenateAll=0
preserveOriginal=1/
 filter class=solr.LowerCaseFilterFactory/
 !-- filter class=solr.DictionaryCompoundWordTokenFilterFactory
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3
maxSubwordSize=30 onlyLongestMatch=false/ --
 filter class=solr.StopFilterFactory words=stopwords.txt
ignoreCase=true enablePositionIncrements=true format=snowball/
 filter class=solr.GermanNormalizationFilterFactory/
 filter class=solr.SnowballPorterFilterFactory language=German2
protected=protwords.txt/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
/fieldType

I've also thought it might be a problem with URL encoding not encoding
the hyphen, but replacing it with %2D didn't change the outcome (and was
probably wrong anyway).

Any help is greatly appreciated. 

Links:
--
[1] http://imgur.com/2oEC5vz
[2] http://i.imgur.com/H0AhEsF.png
[3] http://imgur.com/dzmMe7t


Re: Problem with german hyphenated words not being found

2015-06-11 Thread Thomas Michael Engelke
 Thank you for your input. Here's how the query looks with
debugQuery=true:

rawquerystring: name:industrie-anhänger,
 querystring: name:industrie-anhänger,
 parsedquery: MultiPhraseQuery(name:(industrie-anhang industri)
(anhang industrieanhang)),
 parsedquery_toString: name:(industrie-anhang industri) (anhang
industrieanhang),

 It looks like there are some rules applied, expressed by the braces.
What's the correct interpretation of that? The default operator is OR,
yet this looks like the terms inside the braces group using AND.

Am 11.06.2015 12:40 schrieb Upayavira: 

 The next thing to do is add debugQuery=true to your URL (or enable it in
 the query pane of the admin UI). Then look for the parsed query info.
 
 On the standard text_en field which includes an English stop word
 filter, I ran a query on Jack and Jill's House which showed
 this output:
 
 rawquerystring: text_en:(Jack and Jill's House), querystring:
 text_en:(Jack and Jill's House), parsedquery: text_en:jack
 text_en:jill text_en:hous, parsedquery_toString: text_en:jack
 text_en:jill text_en:hous,
 
 You can see that the parsed query is formed *after* analysis, so you can
 see exactly what is being queried for.
 
 Also, as a corollary to this, you can use the schema browser (or
 faceting for that matter) to view what terms are being indexed, to see
 if they should match.
 
 HTH
 
 Upayavira
 
 Am 11.06.2015 12:00 schrieb Upayavira:
 Have you used the analysis tab in the admin UI? You can type in

sentences for both index and query time and see how they would be
analysed by various fields/field types.

Once you have got index time and query time to result in the same tokens
at the end of the analysis chain, you should start seeing matches in
your queries.

Upayavira

On Thu, Jun 11, 2015, at 10:26 AM, Thomas Michael Engelke wrote:

 Hey, in german, you can string most nouns together by using hyphens, like 
 this: Industrie = industry Anhänger = trailer Industrie- Anhänger = trailer 
 for industrial use Here [1[1]], you can see me querying Industrieanhänger 
 from the name field (name:Industrieanhänger), to make sure the index 
 actually contains the word. Our data is structured that products are listed 
 without the hyphen. Now, customers can come around and use the hyphenated 
 version as a search term (i.e.industrie-anhänger), and of course we want 
 them to find what they are looking for. I've set it up so that the 
 WordDelimiterFilterFactory uses catenateWords=1, so that these words are 
 catenated. An analysis of Industrieanhänger as index and 
 industrie-anhänger as query can be seen here [2[2]]. You can see that both 
 word parts are found. However, querying for industrie- anhänger does not 
 yield results, only when the hyphen is removed, as you can see here [3[3]]. 
 I'm not sure how to proceed from
here, as the results of the analysis have so far always lined up with what I 
could see when querying. Here's the schema definition for text, the field 
type for the name field: fieldType name=text class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer 
type=index tokenizer class=solr.StandardTokenizerFactory/ filter 
class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ 
filter class=solr.LowerCaseFilterFactory/ filter 
class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/ filter 
class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true 
enablePositionIncrements=true format=snowball/ filter 
class=solr.GermanNormalizationFilterFactory/ filter
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/ filter 
class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer 
type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter 
class=solr.WordDelimiterFilterFactory splitOnCaseChange=1 
splitOnNumerics=1 generateWordParts=1 generateNumberParts=1 
catenateWords=1 catenateNumbers=0 catenateAll=0 preserveOriginal=1/ 
filter class=solr.LowerCaseFilterFactory/ !-- filter 
class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/ -- filter 
class=solr.StopFilterFactory words=stopwords.txt ignoreCase=true 
enablePositionIncrements=true format=snowball/ filter 
class=solr.GermanNormalizationFilterFactory/ filter 
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/ filter
class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldType I've 
also thought it might be a problem with URL encoding not encoding the hyphen, 
but replacing it with %2D didn't change the outcome (and was probably wrong 
anyway). Any help is greatly appreciated. Links: -- [1] 
http://imgur.com/2oEC5vz [1] [2] http://i.imgur.com

Solr: Elevate with complex query specifying field names

2015-05-31 Thread Thomas Michael Engelke
 

I have Solr as the backend to an ECommerce solution where the fields can
be configured to be searchable, which generates a schema.xml and loads
it into Solr. 

Now we also allow to configure Solr search weight per field to affect
queries, so my queries usually look something like this: 

spellcheck=truefl=entity_id,scorehl.snippets=1start=0q=ean:test+name:test^10.00+persartnr:test^5.00+persartnr_direct:test+short_description:testspellcheck.q=testspellcheck.build=true=truehl.simple.pre=span+class%3Dhighlighthl.simple.post=/spanjson.nl=maphl.fl=name,short_descriptionwt=jsonspellcheck.collate=truehl=truerows=1000

Now, I want to add query elevation to my mix. I got it to work pretty
flawlessly, however, I'm not sure how to get it to work with my queries
as they specifically state field names and especially boosts on a
regular basis. 

This works and gets elevated when queried as q=test: 

elevate
 query text=test
 doc id=14153 /
 /query
/elevate

However, when queried as q=name:test^10.00, this elevation does not
work/doesn't elevate. 

Is there a way around that? Can I specify the naked query somehow for
the elevation component? 
 

Re: Suggester not suggesting anything using DictionaryCompoundWordTokenFilterFactory

2014-11-11 Thread Thomas Michael Engelke
 I think I found the problem. The definition of the suggester component
has a field option which references the field that the suggester uses
to generate suggestions. Changing this to the field using the
DictionaryCompundWordTokenFilterFactory also suggests word parts.

Am 11.11.2014 08:52 schrieb Thomas Michael Engelke: 

 I'm toying around with the suggester component, like described here: 
 http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx
  [1]
 
 So I made 4 fields:
 
 field name=text_suggest type=text_suggest indexed=true stored=true 
 multiValued=true /
 copyField source=name dest=text_suggest /
 field name=text_suggest_edge type=text_suggest_edge indexed=true 
 stored=true multiValued=true /
 copyField source=name dest=text_suggest_edge /
 field name=text_suggest_ngram type=text_suggest_ngram indexed=true 
 stored=true multiValued=true /
 copyField source=name dest=text_suggest_ngram /
 field name=text_suggest_dictionary_ngram 
 type=text_suggest_dictionary_ngram indexed=true stored=true 
 multiValued=true /
 copyField source=name dest=text_suggest_dictionary_ngram /
 
 with the corresponding definitions:
 
 fieldType name=text_suggest class=solr.TextField
 analyzer
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.LowerCaseFilterFactory /
 /analyzer
 /fieldType
 fieldType name=text_suggest_edge class=solr.TextField
 analyzer
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 
 side=front /
 /analyzer
 /fieldType
 fieldType name=text_suggest_ngram class=solr.TextField
 analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 
 side=front /
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType
 fieldType name=text_suggest_dictionary_ngram class=solr.TextField
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.DictionaryCompoundWordTokenFilterFactory 
 dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
 maxSubwordSize=30 onlyLongestMatch=false/
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 maxGramSize=50 
 side=front /
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 /analyzer
 /fieldType
 
 I'm calling the suggester component this way:
 
 http://address:8983/solr/core/suggest?qf=text_suggest^6.0%20test_suggest_edge^3.0%20text_suggest_ngram^1.0%20text_suggest_dictionary_ngram^0.2q=wa
 
 This seems to work fine:
 
 response
 lst name=responseHeader
 int name=status0/int
 int name=QTime0/int
 /lst
 lst name=spellcheck
 lst name=suggestions
 lst name=wa
 int name=numFound5/int
 int name=startOffset0/int
 int name=endOffset2/int
 arr name=suggestion
 strwandelement aus gitter/str
 strwandelement aus stahlblech/str
 strwandelement/str
 strwandhalter für prospekte/str
 strwandascher, h 300 × b 230 × t 60 mm/str
 /arr
 /lst
 str name=collation(wandelement aus gitter)/str
 /lst
 /lst
 /response
 
 However, I added the fourth field so I could get low-boosted suggestions 
 using the afformentioned DictionaryCompoundWordTokenFilterFactory. A sample 
 analysis for the field(type) text_suggest_dictionary_ngram for the word 
 Geländewagen:
 
 g
 ge
 gel
 gelä
 gelän
 geländ
 gelände
 geländew
 geländewa
 geländewag
 geländewage
 geländewagen
 g
 ge
 gel
 gelä
 gelän
 geländ
 gelände
 w
 wa
 wag
 wage
 wagen
 
 As we can see, the DictionaryCompoundWordTokenFilterFactory extracts the word 
 wagen and EdgeNGrams it. However, I cannot get results from these NGrams. 
 Trying wag as the search term for the suggester, there are no results.
 
 However, doing an analysis of Geländewagen (as field value index) and wag 
 (as field value query), analysis shows a match.
 
 I had the thought that it might be because the underlying component of the 
 suggester is a spellchecker, and a spellchecker wouldn't correct wag to 
 wagen because there was an NGram that spelled wag, and so the word was 
 spelled correctly already. So I tried without the EdgeNGrams, but the result 
 stays the same.
 

Links:
--
[1]
http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx

How to suggest from multiple fields?

2014-11-11 Thread Thomas Michael Engelke
Like in this article 
(http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx), 
I am using multiple fields to generate different options for an 
autosuggest functionality:


- First, the whole field (top priority)
- Then, the whole field as EdgeNGrams from the left side (normal 
priority)

- Lastly, single words or word parts (compound words) as EdgeNGrams

However, I was not very successful in supplying a single requestHandler 
(/suggest) with data from multiple suggesters. I have also not been 
able to find any sample of how this might be done correctly.


Is there a sample that I can read, or a documentation of how this might 
be done? The referenced article was doing it, yet only marginally 
described the technical implementation.


Re: Best practice: Autosuggest/autocomplete vs. real search

2014-11-10 Thread Thomas Michael Engelke
 The dedicated autosuggest field is not used by a suggester component,
instead we just directly query it (/select). I'm trying to read my way
into how the suggesters work, and toying around with some configurations
(For instance from here:
http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx).

Compared to how you can analyze search result through the Solr backend,
the analysis of suggester results seems to be sorely lacking.

Am 10.11.2014 14:37 schrieb Michael Sokolov: 

 The goal is to ensure that suggestions from autocomplete are actually terms 
 in the main index, so that the suggestions will actually result in matches. 
 You've considered expanding the main index by adding the suggestion n-grams 
 to it, but it would probably be better to alter your suggester so that it 
 produces only tokens that are in the main index. I think this is basically 
 how all the Suggester implementations are designed to work already; are you 
 using one of those, or are you using the TermsComponent, or something else?
 
 -Mike
 
 On 11/10/14 2:54 AM, Thomas Michael Engelke wrote:
 
 We're using Solr as a backend for an ECommerce site/system. The Solr index 
 stores products with selected attributes, as well as a dedicated field for 
 autocomplete suggestions (Done via AJAX request when typing in the search 
 box without pressing return). The autosuggest field is supplied by copyField 
 directives from certain select product attribute fields (description and/or 
 name mostly). It uses EdgeNGramFilterFactory to complete words not yet typed 
 completely, and it works quite well. However, we come across an issue with a 
 disconnect between the autosuggest results and results of a normal search, 
 that is, a query over the full fields of the product. Let's say there are 
 products that are called motor. - When autosuggesting, typing mot 
 autosuggests all products with motor, because the EdgeNGram created m, 
 mo, mot, moto and motor, respectively, and it matches. - When 
 searching for mot, however (i.e. pressing enter when seeing the 
 autosuggestions), it doesn't
find any products. The autosuggest field is not part of the real search, and 
no product attribute contains mot as a word. One obvious solution would be to 
incorporate the autosuggest field into the real search, however, this adds 
many tokens to the index that aren't really part of the products indexed and 
makes for strange search results, for example when an NGram is also a word, but 
the record itself does contain the search term only as part of a word. Are 
there clever solutions to this problem?
 

Suggester not suggesting anything using DictionaryCompoundWordTokenFilterFactory

2014-11-10 Thread Thomas Michael Engelke
I'm toying around with the suggester component, like described here: 
http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx


So I made 4 fields:

 field name=text_suggest type=text_suggest indexed=true 
stored=true multiValued=true /

 copyField source=name dest=text_suggest /
 field name=text_suggest_edge type=text_suggest_edge indexed=true 
stored=true multiValued=true /

 copyField source=name dest=text_suggest_edge /
 field name=text_suggest_ngram type=text_suggest_ngram 
indexed=true stored=true multiValued=true /

 copyField source=name dest=text_suggest_ngram /
 field name=text_suggest_dictionary_ngram 
type=text_suggest_dictionary_ngram indexed=true stored=true 
multiValued=true /

 copyField source=name dest=text_suggest_dictionary_ngram /

with the corresponding definitions:

 fieldType name=text_suggest class=solr.TextField
 analyzer
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.LowerCaseFilterFactory /
 /analyzer
 /fieldType
 fieldType name=text_suggest_edge class=solr.TextField
 analyzer
 tokenizer class=solr.KeywordTokenizerFactory /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=50 side=front /

 /analyzer
 /fieldType
 fieldType name=text_suggest_ngram class=solr.TextField
 analyzer
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=50 side=front /

 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 /fieldType
 fieldType name=text_suggest_dictionary_ngram class=solr.TextField
 analyzer type=index
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=50 side=front /

 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
 tokenizer class=solr.StandardTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 /analyzer
 /fieldType

I'm calling the suggester component this way:

http://address:8983/solr/core/suggest?qf=text_suggest^6.0%20test_suggest_edge^3.0%20text_suggest_ngram^1.0%20text_suggest_dictionary_ngram^0.2q=wa

This seems to work fine:

response
  lst name=responseHeader
int name=status0/int
int name=QTime0/int
  /lst
  lst name=spellcheck
lst name=suggestions
  lst name=wa
int name=numFound5/int
int name=startOffset0/int
int name=endOffset2/int
arr name=suggestion
  strwandelement aus gitter/str
  strwandelement aus stahlblech/str
  strwandelement/str
  strwandhalter für prospekte/str
  strwandascher, h 300 × b 230 × t 60 mm/str
/arr
  /lst
  str name=collation(wandelement aus gitter)/str
/lst
  /lst
/response

However, I added the fourth field so I could get low-boosted suggestions 
using the afformentioned DictionaryCompoundWordTokenFilterFactory. A 
sample analysis for the field(type) text_suggest_dictionary_ngram for 
the word Geländewagen:


g
ge
gel
gelä
gelän
geländ
gelände
geländew
geländewa
geländewag
geländewage
geländewagen
g
ge
gel
gelä
gelän
geländ
gelände
w
wa
wag
wage
wagen

As we can see, the DictionaryCompoundWordTokenFilterFactory extracts the 
word wagen and EdgeNGrams it. However, I cannot get results from these 
NGrams. Trying wag as the search term for the suggester, there are no 
results.


However, doing an analysis of Geländewagen (as field value index) and 
wag (as field value query), analysis shows a match.


I had the thought that it might be because the underlying component of 
the suggester is a spellchecker, and a spellchecker wouldn't correct 
wag to wagen because there was an NGram that spelled wag, and so 
the word was spelled correctly already. So I tried without the 
EdgeNGrams, but the result stays the same.


Best practice: Autosuggest/autocomplete vs. real search

2014-11-09 Thread Thomas Michael Engelke
 

 We're using Solr as a backend for an ECommerce site/system. The Solr
index stores products with selected attributes, as well as a dedicated
field for autocomplete suggestions (Done via AJAX request when typing in
the search box without pressing return).

The autosuggest field is supplied by copyField directives from certain
select product attribute fields (description and/or name mostly). It
uses EdgeNGramFilterFactory to complete words not yet typed completely,
and it works quite well.

However, we come across an issue with a disconnect between the
autosuggest results and results of a normal search, that is, a query
over the full fields of the product. Let's say there are products that
are called motor.

- When autosuggesting, typing mot autosuggests all products with
motor, because the EdgeNGram created m, mo, mot, moto and
motor, respectively, and it matches.
- When searching for mot, however (i.e. pressing enter when seeing the
autosuggestions), it doesn't find any products. The autosuggest field is
not part of the real search, and no product attribute contains mot
as a word.

One obvious solution would be to incorporate the autosuggest field
into the real search, however, this adds many tokens to the index that
aren't really part of the products indexed and makes for strange search
results, for example when an NGram is also a word, but the record itself
does contain the search term only as part of a word.

Are there clever solutions to this problem? 

Autosuggest using EdgeNGrams with strange highlighting

2014-11-07 Thread Thomas Michael Engelke
We've moved from an asterisk based autosuggest functionality 
(searchterm*) to a version using a special field called autosuggest, 
filled via copyField directives. The field definition:


fieldType name=autosuggest class=solr.TextField 
positionIncrementGap=100

analyzer type=index
tokenizer 
class=solr.StandardTokenizerFactory/
filter 
class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory 
words=stopwords.txt ignoreCase=true enablePositionIncrements=true 
format=snowball/
filter 
class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/
filter 
class=solr.GermanNormalizationFilterFactory/
filter 
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/
filter 
class=solr.EdgeNGramFilterFactory minGramSize=2 maxGramSize=15 
side=front/
filter 
class=solr.RemoveDuplicatesTokenFilterFactory/

/analyzer
analyzer type=query
tokenizer 
class=solr.StandardTokenizerFactory/
filter 
class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory 
words=stopwords.txt ignoreCase=true enablePositionIncrements=true 
format=snowball/
filter 
class=solr.DictionaryCompoundWordTokenFilterFactory 
dictionary=dictionary.txt minWordSize=5 minSubwordSize=3 
maxSubwordSize=30 onlyLongestMatch=false/
filter 
class=solr.GermanNormalizationFilterFactory/
filter 
class=solr.SnowballPorterFilterFactory language=German2 
protected=protwords.txt/
filter 
class=solr.RemoveDuplicatesTokenFilterFactory/

/analyzer
/fieldType

It works like a charm. Now, we've had highlighting from Solr before, 
using these parameters:


hl=truehl.simple.pre=span+class%3Dhighlighthl.snippets=1hl.simple.post=/spanspellcheck=truehl.fl=description

Now, we've seen something strange. This is just an example, the problem 
is with more than this record. In this example, the autosuggest field 
contains:


2CV4 Spot, Dekorsatz, für 2CV.

However, the highlighting branch for this autosuggest field in the 
record looks like this:


lst name=highlighting
  lst name=34725
arr name=short_description
  str2CV4 Spot, Dekorsatz, für em2CV/em./str
/arr
  /lst
  ...

Although the EdgeNGramFilterFactory generated the NGrams so that 2CV4 
- 2, 2C, 2CV, 2CV4, the term is not highlighted. Shouldn't it? 
It's not a question of the number of highlights, records containing 
multiple occurances of 2CV get highlighted multiple times with no 
problems.


It seems that words only containing parts of the search term which match 
the EdgeNGrams are not highlighted. As we're using highlighting from 
Solr exclusively, this leads to records being found, but having no 
highlight at all.


Re: Weird Problem (possible bug?) with german stemming and wildcard search

2014-10-15 Thread Thomas Michael Engelke

Thank you very much,

this information is worht it's weight in gold. So far, we've used the 
asterisk method because it seemed logical and straight-forward. We will 
slowly migrate to a version using EdgeNGramFilterFactory.


Thanks a bunch.

Am 07.10.2014 14:42 schrieb Alexandre Rafalovitch:


On 7 October 2014 08:25, Thomas Michael Engelke
thomas.enge...@posteo.de wrote:

So the culprit is the asterisk at the end. As far as we can read from 
the docs, an asterisk is just 0 or more characters, which means that 
the literal word in front of the asterisk should match the query.


Not quite: http://wiki.apache.org/solr/MultitermQueryAnalysis [1]

It's actually quite complicated and even depends on exact version of
Solr you are using. In fact, out of all the analyzers you showed
above, I think only LowerCase will be present on the chain. Look for
(multi) marker at: http://www.solr-start.com/info/analyzers/ [2] for 
more

details.

On a higher level, I would suggest getting away from *-based expansion
and looking at EdgeNGrams instead. You can see an example of
autocomplete at
http://www.solr-start.com/javadoc/solr-lucene/index.html [3] and the
matching configuration at:
https://github.com/arafalov/Solr-Javadoc/blob/master/JavadocIndex/JavadocCollection/conf/schema.xml#L24 
[4]


Or a dedicated Suggester module, though information on that is a bit
harder to find.

Regards,
Alex.

Personal: http://www.outerthoughts.com/ [5] and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ [6] and 
@solrstart
Solr popularizers community: 
https://www.linkedin.com/groups?gid=6713853 [7]



Links:
--
[1] http://wiki.apache.org/solr/MultitermQueryAnalysis
[2] http://www.solr-start.com/info/analyzers/
[3] http://www.solr-start.com/javadoc/solr-lucene/index.html
[4] 
https://github.com/arafalov/Solr-Javadoc/blob/master/JavadocIndex/JavadocCollection/conf/schema.xml#L24

[5] http://www.outerthoughts.com/
[6] http://www.solr-start.com/
[7] https://www.linkedin.com/groups?gid=6713853


Weird Problem (possible bug?) with german stemming and wildcard search

2014-10-07 Thread Thomas Michael Engelke

I have a problem with a stemmed german field. The field definition:

field name=description type=text_splitting indexed=true 
stored=true required=false multiValued=false/

...
fieldType name=text_splitting class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true

  analyzer type=index
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/

filter class=solr.PorterStemFilterFactory/
  /analyzer
  analyzer type=query
tokenizer class=solr.WhitespaceTokenizerFactory/
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt 
ignoreCase=true expand=true/
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=0 
catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/

filter class=solr.LowerCaseFilterFactory/
filter class=solr.KeywordMarkerFilterFactory 
protected=protwords.txt/

filter class=solr.PorterStemFilterFactory/
  /analyzer
/fieldType

When we search for a word from an autosuggest kind of component, we 
always add an asterisk to a word, so when somebody enters something like 
Radbremszylinder and waits for some milliseconds, the autosuggest list 
is filled with the results of searching for Radbremszylinder*. This 
seemed to work quite well. Today we got a bug report from a customer for 
that exact word.


So I made an analysis for the word as Field value (index) and Field 
value (query), and it looked like this:


ST   RadbremszylinderWT   Radbremszylinder*
SF   RadbremszylinderSF   Radbremszylinder*
WDF  RadbremszylinderSF   Radbremszylinder*
LCF  radbremszylinderWDF  Radbremszylinder
SKMF radbremszylinderLCF  radbremszylinder
PSF  radbremszylind  SKMF radbremszylinder

As you can see, the end result looks very much alike. However, records 
containing that word in their description field aren't reported as 
results. Strangely enough, records containing Radbremszylindern 
(plural) are reported as results. Removing the asterisk from the end 
reports all records with Radbremszylinder, just as we would expect. So 
the culprit is the asterisk at the end. As far as we can read from the 
docs, an asterisk is just 0 or more characters, which means that the 
literal word in front of the asterisk should match the query.


Searching further we tried some variations, and it seems that searching 
for Radbremszylind* works. All records with any variation 
(Radbremszylinder, Radbremszylindern) are reported. So maybe there's 
a weird interaction with stemming?


Any ideas?


RE: Solr Spellcheck suggestions only return from /select handler when returning search results

2014-09-11 Thread Thomas Michael Engelke
 Hi James, hi list,

I can confirm the existence of data that's within
1 Levenshtein step from ichtscheiben:

{
 responseHeader: {

status: 0,
 QTime: 0,
 params: {
 fl: name,spell,
 indent:
true,
 q: name:Sichtscheiben,
 _: 1410423419758,
 wt:
json,
 rows: 50
 }
 },
 response: {
 numFound: 6,
 start:
0,
 docs: [
 {
 name: Sichtscheiben,
 spell: Sichtscheiben

},
 {
 name: Sichtscheiben,
 spell: Sichtscheiben
 },
 {

name: Sichtscheiben,
 spell: Sichtscheiben
 },
 {
 name:
Sichtscheiben,
 spell: Sichtscheiben
 },
 {
 name:
Sichtscheiben,
 spell: Sichtscheiben
 },
 {
 name:
Sichtscheiben,
 spell: Sichtscheiben
 }
 ]
 }
}

Multiple records
exist that should match.

The note for alternativeTermCount is
appreciated.

I've tried another term: Transport. I get suggestions
when I use Transpor and Transpo, even Transpotr, but ransport
doesn't yield any suggestions. Maybe it's a question of the beginning of
a word and has not really anything to do with stemming.

Am 10.09.2014
15:19 schrieb Dyer, James: 

 Thomas,
 
 It looks like you've set
things up correctly in that while the user is searching against a
stemmed field (name), spellcheck is checking against a
lightly-analyzed copy of it (spell). This is the right way to do it as
spellcheck against stemmed forms is usually undesirable.
 
 But as
you've experienced, you will sometimes get results (due to stemming) and
also suggestions (because the spellechecker is looking at unstemmed
forms). If you do not want spellcheck to return anything when you get
results, you can set spellcheck.maxResultsForSuggest=0.
 
 Now
keeping in mind we're comparing unstemmed forms, can you verify you
indeed have something in your index that is within 2 edits of
ichtscheiben ? My guess is you probably don't, which would be why you
do not get spelling results in that case.
 
 Also, even if you do have
something within 2 edits, if ichtscheiben occurs in your index, by
default it won't try to correct it at all (even if the query returns
nothing, maybe because of filters or other required terms on the query).
In this case you need to set spellcheck.alternativeTermCount to a
non-zero value (try maybe 5).
 
 See
http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.alternativeTermCount
[1] and following sections.
 
 James Dyer
 Ingram Content Group

(615) 213-4311
 
 -Original Message-
 From: Thomas Michael
Engelke [mailto:thomas.enge...@posteo.de] 
 Sent: Wednesday, September
10, 2014 5:00 AM
 To: Solr user
 Subject: Solr Spellcheck suggestions
only return from /select handler when returning search results
 

Hi,
 
 I'm experimenting with the Spellcheck component and have
therefor
 used the example configuration for spell checking to try
things out. My
 solrconfig.xml looks like this:
 
 searchComponent
name=spellcheck
 class=solr.SpellCheckComponent
 str

name=queryAnalyzerFieldTypespell/str
 !-- Multiple Spell

Checkers can be declared and used by this
 component
 --
 !-- a

spellchecker built from a field of the main index --
 lst

name=spellchecker
 str name=namedefault/str
 str

name=fieldspell/str
 str

name=classnamesolr.DirectSolrSpellChecker/str
 !-- the
spellcheck
 distance measure used, the default is the internal
levenshtein --
 str
 name=distanceMeasureinternal/str
 !--
uncomment this to require
 suggestions to occur in 1% of the
documents
 float
 name=thresholdTokenFrequency.01/float
 --

/lst
 !-- a
 spellchecker that can break or combine words. See
/spell handler below
 for usage --
 lst name=spellchecker

str
 name=namewordbreak/str
 str

name=classnamesolr.WordBreakSolrSpellChecker/str
 str

name=fieldspell/str
 str name=combineWordstrue/str
 str

name=breakWordstrue/str
 int name=maxChanges10/int

/lst
 
 /searchComponent
 
 And I've added the spellcheck
component to my
 /select request handler:
 
 requestHandler
name=/select
 class=solr.SearchHandler
 ...
 arr
name=last-components
 
 strspellcheck/str
 /arr

/requestHandler
 
 I have built up the
 spellchecker source in the
schema.xml from the name field:
 
 field
 name=spell type=spell
indexed=true stored=true required=false
 multiValued=false/

copyField source=name dest=spell
 maxChars=3 /
 ...

fieldType name=spell class=solr.TextField

positionIncrementGap=100
 analyzer type=index
 tokenizer

class=solr.StandardTokenizerFactory/
 /analyzer
 analyzer

type=query
 tokenizer class=solr.StandardTokenizerFactory/
 

/analyzer
 /fieldType
 
 As I'm querying the /select request
handler,
 I should get spellcheck suggestions with my results. However,
I rarely
 get a suggestion. Examples:
 
 query: Sichtscheibe,
spellcheck suggestion:
 Sichtscheiben (works)
 query: Sichtscheib,
spellcheck suggestion:
 Sichtscheiben (works)
 query: ichtscheiben, no
spellcheck suggestions
 
 As
 far as I can identify, I only get
suggestions when I get real search
 results. I get results for the
first 2 examples, because the german
 StemFilterFactory translates
Sichtscheibe and Sichtscheiben into
 Sichtscheib, so

Solr Spellcheck suggestions only return from /select handler when returning search results

2014-09-10 Thread Thomas Michael Engelke
 Hi,

I'm experimenting with the Spellcheck component and have therefor
used the example configuration for spell checking to try things out. My
solrconfig.xml looks like this:

 searchComponent name=spellcheck
class=solr.SpellCheckComponent
 str
name=queryAnalyzerFieldTypespell/str
 !-- Multiple Spell
Checkers can be declared and used by this
 component
 --
 !-- a
spellchecker built from a field of the main index --
 lst
name=spellchecker
 str name=namedefault/str
 str
name=fieldspell/str
 str
name=classnamesolr.DirectSolrSpellChecker/str
 !-- the spellcheck
distance measure used, the default is the internal levenshtein --
 str
name=distanceMeasureinternal/str
 !-- uncomment this to require
suggestions to occur in 1% of the documents
 float
name=thresholdTokenFrequency.01/float
 --
 /lst
 !-- a
spellchecker that can break or combine words. See /spell handler below
for usage --
 lst name=spellchecker
 str
name=namewordbreak/str
 str
name=classnamesolr.WordBreakSolrSpellChecker/str
 str
name=fieldspell/str
 str name=combineWordstrue/str
 str
name=breakWordstrue/str
 int name=maxChanges10/int
 /lst

/searchComponent

And I've added the spellcheck component to my
/select request handler:

 requestHandler name=/select
class=solr.SearchHandler
 ...
 arr name=last-components

strspellcheck/str
 /arr
 /requestHandler

I have built up the
spellchecker source in the schema.xml from the name field:

 field
name=spell type=spell indexed=true stored=true required=false
multiValued=false/
 copyField source=name dest=spell
maxChars=3 /
 ...
 fieldType name=spell class=solr.TextField
positionIncrementGap=100
 analyzer type=index
 tokenizer
class=solr.StandardTokenizerFactory/
 /analyzer
 analyzer
type=query
 tokenizer class=solr.StandardTokenizerFactory/

/analyzer
 /fieldType

As I'm querying the /select request handler,
I should get spellcheck suggestions with my results. However, I rarely
get a suggestion. Examples:

query: Sichtscheibe, spellcheck suggestion:
Sichtscheiben (works)
query: Sichtscheib, spellcheck suggestion:
Sichtscheiben (works)
query: ichtscheiben, no spellcheck suggestions

As
far as I can identify, I only get suggestions when I get real search
results. I get results for the first 2 examples, because the german
StemFilterFactory translates Sichtscheibe and Sichtscheiben into
Sichtscheib, so there are matches found. However, the third query
should result in a suggestion, as the Levenshtein distance is less than
in the second example.

Suggestions, improvements, corrections?

 

Solr spellcheck returns more than 1 word for a 1 word spellcheck

2014-09-01 Thread Thomas Michael Engelke
 I'm in the process of incorporating Solr spellchecking in our product.
For that, I've created a new field:

 field name=spell type=spell
indexed=true stored=true required=false multiValued=false/

copyField source=name dest=spell maxChars=3 /

And in the
fieldType definitions:

 fieldType name=spell class=solr.TextField
positionIncrementGap=100
 analyzer
 tokenizer
class=solr.WhitespaceTokenizerFactory/
 /analyzer

/fieldType

Then I feed the names of products into the corresponding
core. They can have a lot of words (examples):

 door lock rear left

Door brake, door in front + rear fitting.

However, the names get pretty
long, and in the source data, they have been truncated. This sometimes
leaves parts of words at the end:

 The water pump can evacuate some
coo

I have created a spellcheck component, feeding of the `spell` field
defined earlier. Now for the problem.

Sometimes, when I look up a
slightly misspelled word, I get results I do not expect. Example
request:

 http://solr.url:8983/solr/en/spell?q=coole

This is (part of)
the response:

 str name=wordcooler/strint name=freq21/int

str name=wordcoo le/strint name=freq2/int
 str
name=wordcable/strint name=freq334/int
 str name=wordco o
le/strint name=freq4/int
 [...]

Now, as you can see, the
misspelled `coole` should have been `cooler`, and it's the first
suggestion. However, the second and fourth suggestion baffle me. After a
bit of research, I found this to be multiple words clunked together. As
I described above, `coo` was a part of a name that was truncated. I
found `co` the same way, and the source data contains a small number of
`o` characters on their own (product number names).

Now, my question
is: Why is Solr suggesting `multiple words` pasted together for a
spellcheck for a single word? Is there a way to prevent Solr from
pasting together word parts to forge suggestions? 
 

Re: Ranking based on match position in field

2014-07-31 Thread Thomas Michael Engelke
 Hi,

thanks for the link. I've upgraded from the used 4.7 to the
recent 4.9 version. I've tried to use the new feature with this query in
the admin interface using edismax:

description:Kühler^~1^5

However,
the result seems to stay the same:

lst name=debug
 str
name=rawquerystringdescription:Kühler~1^5/str
 str
name=querystringdescription:Kühler~1^5/str
 str
name=parsedquery(+description:kühler~1^5.0)/no_coord/str
 str
name=parsedquery_toString+description:kühler~1^5.0/str
 lst
name=explain
 str name=17411
2.334934 = (MATCH)
weight(description:kühler^5.0 in 4080) [DefaultSimilarity], result of:

2.334934 = score(doc=4080,freq=1.0 = termFreq=1.0
), product of:

0.9994 = queryWeight, product of:
 5.0 = boost
 6.226491 =
idf(docFreq=64, maxDocs=12099)
 0.03212082 = queryNorm
 2.3349342 =
fieldWeight in 4080, product of:
 1.0 = tf(freq=1.0), with freq of:
 1.0
= termFreq=1.0
 6.226491 = idf(docFreq=64, maxDocs=12099)
 0.375 =
fieldNorm(doc=4080)
/str
 str name=19085
2.334934 = (MATCH)
weight(description:kühler^5.0 in 5754) [DefaultSimilarity], result of:

2.334934 = score(doc=5754,freq=1.0 = termFreq=1.0
), product of:

0.9994 = queryWeight, product of:
 5.0 = boost
 6.226491 =
idf(docFreq=64, maxDocs=12099)
 0.03212082 = queryNorm
 2.3349342 =
fieldWeight in 5754, product of:
 1.0 = tf(freq=1.0), with freq of:
 1.0
= termFreq=1.0
 6.226491 = idf(docFreq=64, maxDocs=12099)
 0.375 =
fieldNorm(doc=5754)
/str

Am I using this feature wrong?

Am
30.07.2014 14:48 schrieb Ahmet Arslan: 

 Hi,
 
 Please see :
https://issues.apache.org/jira/browse/SOLR-3925 [1]
 
 Ahmet
 
 On
Wednesday, July 30, 2014 2:39 PM, Thomas Michael Engelke
thomas.enge...@posteo.de wrote:
 Hi,
 
 an example. We have 2
records with this data in the same field
 (description):
 
 1:
Lufthutze vor Kühler Bj 62-65, DS
 2: Kühler HY im
 Austausch,
Altteilpfand 250 Euro
 
 A search with the parameters

'description:Kühler' does provide this debug:
 
 2.3234584 = (MATCH)

weight(description:kühler in 4053) [DefaultSimilarity], result of:
 

2.3234584 = fieldWeight in 4053, product of:
 1.0 = tf(freq=1.0),
with
 freq of:
 1.0 = termFreq=1.0
 6.195889 = idf(docFreq=69,
maxDocs=12637)
 
 0.375 = fieldNorm(doc=4053)
 /str
 str
name=16946
 2.3234584 =
 (MATCH) weight(description:kühler in 5729)
[DefaultSimilarity], result
 of:
 2.3234584 = fieldWeight in 5729,
product of:
 1.0 = tf(freq=1.0),
 with freq of:
 1.0 = termFreq=1.0

6.195889 = idf(docFreq=69,
 maxDocs=12637)
 0.375 =
fieldNorm(doc=5729)
 
 As you can see, both get
 the exact same
score. However, we would like to rank the second document
 higher, on
the basis that the search term occurs further to the left of
 the
field.
 
 Is there a component/setting that can do that?




Links:
--
[1] https://issues.apache.org/jira/browse/SOLR-3925


Ranking based on match position in field

2014-07-30 Thread Thomas Michael Engelke
 Hi,

an example. We have 2 records with this data in the same field
(description):

1: Lufthutze vor Kühler Bj 62-65, DS
2: Kühler HY im
Austausch, Altteilpfand 250 Euro

A search with the parameters
'description:Kühler' does provide this debug:

2.3234584 = (MATCH)
weight(description:kühler in 4053) [DefaultSimilarity], result of:

2.3234584 = fieldWeight in 4053, product of:
 1.0 = tf(freq=1.0), with
freq of:
 1.0 = termFreq=1.0
 6.195889 = idf(docFreq=69, maxDocs=12637)

0.375 = fieldNorm(doc=4053)
/str
 str name=16946
2.3234584 =
(MATCH) weight(description:kühler in 5729) [DefaultSimilarity], result
of:
 2.3234584 = fieldWeight in 5729, product of:
 1.0 = tf(freq=1.0),
with freq of:
 1.0 = termFreq=1.0
 6.195889 = idf(docFreq=69,
maxDocs=12637)
 0.375 = fieldNorm(doc=5729)

As you can see, both get
the exact same score. However, we would like to rank the second document
higher, on the basis that the search term occurs further to the left of
the field.

Is there a component/setting that can do that? 
 

Re: Not finding part of fulltext field when word ends in dot

2014-02-03 Thread Thomas Michael Engelke
That was a complicated answer, but ultimately the right one. Thank you very
much.


2014-01-30 Jack Krupansky j...@basetechnology.com:

 The word delimiter filter will turn 26KA into two tokens, as if you had
 written 26 KA without the quotes. The autoGeneratePhraseQueries option
 will cause the multiple terms to be treated as if they actually were
 enclosed within quotes, otherwise they will be treated as separate and
 unquoted terms. If you do enclose 26KA in quotes in your query then
 autoGeneratePhraseQueries is not relevant.

 Ah... maybe the problem is that you have preserveOriginal=true in your
 query analyzer. Do you have your default query operator set to AND? If
 so, it would treat 26KA as 26 AND KA AND 26KA, which requires that
 26KA (without the trailing dot) to be in the index.

 It seems counter-intuitive, but the attributes of the index and query word
 delimiter filters need to be slightly asymmetric.


 -- Jack Krupansky

 -Original Message- From: Thomas Michael Engelke
 Sent: Thursday, January 30, 2014 2:16 AM

 To: solr-user@lucene.apache.org
 Subject: Re: Not finding part of fulltext field when word ends in dot

 I'm not sure I got my problem across. If I understand the snippet of
 documentation right, autoGeneratePhraseQueries only affects queries that
 result in multiple tokens, which mine does not. The version also is
 3.6.0.1, and we're not planning on upgrading to any 4.x version.


 2014-01-29 Jack Krupansky j...@basetechnology.com

  You might want to add autoGeneratePhraseQueries=true to your field
 type, but I don't think that would cause a break when going from 3.6 to
 4.x. The default for that attribute changed in Solr 3.5. What release was
 your data indexed using? There may have been some subtle word delimiter
 filter changes between 3.x and 4.x.

 Read:
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%
 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03.
 adsroot.itcs.umich.edu%3E



 -Original Message- From: Thomas Michael Engelke
 Sent: Wednesday, January 29, 2014 11:16 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Not finding part of fulltext field when word ends in dot


 The fieldType definition is a tad on the longer side:

fieldType name=text class=solr.TextField
 positionIncrementGap=100
analyzer type=index
tokenizer
 class=solr.WhitespaceTokenizerFactory/

filter
 class=solr.WordDelimiterFilterFactory
catenateWords=1
catenateNumbers=1
generateNumberParts=1
splitOnCaseChange=1
generateWordParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/

filter
 class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory
 synonyms=german/synonyms.txt ignoreCase=true expand=true/
filter
 class=solr.DictionaryCompoundWordTokenFilterFactory

 dictionary=german/german-common-nouns.txt
minWordSize=5
minSubwordSize=4
maxSubwordSize=15
onlyLongestMatch=true
/

filter class=solr.StopFilterFactory
 words=german/stopwords.txt ignoreCase=true
 enablePositionIncrements=true/
filter
 class=solr.SnowballPorterFilterFactory language=German2
 protected=german/protwords.txt/
filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer
 class=solr.WhitespaceTokenizerFactory/

filter
 class=solr.WordDelimiterFilterFactory
catenateWords=0
catenateNumbers=0
generateWordParts=1
splitOnCaseChange=1
generateNumberParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/
filter
 class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
 words=german/stopwords.txt ignoreCase=true
 enablePositionIncrements=true

Not finding part of fulltext field when word ends in dot

2014-01-29 Thread Thomas Michael Engelke
Hello everybody,

we have a legacy solr installation in version 3.6.0.1. One of the indices
defines a field named content as a fulltext field where a product
description will reside. One of the records indexed contains the following
data (excerpt):

z. B. in der Serie 26KA.

I had the problem that searching the value 26KA didn't find anything.
Using the analyzer of the adminstrative interface and using the full text
on one hand and 26KA as the query string, I can see how the search string
is transformed by the used filter factories. The WordDelimiterFilterFactory
transforms the 26KA. into 26KA, which is displayed like this (excerpt):

73 74  7576
in der Serie 26KA.
 26KA

It seems that it stripped the 26KA. of the dot. Using the option to
highlight matches, an analysis search of 26KA shows the lower of the two
entries matches (after reaching the LowerCaseFilterFactory). However,
querying the index using the query interface doesn't show any matches.

I discovered that adding an asterisk to the search seems to work, as does
adding the dot. I am puzzled by this, as I thought that the second added
entry was the word actually indexed. I've tried looking up the definition
of the administrative interface, but the documentation only specifies this
for the latest version, where the display is different and (at least in the
sample) doesn't show such duplication.

Can anybody shed some light onto this?


Re: Not finding part of fulltext field when word ends in dot

2014-01-29 Thread Thomas Michael Engelke
The fieldType definition is a tad on the longer side:

fieldType name=text class=solr.TextField
positionIncrementGap=100
analyzer type=index
tokenizer
class=solr.WhitespaceTokenizerFactory/

filter
class=solr.WordDelimiterFilterFactory
catenateWords=1
catenateNumbers=1
generateNumberParts=1
splitOnCaseChange=1
generateWordParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/

filter
class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory
synonyms=german/synonyms.txt ignoreCase=true expand=true/
filter
class=solr.DictionaryCompoundWordTokenFilterFactory

dictionary=german/german-common-nouns.txt
minWordSize=5
minSubwordSize=4
maxSubwordSize=15
onlyLongestMatch=true
/

filter class=solr.StopFilterFactory
words=german/stopwords.txt ignoreCase=true
enablePositionIncrements=true/
filter
class=solr.SnowballPorterFilterFactory language=German2
protected=german/protwords.txt/
filter
class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer
class=solr.WhitespaceTokenizerFactory/

filter
class=solr.WordDelimiterFilterFactory
catenateWords=0
catenateNumbers=0
generateWordParts=1
splitOnCaseChange=1
generateNumberParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/
filter
class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
words=german/stopwords.txt ignoreCase=true
enablePositionIncrements=true/
filter
class=solr.SnowballPorterFilterFactory language=German2
protected=german/protwords.txt/
filter
class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType


Thank you for taking a look.


2014-01-29 Jack Krupansky j...@basetechnology.com

 What field type and analyzer/tokenizer are you using?

 -- Jack Krupansky

 -Original Message- From: Thomas Michael Engelke Sent: Wednesday,
 January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
 finding part of fulltext field when word ends in dot
 Hello everybody,

 we have a legacy solr installation in version 3.6.0.1. One of the indices
 defines a field named content as a fulltext field where a product
 description will reside. One of the records indexed contains the following
 data (excerpt):

 z. B. in der Serie 26KA.

 I had the problem that searching the value 26KA didn't find anything.
 Using the analyzer of the adminstrative interface and using the full text
 on one hand and 26KA as the query string, I can see how the search string
 is transformed by the used filter factories. The WordDelimiterFilterFactory
 transforms the 26KA. into 26KA, which is displayed like this (excerpt):

 73 74  7576
 in der Serie 26KA.
 26KA

 It seems that it stripped the 26KA. of the dot. Using the option to
 highlight matches, an analysis search of 26KA shows the lower of the two
 entries matches (after reaching the LowerCaseFilterFactory). However,
 querying the index using the query interface doesn't show any matches.

 I discovered that adding an asterisk to the search seems to work, as does
 adding the dot. I am puzzled by this, as I thought that the second added
 entry was the word actually indexed. I've tried looking up the definition
 of the administrative interface, but the documentation only specifies this
 for the latest version, where the display is different and (at least in the
 sample) doesn't show such duplication.

 Can anybody shed some light onto this?



Re: Not finding part of fulltext field when word ends in dot

2014-01-29 Thread Thomas Michael Engelke
I'm not sure I got my problem across. If I understand the snippet of
documentation right, autoGeneratePhraseQueries only affects queries that
result in multiple tokens, which mine does not. The version also is
3.6.0.1, and we're not planning on upgrading to any 4.x version.


2014-01-29 Jack Krupansky j...@basetechnology.com

 You might want to add autoGeneratePhraseQueries=true to your field
 type, but I don't think that would cause a break when going from 3.6 to
 4.x. The default for that attribute changed in Solr 3.5. What release was
 your data indexed using? There may have been some subtle word delimiter
 filter changes between 3.x and 4.x.

 Read:
 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201202.mbox/%
 3CC0551C512C863540BC59694A118452AA0764A434@ITS-EMBX-03.
 adsroot.itcs.umich.edu%3E



 -Original Message- From: Thomas Michael Engelke
 Sent: Wednesday, January 29, 2014 11:16 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Not finding part of fulltext field when word ends in dot


 The fieldType definition is a tad on the longer side:

fieldType name=text class=solr.TextField
 positionIncrementGap=100
analyzer type=index
tokenizer
 class=solr.WhitespaceTokenizerFactory/

filter
 class=solr.WordDelimiterFilterFactory
catenateWords=1
catenateNumbers=1
generateNumberParts=1
splitOnCaseChange=1
generateWordParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/

filter
 class=solr.LowerCaseFilterFactory/
filter class=solr.SynonymFilterFactory
 synonyms=german/synonyms.txt ignoreCase=true expand=true/
filter
 class=solr.DictionaryCompoundWordTokenFilterFactory

 dictionary=german/german-common-nouns.txt
minWordSize=5
minSubwordSize=4
maxSubwordSize=15
onlyLongestMatch=true
/

filter class=solr.StopFilterFactory
 words=german/stopwords.txt ignoreCase=true
 enablePositionIncrements=true/
filter
 class=solr.SnowballPorterFilterFactory language=German2
 protected=german/protwords.txt/
filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
analyzer type=query
tokenizer
 class=solr.WhitespaceTokenizerFactory/

filter
 class=solr.WordDelimiterFilterFactory
catenateWords=0
catenateNumbers=0
generateWordParts=1
splitOnCaseChange=1
generateNumberParts=1
catenateAll=0
preserveOriginal=1
splitOnNumerics=0
/
filter
 class=solr.LowerCaseFilterFactory/
filter class=solr.StopFilterFactory
 words=german/stopwords.txt ignoreCase=true
 enablePositionIncrements=true/
filter
 class=solr.SnowballPorterFilterFactory language=German2
 protected=german/protwords.txt/
filter
 class=solr.RemoveDuplicatesTokenFilterFactory/
/analyzer
/fieldType


 Thank you for taking a look.


 2014-01-29 Jack Krupansky j...@basetechnology.com

  What field type and analyzer/tokenizer are you using?

 -- Jack Krupansky

 -Original Message- From: Thomas Michael Engelke Sent: Wednesday,
 January 29, 2014 10:45 AM To: solr-user@lucene.apache.org Subject: Not
 finding part of fulltext field when word ends in dot
 Hello everybody,

 we have a legacy solr installation in version 3.6.0.1. One of the indices
 defines a field named content as a fulltext field where a product
 description will reside. One of the records indexed contains the following
 data (excerpt):

 z. B. in der Serie 26KA.

 I had the problem that searching the value 26KA didn't find anything.
 Using the analyzer of the adminstrative interface and using the full text
 on one hand and 26KA as the query string, I can see how the search
 string
 is transformed by the used filter factories