RE: Solr search engine configuration
Thanks, will look into all that :-) -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Solr search engine configuration
Hi - In that case you need the KeywordRepeat and RemoveDuplicates filters as well, i'd suggest reading their Javadocs. With the docs and the analysis GUI, you can probably figure out their respective place in the tokenizer chain yourself. Trusting on IDF is not really a fine controlled boosting mechanism but it should work more or less. We use payloads everywhere for fine controlled scoring, but that involves a lot of code. Cheers, Markus -Original message- > From:PeterKerk > Sent: Tuesday 13th March 2018 21:35 > To: solr-user@lucene.apache.org > Subject: RE: Solr search engine configuration > > Cool, will do some more digging around in the analysis GUI first. > > One last thing then on this comment of yours: > "Does the decompounder support emitting the compound word as well? If so, > enable it. It should help scoring compounds higher via IDF as they are less > common." > > So I checked the Javadoc: > https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html > To be sure I also checked the Javadoc for the alternative > :https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html, > but nothing there on emitting either. > > Where can I see whether DictionaryCompoundWordTokenFilterFactory supports > emitting the compound work and how to enable it? > > Thanks again! :-) > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
RE: Solr search engine configuration
Cool, will do some more digging around in the analysis GUI first. One last thing then on this comment of yours: "Does the decompounder support emitting the compound word as well? If so, enable it. It should help scoring compounds higher via IDF as they are less common." So I checked the Javadoc: https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html To be sure I also checked the Javadoc for the alternative :https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html, but nothing there on emitting either. Where can I see whether DictionaryCompoundWordTokenFilterFactory supports emitting the compound work and how to enable it? Thanks again! :-) -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
RE: Solr search engine configuration
Inline, cheers. -Original message- > From:PeterKerk > Sent: Tuesday 13th March 2018 18:53 > To: solr-user@lucene.apache.org > Subject: RE: Solr search engine configuration > > You must stay in the Javadoc section, there the examples are good, or the > reference guide: > https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html > https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions > > PVK COMMENT 1: > This seems to be for Solr 6.5+? I'm using 4.3.1. An upgrade is not on > the > radar soon. Will using DictionaryCompoundWordTokenFilterFactory as I'm doing > now severely degrade my result quality as opposed to > HyphenationCompoundWordTokenFilterFactory? Just change version number, most filters are already quite old: https://lucene.apache.org/core/4_3_1/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html Dictionary vs Hyphenation, using Dictionary won't severely degrade results, and can be easier to use if you need to add words. If prefer the Hyphenater though, but it can bite. Stick to Dictionary, you are fine. But both (iirc) suffer from the same problems with overlapping words, or subwords that do not entire make up for the full compound (minus genetives or plural forms) this is a real issue. > > > Almost, zaken -> zaak is already KP output, no need to input what the > stemmer will do for you. > > PVK COMMENT 2: > How do you know zaken -> zaak is already KP output? Is there a list > somewhere? I know because i've seen KPs output a million times by now. You should really access Solr's analysis GUI, it shows what filters emit, it is really helpful. > > PVK COMMENT 3: > I now have: > >positionIncrementGap="100"> > > > > >generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" > catenateAll="0" splitOnCaseChange="1"/> > > >dictionary="compounds_nl.txt" > minWordSize="5" minSubwordSize="2" maxSubwordSize="15" > onlyLongestMatch="true"/> > >dictionary="stemdict_nl.txt"/> > >protected="protwords_nl.txt"/> > > > > > > > > > >generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" > catenateAll="0" splitOnCaseChange="1"/> > > >dictionary="compounds_nl.txt" > minWordSize="5" minSubwordSize="2" maxSubwordSize="15" > onlyLongestMatch="true"/> > > dictionary="stemdict_nl.txt"/> > > protected="protwords_nl.txt"/> > > > > > > Please increase minWordsize and minSubwordSize. There are no compounds with that few characters. minSubwordSize should be at least 4, or you will get a lot of crazy output due to problems states above. > > I tested in admin UI (and yes, I restart Solr and reindex every time I make > a change): > > http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true > returns: > "hi there dieren zaak something else" > "hi there dier something else" > > http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dierenzaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true > returns > "hi there dierenzaak something else" > > So I added "dieren" to compounds_nl.txt > > Now on "title_search_global:(dieren zaak)" it returns: > > hi there dieren zaak something else > 115_3699638 > > > hi there dier something else > 115_3699637 > > > hi there dierenzaak something else > 115_3699639 > > > So it's starting to look good! :-) > > What I want to know, how can I have Solr consider "dierenzaak" to be of > higher importance than just "dier" in the above results? Does the decompounder support emitting the compound word as well? If so, enable it. It should help scoring compounds higher via IDF as they are less common. > > Also I'm still not 100% sure what my addition of "dieren" to > compounds_nl.txt actually does...I assume > DictionaryCompoundWordTokenFilterFactory just looks for that exact string > and if it finds it, considers that a separate word? Correct? Just check in analysis GUI, it will answer all these questions. > > Thanks again! > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
RE: Solr search engine configuration
You must stay in the Javadoc section, there the examples are good, or the reference guide: https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions PVK COMMENT 1: This seems to be for Solr 6.5+? I'm using 4.3.1. An upgrade is not on the radar soon. Will using DictionaryCompoundWordTokenFilterFactory as I'm doing now severely degrade my result quality as opposed to HyphenationCompoundWordTokenFilterFactory? Almost, zaken -> zaak is already KP output, no need to input what the stemmer will do for you. PVK COMMENT 2: How do you know zaken -> zaak is already KP output? Is there a list somewhere? PVK COMMENT 3: I now have: I tested in admin UI (and yes, I restart Solr and reindex every time I make a change): http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true returns: "hi there dieren zaak something else" "hi there dier something else" http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dierenzaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true returns "hi there dierenzaak something else" So I added "dieren" to compounds_nl.txt Now on "title_search_global:(dieren zaak)" it returns: hi there dieren zaak something else 115_3699638 hi there dier something else 115_3699637 hi there dierenzaak something else 115_3699639 So it's starting to look good! :-) What I want to know, how can I have Solr consider "dierenzaak" to be of higher importance than just "dier" in the above results? Also I'm still not 100% sure what my addition of "dieren" to compounds_nl.txt actually does...I assume DictionaryCompoundWordTokenFilterFactory just looks for that exact string and if it finds it, considers that a separate word? Correct? Thanks again! -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr search engine configuration
On 3/13/2018 7:24 AM, PeterKerk wrote: PVK COMMENT: But without a Stopfilter, wont stopwords be included in searches? I though that for example Google excluded these words in their algorithms? I just did a google search for "to be or not to be". It worked flawlessly. If Google were using stopwords, that search would have returned nothing. The four words in that search are among the most frequent words found in English prose. This is a typical stopword list for English: a an and are as at be but by for if in into is it no not of on or such that the their then there these they this to was will with To explain why the frequent responders on this list recommend not using stopwords, and why the biggest search engine on the planet doesn't use them, you need a small history lesson -- you have to know why stopword filters were invented in the first place. A search engine works by creating an uninverted index. This means for a typical full-text index that there is a big list of words, and for each of those words, there is a list that identifies the document, field name, and text offset of where that word is found. Without a stopword filter, the biggest entry in an index for English is probably "the" ... in a corpus of a few million documents, "the" might appear *billions* of times. So the list is BIG. And when the search has to deal with a big entry in the uninverted index, it's slower than normal. Back in the annals of history (80s, 90s, etc) servers didn't have nearly as much memory and CPU resources as they do now. Eliminating these giant entries in the index made a HUGE difference in search performance. A search that might take several seconds with the stopwords included could be sped up to less than one second without them. Even back then, the people who built stopword filters KNEW that they were impacting search results. The reason they implemented them anyway was to greatly improve search *performance*. They knew that a search for "to be or not to be" or "the who" or any number of other similar searches wouldn't work properly. But the vast majority of searches were not really affected by the stopword removal, and users got their results really fast. Today, with modern hardware, search engines are much less bothered by having enormous entries in the uninverted index. When stopwords are NOT removed, you can get more accurate search results. Yes, the index is substantially bigger. But modern hardware is easy to load up with a lot of disk space, memory, and CPU capacity, and search with stopwords is fast enough. Thanks, Shawn
RE: Solr search engine configuration
-Original message- > From:PeterKerk > Sent: Tuesday 13th March 2018 14:24 > To: solr-user@lucene.apache.org > Subject: RE: Solr search engine configuration > > Markus, > > Thanks again. Ok, 1 by 1: > > StemmerOverride wants \t separated fields, that is probably the cause of the > AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a > proper example listed. I recommend putting a decompounder before a stemmer, > and have an accent (or ICU) folder as one of the last filters. > > PVK COMMENT: > Looking for Decompounders and found a few links, btw a lot of the pages > these are linked to don't work. > > https://earlydance.org/news/9189-apachesolr-issues-german-and-other-germanic-languages > > http://lucene.apache.org/core/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html > https://wiki.apache.org/solr/LanguageAnalysis#Decompounding > > https://wiki.apache.org/solr/DictionaryCompoundWordTokenFilterFactory You must stay in the Javadoc section, there the examples are good, or the reference guide: https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions > > my stemdict_nl.txt now contains (words separated by a single tab): > aachenaach > aachener aachener > aalmoezen aalmoes > beveelbevool > dierenzaken dierenzaak > > The problem before was indeed like @Shawn indicates that I had words in > there with a space like so: > dieren zaken dierenzaak > > > > About the diff, it looks like KP output, it has the same issues with whether > or not a word needs double or single vowels in the root. It also shows > issues with strong verbs/nouns (beveel/bevool). Having this list seems like > having KP configured so you should drop it, and only list exceptions to KP > rules in the dict file. This is not easy, so i recommend to stay in to your > domain's vocabulary. > > PVK COMMENT: > That's what I now did above right? Almost, zaken -> zaak is already KP output, no need to input what the stemmer will do for you. > > > Also, unless you have a very specific need for it, drop the StopFilter. > Nobody in these days should want a StopFilter unless they can justify it. We > use them too, but only for very specific reasons, but never for text search. > You might also want to have a WordDelimiterFilter as your first filter, look > it up, you probably want to have it. > > PVK COMMENT: > But without a Stopfilter, wont stopwords be included in searches? I though > that for example Google excluded these words in their algorithms? > Yes, stopwords are good! Keep them! And i am glad Google doesn't just strip stopwords. > > > This is what I have now: > >positionIncrementGap="100"> > > > > >generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" > catenateAll="0" splitOnCaseChange="1"/> > > >dictionary="compounds_nl.txt" > minWordSize="5" minSubwordSize="2" maxSubwordSize="15" > onlyLongestMatch="true"/> > >dictionary="stemdict_nl.txt"/> > > > > > > > > > >generateWordParts="1" > generateNumberParts="1" catenateWords="1" catenateNumbers="1" > catenateAll="0" splitOnCaseChange="1"/> > > >dictionary="compounds_nl.txt" > minWordSize="5" minSubwordSize="2" maxSubwordSize="15" > onlyLongestMatch="true"/> > > dictionary="stemdict_nl.txt"/> > > > > > > That looks fine, but you now you omitted the stemmer (Snowball). Put it after StemmerOverrideFilter, and before ASCIIFolding. > > > Now for both this query > http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&stopwords=true&lowercaseOp
RE: Solr search engine configuration
Markus, Thanks again. Ok, 1 by 1: StemmerOverride wants \t separated fields, that is probably the cause of the AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a proper example listed. I recommend putting a decompounder before a stemmer, and have an accent (or ICU) folder as one of the last filters. PVK COMMENT: Looking for Decompounders and found a few links, btw a lot of the pages these are linked to don't work. https://earlydance.org/news/9189-apachesolr-issues-german-and-other-germanic-languages http://lucene.apache.org/core/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html https://wiki.apache.org/solr/LanguageAnalysis#Decompounding https://wiki.apache.org/solr/DictionaryCompoundWordTokenFilterFactory my stemdict_nl.txt now contains (words separated by a single tab): aachen aach aacheneraachener aalmoezen aalmoes beveel bevool dierenzaken dierenzaak The problem before was indeed like @Shawn indicates that I had words in there with a space like so: dieren zakendierenzaak About the diff, it looks like KP output, it has the same issues with whether or not a word needs double or single vowels in the root. It also shows issues with strong verbs/nouns (beveel/bevool). Having this list seems like having KP configured so you should drop it, and only list exceptions to KP rules in the dict file. This is not easy, so i recommend to stay in to your domain's vocabulary. PVK COMMENT: That's what I now did above right? Also, unless you have a very specific need for it, drop the StopFilter. Nobody in these days should want a StopFilter unless they can justify it. We use them too, but only for very specific reasons, but never for text search. You might also want to have a WordDelimiterFilter as your first filter, look it up, you probably want to have it. PVK COMMENT: But without a Stopfilter, wont stopwords be included in searches? I though that for example Google excluded these words in their algorithms? This is what I have now: Now for both this query http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&stopwords=true&lowercaseOperators=true and this one: http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true This result is found: "Hi there dieren zaak something else" And these are NOT: "Hi there dier something else" "Hi there dierenzaak something else" "Hi there dierzaak something else" What else do you recommend I try? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr search engine configuration
On 3/12/2018 4:15 PM, PeterKerk wrote: > I trimmed stemdict_nl.txt for testing to just this: > > aachenaach > aachener aachener According to the example here: https://github.com/apache/lucene-solr/blob/master/solr/core/src/test-files/solr/collection1/conf/stemdict.txt The lines need to be tab separated. I'm betting that you're running into this bug, which is still unresolved: https://issues.apache.org/jira/browse/LUCENE-4545 The source file you have referenced uses spaces. If those are still in your file, it isn't going to work. It appears that the way the code is written (and is STILL written even in master, which will one day be version 8.0), the separator must be a SINGLE tab. I have confirmed that multiple tabs or any number of spaces isn't going to work properly. I will see what I can do about getting the bug fixed, but for now you're going to have to fix all the separators in your dictionary file. Thanks, Shawn
RE: Solr search engine configuration
Hello Peter, StemmerOverride wants \t separated fields, that is probably the cause of the AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a proper example listed. I recommend putting a decompounder before a stemmer, and have an accent (or ICU) folder as one of the last filters. About the diff, it looks like KP output, it has the same issues with whether or not a word needs double or single vowels in the root. It also shows issues with strong verbs/nouns (beveel/bevool). Having this list seems like having KP configured so you should drop it, and only list exceptions to KP rules in the dict file. This is not easy, so i recommend to stay in to your domain's vocabulary. Also, unless you have a very specific need for it, drop the StopFilter. Nobody in these days should want a StopFilter unless they can justify it. We use them too, but only for very specific reasons, but never for text search. You might also want to have a WordDelimiterFilter as your first filter, look it up, you probably want to have it. Markus [1] https://lucene.apache.org/core/7_1_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilterFactory.html -Original message- > From:PeterKerk > Sent: Monday 12th March 2018 23:16 > To: solr-user@lucene.apache.org > Subject: RE: Solr search engine configuration > > @Erick: thank you for clarifying! > > @Markus: > I feel like I'm not (or at least should not be :-)) the first person to run > into these challenges. > > "You can solve this by adding manual rules to StemmerOverrideFilter, but due > to the compound nature of words, you would need to add it for all the mills" > > After Googling I found this: > https://stackoverflow.com/questions/22451774/word-does-not-get-analysed-properly-using-stemmeroverridefilterfactory-and-snowb > and added http://snowball.tartarus.org/algorithms/kraaij_pohlmann/diffs.txt > as stemdict_nl.txt > > My new fieldType definition now is: > > positionIncrementGap="100"> > > > words="stopwords_nl.txt"/> > > dictionary="stemdict_nl.txt"/> > protected="protwords_nl.txt"> > > > > words="stopwords_nl.txt"/> > > dictionary="stemdict_nl.txt"/> > protected="protwords_nl.txt"> > > > > I trimmed stemdict_nl.txt for testing to just this: > > aachen aach > aachener aachener > > But on full-import it throws a http 500 error: > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at > org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilterFactory.inform(StemmerOverrideFilterFactory.java:66) > > Is my stemdict_nl.txt format incorrect? > > And do you have examples of the HyphenationCompoundWordTokenFilter or > AccentFoldingFilter I can't find any. > > I use Solr 4.3.1 btw, not sure if that matters. > > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html >
RE: Solr search engine configuration
@Erick: thank you for clarifying! @Markus: I feel like I'm not (or at least should not be :-)) the first person to run into these challenges. "You can solve this by adding manual rules to StemmerOverrideFilter, but due to the compound nature of words, you would need to add it for all the mills" After Googling I found this: https://stackoverflow.com/questions/22451774/word-does-not-get-analysed-properly-using-stemmeroverridefilterfactory-and-snowb and added http://snowball.tartarus.org/algorithms/kraaij_pohlmann/diffs.txt as stemdict_nl.txt My new fieldType definition now is: I trimmed stemdict_nl.txt for testing to just this: aachenaach aachener aachener But on full-import it throws a http 500 error: Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilterFactory.inform(StemmerOverrideFilterFactory.java:66) Is my stemdict_nl.txt format incorrect? And do you have examples of the HyphenationCompoundWordTokenFilter or AccentFoldingFilter I can't find any. I use Solr 4.3.1 btw, not sure if that matters. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr search engine configuration
Peter: bq: I don't have a requestHandler named "/select". Right, that was just an example of a request handler, your "/scoresearch" handler _does_ have edismax as your default "defType" so assuming you're using that one it makes no difference at all whether you specify &defType=edismax on the URL or not. You'd see a differences if you specified "&defType=any_parser_other_than_dismax" though ;) As for the rest, I'll leave you in the much more capable hands of Markus since he has, you know, real knowledge in this area rather than my generalities Best, Erick On Mon, Mar 12, 2018 at 1:33 AM, Markus Jelsma wrote: > Hi, > > Glad to hear you removed the gramming, but Kraaij-Pohlmann isn't going to > solve all problems either, for example molens => molen, but molen => mool, > and many more like that. You can solve this by adding manual rules to > StemmerOverrideFilter, but due to the compound nature of words, you would > need to add it for all the mills. > > Regarding the compounds, Dutch is (more or less) just another Germanic > language and uses compounds just like German, Swedish etc. To deal with that > you can try the vanilla HyphenationCompoundWordTokenFilter (or something like > that). Be sure not to set minWordLength too low, or you'll get plenty of bad > results. The major drawback of this token filter is that it emits overlapping > terms, and may not always work with compounds of which the head is a plural, > just like dierenzaak, of scholierenkorting. > > Also add a AccentFoldingFilter, or ICUNormalizer to get rid of accents, or > you may have trouble finding a café. > > Regards, > Markus > > -Original message- >> From:PeterKerk >> Sent: Sunday 11th March 2018 23:55 >> To: solr-user@lucene.apache.org >> Subject: Re: Solr search engine configuration >> >> Sorry for this lengthy post, but I wanted to be complete. >> >> The only occurence of edismax in solrconfig.xml is this one: >> >> > default="true"> >> >> >> edismax >> explicit >> 10 >> >> double_score >> false >> *:* >> >> >> >> I don't have a requestHandler named "/select". >> >> >> Also, removing the gramming definitely helped! :-) >> >> I tried to simplify my setup first and then expand, so what I have now is >> this: >> >> >> > positionIncrementGap="100"> >> >> >> > words="stopwords_nl.txt"/> >> >> > protected="protwords_nl.txt"> >> >> >> >> >> >> > words="stopwords_nl.txt"/> >> >> > protected="protwords_nl.txt"> >> >> >> >> >> >> > stored="true"/> >> >> In my database I have these 4 values for "title" that populate >> "title_search_global" >> >> "Hi there dier something else" >> "Hi there dieren zaak something else" >> "Hi there dierenzaak something else" >> "Hi there dierzaak something else" >> >> ps. "dier" is singular of plural "dieren". >> >> Using this query: >> http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true >> >> These results are found: >> "Hi there dier something else" >> "Hi there dieren zaak something else" >> >> And these are NOT: >> "Hi there dierenzaak something else" >> "Hi there dierzaak something else" >> >> I'd expect it should be fairly easy (although I don't know how) to also >> include result "dierenzaak", by compounding the 2 query values. And yes you >> are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not >> sure what logic would also include "dierzaak" >> >> Regarding your question: yes, I do consider "dieren zaak soemthingelse" an >> exact match o
RE: Solr search engine configuration
Hi, Glad to hear you removed the gramming, but Kraaij-Pohlmann isn't going to solve all problems either, for example molens => molen, but molen => mool, and many more like that. You can solve this by adding manual rules to StemmerOverrideFilter, but due to the compound nature of words, you would need to add it for all the mills. Regarding the compounds, Dutch is (more or less) just another Germanic language and uses compounds just like German, Swedish etc. To deal with that you can try the vanilla HyphenationCompoundWordTokenFilter (or something like that). Be sure not to set minWordLength too low, or you'll get plenty of bad results. The major drawback of this token filter is that it emits overlapping terms, and may not always work with compounds of which the head is a plural, just like dierenzaak, of scholierenkorting. Also add a AccentFoldingFilter, or ICUNormalizer to get rid of accents, or you may have trouble finding a café. Regards, Markus -Original message- > From:PeterKerk > Sent: Sunday 11th March 2018 23:55 > To: solr-user@lucene.apache.org > Subject: Re: Solr search engine configuration > > Sorry for this lengthy post, but I wanted to be complete. > > The only occurence of edismax in solrconfig.xml is this one: > >default="true"> > > > edismax > explicit > 10 > > double_score > false > *:* > > > > I don't have a requestHandler named "/select". > > > Also, removing the gramming definitely helped! :-) > > I tried to simplify my setup first and then expand, so what I have now is > this: > > >positionIncrementGap="100"> > > >words="stopwords_nl.txt"/> > >protected="protwords_nl.txt"> > > > > > >words="stopwords_nl.txt"/> > >protected="protwords_nl.txt"> > > > > > >stored="true"/> > > In my database I have these 4 values for "title" that populate > "title_search_global" > > "Hi there dier something else" > "Hi there dieren zaak something else" > "Hi there dierenzaak something else" > "Hi there dierzaak something else" > > ps. "dier" is singular of plural "dieren". > > Using this query: > http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true > > These results are found: > "Hi there dier something else" > "Hi there dieren zaak something else" > > And these are NOT: > "Hi there dierenzaak something else" > "Hi there dierzaak something else" > > I'd expect it should be fairly easy (although I don't know how) to also > include result "dierenzaak", by compounding the 2 query values. And yes you > are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not > sure what logic would also include "dierzaak" > > Regarding your question: yes, I do consider "dieren zaak soemthingelse" an > exact match of "dieren zaak" > So I also checked the usage of pf parameters with edismax (based on these > links: > https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html, > http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/) > And also for dismax: > https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter > > But I can't find any examples how to actually use these parameters? > > > The search results, including debug info is here: > > > > > 0 > 7 > > title_search_global:(dieren zaak) > edismax > true > true > title_search_global > id,title > (lang:"nl" OR lang:"all") > xml > true >
Re: Solr search engine configuration
Sorry for this lengthy post, but I wanted to be complete. The only occurence of edismax in solrconfig.xml is this one: edismax explicit 10 double_score false *:* I don't have a requestHandler named "/select". Also, removing the gramming definitely helped! :-) I tried to simplify my setup first and then expand, so what I have now is this: In my database I have these 4 values for "title" that populate "title_search_global" "Hi there dier something else" "Hi there dieren zaak something else" "Hi there dierenzaak something else" "Hi there dierzaak something else" ps. "dier" is singular of plural "dieren". Using this query: http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true These results are found: "Hi there dier something else" "Hi there dieren zaak something else" And these are NOT: "Hi there dierenzaak something else" "Hi there dierzaak something else" I'd expect it should be fairly easy (although I don't know how) to also include result "dierenzaak", by compounding the 2 query values. And yes you are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not sure what logic would also include "dierzaak" Regarding your question: yes, I do consider "dieren zaak soemthingelse" an exact match of "dieren zaak" So I also checked the usage of pf parameters with edismax (based on these links: https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html, http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/) And also for dismax: https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter But I can't find any examples how to actually use these parameters? The search results, including debug info is here: 0 7 title_search_global:(dieren zaak) edismax true true title_search_global id,title (lang:"nl" OR lang:"all") xml true true dieren zaak 115_3699638 dier 115_3699637 title_search_global:(dieren zaak) title_search_global:(dieren zaak) (+(title_search_global:dier title_search_global:zaak))/no_coord +(title_search_global:dier title_search_global:zaak) 5.489122 = (MATCH) sum of: 2.4387078 = (MATCH) weight(title_search_global:dier in 51) [DefaultSimilarity], result of: 2.4387078 = score(doc=51,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546 = queryNorm 3.6587384 = fieldWeight in 51, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 = idf(docFreq=3, maxDocs=513) 0.625 = fieldNorm(doc=51) 3.050414 = (MATCH) weight(title_search_global:zaak in 51) [DefaultSimilarity], result of: 3.050414 = score(doc=51,freq=1.0 = termFreq=1.0 ), product of: 0.7454662 = queryWeight, product of: 6.5471287 = idf(docFreq=1, maxDocs=513) 0.113861546 = queryNorm 4.091955 = fieldWeight in 51, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 6.5471287 = idf(docFreq=1, maxDocs=513) 0.625 = fieldNorm(doc=51) 1.9509662 = (MATCH) product of: 3.9019325 = (MATCH) sum of: 3.9019325 = (MATCH) weight(title_search_global:dier in 50) [DefaultSimilarity], result of: 3.9019325 = score(doc=50,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546 = queryNorm 5.8539815 = fieldWeight in 50, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 = idf(docFreq=3, maxDocs=513) 1.0 = fieldNorm(doc=50) 0.5 = coord(1/2) 0.9754831 = (MATCH) product of: 1.9509662 = (MATCH) sum of: 1.9509662 = (MATCH) weight(title_search_global:dier in 132) [DefaultSimilarity], result of: 1.9509662 = score(doc=132,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546 = queryNorm 2.9269907 = fieldWeight in 132, product of: 1.0 = t
Re: Solr search engine configuration
bq: I tried the query with and without the &defType=edismax parameter but I'm getting the EXACT same results. Does that mean some configuration error? Well, not an error at all, this line: ExtendedDismaxQParser Means you're using edismax. If that happens both with or without &defType, that means that your request handler in solrconfig.xml has this defined as a default. Look for the entry like: edismax So any search you send to Solr like http://blah blah/solr/collection/select? will use edismax if no defType overrides it on the URL. --- Let's talk about what "exact match" means ;) Exact match "dieren zaak". Does "Exact match" here mean it would or would not be an exact match on "dieren zaak soemthingelse"? I you do NOT consider the above "exact match", the usual trick is to use a copyField directive to a field that uses KeywordTokenizerFactory (probably) followed by LowerCaseFilterFactory etc. KeywordTokenizerFactory takes the entire input field as a _single_ token, then you can transform it various ways, things like folding accents, lowercasing and the like if desired. I you DO consider the above "exact match", take a look at the pf, pf2 and pf3 parameters in edismax. They're all about forming phrases, bigrams and trigrams respectively for this form of "exact match". Exact match "dierenzaak". This one is tricky. There's little OOB that understands that "dieren zaak" is equivalent to "dierenzaak". I know that in German there's prior art on "decompounding" filters, I don't know about Dutch. Further, given my total lack of understanding the rules of either language I don't know if it does "compounding" too, i.e. understanding that "dieren zaak" is equivalent to "dierenzaak". Can't help much there. For a start I'd get rid of the gramming until I'd explored other alternatives. Gramming is generally a good thing for pre-and-post wildcards, i.e. matching *some*. Since you're concerned with relevance, I suspect that gramming will make your task harder. And if you haven't discovered the admin UI/analysis page, I recommend you spend some time with it (hint, un-check the "verbose" checkbox). As you play with various combinations of tokenizers and filters it'll give you a much better understanding of what the effects of various combinations are. If only human language followed strict rules ;) Professor:"In English, two negatives are allowed and mean a positive, but two positives don't mean a negative." Bored voice from the back: "Yeah, right". Erick On Sun, Mar 11, 2018 at 5:19 AM, PeterKerk wrote: > Thanks! That provides me with some more insight, I altered the search query > to "dieren zaak" to see how queries consisting of more than 1 word are > handled. > I see that words are tokenized into groups of 3, I think because of my > NGramFilterFactory with minGramSize of 3. > > > > (title_search_global:(dieren zaak) OR > description_search_global:(dieren > zaak)) > > > (title_search_global:(dieren zaak) OR > description_search_global:(dieren > zaak)) > > > (+(((title_search_global:die title_search_global:ier > title_search_global:ere title_search_global:ren title_search_global:dier > title_search_global:iere title_search_global:eren title_search_global:diere > title_search_global:ieren title_search_global:dieren) > (title_search_global:zaa title_search_global:aak title_search_global:zaak)) > (((description_search_global:dier description_search_global:diere > description_search_global:dieren)/no_coord) > description_search_global:zaak)))/no_coord > > > +(((title_search_global:die title_search_global:ier > title_search_global:ere > title_search_global:ren title_search_global:dier title_search_global:iere > title_search_global:eren title_search_global:diere title_search_global:ieren > title_search_global:dieren) (title_search_global:zaa title_search_global:aak > title_search_global:zaak)) ((description_search_global:dier > description_search_global:diere description_search_global:dieren) > description_search_global:zaak)) > > ExtendedDismaxQParser > > > > > > (lang:"nl" OR lang:"all") > > > lang:nl lang:all > > > > > I tried the query with and without the &defType=edismax parameter but I'm > getting the EXACT same results. Does that mean some configuration error? > > I'm not sure how to progress from here. Can you see if your presumption that > I'm mixing two different parsers is correct? My schema.xml is here: > http://www.telefonievergelijken.nl/schema.xml > > > Related: do you know of the existence of any sample schema.xml config that > would be usable for a search engine? Seems like something so obvious to > float around out there. I feel that would go a long way. > > > > Not sure if it matters but my requirements are: > > Exact match "di
Re: Solr search engine configuration
Thanks! That provides me with some more insight, I altered the search query to "dieren zaak" to see how queries consisting of more than 1 word are handled. I see that words are tokenized into groups of 3, I think because of my NGramFilterFactory with minGramSize of 3. (title_search_global:(dieren zaak) OR description_search_global:(dieren zaak)) (title_search_global:(dieren zaak) OR description_search_global:(dieren zaak)) (+(((title_search_global:die title_search_global:ier title_search_global:ere title_search_global:ren title_search_global:dier title_search_global:iere title_search_global:eren title_search_global:diere title_search_global:ieren title_search_global:dieren) (title_search_global:zaa title_search_global:aak title_search_global:zaak)) (((description_search_global:dier description_search_global:diere description_search_global:dieren)/no_coord) description_search_global:zaak)))/no_coord +(((title_search_global:die title_search_global:ier title_search_global:ere title_search_global:ren title_search_global:dier title_search_global:iere title_search_global:eren title_search_global:diere title_search_global:ieren title_search_global:dieren) (title_search_global:zaa title_search_global:aak title_search_global:zaak)) ((description_search_global:dier description_search_global:diere description_search_global:dieren) description_search_global:zaak)) ExtendedDismaxQParser (lang:"nl" OR lang:"all") lang:nl lang:all I tried the query with and without the &defType=edismax parameter but I'm getting the EXACT same results. Does that mean some configuration error? I'm not sure how to progress from here. Can you see if your presumption that I'm mixing two different parsers is correct? My schema.xml is here: http://www.telefonievergelijken.nl/schema.xml Related: do you know of the existence of any sample schema.xml config that would be usable for a search engine? Seems like something so obvious to float around out there. I feel that would go a long way. Not sure if it matters but my requirements are: Exact match "dieren zaak" boost result with 1000 Exact match "dierenzaak" boost result with 900 Exact match "dieren" or "zaak" boost result with 600 Partial match "huisdierenzaak" or "huisdieren zaak" boost result with 500 Stem match "dier" boost result with 100 Stem partial match "huisdier" boost result with 70 Other partial matches "die" boost result with 10 -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Solr search engine configuration
You're mixing two different parsers I think. If you're using edismax (either specify defType=edismax on your query or set it up as the default for, say, the "/select" handler in solrcofnig.xml). The "qf" parameter only is relevant if you _are_ using edismax. If you wan to use edismax your query could look something like q=dieren&defType=edismax&qf=qtitle_search_global title_exactmatch^1000 description_search_global description_exactmatch^100 On the other hand if you don't want to use edismax your query would have to look something like: q=qtitle_search_global:dieren title_exactmatch:dieren^1000 description_search_global:dieren description_exactmatch:dieren^100 This is guessing a bit, but If you add &debug=query to your URL, you'll see the parsed results of the query which can be very useful in figuring out exactly what Solr thinks the query is.. Best, Erick On Sat, Mar 10, 2018 at 2:06 PM, PeterKerk wrote: > Since Google onsite search will be end of life April 1 2018, I'm trying to > setup my own onsite search engine that indexes my site's content and makes > it searchable. > > My data config successfully loads data from my database (products, > companies, blogs) into the fields. > > I then try to search in both the title and the description fields with > weights. Now for example when users search on "dieren" (this means "animals" > in Dutch): > > &q=(title_search_global:(dieren) OR > description_search_global:(dieren))&qf=title_search_global+title_exactmatch^1000+description_search_global+description_exactmatch^100 > > I get results with "dieren", "huisdieren", but I also get undesired results > with "manieren" and "versieren". > > What I want is to find text using the following logic (all case > insensitive): > > > Exact match "dieren" boost result with 1000 > Partial match "huisdieren" boost result with 500 > Stem match "dier" boost result with 100 > Stem partial match "huisdier" boost result with 70 > Other partial matches "die" boost result with 10 > > My current schema.xml is here: http://www.telefonievergelijken.nl/schema.xml > I tried the solr admin tool for tokenization, but I can't figure out how to > get to the above logic. > I also Googled for an example Solr schema.xml configuration for building > your own search engines and I'm really surprised there's nothing out there. > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html