RE: Solr search engine configuration

2018-03-13 Thread PeterKerk
Thanks, will look into all that :-)



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Solr search engine configuration

2018-03-13 Thread Markus Jelsma
Hi - In that case you need the KeywordRepeat and RemoveDuplicates filters as 
well, i'd suggest reading their Javadocs. With the docs and the analysis GUI, 
you can probably figure out their respective place in the tokenizer chain 
yourself. 

Trusting on IDF is not really a fine controlled boosting mechanism but it 
should work more or less. We use payloads everywhere for fine controlled 
scoring, but that involves a lot of code.

Cheers,
Markus

-Original message-
> From:PeterKerk 
> Sent: Tuesday 13th March 2018 21:35
> To: solr-user@lucene.apache.org
> Subject: RE: Solr search engine configuration
> 
> Cool, will do some more digging around in the analysis GUI first.
> 
> One last thing then on this comment of yours:
> "Does the decompounder support emitting the compound word as well? If so,
> enable it. It should help scoring compounds higher via IDF as they are less
> common."
> 
> So I checked the Javadoc:
> https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html
> To be sure I also checked the Javadoc for the alternative
> :https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html,
> but nothing there on emitting either.
> 
> Where can I see whether DictionaryCompoundWordTokenFilterFactory supports
> emitting the compound work and how to enable it?
> 
> Thanks again! :-)
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 


RE: Solr search engine configuration

2018-03-13 Thread PeterKerk
Cool, will do some more digging around in the analysis GUI first.

One last thing then on this comment of yours:
"Does the decompounder support emitting the compound word as well? If so,
enable it. It should help scoring compounds higher via IDF as they are less
common."

So I checked the Javadoc:
https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html
To be sure I also checked the Javadoc for the alternative
:https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html,
but nothing there on emitting either.

Where can I see whether DictionaryCompoundWordTokenFilterFactory supports
emitting the compound work and how to enable it?

Thanks again! :-)



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


RE: Solr search engine configuration

2018-03-13 Thread Markus Jelsma
Inline, cheers.

-Original message-
> From:PeterKerk 
> Sent: Tuesday 13th March 2018 18:53
> To: solr-user@lucene.apache.org
> Subject: RE: Solr search engine configuration
> 
> You must stay in the Javadoc section, there the examples are good, or the
> reference guide: 
> https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions
> 
> PVK COMMENT 1: 
>   This seems to be for Solr 6.5+? I'm using 4.3.1. An upgrade is not on 
> the
> radar soon. Will using DictionaryCompoundWordTokenFilterFactory as I'm doing
> now severely degrade my result quality as opposed to
> HyphenationCompoundWordTokenFilterFactory?

Just change version number, most filters are already quite old:
https://lucene.apache.org/core/4_3_1/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html

Dictionary vs Hyphenation, using Dictionary won't severely degrade results, and 
can be easier to use if you need to add words. If prefer the Hyphenater though, 
but it can bite. Stick to Dictionary, you are fine. But both (iirc) suffer from 
the same problems with overlapping words, or subwords that do not entire make 
up for the full compound (minus genetives or plural forms) this is a real issue.

> 
> 
> Almost, zaken -> zaak is already KP output, no need to input what the
> stemmer will do for you. 
> 
> PVK COMMENT 2: 
>   How do you know zaken -> zaak is already KP output? Is there a list
> somewhere?

I know because i've seen KPs output a million times by now. You should really 
access Solr's analysis GUI, it shows what filters emit, it is really helpful.

>   
> PVK COMMENT 3: 
> I now have:
> 
>positionIncrementGap="100">
>   
>   
> 
>   
>generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>  
>   
>dictionary="compounds_nl.txt"
>  minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>   
>dictionary="stemdict_nl.txt"/>
>   
>protected="protwords_nl.txt"/>
>   
>   
>   
>   
>   
>   
>   
> 
>   
>generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>  
>   
>dictionary="compounds_nl.txt"
>  minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>
> dictionary="stemdict_nl.txt"/>
> 
> protected="protwords_nl.txt"/>
>
>   
>   
>  
>   
> 

Please increase minWordsize and minSubwordSize. There are no compounds with 
that few characters. minSubwordSize should be at least 4, or you will get a lot 
of crazy output due to problems states above.

> 
> I tested in admin UI (and yes, I restart Solr and reindex every time I make
> a change):
>   
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true
> returns:
> "hi there dieren zaak something else"
> "hi there dier something else"
> 
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dierenzaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true
> returns
> "hi there dierenzaak something else"
> 
> So I added "dieren" to compounds_nl.txt
> 
> Now on "title_search_global:(dieren zaak)" it returns:
> 
> hi there dieren zaak something else
> 115_3699638
> 
> 
> hi there dier something else
> 115_3699637
> 
> 
> hi there dierenzaak something else
> 115_3699639
> 
> 
> So it's starting to look good! :-)
> 
> What I want to know, how can I have Solr consider "dierenzaak" to be of
> higher importance than just "dier" in the above results?

Does the decompounder support emitting the compound word as well? If so, enable 
it. It should help scoring compounds higher via IDF as they are less common.

> 
> Also I'm still not 100% sure what my addition of "dieren" to
> compounds_nl.txt actually does...I assume
> DictionaryCompoundWordTokenFilterFactory just looks for that exact string
> and if it finds it, considers that a separate word? Correct?

Just check in analysis GUI, it will answer all these questions.

> 
> Thanks again!
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 


RE: Solr search engine configuration

2018-03-13 Thread PeterKerk
You must stay in the Javadoc section, there the examples are good, or the
reference guide: 
https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions

PVK COMMENT 1: 
This seems to be for Solr 6.5+? I'm using 4.3.1. An upgrade is not on 
the
radar soon. Will using DictionaryCompoundWordTokenFilterFactory as I'm doing
now severely degrade my result quality as opposed to
HyphenationCompoundWordTokenFilterFactory?


Almost, zaken -> zaak is already KP output, no need to input what the
stemmer will do for you. 

PVK COMMENT 2: 
How do you know zaken -> zaak is already KP output? Is there a list
somewhere?

PVK COMMENT 3: 
I now have:


  




   



  






  
  




   


 
 

 
 


 
  


I tested in admin UI (and yes, I restart Solr and reindex every time I make
a change):  

http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true
returns:
"hi there dieren zaak something else"
"hi there dier something else"

http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dierenzaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true
returns
"hi there dierenzaak something else"

So I added "dieren" to compounds_nl.txt

Now on "title_search_global:(dieren zaak)" it returns:

hi there dieren zaak something else
115_3699638


hi there dier something else
115_3699637


hi there dierenzaak something else
115_3699639


So it's starting to look good! :-)

What I want to know, how can I have Solr consider "dierenzaak" to be of
higher importance than just "dier" in the above results?

Also I'm still not 100% sure what my addition of "dieren" to
compounds_nl.txt actually does...I assume
DictionaryCompoundWordTokenFilterFactory just looks for that exact string
and if it finds it, considers that a separate word? Correct?

Thanks again!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr search engine configuration

2018-03-13 Thread Shawn Heisey

On 3/13/2018 7:24 AM, PeterKerk wrote:

PVK COMMENT:
But without a Stopfilter, wont stopwords be included in searches? I though
that for example Google excluded these words in their algorithms?


I just did a google search for "to be or not to be".  It worked flawlessly.

If Google were using stopwords, that search would have returned 
nothing.  The four words in that search are among the most frequent 
words found in English prose.  This is a typical stopword list for English:


a
an
and
are
as
at
be
but
by
for
if
in
into
is
it
no
not
of
on
or
such
that
the
their
then
there
these
they
this
to
was
will
with

To explain why the frequent responders on this list recommend not using 
stopwords, and why the biggest search engine on the planet doesn't use 
them, you need a small history lesson -- you have to know why stopword 
filters were invented in the first place.


A search engine works by creating an uninverted index. This means for a 
typical full-text index that there is a big list of words, and for each 
of those words, there is a list that identifies the document, field 
name, and text offset of where that word is found.  Without a stopword 
filter, the biggest entry in an index for English is probably "the" ... 
in a corpus of a few million documents, "the" might appear *billions* of 
times.  So the list is BIG.  And when the search has to deal with a big 
entry in the uninverted index, it's slower than normal.


Back in the annals of history (80s, 90s, etc) servers didn't have nearly 
as much memory and CPU resources as they do now.  Eliminating these 
giant entries in the index made a HUGE difference in search 
performance.  A search that might take several seconds with the 
stopwords included could be sped up to less than one second without them.


Even back then, the people who built stopword filters KNEW that they 
were impacting search results.  The reason they implemented them anyway 
was to greatly improve search *performance*.  They knew that a search 
for "to be or not to be" or "the who" or any number of other similar 
searches wouldn't work properly.  But the vast majority of searches were 
not really affected by the stopword removal, and users got their results 
really fast.


Today, with modern hardware, search engines are much less bothered by 
having enormous entries in the uninverted index.  When stopwords are NOT 
removed, you can get more accurate search results.  Yes, the index is 
substantially bigger.  But modern hardware is easy to load up with a lot 
of disk space, memory, and CPU capacity, and search with stopwords is 
fast enough.


Thanks,
Shawn



RE: Solr search engine configuration

2018-03-13 Thread Markus Jelsma

 
-Original message-
> From:PeterKerk 
> Sent: Tuesday 13th March 2018 14:24
> To: solr-user@lucene.apache.org
> Subject: RE: Solr search engine configuration
> 
> Markus, 
> 
> Thanks again. Ok, 1 by 1:
> 
> StemmerOverride wants \t separated fields, that is probably the cause of the
> AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a
> proper example listed. I recommend putting a decompounder before a stemmer,
> and have an accent (or ICU) folder as one of the last filters. 
> 
> PVK COMMENT:
> Looking for Decompounders and found a few links, btw a lot of the pages
> these are linked to don't work.
> 
> https://earlydance.org/news/9189-apachesolr-issues-german-and-other-germanic-languages
> 
> http://lucene.apache.org/core/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html
>   https://wiki.apache.org/solr/LanguageAnalysis#Decompounding
>   
> https://wiki.apache.org/solr/DictionaryCompoundWordTokenFilterFactory

You must stay in the Javadoc section, there the examples are good, or the 
reference guide:
https://lucene.apache.org/core/6_5_0/analyzers-common/org/apache/lucene/analysis/compound/HyphenationCompoundWordTokenFilterFactory.html
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#filter-descriptions

>   
> my stemdict_nl.txt now contains (words separated by a single tab):
> aachenaach
> aachener  aachener
> aalmoezen aalmoes
> beveelbevool
> dierenzaken   dierenzaak
> 
> The problem before was indeed like @Shawn indicates that I had words in
> there with a space like so:
> dieren zaken  dierenzaak
> 
> 
>   
> About the diff, it looks like KP output, it has the same issues with whether
> or not a word needs double or single vowels in the root. It also shows
> issues with strong verbs/nouns (beveel/bevool). Having this list seems like
> having KP configured so you should drop it, and only list exceptions to KP
> rules in the dict file. This is not easy, so i recommend to stay in to your
> domain's vocabulary. 
> 
> PVK COMMENT:
> That's what I now did above right?

Almost, zaken -> zaak is already KP output, no need to input what the stemmer 
will do for you.

> 
> 
> Also, unless you have a very specific need for it, drop the StopFilter.
> Nobody in these days should want a StopFilter unless they can justify it. We
> use them too, but only for very specific reasons, but never for text search.
> You might also want to have a WordDelimiterFilter as your first filter, look
> it up, you probably want to have it. 
> 
> PVK COMMENT:
> But without a Stopfilter, wont stopwords be included in searches? I though
> that for example Google excluded these words in their algorithms?
> 

Yes, stopwords are good! Keep them! And i am glad Google doesn't just strip 
stopwords.

> 
> 
> This is what I have now:
> 
>positionIncrementGap="100">
>   
>   
> 
>   
>generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>  
>   
>dictionary="compounds_nl.txt"
>  minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>   
>dictionary="stemdict_nl.txt"/>
>   
>   
>   
>   
>   
>   
>   
> 
>   
>generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>  
>   
>dictionary="compounds_nl.txt"
>  minWordSize="5" minSubwordSize="2" maxSubwordSize="15"
> onlyLongestMatch="true"/>
>
> dictionary="stemdict_nl.txt"/>
> 
>   
>   
>  
>   
> 

That looks fine, but you now you omitted the stemmer (Snowball). Put it after 
StemmerOverrideFilter, and before ASCIIFolding.

> 
>   
> Now for both this query
> http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&stopwords=true&lowercaseOp

RE: Solr search engine configuration

2018-03-13 Thread PeterKerk
Markus, 

Thanks again. Ok, 1 by 1:

StemmerOverride wants \t separated fields, that is probably the cause of the
AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a
proper example listed. I recommend putting a decompounder before a stemmer,
and have an accent (or ICU) folder as one of the last filters. 

PVK COMMENT:
Looking for Decompounders and found a few links, btw a lot of the pages
these are linked to don't work.

https://earlydance.org/news/9189-apachesolr-issues-german-and-other-germanic-languages

http://lucene.apache.org/core/2_4_0/api/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilter.html
https://wiki.apache.org/solr/LanguageAnalysis#Decompounding

https://wiki.apache.org/solr/DictionaryCompoundWordTokenFilterFactory

my stemdict_nl.txt now contains (words separated by a single tab):
aachen  aach
aacheneraachener
aalmoezen   aalmoes
beveel  bevool
dierenzaken dierenzaak

The problem before was indeed like @Shawn indicates that I had words in
there with a space like so:
dieren zakendierenzaak



About the diff, it looks like KP output, it has the same issues with whether
or not a word needs double or single vowels in the root. It also shows
issues with strong verbs/nouns (beveel/bevool). Having this list seems like
having KP configured so you should drop it, and only list exceptions to KP
rules in the dict file. This is not easy, so i recommend to stay in to your
domain's vocabulary. 

PVK COMMENT:
That's what I now did above right?


Also, unless you have a very specific need for it, drop the StopFilter.
Nobody in these days should want a StopFilter unless they can justify it. We
use them too, but only for very specific reasons, but never for text search.
You might also want to have a WordDelimiterFilter as your first filter, look
it up, you probably want to have it. 

PVK COMMENT:
But without a Stopfilter, wont stopwords be included in searches? I though
that for example Google excluded these words in their algorithms?




This is what I have now:


  




   



  




  
  




   


 
 



 
  



Now for both this query
http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&stopwords=true&lowercaseOperators=true

and this one:
http://localhost:8983/solr/tt-search-global/select?q=title_search_global%3A(dieren+zaak)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true

This result is found: 
"Hi there dieren zaak something else" 

And these are NOT: 
"Hi there dier something else" 
"Hi there dierenzaak something else" 
"Hi there dierzaak something else"  

What else do you recommend I try?




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr search engine configuration

2018-03-12 Thread Shawn Heisey
On 3/12/2018 4:15 PM, PeterKerk wrote:
> I trimmed stemdict_nl.txt for testing to just this:
>
> aachenaach
> aachener  aachener

According to the example here:

https://github.com/apache/lucene-solr/blob/master/solr/core/src/test-files/solr/collection1/conf/stemdict.txt

The lines need to be tab separated.

I'm betting that you're running into this bug, which is still unresolved:

https://issues.apache.org/jira/browse/LUCENE-4545

The source file you have referenced uses spaces.  If those are still in
your file, it isn't going to work.  It appears that the way the code is
written (and is STILL written even in master, which will one day be
version 8.0), the separator must be a SINGLE tab.  I have confirmed that
multiple tabs or any number of spaces isn't going to work properly.

I will see what I can do about getting the bug fixed, but for now you're
going to have to fix all the separators in your dictionary file.

Thanks,
Shawn



RE: Solr search engine configuration

2018-03-12 Thread Markus Jelsma
Hello Peter,

StemmerOverride wants \t separated fields, that is probably the cause of the 
AIooBE you get. Regarding schema definitions, each factory JavaDoc [1] has a 
proper example listed. I recommend putting a decompounder before a stemmer, and 
have an accent (or ICU) folder as one of the last filters.

About the diff, it looks like KP output, it has the same issues with whether or 
not a word needs double or single vowels in the root. It also shows issues with 
strong verbs/nouns (beveel/bevool). Having this list seems like having KP 
configured so you should drop it, and only list exceptions to KP rules in the 
dict file. This is not easy, so i recommend to stay in to your domain's 
vocabulary.

Also, unless you have a very specific need for it, drop the StopFilter. Nobody 
in these days should want a StopFilter unless they can justify it. We use them 
too, but only for very specific reasons, but never for text search. You might 
also want to have a WordDelimiterFilter as your first filter, look it up, you 
probably want to have it.

Markus

[1] 
https://lucene.apache.org/core/7_1_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilterFactory.html

 
 
-Original message-
> From:PeterKerk 
> Sent: Monday 12th March 2018 23:16
> To: solr-user@lucene.apache.org
> Subject: RE: Solr search engine configuration
> 
> @Erick: thank you for clarifying!
> 
> @Markus:
> I feel like I'm not (or at least should not be :-)) the first person to run
> into these challenges.
> 
> "You can solve this by adding manual rules to StemmerOverrideFilter, but due
> to the compound nature of words, you would need to add it for all the mills"
> 
> After Googling I found this:
> https://stackoverflow.com/questions/22451774/word-does-not-get-analysed-properly-using-stemmeroverridefilterfactory-and-snowb
> and added http://snowball.tartarus.org/algorithms/kraaij_pohlmann/diffs.txt
> as stemdict_nl.txt
> 
> My new fieldType definition now is:
> 
>  positionIncrementGap="100">
>   
>    
>  words="stopwords_nl.txt"/> 
>  
>  dictionary="stemdict_nl.txt"/> 
>  protected="protwords_nl.txt"> 
>   
>   
>    
>  words="stopwords_nl.txt"/> 
>  
>  dictionary="stemdict_nl.txt"/>
>  protected="protwords_nl.txt"> 
>   
>  
> 
> I trimmed stemdict_nl.txt for testing to just this:
> 
> aachen    aach
> aachener  aachener
> 
> But on full-import it throws a http 500 error:
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 at
> org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilterFactory.inform(StemmerOverrideFilterFactory.java:66)
> 
> Is my stemdict_nl.txt format incorrect?
> 
> And do you have examples of the HyphenationCompoundWordTokenFilter or
> AccentFoldingFilter I can't find any.
> 
> I use Solr 4.3.1 btw, not sure if that matters.
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 


RE: Solr search engine configuration

2018-03-12 Thread PeterKerk
@Erick: thank you for clarifying!

@Markus:
I feel like I'm not (or at least should not be :-)) the first person to run
into these challenges.

"You can solve this by adding manual rules to StemmerOverrideFilter, but due
to the compound nature of words, you would need to add it for all the mills"

After Googling I found this:
https://stackoverflow.com/questions/22451774/word-does-not-get-analysed-properly-using-stemmeroverridefilterfactory-and-snowb
and added http://snowball.tartarus.org/algorithms/kraaij_pohlmann/diffs.txt
as stemdict_nl.txt

My new fieldType definition now is:


  
   
   
   
  
  
  
  
   
   
 

  
  


I trimmed stemdict_nl.txt for testing to just this:

aachenaach
aachener  aachener

But on full-import it throws a http 500 error:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1  at
org.apache.lucene.analysis.miscellaneous.StemmerOverrideFilterFactory.inform(StemmerOverrideFilterFactory.java:66)

Is my stemdict_nl.txt format incorrect?

And do you have examples of the HyphenationCompoundWordTokenFilter or
AccentFoldingFilter I can't find any.

I use Solr 4.3.1 btw, not sure if that matters.




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr search engine configuration

2018-03-12 Thread Erick Erickson
Peter:

bq: I don't have a requestHandler named "/select".

Right, that was just an example of a request handler, your
"/scoresearch" handler _does_ have edismax as your default "defType"
so assuming you're using that one it makes no difference at all
whether you specify &defType=edismax on the URL or not. You'd see a
differences if you specified "&defType=any_parser_other_than_dismax"
though ;)

As for the rest, I'll leave you in the much more capable hands of
Markus since he has, you know, real knowledge in this area rather than
my generalities

Best,
Erick

On Mon, Mar 12, 2018 at 1:33 AM, Markus Jelsma
 wrote:
> Hi,
>
> Glad to hear you removed the gramming, but Kraaij-Pohlmann isn't going to 
> solve all problems either, for example molens => molen, but molen => mool, 
> and many more like that. You can solve this by adding manual rules to 
> StemmerOverrideFilter, but due to the compound nature of words, you would 
> need to add it for all the mills.
>
> Regarding the compounds, Dutch is (more or less) just another Germanic 
> language and uses compounds just like German, Swedish etc. To deal with that 
> you can try the vanilla HyphenationCompoundWordTokenFilter (or something like 
> that). Be sure not to set minWordLength too low, or you'll get plenty of bad 
> results. The major drawback of this token filter is that it emits overlapping 
> terms, and may not always work with compounds of which the head is a plural, 
> just like dierenzaak, of scholierenkorting.
>
> Also add a AccentFoldingFilter, or ICUNormalizer to get rid of accents, or 
> you may have trouble finding a café.
>
> Regards,
> Markus
>
> -Original message-
>> From:PeterKerk 
>> Sent: Sunday 11th March 2018 23:55
>> To: solr-user@lucene.apache.org
>> Subject: Re: Solr search engine configuration
>>
>> Sorry for this lengthy post, but I wanted to be complete.
>>
>> The only occurence of edismax in solrconfig.xml is this one:
>>
>>   > default="true">
>>
>>   
>> edismax
>> explicit
>> 10
>>
>> double_score
>> false
>> *:*
>>   
>>   
>>
>> I don't have a requestHandler named "/select".
>>
>>
>> Also, removing the gramming definitely helped! :-)
>>
>> I tried to simplify my setup first and then expand, so what I have now is
>> this:
>>
>>
>>   > positionIncrementGap="100">
>>   
>>   
>>   > words="stopwords_nl.txt"/>
>>   
>>   > protected="protwords_nl.txt">
>>
>>
>>   
>>   
>>   
>>   > words="stopwords_nl.txt"/>
>>   
>>   > protected="protwords_nl.txt">
>>
>>
>>   
>> 
>>
>>   > stored="true"/>
>>
>> In my database I have these 4 values for "title" that populate
>> "title_search_global"
>>
>> "Hi there dier something else"
>> "Hi there dieren zaak something else"
>> "Hi there dierenzaak something else"
>> "Hi there dierzaak something else"
>>
>> ps. "dier" is singular of plural "dieren".
>>
>> Using this query:
>> http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true
>>
>> These results are found:
>> "Hi there dier something else"
>> "Hi there dieren zaak something else"
>>
>> And these are NOT:
>> "Hi there dierenzaak something else"
>> "Hi there dierzaak something else"
>>
>> I'd expect it should be fairly easy (although I don't know how) to also
>> include result "dierenzaak", by compounding the 2 query values. And yes you
>> are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not
>> sure what logic would also include "dierzaak"
>>
>> Regarding your question: yes, I do consider "dieren zaak soemthingelse" an
>> exact match o

RE: Solr search engine configuration

2018-03-12 Thread Markus Jelsma
Hi,

Glad to hear you removed the gramming, but Kraaij-Pohlmann isn't going to solve 
all problems either, for example molens => molen, but molen => mool, and many 
more like that. You can solve this by adding manual rules to 
StemmerOverrideFilter, but due to the compound nature of words, you would need 
to add it for all the mills.

Regarding the compounds, Dutch is (more or less) just another Germanic language 
and uses compounds just like German, Swedish etc. To deal with that you can try 
the vanilla HyphenationCompoundWordTokenFilter (or something like that). Be 
sure not to set minWordLength too low, or you'll get plenty of bad results. The 
major drawback of this token filter is that it emits overlapping terms, and may 
not always work with compounds of which the head is a plural, just like 
dierenzaak, of scholierenkorting.

Also add a AccentFoldingFilter, or ICUNormalizer to get rid of accents, or you 
may have trouble finding a café.

Regards,
Markus
  
-Original message-
> From:PeterKerk 
> Sent: Sunday 11th March 2018 23:55
> To: solr-user@lucene.apache.org
> Subject: Re: Solr search engine configuration
> 
> Sorry for this lengthy post, but I wanted to be complete.
> 
> The only occurence of edismax in solrconfig.xml is this one:
> 
>default="true">
>  
>   
> edismax
> explicit
> 10
>
> double_score
> false
> *:*
>   
>   
>   
> I don't have a requestHandler named "/select".
> 
> 
> Also, removing the gramming definitely helped! :-)
> 
> I tried to simplify my setup first and then expand, so what I have now is
> this:
> 
>   
>positionIncrementGap="100">
>   
>  
>words="stopwords_nl.txt"/> 
>
>protected="protwords_nl.txt">
>   
>   
>   
>   
>  
>words="stopwords_nl.txt"/> 
>
>protected="protwords_nl.txt">
>   
>   
>   
> 
> 
>stored="true"/>
>   
> In my database I have these 4 values for "title" that populate
> "title_search_global" 
>   
> "Hi there dier something else"
> "Hi there dieren zaak something else"
> "Hi there dierenzaak something else"
> "Hi there dierzaak something else"
> 
> ps. "dier" is singular of plural "dieren". 
> 
> Using this query:
> http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true
> 
> These results are found:
> "Hi there dier something else"
> "Hi there dieren zaak something else"
> 
> And these are NOT:
> "Hi there dierenzaak something else"
> "Hi there dierzaak something else"
> 
> I'd expect it should be fairly easy (although I don't know how) to also
> include result "dierenzaak", by compounding the 2 query values. And yes you
> are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not
> sure what logic would also include "dierzaak"
> 
> Regarding your question: yes, I do consider "dieren zaak soemthingelse" an
> exact match of "dieren zaak"
> So I also checked the usage of pf parameters with edismax (based on these
> links:
> https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html,
> http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/)
> And also for dismax:
> https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter
> 
> But I can't find any examples how to actually use these parameters? 
> 
> 
> The search results, including debug info is here:
> 
> 
> 
> 
> 0
> 7
> 
> title_search_global:(dieren zaak)
> edismax
> true
> true
> title_search_global
> id,title
> (lang:"nl" OR lang:"all")
> xml
> true
>   

Re: Solr search engine configuration

2018-03-11 Thread PeterKerk
Sorry for this lengthy post, but I wanted to be complete.

The only occurence of edismax in solrconfig.xml is this one:


   

  edismax
  explicit
  10
 
  double_score
  false
  *:*



I don't have a requestHandler named "/select".


Also, removing the gramming definitely helped! :-)

I tried to simplify my setup first and then expand, so what I have now is
this:



  
   
   
 



  
  
   
   
 



  




In my database I have these 4 values for "title" that populate
"title_search_global"   

"Hi there dier something else"
"Hi there dieren zaak something else"
"Hi there dierenzaak something else"
"Hi there dierzaak something else"

ps. "dier" is singular of plural "dieren". 

Using this query:
http://localhost:8983/solr/search-global/select?q=title_search_global%3A(dieren+zaak)&fq=(lang%3A%22nl%22+OR+lang%3A%22all%22)&fl=id%2Ctitle&wt=xml&indent=true&defType=edismax&qf=title_search_global&stopwords=true&lowercaseOperators=true&debug=true

These results are found:
"Hi there dier something else"
"Hi there dieren zaak something else"

And these are NOT:
"Hi there dierenzaak something else"
"Hi there dierzaak something else"

I'd expect it should be fairly easy (although I don't know how) to also
include result "dierenzaak", by compounding the 2 query values. And yes you
are correct: in Dutch "dieren zaak" would mean the same as "dierenzaak". Not
sure what logic would also include "dierzaak"

Regarding your question: yes, I do consider "dieren zaak soemthingelse" an
exact match of "dieren zaak"
So I also checked the usage of pf parameters with edismax (based on these
links:
https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html,
http://blog.thedigitalgroup.com/vijaym/understanding-phrasequery-and-slop-in-solr/)
And also for dismax:
https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html#TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter

But I can't find any examples how to actually use these parameters? 


The search results, including debug info is here:




0
7

title_search_global:(dieren zaak)
edismax
true
true
title_search_global
id,title
(lang:"nl" OR lang:"all")
xml
true
true




dieren zaak
115_3699638


dier
115_3699637



title_search_global:(dieren zaak)
title_search_global:(dieren zaak)

(+(title_search_global:dier title_search_global:zaak))/no_coord


+(title_search_global:dier title_search_global:zaak)



5.489122 = (MATCH) sum of: 2.4387078 = (MATCH)
weight(title_search_global:dier in 51) [DefaultSimilarity], result of:
2.4387078 = score(doc=51,freq=1.0 = termFreq=1.0 ), product of: 0.66654336 =
queryWeight, product of: 5.8539815 = idf(docFreq=3, maxDocs=513) 0.113861546
= queryNorm 3.6587384 = fieldWeight in 51, product of: 1.0 = tf(freq=1.0),
with freq of: 1.0 = termFreq=1.0 5.8539815 = idf(docFreq=3, maxDocs=513)
0.625 = fieldNorm(doc=51) 3.050414 = (MATCH) weight(title_search_global:zaak
in 51) [DefaultSimilarity], result of: 3.050414 = score(doc=51,freq=1.0 =
termFreq=1.0 ), product of: 0.7454662 = queryWeight, product of: 6.5471287 =
idf(docFreq=1, maxDocs=513) 0.113861546 = queryNorm 4.091955 = fieldWeight
in 51, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0
6.5471287 = idf(docFreq=1, maxDocs=513) 0.625 = fieldNorm(doc=51)


1.9509662 = (MATCH) product of: 3.9019325 = (MATCH) sum of: 3.9019325 =
(MATCH) weight(title_search_global:dier in 50) [DefaultSimilarity], result
of: 3.9019325 = score(doc=50,freq=1.0 = termFreq=1.0 ), product of:
0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
maxDocs=513) 0.113861546 = queryNorm 5.8539815 = fieldWeight in 50, product
of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 5.8539815 =
idf(docFreq=3, maxDocs=513) 1.0 = fieldNorm(doc=50) 0.5 = coord(1/2)


0.9754831 = (MATCH) product of: 1.9509662 = (MATCH) sum of: 1.9509662 =
(MATCH) weight(title_search_global:dier in 132) [DefaultSimilarity], result
of: 1.9509662 = score(doc=132,freq=1.0 = termFreq=1.0 ), product of:
0.66654336 = queryWeight, product of: 5.8539815 = idf(docFreq=3,
maxDocs=513) 0.113861546 = queryNorm 2.9269907 = fieldWeight in 132, product
of: 1.0 = t

Re: Solr search engine configuration

2018-03-11 Thread Erick Erickson
bq: I tried the query with and without the &defType=edismax parameter but I'm
getting the EXACT same results. Does that mean some configuration error?

Well, not an error at all, this line:
 ExtendedDismaxQParser

Means you're using edismax. If that happens both with or without
&defType, that means
that your request handler in solrconfig.xml has this defined as a
default. Look for the
entry like:


   
 edismax

So any search you send to Solr like
http://blah blah/solr/collection/select?

will use edismax if no defType overrides it on the URL.

---
Let's talk about what "exact match" means ;)


Exact match "dieren zaak". Does "Exact match" here mean it would or
would not be an exact match on "dieren zaak soemthingelse"?

I you do NOT consider the above "exact match", the usual trick is to
use a copyField directive to a field that uses KeywordTokenizerFactory
(probably) followed by LowerCaseFilterFactory etc.
KeywordTokenizerFactory takes the entire input field as a _single_
token, then you can transform it various ways, things like folding
accents, lowercasing and the like if desired.

I you DO consider the above "exact match", take a look at the pf, pf2
and pf3 parameters in edismax. They're all about forming phrases,
bigrams and trigrams respectively for this form of "exact match".

Exact match "dierenzaak". This one is tricky. There's little OOB that
understands that "dieren zaak" is equivalent to "dierenzaak". I know
that in German there's prior art on "decompounding" filters, I don't
know about Dutch. Further, given my total lack of understanding the
rules of either language I don't know if it does "compounding" too,
i.e. understanding that "dieren zaak" is equivalent to "dierenzaak".
Can't help much there.

For a start I'd get rid of the gramming until I'd explored other
alternatives. Gramming is generally a good thing for pre-and-post
wildcards, i.e. matching *some*. Since you're concerned with
relevance, I suspect that gramming will make your task harder.

And if you haven't discovered the admin UI/analysis page, I recommend
you spend some time with it (hint, un-check the "verbose" checkbox).
As you play with various combinations of tokenizers and filters it'll
give you a much better understanding of what the effects of various
combinations are.

If only human language followed strict rules ;)

Professor:"In English, two negatives are
allowed and mean a positive, but two positives don't mean a negative."
Bored voice from the back: "Yeah, right".

Erick

On Sun, Mar 11, 2018 at 5:19 AM, PeterKerk  wrote:
> Thanks! That provides me with some more insight, I altered the search query
> to "dieren zaak" to see how queries consisting of more than 1 word are
> handled.
> I see that words are tokenized into groups of 3, I think because of my
> NGramFilterFactory with minGramSize of 3.
>
> 
> 
> (title_search_global:(dieren zaak) OR 
> description_search_global:(dieren
> zaak))
> 
> 
> (title_search_global:(dieren zaak) OR 
> description_search_global:(dieren
> zaak))
> 
> 
> (+(((title_search_global:die title_search_global:ier
> title_search_global:ere title_search_global:ren title_search_global:dier
> title_search_global:iere title_search_global:eren title_search_global:diere
> title_search_global:ieren title_search_global:dieren)
> (title_search_global:zaa title_search_global:aak title_search_global:zaak))
> (((description_search_global:dier description_search_global:diere
> description_search_global:dieren)/no_coord)
> description_search_global:zaak)))/no_coord
> 
> 
> +(((title_search_global:die title_search_global:ier 
> title_search_global:ere
> title_search_global:ren title_search_global:dier title_search_global:iere
> title_search_global:eren title_search_global:diere title_search_global:ieren
> title_search_global:dieren) (title_search_global:zaa title_search_global:aak
> title_search_global:zaak)) ((description_search_global:dier
> description_search_global:diere description_search_global:dieren)
> description_search_global:zaak))
> 
> ExtendedDismaxQParser
> 
> 
> 
> 
> 
> (lang:"nl" OR lang:"all")
> 
> 
> lang:nl lang:all
> 
> 
>
>
> I tried the query with and without the &defType=edismax parameter but I'm
> getting the EXACT same results. Does that mean some configuration error?
>
> I'm not sure how to progress from here. Can you see if your presumption that
> I'm mixing two different parsers is correct? My schema.xml is here:
> http://www.telefonievergelijken.nl/schema.xml
>
>
> Related: do you know of the existence of any sample schema.xml config that
> would be usable for a search engine? Seems like something so obvious to
> float around out there. I feel that would go a long way.
>
>
>
> Not sure if it matters but my requirements are:
>
> Exact match "di

Re: Solr search engine configuration

2018-03-11 Thread PeterKerk
Thanks! That provides me with some more insight, I altered the search query
to "dieren zaak" to see how queries consisting of more than 1 word are
handled.
I see that words are tokenized into groups of 3, I think because of my
NGramFilterFactory with minGramSize of 3.



(title_search_global:(dieren zaak) OR description_search_global:(dieren
zaak))


(title_search_global:(dieren zaak) OR description_search_global:(dieren
zaak))


(+(((title_search_global:die title_search_global:ier
title_search_global:ere title_search_global:ren title_search_global:dier
title_search_global:iere title_search_global:eren title_search_global:diere
title_search_global:ieren title_search_global:dieren)
(title_search_global:zaa title_search_global:aak title_search_global:zaak))
(((description_search_global:dier description_search_global:diere
description_search_global:dieren)/no_coord)
description_search_global:zaak)))/no_coord


+(((title_search_global:die title_search_global:ier 
title_search_global:ere
title_search_global:ren title_search_global:dier title_search_global:iere
title_search_global:eren title_search_global:diere title_search_global:ieren
title_search_global:dieren) (title_search_global:zaa title_search_global:aak
title_search_global:zaak)) ((description_search_global:dier
description_search_global:diere description_search_global:dieren)
description_search_global:zaak))

ExtendedDismaxQParser





(lang:"nl" OR lang:"all")


lang:nl lang:all




I tried the query with and without the &defType=edismax parameter but I'm
getting the EXACT same results. Does that mean some configuration error?

I'm not sure how to progress from here. Can you see if your presumption that
I'm mixing two different parsers is correct? My schema.xml is here:
http://www.telefonievergelijken.nl/schema.xml


Related: do you know of the existence of any sample schema.xml config that
would be usable for a search engine? Seems like something so obvious to
float around out there. I feel that would go a long way.



Not sure if it matters but my requirements are:

Exact match "dieren zaak" boost result with 1000 
Exact match "dierenzaak" boost result with 900 
Exact match "dieren" or "zaak" boost result with 600 

Partial match "huisdierenzaak" or "huisdieren zaak" boost result with 500 
Stem match "dier" boost result with 100 
Stem partial match "huisdier" boost result with 70 
Other partial matches "die" boost result with 10 




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr search engine configuration

2018-03-10 Thread Erick Erickson
You're mixing two different parsers I think.

If you're using edismax (either specify defType=edismax on your query
or set it up as the default for, say, the "/select" handler in
solrcofnig.xml). The "qf" parameter only is relevant if you _are_
using edismax. If you wan to use edismax your query could look
something like
q=dieren&defType=edismax&qf=qtitle_search_global
title_exactm‌atch^1000 description_search_global
description_exactm‌atch^100

On the other hand if you don't want to use edismax your query would
have to look something like:
q=qtitle_search_global:dieren title_exactm‌atch:dieren^1000
description_search_global:dieren description_exactm‌atch:dieren^100

This is guessing a bit, but If you add &debug=query to your URL,
you'll see the parsed results of the query which can be very useful in
figuring out exactly what Solr thinks the query is..

Best,
Erick

On Sat, Mar 10, 2018 at 2:06 PM, PeterKerk  wrote:
> Since Google onsite search will be end of life April 1 2018, I'm trying to
> setup my own onsite search engine that indexes my site's content and makes
> it searchable.
>
> My data config successfully loads data from my database (products,
> companies, blogs) into the fields.
>
> I then try to search in both the title and the description fields with
> weights. Now for example when users search on "dieren" (this means "animals"
> in Dutch):
>
> &q=(title_search_global:(dieren) OR
> description_search_global:(dieren))&qf=title_search_global+title_exactm‌atch^1000+description_search_global+description_exactm‌atch^100
>
> I get results with "dieren", "huisdieren", but I also get undesired results
> with "manieren" and "versieren".
>
> What I want is to find text using the following logic (all case
> insensitive):
>
>
> Exact match "dieren" boost result with 1000
> Partial match "huisdieren" boost result with 500
> Stem match "dier" boost result with 100
> Stem partial match "huisdier" boost result with 70
> Other partial matches "die" boost result with 10
>
> My current schema.xml is here: http://www.telefonievergelijken.nl/schema.xml
> I tried the solr admin tool for tokenization, but I can't figure out how to
> get to the above logic.
> I also Googled for an example Solr schema.xml configuration for building
> your own search engines and I'm really surprised there's nothing out there.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html