RE: Solr limit in words search

Scott Wed, 17 Nov 2021 20:36:27 -0800

Thanks Shawn, not sure if you saw, but I resent without html formatting and it 
came through fine. I'll put it here again along with the preliminary conclusion 
that I was missing the Flatten filter in my indexer. Here are the schema 
details + output you requested:


<field name="subject" type="partial_text_general"/>

  <fieldType name="partial_text_general" class="solr.TextField" 
positionIncrementGap="100" multiValued="true">
        <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.WordDelimiterGraphFilterFactory" 
generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1" 
catenateWords="1" catenateNumbers="1" preserveOriginal="1" splitOnNumerics="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPossessiveFilterFactory"/>
                <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
                <filter class="solr.EnglishMinimalStemFilterFactory"/>
                <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" 
maxGramSize="45" />
        </analyzer>
        <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.SynonymFilterFactory" 
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.WordDelimiterGraphFilterFactory" 
generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1" 
catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPossessiveFilterFactory"/>
                <filter class="solr.KeywordMarkerFilterFactory" 
protected="protwords.txt"/>
                <filter class="solr.EnglishMinimalStemFilterFactory"/>
        </analyzer>
  </fieldType>

Original query between quotes, no matches:

  <str name="rawquerystring">subject:"cobrancas e\-mail marketing"</str>
  <str name="querystring">subject:"cobrancas e\-mail marketing"</str>
  <str name="parsedquery">SpanNearQuery(spanNear([subject:cobranca, 
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)]), 
subject:marketing], 0, true))</str>
  <str name="parsedquery_toString">spanNear([subject:cobranca, 
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)]), 
subject:marketing], 0, true)</str>
  <str name="QParser">LuceneQParser</str>


Original query without 'marketing' between quotes, matches:

  <str name="rawquerystring">subject:"cobrancas e\-mail"</str>
  <str name="querystring">subject:"cobrancas e\-mail"</str>
  <str name="parsedquery">SpanNearQuery(spanNear([subject:cobranca, 
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)])], 0, 
true))</str>
  <str name="parsedquery_toString">spanNear([subject:cobranca, 
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)])], 0, 
true)</str>
  <str name="QParser">LuceneQParser</str>
  
  
<lst name="explain">
    <lst name="240/ec3d223a54dcca5394c70000a63ae627/flavio">
      <bool name="match">true</bool>
      <float name="value">27.416113</float>
      <str name="description">weight(spanNear([subject:cobranca, 
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)])], 0, 
true) in 748821) [SchemaSimilarity], result of:</str>
      <arr name="details">
        <lst>
          <bool name="match">true</bool>
          <float name="value">27.416113</float>
          <str name="description">score(freq=1.0), computed as boost * idf * tf 
from:</str>
          <arr name="details">
            <lst>
              <bool name="match">true</bool>
              <float name="value">2.2</float>
              <str name="description">boost</str>
            </lst>
            <lst>
              <bool name="match">true</bool>
              <float name="value">19.544073</float>
              <str name="description">idf, sum of:</str>
              <arr name="details">
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">9.65364</float>
                  <str name="description">idf, computed as log(1 + (N - n + 
0.5) / (n + 0.5)) from:</str>
                  <arr name="details">
                    <lst>
                      <bool name="match">true</bool>
                      <long name="value">1906</long>
                      <str name="description">n, number of documents containing 
term</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <long name="value">29700198</long>
                      <str name="description">N, total number of documents with 
field</str>
                    </lst>
                  </arr>
                </lst>
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">4.8891644</float>
                  <str name="description">idf, computed as log(1 + (N - n + 
0.5) / (n + 0.5)) from:</str>
                  <arr name="details">
                    <lst>
                      <bool name="match">true</bool>
                      <long name="value">223574</long>
                      <str name="description">n, number of documents containing 
term</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <long name="value">29700198</long>
                      <str name="description">N, total number of documents with 
field</str>
                    </lst>
                  </arr>
                </lst>
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">5.0012693</float>
                  <str name="description">idf, computed as log(1 + (N - n + 
0.5) / (n + 0.5)) from:</str>
                  <arr name="details">
                    <lst>
                      <bool name="match">true</bool>
                      <long name="value">199864</long>
                      <str name="description">n, number of documents containing 
term</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <long name="value">29700198</long>
                      <str name="description">N, total number of documents with 
field</str>
                    </lst>
                  </arr>
                </lst>
              </arr>
            </lst>
            <lst>
              <bool name="match">true</bool>
              <float name="value">0.63762903</float>
              <str name="description">tf, computed as freq / (freq + k1 * (1 - 
b + b * dl / avgdl)) from:</str>
              <arr name="details">
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">1.0</float>
                  <str name="description">phraseFreq=1.0</str>
                </lst>
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">1.2</float>
                  <str name="description">k1, term saturation parameter</str>
                </lst>
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">0.75</float>
                  <str name="description">b, length normalization 
parameter</str>
                </lst>
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">12.0</float>
                  <str name="description">dl, length of field</str>
                </lst>
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">40.25195</float>
                  <str name="description">avgdl, average length of field</str>
                </lst>
              </arr>
            </lst>
          </arr>
        </lst>
      </arr>
    </lst>


Original query, between (), matches, but it also matches other unwanted 
documents such as 'marketing plans', etc

<str name="rawquerystring">subject:(cobrancas e\-mail marketing)</str>
  <str name="querystring">subject:(cobrancas e\-mail marketing)</str>
  <str name="parsedquery">subject:cobranca (subject:email (+subject:e 
+subject:mail)) subject:marketing</str>
  <str name="parsedquery_toString">subject:cobranca (subject:email (+subject:e 
+subject:mail)) subject:marketing</str>
  <str name="QParser">LuceneQParser</str>
  
   <lst name="explain">
    <lst name="240/ec3d223a54dcca5394c70000a63ae627/flavio">
      <bool name="match">true</bool>
      <float name="value">29.841742</float>
      <str name="description">sum of:</str>
      <arr name="details">
        <lst>
          <bool name="match">true</bool>
          <float name="value">13.541982</float>
          <str name="description">weight(subject:cobranca in 748821) 
[SchemaSimilarity], result of:</str>
          <arr name="details">
            <lst>
              <bool name="match">true</bool>
              <float name="value">13.541982</float>
              <str name="description">score(freq=1.0), computed as boost * idf 
* tf from:</str>
              <arr name="details">
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">2.2</float>
                  <str name="description">boost</str>
                </lst>
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">9.653647</float>
                  <str name="description">idf, computed as log(1 + (N - n + 
0.5) / (n + 0.5)) from:</str>
                  <arr name="details">
                    <lst>
                      <bool name="match">true</bool>
                      <long name="value">1906</long>
                      <str name="description">n, number of documents containing 
term</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <long name="value">29700419</long>
                      <str name="description">N, total number of documents with 
field</str>
                    </lst>
                  </arr>
                </lst>
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">0.63762903</float>
                  <str name="description">tf, computed as freq / (freq + k1 * 
(1 - b + b * dl / avgdl)) from:</str>
                  <arr name="details">
                    <lst>
                      <bool name="match">true</bool>
                      <float name="value">1.0</float>
                      <str name="description">freq, occurrences of term within 
document</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <float name="value">1.2</float>
                      <str name="description">k1, term saturation 
parameter</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <float name="value">0.75</float>
                      <str name="description">b, length normalization 
parameter</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <float name="value">12.0</float>
                      <str name="description">dl, length of field</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <float name="value">40.25195</float>
                      <str name="description">avgdl, average length of 
field</str>
                    </lst>
                  </arr>
                </lst>
              </arr>
            </lst>
          </arr>
        </lst>

Original query without 'marketing' between (), matches:

<str name="rawquerystring">subject:(cobrancas e\-mail)</str>
  <str name="querystring">subject:(cobrancas e\-mail)</str>
  <str name="parsedquery">subject:cobranca (subject:email (+subject:e 
+subject:mail))</str>
  <str name="parsedquery_toString">subject:cobranca (subject:email (+subject:e 
+subject:mail))</str>
  <str name="QParser">LuceneQParser</str>
  
  <lst name="explain">
    <lst name="6784/b8e1ce324e2f0e5ed9830100a9bf5bd4/flavio">
      <bool name="match">true</bool>
      <float name="value">21.906368</float>
      <str name="description">sum of:</str>
      <arr name="details">
        <lst>
          <bool name="match">true</bool>
          <float name="value">14.547027</float>
          <str name="description">weight(subject:cobranca in 765897) 
[SchemaSimilarity], result of:</str>
          <arr name="details">
            <lst>
              <bool name="match">true</bool>
              <float name="value">14.547027</float>
              <str name="description">score(freq=1.0), computed as boost * idf 
* tf from:</str>
              <arr name="details">
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">2.2</float>
                  <str name="description">boost</str>
                </lst>
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">9.631984</float>
                  <str name="description">idf, computed as log(1 + (N - n + 
0.5) / (n + 0.5)) from:</str>
                  <arr name="details">
                    <lst>
                      <bool name="match">true</bool>
                      <long name="value">1983</long>
                      <str name="description">n, number of documents containing 
term</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <long name="value">30237753</long>
                      <str name="description">N, total number of documents with 
field</str>
                    </lst>
                  </arr>
                </lst>
                <lst>
                  <bool name="match">true</bool>
                  <float name="value">0.6864925</float>
                  <str name="description">tf, computed as freq / (freq + k1 * 
(1 - b + b * dl / avgdl)) from:</str>
                  <arr name="details">
                    <lst>
                      <bool name="match">true</bool>
                      <float name="value">1.0</float>
                      <str name="description">freq, occurrences of term within 
document</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <float name="value">1.2</float>
                      <str name="description">k1, term saturation 
parameter</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <float name="value">0.75</float>
                      <str name="description">b, length normalization 
parameter</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <float name="value">7.0</float>
                      <str name="description">dl, length of field</str>
                    </lst>
                    <lst>
                      <bool name="match">true</bool>
                      <float name="value">40.209316</float>
                      <str name="description">avgdl, average length of 
field</str>
                    </lst>
                  </arr>
                </lst>
              </arr>
            </lst>
          </arr>
        </lst>

Thanks you for your assistance.

-----Original Message-----
From: Shawn Heisey <[email protected]> 
Sent: Wednesday, November 17, 2021 6:50 PM
To: [email protected]
Subject: Re: Solr limit in words search

On 11/17/21 9:00 AM, Scott Q. wrote:
> I am facing a weird issue, possibly caused by my config.
>
> I have indexed a document which has a field called subject, subject is 
> defined as:

<snip> -- the definition you included is blank in the email that I got. I do 
not know why.  If it was an email attachment, the mailing list eats almost all 
attachments that get sent.

> I have a document with subject field: cobrancas E-mail marketing em 
> dezembro, 2020 - referente ao uso de novembro
>
> If I search for subject:"cobrancas e-mail" then it finds the document, 
> but if I search for subject:"cobrancas e-mail marketing" I have no 
> match.
>
> Why would this happen ?

There could be a lot of reasons.  My best guess at the moment is that you have 
stemming configured on the analysis chain and the phrase search
(quotes) is making that NOT happen on the query analysis.  The analysis tab in 
the admin UI unfortunately cannot show you what happens with a phrase query.  
Ordinarily I would suggest using that to see what happens, but in this case we 
can't do that.

Can you share your schema file? It is usually named managed-schema (with no 
extension) or schema.xml, depending on solrconfig.xml.

Also, if you add a "debugQuery=true" parameter to the query request, you can 
see how Solr ultimately analyzes and parses the query.  I would like to see the 
full response with debug enabled, both on the search that succeeds and the one 
that fails.  And if you can do another search for subject:(cobrancas e-mail 
marketing), replacing the quotes with parentheses, I would like to see the 
debug output from that as well.

What version of Solr, and was it installed from the binary release download?

Thanks,
Shawn

RE: Solr limit in words search

Reply via email to