Thanks Shawn, not sure if you saw, but I resent without html formatting and it
came through fine. I'll put it here again along with the preliminary conclusion
that I was missing the Flatten filter in my indexer. Here are the schema
details + output you requested:
<field name="subject" type="partial_text_general"/>
<fieldType name="partial_text_general" class="solr.TextField"
positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
catenateWords="1" catenateNumbers="1" preserveOriginal="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2"
maxGramSize="45" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="0" splitOnCaseChange="1"
catenateWords="1" catenateNumbers="1" splitOnNumerics="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
Original query between quotes, no matches:
<str name="rawquerystring">subject:"cobrancas e\-mail marketing"</str>
<str name="querystring">subject:"cobrancas e\-mail marketing"</str>
<str name="parsedquery">SpanNearQuery(spanNear([subject:cobranca,
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)]),
subject:marketing], 0, true))</str>
<str name="parsedquery_toString">spanNear([subject:cobranca,
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)]),
subject:marketing], 0, true)</str>
<str name="QParser">LuceneQParser</str>
Original query without 'marketing' between quotes, matches:
<str name="rawquerystring">subject:"cobrancas e\-mail"</str>
<str name="querystring">subject:"cobrancas e\-mail"</str>
<str name="parsedquery">SpanNearQuery(spanNear([subject:cobranca,
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)])], 0,
true))</str>
<str name="parsedquery_toString">spanNear([subject:cobranca,
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)])], 0,
true)</str>
<str name="QParser">LuceneQParser</str>
<lst name="explain">
<lst name="240/ec3d223a54dcca5394c70000a63ae627/flavio">
<bool name="match">true</bool>
<float name="value">27.416113</float>
<str name="description">weight(spanNear([subject:cobranca,
spanOr([subject:email, spanNear([subject:e, subject:mail], 0, true)])], 0,
true) in 748821) [SchemaSimilarity], result of:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">27.416113</float>
<str name="description">score(freq=1.0), computed as boost * idf * tf
from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">2.2</float>
<str name="description">boost</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">19.544073</float>
<str name="description">idf, sum of:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">9.65364</float>
<str name="description">idf, computed as log(1 + (N - n +
0.5) / (n + 0.5)) from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<long name="value">1906</long>
<str name="description">n, number of documents containing
term</str>
</lst>
<lst>
<bool name="match">true</bool>
<long name="value">29700198</long>
<str name="description">N, total number of documents with
field</str>
</lst>
</arr>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">4.8891644</float>
<str name="description">idf, computed as log(1 + (N - n +
0.5) / (n + 0.5)) from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<long name="value">223574</long>
<str name="description">n, number of documents containing
term</str>
</lst>
<lst>
<bool name="match">true</bool>
<long name="value">29700198</long>
<str name="description">N, total number of documents with
field</str>
</lst>
</arr>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">5.0012693</float>
<str name="description">idf, computed as log(1 + (N - n +
0.5) / (n + 0.5)) from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<long name="value">199864</long>
<str name="description">n, number of documents containing
term</str>
</lst>
<lst>
<bool name="match">true</bool>
<long name="value">29700198</long>
<str name="description">N, total number of documents with
field</str>
</lst>
</arr>
</lst>
</arr>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">0.63762903</float>
<str name="description">tf, computed as freq / (freq + k1 * (1 -
b + b * dl / avgdl)) from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">1.0</float>
<str name="description">phraseFreq=1.0</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">1.2</float>
<str name="description">k1, term saturation parameter</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">0.75</float>
<str name="description">b, length normalization
parameter</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">12.0</float>
<str name="description">dl, length of field</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">40.25195</float>
<str name="description">avgdl, average length of field</str>
</lst>
</arr>
</lst>
</arr>
</lst>
</arr>
</lst>
Original query, between (), matches, but it also matches other unwanted
documents such as 'marketing plans', etc
<str name="rawquerystring">subject:(cobrancas e\-mail marketing)</str>
<str name="querystring">subject:(cobrancas e\-mail marketing)</str>
<str name="parsedquery">subject:cobranca (subject:email (+subject:e
+subject:mail)) subject:marketing</str>
<str name="parsedquery_toString">subject:cobranca (subject:email (+subject:e
+subject:mail)) subject:marketing</str>
<str name="QParser">LuceneQParser</str>
<lst name="explain">
<lst name="240/ec3d223a54dcca5394c70000a63ae627/flavio">
<bool name="match">true</bool>
<float name="value">29.841742</float>
<str name="description">sum of:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">13.541982</float>
<str name="description">weight(subject:cobranca in 748821)
[SchemaSimilarity], result of:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">13.541982</float>
<str name="description">score(freq=1.0), computed as boost * idf
* tf from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">2.2</float>
<str name="description">boost</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">9.653647</float>
<str name="description">idf, computed as log(1 + (N - n +
0.5) / (n + 0.5)) from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<long name="value">1906</long>
<str name="description">n, number of documents containing
term</str>
</lst>
<lst>
<bool name="match">true</bool>
<long name="value">29700419</long>
<str name="description">N, total number of documents with
field</str>
</lst>
</arr>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">0.63762903</float>
<str name="description">tf, computed as freq / (freq + k1 *
(1 - b + b * dl / avgdl)) from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">1.0</float>
<str name="description">freq, occurrences of term within
document</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">1.2</float>
<str name="description">k1, term saturation
parameter</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">0.75</float>
<str name="description">b, length normalization
parameter</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">12.0</float>
<str name="description">dl, length of field</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">40.25195</float>
<str name="description">avgdl, average length of
field</str>
</lst>
</arr>
</lst>
</arr>
</lst>
</arr>
</lst>
Original query without 'marketing' between (), matches:
<str name="rawquerystring">subject:(cobrancas e\-mail)</str>
<str name="querystring">subject:(cobrancas e\-mail)</str>
<str name="parsedquery">subject:cobranca (subject:email (+subject:e
+subject:mail))</str>
<str name="parsedquery_toString">subject:cobranca (subject:email (+subject:e
+subject:mail))</str>
<str name="QParser">LuceneQParser</str>
<lst name="explain">
<lst name="6784/b8e1ce324e2f0e5ed9830100a9bf5bd4/flavio">
<bool name="match">true</bool>
<float name="value">21.906368</float>
<str name="description">sum of:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">14.547027</float>
<str name="description">weight(subject:cobranca in 765897)
[SchemaSimilarity], result of:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">14.547027</float>
<str name="description">score(freq=1.0), computed as boost * idf
* tf from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">2.2</float>
<str name="description">boost</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">9.631984</float>
<str name="description">idf, computed as log(1 + (N - n +
0.5) / (n + 0.5)) from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<long name="value">1983</long>
<str name="description">n, number of documents containing
term</str>
</lst>
<lst>
<bool name="match">true</bool>
<long name="value">30237753</long>
<str name="description">N, total number of documents with
field</str>
</lst>
</arr>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">0.6864925</float>
<str name="description">tf, computed as freq / (freq + k1 *
(1 - b + b * dl / avgdl)) from:</str>
<arr name="details">
<lst>
<bool name="match">true</bool>
<float name="value">1.0</float>
<str name="description">freq, occurrences of term within
document</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">1.2</float>
<str name="description">k1, term saturation
parameter</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">0.75</float>
<str name="description">b, length normalization
parameter</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">7.0</float>
<str name="description">dl, length of field</str>
</lst>
<lst>
<bool name="match">true</bool>
<float name="value">40.209316</float>
<str name="description">avgdl, average length of
field</str>
</lst>
</arr>
</lst>
</arr>
</lst>
</arr>
</lst>
Thanks you for your assistance.
-----Original Message-----
From: Shawn Heisey <[email protected]>
Sent: Wednesday, November 17, 2021 6:50 PM
To: [email protected]
Subject: Re: Solr limit in words search
On 11/17/21 9:00 AM, Scott Q. wrote:
> I am facing a weird issue, possibly caused by my config.
>
> I have indexed a document which has a field called subject, subject is
> defined as:
<snip> -- the definition you included is blank in the email that I got. I do
not know why. If it was an email attachment, the mailing list eats almost all
attachments that get sent.
> I have a document with subject field: cobrancas E-mail marketing em
> dezembro, 2020 - referente ao uso de novembro
>
> If I search for subject:"cobrancas e-mail" then it finds the document,
> but if I search for subject:"cobrancas e-mail marketing" I have no
> match.
>
> Why would this happen ?
There could be a lot of reasons. My best guess at the moment is that you have
stemming configured on the analysis chain and the phrase search
(quotes) is making that NOT happen on the query analysis. The analysis tab in
the admin UI unfortunately cannot show you what happens with a phrase query.
Ordinarily I would suggest using that to see what happens, but in this case we
can't do that.
Can you share your schema file? It is usually named managed-schema (with no
extension) or schema.xml, depending on solrconfig.xml.
Also, if you add a "debugQuery=true" parameter to the query request, you can
see how Solr ultimately analyzes and parses the query. I would like to see the
full response with debug enabled, both on the search that succeeds and the one
that fails. And if you can do another search for subject:(cobrancas e-mail
marketing), replacing the quotes with parentheses, I would like to see the
debug output from that as well.
What version of Solr, and was it installed from the binary release download?
Thanks,
Shawn