Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Erick Erickson Wed, 06 Nov 2019 07:38:54 -0800

I don’t see the attachments, maybe I deleted old e-mails or some such. The 
Apache server is fairly aggressive about stripping attachments though, so it’s 
also possible they didn’t make it through.


> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
> 
> Thanks Erick.
> 
>> First, your index and analysis chains are considerably different, this can 
>> easily be a source of problems. In particular, using two different 
>> tokenizers is a huge red flag. I _strongly_ recommend against this unless 
>> you’re totally sure you understand the consequences. Additionally, your use 
>> of the length filter is suspicious, especially since your problem statement 
>> is about the addition of a single letter term and the min length allowed on 
>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is 
>> filtered out in both cases, but maybe you’ve found something odd about the 
>> interactions.
> I will investigate the min length and post the results later.
> 
>> Second, I have no idea what this will do. Are the equal signs typos? Used by 
>> custom code?
> This the url in my application, not solr params. That's the query string.
> 
>> What does “species=“ do? That’s not Solr syntax, so it’s likely that all the 
>> params with an equal-sign are totally ignored unless it’s just a typo.
> This is part of the application. Species will be used later on in solr to 
> filter out the result. That's not solr. That my app params.
> 
>> Third, the easiest way to see what’s happening under the covers is to add 
>> “&debug=true” to the query and look at the parsed query. Ignore all the 
>> relevance calculations for the nonce, or specify “&debug=query” to skip that 
>> part. 
> The two json files i've sent, they are debugQuery=on and the explain tag is 
> present.
> I will try the searching the way you mentioned.
> 
> Thank for your inputs
> 
> Guilherme
> 
>> On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com> wrote:
>> 
>> Fwd to another server
>> 
>> First, your index and analysis chains are considerably different, this can 
>> easily be a source of problems. In particular, using two different 
>> tokenizers is a huge red flag. I _strongly_ recommend against this unless 
>> you’re totally sure you understand the consequences. Additionally, your use 
>> of the length filter is suspicious, especially since your problem statement 
>> is about the addition of a single letter term and the min length allowed on 
>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is 
>> filtered out in both cases, but maybe you’ve found something odd about the 
>> interactions.
>> 
>> Second, I have no idea what this will do. Are the equal signs typos? Used by 
>> custom code?
>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>> 
>> What does “species=“ do? That’s not Solr syntax, so it’s likely that all the 
>> params with an equal-sign are totally ignored unless it’s just a typo.
>> 
>> Third, the easiest way to see what’s happening under the covers is to add 
>> “&debug=true” to the query and look at the parsed query. Ignore all the 
>> relevance calculations for the nonce, or specify “&debug=query” to skip that 
>> part. 
>> 
>> 90% + of the time, the question “why didn’t this query do what I expect” is 
>> answered by looking at the “&debug=query” output and the analysis page in 
>> the admin UI. NOTE: for the analysis page be sure to look at _both_ the 
>> query and index output. Also, and very important about the analysis page 
>> (and this is confusing) is that this _assumes_ that what you put in the text 
>> boxes have made it through the query parser intact and is analyzed by the 
>> field selected. Consider the search "q=field:word1 word2". Now you type 
>> “word1 word2” into the analysis text box and it looks like what you expect. 
>> That’s misleading because the query is _parsed_ as "field:word1 
>> default_search_field:word2”. This is where “&debug=query” helps.
>> 
>> Best,
>> Erick
>> 
>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.leh...@indiamart.com> wrote:
>>> 
>>> Hi Walter,
>>> 
>>> The solr.StopFilter removes all tokens that are stopwords. Those words will
>>>> not be in the index, so they can never match a query.
>>> 
>>> 
>>> I think the OP's concern is different results when adding a stopword. I
>>> think he's using the filter factory correctly - the query chain includes
>>> the filter as well so it should remove "a" while querying.
>>> 
>>> *@Guilherme*, please post results for both the query, the document in
>>> result you are concerned about and post full result of analysis screen (for
>>> both query and index).
>>> 
>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org> wrote:
>>> 
>>>> No.
>>>> 
>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>> will not be in the index, so they can never match a query.
>>>> 
>>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
>>>> schema.xml.
>>>> 2. Reload the collection, restart Solr, or whatever to read the new config.
>>>> 3. Reindex all of the documents.
>>>> 
>>>> When indexed with the new analysis chain, the stopwords will not be
>>>> removed and they will be searchable.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>>>>> 
>>>>> Ok. I am kind a lost now.
>>>>> If I open up the console > analysis and perform it, that's the final
>>>> result.
>>>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>>>> 
>>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
>>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ") then
>>>> add to solr. Is that correct ?
>>>>> 
>>>>> Thanks David
>>>>> 
>>>>>> On 5 Nov 2019, at 14:48, David Hastings <hastings.recurs...@gmail.com
>>>> <mailto:hastings.recurs...@gmail.com>> wrote:
>>>>>> 
>>>>>> Fwd to another server
>>>>>> 
>>>>>> no,
>>>>>>            <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>> words="stopwords.txt"/>
>>>>>> 
>>>>>> is still using stopwords and should be removed, in my opinion of course,
>>>>>> based on your use case may be different, but i generally axe any
>>>> reference
>>>>>> to them at all
>>>>>> 
>>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk
>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>> 
>>>>>>> Thanks.
>>>>>>> Haven't I done this here ?
>>>>>>> <fieldType name="text_field" class="solr.TextField"
>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>        <analyzer type="index">
>>>>>>>            <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>            <filter class="solr.ClassicFilterFactory"/>
>>>>>>>            <filter class="solr.LengthFilterFactory" min="2"
>>>> max="20"/>
>>>>>>>            <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>            <filter class="solr.StopFilterFactory" ignoreCase="true"
>>>>>>> words="stopwords.txt"/>
>>>>>>>        </analyzer>
>>>>>>> 
>>>>>>> 
>>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <hastings.recurs...@gmail.com
>>>> <mailto:hastings.recurs...@gmail.com>>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Fwd to another server
>>>>>>>> 
>>>>>>>> The first thing you should do is remove any reference to stop words
>>>> and
>>>>>>>> never use them, then re-index your data and try it again.
>>>>>>>> 
>>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <gvit...@ebi.ac.uk
>>>> <mailto:gvit...@ebi.ac.uk>>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> I am performing a search to match a name (text_field), however this
>>>> term
>>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i remove
>>>>>>> 'a'
>>>>>>>>> then it works.
>>>>>>>>> e.g
>>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>>>>>> doesn't work:
>>>>>>>>> 
>>>>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>> <
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>> 
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Search term: lymphoid and non-lymphoid cell
>>>>>>>>> works:
>>>>>>>>> 
>>>>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>> <
>>>>>>>>> 
>>>>>>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>>>>>>>>> 
>>>>>>>>> interested in the first result
>>>>>>>>> 
>>>>>>>>> schema.xml
>>>>>>>>> <field name="name"                          type="text_field"
>>>>>>>>> indexed="true"  stored="true"   omitNorms="false"   required="true"
>>>>>>>>> multiValued="false"/>
>>>>>>>>> 
>>>>>>>>>        <analyzer type="query">
>>>>>>>>>            <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>            <filter class="solr.LengthFilterFactory" min="2"
>>>>>>> max="20"/>
>>>>>>>>>            <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>            <filter class="solr.StopFilterFactory"
>>>> ignoreCase="true"
>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>        </analyzer>
>>>>>>>>> 
>>>>>>>>>    <fieldType name="text_field" class="solr.TextField"
>>>>>>>>> positionIncrementGap="100" omitNorms="false" >
>>>>>>>>>        <analyzer type="index">
>>>>>>>>>            <tokenizer class="solr.StandardTokenizerFactory"/>
>>>>>>>>>            <filter class="solr.ClassicFilterFactory"/>
>>>>>>>>>            <filter class="solr.LengthFilterFactory" min="2"
>>>>>>> max="20"/>
>>>>>>>>>            <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>            <filter class="solr.StopFilterFactory"
>>>> ignoreCase="true"
>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>        </analyzer>
>>>>>>>>>        <analyzer type="query">
>>>>>>>>>            <tokenizer class="solr.PatternTokenizerFactory"
>>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>> pattern="^[/._:]+" replacement=""/>
>>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>> pattern="[/._:]+$" replacement=""/>
>>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
>>>>>>>>> pattern="[_]" replacement=" "/>
>>>>>>>>>            <filter class="solr.LengthFilterFactory" min="2"
>>>>>>> max="20"/>
>>>>>>>>>            <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>            <filter class="solr.StopFilterFactory"
>>>> ignoreCase="true"
>>>>>>>>> words="stopwords.txt"/>
>>>>>>>>>        </analyzer>
>>>>>>>>>    </fieldType>
>>>>>>>>> 
>>>>>>>>> stopwords.txt
>>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
>>>>>>>>> a
>>>>>>>>> b
>>>>>>>>> c
>>>>>>>>> ....
>>>>>>>>> an
>>>>>>>>> and
>>>>>>>>> are
>>>>>>>>> 
>>>>>>>>> Running SolR 6.6.2.
>>>>>>>>> 
>>>>>>>>> Is there anything I could do to prevent this ?
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Guilherme
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> -- 
>>> -- 
>>> Regards,
>>> 
>>> *Paras Lehana* [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>> 
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>> 
>>> Mob.: +91-9560911996
>>> Work: 01203916600 | Extn:  *8173*
>>> 
>>> -- 
>>> IMPORTANT: 
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>> 
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to