Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Paras Lehana Wed, 06 Nov 2019 21:24:28 -0800

Hi Guilherme.

I am sending they analysis result and the json result as requested.



Thanks for the effort. Luckily, I can see your attachments (low quality
though).

>From the analysis screen, the analysis is working as expected. One of the
reasons for query="lymphoid and *a* non-lymphoid cell" not matching
document containing "Lymphoid and a non-Lymphoid cell" I can initially
think of is: the stopword "a" is probably present in post-analysis either
of query or index. Did you tweak your index time analysis after indexing?

Do two things:

   1. Post the analysis screen for and index=*"Immunoregulatory
   interactions between a Lymphoid and a non-Lymphoid cell"* and
"query=*"lymphoid
   and a non-lymphoid cell"*. Try hosting the image and providing the link
   here.
   2. Give the same JSON output as you have sent but this time with
   *"echoParams=all"*. Also, post the exact Solr query url.



On Wed, 6 Nov 2019 at 21:07, Erick Erickson <erickerick...@gmail.com> wrote:

> I don’t see the attachments, maybe I deleted old e-mails or some such. The
> Apache server is fairly aggressive about stripping attachments though, so
> it’s also possible they didn’t make it through.
>
> > On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
> >
> > Thanks Erick.
> >
> >> First, your index and analysis chains are considerably different, this
> can easily be a source of problems. In particular, using two different
> tokenizers is a huge red flag. I _strongly_ recommend against this unless
> you’re totally sure you understand the consequences. Additionally, your use
> of the length filter is suspicious, especially since your problem statement
> is about the addition of a single letter term and the min length allowed on
> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
> filtered out in both cases, but maybe you’ve found something odd about the
> interactions.
> > I will investigate the min length and post the results later.
> >
> >> Second, I have no idea what this will do. Are the equal signs typos?
> Used by custom code?
> > This the url in my application, not solr params. That's the query string.
> >
> >> What does “species=“ do? That’s not Solr syntax, so it’s likely that
> all the params with an equal-sign are totally ignored unless it’s just a
> typo.
> > This is part of the application. Species will be used later on in solr
> to filter out the result. That's not solr. That my app params.
> >
> >> Third, the easiest way to see what’s happening under the covers is to
> add “&debug=true” to the query and look at the parsed query. Ignore all the
> relevance calculations for the nonce, or specify “&debug=query” to skip
> that part.
> > The two json files i've sent, they are debugQuery=on and the explain tag
> is present.
> > I will try the searching the way you mentioned.
> >
> > Thank for your inputs
> >
> > Guilherme
> >
> >> On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com>
> wrote:
> >>
> >> Fwd to another server
> >>
> >> First, your index and analysis chains are considerably different, this
> can easily be a source of problems. In particular, using two different
> tokenizers is a huge red flag. I _strongly_ recommend against this unless
> you’re totally sure you understand the consequences. Additionally, your use
> of the length filter is suspicious, especially since your problem statement
> is about the addition of a single letter term and the min length allowed on
> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
> filtered out in both cases, but maybe you’ve found something odd about the
> interactions.
> >>
> >> Second, I have no idea what this will do. Are the equal signs typos?
> Used by custom code?
> >>
> >>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>
> >> What does “species=“ do? That’s not Solr syntax, so it’s likely that
> all the params with an equal-sign are totally ignored unless it’s just a
> typo.
> >>
> >> Third, the easiest way to see what’s happening under the covers is to
> add “&debug=true” to the query and look at the parsed query. Ignore all the
> relevance calculations for the nonce, or specify “&debug=query” to skip
> that part.
> >>
> >> 90% + of the time, the question “why didn’t this query do what I
> expect” is answered by looking at the “&debug=query” output and the
> analysis page in the admin UI. NOTE: for the analysis page be sure to look
> at _both_ the query and index output. Also, and very important about the
> analysis page (and this is confusing) is that this _assumes_ that what you
> put in the text boxes have made it through the query parser intact and is
> analyzed by the field selected. Consider the search "q=field:word1 word2".
> Now you type “word1 word2” into the analysis text box and it looks like
> what you expect. That’s misleading because the query is _parsed_ as
> "field:word1 default_search_field:word2”. This is where “&debug=query”
> helps.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.leh...@indiamart.com>
> wrote:
> >>>
> >>> Hi Walter,
> >>>
> >>> The solr.StopFilter removes all tokens that are stopwords. Those words
> will
> >>>> not be in the index, so they can never match a query.
> >>>
> >>>
> >>> I think the OP's concern is different results when adding a stopword. I
> >>> think he's using the filter factory correctly - the query chain
> includes
> >>> the filter as well so it should remove "a" while querying.
> >>>
> >>> *@Guilherme*, please post results for both the query, the document in
> >>> result you are concerned about and post full result of analysis screen
> (for
> >>> both query and index).
> >>>
> >>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org>
> wrote:
> >>>
> >>>> No.
> >>>>
> >>>> The solr.StopFilter removes all tokens that are stopwords. Those words
> >>>> will not be in the index, so they can never match a query.
> >>>>
> >>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
> >>>> schema.xml.
> >>>> 2. Reload the collection, restart Solr, or whatever to read the new
> config.
> >>>> 3. Reindex all of the documents.
> >>>>
> >>>> When indexed with the new analysis chain, the stopwords will not be
> >>>> removed and they will be searchable.
> >>>>
> >>>> wunder
> >>>> Walter Underwood
> >>>> wun...@wunderwood.org
> >>>> http://observer.wunderwood.org/  (my blog)
> >>>>
> >>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
> wrote:
> >>>>>
> >>>>> Ok. I am kind a lost now.
> >>>>> If I open up the console > analysis and perform it, that's the final
> >>>> result.
> >>>>> <Screenshot 2019-11-05 at 14.54.16.png>
> >>>>>
> >>>>> Your suggestion is: get rid of the <filter stopword.txt> in the
> >>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ")
> then
> >>>> add to solr. Is that correct ?
> >>>>>
> >>>>> Thanks David
> >>>>>
> >>>>>> On 5 Nov 2019, at 14:48, David Hastings <
> hastings.recurs...@gmail.com
> >>>> <mailto:hastings.recurs...@gmail.com>> wrote:
> >>>>>>
> >>>>>> Fwd to another server
> >>>>>>
> >>>>>> no,
> >>>>>>            <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>>>>> words="stopwords.txt"/>
> >>>>>>
> >>>>>> is still using stopwords and should be removed, in my opinion of
> course,
> >>>>>> based on your use case may be different, but i generally axe any
> >>>> reference
> >>>>>> to them at all
> >>>>>>
> >>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk
> >>>> <mailto:gvit...@ebi.ac.uk>> wrote:
> >>>>>>
> >>>>>>> Thanks.
> >>>>>>> Haven't I done this here ?
> >>>>>>> <fieldType name="text_field" class="solr.TextField"
> >>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>        <analyzer type="index">
> >>>>>>>            <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>            <filter class="solr.ClassicFilterFactory"/>
> >>>>>>>            <filter class="solr.LengthFilterFactory" min="2"
> >>>> max="20"/>
> >>>>>>>            <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>            <filter class="solr.StopFilterFactory" ignoreCase="true"
> >>>>>>> words="stopwords.txt"/>
> >>>>>>>        </analyzer>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
> hastings.recurs...@gmail.com
> >>>> <mailto:hastings.recurs...@gmail.com>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Fwd to another server
> >>>>>>>>
> >>>>>>>> The first thing you should do is remove any reference to stop
> words
> >>>> and
> >>>>>>>> never use them, then re-index your data and try it again.
> >>>>>>>>
> >>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
> gvit...@ebi.ac.uk
> >>>> <mailto:gvit...@ebi.ac.uk>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> I am performing a search to match a name (text_field), however
> this
> >>>> term
> >>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i
> remove
> >>>>>>> 'a'
> >>>>>>>>> then it works.
> >>>>>>>>> e.g
> >>>>>>>>> Search Term: lymphoid and a non-lymphoid cell
> >>>>>>>>> doesn't work:
> >>>>>>>>>
> >>>>>>>
> >>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>> <
> >>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>
> >>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Search term: lymphoid and non-lymphoid cell
> >>>>>>>>> works:
> >>>>>>>>>
> >>>>>>>
> >>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>>>>> <
> >>>>>>>>>
> >>>>>>>
> >>>>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
> >>>>>>>>>>
> >>>>>>>>> interested in the first result
> >>>>>>>>>
> >>>>>>>>> schema.xml
> >>>>>>>>> <field name="name"                          type="text_field"
> >>>>>>>>> indexed="true"  stored="true"   omitNorms="false"
>  required="true"
> >>>>>>>>> multiValued="false"/>
> >>>>>>>>>
> >>>>>>>>>        <analyzer type="query">
> >>>>>>>>>            <tokenizer class="solr.PatternTokenizerFactory"
> >>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>> pattern="[_]" replacement=" "/>
> >>>>>>>>>            <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>> max="20"/>
> >>>>>>>>>            <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>            <filter class="solr.StopFilterFactory"
> >>>> ignoreCase="true"
> >>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>        </analyzer>
> >>>>>>>>>
> >>>>>>>>>    <fieldType name="text_field" class="solr.TextField"
> >>>>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>>>        <analyzer type="index">
> >>>>>>>>>            <tokenizer class="solr.StandardTokenizerFactory"/>
> >>>>>>>>>            <filter class="solr.ClassicFilterFactory"/>
> >>>>>>>>>            <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>> max="20"/>
> >>>>>>>>>            <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>            <filter class="solr.StopFilterFactory"
> >>>> ignoreCase="true"
> >>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>        </analyzer>
> >>>>>>>>>        <analyzer type="query">
> >>>>>>>>>            <tokenizer class="solr.PatternTokenizerFactory"
> >>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/>
> >>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>> pattern="^[/._:]+" replacement=""/>
> >>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>> pattern="[/._:]+$" replacement=""/>
> >>>>>>>>>            <filter class="solr.PatternReplaceFilterFactory"
> >>>>>>>>> pattern="[_]" replacement=" "/>
> >>>>>>>>>            <filter class="solr.LengthFilterFactory" min="2"
> >>>>>>> max="20"/>
> >>>>>>>>>            <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>>>            <filter class="solr.StopFilterFactory"
> >>>> ignoreCase="true"
> >>>>>>>>> words="stopwords.txt"/>
> >>>>>>>>>        </analyzer>
> >>>>>>>>>    </fieldType>
> >>>>>>>>>
> >>>>>>>>> stopwords.txt
> >>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
> >>>>>>>>> a
> >>>>>>>>> b
> >>>>>>>>> c
> >>>>>>>>> ....
> >>>>>>>>> an
> >>>>>>>>> and
> >>>>>>>>> are
> >>>>>>>>>
> >>>>>>>>> Running SolR 6.6.2.
> >>>>>>>>>
> >>>>>>>>> Is there anything I could do to prevent this ?
> >>>>>>>>>
> >>>>>>>>> Thanks
> >>>>>>>>> Guilherme
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> --
> >>> Regards,
> >>>
> >>> *Paras Lehana* [65871]
> >>> Development Engineer, Auto-Suggest,
> >>> IndiaMART Intermesh Ltd.
> >>>
> >>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> >>> Noida, UP, IN - 201303
> >>>
> >>> Mob.: +91-9560911996
> >>> Work: 01203916600 | Extn:  *8173*
> >>>
> >>> --
> >>> IMPORTANT:
> >>> NEVER share your IndiaMART OTP/ Password with anyone.
> >>
> >
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to