Hi Guilherme. I am sending they analysis result and the json result as requested.
Thanks for the effort. Luckily, I can see your attachments (low quality though). >From the analysis screen, the analysis is working as expected. One of the reasons for query="lymphoid and *a* non-lymphoid cell" not matching document containing "Lymphoid and a non-Lymphoid cell" I can initially think of is: the stopword "a" is probably present in post-analysis either of query or index. Did you tweak your index time analysis after indexing? Do two things: 1. Post the analysis screen for and index=*"Immunoregulatory interactions between a Lymphoid and a non-Lymphoid cell"* and "query=*"lymphoid and a non-lymphoid cell"*. Try hosting the image and providing the link here. 2. Give the same JSON output as you have sent but this time with *"echoParams=all"*. Also, post the exact Solr query url. On Wed, 6 Nov 2019 at 21:07, Erick Erickson <erickerick...@gmail.com> wrote: > I don’t see the attachments, maybe I deleted old e-mails or some such. The > Apache server is fairly aggressive about stripping attachments though, so > it’s also possible they didn’t make it through. > > > On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <gvit...@ebi.ac.uk> wrote: > > > > Thanks Erick. > > > >> First, your index and analysis chains are considerably different, this > can easily be a source of problems. In particular, using two different > tokenizers is a huge red flag. I _strongly_ recommend against this unless > you’re totally sure you understand the consequences. Additionally, your use > of the length filter is suspicious, especially since your problem statement > is about the addition of a single letter term and the min length allowed on > that filter is 2. That said, it’s reasonable to suppose that the ’a’ is > filtered out in both cases, but maybe you’ve found something odd about the > interactions. > > I will investigate the min length and post the results later. > > > >> Second, I have no idea what this will do. Are the equal signs typos? > Used by custom code? > > This the url in my application, not solr params. That's the query string. > > > >> What does “species=“ do? That’s not Solr syntax, so it’s likely that > all the params with an equal-sign are totally ignored unless it’s just a > typo. > > This is part of the application. Species will be used later on in solr > to filter out the result. That's not solr. That my app params. > > > >> Third, the easiest way to see what’s happening under the covers is to > add “&debug=true” to the query and look at the parsed query. Ignore all the > relevance calculations for the nonce, or specify “&debug=query” to skip > that part. > > The two json files i've sent, they are debugQuery=on and the explain tag > is present. > > I will try the searching the way you mentioned. > > > > Thank for your inputs > > > > Guilherme > > > >> On 6 Nov 2019, at 14:14, Erick Erickson <erickerick...@gmail.com> > wrote: > >> > >> Fwd to another server > >> > >> First, your index and analysis chains are considerably different, this > can easily be a source of problems. In particular, using two different > tokenizers is a huge red flag. I _strongly_ recommend against this unless > you’re totally sure you understand the consequences. Additionally, your use > of the length filter is suspicious, especially since your problem statement > is about the addition of a single letter term and the min length allowed on > that filter is 2. That said, it’s reasonable to suppose that the ’a’ is > filtered out in both cases, but maybe you’ve found something odd about the > interactions. > >> > >> Second, I have no idea what this will do. Are the equal signs typos? > Used by custom code? > >> > >>>> > https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true > >> > >> What does “species=“ do? That’s not Solr syntax, so it’s likely that > all the params with an equal-sign are totally ignored unless it’s just a > typo. > >> > >> Third, the easiest way to see what’s happening under the covers is to > add “&debug=true” to the query and look at the parsed query. Ignore all the > relevance calculations for the nonce, or specify “&debug=query” to skip > that part. > >> > >> 90% + of the time, the question “why didn’t this query do what I > expect” is answered by looking at the “&debug=query” output and the > analysis page in the admin UI. NOTE: for the analysis page be sure to look > at _both_ the query and index output. Also, and very important about the > analysis page (and this is confusing) is that this _assumes_ that what you > put in the text boxes have made it through the query parser intact and is > analyzed by the field selected. Consider the search "q=field:word1 word2". > Now you type “word1 word2” into the analysis text box and it looks like > what you expect. That’s misleading because the query is _parsed_ as > "field:word1 default_search_field:word2”. This is where “&debug=query” > helps. > >> > >> Best, > >> Erick > >> > >>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <paras.leh...@indiamart.com> > wrote: > >>> > >>> Hi Walter, > >>> > >>> The solr.StopFilter removes all tokens that are stopwords. Those words > will > >>>> not be in the index, so they can never match a query. > >>> > >>> > >>> I think the OP's concern is different results when adding a stopword. I > >>> think he's using the filter factory correctly - the query chain > includes > >>> the filter as well so it should remove "a" while querying. > >>> > >>> *@Guilherme*, please post results for both the query, the document in > >>> result you are concerned about and post full result of analysis screen > (for > >>> both query and index). > >>> > >>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <wun...@wunderwood.org> > wrote: > >>> > >>>> No. > >>>> > >>>> The solr.StopFilter removes all tokens that are stopwords. Those words > >>>> will not be in the index, so they can never match a query. > >>>> > >>>> 1. Remove the lines with solr.StopFilter from every analysis chain in > >>>> schema.xml. > >>>> 2. Reload the collection, restart Solr, or whatever to read the new > config. > >>>> 3. Reindex all of the documents. > >>>> > >>>> When indexed with the new analysis chain, the stopwords will not be > >>>> removed and they will be searchable. > >>>> > >>>> wunder > >>>> Walter Underwood > >>>> wun...@wunderwood.org > >>>> http://observer.wunderwood.org/ (my blog) > >>>> > >>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk> > wrote: > >>>>> > >>>>> Ok. I am kind a lost now. > >>>>> If I open up the console > analysis and perform it, that's the final > >>>> result. > >>>>> <Screenshot 2019-11-05 at 14.54.16.png> > >>>>> > >>>>> Your suggestion is: get rid of the <filter stopword.txt> in the > >>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ") > then > >>>> add to solr. Is that correct ? > >>>>> > >>>>> Thanks David > >>>>> > >>>>>> On 5 Nov 2019, at 14:48, David Hastings < > hastings.recurs...@gmail.com > >>>> <mailto:hastings.recurs...@gmail.com>> wrote: > >>>>>> > >>>>>> Fwd to another server > >>>>>> > >>>>>> no, > >>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" > >>>>>> words="stopwords.txt"/> > >>>>>> > >>>>>> is still using stopwords and should be removed, in my opinion of > course, > >>>>>> based on your use case may be different, but i generally axe any > >>>> reference > >>>>>> to them at all > >>>>>> > >>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <gvit...@ebi.ac.uk > >>>> <mailto:gvit...@ebi.ac.uk>> wrote: > >>>>>> > >>>>>>> Thanks. > >>>>>>> Haven't I done this here ? > >>>>>>> <fieldType name="text_field" class="solr.TextField" > >>>>>>> positionIncrementGap="100" omitNorms="false" > > >>>>>>> <analyzer type="index"> > >>>>>>> <tokenizer class="solr.StandardTokenizerFactory"/> > >>>>>>> <filter class="solr.ClassicFilterFactory"/> > >>>>>>> <filter class="solr.LengthFilterFactory" min="2" > >>>> max="20"/> > >>>>>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>>>>> <filter class="solr.StopFilterFactory" ignoreCase="true" > >>>>>>> words="stopwords.txt"/> > >>>>>>> </analyzer> > >>>>>>> > >>>>>>> > >>>>>>>> On 5 Nov 2019, at 14:15, David Hastings < > hastings.recurs...@gmail.com > >>>> <mailto:hastings.recurs...@gmail.com>> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> Fwd to another server > >>>>>>>> > >>>>>>>> The first thing you should do is remove any reference to stop > words > >>>> and > >>>>>>>> never use them, then re-index your data and try it again. > >>>>>>>> > >>>>>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri < > gvit...@ebi.ac.uk > >>>> <mailto:gvit...@ebi.ac.uk>> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> > >>>>>>>>> I am performing a search to match a name (text_field), however > this > >>>> term > >>>>>>>>> contains 'and' and 'a' and it doesn't return any records. If i > remove > >>>>>>> 'a' > >>>>>>>>> then it works. > >>>>>>>>> e.g > >>>>>>>>> Search Term: lymphoid and a non-lymphoid cell > >>>>>>>>> doesn't work: > >>>>>>>>> > >>>>>>> > >>>> > https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true > >>>> < > >>>> > https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true > >>>>> > >>>>>>>>> < > >>>>>>>>> > >>>>>>> > >>>> > https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Search term: lymphoid and non-lymphoid cell > >>>>>>>>> works: > >>>>>>>>> > >>>>>>> > >>>> > https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true > >>>>>>>>> < > >>>>>>>>> > >>>>>>> > >>>> > https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true > >>>>>>>>>> > >>>>>>>>> interested in the first result > >>>>>>>>> > >>>>>>>>> schema.xml > >>>>>>>>> <field name="name" type="text_field" > >>>>>>>>> indexed="true" stored="true" omitNorms="false" > required="true" > >>>>>>>>> multiValued="false"/> > >>>>>>>>> > >>>>>>>>> <analyzer type="query"> > >>>>>>>>> <tokenizer class="solr.PatternTokenizerFactory" > >>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/> > >>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" > >>>>>>>>> pattern="^[/._:]+" replacement=""/> > >>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" > >>>>>>>>> pattern="[/._:]+$" replacement=""/> > >>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" > >>>>>>>>> pattern="[_]" replacement=" "/> > >>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" > >>>>>>> max="20"/> > >>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>>>>>>> <filter class="solr.StopFilterFactory" > >>>> ignoreCase="true" > >>>>>>>>> words="stopwords.txt"/> > >>>>>>>>> </analyzer> > >>>>>>>>> > >>>>>>>>> <fieldType name="text_field" class="solr.TextField" > >>>>>>>>> positionIncrementGap="100" omitNorms="false" > > >>>>>>>>> <analyzer type="index"> > >>>>>>>>> <tokenizer class="solr.StandardTokenizerFactory"/> > >>>>>>>>> <filter class="solr.ClassicFilterFactory"/> > >>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" > >>>>>>> max="20"/> > >>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>>>>>>> <filter class="solr.StopFilterFactory" > >>>> ignoreCase="true" > >>>>>>>>> words="stopwords.txt"/> > >>>>>>>>> </analyzer> > >>>>>>>>> <analyzer type="query"> > >>>>>>>>> <tokenizer class="solr.PatternTokenizerFactory" > >>>>>>>>> pattern="[^a-zA-Z0-9/._:]"/> > >>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" > >>>>>>>>> pattern="^[/._:]+" replacement=""/> > >>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" > >>>>>>>>> pattern="[/._:]+$" replacement=""/> > >>>>>>>>> <filter class="solr.PatternReplaceFilterFactory" > >>>>>>>>> pattern="[_]" replacement=" "/> > >>>>>>>>> <filter class="solr.LengthFilterFactory" min="2" > >>>>>>> max="20"/> > >>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>>>>>>> <filter class="solr.StopFilterFactory" > >>>> ignoreCase="true" > >>>>>>>>> words="stopwords.txt"/> > >>>>>>>>> </analyzer> > >>>>>>>>> </fieldType> > >>>>>>>>> > >>>>>>>>> stopwords.txt > >>>>>>>>> #Standard english stop words taken from Lucene's StopAnalyzer > >>>>>>>>> a > >>>>>>>>> b > >>>>>>>>> c > >>>>>>>>> .... > >>>>>>>>> an > >>>>>>>>> and > >>>>>>>>> are > >>>>>>>>> > >>>>>>>>> Running SolR 6.6.2. > >>>>>>>>> > >>>>>>>>> Is there anything I could do to prevent this ? > >>>>>>>>> > >>>>>>>>> Thanks > >>>>>>>>> Guilherme > >>>>>>> > >>>>>>> > >>>>> > >>>> > >>>> > >>> > >>> -- > >>> -- > >>> Regards, > >>> > >>> *Paras Lehana* [65871] > >>> Development Engineer, Auto-Suggest, > >>> IndiaMART Intermesh Ltd. > >>> > >>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, > >>> Noida, UP, IN - 201303 > >>> > >>> Mob.: +91-9560911996 > >>> Work: 01203916600 | Extn: *8173* > >>> > >>> -- > >>> IMPORTANT: > >>> NEVER share your IndiaMART OTP/ Password with anyone. > >> > > > > -- -- Regards, *Paras Lehana* [65871] Development Engineer, Auto-Suggest, IndiaMART Intermesh Ltd. 8th Floor, Tower A, Advant-Navis Business Park, Sector 142, Noida, UP, IN - 201303 Mob.: +91-9560911996 Work: 01203916600 | Extn: *8173* -- IMPORTANT: NEVER share your IndiaMART OTP/ Password with anyone.