Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Paras Lehana Sun, 17 Nov 2019 21:39:16 -0800

Hi Guilherme,

Have you tried reindexing the documents and compare the results? No issues
if you cannot do that - let's try something else. I was going through the
whole mail and your files. You had said:


As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
> don't get anything (which make sense).


Why did you think that not getting anything when you add dbId made sense?
Asking because I may be missing something here.

Also, what is the purpose of so many qf's? Going through your documents and
config files, I found that your dbId's are string of numbers and I don't
think you want to find your query terms in dbId, right?
Do you want to boost the score by the values in dbId?

Your qf of dbId^100 boosts documents containing terms in q by 100x. Since
your terms don't match with the values in dbId for any document, the score
produced by this scoring is 0. 100x or 1x of 0 is still 0.
I still need to see how this scoring gets added up in edismax parser but do
reevaluate the usage of these qfs. Same goes for other qf boosts. :)


On Fri, 15 Nov 2019 at 12:23, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:

> Hi Paras
> No worries.
> No I didn’t find anything. This is annoying now...
> Yes! They do contain dbId. Absolutely all my docs contains dbId and it is
> actually my key, if you check again the schema.xml
>
> Cheers
> Guilherme
>
> On 15 Nov 2019, at 05:37, Paras Lehana <paras.leh...@indiamart.com> wrote:
>
> 
> Hey Guilherme,
>
> I was a bit busy for the past few days and couldn't read your mail. So,
> did you find anything? Anyways, as I had expected, the culprit is
> definitely among the qfs. Do the documents in concern contain dbId? I
> suggest you to cross check the fields in your document with those impacting
> the result in qf.
>
> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>
>> What I can't understand is:
>> I search for the exact term - "Immunoregulatory interactions between a
>> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the
>> exact term - Immunoregulatory interactions between a Lymphoid *and 
>> *non-Lymphoid
>> cell" then it works
>>
>> On 11 Nov 2019, at 12:24, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>>
>> Thanks
>>
>> Removing stopwords is another story. I'm curious to find the reason
>> assuming that you keep on using stopwords. In some cases, stopwords are
>> really necessary.
>>
>> Yes. It always make sense the way we've been using.
>>
>> If q.alt is giving you responses, it's confirmed that your stopwords
>> filter
>> is working as expected. The problem definitely lies in the configuration
>> of
>> edismax.
>>
>> I see.
>>
>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>
>> Ok, using q now, removed all qf, performed the search and I got 23
>> results, and the one I really want, on the top.
>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then
>> I don't get anything (which make sense). However if I query name_exact, I
>> get the 23 results again, and unfortunately if I query stId^1.0
>> name_exact^10.0 I still don't get any results.
>>
>> In summary
>> - without qf - 23 results
>> - dbId - 0 results
>> - name_exact - 16 results
>> - name - 23 results
>> - dbId^1.0
>>  name_exact^10.0 - 0 results
>> - 0 results if any other, stId, dbId (key) is added on top of the
>> name(name_exact, etc).
>>
>> Definitely lost here! :-/
>>
>>
>> On 11 Nov 2019, at 07:59, Paras Lehana <paras.leh...@indiamart.com>
>> wrote:
>>
>> Hi
>>
>> So I don't think removing it completely is the way to go from the scenario
>>
>> we have
>>
>>
>>
>> Removing stopwords is another story. I'm curious to find the reason
>> assuming that you keep on using stopwords. In some cases, stopwords are
>> really necessary.
>>
>>
>> Quite a considerable increase
>>
>>
>> If q.alt is giving you responses, it's confirmed that your stopwords
>> filter
>> is working as expected. The problem definitely lies in the configuration
>> of
>> edismax.
>>
>>
>>
>> I am sorry but I didn't understand what do you want me to do exactly with
>> the lst (??) and qf and bf.
>>
>>
>>
>> What combinations did you try? I was referring to the field-level boosting
>> you have applied in edismax config.
>>
>> *Let me explain again:* In your solrconfig.xml, look at your /search
>> request handler. There are many qf and some bq boosts. I want you to
>> remove
>> all of these, check response again (with q now) and keep on adding them
>> again (one by one) while looking for when the numFound drastically
>> changes.
>>
>> On Fri, 8 Nov 2019 at 23:47, David Hastings <hastings.recurs...@gmail.com
>> >
>> wrote:
>>
>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>> pretty well for such a solution, but for a full index the size became
>> prohibitive
>>
>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <wun...@wunderwood.org>
>> wrote:
>>
>> If we had IDF for phrases, they would be super effective. The 2X weight
>>
>> is
>>
>> a hack that mostly works.
>>
>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>
>> hastings.recurs...@gmail.com> wrote:
>>
>>
>> the pf and qf fields are REALLY nice for this
>>
>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>>
>> wun...@wunderwood.org>
>>
>> wrote:
>>
>> I always enable phrase searching in edismax for exactly this reason.
>>
>> Something like:
>>
>>     <str name="qf”>title^8 keywords^4 text</str>
>>     <str name="pf”>title^16 keywords^8 text^2</str>
>>
>> To deal with concepts in queries, a classifier and/or named entity
>> extractor can be helpful. If you have a list of concepts (“controlled
>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>>
>> that
>>
>> term can be queried against the field matching that vocabulary.
>>
>> This is how LinkedIn separates people, companies, and places, for
>>
>> example.
>>
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <erickerick...@gmail.com
>>
>>
>> wrote:
>>
>>
>> Look at the “mm” parameter, try setting it to 100%. Although that’t
>>
>> not
>>
>> entirely likely to do what you want either since virtually every doc
>>
>> will
>>
>> have “a” in it. But at least you’d get docs that have both terms.
>>
>>
>> you may also be able to search for things like “Lamin A” _only as a
>>
>> phrase_ and have some luck. But this is a gnarly problem in general.
>>
>> Some
>>
>> people have been able to substitute synonyms and/or shingles to make
>>
>> this
>>
>> work at the expense of a larger index.
>>
>>
>> This is a generic problem with context. “Lamin A” is really a
>>
>> “concept”,
>>
>> not just two words that happen to be near each other. Searching as a
>>
>> phrase
>>
>> is an OOB-but-naive way to try to make it more likely that the ranked
>> results refer to the _concept_ of “Lamin A”. The assumption here is
>>
>> “if
>>
>> these two words appear next to each other, they’re more likely to be
>>
>> what I
>>
>> want”. I say “naive” because “Lamins: A new approach to...” would
>>
>> _also_ be
>>
>> found for a naive phrase search. (I have no idea whether such a title
>>
>> makes
>>
>> sense or not, but you figured that out already)...
>>
>>
>> To do this well you’d have to dive in to NLP/Machine learning.
>>
>> I truly wish we could have the DWIM search algorithm (Do What I
>>
>> Mean)….
>>
>>
>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
>>
>> wrote:
>>
>>
>> HI Walter and Paras
>>
>> I indexed it removing all the references to StopWordFilter and I
>>
>> went
>>
>> from 121 results to near 20K as the search term q="Lymphoid and a
>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
>>
>> So I
>>
>> don't think removing it completely is the way to go from the scenario
>>
>> we
>>
>> have, but I appreciate the suggestion…
>>
>>
>> Yes the response is using fl=*
>> I am trying some combinations at the moment, but yet no success.
>>
>> defType=edismax
>> q.alt=Lymphoid and a non-Lymphoid cell
>> Number of results=1599
>> Quite a considerable increase, even though reasonable meaningful
>>
>> results.
>>
>>
>> I am sorry but I didn't understand what do you want me to do exactly
>>
>> with the lst (??) and qf and bf.
>>
>>
>> Thanks everyone with their inputs
>>
>>
>> On 8 Nov 2019, at 06:45, Paras Lehana <paras.leh...@indiamart.com>
>>
>> wrote:
>>
>>
>> Hi Guilherme
>>
>> By accident, I ended up querying the using the default handler
>>
>> (/select) and it worked.
>>
>>
>> You've just found the culprit. Thanks for giving the material I
>>
>> requested. Your analysis chain is working as expected. I don't see any
>> issue in either StopWordFilter or your boosts. I also use a boost of
>>
>> 50
>>
>> when boosting contextual suggestions (boosting "gold iphone" on a page
>>
>> of
>>
>> iphone) but I take Walter's suggestion and would try to optimize my
>> weights. I agree that this 50 thing was not researched much about by
>>
>> us
>>
>> as
>>
>> well (we never faced performance or relevance issues).
>>
>>
>> See the major difference in both the handlers - edismax. I'm pretty
>>
>> sure that your problem lies in the parsing of queries (you can confirm
>>
>> that
>>
>> from parsedquery key in debug of both JSON responses). I hope you have
>> provided the response with fl=*. Replace q with q.alt in your /search
>> handler query and I think you should start getting responses. That's
>> because q.alt uses standard parser. If you want to keep using
>>
>> edisMax, I
>>
>> suggest you to test the responses removing some combination of lst
>>
>> (qf,
>>
>> bf)
>>
>> and find what's restricting the documents to come up. I'm out of
>>
>> office
>>
>> today - would have certainly tried analyzing the field values of the
>> document in /select request and compare it with qf/bq in
>>
>> solrconfig.xml
>>
>> /search. Do this for me and you'd certainly find something.
>>
>>
>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
>>
>> wun...@wunderwood.org
>>
>> <mailto:wun...@wunderwood.org>> wrote:
>>
>> I normally use a weight of 8 for the most important field, like
>>
>> title.
>>
>> Other fields might get a 4 or 2.
>>
>>
>> I add a “pf” field with the weights doubled, so that phrase matches
>>
>> have a higher weight.
>>
>>
>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>>
>> early web search engines. With different relevance algorithms and
>>
>> totally
>>
>> different evaluation and tuning systems, they settled on weights of 8
>>
>> and
>>
>> 7.5 for HTML titles. With the the two radically different system
>>
>> getting
>>
>> the same number, I decided that was a property of the documents, not
>>
>> of
>>
>> the
>>
>> search engines.
>>
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>
>> (my blog)
>>
>>
>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>>
>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>
>>
>> Hi Wunder,
>>
>> My indexer takes quite a few hours to be executed I am shortening
>>
>> it
>>
>> to run faster, but I also need to make sure it gives what we are
>>
>> expecting.
>>
>> This implementation's been there for >4y, and massively used.
>>
>>
>> In your edismax handlers, weights of 20, 50, and 100 are
>>
>> extremely
>>
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>
>> years
>>
>> of configuring Solr.
>>
>> I've inherited that implementation and I am really keen to
>>
>> adequate
>>
>> it, what would you recommend ?
>>
>>
>> Cheers
>> Guilherme
>>
>> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org
>>
>> <mailto:wun...@wunderwood.org>> wrote:
>>
>>
>> Thanks for posting the files. Looking at schema.xml, I see that
>>
>> you
>>
>> still are using StopFilterFactory. The first advice we gave you was to
>> remove that.
>>
>>
>> Remove StopFilterFactory everywhere and reindex.
>>
>> You will continue to have problems matching stopwords until you
>>
>> do
>>
>> that.
>>
>>
>> In your edismax handlers, weights of 20, 50, and 100 are
>>
>> extremely
>>
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>
>> years
>>
>> of configuring Solr.
>>
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
>>
>>
>> (my blog)
>>
>>
>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>>
>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>
>>
>> Hi Paras, everyone
>>
>> Thank you again for your inputs and suggestions. I sorry to hear
>>
>> you had trouble with the attachments I will host it somewhere and
>>
>> share
>>
>> the
>>
>> links.
>>
>> I don't tweak my index, I get the data from the graph database,
>>
>> create a document as they are and save to solr.
>>
>>
>> So, I am sending the new analysis screen querying the way you
>>
>> suggested. Also the results with params and solr query url.
>>
>>
>> During the process of querying what you asked I found something
>>
>> really weird (at least for me). By accident, I ended up querying the
>>
>> using
>>
>> the default handler (/select) and it worked. Then If I use the one I
>>
>> must
>>
>> use, then sadly doesn't work. I am posting both results and I will
>>
>> also
>>
>> post the handlers as well.
>>
>>
>> Here is the link with all the files mentioned before
>>
>>
>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>> <
>>
>>
>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>> >
>>
>> <
>>
>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>
>> <
>>
>>
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>
>>
>> If the link doesn't work www dot dropbox dot com slash sh slash
>>
>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>
>>
>> Thanks
>>
>> On 7 Nov 2019, at 05:23, Paras Lehana <
>>
>> paras.leh...@indiamart.com
>>
>> <mailto:paras.leh...@indiamart.com>> wrote:
>>
>>
>> Hi Guilherme.
>>
>> I am sending they analysis result and the json result as
>>
>> requested.
>>
>>
>>
>> Thanks for the effort. Luckily, I can see your attachments (low
>>
>> quality
>>
>> though).
>>
>> From the analysis screen, the analysis is working as expected.
>>
>> One
>>
>> of the
>>
>> reasons for query="lymphoid and *a* non-lymphoid cell" not
>>
>> matching
>>
>> document containing "Lymphoid and a non-Lymphoid cell" I can
>>
>> initially
>>
>> think of is: the stopword "a" is probably present in
>>
>> post-analysis
>>
>> either
>>
>> of query or index. Did you tweak your index time analysis after
>>
>> indexing?
>>
>>
>> Do two things:
>>
>> 1. Post the analysis screen for and index=*"Immunoregulatory
>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>> "query=*"lymphoid
>> and a non-lymphoid cell"*. Try hosting the image and providing
>>
>> the
>>
>> link
>>
>> here.
>> 2. Give the same JSON output as you have sent but this time
>>
>> with
>>
>> *"echoParams=all"*. Also, post the exact Solr query url.
>>
>>
>>
>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>>
>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote:
>>
>>
>> I don’t see the attachments, maybe I deleted old e-mails or
>>
>> some
>>
>> such. The
>>
>> Apache server is fairly aggressive about stripping attachments
>>
>> though, so
>>
>> it’s also possible they didn’t make it through.
>>
>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
>>
>> gvit...@ebi.ac.uk
>>
>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>
>>
>> Thanks Erick.
>>
>> First, your index and analysis chains are considerably
>>
>> different, this
>>
>> can easily be a source of problems. In particular, using two
>>
>> different
>>
>> tokenizers is a huge red flag. I _strongly_ recommend against
>>
>> this unless
>>
>> you’re totally sure you understand the consequences.
>>
>> Additionally, your use
>>
>> of the length filter is suspicious, especially since your
>>
>> problem
>>
>> statement
>>
>> is about the addition of a single letter term and the min
>>
>> length
>>
>> allowed on
>>
>> that filter is 2. That said, it’s reasonable to suppose that
>>
>> the
>>
>> ’a’ is
>>
>> filtered out in both cases, but maybe you’ve found something
>>
>> odd
>>
>> about the
>>
>> interactions.
>>
>> I will investigate the min length and post the results later.
>>
>> Second, I have no idea what this will do. Are the equal
>>
>> signs
>>
>> typos?
>>
>> Used by custom code?
>>
>> This the url in my application, not solr params. That's the
>>
>> query string.
>>
>>
>> What does “species=“ do? That’s not Solr syntax, so it’s
>>
>> likely
>>
>> that
>>
>> all the params with an equal-sign are totally ignored unless
>>
>> it’s
>>
>> just a
>>
>> typo.
>>
>> This is part of the application. Species will be used later
>>
>> on
>>
>> in solr
>>
>> to filter out the result. That's not solr. That my app params.
>>
>>
>> Third, the easiest way to see what’s happening under the
>>
>> covers
>>
>> is to
>>
>> add “&debug=true” to the query and look at the parsed query.
>>
>> Ignore all the
>>
>> relevance calculations for the nonce, or specify
>>
>> “&debug=query”
>>
>> to skip
>>
>> that part.
>>
>> The two json files i've sent, they are debugQuery=on and the
>>
>> explain tag
>>
>> is present.
>>
>> I will try the searching the way you mentioned.
>>
>> Thank for your inputs
>>
>> Guilherme
>>
>> On 6 Nov 2019, at 14:14, Erick Erickson <
>>
>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>>
>>
>> wrote:
>>
>>
>> Fwd to another server
>>
>> First, your index and analysis chains are considerably
>>
>> different, this
>>
>> can easily be a source of problems. In particular, using two
>>
>> different
>>
>> tokenizers is a huge red flag. I _strongly_ recommend against
>>
>> this unless
>>
>> you’re totally sure you understand the consequences.
>>
>> Additionally, your use
>>
>> of the length filter is suspicious, especially since your
>>
>> problem
>>
>> statement
>>
>> is about the addition of a single letter term and the min
>>
>> length
>>
>> allowed on
>>
>> that filter is 2. That said, it’s reasonable to suppose that
>>
>> the
>>
>> ’a’ is
>>
>> filtered out in both cases, but maybe you’ve found something
>>
>> odd
>>
>> about the
>>
>> interactions.
>>
>>
>> Second, I have no idea what this will do. Are the equal
>>
>> signs
>>
>> typos?
>>
>> Used by custom code?
>>
>>
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>>
>> What does “species=“ do? That’s not Solr syntax, so it’s
>>
>> likely
>>
>> that
>>
>> all the params with an equal-sign are totally ignored unless
>>
>> it’s
>>
>> just a
>>
>> typo.
>>
>>
>> Third, the easiest way to see what’s happening under the
>>
>> covers
>>
>> is to
>>
>> add “&debug=true” to the query and look at the parsed query.
>>
>> Ignore all the
>>
>> relevance calculations for the nonce, or specify
>>
>> “&debug=query”
>>
>> to skip
>>
>> that part.
>>
>>
>> 90% + of the time, the question “why didn’t this query do
>>
>> what I
>>
>> expect” is answered by looking at the “&debug=query” output
>>
>> and
>>
>> the
>>
>> analysis page in the admin UI. NOTE: for the analysis page be
>>
>> sure to look
>>
>> at _both_ the query and index output. Also, and very important
>>
>> about the
>>
>> analysis page (and this is confusing) is that this _assumes_
>>
>> that
>>
>> what you
>>
>> put in the text boxes have made it through the query parser
>>
>> intact and is
>>
>> analyzed by the field selected. Consider the search
>>
>> "q=field:word1 word2".
>>
>> Now you type “word1 word2” into the analysis text box and it
>>
>> looks like
>>
>> what you expect. That’s misleading because the query is
>>
>> _parsed_
>>
>> as
>>
>> "field:word1 default_search_field:word2”. This is where
>>
>> “&debug=query”
>>
>> helps.
>>
>>
>> Best,
>> Erick
>>
>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>>
>> paras.leh...@indiamart.com <mailto:paras.leh...@indiamart.com>>
>>
>> wrote:
>>
>>
>> Hi Walter,
>>
>> The solr.StopFilter removes all tokens that are stopwords.
>>
>> Those words
>>
>> will
>>
>> not be in the index, so they can never match a query.
>>
>>
>>
>> I think the OP's concern is different results when adding a
>>
>> stopword. I
>>
>> think he's using the filter factory correctly - the query
>>
>> chain
>>
>> includes
>>
>> the filter as well so it should remove "a" while querying.
>>
>> *@Guilherme*, please post results for both the query, the
>>
>> document in
>>
>> result you are concerned about and post full result of
>>
>> analysis screen
>>
>> (for
>>
>> both query and index).
>>
>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>>
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>>
>>
>> wrote:
>>
>>
>> No.
>>
>> The solr.StopFilter removes all tokens that are stopwords.
>>
>> Those words
>>
>> will not be in the index, so they can never match a query.
>>
>> 1. Remove the lines with solr.StopFilter from every
>>
>> analysis
>>
>> chain in
>>
>> schema.xml.
>> 2. Reload the collection, restart Solr, or whatever to
>>
>> read
>>
>> the new
>>
>> config.
>>
>> 3. Reindex all of the documents.
>>
>> When indexed with the new analysis chain, the stopwords
>>
>> will
>>
>> not be
>>
>> removed and they will be searchable.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <
>>
>> http://observer.wunderwood.org/>  (my blog)
>>
>>
>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>>
>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>
>>
>> wrote:
>>
>>
>> Ok. I am kind a lost now.
>> If I open up the console > analysis and perform it,
>>
>> that's
>>
>> the final
>>
>> result.
>>
>> <Screenshot 2019-11-05 at 14.54.16.png>
>>
>> Your suggestion is: get rid of the <filter stopword.txt>
>>
>> in
>>
>> the
>>
>> schema.xml and during index phase replaceAll("in
>>
>> stopwords.txt"," ")
>>
>> then
>>
>> add to solr. Is that correct ?
>>
>>
>> Thanks David
>>
>> On 5 Nov 2019, at 14:48, David Hastings <
>>
>> hastings.recurs...@gmail.com <mailto:
>>
>> hastings.recurs...@gmail.com
>>
>>
>> <mailto:hastings.recurs...@gmail.com <mailto:
>>
>> hastings.recurs...@gmail.com>>> wrote:
>>
>>
>> Fwd to another server
>>
>> no,
>>   <filter class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"/>
>>
>> is still using stopwords and should be removed, in my
>>
>> opinion of
>>
>> course,
>>
>> based on your use case may be different, but i generally
>>
>> axe any
>>
>> reference
>>
>> to them at all
>>
>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>>
>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
>>
>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
>>
>> wrote:
>>
>>
>> Thanks.
>> Haven't I done this here ?
>> <fieldType name="text_field" class="solr.TextField"
>> positionIncrementGap="100" omitNorms="false" >
>> <analyzer type="index">
>>   <tokenizer class="solr.StandardTokenizerFactory"/>
>>   <filter class="solr.ClassicFilterFactory"/>
>>   <filter class="solr.LengthFilterFactory" min="2"
>>
>> max="20"/>
>>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>>   <filter class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"/>
>> </analyzer>
>>
>>
>> On 5 Nov 2019, at 14:15, David Hastings <
>>
>> hastings.recurs...@gmail.com <mailto:
>>
>> hastings.recurs...@gmail.com
>>
>>
>> <mailto:hastings.recurs...@gmail.com <mailto:
>>
>> hastings.recurs...@gmail.com>>>
>>
>> wrote:
>>
>>
>> Fwd to another server
>>
>> The first thing you should do is remove any reference
>>
>> to
>>
>> stop
>>
>> words
>>
>> and
>>
>> never use them, then re-index your data and try it
>>
>> again.
>>
>>
>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>
>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
>>
>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
>>
>> wrote:
>>
>>
>> Hi,
>>
>> I am performing a search to match a name
>>
>> (text_field),
>>
>> however
>>
>> this
>>
>> term
>>
>> contains 'and' and 'a' and it doesn't return any
>>
>> records. If i
>>
>> remove
>>
>> 'a'
>>
>> then it works.
>> e.g
>> Search Term: lymphoid and a non-lymphoid cell
>> doesn't work:
>>
>>
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>> <
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>>
>> <
>>
>>
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>>
>>
>> Search term: lymphoid and non-lymphoid cell
>> works:
>>
>>
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>> <
>>
>>
>>
>>
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>> <
>>
>>
>>
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>
>>
>>
>> interested in the first result
>>
>> schema.xml
>> <field name="name"
>>
>> type="text_field"
>>
>> indexed="true"  stored="true"   omitNorms="false"
>>
>> required="true"
>>
>> multiValued="false"/>
>>
>> <analyzer type="query">
>>   <tokenizer class="solr.PatternTokenizerFactory"
>> pattern="[^a-zA-Z0-9/._:]"/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="^[/._:]+" replacement=""/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[/._:]+$" replacement=""/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[_]" replacement=" "/>
>>   <filter class="solr.LengthFilterFactory" min="2"
>>
>> max="20"/>
>>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>>   <filter class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"/>
>> </analyzer>
>>
>> <fieldType name="text_field" class="solr.TextField"
>> positionIncrementGap="100" omitNorms="false" >
>> <analyzer type="index">
>>   <tokenizer
>>
>> class="solr.StandardTokenizerFactory"/>
>>
>>   <filter class="solr.ClassicFilterFactory"/>
>>   <filter class="solr.LengthFilterFactory" min="2"
>>
>> max="20"/>
>>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>>   <filter class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"/>
>> </analyzer>
>> <analyzer type="query">
>>   <tokenizer class="solr.PatternTokenizerFactory"
>> pattern="[^a-zA-Z0-9/._:]"/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="^[/._:]+" replacement=""/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[/._:]+$" replacement=""/>
>>   <filter class="solr.PatternReplaceFilterFactory"
>> pattern="[_]" replacement=" "/>
>>   <filter class="solr.LengthFilterFactory" min="2"
>>
>> max="20"/>
>>
>>   <filter class="solr.LowerCaseFilterFactory"/>
>>   <filter class="solr.StopFilterFactory"
>>
>> ignoreCase="true"
>>
>> words="stopwords.txt"/>
>> </analyzer>
>> </fieldType>
>>
>> stopwords.txt
>> #Standard english stop words taken from Lucene's
>>
>> StopAnalyzer
>>
>> a
>> b
>> c
>> ....
>> an
>> and
>> are
>>
>> Running SolR 6.6.2.
>>
>> Is there anything I could do to prevent this ?
>>
>> Thanks
>> Guilherme
>>
>>
>>
>>
>>
>>
>>
>> --
>> --
>> Regards,
>>
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>>
>> --
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>
>>
>>
>>
>>
>> --
>> --
>> Regards,
>>
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>>
>> --
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>
>>
>>
>>
>>
>>
>> --
>> --
>> Regards,
>>
>> Paras Lehana [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996 <tel:+91-9560911996>
>> Work: 01203916600 | Extn:  8173
>>
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> --
>> Regards,
>>
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>>
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>>
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>>
>> --
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>>
>>
>>
>>
>>
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
>
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
>
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
>
> IMPORTANT:
> NEVER share your IndiaMART OTP/ Password with anyone.
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to