Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Guilherme Viteri Mon, 18 Nov 2019 07:43:37 -0800

Hi,

> Have you tried reindexing the documents and compare the results? No issues
> if you cannot do that - let's try something else. I was going through the
> whole mail and your files. You had said:
Yes, but since it hasn't worked as suggested, I kept as you suggested.


> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
>> don't get anything (which make sense).
> 
> Why did you think that not getting anything when you add dbId made sense?
> Asking because I may be missing something here.
I am searching for a text and I was searching on an ID field, which wouldn't 
make sense.
(I will come back to this soon.)

Ok, I've been adding and removing fields in the qf and I could isolate half of 
the problem. First, I have one type of field called keyword_field and I added 
the StopWords filter for this field and It worked. Second,
when I add the fields that are id (<fieldType name="id" class="solr.StrField" />

Do you think I should also the stopwords filter for the fieldtype id ?
(I tried, and it worked, but I am not sure if this is conceptually correct, id, 
should remain intact from my understand)

Thanks
Guilherme

> On 18 Nov 2019, at 05:37, Paras Lehana <paras.leh...@indiamart.com> wrote:
> 
> Hi Guilherme,
> 
> Have you tried reindexing the documents and compare the results? No issues
> if you cannot do that - let's try something else. I was going through the
> whole mail and your files. You had said:
> 
> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
>> don't get anything (which make sense).
> 
> 
> Why did you think that not getting anything when you add dbId made sense?
> Asking because I may be missing something here.
> 
> Also, what is the purpose of so many qf's? Going through your documents and
> config files, I found that your dbId's are string of numbers and I don't
> think you want to find your query terms in dbId, right?
> Do you want to boost the score by the values in dbId?
> 
> Your qf of dbId^100 boosts documents containing terms in q by 100x. Since
> your terms don't match with the values in dbId for any document, the score
> produced by this scoring is 0. 100x or 1x of 0 is still 0.
> I still need to see how this scoring gets added up in edismax parser but do
> reevaluate the usage of these qfs. Same goes for other qf boosts. :)
> 
> 
> On Fri, 15 Nov 2019 at 12:23, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
> 
>> Hi Paras
>> No worries.
>> No I didn’t find anything. This is annoying now...
>> Yes! They do contain dbId. Absolutely all my docs contains dbId and it is
>> actually my key, if you check again the schema.xml
>> 
>> Cheers
>> Guilherme
>> 
>> On 15 Nov 2019, at 05:37, Paras Lehana <paras.leh...@indiamart.com> wrote:
>> 
>> 
>> Hey Guilherme,
>> 
>> I was a bit busy for the past few days and couldn't read your mail. So,
>> did you find anything? Anyways, as I had expected, the culprit is
>> definitely among the qfs. Do the documents in concern contain dbId? I
>> suggest you to cross check the fields in your document with those impacting
>> the result in qf.
>> 
>> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>> 
>>> What I can't understand is:
>>> I search for the exact term - "Immunoregulatory interactions between a
>>> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the
>>> exact term - Immunoregulatory interactions between a Lymphoid *and 
>>> *non-Lymphoid
>>> cell" then it works
>>> 
>>> On 11 Nov 2019, at 12:24, Guilherme Viteri <gvit...@ebi.ac.uk> wrote:
>>> 
>>> Thanks
>>> 
>>> Removing stopwords is another story. I'm curious to find the reason
>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>> really necessary.
>>> 
>>> Yes. It always make sense the way we've been using.
>>> 
>>> If q.alt is giving you responses, it's confirmed that your stopwords
>>> filter
>>> is working as expected. The problem definitely lies in the configuration
>>> of
>>> edismax.
>>> 
>>> I see.
>>> 
>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>> 
>>> Ok, using q now, removed all qf, performed the search and I got 23
>>> results, and the one I really want, on the top.
>>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then
>>> I don't get anything (which make sense). However if I query name_exact, I
>>> get the 23 results again, and unfortunately if I query stId^1.0
>>> name_exact^10.0 I still don't get any results.
>>> 
>>> In summary
>>> - without qf - 23 results
>>> - dbId - 0 results
>>> - name_exact - 16 results
>>> - name - 23 results
>>> - dbId^1.0
>>> name_exact^10.0 - 0 results
>>> - 0 results if any other, stId, dbId (key) is added on top of the
>>> name(name_exact, etc).
>>> 
>>> Definitely lost here! :-/
>>> 
>>> 
>>> On 11 Nov 2019, at 07:59, Paras Lehana <paras.leh...@indiamart.com>
>>> wrote:
>>> 
>>> Hi
>>> 
>>> So I don't think removing it completely is the way to go from the scenario
>>> 
>>> we have
>>> 
>>> 
>>> 
>>> Removing stopwords is another story. I'm curious to find the reason
>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>> really necessary.
>>> 
>>> 
>>> Quite a considerable increase
>>> 
>>> 
>>> If q.alt is giving you responses, it's confirmed that your stopwords
>>> filter
>>> is working as expected. The problem definitely lies in the configuration
>>> of
>>> edismax.
>>> 
>>> 
>>> 
>>> I am sorry but I didn't understand what do you want me to do exactly with
>>> the lst (??) and qf and bf.
>>> 
>>> 
>>> 
>>> What combinations did you try? I was referring to the field-level boosting
>>> you have applied in edismax config.
>>> 
>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>> request handler. There are many qf and some bq boosts. I want you to
>>> remove
>>> all of these, check response again (with q now) and keep on adding them
>>> again (one by one) while looking for when the numFound drastically
>>> changes.
>>> 
>>> On Fri, 8 Nov 2019 at 23:47, David Hastings <hastings.recurs...@gmail.com
>>>> 
>>> wrote:
>>> 
>>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>>> pretty well for such a solution, but for a full index the size became
>>> prohibitive
>>> 
>>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood <wun...@wunderwood.org>
>>> wrote:
>>> 
>>> If we had IDF for phrases, they would be super effective. The 2X weight
>>> 
>>> is
>>> 
>>> a hack that mostly works.
>>> 
>>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>> 
>>> hastings.recurs...@gmail.com> wrote:
>>> 
>>> 
>>> the pf and qf fields are REALLY nice for this
>>> 
>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>>> 
>>> wun...@wunderwood.org>
>>> 
>>> wrote:
>>> 
>>> I always enable phrase searching in edismax for exactly this reason.
>>> 
>>> Something like:
>>> 
>>>    <str name="qf”>title^8 keywords^4 text</str>
>>>    <str name="pf”>title^16 keywords^8 text^2</str>
>>> 
>>> To deal with concepts in queries, a classifier and/or named entity
>>> extractor can be helpful. If you have a list of concepts (“controlled
>>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>>> 
>>> that
>>> 
>>> term can be queried against the field matching that vocabulary.
>>> 
>>> This is how LinkedIn separates people, companies, and places, for
>>> 
>>> example.
>>> 
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson <erickerick...@gmail.com
>>> 
>>> 
>>> wrote:
>>> 
>>> 
>>> Look at the “mm” parameter, try setting it to 100%. Although that’t
>>> 
>>> not
>>> 
>>> entirely likely to do what you want either since virtually every doc
>>> 
>>> will
>>> 
>>> have “a” in it. But at least you’d get docs that have both terms.
>>> 
>>> 
>>> you may also be able to search for things like “Lamin A” _only as a
>>> 
>>> phrase_ and have some luck. But this is a gnarly problem in general.
>>> 
>>> Some
>>> 
>>> people have been able to substitute synonyms and/or shingles to make
>>> 
>>> this
>>> 
>>> work at the expense of a larger index.
>>> 
>>> 
>>> This is a generic problem with context. “Lamin A” is really a
>>> 
>>> “concept”,
>>> 
>>> not just two words that happen to be near each other. Searching as a
>>> 
>>> phrase
>>> 
>>> is an OOB-but-naive way to try to make it more likely that the ranked
>>> results refer to the _concept_ of “Lamin A”. The assumption here is
>>> 
>>> “if
>>> 
>>> these two words appear next to each other, they’re more likely to be
>>> 
>>> what I
>>> 
>>> want”. I say “naive” because “Lamins: A new approach to...” would
>>> 
>>> _also_ be
>>> 
>>> found for a naive phrase search. (I have no idea whether such a title
>>> 
>>> makes
>>> 
>>> sense or not, but you figured that out already)...
>>> 
>>> 
>>> To do this well you’d have to dive in to NLP/Machine learning.
>>> 
>>> I truly wish we could have the DWIM search algorithm (Do What I
>>> 
>>> Mean)….
>>> 
>>> 
>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri <gvit...@ebi.ac.uk>
>>> 
>>> wrote:
>>> 
>>> 
>>> HI Walter and Paras
>>> 
>>> I indexed it removing all the references to StopWordFilter and I
>>> 
>>> went
>>> 
>>> from 121 results to near 20K as the search term q="Lymphoid and a
>>> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
>>> 
>>> So I
>>> 
>>> don't think removing it completely is the way to go from the scenario
>>> 
>>> we
>>> 
>>> have, but I appreciate the suggestion…
>>> 
>>> 
>>> Yes the response is using fl=*
>>> I am trying some combinations at the moment, but yet no success.
>>> 
>>> defType=edismax
>>> q.alt=Lymphoid and a non-Lymphoid cell
>>> Number of results=1599
>>> Quite a considerable increase, even though reasonable meaningful
>>> 
>>> results.
>>> 
>>> 
>>> I am sorry but I didn't understand what do you want me to do exactly
>>> 
>>> with the lst (??) and qf and bf.
>>> 
>>> 
>>> Thanks everyone with their inputs
>>> 
>>> 
>>> On 8 Nov 2019, at 06:45, Paras Lehana <paras.leh...@indiamart.com>
>>> 
>>> wrote:
>>> 
>>> 
>>> Hi Guilherme
>>> 
>>> By accident, I ended up querying the using the default handler
>>> 
>>> (/select) and it worked.
>>> 
>>> 
>>> You've just found the culprit. Thanks for giving the material I
>>> 
>>> requested. Your analysis chain is working as expected. I don't see any
>>> issue in either StopWordFilter or your boosts. I also use a boost of
>>> 
>>> 50
>>> 
>>> when boosting contextual suggestions (boosting "gold iphone" on a page
>>> 
>>> of
>>> 
>>> iphone) but I take Walter's suggestion and would try to optimize my
>>> weights. I agree that this 50 thing was not researched much about by
>>> 
>>> us
>>> 
>>> as
>>> 
>>> well (we never faced performance or relevance issues).
>>> 
>>> 
>>> See the major difference in both the handlers - edismax. I'm pretty
>>> 
>>> sure that your problem lies in the parsing of queries (you can confirm
>>> 
>>> that
>>> 
>>> from parsedquery key in debug of both JSON responses). I hope you have
>>> provided the response with fl=*. Replace q with q.alt in your /search
>>> handler query and I think you should start getting responses. That's
>>> because q.alt uses standard parser. If you want to keep using
>>> 
>>> edisMax, I
>>> 
>>> suggest you to test the responses removing some combination of lst
>>> 
>>> (qf,
>>> 
>>> bf)
>>> 
>>> and find what's restricting the documents to come up. I'm out of
>>> 
>>> office
>>> 
>>> today - would have certainly tried analyzing the field values of the
>>> document in /select request and compare it with qf/bq in
>>> 
>>> solrconfig.xml
>>> 
>>> /search. Do this for me and you'd certainly find something.
>>> 
>>> 
>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood <
>>> 
>>> wun...@wunderwood.org
>>> 
>>> <mailto:wun...@wunderwood.org>> wrote:
>>> 
>>> I normally use a weight of 8 for the most important field, like
>>> 
>>> title.
>>> 
>>> Other fields might get a 4 or 2.
>>> 
>>> 
>>> I add a “pf” field with the weights doubled, so that phrase matches
>>> 
>>> have a higher weight.
>>> 
>>> 
>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>>> 
>>> early web search engines. With different relevance algorithms and
>>> 
>>> totally
>>> 
>>> different evaluation and tuning systems, they settled on weights of 8
>>> 
>>> and
>>> 
>>> 7.5 for HTML titles. With the the two radically different system
>>> 
>>> getting
>>> 
>>> the same number, I decided that was a property of the documents, not
>>> 
>>> of
>>> 
>>> the
>>> 
>>> search engines.
>>> 
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>>> 
>>> (my blog)
>>> 
>>> 
>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>>> 
>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>> 
>>> 
>>> Hi Wunder,
>>> 
>>> My indexer takes quite a few hours to be executed I am shortening
>>> 
>>> it
>>> 
>>> to run faster, but I also need to make sure it gives what we are
>>> 
>>> expecting.
>>> 
>>> This implementation's been there for >4y, and massively used.
>>> 
>>> 
>>> In your edismax handlers, weights of 20, 50, and 100 are
>>> 
>>> extremely
>>> 
>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>> 
>>> years
>>> 
>>> of configuring Solr.
>>> 
>>> I've inherited that implementation and I am really keen to
>>> 
>>> adequate
>>> 
>>> it, what would you recommend ?
>>> 
>>> 
>>> Cheers
>>> Guilherme
>>> 
>>> On 7 Nov 2019, at 14:43, Walter Underwood <wun...@wunderwood.org
>>> 
>>> <mailto:wun...@wunderwood.org>> wrote:
>>> 
>>> 
>>> Thanks for posting the files. Looking at schema.xml, I see that
>>> 
>>> you
>>> 
>>> still are using StopFilterFactory. The first advice we gave you was to
>>> remove that.
>>> 
>>> 
>>> Remove StopFilterFactory everywhere and reindex.
>>> 
>>> You will continue to have problems matching stopwords until you
>>> 
>>> do
>>> 
>>> that.
>>> 
>>> 
>>> In your edismax handlers, weights of 20, 50, and 100 are
>>> 
>>> extremely
>>> 
>>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen
>>> 
>>> years
>>> 
>>> of configuring Solr.
>>> 
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/
>>> 
>>> 
>>> (my blog)
>>> 
>>> 
>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri <gvit...@ebi.ac.uk
>>> 
>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>> 
>>> 
>>> Hi Paras, everyone
>>> 
>>> Thank you again for your inputs and suggestions. I sorry to hear
>>> 
>>> you had trouble with the attachments I will host it somewhere and
>>> 
>>> share
>>> 
>>> the
>>> 
>>> links.
>>> 
>>> I don't tweak my index, I get the data from the graph database,
>>> 
>>> create a document as they are and save to solr.
>>> 
>>> 
>>> So, I am sending the new analysis screen querying the way you
>>> 
>>> suggested. Also the results with params and solr query url.
>>> 
>>> 
>>> During the process of querying what you asked I found something
>>> 
>>> really weird (at least for me). By accident, I ended up querying the
>>> 
>>> using
>>> 
>>> the default handler (/select) and it worked. Then If I use the one I
>>> 
>>> must
>>> 
>>> use, then sadly doesn't work. I am posting both results and I will
>>> 
>>> also
>>> 
>>> post the handlers as well.
>>> 
>>> 
>>> Here is the link with all the files mentioned before
>>> 
>>> 
>>> 
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>> <
>>> 
>>> 
>>> 
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>>> 
>>> 
>>> <
>>> 
>>> 
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>> 
>>> <
>>> 
>>> 
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
>>> 
>>> 
>>> If the link doesn't work www dot dropbox dot com slash sh slash
>>> 
>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>> 
>>> 
>>> Thanks
>>> 
>>> On 7 Nov 2019, at 05:23, Paras Lehana <
>>> 
>>> paras.leh...@indiamart.com
>>> 
>>> <mailto:paras.leh...@indiamart.com>> wrote:
>>> 
>>> 
>>> Hi Guilherme.
>>> 
>>> I am sending they analysis result and the json result as
>>> 
>>> requested.
>>> 
>>> 
>>> 
>>> Thanks for the effort. Luckily, I can see your attachments (low
>>> 
>>> quality
>>> 
>>> though).
>>> 
>>> From the analysis screen, the analysis is working as expected.
>>> 
>>> One
>>> 
>>> of the
>>> 
>>> reasons for query="lymphoid and *a* non-lymphoid cell" not
>>> 
>>> matching
>>> 
>>> document containing "Lymphoid and a non-Lymphoid cell" I can
>>> 
>>> initially
>>> 
>>> think of is: the stopword "a" is probably present in
>>> 
>>> post-analysis
>>> 
>>> either
>>> 
>>> of query or index. Did you tweak your index time analysis after
>>> 
>>> indexing?
>>> 
>>> 
>>> Do two things:
>>> 
>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>> "query=*"lymphoid
>>> and a non-lymphoid cell"*. Try hosting the image and providing
>>> 
>>> the
>>> 
>>> link
>>> 
>>> here.
>>> 2. Give the same JSON output as you have sent but this time
>>> 
>>> with
>>> 
>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>> 
>>> 
>>> 
>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson <
>>> 
>>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>> wrote:
>>> 
>>> 
>>> I don’t see the attachments, maybe I deleted old e-mails or
>>> 
>>> some
>>> 
>>> such. The
>>> 
>>> Apache server is fairly aggressive about stripping attachments
>>> 
>>> though, so
>>> 
>>> it’s also possible they didn’t make it through.
>>> 
>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri <
>>> 
>>> gvit...@ebi.ac.uk
>>> 
>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>> 
>>> 
>>> Thanks Erick.
>>> 
>>> First, your index and analysis chains are considerably
>>> 
>>> different, this
>>> 
>>> can easily be a source of problems. In particular, using two
>>> 
>>> different
>>> 
>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>> 
>>> this unless
>>> 
>>> you’re totally sure you understand the consequences.
>>> 
>>> Additionally, your use
>>> 
>>> of the length filter is suspicious, especially since your
>>> 
>>> problem
>>> 
>>> statement
>>> 
>>> is about the addition of a single letter term and the min
>>> 
>>> length
>>> 
>>> allowed on
>>> 
>>> that filter is 2. That said, it’s reasonable to suppose that
>>> 
>>> the
>>> 
>>> ’a’ is
>>> 
>>> filtered out in both cases, but maybe you’ve found something
>>> 
>>> odd
>>> 
>>> about the
>>> 
>>> interactions.
>>> 
>>> I will investigate the min length and post the results later.
>>> 
>>> Second, I have no idea what this will do. Are the equal
>>> 
>>> signs
>>> 
>>> typos?
>>> 
>>> Used by custom code?
>>> 
>>> This the url in my application, not solr params. That's the
>>> 
>>> query string.
>>> 
>>> 
>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>> 
>>> likely
>>> 
>>> that
>>> 
>>> all the params with an equal-sign are totally ignored unless
>>> 
>>> it’s
>>> 
>>> just a
>>> 
>>> typo.
>>> 
>>> This is part of the application. Species will be used later
>>> 
>>> on
>>> 
>>> in solr
>>> 
>>> to filter out the result. That's not solr. That my app params.
>>> 
>>> 
>>> Third, the easiest way to see what’s happening under the
>>> 
>>> covers
>>> 
>>> is to
>>> 
>>> add “&debug=true” to the query and look at the parsed query.
>>> 
>>> Ignore all the
>>> 
>>> relevance calculations for the nonce, or specify
>>> 
>>> “&debug=query”
>>> 
>>> to skip
>>> 
>>> that part.
>>> 
>>> The two json files i've sent, they are debugQuery=on and the
>>> 
>>> explain tag
>>> 
>>> is present.
>>> 
>>> I will try the searching the way you mentioned.
>>> 
>>> Thank for your inputs
>>> 
>>> Guilherme
>>> 
>>> On 6 Nov 2019, at 14:14, Erick Erickson <
>>> 
>>> erickerick...@gmail.com <mailto:erickerick...@gmail.com>>
>>> 
>>> wrote:
>>> 
>>> 
>>> Fwd to another server
>>> 
>>> First, your index and analysis chains are considerably
>>> 
>>> different, this
>>> 
>>> can easily be a source of problems. In particular, using two
>>> 
>>> different
>>> 
>>> tokenizers is a huge red flag. I _strongly_ recommend against
>>> 
>>> this unless
>>> 
>>> you’re totally sure you understand the consequences.
>>> 
>>> Additionally, your use
>>> 
>>> of the length filter is suspicious, especially since your
>>> 
>>> problem
>>> 
>>> statement
>>> 
>>> is about the addition of a single letter term and the min
>>> 
>>> length
>>> 
>>> allowed on
>>> 
>>> that filter is 2. That said, it’s reasonable to suppose that
>>> 
>>> the
>>> 
>>> ’a’ is
>>> 
>>> filtered out in both cases, but maybe you’ve found something
>>> 
>>> odd
>>> 
>>> about the
>>> 
>>> interactions.
>>> 
>>> 
>>> Second, I have no idea what this will do. Are the equal
>>> 
>>> signs
>>> 
>>> typos?
>>> 
>>> Used by custom code?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> <
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> 
>>> 
>>> What does “species=“ do? That’s not Solr syntax, so it’s
>>> 
>>> likely
>>> 
>>> that
>>> 
>>> all the params with an equal-sign are totally ignored unless
>>> 
>>> it’s
>>> 
>>> just a
>>> 
>>> typo.
>>> 
>>> 
>>> Third, the easiest way to see what’s happening under the
>>> 
>>> covers
>>> 
>>> is to
>>> 
>>> add “&debug=true” to the query and look at the parsed query.
>>> 
>>> Ignore all the
>>> 
>>> relevance calculations for the nonce, or specify
>>> 
>>> “&debug=query”
>>> 
>>> to skip
>>> 
>>> that part.
>>> 
>>> 
>>> 90% + of the time, the question “why didn’t this query do
>>> 
>>> what I
>>> 
>>> expect” is answered by looking at the “&debug=query” output
>>> 
>>> and
>>> 
>>> the
>>> 
>>> analysis page in the admin UI. NOTE: for the analysis page be
>>> 
>>> sure to look
>>> 
>>> at _both_ the query and index output. Also, and very important
>>> 
>>> about the
>>> 
>>> analysis page (and this is confusing) is that this _assumes_
>>> 
>>> that
>>> 
>>> what you
>>> 
>>> put in the text boxes have made it through the query parser
>>> 
>>> intact and is
>>> 
>>> analyzed by the field selected. Consider the search
>>> 
>>> "q=field:word1 word2".
>>> 
>>> Now you type “word1 word2” into the analysis text box and it
>>> 
>>> looks like
>>> 
>>> what you expect. That’s misleading because the query is
>>> 
>>> _parsed_
>>> 
>>> as
>>> 
>>> "field:word1 default_search_field:word2”. This is where
>>> 
>>> “&debug=query”
>>> 
>>> helps.
>>> 
>>> 
>>> Best,
>>> Erick
>>> 
>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana <
>>> 
>>> paras.leh...@indiamart.com <mailto:paras.leh...@indiamart.com>>
>>> 
>>> wrote:
>>> 
>>> 
>>> Hi Walter,
>>> 
>>> The solr.StopFilter removes all tokens that are stopwords.
>>> 
>>> Those words
>>> 
>>> will
>>> 
>>> not be in the index, so they can never match a query.
>>> 
>>> 
>>> 
>>> I think the OP's concern is different results when adding a
>>> 
>>> stopword. I
>>> 
>>> think he's using the filter factory correctly - the query
>>> 
>>> chain
>>> 
>>> includes
>>> 
>>> the filter as well so it should remove "a" while querying.
>>> 
>>> *@Guilherme*, please post results for both the query, the
>>> 
>>> document in
>>> 
>>> result you are concerned about and post full result of
>>> 
>>> analysis screen
>>> 
>>> (for
>>> 
>>> both query and index).
>>> 
>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood <
>>> 
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>>
>>> 
>>> wrote:
>>> 
>>> 
>>> No.
>>> 
>>> The solr.StopFilter removes all tokens that are stopwords.
>>> 
>>> Those words
>>> 
>>> will not be in the index, so they can never match a query.
>>> 
>>> 1. Remove the lines with solr.StopFilter from every
>>> 
>>> analysis
>>> 
>>> chain in
>>> 
>>> schema.xml.
>>> 2. Reload the collection, restart Solr, or whatever to
>>> 
>>> read
>>> 
>>> the new
>>> 
>>> config.
>>> 
>>> 3. Reindex all of the documents.
>>> 
>>> When indexed with the new analysis chain, the stopwords
>>> 
>>> will
>>> 
>>> not be
>>> 
>>> removed and they will be searchable.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/ <
>>> 
>>> http://observer.wunderwood.org/>  (my blog)
>>> 
>>> 
>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri <
>>> 
>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>
>>> 
>>> wrote:
>>> 
>>> 
>>> Ok. I am kind a lost now.
>>> If I open up the console > analysis and perform it,
>>> 
>>> that's
>>> 
>>> the final
>>> 
>>> result.
>>> 
>>> <Screenshot 2019-11-05 at 14.54.16.png>
>>> 
>>> Your suggestion is: get rid of the <filter stopword.txt>
>>> 
>>> in
>>> 
>>> the
>>> 
>>> schema.xml and during index phase replaceAll("in
>>> 
>>> stopwords.txt"," ")
>>> 
>>> then
>>> 
>>> add to solr. Is that correct ?
>>> 
>>> 
>>> Thanks David
>>> 
>>> On 5 Nov 2019, at 14:48, David Hastings <
>>> 
>>> hastings.recurs...@gmail.com <mailto:
>>> 
>>> hastings.recurs...@gmail.com
>>> 
>>> 
>>> <mailto:hastings.recurs...@gmail.com <mailto:
>>> 
>>> hastings.recurs...@gmail.com>>> wrote:
>>> 
>>> 
>>> Fwd to another server
>>> 
>>> no,
>>>  <filter class="solr.StopFilterFactory"
>>> 
>>> ignoreCase="true"
>>> 
>>> words="stopwords.txt"/>
>>> 
>>> is still using stopwords and should be removed, in my
>>> 
>>> opinion of
>>> 
>>> course,
>>> 
>>> based on your use case may be different, but i generally
>>> 
>>> axe any
>>> 
>>> reference
>>> 
>>> to them at all
>>> 
>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri <
>>> 
>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
>>> 
>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
>>> 
>>> wrote:
>>> 
>>> 
>>> Thanks.
>>> Haven't I done this here ?
>>> <fieldType name="text_field" class="solr.TextField"
>>> positionIncrementGap="100" omitNorms="false" >
>>> <analyzer type="index">
>>>  <tokenizer class="solr.StandardTokenizerFactory"/>
>>>  <filter class="solr.ClassicFilterFactory"/>
>>>  <filter class="solr.LengthFilterFactory" min="2"
>>> 
>>> max="20"/>
>>> 
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>  <filter class="solr.StopFilterFactory"
>>> 
>>> ignoreCase="true"
>>> 
>>> words="stopwords.txt"/>
>>> </analyzer>
>>> 
>>> 
>>> On 5 Nov 2019, at 14:15, David Hastings <
>>> 
>>> hastings.recurs...@gmail.com <mailto:
>>> 
>>> hastings.recurs...@gmail.com
>>> 
>>> 
>>> <mailto:hastings.recurs...@gmail.com <mailto:
>>> 
>>> hastings.recurs...@gmail.com>>>
>>> 
>>> wrote:
>>> 
>>> 
>>> Fwd to another server
>>> 
>>> The first thing you should do is remove any reference
>>> 
>>> to
>>> 
>>> stop
>>> 
>>> words
>>> 
>>> and
>>> 
>>> never use them, then re-index your data and try it
>>> 
>>> again.
>>> 
>>> 
>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri <
>>> 
>>> gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>
>>> 
>>> <mailto:gvit...@ebi.ac.uk <mailto:gvit...@ebi.ac.uk>>>
>>> 
>>> wrote:
>>> 
>>> 
>>> Hi,
>>> 
>>> I am performing a search to match a name
>>> 
>>> (text_field),
>>> 
>>> however
>>> 
>>> this
>>> 
>>> term
>>> 
>>> contains 'and' and 'a' and it doesn't return any
>>> 
>>> records. If i
>>> 
>>> remove
>>> 
>>> 'a'
>>> 
>>> then it works.
>>> e.g
>>> Search Term: lymphoid and a non-lymphoid cell
>>> doesn't work:
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> <
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> 
>>> <
>>> 
>>> 
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> <
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> 
>>> 
>>> <
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> <
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> 
>>> 
>>> 
>>> Search term: lymphoid and non-lymphoid cell
>>> works:
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> <
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> 
>>> <
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> <
>>> 
>>> 
>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell&species=Homo+sapiens&species=Entries+without+species&cluster=true
>>> 
>>> 
>>> 
>>> interested in the first result
>>> 
>>> schema.xml
>>> <field name="name"
>>> 
>>> type="text_field"
>>> 
>>> indexed="true"  stored="true"   omitNorms="false"
>>> 
>>> required="true"
>>> 
>>> multiValued="false"/>
>>> 
>>> <analyzer type="query">
>>>  <tokenizer class="solr.PatternTokenizerFactory"
>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="^[/._:]+" replacement=""/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="[/._:]+$" replacement=""/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="[_]" replacement=" "/>
>>>  <filter class="solr.LengthFilterFactory" min="2"
>>> 
>>> max="20"/>
>>> 
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>  <filter class="solr.StopFilterFactory"
>>> 
>>> ignoreCase="true"
>>> 
>>> words="stopwords.txt"/>
>>> </analyzer>
>>> 
>>> <fieldType name="text_field" class="solr.TextField"
>>> positionIncrementGap="100" omitNorms="false" >
>>> <analyzer type="index">
>>>  <tokenizer
>>> 
>>> class="solr.StandardTokenizerFactory"/>
>>> 
>>>  <filter class="solr.ClassicFilterFactory"/>
>>>  <filter class="solr.LengthFilterFactory" min="2"
>>> 
>>> max="20"/>
>>> 
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>  <filter class="solr.StopFilterFactory"
>>> 
>>> ignoreCase="true"
>>> 
>>> words="stopwords.txt"/>
>>> </analyzer>
>>> <analyzer type="query">
>>>  <tokenizer class="solr.PatternTokenizerFactory"
>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="^[/._:]+" replacement=""/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="[/._:]+$" replacement=""/>
>>>  <filter class="solr.PatternReplaceFilterFactory"
>>> pattern="[_]" replacement=" "/>
>>>  <filter class="solr.LengthFilterFactory" min="2"
>>> 
>>> max="20"/>
>>> 
>>>  <filter class="solr.LowerCaseFilterFactory"/>
>>>  <filter class="solr.StopFilterFactory"
>>> 
>>> ignoreCase="true"
>>> 
>>> words="stopwords.txt"/>
>>> </analyzer>
>>> </fieldType>
>>> 
>>> stopwords.txt
>>> #Standard english stop words taken from Lucene's
>>> 
>>> StopAnalyzer
>>> 
>>> a
>>> b
>>> c
>>> ....
>>> an
>>> and
>>> are
>>> 
>>> Running SolR 6.6.2.
>>> 
>>> Is there anything I could do to prevent this ?
>>> 
>>> Thanks
>>> Guilherme
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> --
>>> Regards,
>>> 
>>> *Paras Lehana* [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>> 
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>> 
>>> Mob.: +91-9560911996
>>> Work: 01203916600 | Extn:  *8173*
>>> 
>>> --
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> --
>>> Regards,
>>> 
>>> *Paras Lehana* [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>> 
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>> 
>>> Mob.: +91-9560911996
>>> Work: 01203916600 | Extn:  *8173*
>>> 
>>> --
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> --
>>> Regards,
>>> 
>>> Paras Lehana [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>> 
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>> 
>>> Mob.: +91-9560911996 <tel:+91-9560911996>
>>> Work: 01203916600 | Extn:  8173
>>> 
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> --
>>> --
>>> Regards,
>>> 
>>> *Paras Lehana* [65871]
>>> Development Engineer, Auto-Suggest,
>>> IndiaMART Intermesh Ltd.
>>> 
>>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>>> Noida, UP, IN - 201303
>>> 
>>> Mob.: +91-9560911996
>>> Work: 01203916600 | Extn:  *8173*
>>> 
>>> --
>>> IMPORTANT:
>>> NEVER share your IndiaMART OTP/ Password with anyone.
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> --
>> --
>> Regards,
>> 
>> *Paras Lehana* [65871]
>> Development Engineer, Auto-Suggest,
>> IndiaMART Intermesh Ltd.
>> 
>> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
>> Noida, UP, IN - 201303
>> 
>> Mob.: +91-9560911996
>> Work: 01203916600 | Extn:  *8173*
>> 
>> IMPORTANT:
>> NEVER share your IndiaMART OTP/ Password with anyone.
>> 
>> 
> 
> -- 
> -- 
> Regards,
> 
> *Paras Lehana* [65871]
> Development Engineer, Auto-Suggest,
> IndiaMART Intermesh Ltd.
> 
> 8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
> Noida, UP, IN - 201303
> 
> Mob.: +91-9560911996
> Work: 01203916600 | Extn:  *8173*
> 
> -- 
> IMPORTANT: 
> NEVER share your IndiaMART OTP/ Password with anyone.

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

Reply via email to