subject:"Stopwords"

Re: Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-11 Thread Edward Turner

Many thanks Walter, that's useful information. And yes, if we are able to
keep stopwords, then we will. We have been exploring it because we've
noticed its use leads to a sizable drop in index size (5%, in some of our
tests), which then had the knock on effect of better performance. (Also,
unfortunately, we do not have the luxury of using super big
machines/storage -- so it's always a balancing act for us.)

Best,
Edd

Edward Turner


On Tue, 10 Nov 2020 at 16:22, Walter Underwood 
wrote:

> By far the simplest solution is to leave stopwords in the index. That also
> improves
> relevance, because it becomes possible to search for “vitamin a” or “to be
> or not to be”.
>
> Stopword remove was a performance and disk space hack from the 1960s. It
> is no
> longer needed. We were keeping stopwords in the index at Infoseek, back in
> 1996.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 10, 2020, at 1:16 AM, Edward Turner  wrote:
> >
> > Hi all,
> >
> > Okay, I've been doing more research about this problem and from what I
> > understand, phrase queries + stopwords are known to have some
> difficulties
> > working together in some circumstances.
> >
> > E.g.,
> >
> https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
> > https://issues.apache.org/jira/browse/SOLR-6468
> >
> > I was thinking about workarounds, but each solution I've attempted
> doesn't
> > quite work.
> >
> > Therefore, maybe one possible solution is to take a step back and
> > preprocess index/query data going to Solr, something like:
> >
> > String wordsForSolr = removeStopWordsFrom("This is pretend index or query
> > data")
> > // wordsForSolr = "pretend index query data"
> >
> > Off the top of my head, this will by-pass position issues.
> >
> > I will give this a go, but was wondering whether this is something others
> > have done?
> >
> > Best wishes,
> > Edd
> >
> > 
> > Edward Turner
> >
> >
> > On Fri, 6 Nov 2020 at 13:58, Edward Turner  wrote:
> >
> >> Hi all,
> >>
> >> We are experiencing some unexpected behaviour for phrase queries which
> we
> >> believe might be related to the FlattenGraphFilterFactory and stopwords.
> >>
> >> Brief description: when performing a phrase query
> >> "Molecular cloning and evolution of the" => we get expected hits
> >> "Molecular cloning and evolution of the genes" => we get no hits
> >> (unexpected behaviour)
> >>
> >> I think it's worthwhile adding the analyzers we use to help you see what
> >> we're doing:
> >>  Analyzers 
> >>  >>   sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
> >>   
> >>   >> pattern="[- /()]+" />
> >>   >> ignoreCase="true" />
> >>   >> preserveOriginal="false" />
> >>  
> >>   >> generateNumberParts="1" splitOnCaseChange="0"
> preserveOriginal="0"
> >> splitOnNumerics="0" stemEnglishPossessive="1"
> >> generateWordParts="1"
> >> catenateNumbers="0" catenateWords="1" catenateAll="1" />
> >>  
> >>   
> >>   
> >>   >> pattern="[- /()]+" />
> >>   >> ignoreCase="true" />
> >>   >> preserveOriginal="false" />
> >>  
> >>   >> generateNumberParts="1" splitOnCaseChange="0"
> preserveOriginal="0"
> >> splitOnNumerics="0" stemEnglishPossessive="1"
> >> generateWordParts="1"
> >> catenateNumbers="0" catenateWords="0" catenateAll="0" />
> >>   
> >> 
> >>  End of Analyzers 
> >>
> >>  Stopwords 
> >> We use the following stopwords:
> >> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no,
> not,
> >> of, on, or, such, that, the, their, then, there, these, they, this, to,
> >> was, will, with, which
> >>  End of Stopwords

Re: Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-10 Thread Walter Underwood

By far the simplest solution is to leave stopwords in the index. That also 
improves
relevance, because it becomes possible to search for “vitamin a” or “to be or 
not to be”.

Stopword remove was a performance and disk space hack from the 1960s. It is no 
longer needed. We were keeping stopwords in the index at Infoseek, back in 1996.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 10, 2020, at 1:16 AM, Edward Turner  wrote:
> 
> Hi all,
> 
> Okay, I've been doing more research about this problem and from what I
> understand, phrase queries + stopwords are known to have some difficulties
> working together in some circumstances.
> 
> E.g.,
> https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
> https://issues.apache.org/jira/browse/SOLR-6468
> 
> I was thinking about workarounds, but each solution I've attempted doesn't
> quite work.
> 
> Therefore, maybe one possible solution is to take a step back and
> preprocess index/query data going to Solr, something like:
> 
> String wordsForSolr = removeStopWordsFrom("This is pretend index or query
> data")
> // wordsForSolr = "pretend index query data"
> 
> Off the top of my head, this will by-pass position issues.
> 
> I will give this a go, but was wondering whether this is something others
> have done?
> 
> Best wishes,
> Edd
> 
> 
> Edward Turner
> 
> 
> On Fri, 6 Nov 2020 at 13:58, Edward Turner  wrote:
> 
>> Hi all,
>> 
>> We are experiencing some unexpected behaviour for phrase queries which we
>> believe might be related to the FlattenGraphFilterFactory and stopwords.
>> 
>> Brief description: when performing a phrase query
>> "Molecular cloning and evolution of the" => we get expected hits
>> "Molecular cloning and evolution of the genes" => we get no hits
>> (unexpected behaviour)
>> 
>> I think it's worthwhile adding the analyzers we use to help you see what
>> we're doing:
>>  Analyzers 
>> >   sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>>   
>>  > pattern="[- /()]+" />
>>  > ignoreCase="true" />
>>  > preserveOriginal="false" />
>>  
>>  > generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>> splitOnNumerics="0" stemEnglishPossessive="1"
>> generateWordParts="1"
>> catenateNumbers="0" catenateWords="1" catenateAll="1" />
>>  
>>   
>>   
>>  > pattern="[- /()]+" />
>>  > ignoreCase="true" />
>>  > preserveOriginal="false" />
>>  
>>  >     generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>> splitOnNumerics="0" stemEnglishPossessive="1"
>> generateWordParts="1"
>> catenateNumbers="0" catenateWords="0" catenateAll="0" />
>>   
>> 
>>  End of Analyzers 
>> 
>>  Stopwords 
>> We use the following stopwords:
>> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
>> of, on, or, such, that, the, their, then, there, these, they, this, to,
>> was, will, with, which
>>  End of Stopwords 
>> 
>>  Analysis Admin page output ---
>> ... And to see what's going on when we're indexing/querying, I created a
>> gist with an image of the (non-verbose) output of the analysis admin page
>> for, index data/query, "Molecular cloning and evolution of the genes":
>> 
>> https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png
>> 
>> Hopefully this link works, and you can see that the resulting terms and
>> positions are identical until the FlattenGraphFilterFactory step in the
>> "index" phase.
>> 
>> Final stage of index analysis:
>> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes
>> 
>> Final stage of query analysis:
>> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes
>> 
>> The empty positions are because of stopwords (presumably)
>>  End of Analysis Admin page output ---
>> 
>> Main question:
>> Could someone explain why the FlattenGraphFilterFactory changes the
>> position of the "genes" token? From what we see, this happens after a,
>> "the" (but we've not checked exhaustively, and continue to test).
>> 
>> Perhaps, we are doing something wrong in our analysis setup?
>> 
>> Any help would be much appreciated -- getting phrase queries to work is an
>> important use-case of ours.
>> 
>> Kind regards and thank you in advance,
>> Edd
>> 
>> Edward Turner
>>

Re: Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-10 Thread Edward Turner

Hi all,

Okay, I've been doing more research about this problem and from what I
understand, phrase queries + stopwords are known to have some difficulties
working together in some circumstances.

E.g.,
https://stackoverflow.com/questions/56802656/stopwords-and-phrase-queries-solr?rq=1
https://issues.apache.org/jira/browse/SOLR-6468

I was thinking about workarounds, but each solution I've attempted doesn't
quite work.

Therefore, maybe one possible solution is to take a step back and
preprocess index/query data going to Solr, something like:

String wordsForSolr = removeStopWordsFrom("This is pretend index or query
data")
// wordsForSolr = "pretend index query data"

Off the top of my head, this will by-pass position issues.

I will give this a go, but was wondering whether this is something others
have done?

Best wishes,
Edd


Edward Turner


On Fri, 6 Nov 2020 at 13:58, Edward Turner  wrote:

> Hi all,
>
> We are experiencing some unexpected behaviour for phrase queries which we
> believe might be related to the FlattenGraphFilterFactory and stopwords.
>
> Brief description: when performing a phrase query
> "Molecular cloning and evolution of the" => we get expected hits
> "Molecular cloning and evolution of the genes" => we get no hits
> (unexpected behaviour)
>
> I think it's worthwhile adding the analyzers we use to help you see what
> we're doing:
>  Analyzers 
> sortMissingLast="true" omitNorms="true" positionIncrementGap="100">
>
> pattern="[- /()]+" />
> ignoreCase="true" />
> preserveOriginal="false" />
>   
> generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>  splitOnNumerics="0" stemEnglishPossessive="1"
> generateWordParts="1"
>  catenateNumbers="0" catenateWords="1" catenateAll="1" />
>   
>
>
> pattern="[- /()]+" />
> ignoreCase="true" />
> preserveOriginal="false" />
>   
>     generateNumberParts="1" splitOnCaseChange="0" preserveOriginal="0"
>  splitOnNumerics="0" stemEnglishPossessive="1"
> generateWordParts="1"
>  catenateNumbers="0" catenateWords="0" catenateAll="0" />
>
> 
>  End of Analyzers 
>
>  Stopwords 
> We use the following stopwords:
> a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
> of, on, or, such, that, the, their, then, there, these, they, this, to,
> was, will, with, which
>  End of Stopwords 
>
>  Analysis Admin page output ---
> ... And to see what's going on when we're indexing/querying, I created a
> gist with an image of the (non-verbose) output of the analysis admin page
> for, index data/query, "Molecular cloning and evolution of the genes":
>
> https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png
>
> Hopefully this link works, and you can see that the resulting terms and
> positions are identical until the FlattenGraphFilterFactory step in the
> "index" phase.
>
> Final stage of index analysis:
> (1)molecular (2)cloning (3) (4)evolution (5) (6)genes
>
> Final stage of query analysis:
> (1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes
>
> The empty positions are because of stopwords (presumably)
>  End of Analysis Admin page output ---
>
> Main question:
> Could someone explain why the FlattenGraphFilterFactory changes the
> position of the "genes" token? From what we see, this happens after a,
> "the" (but we've not checked exhaustively, and continue to test).
>
> Perhaps, we are doing something wrong in our analysis setup?
>
> Any help would be much appreciated -- getting phrase queries to work is an
> important use-case of ours.
>
> Kind regards and thank you in advance,
> Edd
> 
> Edward Turner
>

Phrase query no hits when stopwords and FlattenGraphFilterFactory used

2020-11-06 Thread Edward Turner

Hi all,

We are experiencing some unexpected behaviour for phrase queries which we
believe might be related to the FlattenGraphFilterFactory and stopwords.

Brief description: when performing a phrase query
"Molecular cloning and evolution of the" => we get expected hits
"Molecular cloning and evolution of the genes" => we get no hits
(unexpected behaviour)

I think it's worthwhile adding the analyzers we use to help you see what
we're doing:
 Analyzers 

   
  
  
  
  
  
  
   
   
  
  
  
  
  
   

 End of Analyzers ----

 Stopwords 
We use the following stopwords:
a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
of, on, or, such, that, the, their, then, there, these, they, this, to,
was, will, with, which
---- End of Stopwords 

 Analysis Admin page output ---
... And to see what's going on when we're indexing/querying, I created a
gist with an image of the (non-verbose) output of the analysis admin page
for, index data/query, "Molecular cloning and evolution of the genes":
https://gist.github.com/eddturner/81dbf409703aad402e9009b13d42e43c#file-analysis-admin-png

Hopefully this link works, and you can see that the resulting terms and
positions are identical until the FlattenGraphFilterFactory step in the
"index" phase.

Final stage of index analysis:
(1)molecular (2)cloning (3) (4)evolution (5) (6)genes

Final stage of query analysis:
(1)molecular (2)cloning (3) (4)evolution (5) (6) (7)genes

The empty positions are because of stopwords (presumably)
 End of Analysis Admin page output ---

Main question:
Could someone explain why the FlattenGraphFilterFactory changes the
position of the "genes" token? From what we see, this happens after a,
"the" (but we've not checked exhaustively, and continue to test).

Perhaps, we are doing something wrong in our analysis setup?

Any help would be much appreciated -- getting phrase queries to work is an
important use-case of ours.

Kind regards and thank you in advance,
Edd

Edward Turner

Re: Avoiding single digit and single charcater ONLY query by putting them in stopwords list

2020-10-27 Thread Mark Robinson

Thanks!

Mark

On Tue, Oct 27, 2020 at 11:56 AM Dave  wrote:

> Agreed. Just a JavaScript check on the input box would work fine for 99%
> of cases, unless something automatic is running them in which case just
> server side redirect back to the form.
>
> > On Oct 27, 2020, at 11:54 AM, Mark Robinson 
> wrote:
> >
> > Hi  Konstantinos ,
> >
> > Thanks for the reply.
> > I too feel the same. Wanted to find what others also in the Solr world
> > thought about it.
> >
> > Thanks!
> > Mark.
> >
> >> On Tue, Oct 27, 2020 at 11:45 AM Konstantinos Koukouvis <
> >> konstantinos.koukou...@mecenat.com> wrote:
> >>
> >> Oh hi Mark!
> >>
> >> Why would you wanna do such a thing in the solr end. Imho it would be
> much
> >> more clean and easy to do it on the client side
> >>
> >> Regards,
> >> Konstantinos
> >>
> >>
>  On 27 Oct 2020, at 16:42, Mark Robinson 
> wrote:
> >>>
> >>> Hello,
> >>>
> >>> I want to block queries having only a digit like "1" or "2" ,... or
> >>> just a letter like "a" or "b" ...
> >>>
> >>> Is it a good idea to block them ... ie just single digits 0 - 9 and  a
> -
> >> z
> >>> by putting them as a stop word? The problem with this I can anticipate
> >> is a
> >>> query like "1 inch screw" can have the important information "1"
> stripped
> >>> out if I tokenize it.
> >>>
> >>> So what would be a good way to avoid  single digit only and single
> letter
> >>> only queries, from the Solr end?
> >>> Or should I not do this at the Solr end at all?
> >>>
> >>> Could someone please share your thoughts?
> >>>
> >>> Thanks!
> >>> Mark
> >>
> >> ==
> >> Konstantinos Koukouvis
> >> konstantinos.koukou...@mecenat.com
> >>
> >> Using Golang and Solr? Try this: https://github.com/mecenat/solr
> >>
> >>
> >>
> >>
> >>
> >>
>

Re: Avoiding single digit and single charcater ONLY query by putting them in stopwords list

2020-10-27 Thread Dave

Agreed. Just a JavaScript check on the input box would work fine for 99% of 
cases, unless something automatic is running them in which case just server 
side redirect back to the form. 

> On Oct 27, 2020, at 11:54 AM, Mark Robinson  wrote:
> 
> Hi  Konstantinos ,
> 
> Thanks for the reply.
> I too feel the same. Wanted to find what others also in the Solr world
> thought about it.
> 
> Thanks!
> Mark.
> 
>> On Tue, Oct 27, 2020 at 11:45 AM Konstantinos Koukouvis <
>> konstantinos.koukou...@mecenat.com> wrote:
>> 
>> Oh hi Mark!
>> 
>> Why would you wanna do such a thing in the solr end. Imho it would be much
>> more clean and easy to do it on the client side
>> 
>> Regards,
>> Konstantinos
>> 
>> 
 On 27 Oct 2020, at 16:42, Mark Robinson  wrote:
>>> 
>>> Hello,
>>> 
>>> I want to block queries having only a digit like "1" or "2" ,... or
>>> just a letter like "a" or "b" ...
>>> 
>>> Is it a good idea to block them ... ie just single digits 0 - 9 and  a -
>> z
>>> by putting them as a stop word? The problem with this I can anticipate
>> is a
>>> query like "1 inch screw" can have the important information "1" stripped
>>> out if I tokenize it.
>>> 
>>> So what would be a good way to avoid  single digit only and single letter
>>> only queries, from the Solr end?
>>> Or should I not do this at the Solr end at all?
>>> 
>>> Could someone please share your thoughts?
>>> 
>>> Thanks!
>>> Mark
>> 
>> ==
>> Konstantinos Koukouvis
>> konstantinos.koukou...@mecenat.com
>> 
>> Using Golang and Solr? Try this: https://github.com/mecenat/solr
>> 
>> 
>> 
>> 
>> 
>>

Re: Avoiding single digit and single charcater ONLY query by putting them in stopwords list

2020-10-27 Thread Mark Robinson

Hi  Konstantinos ,

Thanks for the reply.
I too feel the same. Wanted to find what others also in the Solr world
thought about it.

Thanks!
Mark.

On Tue, Oct 27, 2020 at 11:45 AM Konstantinos Koukouvis <
konstantinos.koukou...@mecenat.com> wrote:

> Oh hi Mark!
>
> Why would you wanna do such a thing in the solr end. Imho it would be much
> more clean and easy to do it on the client side
>
> Regards,
> Konstantinos
>
>
> > On 27 Oct 2020, at 16:42, Mark Robinson  wrote:
> >
> > Hello,
> >
> > I want to block queries having only a digit like "1" or "2" ,... or
> > just a letter like "a" or "b" ...
> >
> > Is it a good idea to block them ... ie just single digits 0 - 9 and  a -
> z
> > by putting them as a stop word? The problem with this I can anticipate
> is a
> > query like "1 inch screw" can have the important information "1" stripped
> > out if I tokenize it.
> >
> > So what would be a good way to avoid  single digit only and single letter
> > only queries, from the Solr end?
> > Or should I not do this at the Solr end at all?
> >
> > Could someone please share your thoughts?
> >
> > Thanks!
> > Mark
>
> ==
> Konstantinos Koukouvis
> konstantinos.koukou...@mecenat.com
>
> Using Golang and Solr? Try this: https://github.com/mecenat/solr
>
>
>
>
>
>

Re: Avoiding single digit and single charcater ONLY query by putting them in stopwords list

2020-10-27 Thread Konstantinos Koukouvis

Oh hi Mark!

Why would you wanna do such a thing in the solr end. Imho it would be much more 
clean and easy to do it on the client side

Regards,
Konstantinos


> On 27 Oct 2020, at 16:42, Mark Robinson  wrote:
> 
> Hello,
> 
> I want to block queries having only a digit like "1" or "2" ,... or
> just a letter like "a" or "b" ...
> 
> Is it a good idea to block them ... ie just single digits 0 - 9 and  a - z
> by putting them as a stop word? The problem with this I can anticipate is a
> query like "1 inch screw" can have the important information "1" stripped
> out if I tokenize it.
> 
> So what would be a good way to avoid  single digit only and single letter
> only queries, from the Solr end?
> Or should I not do this at the Solr end at all?
> 
> Could someone please share your thoughts?
> 
> Thanks!
> Mark

==
Konstantinos Koukouvis
konstantinos.koukou...@mecenat.com

Using Golang and Solr? Try this: https://github.com/mecenat/solr

Avoiding single digit and single charcater ONLY query by putting them in stopwords list

2020-10-27 Thread Mark Robinson

Hello,

I want to block queries having only a digit like "1" or "2" ,... or
just a letter like "a" or "b" ...

Is it a good idea to block them ... ie just single digits 0 - 9 and  a - z
by putting them as a stop word? The problem with this I can anticipate is a
query like "1 inch screw" can have the important information "1" stripped
out if I tokenize it.

So what would be a good way to avoid  single digit only and single letter
only queries, from the Solr end?
Or should I not do this at the Solr end at all?

Could someone please share your thoughts?

Thanks!
Mark

RE: advice on whether to use stopwords for use case

2020-10-01 Thread Markus Jelsma

Well, when not splitting on whitespace you can the CharFilter for regex 
replacements [1] to clear the entire search string if anywhere in the string a 
banned word is found: 

.*(cigarette|tobacco).*

[1] 
https://lucene.apache.org/solr/guide/6_6/charfilterfactories.html#CharFilterFactories-solr.PatternReplaceCharFilterFactory
 
-Original message-
> From:Walter Underwood 
> Sent: Thursday 1st October 2020 18:20
> To: solr-user@lucene.apache.org
> Subject: Re: advice on whether to use stopwords for use case
> 
> I can’t think of an easy way to do this in Solr.
> 
> Do a bunch of string searches on the query on the client side. If any of them 
> match, 
> make a “no hits” result page.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
> > On Sep 30, 2020, at 11:56 PM, Derek Poh  wrote:
> > 
> > Yes, the requirements (for now) is not to return any results. I think they 
> > may change the requirements,pending their return from the holidays.
> > 
> >> If so, then check for those words in the query before sending it to Solr.
> > That is what I think so too.
> > 
> > Thinking further, using stopwords for this, there will still be results 
> > return when the number of words in the search keywords is more than the 
> > stopwords.
> > 
> > On 1/10/2020 2:57 am, Walter Underwood wrote:
> >> I’m not clear on the requirements. It sounds like the query “cigar” or 
> >> “cuban cigar”
> >> should return zero results. Is that right?
> >> 
> >> If so, then check for those words in the query before sending it to Solr.
> >> 
> >> But the stopwords approach seems like the requirement is different. Could 
> >> you give
> >> some examples?
> >> 
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my 
> >> blog)
> >> 
> >>> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  
> >>> <mailto:arafa...@gmail.com> wrote:
> >>> 
> >>> You may also want to look at something like: 
> >>> https://docs.querqy.org/index.html <https://docs.querqy.org/index.html>
> >>> 
> >>> ApacheCon had (is having..) a presentation on it that seemed quite
> >>> relevant to your needs. The videos should be live in a week or so.
> >>> 
> >>> Regards,
> >>>   Alex.
> >>> 
> >>> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  
> >>> <mailto:arafa...@gmail.com> wrote:
> >>>> I am not sure why you think stop words are your first choice. Maybe I
> >>>> misunderstand the question. I read it as that you need to exclude
> >>>> completely a set of documents that include specific keywords when
> >>>> called from specific module.
> >>>> 
> >>>> If I wanted to differentiate the searches from specific module, I
> >>>> would give that module a different end-point (Request Query Handler),
> >>>> instead of /select. So, /nocigs or whatever.
> >>>> 
> >>>> Then, in that end-point, you could do all sorts of extra things, such
> >>>> as setting appends or even invariants parameters, which would include
> >>>> filter query to exclude any documents matching specific keywords. I
> >>>> assume it is ok to return documents that are matching for other
> >>>> reasons.
> >>>> 
> >>>> Ideally, you would mark the cigs documents during indexing with a
> >>>> binary or enumeration flag and then during search you just need to
> >>>> check against that flag. In that case, you could copyField  your text
> >>>> and run it against something like
> >>>> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
> >>>>  
> >>>> <https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter>
> >>>> combined with Shingles for multiwords. Or similar. And just transform
> >>>> it as index-only so that the result is basically a yes/no flag.
> >>>> Similar thing could be done with UpdateRequestProcessor pipeline if
> >>>> you want to end up with a true boolean flag. The idea is the same,
> >>>> just to have an index-only flag that you force lock into for any
> >>>> request from specific mo

Re: advice on whether to use stopwords for use case

2020-10-01 Thread Walter Underwood

I can’t think of an easy way to do this in Solr.

Do a bunch of string searches on the query on the client side. If any of them 
match, 
make a “no hits” result page.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 30, 2020, at 11:56 PM, Derek Poh  wrote:
> 
> Yes, the requirements (for now) is not to return any results. I think they 
> may change the requirements,pending their return from the holidays.
> 
>> If so, then check for those words in the query before sending it to Solr.
> That is what I think so too.
> 
> Thinking further, using stopwords for this, there will still be results 
> return when the number of words in the search keywords is more than the 
> stopwords.
> 
> On 1/10/2020 2:57 am, Walter Underwood wrote:
>> I’m not clear on the requirements. It sounds like the query “cigar” or 
>> “cuban cigar”
>> should return zero results. Is that right?
>> 
>> If so, then check for those words in the query before sending it to Solr.
>> 
>> But the stopwords approach seems like the requirement is different. Could 
>> you give
>> some examples?
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>> 
>>> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  
>>> <mailto:arafa...@gmail.com> wrote:
>>> 
>>> You may also want to look at something like: 
>>> https://docs.querqy.org/index.html <https://docs.querqy.org/index.html>
>>> 
>>> ApacheCon had (is having..) a presentation on it that seemed quite
>>> relevant to your needs. The videos should be live in a week or so.
>>> 
>>> Regards,
>>>   Alex.
>>> 
>>> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  
>>> <mailto:arafa...@gmail.com> wrote:
>>>> I am not sure why you think stop words are your first choice. Maybe I
>>>> misunderstand the question. I read it as that you need to exclude
>>>> completely a set of documents that include specific keywords when
>>>> called from specific module.
>>>> 
>>>> If I wanted to differentiate the searches from specific module, I
>>>> would give that module a different end-point (Request Query Handler),
>>>> instead of /select. So, /nocigs or whatever.
>>>> 
>>>> Then, in that end-point, you could do all sorts of extra things, such
>>>> as setting appends or even invariants parameters, which would include
>>>> filter query to exclude any documents matching specific keywords. I
>>>> assume it is ok to return documents that are matching for other
>>>> reasons.
>>>> 
>>>> Ideally, you would mark the cigs documents during indexing with a
>>>> binary or enumeration flag and then during search you just need to
>>>> check against that flag. In that case, you could copyField  your text
>>>> and run it against something like
>>>> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
>>>>  
>>>> <https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter>
>>>> combined with Shingles for multiwords. Or similar. And just transform
>>>> it as index-only so that the result is basically a yes/no flag.
>>>> Similar thing could be done with UpdateRequestProcessor pipeline if
>>>> you want to end up with a true boolean flag. The idea is the same,
>>>> just to have an index-only flag that you force lock into for any
>>>> request from specific module.
>>>> 
>>>> Or even with something like ElevationSearchComponent. Same idea.
>>>> 
>>>> Hope this helps.
>>>> 
>>>> Regards,
>>>>   Alex.
>>>> 
>>>> On Tue, 29 Sep 2020 at 22:28, Derek Poh  
>>>> <mailto:d...@globalsources.com> wrote:
>>>>> Hi
>>>>> 
>>>>> I have read in the mailings list that we should try to avoid using stop
>>>>> words.
>>>>> 
>>>>> I have a use case where I would like to know if there is other
>>>>> alternative solutions beside using stop words.
>>>>> 
>>>>> There is business requirement to return zero result when the search is
>>>>> cigarette related words and the search is coming from a particular
>>>>> module on our site. It does not ap

Re: advice on whether to use stopwords for use case

2020-10-01 Thread Derek Poh

Yes, the requirements (for now) is not to return any results. I think 
they may change the requirements,pending their return from the holidays.



If so, then check for those words in the query before sending it to Solr.

That is what I think so too.

Thinking further, using stopwords for this, there will still be results 
return when the number of words in the search keywords is more than the 
stopwords.


On 1/10/2020 2:57 am, Walter Underwood wrote:

I’m not clear on the requirements. It sounds like the query “cigar” or “cuban 
cigar”
should return zero results. Is that right?

If so, then check for those words in the query before sending it to Solr.

But the stopwords approach seems like the requirement is different. Could you 
give
some examples?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  wrote:

You may also want to look at something like: https://docs.querqy.org/index.html

ApacheCon had (is having..) a presentation on it that seemed quite
relevant to your needs. The videos should be live in a week or so.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  wrote:

I am not sure why you think stop words are your first choice. Maybe I
misunderstand the question. I read it as that you need to exclude
completely a set of documents that include specific keywords when
called from specific module.

If I wanted to differentiate the searches from specific module, I
would give that module a different end-point (Request Query Handler),
instead of /select. So, /nocigs or whatever.

Then, in that end-point, you could do all sorts of extra things, such
as setting appends or even invariants parameters, which would include
filter query to exclude any documents matching specific keywords. I
assume it is ok to return documents that are matching for other
reasons.

Ideally, you would mark the cigs documents during indexing with a
binary or enumeration flag and then during search you just need to
check against that flag. In that case, you could copyField  your text
and run it against something like
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
combined with Shingles for multiwords. Or similar. And just transform
it as index-only so that the result is basically a yes/no flag.
Similar thing could be done with UpdateRequestProcessor pipeline if
you want to end up with a true boolean flag. The idea is the same,
just to have an index-only flag that you force lock into for any
request from specific module.

Or even with something like ElevationSearchComponent. Same idea.

Hope this helps.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:

Hi

I have read in the mailings list that we should try to avoid using stop
words.

I have a use case where I would like to know if there is other
alternative solutions beside using stop words.

There is business requirement to return zero result when the search is
cigarette related words and the search is coming from a particular
module on our site. It does not apply to all searches from our site.
There is a list of these cigarette related words. This list contains
single word, multiple words (Electronic cigar), multiple words with
punctuation (e-cigarette case).
I am planning to copy a different set of search fields, that will
include the stopword filter in the index and query stage, for this
module to use.

For this use case, other than using stop words to handle it, is there
any alternative solution?

Derek

--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.





--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: advice on whether to use stopwords for use case

2020-10-01 Thread Derek Poh


Hi Alex

The business requirement (for now) is not to return any result when the 
search keywords are cigarette related. The business user team will 
provide the list of the cigarette related keywords.


Will digest, explore and research on your suggestions. Thank you.

On 30/9/2020 10:56 am, Alexandre Rafalovitch wrote:

I am not sure why you think stop words are your first choice. Maybe I
misunderstand the question. I read it as that you need to exclude
completely a set of documents that include specific keywords when
called from specific module.

If I wanted to differentiate the searches from specific module, I
would give that module a different end-point (Request Query Handler),
instead of /select. So, /nocigs or whatever.

Then, in that end-point, you could do all sorts of extra things, such
as setting appends or even invariants parameters, which would include
filter query to exclude any documents matching specific keywords. I
assume it is ok to return documents that are matching for other
reasons.

Ideally, you would mark the cigs documents during indexing with a
binary or enumeration flag and then during search you just need to
check against that flag. In that case, you could copyField  your text
and run it against something like
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
combined with Shingles for multiwords. Or similar. And just transform
it as index-only so that the result is basically a yes/no flag.
Similar thing could be done with UpdateRequestProcessor pipeline if
you want to end up with a true boolean flag. The idea is the same,
just to have an index-only flag that you force lock into for any
request from specific module.

Or even with something like ElevationSearchComponent. Same idea.

Hope this helps.

Regards,
Alex.

On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:

Hi

I have read in the mailings list that we should try to avoid using stop
words.

I have a use case where I would like to know if there is other
alternative solutions beside using stop words.

There is business requirement to return zero result when the search is
cigarette related words and the search is coming from a particular
module on our site. It does not apply to all searches from our site.
There is a list of these cigarette related words. This list contains
single word, multiple words (Electronic cigar), multiple words with
punctuation (e-cigarette case).
I am planning to copy a different set of search fields, that will
include the stopword filter in the index and query stage, for this
module to use.

For this use case, other than using stop words to handle it, is there
any alternative solution?

Derek

--
CONFIDENTIALITY NOTICE

This e-mail (including any attachments) may contain confidential and/or 
privileged information. If you are not the intended recipient or have received 
this e-mail in error, please inform the sender immediately and delete this 
e-mail (including any attachments) from your computer, and you must not use, 
disclose to anyone else or copy this e-mail (including any attachments), 
whether in whole or in part.

This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.



--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: advice on whether to use stopwords for use case

2020-09-30 Thread Walter Underwood

I’m not clear on the requirements. It sounds like the query “cigar” or “cuban 
cigar”
should return zero results. Is that right?

If so, then check for those words in the query before sending it to Solr.

But the stopwords approach seems like the requirement is different. Could you 
give
some examples?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 30, 2020, at 11:53 AM, Alexandre Rafalovitch  
> wrote:
> 
> You may also want to look at something like: 
> https://docs.querqy.org/index.html
> 
> ApacheCon had (is having..) a presentation on it that seemed quite
> relevant to your needs. The videos should be live in a week or so.
> 
> Regards,
>   Alex.
> 
> On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  
> wrote:
>> 
>> I am not sure why you think stop words are your first choice. Maybe I
>> misunderstand the question. I read it as that you need to exclude
>> completely a set of documents that include specific keywords when
>> called from specific module.
>> 
>> If I wanted to differentiate the searches from specific module, I
>> would give that module a different end-point (Request Query Handler),
>> instead of /select. So, /nocigs or whatever.
>> 
>> Then, in that end-point, you could do all sorts of extra things, such
>> as setting appends or even invariants parameters, which would include
>> filter query to exclude any documents matching specific keywords. I
>> assume it is ok to return documents that are matching for other
>> reasons.
>> 
>> Ideally, you would mark the cigs documents during indexing with a
>> binary or enumeration flag and then during search you just need to
>> check against that flag. In that case, you could copyField  your text
>> and run it against something like
>> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
>> combined with Shingles for multiwords. Or similar. And just transform
>> it as index-only so that the result is basically a yes/no flag.
>> Similar thing could be done with UpdateRequestProcessor pipeline if
>> you want to end up with a true boolean flag. The idea is the same,
>> just to have an index-only flag that you force lock into for any
>> request from specific module.
>> 
>> Or even with something like ElevationSearchComponent. Same idea.
>> 
>> Hope this helps.
>> 
>> Regards,
>>   Alex.
>> 
>> On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:
>>> 
>>> Hi
>>> 
>>> I have read in the mailings list that we should try to avoid using stop
>>> words.
>>> 
>>> I have a use case where I would like to know if there is other
>>> alternative solutions beside using stop words.
>>> 
>>> There is business requirement to return zero result when the search is
>>> cigarette related words and the search is coming from a particular
>>> module on our site. It does not apply to all searches from our site.
>>> There is a list of these cigarette related words. This list contains
>>> single word, multiple words (Electronic cigar), multiple words with
>>> punctuation (e-cigarette case).
>>> I am planning to copy a different set of search fields, that will
>>> include the stopword filter in the index and query stage, for this
>>> module to use.
>>> 
>>> For this use case, other than using stop words to handle it, is there
>>> any alternative solution?
>>> 
>>> Derek
>>> 
>>> --
>>> CONFIDENTIALITY NOTICE
>>> 
>>> This e-mail (including any attachments) may contain confidential and/or 
>>> privileged information. If you are not the intended recipient or have 
>>> received this e-mail in error, please inform the sender immediately and 
>>> delete this e-mail (including any attachments) from your computer, and you 
>>> must not use, disclose to anyone else or copy this e-mail (including any 
>>> attachments), whether in whole or in part.
>>> 
>>> This e-mail and any reply to it may be monitored for security, legal, 
>>> regulatory compliance and/or other appropriate reasons.

Re: advice on whether to use stopwords for use case

2020-09-30 Thread Alexandre Rafalovitch

You may also want to look at something like: https://docs.querqy.org/index.html

ApacheCon had (is having..) a presentation on it that seemed quite
relevant to your needs. The videos should be live in a week or so.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:56, Alexandre Rafalovitch  wrote:
>
> I am not sure why you think stop words are your first choice. Maybe I
> misunderstand the question. I read it as that you need to exclude
> completely a set of documents that include specific keywords when
> called from specific module.
>
> If I wanted to differentiate the searches from specific module, I
> would give that module a different end-point (Request Query Handler),
> instead of /select. So, /nocigs or whatever.
>
> Then, in that end-point, you could do all sorts of extra things, such
> as setting appends or even invariants parameters, which would include
> filter query to exclude any documents matching specific keywords. I
> assume it is ok to return documents that are matching for other
> reasons.
>
> Ideally, you would mark the cigs documents during indexing with a
> binary or enumeration flag and then during search you just need to
> check against that flag. In that case, you could copyField  your text
> and run it against something like
> https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
> combined with Shingles for multiwords. Or similar. And just transform
> it as index-only so that the result is basically a yes/no flag.
> Similar thing could be done with UpdateRequestProcessor pipeline if
> you want to end up with a true boolean flag. The idea is the same,
> just to have an index-only flag that you force lock into for any
> request from specific module.
>
> Or even with something like ElevationSearchComponent. Same idea.
>
> Hope this helps.
>
> Regards,
>Alex.
>
> On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:
> >
> > Hi
> >
> > I have read in the mailings list that we should try to avoid using stop
> > words.
> >
> > I have a use case where I would like to know if there is other
> > alternative solutions beside using stop words.
> >
> > There is business requirement to return zero result when the search is
> > cigarette related words and the search is coming from a particular
> > module on our site. It does not apply to all searches from our site.
> > There is a list of these cigarette related words. This list contains
> > single word, multiple words (Electronic cigar), multiple words with
> > punctuation (e-cigarette case).
> > I am planning to copy a different set of search fields, that will
> > include the stopword filter in the index and query stage, for this
> > module to use.
> >
> > For this use case, other than using stop words to handle it, is there
> > any alternative solution?
> >
> > Derek
> >
> > --
> > CONFIDENTIALITY NOTICE
> >
> > This e-mail (including any attachments) may contain confidential and/or 
> > privileged information. If you are not the intended recipient or have 
> > received this e-mail in error, please inform the sender immediately and 
> > delete this e-mail (including any attachments) from your computer, and you 
> > must not use, disclose to anyone else or copy this e-mail (including any 
> > attachments), whether in whole or in part.
> >
> > This e-mail and any reply to it may be monitored for security, legal, 
> > regulatory compliance and/or other appropriate reasons.

Re: advice on whether to use stopwords for use case

2020-09-29 Thread Alexandre Rafalovitch

I am not sure why you think stop words are your first choice. Maybe I
misunderstand the question. I read it as that you need to exclude
completely a set of documents that include specific keywords when
called from specific module.

If I wanted to differentiate the searches from specific module, I
would give that module a different end-point (Request Query Handler),
instead of /select. So, /nocigs or whatever.

Then, in that end-point, you could do all sorts of extra things, such
as setting appends or even invariants parameters, which would include
filter query to exclude any documents matching specific keywords. I
assume it is ok to return documents that are matching for other
reasons.

Ideally, you would mark the cigs documents during indexing with a
binary or enumeration flag and then during search you just need to
check against that flag. In that case, you could copyField  your text
and run it against something like
https://lucene.apache.org/solr/guide/8_6/filter-descriptions.html#keep-word-filter
combined with Shingles for multiwords. Or similar. And just transform
it as index-only so that the result is basically a yes/no flag.
Similar thing could be done with UpdateRequestProcessor pipeline if
you want to end up with a true boolean flag. The idea is the same,
just to have an index-only flag that you force lock into for any
request from specific module.

Or even with something like ElevationSearchComponent. Same idea.

Hope this helps.

Regards,
   Alex.

On Tue, 29 Sep 2020 at 22:28, Derek Poh  wrote:
>
> Hi
>
> I have read in the mailings list that we should try to avoid using stop
> words.
>
> I have a use case where I would like to know if there is other
> alternative solutions beside using stop words.
>
> There is business requirement to return zero result when the search is
> cigarette related words and the search is coming from a particular
> module on our site. It does not apply to all searches from our site.
> There is a list of these cigarette related words. This list contains
> single word, multiple words (Electronic cigar), multiple words with
> punctuation (e-cigarette case).
> I am planning to copy a different set of search fields, that will
> include the stopword filter in the index and query stage, for this
> module to use.
>
> For this use case, other than using stop words to handle it, is there
> any alternative solution?
>
> Derek
>
> --
> CONFIDENTIALITY NOTICE
>
> This e-mail (including any attachments) may contain confidential and/or 
> privileged information. If you are not the intended recipient or have 
> received this e-mail in error, please inform the sender immediately and 
> delete this e-mail (including any attachments) from your computer, and you 
> must not use, disclose to anyone else or copy this e-mail (including any 
> attachments), whether in whole or in part.
>
> This e-mail and any reply to it may be monitored for security, legal, 
> regulatory compliance and/or other appropriate reasons.

advice on whether to use stopwords for use case

2020-09-29 Thread Derek Poh


Hi

I have read in the mailings list that we should try to avoid using stop 
words.


I have a use case where I would like to know if there is other 
alternative solutions beside using stop words.


There is business requirement to return zero result when the search is 
cigarette related words and the search is coming from a particular 
module on our site. It does not apply to all searches from our site.
There is a list of these cigarette related words. This list contains 
single word, multiple words (Electronic cigar), multiple words with 
punctuation (e-cigarette case).
I am planning to copy a different set of search fields, that will 
include the stopword filter in the index and query stage, for this 
module to use.


For this use case, other than using stop words to handle it, is there 
any alternative solution?


Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.

Re: Constant score and stopwords strange behaviour

2020-06-25 Thread Paras Lehana

Hi,

You can also change the multiplication factor in TF IDF snipped in the
source code to 1 also. I know there would be a better method to handle
stopwords now that you have used constant scoring but I wanted to mention
my method by what we got rid of TF.

On Thu, 25 Jun 2020 at 03:02, dbourassa  wrote:

> Hi,
>
> I'm working on a Solr core where we don't want to use TF-IDF (BM25).
> We rank documents with boost based on popularity, exact match, phrase
> match,
> etc.
>
> To bypass TF-IDF, we use constant score like this "q=harry^=0.5
> potter^=0.5"
> (score is always 1 before boost)
> We have just noticed a strange behaviour with this method.
> With "q=a cat", the stopword 'a' is automatically removed by the query
> analyzer.
> But with "q=a^0.5 cat^0.5", the stopword 'a' is not removed.
>
> We also tried something like "q=(a AND cat)^=1" but the problem still.
>
> Someone have an idea or a better solution to bypass TF-IDF ?
>
> relevant info in solrconfig :
> ...
> edismax
> 590%
> true
> ...
>
> relevant info in schema :
> 
> ...
>  words="stopwords_querytime_custom.txt"/>
> ...
>
>
> Thanks
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, *Auto-Suggest*,
IndiaMART InterMESH Ltd,

11th Floor, Tower 2, Assotech Business Cresterra,
Plot No. 22, Sector 135, Noida, Uttar Pradesh, India 201305

Mob.: +91-9560911996
Work: 0120-4056700 | Extn:
*1196*

-- 
*
*

 <https://www.facebook.com/IndiaMART/videos/578196442936091/>

Constant score and stopwords strange behaviour

2020-06-24 Thread dbourassa

Hi,

I'm working on a Solr core where we don't want to use TF-IDF (BM25).
We rank documents with boost based on popularity, exact match, phrase match,
etc.

To bypass TF-IDF, we use constant score like this "q=harry^=0.5 potter^=0.5"
(score is always 1 before boost)
We have just noticed a strange behaviour with this method.
With "q=a cat", the stopword 'a' is automatically removed by the query
analyzer.
But with "q=a^0.5 cat^0.5", the stopword 'a' is not removed. 

We also tried something like "q=(a AND cat)^=1" but the problem still.

Someone have an idea or a better solution to bypass TF-IDF ?

relevant info in solrconfig :
...
edismax
590%
true
...

relevant info in schema :

...

...


Thanks



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Dynamic Stopwords

2020-05-15 Thread Tim Casey

What I have done for this in the past is calculating the expected value of
a symbol within a universe.  Then calculating the difference between
expected value and the actual value at the time you see a symbol.  Take the
difference and use the most surprising symbols, in rank order from most
surprising to least surprising, dropping lower frequency/unique values.
This was a fairly length independent way to get to interesting tokens.

Most calculations around stop words are very difficult to maintain and
handle.  You can have 7 English stop words easy.  Then you go to a larger
set, say 30ish, then another larger set say 150.  The problem is as you
remove stop words, you remove some meaning.  You will see an example of
this when you want to know the difference between 'a noun' and 'the noun'.
  Now that we have covered English and chosen the optimal set of stop words
for a particular set of text, a new language comes around.  Eventually the
stop words become a contributing factor of error.  The other reason to not
use stop words is a corpus is usually a form of golden egg.  You might be
able to reindex it, but the cost is usually not free.  It is generally
better to have an honest index and allow the post analysis to change.  This
way you can change it 10 times a day and no one will care.

If you are interested in a word cloud I would suspect people have done a
reasonable job around this using a solr index already.

tim

On Fri, May 15, 2020 at 1:48 PM A Adel  wrote:

> Yes, significant terms have been calculated but they have the anomaly or
> relative shift nature rather than the high frequency, as suggested also by
> the blog post. So, it looks that adding a preprocessing step upstream in an
> additional field makes more sense in this case. The text is intrinsically
> not straightforward to parse (short free text) using mainstream NLP though.
>
> A.
>
> On Fri, May 15, 2020, 8:43 PM Walter Underwood 
> wrote:
>
> > Right. I might use NLP to pull out noun phrases and entities. Entities
> are
> > essential noun phrases with proper nouns.
> >
> > Put those in a separate field and build the word cloud from that.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> > dturnb...@opensourceconnections.com> wrote:
> > >
> > > You may want something more like "significant terms" - terms
> > statistically
> > > significant in a document. Possibly not just based on doc freq
> > >
> > > https://saumitra.me/blog/solr-significant-terms/
> > >
> > > On Fri, May 15, 2020 at 2:16 PM A Adel  wrote:
> > >
> > >> Hi Walter,
> > >>
> > >> Thank you for your explanation, I understand the point and agree with
> > you.
> > >> However, the use case at hand is building a word cloud based on
> faceting
> > >> the multilingual text field (very simple) which in case of not using
> > stop
> > >> words returns many generic terms, articles, etc. If stop words filter
> is
> > >> not used, is there any other/better technique to be used instead to
> > build a
> > >> meaningful word cloud?
> > >>
> > >>
> > >> On Fri, May 15, 2020, 5:20 PM Walter Underwood  >
> > >> wrote:
> > >>
> > >>> Just don’t use stop words. That will give much better relevance and
> > works
> > >>> for all languages.
> > >>>
> > >>> Stop words are an obsolete hack from the days of search engines
> running
> > >>> on 16 bit CPUs. They save space by throwing away important
> information.
> > >>>
> > >>> The classic example is “to be or not to be”, which is made up
> entirely
> > of
> > >>> stop words. Remove them and it is impossible to search for that
> phrase.
> > >>>
> > >>> wunder
> > >>> Walter Underwood
> > >>> wun...@wunderwood.org
> > >>> http://observer.wunderwood.org/  (my blog)
> > >>>
> >  On May 14, 2020, at 10:47 PM, A Adel  wrote:
> > 
> >  Hi - Is there a way to configure stop words to be dynamic for each
> > >>> document
> >  based on the language detected of a multilingual text field?
> Combining
> > >>> all
> >  languages stop words in one set is a possibility however it
> introduces
> >  false positives for some language combinations, such as German and
> > >>> English.
> >  Thanks, A.
> > >>>
> > >>>
> > >>
> > >
> > >
> > > --
> > > *Doug Turnbull **| CTO* | OpenSource Connections
> > > , LLC | 240.476.9983
> > > Author: Relevant Search ; Contributor:
> *AI
> > > Powered Search *
> > > This e-mail and all contents, including attachments, is considered to
> be
> > > Company Confidential unless explicitly stated otherwise, regardless
> > > of whether attachments are marked as such.
> >
> >
>

Re: Dynamic Stopwords

2020-05-15 Thread A Adel

Yes, significant terms have been calculated but they have the anomaly or
relative shift nature rather than the high frequency, as suggested also by
the blog post. So, it looks that adding a preprocessing step upstream in an
additional field makes more sense in this case. The text is intrinsically
not straightforward to parse (short free text) using mainstream NLP though.

A.

On Fri, May 15, 2020, 8:43 PM Walter Underwood 
wrote:

> Right. I might use NLP to pull out noun phrases and entities. Entities are
> essential noun phrases with proper nouns.
>
> Put those in a separate field and build the word cloud from that.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> >
> > You may want something more like "significant terms" - terms
> statistically
> > significant in a document. Possibly not just based on doc freq
> >
> > https://saumitra.me/blog/solr-significant-terms/
> >
> > On Fri, May 15, 2020 at 2:16 PM A Adel  wrote:
> >
> >> Hi Walter,
> >>
> >> Thank you for your explanation, I understand the point and agree with
> you.
> >> However, the use case at hand is building a word cloud based on faceting
> >> the multilingual text field (very simple) which in case of not using
> stop
> >> words returns many generic terms, articles, etc. If stop words filter is
> >> not used, is there any other/better technique to be used instead to
> build a
> >> meaningful word cloud?
> >>
> >>
> >> On Fri, May 15, 2020, 5:20 PM Walter Underwood 
> >> wrote:
> >>
> >>> Just don’t use stop words. That will give much better relevance and
> works
> >>> for all languages.
> >>>
> >>> Stop words are an obsolete hack from the days of search engines running
> >>> on 16 bit CPUs. They save space by throwing away important information.
> >>>
> >>> The classic example is “to be or not to be”, which is made up entirely
> of
> >>> stop words. Remove them and it is impossible to search for that phrase.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
>  On May 14, 2020, at 10:47 PM, A Adel  wrote:
> 
>  Hi - Is there a way to configure stop words to be dynamic for each
> >>> document
>  based on the language detected of a multilingual text field? Combining
> >>> all
>  languages stop words in one set is a possibility however it introduces
>  false positives for some language combinations, such as German and
> >>> English.
>  Thanks, A.
> >>>
> >>>
> >>
> >
> >
> > --
> > *Doug Turnbull **| CTO* | OpenSource Connections
> > , LLC | 240.476.9983
> > Author: Relevant Search ; Contributor: *AI
> > Powered Search *
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
>
>

Re: Dynamic Stopwords

2020-05-15 Thread Tim Casey

You do not need stop words to do what you need to do,  For one thing, stop
words requires a segmentation on a phrase-by-phrase basis in some cases.
That is, especially in places like Europe, there is a lot of mixed
language. (Your milage may vary :).

In order to do what you want, you really need to look at the statistical
value of all of the symbols in the universe of consideration.  Find the
relevant terms, throw out common terms and anything with a frequency below
5.  This is also language independent, and slang independent and fairly
medium independent.  If you need a more refined space, you can build the
symbol space from bigrams.

If I ever write a book the title is going to be "The The".  I hope it has
multi-lingual translations.  Although, at this point, it is a very short
book :/

tim

On Fri, May 15, 2020 at 11:43 AM Walter Underwood 
wrote:

> Right. I might use NLP to pull out noun phrases and entities. Entities are
> essential noun phrases with proper nouns.
>
> Put those in a separate field and build the word cloud from that.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 15, 2020, at 11:39 AM, Doug Turnbull <
> dturnb...@opensourceconnections.com> wrote:
> >
> > You may want something more like "significant terms" - terms
> statistically
> > significant in a document. Possibly not just based on doc freq
> >
> > https://saumitra.me/blog/solr-significant-terms/
> >
> > On Fri, May 15, 2020 at 2:16 PM A Adel  wrote:
> >
> >> Hi Walter,
> >>
> >> Thank you for your explanation, I understand the point and agree with
> you.
> >> However, the use case at hand is building a word cloud based on faceting
> >> the multilingual text field (very simple) which in case of not using
> stop
> >> words returns many generic terms, articles, etc. If stop words filter is
> >> not used, is there any other/better technique to be used instead to
> build a
> >> meaningful word cloud?
> >>
> >>
> >> On Fri, May 15, 2020, 5:20 PM Walter Underwood 
> >> wrote:
> >>
> >>> Just don’t use stop words. That will give much better relevance and
> works
> >>> for all languages.
> >>>
> >>> Stop words are an obsolete hack from the days of search engines running
> >>> on 16 bit CPUs. They save space by throwing away important information.
> >>>
> >>> The classic example is “to be or not to be”, which is made up entirely
> of
> >>> stop words. Remove them and it is impossible to search for that phrase.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org
> >>> http://observer.wunderwood.org/  (my blog)
> >>>
>  On May 14, 2020, at 10:47 PM, A Adel  wrote:
> 
>  Hi - Is there a way to configure stop words to be dynamic for each
> >>> document
>  based on the language detected of a multilingual text field? Combining
> >>> all
>  languages stop words in one set is a possibility however it introduces
>  false positives for some language combinations, such as German and
> >>> English.
>  Thanks, A.
> >>>
> >>>
> >>
> >
> >
> > --
> > *Doug Turnbull **| CTO* | OpenSource Connections
> > , LLC | 240.476.9983
> > Author: Relevant Search ; Contributor: *AI
> > Powered Search *
> > This e-mail and all contents, including attachments, is considered to be
> > Company Confidential unless explicitly stated otherwise, regardless
> > of whether attachments are marked as such.
>
>

Re: Dynamic Stopwords

2020-05-15 Thread Walter Underwood

Right. I might use NLP to pull out noun phrases and entities. Entities are 
essential noun phrases with proper nouns.

Put those in a separate field and build the word cloud from that.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 15, 2020, at 11:39 AM, Doug Turnbull 
>  wrote:
> 
> You may want something more like "significant terms" - terms statistically
> significant in a document. Possibly not just based on doc freq
> 
> https://saumitra.me/blog/solr-significant-terms/
> 
> On Fri, May 15, 2020 at 2:16 PM A Adel  wrote:
> 
>> Hi Walter,
>> 
>> Thank you for your explanation, I understand the point and agree with you.
>> However, the use case at hand is building a word cloud based on faceting
>> the multilingual text field (very simple) which in case of not using stop
>> words returns many generic terms, articles, etc. If stop words filter is
>> not used, is there any other/better technique to be used instead to build a
>> meaningful word cloud?
>> 
>> 
>> On Fri, May 15, 2020, 5:20 PM Walter Underwood 
>> wrote:
>> 
>>> Just don’t use stop words. That will give much better relevance and works
>>> for all languages.
>>> 
>>> Stop words are an obsolete hack from the days of search engines running
>>> on 16 bit CPUs. They save space by throwing away important information.
>>> 
>>> The classic example is “to be or not to be”, which is made up entirely of
>>> stop words. Remove them and it is impossible to search for that phrase.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
 On May 14, 2020, at 10:47 PM, A Adel  wrote:
 
 Hi - Is there a way to configure stop words to be dynamic for each
>>> document
 based on the language detected of a multilingual text field? Combining
>>> all
 languages stop words in one set is a possibility however it introduces
 false positives for some language combinations, such as German and
>>> English.
 Thanks, A.
>>> 
>>> 
>> 
> 
> 
> -- 
> *Doug Turnbull **| CTO* | OpenSource Connections
> , LLC | 240.476.9983
> Author: Relevant Search ; Contributor: *AI
> Powered Search *
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.

Re: Dynamic Stopwords

2020-05-15 Thread Doug Turnbull

You may want something more like "significant terms" - terms statistically
significant in a document. Possibly not just based on doc freq

https://saumitra.me/blog/solr-significant-terms/

On Fri, May 15, 2020 at 2:16 PM A Adel  wrote:

> Hi Walter,
>
> Thank you for your explanation, I understand the point and agree with you.
> However, the use case at hand is building a word cloud based on faceting
> the multilingual text field (very simple) which in case of not using stop
> words returns many generic terms, articles, etc. If stop words filter is
> not used, is there any other/better technique to be used instead to build a
> meaningful word cloud?
>
>
> On Fri, May 15, 2020, 5:20 PM Walter Underwood 
> wrote:
>
> > Just don’t use stop words. That will give much better relevance and works
> > for all languages.
> >
> > Stop words are an obsolete hack from the days of search engines running
> > on 16 bit CPUs. They save space by throwing away important information.
> >
> > The classic example is “to be or not to be”, which is made up entirely of
> > stop words. Remove them and it is impossible to search for that phrase.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On May 14, 2020, at 10:47 PM, A Adel  wrote:
> > >
> > > Hi - Is there a way to configure stop words to be dynamic for each
> > document
> > > based on the language detected of a multilingual text field? Combining
> > all
> > > languages stop words in one set is a possibility however it introduces
> > > false positives for some language combinations, such as German and
> > English.
> > > Thanks, A.
> >
> >
>


-- 
*Doug Turnbull **| CTO* | OpenSource Connections
, LLC | 240.476.9983
Author: Relevant Search ; Contributor: *AI
Powered Search *
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Dynamic Stopwords

2020-05-15 Thread A Adel

Hi Walter,

Thank you for your explanation, I understand the point and agree with you.
However, the use case at hand is building a word cloud based on faceting
the multilingual text field (very simple) which in case of not using stop
words returns many generic terms, articles, etc. If stop words filter is
not used, is there any other/better technique to be used instead to build a
meaningful word cloud?

On Fri, May 15, 2020, 5:20 PM Walter Underwood 
wrote:

> Just don’t use stop words. That will give much better relevance and works
> for all languages.
>
> Stop words are an obsolete hack from the days of search engines running
> on 16 bit CPUs. They save space by throwing away important information.
>
> The classic example is “to be or not to be”, which is made up entirely of
> stop words. Remove them and it is impossible to search for that phrase.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On May 14, 2020, at 10:47 PM, A Adel  wrote:
> >
> > Hi - Is there a way to configure stop words to be dynamic for each
> document
> > based on the language detected of a multilingual text field? Combining
> all
> > languages stop words in one set is a possibility however it introduces
> > false positives for some language combinations, such as German and
> English.
> > Thanks, A.
>
>

Re: Dynamic Stopwords

2020-05-15 Thread Walter Underwood

Just don’t use stop words. That will give much better relevance and works
for all languages.

Stop words are an obsolete hack from the days of search engines running 
on 16 bit CPUs. They save space by throwing away important information.

The classic example is “to be or not to be”, which is made up entirely of
stop words. Remove them and it is impossible to search for that phrase.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 14, 2020, at 10:47 PM, A Adel  wrote:
> 
> Hi - Is there a way to configure stop words to be dynamic for each document
> based on the language detected of a multilingual text field? Combining all
> languages stop words in one set is a possibility however it introduces
> false positives for some language combinations, such as German and English.
> Thanks, A.

Dynamic Stopwords

2020-05-14 Thread A Adel

Hi - Is there a way to configure stop words to be dynamic for each document
based on the language detected of a multilingual text field? Combining all
languages stop words in one set is a possibility however it introduces
false positives for some language combinations, such as German and English.
Thanks, A.

Re: Stopwords impact on search

2020-04-26 Thread Steven White

Thanks Walter.  Much appreciated.

To the Solr dev team, it would be of great help if there Walter's IDF
summary is made part of stop-filter:
https://lucene.apache.org/solr/guide/8_5/filter-descriptions.html#stop-filter

Steve

On Fri, Apr 24, 2020 at 8:49 PM Walter Underwood 
wrote:

> IDF and stopword removal are different approaches to the same thing.
>
> Removing stopwords is a binary decision on how important common words
> are for search. It says some words are completely useless.
>
> IDF is a proportional measure on how important common words are for search.
>
> Instead of removing a list of words that are assumed to be common and less
> useful, let the engine actually measure how common the words are and factor
> that into the relevance.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Apr 24, 2020, at 5:39 PM, Steven White  wrote:
> >
> > Hi everyone,
> >
> > I get it why and when if stopwords are note indexed is a bad idea and can
> > give you 0 or incomplete results.  But what about the quality of search
> > result when stopwords are indexed vs. not indexed?
> >
> > 1) Stopwords are removed and I do word search, not phrase for "solr and
> > lucene are so cool".
> > 2) Stopwords are not removed and I do word search, not phrase for "solr
> and
> > lucene are so cool".
> >
> > Now if "and", "are" and "or" are stopwords, will the search quality and
> > ranking for #1 be better then #2?  What about if I turn the above into a
> > phrase search?
> >
> > Thanks
> >
> > Steve
> >
> >
> > On Fri, Apr 24, 2020 at 10:53 AM Walter Underwood  >
> > wrote:
> >
> >> I’m astonished that the default still has that. It was a bad idea in
> Solr
> >> 1.3, when
> >> it bit my ass.
> >>
> >> We help people with this about once a month and the advice is always the
> >> same.
> >> Imagine all the poor people who never ask about it and run with that
> >> default!
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Apr 24, 2020, at 7:34 AM, Erick Erickson 
> >> wrote:
> >>>
> >>> +1 to removing stopword filters.
> >>>
> >>>> On Apr 24, 2020, at 10:28 AM, Jan Høydahl 
> >> wrote:
> >>>>
> >>>> I tend to agree. Should we simply remove the stopword filters from the
> >> default configsets shipping with Solr?
> >>>>
> >>>> Jan
> >>>>
> >>>>> 24. apr. 2020 kl. 14:44 skrev David Hastings <
> >> hastings.recurs...@gmail.com>:
> >>>>>
> >>>>> you should never use the stopword filter unless you have a very
> >> specific
> >>>>> purpose
> >>>>>
> >>>>> On Fri, Apr 24, 2020 at 8:33 AM Steven White 
> >> wrote:
> >>>>>
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>> What is, if any, the impact of stopwords in to my search ranking
> >> quality?
> >>>>>> Will my ranking improve is I do not index stopwords?
> >>>>>>
> >>>>>> I'm trying to figure out if I should use the stopword filter or not.
> >>>>>>
> >>>>>> Thanks in advanced.
> >>>>>>
> >>>>>> Steve
> >>>>>>
> >>>>
> >>>
> >>
> >>
>
>

Re: Stopwords impact on search

2020-04-24 Thread Walter Underwood

IDF and stopword removal are different approaches to the same thing.

Removing stopwords is a binary decision on how important common words
are for search. It says some words are completely useless.

IDF is a proportional measure on how important common words are for search.

Instead of removing a list of words that are assumed to be common and less
useful, let the engine actually measure how common the words are and factor
that into the relevance.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 24, 2020, at 5:39 PM, Steven White  wrote:
> 
> Hi everyone,
> 
> I get it why and when if stopwords are note indexed is a bad idea and can
> give you 0 or incomplete results.  But what about the quality of search
> result when stopwords are indexed vs. not indexed?
> 
> 1) Stopwords are removed and I do word search, not phrase for "solr and
> lucene are so cool".
> 2) Stopwords are not removed and I do word search, not phrase for "solr and
> lucene are so cool".
> 
> Now if "and", "are" and "or" are stopwords, will the search quality and
> ranking for #1 be better then #2?  What about if I turn the above into a
> phrase search?
> 
> Thanks
> 
> Steve
> 
> 
> On Fri, Apr 24, 2020 at 10:53 AM Walter Underwood 
> wrote:
> 
>> I’m astonished that the default still has that. It was a bad idea in Solr
>> 1.3, when
>> it bit my ass.
>> 
>> We help people with this about once a month and the advice is always the
>> same.
>> Imagine all the poor people who never ask about it and run with that
>> default!
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Apr 24, 2020, at 7:34 AM, Erick Erickson 
>> wrote:
>>> 
>>> +1 to removing stopword filters.
>>> 
>>>> On Apr 24, 2020, at 10:28 AM, Jan Høydahl 
>> wrote:
>>>> 
>>>> I tend to agree. Should we simply remove the stopword filters from the
>> default configsets shipping with Solr?
>>>> 
>>>> Jan
>>>> 
>>>>> 24. apr. 2020 kl. 14:44 skrev David Hastings <
>> hastings.recurs...@gmail.com>:
>>>>> 
>>>>> you should never use the stopword filter unless you have a very
>> specific
>>>>> purpose
>>>>> 
>>>>> On Fri, Apr 24, 2020 at 8:33 AM Steven White 
>> wrote:
>>>>> 
>>>>>> Hi everyone,
>>>>>> 
>>>>>> What is, if any, the impact of stopwords in to my search ranking
>> quality?
>>>>>> Will my ranking improve is I do not index stopwords?
>>>>>> 
>>>>>> I'm trying to figure out if I should use the stopword filter or not.
>>>>>> 
>>>>>> Thanks in advanced.
>>>>>> 
>>>>>> Steve
>>>>>> 
>>>> 
>>> 
>> 
>>

Re: Stopwords impact on search

2020-04-24 Thread Steven White

Hi everyone,

I get it why and when if stopwords are note indexed is a bad idea and can
give you 0 or incomplete results.  But what about the quality of search
result when stopwords are indexed vs. not indexed?

1) Stopwords are removed and I do word search, not phrase for "solr and
lucene are so cool".
2) Stopwords are not removed and I do word search, not phrase for "solr and
lucene are so cool".

Now if "and", "are" and "or" are stopwords, will the search quality and
ranking for #1 be better then #2?  What about if I turn the above into a
phrase search?

Thanks

Steve

On Fri, Apr 24, 2020 at 10:53 AM Walter Underwood 
wrote:

> I’m astonished that the default still has that. It was a bad idea in Solr
> 1.3, when
> it bit my ass.
>
> We help people with this about once a month and the advice is always the
> same.
> Imagine all the poor people who never ask about it and run with that
> default!
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Apr 24, 2020, at 7:34 AM, Erick Erickson 
> wrote:
> >
> > +1 to removing stopword filters.
> >
> >> On Apr 24, 2020, at 10:28 AM, Jan Høydahl 
> wrote:
> >>
> >> I tend to agree. Should we simply remove the stopword filters from the
> default configsets shipping with Solr?
> >>
> >> Jan
> >>
> >>> 24. apr. 2020 kl. 14:44 skrev David Hastings <
> hastings.recurs...@gmail.com>:
> >>>
> >>> you should never use the stopword filter unless you have a very
> specific
> >>> purpose
> >>>
> >>> On Fri, Apr 24, 2020 at 8:33 AM Steven White 
> wrote:
> >>>
> >>>> Hi everyone,
> >>>>
> >>>> What is, if any, the impact of stopwords in to my search ranking
> quality?
> >>>> Will my ranking improve is I do not index stopwords?
> >>>>
> >>>> I'm trying to figure out if I should use the stopword filter or not.
> >>>>
> >>>> Thanks in advanced.
> >>>>
> >>>> Steve
> >>>>
> >>
> >
>
>

Re: Stopwords impact on search

2020-04-24 Thread Walter Underwood

I’m astonished that the default still has that. It was a bad idea in Solr 1.3, 
when
it bit my ass.

We help people with this about once a month and the advice is always the same.
Imagine all the poor people who never ask about it and run with that default!

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 24, 2020, at 7:34 AM, Erick Erickson  wrote:
> 
> +1 to removing stopword filters.
> 
>> On Apr 24, 2020, at 10:28 AM, Jan Høydahl  wrote:
>> 
>> I tend to agree. Should we simply remove the stopword filters from the 
>> default configsets shipping with Solr?
>> 
>> Jan
>> 
>>> 24. apr. 2020 kl. 14:44 skrev David Hastings :
>>> 
>>> you should never use the stopword filter unless you have a very specific
>>> purpose
>>> 
>>> On Fri, Apr 24, 2020 at 8:33 AM Steven White  wrote:
>>> 
>>>> Hi everyone,
>>>> 
>>>> What is, if any, the impact of stopwords in to my search ranking quality?
>>>> Will my ranking improve is I do not index stopwords?
>>>> 
>>>> I'm trying to figure out if I should use the stopword filter or not.
>>>> 
>>>> Thanks in advanced.
>>>> 
>>>> Steve
>>>> 
>> 
>

Re: Stopwords impact on search

2020-04-24 Thread Jan Høydahl

Turns out there is already a JIRA for this SOLR-10992 
<https://issues.apache.org/jira/browse/SOLR-10992>
where both you and I commented already :) But it’s 3 years old...

> 24. apr. 2020 kl. 16:34 skrev Erick Erickson :
> 
> +1 to removing stopword filters.
> 
>> On Apr 24, 2020, at 10:28 AM, Jan Høydahl  wrote:
>> 
>> I tend to agree. Should we simply remove the stopword filters from the 
>> default configsets shipping with Solr?
>> 
>> Jan
>> 
>>> 24. apr. 2020 kl. 14:44 skrev David Hastings :
>>> 
>>> you should never use the stopword filter unless you have a very specific
>>> purpose
>>> 
>>> On Fri, Apr 24, 2020 at 8:33 AM Steven White  wrote:
>>> 
>>>> Hi everyone,
>>>> 
>>>> What is, if any, the impact of stopwords in to my search ranking quality?
>>>> Will my ranking improve is I do not index stopwords?
>>>> 
>>>> I'm trying to figure out if I should use the stopword filter or not.
>>>> 
>>>> Thanks in advanced.
>>>> 
>>>> Steve
>>>> 
>> 
>

Re: Stopwords impact on search

2020-04-24 Thread Rohan Kasat

So do we use stopwords filter as part of query analyzer, to avoid
highlighting of these stop words ?

Regards,
Rohan

On Fri, Apr 24, 2020 at 7:45 AM Walter Underwood 
wrote:

> Agreed. Here is an article from 13 years ago when I accidentally turned on
> stopword removal at Netflix. It caused bad problems.
>
> https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
>
> Infoseek was not removing stopwords when I joined them in 1996. Since then,
> I’ve always left stopwords in the index. Removing stop words is a desperate
> speed/hack hack from the days of 16-bit machines.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Apr 24, 2020, at 5:44 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > you should never use the stopword filter unless you have a very specific
> > purpose
> >
> > On Fri, Apr 24, 2020 at 8:33 AM Steven White 
> wrote:
> >
> >> Hi everyone,
> >>
> >> What is, if any, the impact of stopwords in to my search ranking
> quality?
> >> Will my ranking improve is I do not index stopwords?
> >>
> >> I'm trying to figure out if I should use the stopword filter or not.
> >>
> >> Thanks in advanced.
> >>
> >> Steve
> >>
>
> --

*Regards,Rohan Kasat*

Re: Stopwords impact on search

2020-04-24 Thread Walter Underwood

Agreed. Here is an article from 13 years ago when I accidentally turned on 
stopword removal at Netflix. It caused bad problems.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

Infoseek was not removing stopwords when I joined them in 1996. Since then,
I’ve always left stopwords in the index. Removing stop words is a desperate
speed/hack hack from the days of 16-bit machines.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 24, 2020, at 5:44 AM, David Hastings  
> wrote:
> 
> you should never use the stopword filter unless you have a very specific
> purpose
> 
> On Fri, Apr 24, 2020 at 8:33 AM Steven White  wrote:
> 
>> Hi everyone,
>> 
>> What is, if any, the impact of stopwords in to my search ranking quality?
>> Will my ranking improve is I do not index stopwords?
>> 
>> I'm trying to figure out if I should use the stopword filter or not.
>> 
>> Thanks in advanced.
>> 
>> Steve
>>

Re: Stopwords impact on search

2020-04-24 Thread Erick Erickson

+1 to removing stopword filters.

> On Apr 24, 2020, at 10:28 AM, Jan Høydahl  wrote:
> 
> I tend to agree. Should we simply remove the stopword filters from the 
> default configsets shipping with Solr?
> 
> Jan
> 
>> 24. apr. 2020 kl. 14:44 skrev David Hastings :
>> 
>> you should never use the stopword filter unless you have a very specific
>> purpose
>> 
>> On Fri, Apr 24, 2020 at 8:33 AM Steven White  wrote:
>> 
>>> Hi everyone,
>>> 
>>> What is, if any, the impact of stopwords in to my search ranking quality?
>>> Will my ranking improve is I do not index stopwords?
>>> 
>>> I'm trying to figure out if I should use the stopword filter or not.
>>> 
>>> Thanks in advanced.
>>> 
>>> Steve
>>> 
>

Re: Stopwords impact on search

2020-04-24 Thread Jan Høydahl

I tend to agree. Should we simply remove the stopword filters from the default 
configsets shipping with Solr?

Jan

> 24. apr. 2020 kl. 14:44 skrev David Hastings :
> 
> you should never use the stopword filter unless you have a very specific
> purpose
> 
> On Fri, Apr 24, 2020 at 8:33 AM Steven White  wrote:
> 
>> Hi everyone,
>> 
>> What is, if any, the impact of stopwords in to my search ranking quality?
>> Will my ranking improve is I do not index stopwords?
>> 
>> I'm trying to figure out if I should use the stopword filter or not.
>> 
>> Thanks in advanced.
>> 
>> Steve
>>

Re: Stopwords impact on search

2020-04-24 Thread David Hastings

you should never use the stopword filter unless you have a very specific
purpose

On Fri, Apr 24, 2020 at 8:33 AM Steven White  wrote:

> Hi everyone,
>
> What is, if any, the impact of stopwords in to my search ranking quality?
> Will my ranking improve is I do not index stopwords?
>
> I'm trying to figure out if I should use the stopword filter or not.
>
> Thanks in advanced.
>
> Steve
>

Stopwords impact on search

2020-04-24 Thread Steven White

Hi everyone,

What is, if any, the impact of stopwords in to my search ranking quality?
Will my ranking improve is I do not index stopwords?

I'm trying to figure out if I should use the stopword filter or not.

Thanks in advanced.

Steve

Re: handling stopwords for special scenarios

2020-04-09 Thread Walter Underwood

Agreed, leave the stopwords alone. I ran into this same problem
thirteen years ago at Netflix. Even before that, I wasn’t removing 
stopwords, but I accidentally left them in the Solr 1.3 config.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Apr 9, 2020, at 7:34 AM, Erick Erickson  wrote:
> 
> 1> why use stopwords at all? They’re largely a holdover from the
> bad old days when memory was limited. I usually recommend
> people just start by not using stopwords at all.
> 
> 2> assuming <1> doesn’t work for you, why doesn’t it look feasible
>  to remove here from the stopword list? True, you have to re-index.
> 
> But what you’re asking for is not possible. Stopwords are simply gone
> from the index without a trace, there’s absolutely no way to reconstruct
> one.
> 
> I’d also argue that this is an N+1 situation. Sure, you’ll solve the “here”
> problem by removing it from the stopword list, but then you’ll have
> the same problem with “there”…
> 
> Best,
> Erick
> 
>> On Apr 9, 2020, at 9:10 AM, rashi gandhi  wrote:
>> 
>> Hi All,
>> 
>> We are using stopword filter factory at both index and search time, to omit
>> the stopwords.
>> 
>> However, for a one particular case, we are getting "here" as a search query
>> and "here" is one the words in title/name representing our client.
>> We are returning zero results as "here" is one of the English
>> language stopwords which is getting omitted while indexing and searching
>> both.
>> 
>> One solution could be that I remove the "here" from list of stopwords,
>> however does not look feasible.
>> 
>> Is there any way where we can handle this kind of cases, where
>> stopwrods are meant to be actual search term?
>> 
>> Any leads would be appreciated.
>

Re: handling stopwords for special scenarios

2020-04-09 Thread Erick Erickson

1> why use stopwords at all? They’re largely a holdover from the
 bad old days when memory was limited. I usually recommend
 people just start by not using stopwords at all.

2> assuming <1> doesn’t work for you, why doesn’t it look feasible
  to remove here from the stopword list? True, you have to re-index.

But what you’re asking for is not possible. Stopwords are simply gone
from the index without a trace, there’s absolutely no way to reconstruct
one.

I’d also argue that this is an N+1 situation. Sure, you’ll solve the “here”
problem by removing it from the stopword list, but then you’ll have
the same problem with “there”…

Best,
Erick

> On Apr 9, 2020, at 9:10 AM, rashi gandhi  wrote:
> 
> Hi All,
> 
> We are using stopword filter factory at both index and search time, to omit
> the stopwords.
> 
> However, for a one particular case, we are getting "here" as a search query
> and "here" is one the words in title/name representing our client.
> We are returning zero results as "here" is one of the English
> language stopwords which is getting omitted while indexing and searching
> both.
> 
> One solution could be that I remove the "here" from list of stopwords,
> however does not look feasible.
> 
> Is there any way where we can handle this kind of cases, where
> stopwrods are meant to be actual search term?
> 
> Any leads would be appreciated.

handling stopwords for special scenarios

2020-04-09 Thread rashi gandhi

Hi All,

We are using stopword filter factory at both index and search time, to omit
the stopwords.

However, for a one particular case, we are getting "here" as a search query
and "here" is one the words in title/name representing our client.
We are returning zero results as "here" is one of the English
language stopwords which is getting omitted while indexing and searching
both.

One solution could be that I remove the "here" from list of stopwords,
however does not look feasible.

Is there any way where we can handle this kind of cases, where
stopwrods are meant to be actual search term?

Any leads would be appreciated.

Re: Weird issues when using synonyms and stopwords together

2020-03-20 Thread Walter Underwood

Do not remove stopwords.

Stopword removal was a hack invented for 16-bit machines and multi-megabyte 
disks.
That hack is not needed now.

tf.idf addresses the same problem as stopwords with a much better algorithm.
Removing stopwords is an on/off decision for a guess at common words.
tf.idf is a proportional weighting of common words based on the statistics of
your documents.

Do not remove stopwords.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 20, 2020, at 7:52 AM, Vikas Kumar  wrote:
> 
> I have a field title in my solr schema:
> 
>  required="true" stored="true" />
> 
> text_en is defined as follows:
> 
> positionIncrementGap="100" docValues="false" multiValued="false">
>
>
> words="stopwords_en.txt" />
>
> preserveOriginal="true" />
>
>
>
>
> synonyms="synonyms_en.txt" ignoreCase="true" expand="true" />
>     words="stopwords_en.txt" />
>
>
>
>
> 
> I'm encountering strange behaviour when using multi-word synonyms which
> contain stopwords.
> 
> If the stopwords appear in the middle, it works fine. For example, if I
> have the following in my synonyms file (where i is a stopword):
> 
> iphone, apple i phone
> 
> And if I query: /select?q=iphone=title=edismax
> 
> The parsed query is: +DisjunctionMaxQuery(+title:appl +title:phone)
> title:iphon
> 
> Same for query: /select?q=apple i phone=title=edismax
> 
> But if stopwords appear at the start or end, then behaviour is
> unpredictable.
> 
> In most of the cases, the entire synonym is dropped. For example, if I
> change my synonyms file to:
> 
> iphone, i phone
> 
> and do the same query again (with iphone), I get:
> 
> +DisjunctionMaxQuery(((title:iphon)))
> 
> I was expecting iphon and phone (as i would be dropped) in my dismax query.
> 
> In some cases, behaviour is even more weird.
> 
> For example, if my synonyms file is:
> 
> between two ferns,netflix comedy,zach galifianakis show,netflix 2019 best
> 
> and I have ferns and best as my stopwords. If I do the following query:
> 
> /select?q=netflix comedy=title=edismax
> 
> I get this:
> 
> +DisjunctionMaxQuery+title:between +title:two +title:galifianaki
> +title:show) (+title:netflix +title:2019 +title:comedi
> 
> which is kind of a very weird combinations.
> 
> I'm not able to understand this behaviour and have not found anything
> related to this in documentation or internet. Maybe I'm missing something.
> Any help/pointers is highly appreciated.
> 
> Solr version: 8.4.1

Weird issues when using synonyms and stopwords together

2020-03-20 Thread Vikas Kumar

I have a field title in my solr schema:



text_en is defined as follows:


















I'm encountering strange behaviour when using multi-word synonyms which
contain stopwords.

If the stopwords appear in the middle, it works fine. For example, if I
have the following in my synonyms file (where i is a stopword):

iphone, apple i phone

And if I query: /select?q=iphone=title=edismax

The parsed query is: +DisjunctionMaxQuery(+title:appl +title:phone)
title:iphon

Same for query: /select?q=apple i phone=title=edismax

But if stopwords appear at the start or end, then behaviour is
unpredictable.

In most of the cases, the entire synonym is dropped. For example, if I
change my synonyms file to:

iphone, i phone

and do the same query again (with iphone), I get:

+DisjunctionMaxQuery(((title:iphon)))

I was expecting iphon and phone (as i would be dropped) in my dismax query.

In some cases, behaviour is even more weird.

For example, if my synonyms file is:

between two ferns,netflix comedy,zach galifianakis show,netflix 2019 best

and I have ferns and best as my stopwords. If I do the following query:

/select?q=netflix comedy=title=edismax

I get this:

+DisjunctionMaxQuery+title:between +title:two +title:galifianaki
+title:show) (+title:netflix +title:2019 +title:comedi

which is kind of a very weird combinations.

I'm not able to understand this behaviour and have not found anything
related to this in documentation or internet. Maybe I'm missing something.
Any help/pointers is highly appreciated.

Solr version: 8.4.1

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood

Make phrases into single tokens at indexing and query time. Let the engine do
the rest of the work.

For example, “subunits of the army” can become “subunitsofthearmy” or 
“subunits_of_the_army”.
We used patterns to choose phrases, so “word word”, “word glue word”, or “word 
glue glue word”
could become phrases.

Nutch did something like this, but used it for filtering down the candidates 
for matching,
then used regular Lucene scoring for ranking.

The Infoseek Ultra index used these phrase terms but did not store positions.

The idea came from early DNA search engines.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 17, 2020, at 10:53 AM, David Hastings  
> wrote:
> 
> interesting, i cant seem to find anything on Phrase IDF, dont suppose you
> have a link or two i could look at by chance?
> 
> On Mon, Feb 17, 2020 at 1:48 PM Walter Underwood 
> wrote:
> 
>> At Infoseek, we used “glue words” to build phrase tokens. It was really
>> effective.
>> Phrase IDF is powerful stuff.
>> 
>> Luckily for you, the patent on that has expired. :-)
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 17, 2020, at 10:46 AM, David Hastings <
>> hastings.recurs...@gmail.com> wrote:
>>> 
>>> i use stop words for building shingles into "interesting phrases" for my
>>> machine teacher/students, so i wouldnt say theres no reason, however my
>> use
>>> case is very specific.  Otherwise yeah, theyre gone for all practical
>>> reasons/search scenarios.
>>> 
>>> On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
>>> wrote:
>>> 
>>>> Why are you using stopwords? I would need a really, really good reason
>> to
>>>> use those.
>>>> 
>>>> Stopwords are an obsolete technique from 16-bit processors. I’ve never
>>>> used them and
>>>> I’ve been a search engineer since 1997.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
>>>> wrote:
>>>>> 
>>>>> Hi
>>>>> 
>>>>> I've run into an issue with creating a Managed Stopwords list that has
>>>> the
>>>>> same name as a previously deleted list. Going through the same flow
>> with
>>>>> Managed Synonyms doesn't result in this unexpected behaviour. Am I
>>>> missing
>>>>> something or did I discover a bug in Solr?
>>>>> 
>>>>> On a newly started solr with the techproducts core:
>>>>> 
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> curl -X DELETE
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> curl
>>>> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>>>> 
>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>>>> 
>>>>> The second PUT request results in a status 500 with error
>>>>> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
>>>>> 
>>>>> Similar requests for synonyms work fine, no matter how many times I
>>>> repeat
>>>>> the CREATE/DELETE/RELOAD cycle:
>>>>> 
>>>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>>>> 
>>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>>>> curl -X DELETE
>>>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>>>> curl
>>>> http://localhost:8983/solr/admin/core

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread David Hastings

interesting, i cant seem to find anything on Phrase IDF, dont suppose you
have a link or two i could look at by chance?

On Mon, Feb 17, 2020 at 1:48 PM Walter Underwood 
wrote:

> At Infoseek, we used “glue words” to build phrase tokens. It was really
> effective.
> Phrase IDF is powerful stuff.
>
> Luckily for you, the patent on that has expired. :-)
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 17, 2020, at 10:46 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > i use stop words for building shingles into "interesting phrases" for my
> > machine teacher/students, so i wouldnt say theres no reason, however my
> use
> > case is very specific.  Otherwise yeah, theyre gone for all practical
> > reasons/search scenarios.
> >
> > On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
> > wrote:
> >
> >> Why are you using stopwords? I would need a really, really good reason
> to
> >> use those.
> >>
> >> Stopwords are an obsolete technique from 16-bit processors. I’ve never
> >> used them and
> >> I’ve been a search engineer since 1997.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
> >> wrote:
> >>>
> >>> Hi
> >>>
> >>> I've run into an issue with creating a Managed Stopwords list that has
> >> the
> >>> same name as a previously deleted list. Going through the same flow
> with
> >>> Managed Synonyms doesn't result in this unexpected behaviour. Am I
> >> missing
> >>> something or did I discover a bug in Solr?
> >>>
> >>> On a newly started solr with the techproducts core:
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl -X DELETE
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>>
> >>> The second PUT request results in a status 500 with error
> >>> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> >>>
> >>> Similar requests for synonyms work fine, no matter how many times I
> >> repeat
> >>> the CREATE/DELETE/RELOAD cycle:
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> >>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>> curl -X DELETE
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> >>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>>
> >>> Reloading after creating the Stopwords list but not after deleting it
> >> works
> >>> without error too on a fresh techproducts core (you'll have to remove
> the
> >>> directory from disk and create the core again after running the
> previous
> >>> commands).
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\=

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood

At Infoseek, we used “glue words” to build phrase tokens. It was really 
effective.
Phrase IDF is powerful stuff.

Luckily for you, the patent on that has expired. :-)

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 17, 2020, at 10:46 AM, David Hastings  
> wrote:
> 
> i use stop words for building shingles into "interesting phrases" for my
> machine teacher/students, so i wouldnt say theres no reason, however my use
> case is very specific.  Otherwise yeah, theyre gone for all practical
> reasons/search scenarios.
> 
> On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
> wrote:
> 
>> Why are you using stopwords? I would need a really, really good reason to
>> use those.
>> 
>> Stopwords are an obsolete technique from 16-bit processors. I’ve never
>> used them and
>> I’ve been a search engineer since 1997.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
>> wrote:
>>> 
>>> Hi
>>> 
>>> I've run into an issue with creating a Managed Stopwords list that has
>> the
>>> same name as a previously deleted list. Going through the same flow with
>>> Managed Synonyms doesn't result in this unexpected behaviour. Am I
>> missing
>>> something or did I discover a bug in Solr?
>>> 
>>> On a newly started solr with the techproducts core:
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl -X DELETE
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> 
>>> The second PUT request results in a status 500 with error
>>> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
>>> 
>>> Similar requests for synonyms work fine, no matter how many times I
>> repeat
>>> the CREATE/DELETE/RELOAD cycle:
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> curl -X DELETE
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> 
>>> Reloading after creating the Stopwords list but not after deleting it
>> works
>>> without error too on a fresh techproducts core (you'll have to remove the
>>> directory from disk and create the core again after running the previous
>>> commands).
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
>>> curl -X DELETE
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> 
>>> And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
>>> CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
>>> can be completed twice. (Again, on a freshly created techproducts core.)
>>> Only the third attempt to create a list r

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread David Hastings

i use stop words for building shingles into "interesting phrases" for my
machine teacher/students, so i wouldnt say theres no reason, however my use
case is very specific.  Otherwise yeah, theyre gone for all practical
reasons/search scenarios.

On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
wrote:

> Why are you using stopwords? I would need a really, really good reason to
> use those.
>
> Stopwords are an obsolete technique from 16-bit processors. I’ve never
> used them and
> I’ve been a search engineer since 1997.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
> wrote:
> >
> > Hi
> >
> > I've run into an issue with creating a Managed Stopwords list that has
> the
> > same name as a previously deleted list. Going through the same flow with
> > Managed Synonyms doesn't result in this unexpected behaviour. Am I
> missing
> > something or did I discover a bug in Solr?
> >
> > On a newly started solr with the techproducts core:
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >
> > The second PUT request results in a status 500 with error
> > msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> >
> > Similar requests for synonyms work fine, no matter how many times I
> repeat
> > the CREATE/DELETE/RELOAD cycle:
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl -X DELETE
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >
> > Reloading after creating the Stopwords list but not after deleting it
> works
> > without error too on a fresh techproducts core (you'll have to remove the
> > directory from disk and create the core again after running the previous
> > commands).
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >
> > And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
> > CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
> > can be completed twice. (Again, on a freshly created techproducts core.)
> > Only the third attempt to create a list results in an error. Synonyms can
> > still be created and deleted repeatedly after this.
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood

Why are you using stopwords? I would need a really, really good reason to use 
those.

Stopwords are an obsolete technique from 16-bit processors. I’ve never used 
them and
I’ve been a search engineer since 1997.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 17, 2020, at 7:31 AM, Thomas Corthals  wrote:
> 
> Hi
> 
> I've run into an issue with creating a Managed Stopwords list that has the
> same name as a previously deleted list. Going through the same flow with
> Managed Synonyms doesn't result in this unexpected behaviour. Am I missing
> something or did I discover a bug in Solr?
> 
> On a newly started solr with the techproducts core:
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> 
> The second PUT request results in a status 500 with error
> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> 
> Similar requests for synonyms work fine, no matter how many times I repeat
> the CREATE/DELETE/RELOAD cycle:
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> 
> Reloading after creating the Stopwords list but not after deleting it works
> without error too on a fresh techproducts core (you'll have to remove the
> directory from disk and create the core again after running the previous
> commands).
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> 
> And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
> CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
> can be completed twice. (Again, on a freshly created techproducts core.)
> Only the third attempt to create a list results in an error. Synonyms can
> still be created and deleted repeatedly after this.
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterF

Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Thomas Corthals

Hi

I've run into an issue with creating a Managed Stopwords list that has the
same name as a previously deleted list. Going through the same flow with
Managed Synonyms doesn't result in this unexpected behaviour. Am I missing
something or did I discover a bug in Solr?

On a newly started solr with the techproducts core:

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist

The second PUT request results in a status 500 with error
msg "java.util.LinkedHashMap cannot be cast to java.util.List".

Similar requests for synonyms work fine, no matter how many times I repeat
the CREATE/DELETE/RELOAD cycle:

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap

Reloading after creating the Stopwords list but not after deleting it works
without error too on a fresh techproducts core (you'll have to remove the
directory from disk and create the core again after running the previous
commands).

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist

And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
can be completed twice. (Again, on a freshly created techproducts core.)
Only the third attempt to create a list results in an error. Synonyms can
still be created and deleted repeatedly after this.

curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl -X DELETE
http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
curl -X PUT -H 'Content-type:application/json' --data-binary
'{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist

The same successes/errors occur when running each cycle against a different
core if the cores share the same configset.

Any ideas on what might be going wrong?

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-20 Thread Guilherme Viteri

Hi,

Alright, after trying and trying, I have managed to isolate the fields that are 
causing the search to fail.
Now, all the fields are "" are 
breaking up my search. 

I changed the id-StrField to 







And finally now it works, however I am just scared this is not correct or bad 
practice as I am dealing with IDs and they should be anyhow parsed.

What is your opinion ?

Thanks
Guilherme

> On 18 Nov 2019, at 15:42, Guilherme Viteri  wrote:
> 
> Hi,
> 
>> Have you tried reindexing the documents and compare the results? No issues
>> if you cannot do that - let's try something else. I was going through the
>> whole mail and your files. You had said:
> Yes, but since it hasn't worked as suggested, I kept as you suggested.
> 
>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
>>> don't get anything (which make sense).
>> 
>> Why did you think that not getting anything when you add dbId made sense?
>> Asking because I may be missing something here.
> I am searching for a text and I was searching on an ID field, which wouldn't 
> make sense.
> (I will come back to this soon.)
> 
> Ok, I've been adding and removing fields in the qf and I could isolate half 
> of the problem. First, I have one type of field called keyword_field and I 
> added the StopWords filter for this field and It worked. Second,
> when I add the fields that are id ( />
> 
> Do you think I should also the stopwords filter for the fieldtype id ?
> (I tried, and it worked, but I am not sure if this is conceptually correct, 
> id, should remain intact from my understand)
> 
> Thanks
> Guilherme
> 
>> On 18 Nov 2019, at 05:37, Paras Lehana > <mailto:paras.leh...@indiamart.com>> wrote:
>> 
>> Hi Guilherme,
>> 
>> Have you tried reindexing the documents and compare the results? No issues
>> if you cannot do that - let's try something else. I was going through the
>> whole mail and your files. You had said:
>> 
>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
>>> don't get anything (which make sense).
>> 
>> 
>> Why did you think that not getting anything when you add dbId made sense?
>> Asking because I may be missing something here.
>> 
>> Also, what is the purpose of so many qf's? Going through your documents and
>> config files, I found that your dbId's are string of numbers and I don't
>> think you want to find your query terms in dbId, right?
>> Do you want to boost the score by the values in dbId?
>> 
>> Your qf of dbId^100 boosts documents containing terms in q by 100x. Since
>> your terms don't match with the values in dbId for any document, the score
>> produced by this scoring is 0. 100x or 1x of 0 is still 0.
>> I still need to see how this scoring gets added up in edismax parser but do
>> reevaluate the usage of these qfs. Same goes for other qf boosts. :)
>> 
>> 
>> On Fri, 15 Nov 2019 at 12:23, Guilherme Viteri > <mailto:gvit...@ebi.ac.uk>> wrote:
>> 
>>> Hi Paras
>>> No worries.
>>> No I didn’t find anything. This is annoying now...
>>> Yes! They do contain dbId. Absolutely all my docs contains dbId and it is
>>> actually my key, if you check again the schema.xml
>>> 
>>> Cheers
>>> Guilherme
>>> 
>>> On 15 Nov 2019, at 05:37, Paras Lehana >> <mailto:paras.leh...@indiamart.com>> wrote:
>>> 
>>> 
>>> Hey Guilherme,
>>> 
>>> I was a bit busy for the past few days and couldn't read your mail. So,
>>> did you find anything? Anyways, as I had expected, the culprit is
>>> definitely among the qfs. Do the documents in concern contain dbId? I
>>> suggest you to cross check the fields in your document with those impacting
>>> the result in qf.
>>> 
>>> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri >> <mailto:gvit...@ebi.ac.uk>> wrote:
>>> 
>>>> What I can't understand is:
>>>> I search for the exact term - "Immunoregulatory interactions between a
>>>> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the
>>>> exact term - Immunoregulatory interactions between a Lymphoid *and 
>>>> *non-Lymphoid
>>>> cell" then it works
>>>> 
>>>> On 11 Nov 2019, at 12:24, Guilherme Viteri >>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>> 
>>>> Thanks
>>>> 
>>>> Removing stopwords is anot

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-18 Thread Guilherme Viteri

Hi,

> Have you tried reindexing the documents and compare the results? No issues
> if you cannot do that - let's try something else. I was going through the
> whole mail and your files. You had said:
Yes, but since it hasn't worked as suggested, I kept as you suggested.

> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
>> don't get anything (which make sense).
> 
> Why did you think that not getting anything when you add dbId made sense?
> Asking because I may be missing something here.
I am searching for a text and I was searching on an ID field, which wouldn't 
make sense.
(I will come back to this soon.)

Ok, I've been adding and removing fields in the qf and I could isolate half of 
the problem. First, I have one type of field called keyword_field and I added 
the StopWords filter for this field and It worked. Second,
when I add the fields that are id (

Do you think I should also the stopwords filter for the fieldtype id ?
(I tried, and it worked, but I am not sure if this is conceptually correct, id, 
should remain intact from my understand)

Thanks
Guilherme

> On 18 Nov 2019, at 05:37, Paras Lehana  wrote:
> 
> Hi Guilherme,
> 
> Have you tried reindexing the documents and compare the results? No issues
> if you cannot do that - let's try something else. I was going through the
> whole mail and your files. You had said:
> 
> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
>> don't get anything (which make sense).
> 
> 
> Why did you think that not getting anything when you add dbId made sense?
> Asking because I may be missing something here.
> 
> Also, what is the purpose of so many qf's? Going through your documents and
> config files, I found that your dbId's are string of numbers and I don't
> think you want to find your query terms in dbId, right?
> Do you want to boost the score by the values in dbId?
> 
> Your qf of dbId^100 boosts documents containing terms in q by 100x. Since
> your terms don't match with the values in dbId for any document, the score
> produced by this scoring is 0. 100x or 1x of 0 is still 0.
> I still need to see how this scoring gets added up in edismax parser but do
> reevaluate the usage of these qfs. Same goes for other qf boosts. :)
> 
> 
> On Fri, 15 Nov 2019 at 12:23, Guilherme Viteri  wrote:
> 
>> Hi Paras
>> No worries.
>> No I didn’t find anything. This is annoying now...
>> Yes! They do contain dbId. Absolutely all my docs contains dbId and it is
>> actually my key, if you check again the schema.xml
>> 
>> Cheers
>> Guilherme
>> 
>> On 15 Nov 2019, at 05:37, Paras Lehana  wrote:
>> 
>> 
>> Hey Guilherme,
>> 
>> I was a bit busy for the past few days and couldn't read your mail. So,
>> did you find anything? Anyways, as I had expected, the culprit is
>> definitely among the qfs. Do the documents in concern contain dbId? I
>> suggest you to cross check the fields in your document with those impacting
>> the result in qf.
>> 
>> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri  wrote:
>> 
>>> What I can't understand is:
>>> I search for the exact term - "Immunoregulatory interactions between a
>>> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the
>>> exact term - Immunoregulatory interactions between a Lymphoid *and 
>>> *non-Lymphoid
>>> cell" then it works
>>> 
>>> On 11 Nov 2019, at 12:24, Guilherme Viteri  wrote:
>>> 
>>> Thanks
>>> 
>>> Removing stopwords is another story. I'm curious to find the reason
>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>> really necessary.
>>> 
>>> Yes. It always make sense the way we've been using.
>>> 
>>> If q.alt is giving you responses, it's confirmed that your stopwords
>>> filter
>>> is working as expected. The problem definitely lies in the configuration
>>> of
>>> edismax.
>>> 
>>> I see.
>>> 
>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>> 
>>> Ok, using q now, removed all qf, performed the search and I got 23
>>> results, and the one I really want, on the top.
>>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then
>>> I don't get anything (which make sense). However if I query name_exact, I
>>> get the 23 results again, and unfortunately if I query stId^1.0
>>> name_exact^10.0 I still don't get any results.
>>> 
>>> In summary
>>> - without qf - 23 results
>>> - dbId - 0

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-17 Thread Paras Lehana

Hi Guilherme,

Have you tried reindexing the documents and compare the results? No issues
if you cannot do that - let's try something else. I was going through the
whole mail and your files. You had said:

As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
> don't get anything (which make sense).


Why did you think that not getting anything when you add dbId made sense?
Asking because I may be missing something here.

Also, what is the purpose of so many qf's? Going through your documents and
config files, I found that your dbId's are string of numbers and I don't
think you want to find your query terms in dbId, right?
Do you want to boost the score by the values in dbId?

Your qf of dbId^100 boosts documents containing terms in q by 100x. Since
your terms don't match with the values in dbId for any document, the score
produced by this scoring is 0. 100x or 1x of 0 is still 0.
I still need to see how this scoring gets added up in edismax parser but do
reevaluate the usage of these qfs. Same goes for other qf boosts. :)


On Fri, 15 Nov 2019 at 12:23, Guilherme Viteri  wrote:

> Hi Paras
> No worries.
> No I didn’t find anything. This is annoying now...
> Yes! They do contain dbId. Absolutely all my docs contains dbId and it is
> actually my key, if you check again the schema.xml
>
> Cheers
> Guilherme
>
> On 15 Nov 2019, at 05:37, Paras Lehana  wrote:
>
> 
> Hey Guilherme,
>
> I was a bit busy for the past few days and couldn't read your mail. So,
> did you find anything? Anyways, as I had expected, the culprit is
> definitely among the qfs. Do the documents in concern contain dbId? I
> suggest you to cross check the fields in your document with those impacting
> the result in qf.
>
> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri  wrote:
>
>> What I can't understand is:
>> I search for the exact term - "Immunoregulatory interactions between a
>> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the
>> exact term - Immunoregulatory interactions between a Lymphoid *and 
>> *non-Lymphoid
>> cell" then it works
>>
>> On 11 Nov 2019, at 12:24, Guilherme Viteri  wrote:
>>
>> Thanks
>>
>> Removing stopwords is another story. I'm curious to find the reason
>> assuming that you keep on using stopwords. In some cases, stopwords are
>> really necessary.
>>
>> Yes. It always make sense the way we've been using.
>>
>> If q.alt is giving you responses, it's confirmed that your stopwords
>> filter
>> is working as expected. The problem definitely lies in the configuration
>> of
>> edismax.
>>
>> I see.
>>
>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>
>> Ok, using q now, removed all qf, performed the search and I got 23
>> results, and the one I really want, on the top.
>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then
>> I don't get anything (which make sense). However if I query name_exact, I
>> get the 23 results again, and unfortunately if I query stId^1.0
>> name_exact^10.0 I still don't get any results.
>>
>> In summary
>> - without qf - 23 results
>> - dbId - 0 results
>> - name_exact - 16 results
>> - name - 23 results
>> - dbId^1.0
>>  name_exact^10.0 - 0 results
>> - 0 results if any other, stId, dbId (key) is added on top of the
>> name(name_exact, etc).
>>
>> Definitely lost here! :-/
>>
>>
>> On 11 Nov 2019, at 07:59, Paras Lehana 
>> wrote:
>>
>> Hi
>>
>> So I don't think removing it completely is the way to go from the scenario
>>
>> we have
>>
>>
>>
>> Removing stopwords is another story. I'm curious to find the reason
>> assuming that you keep on using stopwords. In some cases, stopwords are
>> really necessary.
>>
>>
>> Quite a considerable increase
>>
>>
>> If q.alt is giving you responses, it's confirmed that your stopwords
>> filter
>> is working as expected. The problem definitely lies in the configuration
>> of
>> edismax.
>>
>>
>>
>> I am sorry but I didn't understand what do you want me to do exactly with
>> the lst (??) and qf and bf.
>>
>>
>>
>> What combinations did you try? I was referring to the field-level boosting
>> you have applied in edismax config.
>>
>> *Let me explain again:* In your solrconfig.xml, look at your /search
>> request handler. There are many qf and some bq boosts. I want you to
>> remove
>> all of these, check response again (with q now) and keep on adding them
>&

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-14 Thread Guilherme Viteri

Hi Paras
No worries.
No I didn’t find anything. This is annoying now...
Yes! They do contain dbId. Absolutely all my docs contains dbId and it is 
actually my key, if you check again the schema.xml

Cheers
Guilherme 

> On 15 Nov 2019, at 05:37, Paras Lehana  wrote:
> 
> 
> Hey Guilherme,
> 
> I was a bit busy for the past few days and couldn't read your mail. So, did 
> you find anything? Anyways, as I had expected, the culprit is definitely 
> among the qfs. Do the documents in concern contain dbId? I suggest you to 
> cross check the fields in your document with those impacting the result in 
> qf. 
> 
>> On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri  wrote:
>> What I can't understand is:
>> I search for the exact term - "Immunoregulatory interactions between a 
>> Lymphoid and a non-Lymphoid cell" and If i search "I search for the exact 
>> term - Immunoregulatory interactions between a Lymphoid and non-Lymphoid 
>> cell" then it works 
>> 
>>> On 11 Nov 2019, at 12:24, Guilherme Viteri  wrote:
>>> 
>>> Thanks
>>>> Removing stopwords is another story. I'm curious to find the reason
>>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>>> really necessary.
>>> Yes. It always make sense the way we've been using.
>>> 
>>>> If q.alt is giving you responses, it's confirmed that your stopwords filter
>>>> is working as expected. The problem definitely lies in the configuration of
>>>> edismax.
>>> I see.
>>> 
>>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>> Ok, using q now, removed all qf, performed the search and I got 23 results, 
>>> and the one I really want, on the top.
>>> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I 
>>> don't get anything (which make sense). However if I query name_exact, I get 
>>> the 23 results again, and unfortunately if I query stId^1.0 name_exact^10.0 
>>> I still don't get any results.
>>> 
>>> In summary
>>> - without qf - 23 results
>>> - dbId - 0 results
>>> - name_exact - 16 results
>>> - name - 23 results
>>> - dbId^1.0
>>>  name_exact^10.0 - 0 results
>>> - 0 results if any other, stId, dbId (key) is added on top of the 
>>> name(name_exact, etc).
>>> 
>>> Definitely lost here! :-/
>>> 
>>> 
>>>> On 11 Nov 2019, at 07:59, Paras Lehana  wrote:
>>>> 
>>>> Hi
>>>> 
>>>> So I don't think removing it completely is the way to go from the scenario
>>>>> we have
>>>> 
>>>> 
>>>> Removing stopwords is another story. I'm curious to find the reason
>>>> assuming that you keep on using stopwords. In some cases, stopwords are
>>>> really necessary.
>>>> 
>>>> 
>>>> Quite a considerable increase
>>>> 
>>>> 
>>>> If q.alt is giving you responses, it's confirmed that your stopwords filter
>>>> is working as expected. The problem definitely lies in the configuration of
>>>> edismax.
>>>> 
>>>> 
>>>> 
>>>>> I am sorry but I didn't understand what do you want me to do exactly with
>>>>> the lst (??) and qf and bf.
>>>> 
>>>> 
>>>> What combinations did you try? I was referring to the field-level boosting
>>>> you have applied in edismax config.
>>>> 
>>>> *Let me explain again:* In your solrconfig.xml, look at your /search
>>>> request handler. There are many qf and some bq boosts. I want you to remove
>>>> all of these, check response again (with q now) and keep on adding them
>>>> again (one by one) while looking for when the numFound drastically changes.
>>>> 
>>>> On Fri, 8 Nov 2019 at 23:47, David Hastings 
>>>> wrote:
>>>> 
>>>>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>>>>> pretty well for such a solution, but for a full index the size became
>>>>> prohibitive
>>>>> 
>>>>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood 
>>>>> wrote:
>>>>> 
>>>>>> If we had IDF for phrases, they would be super effective. The 2X weight
>>>>> is
>>>>>> a hack that mostly works.
>>>>>> 
>>>>>> Infoseek had phrase IDF and it was

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-14 Thread Paras Lehana

Hey Guilherme,

I was a bit busy for the past few days and couldn't read your mail. So, did
you find anything? Anyways, as I had expected, the culprit is definitely
among the qfs. Do the documents in concern contain dbId? I suggest you to
cross check the fields in your document with those impacting the result in
qf.

On Tue, 12 Nov 2019 at 16:14, Guilherme Viteri  wrote:

> What I can't understand is:
> I search for the exact term - "Immunoregulatory interactions between a
> Lymphoid *and a* non-Lymphoid cell" and If i search "I search for the
> exact term - Immunoregulatory interactions between a Lymphoid *and 
> *non-Lymphoid
> cell" then it works
>
> On 11 Nov 2019, at 12:24, Guilherme Viteri  wrote:
>
> Thanks
>
> Removing stopwords is another story. I'm curious to find the reason
> assuming that you keep on using stopwords. In some cases, stopwords are
> really necessary.
>
> Yes. It always make sense the way we've been using.
>
> If q.alt is giving you responses, it's confirmed that your stopwords filter
> is working as expected. The problem definitely lies in the configuration of
> edismax.
>
> I see.
>
> *Let me explain again:* In your solrconfig.xml, look at your /search
>
> Ok, using q now, removed all qf, performed the search and I got 23
> results, and the one I really want, on the top.
> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I
> don't get anything (which make sense). However if I query name_exact, I get
> the 23 results again, and unfortunately if I query stId^1.0 name_exact^10.0
> I still don't get any results.
>
> In summary
> - without qf - 23 results
> - dbId - 0 results
> - name_exact - 16 results
> - name - 23 results
> - dbId^1.0
>  name_exact^10.0 - 0 results
> - 0 results if any other, stId, dbId (key) is added on top of the
> name(name_exact, etc).
>
> Definitely lost here! :-/
>
>
> On 11 Nov 2019, at 07:59, Paras Lehana  wrote:
>
> Hi
>
> So I don't think removing it completely is the way to go from the scenario
>
> we have
>
>
>
> Removing stopwords is another story. I'm curious to find the reason
> assuming that you keep on using stopwords. In some cases, stopwords are
> really necessary.
>
>
> Quite a considerable increase
>
>
> If q.alt is giving you responses, it's confirmed that your stopwords filter
> is working as expected. The problem definitely lies in the configuration of
> edismax.
>
>
>
> I am sorry but I didn't understand what do you want me to do exactly with
> the lst (??) and qf and bf.
>
>
>
> What combinations did you try? I was referring to the field-level boosting
> you have applied in edismax config.
>
> *Let me explain again:* In your solrconfig.xml, look at your /search
> request handler. There are many qf and some bq boosts. I want you to remove
> all of these, check response again (with q now) and keep on adding them
> again (one by one) while looking for when the numFound drastically changes.
>
> On Fri, 8 Nov 2019 at 23:47, David Hastings 
> wrote:
>
> I use 3 word shingles with stopwords for my MLT ML trainer that worked
> pretty well for such a solution, but for a full index the size became
> prohibitive
>
> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood 
> wrote:
>
> If we had IDF for phrases, they would be super effective. The 2X weight
>
> is
>
> a hack that mostly works.
>
> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 8, 2019, at 11:08 AM, David Hastings <
>
> hastings.recurs...@gmail.com> wrote:
>
>
> the pf and qf fields are REALLY nice for this
>
> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>
> wun...@wunderwood.org>
>
> wrote:
>
> I always enable phrase searching in edismax for exactly this reason.
>
> Something like:
>
> title^16 keywords^8 text^2
>
> To deal with concepts in queries, a classifier and/or named entity
> extractor can be helpful. If you have a list of concepts (“controlled
> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>
> that
>
> term can be queried against the field matching that vocabulary.
>
> This is how LinkedIn separates people, companies, and places, for
>
> example.
>
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 8, 2019, at 10:48 AM, Erick Erickson 
>
> wrote:
>
>
> Look at the “mm” parameter, try setting it to 100%. Although that’t
>
> not
>
> entirely likely to do

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-12 Thread Guilherme Viteri

What I can't understand is:
I search for the exact term - "Immunoregulatory interactions between a Lymphoid 
and a non-Lymphoid cell" and If i search "I search for the exact term - 
Immunoregulatory interactions between a Lymphoid and non-Lymphoid cell" then it 
works 

> On 11 Nov 2019, at 12:24, Guilherme Viteri  wrote:
> 
> Thanks
>> Removing stopwords is another story. I'm curious to find the reason
>> assuming that you keep on using stopwords. In some cases, stopwords are
>> really necessary.
> Yes. It always make sense the way we've been using.
> 
>> If q.alt is giving you responses, it's confirmed that your stopwords filter
>> is working as expected. The problem definitely lies in the configuration of
>> edismax.
> I see.
> 
>> *Let me explain again:* In your solrconfig.xml, look at your /search
> Ok, using q now, removed all qf, performed the search and I got 23 results, 
> and the one I really want, on the top.
> As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I 
> don't get anything (which make sense). However if I query name_exact, I get 
> the 23 results again, and unfortunately if I query stId^1.0 name_exact^10.0 I 
> still don't get any results.
> 
> In summary
> - without qf - 23 results
> - dbId - 0 results
> - name_exact - 16 results
> - name - 23 results
> - dbId^1.0
>  name_exact^10.0 - 0 results
> - 0 results if any other, stId, dbId (key) is added on top of the 
> name(name_exact, etc).
> 
> Definitely lost here! :-/
> 
> 
>> On 11 Nov 2019, at 07:59, Paras Lehana  wrote:
>> 
>> Hi
>> 
>> So I don't think removing it completely is the way to go from the scenario
>>> we have
>> 
>> 
>> Removing stopwords is another story. I'm curious to find the reason
>> assuming that you keep on using stopwords. In some cases, stopwords are
>> really necessary.
>> 
>> 
>> Quite a considerable increase
>> 
>> 
>> If q.alt is giving you responses, it's confirmed that your stopwords filter
>> is working as expected. The problem definitely lies in the configuration of
>> edismax.
>> 
>> 
>> 
>>> I am sorry but I didn't understand what do you want me to do exactly with
>>> the lst (??) and qf and bf.
>> 
>> 
>> What combinations did you try? I was referring to the field-level boosting
>> you have applied in edismax config.
>> 
>> *Let me explain again:* In your solrconfig.xml, look at your /search
>> request handler. There are many qf and some bq boosts. I want you to remove
>> all of these, check response again (with q now) and keep on adding them
>> again (one by one) while looking for when the numFound drastically changes.
>> 
>> On Fri, 8 Nov 2019 at 23:47, David Hastings 
>> wrote:
>> 
>>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>>> pretty well for such a solution, but for a full index the size became
>>> prohibitive
>>> 
>>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood 
>>> wrote:
>>> 
>>>> If we had IDF for phrases, they would be super effective. The 2X weight
>>> is
>>>> a hack that mostly works.
>>>> 
>>>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org
>>>> http://observer.wunderwood.org/  (my blog)
>>>> 
>>>>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>>> hastings.recurs...@gmail.com> wrote:
>>>>> 
>>>>> the pf and qf fields are REALLY nice for this
>>>>> 
>>>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>>> wun...@wunderwood.org>
>>>>> wrote:
>>>>> 
>>>>>> I always enable phrase searching in edismax for exactly this reason.
>>>>>> 
>>>>>> Something like:
>>>>>> 
>>>>>> title^16 keywords^8 text^2
>>>>>> 
>>>>>> To deal with concepts in queries, a classifier and/or named entity
>>>>>> extractor can be helpful. If you have a list of concepts (“controlled
>>>>>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>>> that
>>>>>> term can be queried against the field matching that vocabulary.
>>>>>> 
>>>>>> This is how LinkedIn separates people, companies, and places, for
>>>> example.

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-11 Thread Guilherme Viteri

Thanks
> Removing stopwords is another story. I'm curious to find the reason
> assuming that you keep on using stopwords. In some cases, stopwords are
> really necessary.
Yes. It always make sense the way we've been using.

> If q.alt is giving you responses, it's confirmed that your stopwords filter
> is working as expected. The problem definitely lies in the configuration of
> edismax.
I see.

> *Let me explain again:* In your solrconfig.xml, look at your /search
Ok, using q now, removed all qf, performed the search and I got 23 results, and 
the one I really want, on the top.
As soon as I add dbId or stId (regardless the boost, 1.0 or 100.0), then I 
don't get anything (which make sense). However if I query name_exact, I get the 
23 results again, and unfortunately if I query stId^1.0 name_exact^10.0 I still 
don't get any results.

In summary
- without qf - 23 results
- dbId - 0 results
- name_exact - 16 results
- name - 23 results
- dbId^1.0
  name_exact^10.0 - 0 results
- 0 results if any other, stId, dbId (key) is added on top of the 
name(name_exact, etc).

Definitely lost here! :-/


> On 11 Nov 2019, at 07:59, Paras Lehana  wrote:
> 
> Hi
> 
> So I don't think removing it completely is the way to go from the scenario
>> we have
> 
> 
> Removing stopwords is another story. I'm curious to find the reason
> assuming that you keep on using stopwords. In some cases, stopwords are
> really necessary.
> 
> 
> Quite a considerable increase
> 
> 
> If q.alt is giving you responses, it's confirmed that your stopwords filter
> is working as expected. The problem definitely lies in the configuration of
> edismax.
> 
> 
> 
>> I am sorry but I didn't understand what do you want me to do exactly with
>> the lst (??) and qf and bf.
> 
> 
> What combinations did you try? I was referring to the field-level boosting
> you have applied in edismax config.
> 
> *Let me explain again:* In your solrconfig.xml, look at your /search
> request handler. There are many qf and some bq boosts. I want you to remove
> all of these, check response again (with q now) and keep on adding them
> again (one by one) while looking for when the numFound drastically changes.
> 
> On Fri, 8 Nov 2019 at 23:47, David Hastings 
> wrote:
> 
>> I use 3 word shingles with stopwords for my MLT ML trainer that worked
>> pretty well for such a solution, but for a full index the size became
>> prohibitive
>> 
>> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood 
>> wrote:
>> 
>>> If we had IDF for phrases, they would be super effective. The 2X weight
>> is
>>> a hack that mostly works.
>>> 
>>> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>>> On Nov 8, 2019, at 11:08 AM, David Hastings <
>>> hastings.recurs...@gmail.com> wrote:
>>>> 
>>>> the pf and qf fields are REALLY nice for this
>>>> 
>>>> On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
>> wun...@wunderwood.org>
>>>> wrote:
>>>> 
>>>>> I always enable phrase searching in edismax for exactly this reason.
>>>>> 
>>>>> Something like:
>>>>> 
>>>>>  title^16 keywords^8 text^2
>>>>> 
>>>>> To deal with concepts in queries, a classifier and/or named entity
>>>>> extractor can be helpful. If you have a list of concepts (“controlled
>>>>> vocabulary”) that includes “Lamin A”, and that shows up in a query,
>> that
>>>>> term can be queried against the field matching that vocabulary.
>>>>> 
>>>>> This is how LinkedIn separates people, companies, and places, for
>>> example.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>>> On Nov 8, 2019, at 10:48 AM, Erick Erickson >> 
>>>>> wrote:
>>>>>> 
>>>>>> Look at the “mm” parameter, try setting it to 100%. Although that’t
>> not
>>>>> entirely likely to do what you want either since virtually every doc
>>> will
>>>>> have “a” in it. But at least you’d get docs that have both terms.
>>>>>> 
>>>>>> you may also be able to search for things like “Lamin A” _only as a
>>>>> phrase_ and have some luck. But

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-10 Thread Paras Lehana

Hi

So I don't think removing it completely is the way to go from the scenario
> we have


Removing stopwords is another story. I'm curious to find the reason
assuming that you keep on using stopwords. In some cases, stopwords are
really necessary.


Quite a considerable increase


If q.alt is giving you responses, it's confirmed that your stopwords filter
is working as expected. The problem definitely lies in the configuration of
edismax.



> I am sorry but I didn't understand what do you want me to do exactly with
> the lst (??) and qf and bf.


What combinations did you try? I was referring to the field-level boosting
you have applied in edismax config.

*Let me explain again:* In your solrconfig.xml, look at your /search
request handler. There are many qf and some bq boosts. I want you to remove
all of these, check response again (with q now) and keep on adding them
again (one by one) while looking for when the numFound drastically changes.

On Fri, 8 Nov 2019 at 23:47, David Hastings 
wrote:

> I use 3 word shingles with stopwords for my MLT ML trainer that worked
> pretty well for such a solution, but for a full index the size became
> prohibitive
>
> On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood 
> wrote:
>
> > If we had IDF for phrases, they would be super effective. The 2X weight
> is
> > a hack that mostly works.
> >
> > Infoseek had phrase IDF and it was a killer algorithm for relevance.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> > > On Nov 8, 2019, at 11:08 AM, David Hastings <
> > hastings.recurs...@gmail.com> wrote:
> > >
> > > the pf and qf fields are REALLY nice for this
> > >
> > > On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood <
> wun...@wunderwood.org>
> > > wrote:
> > >
> > >> I always enable phrase searching in edismax for exactly this reason.
> > >>
> > >> Something like:
> > >>
> > >>   title^16 keywords^8 text^2
> > >>
> > >> To deal with concepts in queries, a classifier and/or named entity
> > >> extractor can be helpful. If you have a list of concepts (“controlled
> > >> vocabulary”) that includes “Lamin A”, and that shows up in a query,
> that
> > >> term can be queried against the field matching that vocabulary.
> > >>
> > >> This is how LinkedIn separates people, companies, and places, for
> > example.
> > >>
> > >> wunder
> > >> Walter Underwood
> > >> wun...@wunderwood.org
> > >> http://observer.wunderwood.org/  (my blog)
> > >>
> > >>> On Nov 8, 2019, at 10:48 AM, Erick Erickson  >
> > >> wrote:
> > >>>
> > >>> Look at the “mm” parameter, try setting it to 100%. Although that’t
> not
> > >> entirely likely to do what you want either since virtually every doc
> > will
> > >> have “a” in it. But at least you’d get docs that have both terms.
> > >>>
> > >>> you may also be able to search for things like “Lamin A” _only as a
> > >> phrase_ and have some luck. But this is a gnarly problem in general.
> > Some
> > >> people have been able to substitute synonyms and/or shingles to make
> > this
> > >> work at the expense of a larger index.
> > >>>
> > >>> This is a generic problem with context. “Lamin A” is really a
> > “concept”,
> > >> not just two words that happen to be near each other. Searching as a
> > phrase
> > >> is an OOB-but-naive way to try to make it more likely that the ranked
> > >> results refer to the _concept_ of “Lamin A”. The assumption here is
> “if
> > >> these two words appear next to each other, they’re more likely to be
> > what I
> > >> want”. I say “naive” because “Lamins: A new approach to...” would
> > _also_ be
> > >> found for a naive phrase search. (I have no idea whether such a title
> > makes
> > >> sense or not, but you figured that out already)...
> > >>>
> > >>> To do this well you’d have to dive in to NLP/Machine learning.
> > >>>
> > >>> I truly wish we could have the DWIM search algorithm (Do What I
> Mean)….
> > >>>
> > >>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri 
> > >> wrote:
> > >>>>
> > >>>> HI Walter and Paras
> > >>>>
> > >>>> I indexed it removing all the r

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread David Hastings

I use 3 word shingles with stopwords for my MLT ML trainer that worked
pretty well for such a solution, but for a full index the size became
prohibitive

On Fri, Nov 8, 2019 at 12:13 PM Walter Underwood 
wrote:

> If we had IDF for phrases, they would be super effective. The 2X weight is
> a hack that mostly works.
>
> Infoseek had phrase IDF and it was a killer algorithm for relevance.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 8, 2019, at 11:08 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > the pf and qf fields are REALLY nice for this
> >
> > On Fri, Nov 8, 2019 at 12:02 PM Walter Underwood 
> > wrote:
> >
> >> I always enable phrase searching in edismax for exactly this reason.
> >>
> >> Something like:
> >>
> >>   title^16 keywords^8 text^2
> >>
> >> To deal with concepts in queries, a classifier and/or named entity
> >> extractor can be helpful. If you have a list of concepts (“controlled
> >> vocabulary”) that includes “Lamin A”, and that shows up in a query, that
> >> term can be queried against the field matching that vocabulary.
> >>
> >> This is how LinkedIn separates people, companies, and places, for
> example.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Nov 8, 2019, at 10:48 AM, Erick Erickson 
> >> wrote:
> >>>
> >>> Look at the “mm” parameter, try setting it to 100%. Although that’t not
> >> entirely likely to do what you want either since virtually every doc
> will
> >> have “a” in it. But at least you’d get docs that have both terms.
> >>>
> >>> you may also be able to search for things like “Lamin A” _only as a
> >> phrase_ and have some luck. But this is a gnarly problem in general.
> Some
> >> people have been able to substitute synonyms and/or shingles to make
> this
> >> work at the expense of a larger index.
> >>>
> >>> This is a generic problem with context. “Lamin A” is really a
> “concept”,
> >> not just two words that happen to be near each other. Searching as a
> phrase
> >> is an OOB-but-naive way to try to make it more likely that the ranked
> >> results refer to the _concept_ of “Lamin A”. The assumption here is “if
> >> these two words appear next to each other, they’re more likely to be
> what I
> >> want”. I say “naive” because “Lamins: A new approach to...” would
> _also_ be
> >> found for a naive phrase search. (I have no idea whether such a title
> makes
> >> sense or not, but you figured that out already)...
> >>>
> >>> To do this well you’d have to dive in to NLP/Machine learning.
> >>>
> >>> I truly wish we could have the DWIM search algorithm (Do What I Mean)….
> >>>
> >>>> On Nov 8, 2019, at 11:29 AM, Guilherme Viteri 
> >> wrote:
> >>>>
> >>>> HI Walter and Paras
> >>>>
> >>>> I indexed it removing all the references to StopWordFilter and I went
> >> from 121 results to near 20K as the search term q="Lymphoid and a
> >> non-Lymphoid cell" is matching entities such as "IFT A" or  "Lamin A".
> So I
> >> don't think removing it completely is the way to go from the scenario we
> >> have, but I appreciate the suggestion…
> >>>>
> >>>> Yes the response is using fl=*
> >>>> I am trying some combinations at the moment, but yet no success.
> >>>>
> >>>> defType=edismax
> >>>> q.alt=Lymphoid and a non-Lymphoid cell
> >>>> Number of results=1599
> >>>> Quite a considerable increase, even though reasonable meaningful
> >> results.
> >>>>
> >>>> I am sorry but I didn't understand what do you want me to do exactly
> >> with the lst (??) and qf and bf.
> >>>>
> >>>> Thanks everyone with their inputs
> >>>>
> >>>>
> >>>>> On 8 Nov 2019, at 06:45, Paras Lehana 
> >> wrote:
> >>>>>
> >>>>> Hi Guilherme
> >>>>>
> >>>>> By accident, I ended up querying the using the default handler
> >> (/select) and it worked.
> >>>>>
> >>>>> You've just found the culprit. Thanks for

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread Walter Underwood

SON responses). I hope you have
>> provided the response with fl=*. Replace q with q.alt in your /search
>> handler query and I think you should start getting responses. That's
>> because q.alt uses standard parser. If you want to keep using edisMax, I
>> suggest you to test the responses removing some combination of lst (qf, bf)
>> and find what's restricting the documents to come up. I'm out of office
>> today - would have certainly tried analyzing the field values of the
>> document in /select request and compare it with qf/bq in solrconfig.xml
>> /search. Do this for me and you'd certainly find something.
>>>>> 
>>>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood > <mailto:wun...@wunderwood.org>> wrote:
>>>>> I normally use a weight of 8 for the most important field, like title.
>> Other fields might get a 4 or 2.
>>>>> 
>>>>> I add a “pf” field with the weights doubled, so that phrase matches
>> have a higher weight.
>>>>> 
>>>>> The weight of 8 comes from experience at Infoseek and Inktomi, two
>> early web search engines. With different relevance algorithms and totally
>> different evaluation and tuning systems, they settled on weights of 8 and
>> 7.5 for HTML titles. With the the two radically different system getting
>> the same number, I decided that was a property of the documents, not of the
>> search engines.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>> (my blog)
>>>>> 
>>>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri > <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>> 
>>>>>> Hi Wunder,
>>>>>> 
>>>>>> My indexer takes quite a few hours to be executed I am shortening it
>> to run faster, but I also need to make sure it gives what we are expecting.
>> This implementation's been there for >4y, and massively used.
>>>>>> 
>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>>>> I've inherited that implementation and I am really keen to adequate
>> it, what would you recommend ?
>>>>>> 
>>>>>> Cheers
>>>>>> Guilherme
>>>>>> 
>>>>>>> On 7 Nov 2019, at 14:43, Walter Underwood > <mailto:wun...@wunderwood.org>> wrote:
>>>>>>> 
>>>>>>> Thanks for posting the files. Looking at schema.xml, I see that you
>> still are using StopFilterFactory. The first advice we gave you was to
>> remove that.
>>>>>>> 
>>>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>>>> 
>>>>>>> You will continue to have problems matching stopwords until you do
>> that.
>>>>>>> 
>>>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
>> (my blog)
>>>>>>> 
>>>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri > <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>>>> 
>>>>>>>> Hi Paras, everyone
>>>>>>>> 
>>>>>>>> Thank you again for your inputs and suggestions. I sorry to hear
>> you had trouble with the attachments I will host it somewhere and share the
>> links.
>>>>>>>> I don't tweak my index, I get the data from the graph database,
>> create a document as they are and save to solr.
>>>>>>>> 
>>>>>>>> So, I am sending the new analysis screen querying the way you
>> suggested. Also the results with params and solr query url.
>>>>>>>> 
>>>>>>>> During the process of querying what you asked I found something
>> really weird (at least for me). By accident, I ended up querying t

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread David Hastings

:00, Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> >>> I normally use a weight of 8 for the most important field, like title.
> Other fields might get a 4 or 2.
> >>>
> >>> I add a “pf” field with the weights doubled, so that phrase matches
> have a higher weight.
> >>>
> >>> The weight of 8 comes from experience at Infoseek and Inktomi, two
> early web search engines. With different relevance algorithms and totally
> different evaluation and tuning systems, they settled on weights of 8 and
> 7.5 for HTML titles. With the the two radically different system getting
> the same number, I decided that was a property of the documents, not of the
> search engines.
> >>>
> >>> wunder
> >>> Walter Underwood
> >>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> (my blog)
> >>>
> >>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri  <mailto:gvit...@ebi.ac.uk>> wrote:
> >>>>
> >>>> Hi Wunder,
> >>>>
> >>>> My indexer takes quite a few hours to be executed I am shortening it
> to run faster, but I also need to make sure it gives what we are expecting.
> This implementation's been there for >4y, and massively used.
> >>>>
> >>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
> of configuring Solr.
> >>>> I've inherited that implementation and I am really keen to adequate
> it, what would you recommend ?
> >>>>
> >>>> Cheers
> >>>> Guilherme
> >>>>
> >>>>> On 7 Nov 2019, at 14:43, Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> >>>>>
> >>>>> Thanks for posting the files. Looking at schema.xml, I see that you
> still are using StopFilterFactory. The first advice we gave you was to
> remove that.
> >>>>>
> >>>>> Remove StopFilterFactory everywhere and reindex.
> >>>>>
> >>>>> You will continue to have problems matching stopwords until you do
> that.
> >>>>>
> >>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
> of configuring Solr.
> >>>>>
> >>>>> wunder
> >>>>> Walter Underwood
> >>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> >>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>
> (my blog)
> >>>>>
> >>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  <mailto:gvit...@ebi.ac.uk>> wrote:
> >>>>>>
> >>>>>> Hi Paras, everyone
> >>>>>>
> >>>>>> Thank you again for your inputs and suggestions. I sorry to hear
> you had trouble with the attachments I will host it somewhere and share the
> links.
> >>>>>> I don't tweak my index, I get the data from the graph database,
> create a document as they are and save to solr.
> >>>>>>
> >>>>>> So, I am sending the new analysis screen querying the way you
> suggested. Also the results with params and solr query url.
> >>>>>>
> >>>>>> During the process of querying what you asked I found something
> really weird (at least for me). By accident, I ended up querying the using
> the default handler (/select) and it worked. Then If I use the one I must
> use, then sadly doesn't work. I am posting both results and I will also
> post the handlers as well.
> >>>>>>
> >>>>>> Here is the link with all the files mentioned before
> >>>>>>
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> >>
> >>>>>> If the link doesn't work www dot dropbox dot com slash sh slash
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> >>>>>>
> >>>>>> Thanks
> >>>>>>
> >>>>>>> On 7 Nov 2019, at 05:23

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread Walter Underwood

elds might get a 4 or 2.
>>> 
>>> I add a “pf” field with the weights doubled, so that phrase matches have a 
>>> higher weight.
>>> 
>>> The weight of 8 comes from experience at Infoseek and Inktomi, two early 
>>> web search engines. With different relevance algorithms and totally 
>>> different evaluation and tuning systems, they settled on weights of 8 and 
>>> 7.5 for HTML titles. With the the two radically different system getting 
>>> the same number, I decided that was a property of the documents, not of the 
>>> search engines.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
>>> 
>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri >>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>> 
>>>> Hi Wunder,
>>>> 
>>>> My indexer takes quite a few hours to be executed I am shortening it to 
>>>> run faster, but I also need to make sure it gives what we are expecting. 
>>>> This implementation's been there for >4y, and massively used.
>>>> 
>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. 
>>>>> I don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>>>>> configuring Solr.
>>>> I've inherited that implementation and I am really keen to adequate it, 
>>>> what would you recommend ?
>>>> 
>>>> Cheers
>>>> Guilherme
>>>> 
>>>>> On 7 Nov 2019, at 14:43, Walter Underwood >>>> <mailto:wun...@wunderwood.org>> wrote:
>>>>> 
>>>>> Thanks for posting the files. Looking at schema.xml, I see that you still 
>>>>> are using StopFilterFactory. The first advice we gave you was to remove 
>>>>> that.
>>>>> 
>>>>> Remove StopFilterFactory everywhere and reindex.
>>>>> 
>>>>> You will continue to have problems matching stopwords until you do that.
>>>>> 
>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. 
>>>>> I don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>>>>> configuring Solr.
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my 
>>>>> blog)
>>>>> 
>>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri >>>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>>> 
>>>>>> Hi Paras, everyone
>>>>>> 
>>>>>> Thank you again for your inputs and suggestions. I sorry to hear you had 
>>>>>> trouble with the attachments I will host it somewhere and share the 
>>>>>> links. 
>>>>>> I don't tweak my index, I get the data from the graph database, create a 
>>>>>> document as they are and save to solr.
>>>>>> 
>>>>>> So, I am sending the new analysis screen querying the way you suggested. 
>>>>>> Also the results with params and solr query url.
>>>>>> 
>>>>>> During the process of querying what you asked I found something really 
>>>>>> weird (at least for me). By accident, I ended up querying the using the 
>>>>>> default handler (/select) and it worked. Then If I use the one I must 
>>>>>> use, then sadly doesn't work. I am posting both results and I will also 
>>>>>> post the handlers as well.
>>>>>> 
>>>>>> Here is the link with all the files mentioned before
>>>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>>>>>>  
>>>>>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>>
>>>>>> If the link doesn't work www dot dropbox dot com slash sh slash 
>>>>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>>> On 7 Nov 201

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread Walter Underwood

But when you change it to AND, a single misspelling means zero results. That is 
usually not helpful.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 8, 2019, at 10:43 AM, David Hastings  
> wrote:
> 
> is your default operator OR?
> change it to AND
> 
> 
> On Fri, Nov 8, 2019 at 11:30 AM Guilherme Viteri  wrote:
> 
>> HI Walter and Paras
>> 
>> I indexed it removing all the references to StopWordFilter and I went from
>> 121 results to near 20K as the search term q="Lymphoid and a non-Lymphoid
>> cell" is matching entities such as "IFT A" or  "Lamin A". So I don't think
>> removing it completely is the way to go from the scenario we have, but I
>> appreciate the suggestion...
>> 
>> Yes the response is using fl=*
>> I am trying some combinations at the moment, but yet no success.
>> 
>> defType=edismax
>> q.alt=Lymphoid and a non-Lymphoid cell
>> Number of results=1599
>> Quite a considerable increase, even though reasonable meaningful results.
>> 
>> I am sorry but I didn't understand what do you want me to do exactly with
>> the lst (??) and qf and bf.
>> 
>> Thanks everyone with their inputs
>> 
>> 
>>> On 8 Nov 2019, at 06:45, Paras Lehana 
>> wrote:
>>> 
>>> Hi Guilherme
>>> 
>>> By accident, I ended up querying the using the default handler (/select)
>> and it worked.
>>> 
>>> You've just found the culprit. Thanks for giving the material I
>> requested. Your analysis chain is working as expected. I don't see any
>> issue in either StopWordFilter or your boosts. I also use a boost of 50
>> when boosting contextual suggestions (boosting "gold iphone" on a page of
>> iphone) but I take Walter's suggestion and would try to optimize my
>> weights. I agree that this 50 thing was not researched much about by us as
>> well (we never faced performance or relevance issues).
>>> 
>>> See the major difference in both the handlers - edismax. I'm pretty sure
>> that your problem lies in the parsing of queries (you can confirm that from
>> parsedquery key in debug of both JSON responses). I hope you have provided
>> the response with fl=*. Replace q with q.alt in your /search handler query
>> and I think you should start getting responses. That's because q.alt uses
>> standard parser. If you want to keep using edisMax, I suggest you to test
>> the responses removing some combination of lst (qf, bf) and find what's
>> restricting the documents to come up. I'm out of office today - would have
>> certainly tried analyzing the field values of the document in /select
>> request and compare it with qf/bq in solrconfig.xml /search. Do this for me
>> and you'd certainly find something.
>>> 
>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood > <mailto:wun...@wunderwood.org>> wrote:
>>> I normally use a weight of 8 for the most important field, like title.
>> Other fields might get a 4 or 2.
>>> 
>>> I add a “pf” field with the weights doubled, so that phrase matches have
>> a higher weight.
>>> 
>>> The weight of 8 comes from experience at Infoseek and Inktomi, two early
>> web search engines. With different relevance algorithms and totally
>> different evaluation and tuning systems, they settled on weights of 8 and
>> 7.5 for HTML titles. With the the two radically different system getting
>> the same number, I decided that was a property of the documents, not of the
>> search engines.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>> blog)
>>> 
>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri > <mailto:gvit...@ebi.ac.uk>> wrote:
>>>> 
>>>> Hi Wunder,
>>>> 
>>>> My indexer takes quite a few hours to be executed I am shortening it to
>> run faster, but I also need to make sure it gives what we are expecting.
>> This implementation's been there for >4y, and massively used.
>>>> 
>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>> I've inherited that implementation and I am really keen to adequate it,
>> what would you recommend ?
>>>> 
>>>> Cheers
>>>> Guilherme
>>>&g

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread Erick Erickson

>> My indexer takes quite a few hours to be executed I am shortening it to run 
>>> faster, but I also need to make sure it gives what we are expecting. This 
>>> implementation's been there for >4y, and massively used.
>>> 
>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
>>>> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>>>> configuring Solr.
>>> I've inherited that implementation and I am really keen to adequate it, 
>>> what would you recommend ?
>>> 
>>> Cheers
>>> Guilherme
>>> 
>>>> On 7 Nov 2019, at 14:43, Walter Underwood >>> <mailto:wun...@wunderwood.org>> wrote:
>>>> 
>>>> Thanks for posting the files. Looking at schema.xml, I see that you still 
>>>> are using StopFilterFactory. The first advice we gave you was to remove 
>>>> that.
>>>> 
>>>> Remove StopFilterFactory everywhere and reindex.
>>>> 
>>>> You will continue to have problems matching stopwords until you do that.
>>>> 
>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
>>>> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>>>> configuring Solr.
>>>> 
>>>> wunder
>>>> Walter Underwood
>>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my 
>>>> blog)
>>>> 
>>>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri >>>> <mailto:gvit...@ebi.ac.uk>> wrote:
>>>>> 
>>>>> Hi Paras, everyone
>>>>> 
>>>>> Thank you again for your inputs and suggestions. I sorry to hear you had 
>>>>> trouble with the attachments I will host it somewhere and share the 
>>>>> links. 
>>>>> I don't tweak my index, I get the data from the graph database, create a 
>>>>> document as they are and save to solr.
>>>>> 
>>>>> So, I am sending the new analysis screen querying the way you suggested. 
>>>>> Also the results with params and solr query url.
>>>>> 
>>>>> During the process of querying what you asked I found something really 
>>>>> weird (at least for me). By accident, I ended up querying the using the 
>>>>> default handler (/select) and it worked. Then If I use the one I must 
>>>>> use, then sadly doesn't work. I am posting both results and I will also 
>>>>> post the handlers as well.
>>>>> 
>>>>> Here is the link with all the files mentioned before
>>>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>>>>>  
>>>>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0<https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>>
>>>>> If the link doesn't work www dot dropbox dot com slash sh slash 
>>>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>>>> 
>>>>> Thanks
>>>>> 
>>>>>> On 7 Nov 2019, at 05:23, Paras Lehana >>>>> <mailto:paras.leh...@indiamart.com>> wrote:
>>>>>> 
>>>>>> Hi Guilherme.
>>>>>> 
>>>>>> I am sending they analysis result and the json result as requested.
>>>>>> 
>>>>>> 
>>>>>> Thanks for the effort. Luckily, I can see your attachments (low quality
>>>>>> though).
>>>>>> 
>>>>>> From the analysis screen, the analysis is working as expected. One of the
>>>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>>>>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>>>>>> think of is: the stopword "a" is probably present in post-analysis either
>>>>>> of query or index. Did you tweak your index time analysis after indexing?
>>>>>> 
>>>>>> Do two things:
>>>>>> 
>>>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>>>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread Guilherme Viteri

OR

 



OR
explicit
edismax
*:*
name

 ...
   


> On 8 Nov 2019, at 16:43, David Hastings  wrote:
> 
> is your default operator OR?
> change it to AND
> 
> 
> On Fri, Nov 8, 2019 at 11:30 AM Guilherme Viteri  wrote:
> 
>> HI Walter and Paras
>> 
>> I indexed it removing all the references to StopWordFilter and I went from
>> 121 results to near 20K as the search term q="Lymphoid and a non-Lymphoid
>> cell" is matching entities such as "IFT A" or  "Lamin A". So I don't think
>> removing it completely is the way to go from the scenario we have, but I
>> appreciate the suggestion...
>> 
>> Yes the response is using fl=*
>> I am trying some combinations at the moment, but yet no success.
>> 
>> defType=edismax
>> q.alt=Lymphoid and a non-Lymphoid cell
>> Number of results=1599
>> Quite a considerable increase, even though reasonable meaningful results.
>> 
>> I am sorry but I didn't understand what do you want me to do exactly with
>> the lst (??) and qf and bf.
>> 
>> Thanks everyone with their inputs
>> 
>> 
>>> On 8 Nov 2019, at 06:45, Paras Lehana 
>> wrote:
>>> 
>>> Hi Guilherme
>>> 
>>> By accident, I ended up querying the using the default handler (/select)
>> and it worked.
>>> 
>>> You've just found the culprit. Thanks for giving the material I
>> requested. Your analysis chain is working as expected. I don't see any
>> issue in either StopWordFilter or your boosts. I also use a boost of 50
>> when boosting contextual suggestions (boosting "gold iphone" on a page of
>> iphone) but I take Walter's suggestion and would try to optimize my
>> weights. I agree that this 50 thing was not researched much about by us as
>> well (we never faced performance or relevance issues).
>>> 
>>> See the major difference in both the handlers - edismax. I'm pretty sure
>> that your problem lies in the parsing of queries (you can confirm that from
>> parsedquery key in debug of both JSON responses). I hope you have provided
>> the response with fl=*. Replace q with q.alt in your /search handler query
>> and I think you should start getting responses. That's because q.alt uses
>> standard parser. If you want to keep using edisMax, I suggest you to test
>> the responses removing some combination of lst (qf, bf) and find what's
>> restricting the documents to come up. I'm out of office today - would have
>> certainly tried analyzing the field values of the document in /select
>> request and compare it with qf/bq in solrconfig.xml /search. Do this for me
>> and you'd certainly find something.
>>> 
>>> On Thu, 7 Nov 2019 at 21:00, Walter Underwood > <mailto:wun...@wunderwood.org>> wrote:
>>> I normally use a weight of 8 for the most important field, like title.
>> Other fields might get a 4 or 2.
>>> 
>>> I add a “pf” field with the weights doubled, so that phrase matches have
>> a higher weight.
>>> 
>>> The weight of 8 comes from experience at Infoseek and Inktomi, two early
>> web search engines. With different relevance algorithms and totally
>> different evaluation and tuning systems, they settled on weights of 8 and
>> 7.5 for HTML titles. With the the two radically different system getting
>> the same number, I decided that was a property of the documents, not of the
>> search engines.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
>> blog)
>>> 
>>>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri > <mailto:gvit...@ebi.ac.uk>> wrote:
>>>> 
>>>> Hi Wunder,
>>>> 
>>>> My indexer takes quite a few hours to be executed I am shortening it to
>> run faster, but I also need to make sure it gives what we are expecting.
>> This implementation's been there for >4y, and massively used.
>>>> 
>>>>> In your edismax handlers, weights of 20, 50, and 100 are extremely
>> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
>> of configuring Solr.
>>>> I've inherited that implementation and I am really keen to adequate it,
>> what would you recommend ?
>>>> 
>>>> Cheers
>>>> Guilherme
>>>> 
>>>>> On 7 Nov 2019, a

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread David Hastings

is your default operator OR?
change it to AND


On Fri, Nov 8, 2019 at 11:30 AM Guilherme Viteri  wrote:

> HI Walter and Paras
>
> I indexed it removing all the references to StopWordFilter and I went from
> 121 results to near 20K as the search term q="Lymphoid and a non-Lymphoid
> cell" is matching entities such as "IFT A" or  "Lamin A". So I don't think
> removing it completely is the way to go from the scenario we have, but I
> appreciate the suggestion...
>
> Yes the response is using fl=*
> I am trying some combinations at the moment, but yet no success.
>
> defType=edismax
> q.alt=Lymphoid and a non-Lymphoid cell
> Number of results=1599
> Quite a considerable increase, even though reasonable meaningful results.
>
> I am sorry but I didn't understand what do you want me to do exactly with
> the lst (??) and qf and bf.
>
> Thanks everyone with their inputs
>
>
> > On 8 Nov 2019, at 06:45, Paras Lehana 
> wrote:
> >
> > Hi Guilherme
> >
> > By accident, I ended up querying the using the default handler (/select)
> and it worked.
> >
> > You've just found the culprit. Thanks for giving the material I
> requested. Your analysis chain is working as expected. I don't see any
> issue in either StopWordFilter or your boosts. I also use a boost of 50
> when boosting contextual suggestions (boosting "gold iphone" on a page of
> iphone) but I take Walter's suggestion and would try to optimize my
> weights. I agree that this 50 thing was not researched much about by us as
> well (we never faced performance or relevance issues).
> >
> > See the major difference in both the handlers - edismax. I'm pretty sure
> that your problem lies in the parsing of queries (you can confirm that from
> parsedquery key in debug of both JSON responses). I hope you have provided
> the response with fl=*. Replace q with q.alt in your /search handler query
> and I think you should start getting responses. That's because q.alt uses
> standard parser. If you want to keep using edisMax, I suggest you to test
> the responses removing some combination of lst (qf, bf) and find what's
> restricting the documents to come up. I'm out of office today - would have
> certainly tried analyzing the field values of the document in /select
> request and compare it with qf/bq in solrconfig.xml /search. Do this for me
> and you'd certainly find something.
> >
> > On Thu, 7 Nov 2019 at 21:00, Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> > I normally use a weight of 8 for the most important field, like title.
> Other fields might get a 4 or 2.
> >
> > I add a “pf” field with the weights doubled, so that phrase matches have
> a higher weight.
> >
> > The weight of 8 comes from experience at Infoseek and Inktomi, two early
> web search engines. With different relevance algorithms and totally
> different evaluation and tuning systems, they settled on weights of 8 and
> 7.5 for HTML titles. With the the two radically different system getting
> the same number, I decided that was a property of the documents, not of the
> search engines.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> > http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my
> blog)
> >
> >> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri  <mailto:gvit...@ebi.ac.uk>> wrote:
> >>
> >> Hi Wunder,
> >>
> >> My indexer takes quite a few hours to be executed I am shortening it to
> run faster, but I also need to make sure it gives what we are expecting.
> This implementation's been there for >4y, and massively used.
> >>
> >>> In your edismax handlers, weights of 20, 50, and 100 are extremely
> high. I don’t think I’ve ever used a weight higher than 16 in a dozen years
> of configuring Solr.
> >> I've inherited that implementation and I am really keen to adequate it,
> what would you recommend ?
> >>
> >> Cheers
> >> Guilherme
> >>
> >>> On 7 Nov 2019, at 14:43, Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> >>>
> >>> Thanks for posting the files. Looking at schema.xml, I see that you
> still are using StopFilterFactory. The first advice we gave you was to
> remove that.
> >>>
> >>> Remove StopFilterFactory everywhere and reindex.
> >>>
> >>> You will continue to have problems matching stopwords until you do
> that.
> >>>
> >>> In your edismax handlers, weights of 20, 50, and 100 are extremely
> high. I don’t think I’v

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-08 Thread Guilherme Viteri

HI Walter and Paras

I indexed it removing all the references to StopWordFilter and I went from 121 
results to near 20K as the search term q="Lymphoid and a non-Lymphoid cell" is 
matching entities such as "IFT A" or  "Lamin A". So I don't think removing it 
completely is the way to go from the scenario we have, but I appreciate the 
suggestion...

Yes the response is using fl=*
I am trying some combinations at the moment, but yet no success.

defType=edismax
q.alt=Lymphoid and a non-Lymphoid cell
Number of results=1599
Quite a considerable increase, even though reasonable meaningful results. 

I am sorry but I didn't understand what do you want me to do exactly with the 
lst (??) and qf and bf.

Thanks everyone with their inputs


> On 8 Nov 2019, at 06:45, Paras Lehana  wrote:
> 
> Hi Guilherme
> 
> By accident, I ended up querying the using the default handler (/select) and 
> it worked. 
> 
> You've just found the culprit. Thanks for giving the material I requested. 
> Your analysis chain is working as expected. I don't see any issue in either 
> StopWordFilter or your boosts. I also use a boost of 50 when boosting 
> contextual suggestions (boosting "gold iphone" on a page of iphone) but I 
> take Walter's suggestion and would try to optimize my weights. I agree that 
> this 50 thing was not researched much about by us as well (we never faced 
> performance or relevance issues).  
> 
> See the major difference in both the handlers - edismax. I'm pretty sure that 
> your problem lies in the parsing of queries (you can confirm that from 
> parsedquery key in debug of both JSON responses). I hope you have provided 
> the response with fl=*. Replace q with q.alt in your /search handler query 
> and I think you should start getting responses. That's because q.alt uses 
> standard parser. If you want to keep using edisMax, I suggest you to test the 
> responses removing some combination of lst (qf, bf) and find what's 
> restricting the documents to come up. I'm out of office today - would have 
> certainly tried analyzing the field values of the document in /select request 
> and compare it with qf/bq in solrconfig.xml /search. Do this for me and you'd 
> certainly find something.  
> 
> On Thu, 7 Nov 2019 at 21:00, Walter Underwood  <mailto:wun...@wunderwood.org>> wrote:
> I normally use a weight of 8 for the most important field, like title. Other 
> fields might get a 4 or 2.
> 
> I add a “pf” field with the weights doubled, so that phrase matches have a 
> higher weight.
> 
> The weight of 8 comes from experience at Infoseek and Inktomi, two early web 
> search engines. With different relevance algorithms and totally different 
> evaluation and tuning systems, they settled on weights of 8 and 7.5 for HTML 
> titles. With the the two radically different system getting the same number, 
> I decided that was a property of the documents, not of the search engines.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my blog)
> 
>> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri > <mailto:gvit...@ebi.ac.uk>> wrote:
>> 
>> Hi Wunder,
>> 
>> My indexer takes quite a few hours to be executed I am shortening it to run 
>> faster, but I also need to make sure it gives what we are expecting. This 
>> implementation's been there for >4y, and massively used.
>> 
>>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
>>> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>>> configuring Solr.
>> I've inherited that implementation and I am really keen to adequate it, what 
>> would you recommend ?
>> 
>> Cheers
>> Guilherme
>> 
>>> On 7 Nov 2019, at 14:43, Walter Underwood >> <mailto:wun...@wunderwood.org>> wrote:
>>> 
>>> Thanks for posting the files. Looking at schema.xml, I see that you still 
>>> are using StopFilterFactory. The first advice we gave you was to remove 
>>> that.
>>> 
>>> Remove StopFilterFactory everywhere and reindex.
>>> 
>>> You will continue to have problems matching stopwords until you do that.
>>> 
>>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
>>> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>>> configuring Solr.
>>> 
>>> wunder
>>> Walter Underwood
>>> wun...@wunderwood.org <mailto:wun...@wunderwood.org>
>>> http://observer.wunderwood.org/ <http://observer.wunderwood.org/>  (my bl

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Paras Lehana

Hi Guilherme

By accident, I ended up querying the using the default handler (/select)
> and it worked.


You've just found the culprit. Thanks for giving the material I requested.
Your analysis chain is working as expected. I don't see any issue in either
StopWordFilter or your boosts. I also use a boost of 50 when boosting
contextual suggestions (boosting "gold iphone" on a page of iphone) but I
take Walter's suggestion and would try to optimize my weights. I agree that
this 50 thing was not researched much about by us as well (we never faced
performance or relevance issues).

See the major difference in both the handlers - edismax. I'm pretty sure
that your problem lies in the parsing of queries (you can confirm that from
parsedquery key in debug of both JSON responses). I hope you have provided
the response with fl=*. Replace q with q.alt in your /search handler query
and I think you should start getting responses. That's because q.alt uses
standard parser. If you want to keep using edisMax, I suggest you to test
the responses removing some combination of lst (qf, bf) and find what's
restricting the documents to come up. I'm out of office today - would have
certainly tried analyzing the field values of the document in /select
request and compare it with qf/bq in solrconfig.xml /search. Do this for me
and you'd certainly find something.

On Thu, 7 Nov 2019 at 21:00, Walter Underwood  wrote:

> I normally use a weight of 8 for the most important field, like title.
> Other fields might get a 4 or 2.
>
> I add a “pf” field with the weights doubled, so that phrase matches have a
> higher weight.
>
> The weight of 8 comes from experience at Infoseek and Inktomi, two early
> web search engines. With different relevance algorithms and totally
> different evaluation and tuning systems, they settled on weights of 8 and
> 7.5 for HTML titles. With the the two radically different system getting
> the same number, I decided that was a property of the documents, not of the
> search engines.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri  wrote:
>
> Hi Wunder,
>
> My indexer takes quite a few hours to be executed I am shortening it to
> run faster, but I also need to make sure it gives what we are expecting.
> This implementation's been there for >4y, and massively used.
>
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I
> don’t think I’ve ever used a weight higher than 16 in a dozen years of
> configuring Solr.
>
> I've inherited that implementation and I am really keen to adequate it,
> what would you recommend ?
>
> Cheers
> Guilherme
>
> On 7 Nov 2019, at 14:43, Walter Underwood  wrote:
>
> Thanks for posting the files. Looking at schema.xml, I see that you still
> are using StopFilterFactory. The first advice we gave you was to remove
> that.
>
> Remove StopFilterFactory everywhere and reindex.
>
> You will continue to have problems matching stopwords until you do that.
>
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I
> don’t think I’ve ever used a weight higher than 16 in a dozen years of
> configuring Solr.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  wrote:
>
> Hi Paras, everyone
>
> Thank you again for your inputs and suggestions. I sorry to hear you had
> trouble with the attachments I will host it somewhere and share the links.
> I don't tweak my index, I get the data from the graph database, create a
> document as they are and save to solr.
>
> So, I am sending the new analysis screen querying the way you suggested.
> Also the results with params and solr query url.
>
> During the process of querying what you asked I found something really
> weird (at least for me). By accident, I ended up querying the using the
> default handler (/select) and it worked. Then If I use the one I must use,
> then sadly doesn't work. I am posting both results and I will also post the
> handlers as well.
>
> Here is the link with all the files mentioned before
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> >
> If the link doesn't work www dot dropbox dot com slash sh slash
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>
> Thanks
>
> On 7 Nov 2019, at 05:23, Paras Lehana  wrote:
>
> Hi Guilherme.
>
> I am sending they analysis result and the json result as requested.
>
>
> Thanks for the effort. Luckily, I can see your attachments (low quality
> though).
&

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Walter Underwood

I normally use a weight of 8 for the most important field, like title. Other 
fields might get a 4 or 2.

I add a “pf” field with the weights doubled, so that phrase matches have a 
higher weight.

The weight of 8 comes from experience at Infoseek and Inktomi, two early web 
search engines. With different relevance algorithms and totally different 
evaluation and tuning systems, they settled on weights of 8 and 7.5 for HTML 
titles. With the the two radically different system getting the same number, I 
decided that was a property of the documents, not of the search engines.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 7, 2019, at 9:03 AM, Guilherme Viteri  wrote:
> 
> Hi Wunder,
> 
> My indexer takes quite a few hours to be executed I am shortening it to run 
> faster, but I also need to make sure it gives what we are expecting. This 
> implementation's been there for >4y, and massively used.
> 
>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
>> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>> configuring Solr.
> I've inherited that implementation and I am really keen to adequate it, what 
> would you recommend ?
> 
> Cheers
> Guilherme
> 
>> On 7 Nov 2019, at 14:43, Walter Underwood  wrote:
>> 
>> Thanks for posting the files. Looking at schema.xml, I see that you still 
>> are using StopFilterFactory. The first advice we gave you was to remove that.
>> 
>> Remove StopFilterFactory everywhere and reindex.
>> 
>> You will continue to have problems matching stopwords until you do that.
>> 
>> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
>> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
>> configuring Solr.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  wrote:
>>> 
>>> Hi Paras, everyone
>>> 
>>> Thank you again for your inputs and suggestions. I sorry to hear you had 
>>> trouble with the attachments I will host it somewhere and share the links. 
>>> I don't tweak my index, I get the data from the graph database, create a 
>>> document as they are and save to solr.
>>> 
>>> So, I am sending the new analysis screen querying the way you suggested. 
>>> Also the results with params and solr query url.
>>> 
>>> During the process of querying what you asked I found something really 
>>> weird (at least for me). By accident, I ended up querying the using the 
>>> default handler (/select) and it worked. Then If I use the one I must use, 
>>> then sadly doesn't work. I am posting both results and I will also post the 
>>> handlers as well.
>>> 
>>> Here is the link with all the files mentioned before
>>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 
>>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>>> If the link doesn't work www dot dropbox dot com slash sh slash 
>>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>>> 
>>> Thanks
>>> 
>>>> On 7 Nov 2019, at 05:23, Paras Lehana  wrote:
>>>> 
>>>> Hi Guilherme.
>>>> 
>>>> I am sending they analysis result and the json result as requested.
>>>> 
>>>> 
>>>> Thanks for the effort. Luckily, I can see your attachments (low quality
>>>> though).
>>>> 
>>>> From the analysis screen, the analysis is working as expected. One of the
>>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>>>> think of is: the stopword "a" is probably present in post-analysis either
>>>> of query or index. Did you tweak your index time analysis after indexing?
>>>> 
>>>> Do two things:
>>>> 
>>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>>> "query=*"lymphoid
>>>> and a non-lymphoid cell"*. Try hosting the image and providing the link
>>>> here.
>>>> 2. Give the same JSON output as you have sent but this time with
>>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>>> 
>>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Guilherme Viteri

Hi Wunder,

My indexer takes quite a few hours to be executed I am shortening it to run 
faster, but I also need to make sure it gives what we are expecting. This 
implementation's been there for >4y, and massively used.

> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
> configuring Solr.
I've inherited that implementation and I am really keen to adequate it, what 
would you recommend ?

Cheers
Guilherme

> On 7 Nov 2019, at 14:43, Walter Underwood  wrote:
> 
> Thanks for posting the files. Looking at schema.xml, I see that you still are 
> using StopFilterFactory. The first advice we gave you was to remove that.
> 
> Remove StopFilterFactory everywhere and reindex.
> 
> You will continue to have problems matching stopwords until you do that.
> 
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
> don’t think I’ve ever used a weight higher than 16 in a dozen years of 
> configuring Solr.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  wrote:
>> 
>> Hi Paras, everyone
>> 
>> Thank you again for your inputs and suggestions. I sorry to hear you had 
>> trouble with the attachments I will host it somewhere and share the links. 
>> I don't tweak my index, I get the data from the graph database, create a 
>> document as they are and save to solr.
>> 
>> So, I am sending the new analysis screen querying the way you suggested. 
>> Also the results with params and solr query url.
>> 
>> During the process of querying what you asked I found something really weird 
>> (at least for me). By accident, I ended up querying the using the default 
>> handler (/select) and it worked. Then If I use the one I must use, then 
>> sadly doesn't work. I am posting both results and I will also post the 
>> handlers as well.
>> 
>> Here is the link with all the files mentioned before
>> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 
>> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
>> If the link doesn't work www dot dropbox dot com slash sh slash 
>> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
>> 
>> Thanks
>> 
>>> On 7 Nov 2019, at 05:23, Paras Lehana  wrote:
>>> 
>>> Hi Guilherme.
>>> 
>>> I am sending they analysis result and the json result as requested.
>>> 
>>> 
>>> Thanks for the effort. Luckily, I can see your attachments (low quality
>>> though).
>>> 
>>> From the analysis screen, the analysis is working as expected. One of the
>>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>>> think of is: the stopword "a" is probably present in post-analysis either
>>> of query or index. Did you tweak your index time analysis after indexing?
>>> 
>>> Do two things:
>>> 
>>> 1. Post the analysis screen for and index=*"Immunoregulatory
>>> interactions between a Lymphoid and a non-Lymphoid cell"* and
>>> "query=*"lymphoid
>>> and a non-lymphoid cell"*. Try hosting the image and providing the link
>>> here.
>>> 2. Give the same JSON output as you have sent but this time with
>>> *"echoParams=all"*. Also, post the exact Solr query url.
>>> 
>>> 
>>> 
>>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson  wrote:
>>> 
>>>> I don’t see the attachments, maybe I deleted old e-mails or some such. The
>>>> Apache server is fairly aggressive about stripping attachments though, so
>>>> it’s also possible they didn’t make it through.
>>>> 
>>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri  wrote:
>>>>> 
>>>>> Thanks Erick.
>>>>> 
>>>>>> First, your index and analysis chains are considerably different, this
>>>> can easily be a source of problems. In particular, using two different
>>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>>> you’re totally sure you understand the consequences. Additionally, your use
>>>> of the length filter is suspicious, especially since your problem statement
>>>> is about the addition of a single letter term and the min length allowed on
>>>> tha

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread David Hastings

Ha, funny enough i still use qf/pf boosts starting at 100 and go down,
gives me room to add boosting to more fields but not equal.  maybe
excessive but haven't noticed a performance issue

On Thu, Nov 7, 2019 at 9:44 AM Walter Underwood 
wrote:

> Thanks for posting the files. Looking at schema.xml, I see that you still
> are using StopFilterFactory. The first advice we gave you was to remove
> that.
>
> Remove StopFilterFactory everywhere and reindex.
>
> You will continue to have problems matching stopwords until you do that.
>
> In your edismax handlers, weights of 20, 50, and 100 are extremely high. I
> don’t think I’ve ever used a weight higher than 16 in a dozen years of
> configuring Solr.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  wrote:
> >
> > Hi Paras, everyone
> >
> > Thank you again for your inputs and suggestions. I sorry to hear you had
> trouble with the attachments I will host it somewhere and share the links.
> > I don't tweak my index, I get the data from the graph database, create a
> document as they are and save to solr.
> >
> > So, I am sending the new analysis screen querying the way you suggested.
> Also the results with params and solr query url.
> >
> > During the process of querying what you asked I found something really
> weird (at least for me). By accident, I ended up querying the using the
> default handler (/select) and it worked. Then If I use the one I must use,
> then sadly doesn't work. I am posting both results and I will also post the
> handlers as well.
> >
> > Here is the link with all the files mentioned before
> >
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0
> >
> > If the link doesn't work www dot dropbox dot com slash sh slash
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> >
> > Thanks
> >
> >> On 7 Nov 2019, at 05:23, Paras Lehana 
> wrote:
> >>
> >> Hi Guilherme.
> >>
> >> I am sending they analysis result and the json result as requested.
> >>
> >>
> >> Thanks for the effort. Luckily, I can see your attachments (low quality
> >> though).
> >>
> >> From the analysis screen, the analysis is working as expected. One of
> the
> >> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
> >> document containing "Lymphoid and a non-Lymphoid cell" I can initially
> >> think of is: the stopword "a" is probably present in post-analysis
> either
> >> of query or index. Did you tweak your index time analysis after
> indexing?
> >>
> >> Do two things:
> >>
> >>  1. Post the analysis screen for and index=*"Immunoregulatory
> >>  interactions between a Lymphoid and a non-Lymphoid cell"* and
> >> "query=*"lymphoid
> >>  and a non-lymphoid cell"*. Try hosting the image and providing the link
> >>  here.
> >>  2. Give the same JSON output as you have sent but this time with
> >>  *"echoParams=all"*. Also, post the exact Solr query url.
> >>
> >>
> >>
> >> On Wed, 6 Nov 2019 at 21:07, Erick Erickson 
> wrote:
> >>
> >>> I don’t see the attachments, maybe I deleted old e-mails or some such.
> The
> >>> Apache server is fairly aggressive about stripping attachments though,
> so
> >>> it’s also possible they didn’t make it through.
> >>>
> >>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri 
> wrote:
> >>>>
> >>>> Thanks Erick.
> >>>>
> >>>>> First, your index and analysis chains are considerably different,
> this
> >>> can easily be a source of problems. In particular, using two different
> >>> tokenizers is a huge red flag. I _strongly_ recommend against this
> unless
> >>> you’re totally sure you understand the consequences. Additionally,
> your use
> >>> of the length filter is suspicious, especially since your problem
> statement
> >>> is about the addition of a single letter term and the min length
> allowed on
> >>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
> >>> filtered out in both cases, but maybe you’ve found something odd about
> the
> >>> interactions.
> >>>> I will investigate the min length and post the resu

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Walter Underwood

Thanks for posting the files. Looking at schema.xml, I see that you still are 
using StopFilterFactory. The first advice we gave you was to remove that.

Remove StopFilterFactory everywhere and reindex.

You will continue to have problems matching stopwords until you do that.

In your edismax handlers, weights of 20, 50, and 100 are extremely high. I 
don’t think I’ve ever used a weight higher than 16 in a dozen years of 
configuring Solr.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 7, 2019, at 6:56 AM, Guilherme Viteri  wrote:
> 
> Hi Paras, everyone
> 
> Thank you again for your inputs and suggestions. I sorry to hear you had 
> trouble with the attachments I will host it somewhere and share the links. 
> I don't tweak my index, I get the data from the graph database, create a 
> document as they are and save to solr.
> 
> So, I am sending the new analysis screen querying the way you suggested. Also 
> the results with params and solr query url.
> 
> During the process of querying what you asked I found something really weird 
> (at least for me). By accident, I ended up querying the using the default 
> handler (/select) and it worked. Then If I use the one I must use, then sadly 
> doesn't work. I am posting both results and I will also post the handlers as 
> well.
> 
> Here is the link with all the files mentioned before
> https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0 
> <https://www.dropbox.com/sh/fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a?dl=0>
> If the link doesn't work www dot dropbox dot com slash sh slash 
> fymfm1q94zum1lx/AADwU1c9EUf2A4d7FtzSKR54a ? dl equals 0
> 
> Thanks
> 
>> On 7 Nov 2019, at 05:23, Paras Lehana  wrote:
>> 
>> Hi Guilherme.
>> 
>> I am sending they analysis result and the json result as requested.
>> 
>> 
>> Thanks for the effort. Luckily, I can see your attachments (low quality
>> though).
>> 
>> From the analysis screen, the analysis is working as expected. One of the
>> reasons for query="lymphoid and *a* non-lymphoid cell" not matching
>> document containing "Lymphoid and a non-Lymphoid cell" I can initially
>> think of is: the stopword "a" is probably present in post-analysis either
>> of query or index. Did you tweak your index time analysis after indexing?
>> 
>> Do two things:
>> 
>>  1. Post the analysis screen for and index=*"Immunoregulatory
>>  interactions between a Lymphoid and a non-Lymphoid cell"* and
>> "query=*"lymphoid
>>  and a non-lymphoid cell"*. Try hosting the image and providing the link
>>  here.
>>  2. Give the same JSON output as you have sent but this time with
>>  *"echoParams=all"*. Also, post the exact Solr query url.
>> 
>> 
>> 
>> On Wed, 6 Nov 2019 at 21:07, Erick Erickson  wrote:
>> 
>>> I don’t see the attachments, maybe I deleted old e-mails or some such. The
>>> Apache server is fairly aggressive about stripping attachments though, so
>>> it’s also possible they didn’t make it through.
>>> 
>>>> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri  wrote:
>>>> 
>>>> Thanks Erick.
>>>> 
>>>>> First, your index and analysis chains are considerably different, this
>>> can easily be a source of problems. In particular, using two different
>>> tokenizers is a huge red flag. I _strongly_ recommend against this unless
>>> you’re totally sure you understand the consequences. Additionally, your use
>>> of the length filter is suspicious, especially since your problem statement
>>> is about the addition of a single letter term and the min length allowed on
>>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>>> filtered out in both cases, but maybe you’ve found something odd about the
>>> interactions.
>>>> I will investigate the min length and post the results later.
>>>> 
>>>>> Second, I have no idea what this will do. Are the equal signs typos?
>>> Used by custom code?
>>>> This the url in my application, not solr params. That's the query string.
>>>> 
>>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>>> all the params with an equal-sign are totally ignored unless it’s just a
>>> typo.
>>>> This is part of the application. Species will be used later on in solr
>>> to filter out the result. That's not solr. That my app params.
>>>> 
>>>>> Third, the easiest way to see what

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-07 Thread Guilherme Viteri

ur use
>> of the length filter is suspicious, especially since your problem statement
>> is about the addition of a single letter term and the min length allowed on
>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is
>> filtered out in both cases, but maybe you’ve found something odd about the
>> interactions.
>>>> 
>>>> Second, I have no idea what this will do. Are the equal signs typos?
>> Used by custom code?
>>>> 
>>>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>>> 
>>>> What does “species=“ do? That’s not Solr syntax, so it’s likely that
>> all the params with an equal-sign are totally ignored unless it’s just a
>> typo.
>>>> 
>>>> Third, the easiest way to see what’s happening under the covers is to
>> add “=true” to the query and look at the parsed query. Ignore all the
>> relevance calculations for the nonce, or specify “=query” to skip
>> that part.
>>>> 
>>>> 90% + of the time, the question “why didn’t this query do what I
>> expect” is answered by looking at the “=query” output and the
>> analysis page in the admin UI. NOTE: for the analysis page be sure to look
>> at _both_ the query and index output. Also, and very important about the
>> analysis page (and this is confusing) is that this _assumes_ that what you
>> put in the text boxes have made it through the query parser intact and is
>> analyzed by the field selected. Consider the search "q=field:word1 word2".
>> Now you type “word1 word2” into the analysis text box and it looks like
>> what you expect. That’s misleading because the query is _parsed_ as
>> "field:word1 default_search_field:word2”. This is where “=query”
>> helps.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana 
>> wrote:
>>>>> 
>>>>> Hi Walter,
>>>>> 
>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>> will
>>>>>> not be in the index, so they can never match a query.
>>>>> 
>>>>> 
>>>>> I think the OP's concern is different results when adding a stopword. I
>>>>> think he's using the filter factory correctly - the query chain
>> includes
>>>>> the filter as well so it should remove "a" while querying.
>>>>> 
>>>>> *@Guilherme*, please post results for both the query, the document in
>>>>> result you are concerned about and post full result of analysis screen
>> (for
>>>>> both query and index).
>>>>> 
>>>>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood 
>> wrote:
>>>>> 
>>>>>> No.
>>>>>> 
>>>>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>>>>> will not be in the index, so they can never match a query.
>>>>>> 
>>>>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
>>>>>> schema.xml.
>>>>>> 2. Reload the collection, restart Solr, or whatever to read the new
>> config.
>>>>>> 3. Reindex all of the documents.
>>>>>> 
>>>>>> When indexed with the new analysis chain, the stopwords will not be
>>>>>> removed and they will be searchable.
>>>>>> 
>>>>>> wunder
>>>>>> Walter Underwood
>>>>>> wun...@wunderwood.org
>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri 
>> wrote:
>>>>>>> 
>>>>>>> Ok. I am kind a lost now.
>>>>>>> If I open up the console > analysis and perform it, that's the final
>>>>>> result.
>>>>>>> 
>>>>>>> 
>>>>>>> Your suggestion is: get rid of the  in the
>>>>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ")
>> then
>>>>>> add to solr. Is that correct ?
>>>>>>> 
>>>>>>> Thanks David
>>>>>>> 
>>>>>>>> On 5 Nov 2019, at 14:48, David Hastings <
>> hastings.recurs...@gmail.com
>>>>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-06 Thread Paras Lehana

 this _assumes_ that what you
> put in the text boxes have made it through the query parser intact and is
> analyzed by the field selected. Consider the search "q=field:word1 word2".
> Now you type “word1 word2” into the analysis text box and it looks like
> what you expect. That’s misleading because the query is _parsed_ as
> "field:word1 default_search_field:word2”. This is where “=query”
> helps.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Nov 6, 2019, at 2:36 AM, Paras Lehana 
> wrote:
> >>>
> >>> Hi Walter,
> >>>
> >>> The solr.StopFilter removes all tokens that are stopwords. Those words
> will
> >>>> not be in the index, so they can never match a query.
> >>>
> >>>
> >>> I think the OP's concern is different results when adding a stopword. I
> >>> think he's using the filter factory correctly - the query chain
> includes
> >>> the filter as well so it should remove "a" while querying.
> >>>
> >>> *@Guilherme*, please post results for both the query, the document in
> >>> result you are concerned about and post full result of analysis screen
> (for
> >>> both query and index).
> >>>
> >>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood 
> wrote:
> >>>
> >>>> No.
> >>>>
> >>>> The solr.StopFilter removes all tokens that are stopwords. Those words
> >>>> will not be in the index, so they can never match a query.
> >>>>
> >>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
> >>>> schema.xml.
> >>>> 2. Reload the collection, restart Solr, or whatever to read the new
> config.
> >>>> 3. Reindex all of the documents.
> >>>>
> >>>> When indexed with the new analysis chain, the stopwords will not be
> >>>> removed and they will be searchable.
> >>>>
> >>>> wunder
> >>>> Walter Underwood
> >>>> wun...@wunderwood.org
> >>>> http://observer.wunderwood.org/  (my blog)
> >>>>
> >>>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri 
> wrote:
> >>>>>
> >>>>> Ok. I am kind a lost now.
> >>>>> If I open up the console > analysis and perform it, that's the final
> >>>> result.
> >>>>> 
> >>>>>
> >>>>> Your suggestion is: get rid of the  in the
> >>>> schema.xml and during index phase replaceAll("in stopwords.txt"," ")
> then
> >>>> add to solr. Is that correct ?
> >>>>>
> >>>>> Thanks David
> >>>>>
> >>>>>> On 5 Nov 2019, at 14:48, David Hastings <
> hastings.recurs...@gmail.com
> >>>> <mailto:hastings.recurs...@gmail.com>> wrote:
> >>>>>>
> >>>>>> Fwd to another server
> >>>>>>
> >>>>>> no,
> >>>>>> >>>>>> words="stopwords.txt"/>
> >>>>>>
> >>>>>> is still using stopwords and should be removed, in my opinion of
> course,
> >>>>>> based on your use case may be different, but i generally axe any
> >>>> reference
> >>>>>> to them at all
> >>>>>>
> >>>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri  >>>> <mailto:gvit...@ebi.ac.uk>> wrote:
> >>>>>>
> >>>>>>> Thanks.
> >>>>>>> Haven't I done this here ?
> >>>>>>>  >>>>>>> positionIncrementGap="100" omitNorms="false" >
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> >>>> max="20"/>
> >>>>>>>
> >>>>>>> >>>>>>> words="stopwords.txt"/>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 5 Nov 2019, at 14:15, David Hastings <
> hastings.recurs...@gmail.com
> >>>> <mailto:hastings.recurs...@gmail.com>>
> >>>&

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-06 Thread Erick Erickson

I don’t see the attachments, maybe I deleted old e-mails or some such. The 
Apache server is fairly aggressive about stripping attachments though, so it’s 
also possible they didn’t make it through.

> On Nov 6, 2019, at 9:28 AM, Guilherme Viteri  wrote:
> 
> Thanks Erick.
> 
>> First, your index and analysis chains are considerably different, this can 
>> easily be a source of problems. In particular, using two different 
>> tokenizers is a huge red flag. I _strongly_ recommend against this unless 
>> you’re totally sure you understand the consequences. Additionally, your use 
>> of the length filter is suspicious, especially since your problem statement 
>> is about the addition of a single letter term and the min length allowed on 
>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is 
>> filtered out in both cases, but maybe you’ve found something odd about the 
>> interactions.
> I will investigate the min length and post the results later.
> 
>> Second, I have no idea what this will do. Are the equal signs typos? Used by 
>> custom code?
> This the url in my application, not solr params. That's the query string.
> 
>> What does “species=“ do? That’s not Solr syntax, so it’s likely that all the 
>> params with an equal-sign are totally ignored unless it’s just a typo.
> This is part of the application. Species will be used later on in solr to 
> filter out the result. That's not solr. That my app params.
> 
>> Third, the easiest way to see what’s happening under the covers is to add 
>> “=true” to the query and look at the parsed query. Ignore all the 
>> relevance calculations for the nonce, or specify “=query” to skip that 
>> part. 
> The two json files i've sent, they are debugQuery=on and the explain tag is 
> present.
> I will try the searching the way you mentioned.
> 
> Thank for your inputs
> 
> Guilherme
> 
>> On 6 Nov 2019, at 14:14, Erick Erickson  wrote:
>> 
>> Fwd to another server
>> 
>> First, your index and analysis chains are considerably different, this can 
>> easily be a source of problems. In particular, using two different 
>> tokenizers is a huge red flag. I _strongly_ recommend against this unless 
>> you’re totally sure you understand the consequences. Additionally, your use 
>> of the length filter is suspicious, especially since your problem statement 
>> is about the addition of a single letter term and the min length allowed on 
>> that filter is 2. That said, it’s reasonable to suppose that the ’a’ is 
>> filtered out in both cases, but maybe you’ve found something odd about the 
>> interactions.
>> 
>> Second, I have no idea what this will do. Are the equal signs typos? Used by 
>> custom code?
>> 
>>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>> 
>> What does “species=“ do? That’s not Solr syntax, so it’s likely that all the 
>> params with an equal-sign are totally ignored unless it’s just a typo.
>> 
>> Third, the easiest way to see what’s happening under the covers is to add 
>> “=true” to the query and look at the parsed query. Ignore all the 
>> relevance calculations for the nonce, or specify “=query” to skip that 
>> part. 
>> 
>> 90% + of the time, the question “why didn’t this query do what I expect” is 
>> answered by looking at the “=query” output and the analysis page in 
>> the admin UI. NOTE: for the analysis page be sure to look at _both_ the 
>> query and index output. Also, and very important about the analysis page 
>> (and this is confusing) is that this _assumes_ that what you put in the text 
>> boxes have made it through the query parser intact and is analyzed by the 
>> field selected. Consider the search "q=field:word1 word2". Now you type 
>> “word1 word2” into the analysis text box and it looks like what you expect. 
>> That’s misleading because the query is _parsed_ as "field:word1 
>> default_search_field:word2”. This is where “=query” helps.
>> 
>> Best,
>> Erick
>> 
>>> On Nov 6, 2019, at 2:36 AM, Paras Lehana  wrote:
>>> 
>>> Hi Walter,
>>> 
>>> The solr.StopFilter removes all tokens that are stopwords. Those words will
>>>> not be in the index, so they can never match a query.
>>> 
>>> 
>>> I think the OP's concern is different results when adding a stopword. I
>>> think he's using the filter factory correctly - the query chain includes
>>> the filter as well so it should remove "a" while querying.
>>>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-06 Thread Guilherme Viteri

Thanks Erick.

> First, your index and analysis chains are considerably different, this can 
> easily be a source of problems. In particular, using two different tokenizers 
> is a huge red flag. I _strongly_ recommend against this unless you’re totally 
> sure you understand the consequences. Additionally, your use of the length 
> filter is suspicious, especially since your problem statement is about the 
> addition of a single letter term and the min length allowed on that filter is 
> 2. That said, it’s reasonable to suppose that the ’a’ is filtered out in both 
> cases, but maybe you’ve found something odd about the interactions.
I will investigate the min length and post the results later.

> Second, I have no idea what this will do. Are the equal signs typos? Used by 
> custom code?
This the url in my application, not solr params. That's the query string.

> What does “species=“ do? That’s not Solr syntax, so it’s likely that all the 
> params with an equal-sign are totally ignored unless it’s just a typo.
This is part of the application. Species will be used later on in solr to 
filter out the result. That's not solr. That my app params.

> Third, the easiest way to see what’s happening under the covers is to add 
> “=true” to the query and look at the parsed query. Ignore all the 
> relevance calculations for the nonce, or specify “=query” to skip that 
> part. 
The two json files i've sent, they are debugQuery=on and the explain tag is 
present.
I will try the searching the way you mentioned.

Thank for your inputs

Guilherme

> On 6 Nov 2019, at 14:14, Erick Erickson  wrote:
> 
> Fwd to another server
> 
> First, your index and analysis chains are considerably different, this can 
> easily be a source of problems. In particular, using two different tokenizers 
> is a huge red flag. I _strongly_ recommend against this unless you’re totally 
> sure you understand the consequences. Additionally, your use of the length 
> filter is suspicious, especially since your problem statement is about the 
> addition of a single letter term and the min length allowed on that filter is 
> 2. That said, it’s reasonable to suppose that the ’a’ is filtered out in both 
> cases, but maybe you’ve found something odd about the interactions.
> 
> Second, I have no idea what this will do. Are the equal signs typos? Used by 
> custom code?
> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> 
> What does “species=“ do? That’s not Solr syntax, so it’s likely that all the 
> params with an equal-sign are totally ignored unless it’s just a typo.
> 
> Third, the easiest way to see what’s happening under the covers is to add 
> “=true” to the query and look at the parsed query. Ignore all the 
> relevance calculations for the nonce, or specify “=query” to skip that 
> part. 
> 
> 90% + of the time, the question “why didn’t this query do what I expect” is 
> answered by looking at the “=query” output and the analysis page in the 
> admin UI. NOTE: for the analysis page be sure to look at _both_ the query and 
> index output. Also, and very important about the analysis page (and this is 
> confusing) is that this _assumes_ that what you put in the text boxes have 
> made it through the query parser intact and is analyzed by the field 
> selected. Consider the search "q=field:word1 word2". Now you type “word1 
> word2” into the analysis text box and it looks like what you expect. That’s 
> misleading because the query is _parsed_ as "field:word1 
> default_search_field:word2”. This is where “=query” helps.
> 
> Best,
> Erick
> 
>> On Nov 6, 2019, at 2:36 AM, Paras Lehana  wrote:
>> 
>> Hi Walter,
>> 
>> The solr.StopFilter removes all tokens that are stopwords. Those words will
>>> not be in the index, so they can never match a query.
>> 
>> 
>> I think the OP's concern is different results when adding a stopword. I
>> think he's using the filter factory correctly - the query chain includes
>> the filter as well so it should remove "a" while querying.
>> 
>> *@Guilherme*, please post results for both the query, the document in
>> result you are concerned about and post full result of analysis screen (for
>> both query and index).
>> 
>> On Tue, 5 Nov 2019 at 21:38, Walter Underwood  wrote:
>> 
>>> No.
>>> 
>>> The solr.StopFilter removes all tokens that are stopwords. Those words
>>> will not be in the index, so they can never match a query.
>>> 
>>> 1. Remove the lines with solr.StopFilter from every analysis chain in
>>> schema.xml.
>>> 2. Reload the collection, restart Solr, o

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-06 Thread Erick Erickson

First, your index and analysis chains are considerably different, this can 
easily be a source of problems. In particular, using two different tokenizers 
is a huge red flag. I _strongly_ recommend against this unless you’re totally 
sure you understand the consequences. Additionally, your use of the length 
filter is suspicious, especially since your problem statement is about the 
addition of a single letter term and the min length allowed on that filter is 
2. That said, it’s reasonable to suppose that the ’a’ is filtered out in both 
cases, but maybe you’ve found something odd about the interactions.

Second, I have no idea what this will do. Are the equal signs typos? Used by 
custom code?

>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true

What does “species=“ do? That’s not Solr syntax, so it’s likely that all the 
params with an equal-sign are totally ignored unless it’s just a typo.

Third, the easiest way to see what’s happening under the covers is to add 
“=true” to the query and look at the parsed query. Ignore all the 
relevance calculations for the nonce, or specify “=query” to skip that 
part. 

90% + of the time, the question “why didn’t this query do what I expect” is 
answered by looking at the “=query” output and the analysis page in the 
admin UI. NOTE: for the analysis page be sure to look at _both_ the query and 
index output. Also, and very important about the analysis page (and this is 
confusing) is that this _assumes_ that what you put in the text boxes have made 
it through the query parser intact and is analyzed by the field selected. 
Consider the search "q=field:word1 word2". Now you type “word1 word2” into the 
analysis text box and it looks like what you expect. That’s misleading because 
the query is _parsed_ as "field:word1 default_search_field:word2”. This is 
where “=query” helps.

Best,
Erick

> On Nov 6, 2019, at 2:36 AM, Paras Lehana  wrote:
> 
> Hi Walter,
> 
> The solr.StopFilter removes all tokens that are stopwords. Those words will
>> not be in the index, so they can never match a query.
> 
> 
> I think the OP's concern is different results when adding a stopword. I
> think he's using the filter factory correctly - the query chain includes
> the filter as well so it should remove "a" while querying.
> 
> *@Guilherme*, please post results for both the query, the document in
> result you are concerned about and post full result of analysis screen (for
> both query and index).
> 
> On Tue, 5 Nov 2019 at 21:38, Walter Underwood  wrote:
> 
>> No.
>> 
>> The solr.StopFilter removes all tokens that are stopwords. Those words
>> will not be in the index, so they can never match a query.
>> 
>> 1. Remove the lines with solr.StopFilter from every analysis chain in
>> schema.xml.
>> 2. Reload the collection, restart Solr, or whatever to read the new config.
>> 3. Reindex all of the documents.
>> 
>> When indexed with the new analysis chain, the stopwords will not be
>> removed and they will be searchable.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri  wrote:
>>> 
>>> Ok. I am kind a lost now.
>>> If I open up the console > analysis and perform it, that's the final
>> result.
>>> 
>>> 
>>> Your suggestion is: get rid of the  in the
>> schema.xml and during index phase replaceAll("in stopwords.txt"," ") then
>> add to solr. Is that correct ?
>>> 
>>> Thanks David
>>> 
>>>> On 5 Nov 2019, at 14:48, David Hastings > <mailto:hastings.recurs...@gmail.com>> wrote:
>>>> 
>>>> Fwd to another server
>>>> 
>>>> no,
>>>>  >>> words="stopwords.txt"/>
>>>> 
>>>> is still using stopwords and should be removed, in my opinion of course,
>>>> based on your use case may be different, but i generally axe any
>> reference
>>>> to them at all
>>>> 
>>>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri > <mailto:gvit...@ebi.ac.uk>> wrote:
>>>> 
>>>>> Thanks.
>>>>> Haven't I done this here ?
>>>>> >>>> positionIncrementGap="100" omitNorms="false" >
>>>>>  
>>>>>  
>>>>>  
>>>>>  > max="20"/>
>>>>>  
>>>>>  >>>> words=&qu

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-05 Thread Paras Lehana

Hi Walter,

The solr.StopFilter removes all tokens that are stopwords. Those words will
> not be in the index, so they can never match a query.


I think the OP's concern is different results when adding a stopword. I
think he's using the filter factory correctly - the query chain includes
the filter as well so it should remove "a" while querying.

 *@Guilherme*, please post results for both the query, the document in
result you are concerned about and post full result of analysis screen (for
both query and index).

On Tue, 5 Nov 2019 at 21:38, Walter Underwood  wrote:

> No.
>
> The solr.StopFilter removes all tokens that are stopwords. Those words
> will not be in the index, so they can never match a query.
>
> 1. Remove the lines with solr.StopFilter from every analysis chain in
> schema.xml.
> 2. Reload the collection, restart Solr, or whatever to read the new config.
> 3. Reindex all of the documents.
>
> When indexed with the new analysis chain, the stopwords will not be
> removed and they will be searchable.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Nov 5, 2019, at 8:56 AM, Guilherme Viteri  wrote:
> >
> > Ok. I am kind a lost now.
> > If I open up the console > analysis and perform it, that's the final
> result.
> >  
> >
> > Your suggestion is: get rid of the  in the
> schema.xml and during index phase replaceAll("in stopwords.txt"," ") then
> add to solr. Is that correct ?
> >
> > Thanks David
> >
> >> On 5 Nov 2019, at 14:48, David Hastings  <mailto:hastings.recurs...@gmail.com>> wrote:
> >>
> >> Fwd to another server
> >>
> >> no,
> >>>> words="stopwords.txt"/>
> >>
> >> is still using stopwords and should be removed, in my opinion of course,
> >> based on your use case may be different, but i generally axe any
> reference
> >> to them at all
> >>
> >> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri  <mailto:gvit...@ebi.ac.uk>> wrote:
> >>
> >>> Thanks.
> >>> Haven't I done this here ?
> >>>   >>> positionIncrementGap="100" omitNorms="false" >
> >>>   
> >>>   
> >>>   
> >>>max="20"/>
> >>>   
> >>>>>> words="stopwords.txt"/>
> >>>   
> >>>
> >>>
> >>>> On 5 Nov 2019, at 14:15, David Hastings  <mailto:hastings.recurs...@gmail.com>>
> >>> wrote:
> >>>>
> >>>> Fwd to another server
> >>>>
> >>>> The first thing you should do is remove any reference to stop words
> and
> >>>> never use them, then re-index your data and try it again.
> >>>>
> >>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri  <mailto:gvit...@ebi.ac.uk>>
> >>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am performing a search to match a name (text_field), however this
> term
> >>>>> contains 'and' and 'a' and it doesn't return any records. If i remove
> >>> 'a'
> >>>>> then it works.
> >>>>> e.g
> >>>>> Search Term: lymphoid and a non-lymphoid cell
> >>>>> doesn't work:
> >>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >
> >>>>> <
> >>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >>>>>>
> >>>>>
> >>>>> Search term: lymphoid and non-lymphoid cell
> >>>>> works:
> >>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >>>>> <
> >>>>>
> >>>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >>>>>>
> >>>>> interested in the first result
> >>>>>
> >>>>> schema.xml

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-05 Thread Walter Underwood

No.

The solr.StopFilter removes all tokens that are stopwords. Those words will not 
be in the index, so they can never match a query.

1. Remove the lines with solr.StopFilter from every analysis chain in 
schema.xml.
2. Reload the collection, restart Solr, or whatever to read the new config.
3. Reindex all of the documents.

When indexed with the new analysis chain, the stopwords will not be removed and 
they will be searchable.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Nov 5, 2019, at 8:56 AM, Guilherme Viteri  wrote:
> 
> Ok. I am kind a lost now.
> If I open up the console > analysis and perform it, that's the final result.
>  
> 
> Your suggestion is: get rid of the  in the schema.xml 
> and during index phase replaceAll("in stopwords.txt"," ") then add to solr. 
> Is that correct ?
> 
> Thanks David
> 
>> On 5 Nov 2019, at 14:48, David Hastings > <mailto:hastings.recurs...@gmail.com>> wrote:
>> 
>> Fwd to another server
>> 
>> no,
>>   > words="stopwords.txt"/>
>> 
>> is still using stopwords and should be removed, in my opinion of course,
>> based on your use case may be different, but i generally axe any reference
>> to them at all
>> 
>> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri > <mailto:gvit...@ebi.ac.uk>> wrote:
>> 
>>> Thanks.
>>> Haven't I done this here ?
>>>  >> positionIncrementGap="100" omitNorms="false" >
>>>   
>>>   
>>>   
>>>   
>>>   
>>>   >> words="stopwords.txt"/>
>>>   
>>> 
>>> 
>>>> On 5 Nov 2019, at 14:15, David Hastings >>> <mailto:hastings.recurs...@gmail.com>>
>>> wrote:
>>>> 
>>>> Fwd to another server
>>>> 
>>>> The first thing you should do is remove any reference to stop words and
>>>> never use them, then re-index your data and try it again.
>>>> 
>>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri >>> <mailto:gvit...@ebi.ac.uk>>
>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am performing a search to match a name (text_field), however this term
>>>>> contains 'and' and 'a' and it doesn't return any records. If i remove
>>> 'a'
>>>>> then it works.
>>>>> e.g
>>>>> Search Term: lymphoid and a non-lymphoid cell
>>>>> doesn't work:
>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>>  
>>> <https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true>
>>>>> <
>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>>>>> 
>>>>> 
>>>>> Search term: lymphoid and non-lymphoid cell
>>>>> works:
>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>>>> <
>>>>> 
>>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>>>>> 
>>>>> interested in the first result
>>>>> 
>>>>> schema.xml
>>>>> >>>> indexed="true"  stored="true"   omitNorms="false"   required="true"
>>>>> multiValued="false"/>
>>>>> 
>>>>>   
>>>>>   >>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>   >>>> pattern="^[/._:]+" replacement=""/>
>>>>>   >>>> pattern="[/._:]+$" replacement=""/>
>>>>>   >>>> pattern="[_]" replacement=" "/>
>>>>>   >> max="20"/>
>>>>>   
>>>>>   >>>> words="stopwords.txt"/>
>>>>>   
>>>>> 
>>>>>   >>>> positionIncrementGap="100" omitNorms="false" >
>>>>>   
>>>>>   
>>>>>   
>>>>>   >> max="20"/>
>>>>>   
>>>>>   >>>> words="stopwords.txt"/>
>>>>>   
>>>>>   
>>>>>   >>>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>>   >>>> pattern="^[/._:]+" replacement=""/>
>>>>>   >>>> pattern="[/._:]+$" replacement=""/>
>>>>>   >>>> pattern="[_]" replacement=" "/>
>>>>>   >> max="20"/>
>>>>>   
>>>>>   >>>> words="stopwords.txt"/>
>>>>>   
>>>>>   
>>>>> 
>>>>> stopwords.txt
>>>>> #Standard english stop words taken from Lucene's StopAnalyzer
>>>>> a
>>>>> b
>>>>> c
>>>>> 
>>>>> an
>>>>> and
>>>>> are
>>>>> 
>>>>> Running SolR 6.6.2.
>>>>> 
>>>>> Is there anything I could do to prevent this ?
>>>>> 
>>>>> Thanks
>>>>> Guilherme
>>> 
>>> 
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-05 Thread Guilherme Viteri

Ok. I am kind a lost now.
If I open up the console > analysis and perform it, that's the final result.
 

Your suggestion is: get rid of the  in the schema.xml and 
during index phase replaceAll("in stopwords.txt"," ") then add to solr. Is that 
correct ?

Thanks David

> On 5 Nov 2019, at 14:48, David Hastings  wrote:
> 
> Fwd to another server
> 
> no,
>words="stopwords.txt"/>
> 
> is still using stopwords and should be removed, in my opinion of course,
> based on your use case may be different, but i generally axe any reference
> to them at all
> 
> On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri  wrote:
> 
>> Thanks.
>> Haven't I done this here ?
>>  > positionIncrementGap="100" omitNorms="false" >
>>   
>>   
>>   
>>   
>>   
>>   > words="stopwords.txt"/>
>>   
>> 
>> 
>>> On 5 Nov 2019, at 14:15, David Hastings 
>> wrote:
>>> 
>>> Fwd to another server
>>> 
>>> The first thing you should do is remove any reference to stop words and
>>> never use them, then re-index your data and try it again.
>>> 
>>> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri 
>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I am performing a search to match a name (text_field), however this term
>>>> contains 'and' and 'a' and it doesn't return any records. If i remove
>> 'a'
>>>> then it works.
>>>> e.g
>>>> Search Term: lymphoid and a non-lymphoid cell
>>>> doesn't work:
>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>>> <
>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>>>> 
>>>> 
>>>> Search term: lymphoid and non-lymphoid cell
>>>> works:
>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>>> <
>>>> 
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>>>> 
>>>> interested in the first result
>>>> 
>>>> schema.xml
>>>> >>> indexed="true"  stored="true"   omitNorms="false"   required="true"
>>>> multiValued="false"/>
>>>> 
>>>>   
>>>>   >>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>   >>> pattern="^[/._:]+" replacement=""/>
>>>>   >>> pattern="[/._:]+$" replacement=""/>
>>>>   >>> pattern="[_]" replacement=" "/>
>>>>   > max="20"/>
>>>>   
>>>>   >>> words="stopwords.txt"/>
>>>>   
>>>> 
>>>>   >>> positionIncrementGap="100" omitNorms="false" >
>>>>   
>>>>   
>>>>   
>>>>   > max="20"/>
>>>>   
>>>>   >>> words="stopwords.txt"/>
>>>>   
>>>>   
>>>>   >>> pattern="[^a-zA-Z0-9/._:]"/>
>>>>   >>> pattern="^[/._:]+" replacement=""/>
>>>>   >>> pattern="[/._:]+$" replacement=""/>
>>>>   >>> pattern="[_]" replacement=" "/>
>>>>   > max="20"/>
>>>>   
>>>>   >>> words="stopwords.txt"/>
>>>>   
>>>>   
>>>> 
>>>> stopwords.txt
>>>> #Standard english stop words taken from Lucene's StopAnalyzer
>>>> a
>>>> b
>>>> c
>>>> 
>>>> an
>>>> and
>>>> are
>>>> 
>>>> Running SolR 6.6.2.
>>>> 
>>>> Is there anything I could do to prevent this ?
>>>> 
>>>> Thanks
>>>> Guilherme
>> 
>>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-05 Thread David Hastings

no,
   

is still using stopwords and should be removed, in my opinion of course,
based on your use case may be different, but i generally axe any reference
to them at all

On Tue, Nov 5, 2019 at 9:47 AM Guilherme Viteri  wrote:

> Thanks.
> Haven't I done this here ?
>positionIncrementGap="100" omitNorms="false" >
>
>
>
>
>
> words="stopwords.txt"/>
>
>
>
> > On 5 Nov 2019, at 14:15, David Hastings 
> wrote:
> >
> > Fwd to another server
> >
> > The first thing you should do is remove any reference to stop words and
> > never use them, then re-index your data and try it again.
> >
> > On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri 
> wrote:
> >
> >> Hi,
> >>
> >> I am performing a search to match a name (text_field), however this term
> >> contains 'and' and 'a' and it doesn't return any records. If i remove
> 'a'
> >> then it works.
> >> e.g
> >> Search Term: lymphoid and a non-lymphoid cell
> >> doesn't work:
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >>>
> >>
> >> Search term: lymphoid and non-lymphoid cell
> >> works:
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >> <
> >>
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >>>
> >> interested in the first result
> >>
> >> schema.xml
> >>  >> indexed="true"  stored="true"   omitNorms="false"   required="true"
> >> multiValued="false"/>
> >>
> >>
> >> >> pattern="[^a-zA-Z0-9/._:]"/>
> >> >> pattern="^[/._:]+" replacement=""/>
> >> >> pattern="[/._:]+$" replacement=""/>
> >> >> pattern="[_]" replacement=" "/>
> >> max="20"/>
> >>
> >> >> words="stopwords.txt"/>
> >>
> >>
> >> >> positionIncrementGap="100" omitNorms="false" >
> >>
> >>
> >>
> >> max="20"/>
> >>
> >> >> words="stopwords.txt"/>
> >>
> >>
> >> >> pattern="[^a-zA-Z0-9/._:]"/>
> >> >> pattern="^[/._:]+" replacement=""/>
> >> >> pattern="[/._:]+$" replacement=""/>
> >> >> pattern="[_]" replacement=" "/>
> >> max="20"/>
> >>
> >> >> words="stopwords.txt"/>
> >>
> >>
> >>
> >> stopwords.txt
> >> #Standard english stop words taken from Lucene's StopAnalyzer
> >> a
> >> b
> >> c
> >> 
> >> an
> >> and
> >> are
> >>
> >> Running SolR 6.6.2.
> >>
> >> Is there anything I could do to prevent this ?
> >>
> >> Thanks
> >> Guilherme
>
>

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-05 Thread Guilherme Viteri

Thanks.
Haven't I done this here ?
  
   
   
   
   
   
   
   


> On 5 Nov 2019, at 14:15, David Hastings  wrote:
> 
> Fwd to another server
> 
> The first thing you should do is remove any reference to stop words and
> never use them, then re-index your data and try it again.
> 
> On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri  wrote:
> 
>> Hi,
>> 
>> I am performing a search to match a name (text_field), however this term
>> contains 'and' and 'a' and it doesn't return any records. If i remove 'a'
>> then it works.
>> e.g
>> Search Term: lymphoid and a non-lymphoid cell
>> doesn't work:
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>> 
>> 
>> Search term: lymphoid and non-lymphoid cell
>> works:
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>> <
>> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
>>> 
>> interested in the first result
>> 
>> schema.xml
>> > indexed="true"  stored="true"   omitNorms="false"   required="true"
>> multiValued="false"/>
>> 
>>
>>> pattern="[^a-zA-Z0-9/._:]"/>
>>> pattern="^[/._:]+" replacement=""/>
>>> pattern="[/._:]+$" replacement=""/>
>>> pattern="[_]" replacement=" "/>
>>
>>
>>> words="stopwords.txt"/>
>>
>> 
>>> positionIncrementGap="100" omitNorms="false" >
>>
>>
>>
>>
>>
>>> words="stopwords.txt"/>
>>
>>
>>> pattern="[^a-zA-Z0-9/._:]"/>
>>> pattern="^[/._:]+" replacement=""/>
>>> pattern="[/._:]+$" replacement=""/>
>>> pattern="[_]" replacement=" "/>
>>
>>
>>> words="stopwords.txt"/>
>>
>>
>> 
>> stopwords.txt
>> #Standard english stop words taken from Lucene's StopAnalyzer
>> a
>> b
>> c
>> 
>> an
>> and
>> are
>> 
>> Running SolR 6.6.2.
>> 
>> Is there anything I could do to prevent this ?
>> 
>> Thanks
>> Guilherme

Re: When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-05 Thread David Hastings

The first thing you should do is remove any reference to stop words and
never use them, then re-index your data and try it again.

On Tue, Nov 5, 2019 at 9:14 AM Guilherme Viteri  wrote:

> Hi,
>
> I am performing a search to match a name (text_field), however this term
> contains 'and' and 'a' and it doesn't return any records. If i remove 'a'
> then it works.
> e.g
> Search Term: lymphoid and a non-lymphoid cell
> doesn't work:
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >
>
> Search term: lymphoid and non-lymphoid cell
> works:
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> <
> https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
> >
> interested in the first result
>
> schema.xml
>indexed="true"  stored="true"   omitNorms="false"   required="true"
>  multiValued="false"/>
>
> 
>  pattern="[^a-zA-Z0-9/._:]"/>
>  pattern="^[/._:]+" replacement=""/>
>  pattern="[/._:]+$" replacement=""/>
>  pattern="[_]" replacement=" "/>
> 
> 
>  words="stopwords.txt"/>
> 
>
>  positionIncrementGap="100" omitNorms="false" >
> 
> 
> 
> 
> 
>  words="stopwords.txt"/>
> 
> 
>  pattern="[^a-zA-Z0-9/._:]"/>
>  pattern="^[/._:]+" replacement=""/>
>  pattern="[/._:]+$" replacement=""/>
>  pattern="[_]" replacement=" "/>
> 
> 
>  words="stopwords.txt"/>
> 
> 
>
> stopwords.txt
> #Standard english stop words taken from Lucene's StopAnalyzer
> a
> b
> c
> 
> an
> and
> are
>
> Running SolR 6.6.2.
>
> Is there anything I could do to prevent this ?
>
> Thanks
> Guilherme

When search term has two stopwords ('and' and 'a') together, it doesn't work

2019-11-05 Thread Guilherme Viteri

Hi,

I am performing a search to match a name (text_field), however this term 
contains 'and' and 'a' and it doesn't return any records. If i remove 'a' then 
it works.
e.g
Search Term: lymphoid and a non-lymphoid cell
doesn't work: 
https://dev.reactome.org/content/query?q=lymphoid+and+a+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
 


Search term: lymphoid and non-lymphoid cell
works: 
https://dev.reactome.org/content/query?q=lymphoid+and+non-lymphoid+cell=Homo+sapiens=Entries+without+species=true
 

interested in the first result

schema.xml
 






























stopwords.txt
#Standard english stop words taken from Lucene's StopAnalyzer
a
b
c

an
and
are

Running SolR 6.6.2.

Is there anything I could do to prevent this ?

Thanks 
Guilherme

Re: Identify stopwords using TF-IDF

2019-06-22 Thread Walter Underwood

I haven’t removed stopwords since 1996, when I joined Infoseek. What is your 
special case where you must remove them?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 22, 2019, at 9:51 PM, akash jayaweera  
> wrote:
> 
> Hello Walter,
> 
> Thank you for the reply.
> But for some of my use-case I need to identify stopword. So I need a better
> way to identify domain specific stopwords. I used TF-IDF to identify
> stopwords. But it has the issue I mentioned above.
> 
> Regards,
> *Akash Jayaweera.*
> 
> 
> E akash.jayawe...@gmail.com 
> M + 94 77 2472635 <+94%2077%20247%202635>
> 
> 
> On Sun, Jun 23, 2019 at 10:13 AM Walter Underwood 
> wrote:
> 
>> Don’t remove stopwords. That was a useful hack when we were running search
>> engines on 16-bit machines. These days, it causes more problems than it
>> solves.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Jun 22, 2019, at 8:14 PM, akash jayaweera 
>> wrote:
>>> 
>>> Hello All,
>>> I'm trying to identify stopwords for a non-English corpus using TF-IDF
>>> score. I calculated the score for each unique term in the corpus. But my
>>> question is how can I select stopwords using the score.
>>> For example if we have a corpus of football, term "football" get the
>> lowest
>>> TF-IDF score. But for my requirement I don't want to identify "football"
>> as
>>> a stopword.
>>> How can I clearly Identify stopword. Is there any other simple method to
>>> identify stopwords than TF-IDF score.
>>> 
>>> Regards,
>>> *Akash Jayaweera.*
>>> 
>>> 
>>> E akash.jayawe...@gmail.com 
>>> M + 94 77 2472635 <+94%2077%20247%202635>
>> 
>>

Re: Identify stopwords using TF-IDF

2019-06-22 Thread akash jayaweera

Hello Walter,

Thank you for the reply.
But for some of my use-case I need to identify stopword. So I need a better
way to identify domain specific stopwords. I used TF-IDF to identify
stopwords. But it has the issue I mentioned above.

Regards,
*Akash Jayaweera.*


E akash.jayawe...@gmail.com 
M + 94 77 2472635 <+94%2077%20247%202635>


On Sun, Jun 23, 2019 at 10:13 AM Walter Underwood 
wrote:

> Don’t remove stopwords. That was a useful hack when we were running search
> engines on 16-bit machines. These days, it causes more problems than it
> solves.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 22, 2019, at 8:14 PM, akash jayaweera 
> wrote:
> >
> > Hello All,
> > I'm trying to identify stopwords for a non-English corpus using TF-IDF
> > score. I calculated the score for each unique term in the corpus. But my
> > question is how can I select stopwords using the score.
> > For example if we have a corpus of football, term "football" get the
> lowest
> > TF-IDF score. But for my requirement I don't want to identify "football"
> as
> > a stopword.
> > How can I clearly Identify stopword. Is there any other simple method to
> > identify stopwords than TF-IDF score.
> >
> > Regards,
> > *Akash Jayaweera.*
> >
> >
> > E akash.jayawe...@gmail.com 
> > M + 94 77 2472635 <+94%2077%20247%202635>
>
>

Re: Identify stopwords using TF-IDF

2019-06-22 Thread Walter Underwood

Don’t remove stopwords. That was a useful hack when we were running search 
engines on 16-bit machines. These days, it causes more problems than it solves.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jun 22, 2019, at 8:14 PM, akash jayaweera  
> wrote:
> 
> Hello All,
> I'm trying to identify stopwords for a non-English corpus using TF-IDF
> score. I calculated the score for each unique term in the corpus. But my
> question is how can I select stopwords using the score.
> For example if we have a corpus of football, term "football" get the lowest
> TF-IDF score. But for my requirement I don't want to identify "football" as
> a stopword.
> How can I clearly Identify stopword. Is there any other simple method to
> identify stopwords than TF-IDF score.
> 
> Regards,
> *Akash Jayaweera.*
> 
> 
> E akash.jayawe...@gmail.com 
> M + 94 77 2472635 <+94%2077%20247%202635>

Identify stopwords using TF-IDF

2019-06-22 Thread akash jayaweera

Hello All,
I'm trying to identify stopwords for a non-English corpus using TF-IDF
score. I calculated the score for each unique term in the corpus. But my
question is how can I select stopwords using the score.
For example if we have a corpus of football, term "football" get the lowest
TF-IDF score. But for my requirement I don't want to identify "football" as
a stopword.
How can I clearly Identify stopword. Is there any other simple method to
identify stopwords than TF-IDF score.

Regards,
*Akash Jayaweera.*


E akash.jayawe...@gmail.com 
M + 94 77 2472635 <+94%2077%20247%202635>

Re: StopWords behavior with phrases

2019-05-21 Thread Jan Høydahl

Well perhaps you don't need to remove stopwords at all? :)
Or a middle ground is to NOT removing stopwords in your 'index' analyzer, then 
you have the flexibility of removing them on query side. Thus if you use 
=false on your call perhaps that works?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 21. mai 2019 kl. 09:53 skrev Ashish Bisht :
> 
> Hi,
> 
> We make query to solr as below
> 
> *q="market and cloud" OR (market and cloud)=AND=edismax*
> 
> Our intent to look for results with both phrase match and AND query together
> where solr itself takes care of relevancy.
> 
> But due to presence of stopword in phrase query a gap is left which gives
> different results as against a keyword "market cloud".
> 
> "parsedquery_toString":"+(+(content:\"market ? cloud\" |
> search_field:\"market ? cloud\"))",
> 
> There are suggestion that for phrase query create a separate field with no
> stopword,But then we'll not be able to achieve both phrase and AND in a
> single request.
> 
> Is there anyway ? can be removed from phrase or any suggestion for our
> requirement.
> 
> Please suggest
> 
> Regards
> Ashish
> 
> 
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

StopWords behavior with phrases

2019-05-21 Thread Ashish Bisht

Hi,

We make query to solr as below

*q="market and cloud" OR (market and cloud)=AND=edismax*

Our intent to look for results with both phrase match and AND query together
where solr itself takes care of relevancy.

But due to presence of stopword in phrase query a gap is left which gives
different results as against a keyword "market cloud".

"parsedquery_toString":"+(+(content:\"market ? cloud\" |
search_field:\"market ? cloud\"))",

There are suggestion that for phrase query create a separate field with no
stopword,But then we'll not be able to achieve both phrase and AND in a
single request.

Is there anyway ? can be removed from phrase or any suggestion for our
requirement.

Please suggest

Regards
Ashish





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: How to use stopwords, synonyms along with fuzzy match in a SOLR

2019-05-09 Thread Erick Erickson

Ah, I didn’t read thoroughly enough. The problem is stopwords don’t really 
count for fuzzy searching. By specifying “junk~” you’re not really searching 
for “junk” or variants. You’re telling Solr “find any term that is a fuzzy 
match” to “junk”. Under the covers, a search is being made for “jank OR jack 
OR…) for however many terms are within the edit distance specified for “junk”.

So Solr is behaving as expected. Imagine if it worked as you expect and 
stopwords were removed before applying the fuzzy logic. Then the complaint 
would be “Hey, I know I have words in my corpus ('jack' in this case) that 
should match the fuzzy term 'junk~’ but I don’t get any results back”.

Notice that no document with straight “junk” in the text will be returned 
absent other matching fuzzy terms.

Best,
Erick

> On May 9, 2019, at 11:17 AM, bbarani  wrote:
> 
> 
>
>
>
> ignoreCase="true"/>
>
>
>
>
> ignoreCase="true"/>
>
>

Re: How to use stopwords, synonyms along with fuzzy match in a SOLR

2019-05-09 Thread bbarani

Thanks for your reply Erick.

I create a simple field type as below for testing and added 'junk' to the
stopwords but it doesnt seem to honor it when using fuzzzy search

Btw, I am using qf along with edismax and pass the value in q (sample query
below).

/solr/collection1/select?qf=title_autoComplete=false=productName=edismax=junk~=true=100%25=defaultMarketingSequence%20asc=1


 















 Headphone *Jack* Adapter Cable




junk~
junk~

(+DisjunctionMaxQuery((title_autoComplete:junk~2)))/no_coord

+(title_autoComplete:junk~2)


1.5424817 = sum of: 1.5424817 = weight(title_autoComplete:jack in 190)
[SchemaSimilarity], result of: 1.5424817 = score(doc=190,freq=1.0 =
termFreq=1.0 ), product of: 0.5 = boost 3.0849633 = idf, computed as log(1 +
(docCount - docFreq + 0.5) / (docFreq + 0.5)) from: 37.0 = docFreq 819.0 =
docCount 1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from: 1.0
= termFreq=1.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for
field)





--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: How to use stopwords, synonyms along with fuzzy match in a SOLR

2019-05-08 Thread Erick Erickson

Well, I’d start by adding debug=true, that’ll show you the parsed query as well 
as why certain documents scored the way they did. But do note that q=junk~ will 
search against the default text field (the ”df” parameter in the request 
handler definition in solrconfig.xml). Is that what you’re expecting?

Or, I suppose, it’s searching against the fields defined if you’re using 
(e)dismax as your query parser. But the debut output (parsed query part) will 
show what the actual search is.

You should also look at the admin/analysis page. For instance, the way you have 
the field defined at index time, it’ll break on whitespace. But “junk.” won’t 
be found because your stopword doesn’t contain the period.

Plus, your EdgeNGramFilterFactory is pretty strange. A min gram size of 1 means 
you’re searching for single characters.

So what I’d do is back off the definition and build it up bit by bit to see 
if/when you have this problem. But if stopwords are working correctly at index 
time, the “junk” will not be _in_ the index, therefore it’ll be impossible to 
find fuzzy search or not. So you’re making some assumptions that aren’t true, 
and the analysis process combined with looking at the parsed query should show 
you quite a lot.

Best,
Erick

> On May 8, 2019, at 4:43 PM, bbarani  wrote:
> 
> Hi,
> Is there a way to use stopwords and fuzzy match in a SOLR query?
> 
> The below query matches 'jack' too and I added 'junk' to the stopwords (in
> query) to avoid returning results but looks like its not honoring the
> stopwords when using the fuzzy search. 
> 
> solr/collection1/select?app-qf=title_autoComplete=false=*=true=-1=marketingSequence%20asc=productId=true=on=categoryFilter=defaultMarketingSequence%20asc=junk~
> 
> 
>
>
> ignoreCase="true"/>
>
>
>
>
> synonyms="synonyms.txt"/>
> catenateNumbers="0" generateNumberParts="0" generateWordParts="0"
> preserveOriginal="1" catenateAll="0" catenateWords="1"/>
> minGramSize="1"/>
>
>
> ignoreCase="true"/>
>
>
>
>
> synonyms="synonyms.txt"/>
> catenateNumbers="0" generateNumberParts="0" generateWordParts="0"
> preserveOriginal="1" catenateAll="0" catenateWords="1"/>
>
>
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

How to use stopwords, synonyms along with fuzzy match in a SOLR

2019-05-08 Thread bbarani

Hi,
Is there a way to use stopwords and fuzzy match in a SOLR query?

The below query matches 'jack' too and I added 'junk' to the stopwords (in
query) to avoid returning results but looks like its not honoring the
stopwords when using the fuzzy search. 

solr/collection1/select?app-qf=title_autoComplete=false=*=true=-1=marketingSequence%20asc=productId=true=on=categoryFilter=defaultMarketingSequence%20asc=junk~


























--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

How to use stopwords, synonyms along with fuzzy match in a SOLR

2019-05-08 Thread bbarani

Hi,
Is there a way to use stopwords and fuzzy match in a SOLR query?

The below query matches 'jack' too and I added 'junk' to the stopwords (in
query) to avoid returning results but looks like its not honoring the
stopwords when using the fuzzy search. 

solr/collection1/select?app-qf=title_autoComplete=false=*=true=-1=marketingSequence%20asc=productId=true=on=categoryFilter=defaultMarketingSequence%20asc=junk~


























--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Stopwords param of edismax parser not working

2019-03-29 Thread Branham, Jeremy (Experis)

Hi Ashish –
Are you using v7.3?
If so, I think this is the spot in code that should be executing:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/7.3.0/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java#L310

 Haven’t dug into the logic, but I tested on my server [v7.7.0], and the debug 
output doesn’t show whether or not the stopword filter was removed.
I don’t know your use-case, but maybe you could use the field analysis tool in 
Solr Admin to get more insight.
 
Jeremy Branham
jb...@allstate.com

On 3/28/19, 4:47 AM, "Ashish Bisht"  wrote:

Hi,

We are trying  to remove stopwords from analysis using edismax parser
parameter.The documentation says

*stopwords
A Boolean parameter indicating if the StopFilterFactory configured in the
query analyzer should be respected when parsing the query. If this is set to
false, then the StopFilterFactory in the query analyzer is ignored.*


https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_7-5F3_the-2Dextended-2Ddismax-2Dquery-2Dparser.html=DwICAg=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=e4J09_tlle6pJ7cObY_3FNbT4FR9VqDKCmLDx2B1ZCs=fcdcV-zmNEPuHTwm3OIwC_pnXlfnBWBPxjH5Ah-5dsI=


But seems like its not working.


https://urldefense.proofpoint.com/v2/url?u=http-3A__Box-2D1-3A8983_solr_SalesCentralDev-5F4_select-3Fq-3Dinternet=DwICAg=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=e4J09_tlle6pJ7cObY_3FNbT4FR9VqDKCmLDx2B1ZCs=tsSjzyF4rk8ld7IZKfbLbXeTqLlRxChfOr8kJw5ASr4=
 of
things=0=edismax=search_field
content*=false*=true


"parsedquery":"+(DisjunctionMaxQuery((content:internet |
search_field:internet)) DisjunctionMaxQuery((content:thing |
search_field:thing)))",
  *  "parsedquery_toString":"+((content:internet | search_field:internet)
(content:thing | search_field:thing))",*


Are we missing something here?



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=http-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=e4J09_tlle6pJ7cObY_3FNbT4FR9VqDKCmLDx2B1ZCs=zUk8ppVtIoJ0kfwqBmFVsGooDkMnNjeHYp_yfZkGgDk=

Re: Stopwords param of edismax parser not working

2019-03-28 Thread Erick Erickson

and to say anything about your particular situation we need to see the field 
definitions from the schema for the field you expect stopwrods to be removed 
from and the stopwords file for those fields.

But Walter’s comment is germane. Stopwords lead to a number of incongruities 
and are best just left in.

Best,
Erick

> On Mar 28, 2019, at 8:05 AM, Walter Underwood  wrote:
> 
> Why are you removing stopwords? That hack made sense in the 1950s, but I 
> haven’t removed stopwords for the last twenty years.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 
>> On Mar 28, 2019, at 2:47 AM, Ashish Bisht  wrote:
>> 
>> Hi,
>> 
>> We are trying  to remove stopwords from analysis using edismax parser
>> parameter.The documentation says
>> 
>> *stopwords
>> A Boolean parameter indicating if the StopFilterFactory configured in the
>> query analyzer should be respected when parsing the query. If this is set to
>> false, then the StopFilterFactory in the query analyzer is ignored.*
>> 
>> https://lucene.apache.org/solr/guide/7_3/the-extended-dismax-query-parser.html
>> 
>> 
>> But seems like its not working.
>> 
>> http://Box-1:8983/solr/SalesCentralDev_4/select?q=internet of
>> things=0=edismax=search_field
>> content*=false*=true
>> 
>> 
>> "parsedquery":"+(DisjunctionMaxQuery((content:internet |
>> search_field:internet)) DisjunctionMaxQuery((content:thing |
>> search_field:thing)))",
>> *  "parsedquery_toString":"+((content:internet | search_field:internet)
>> (content:thing | search_field:thing))",*
>> 
>> 
>> Are we missing something here?
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>

Re: Stopwords param of edismax parser not working

2019-03-28 Thread Walter Underwood

Why are you removing stopwords? That hack made sense in the 1950s, but I 
haven’t removed stopwords for the last twenty years.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Mar 28, 2019, at 2:47 AM, Ashish Bisht  wrote:
> 
> Hi,
> 
> We are trying  to remove stopwords from analysis using edismax parser
> parameter.The documentation says
> 
> *stopwords
> A Boolean parameter indicating if the StopFilterFactory configured in the
> query analyzer should be respected when parsing the query. If this is set to
> false, then the StopFilterFactory in the query analyzer is ignored.*
> 
> https://lucene.apache.org/solr/guide/7_3/the-extended-dismax-query-parser.html
> 
> 
> But seems like its not working.
> 
> http://Box-1:8983/solr/SalesCentralDev_4/select?q=internet of
> things=0=edismax=search_field
> content*=false*=true
> 
> 
> "parsedquery":"+(DisjunctionMaxQuery((content:internet |
> search_field:internet)) DisjunctionMaxQuery((content:thing |
> search_field:thing)))",
>  *  "parsedquery_toString":"+((content:internet | search_field:internet)
> (content:thing | search_field:thing))",*
> 
> 
> Are we missing something here?
> 
> 
> 
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Stopwords param of edismax parser not working

2019-03-28 Thread Ashish Bisht

Hi,

We are trying  to remove stopwords from analysis using edismax parser
parameter.The documentation says

*stopwords
A Boolean parameter indicating if the StopFilterFactory configured in the
query analyzer should be respected when parsing the query. If this is set to
false, then the StopFilterFactory in the query analyzer is ignored.*

https://lucene.apache.org/solr/guide/7_3/the-extended-dismax-query-parser.html


But seems like its not working.

http://Box-1:8983/solr/SalesCentralDev_4/select?q=internet of
things=0=edismax=search_field
content*=false*=true


"parsedquery":"+(DisjunctionMaxQuery((content:internet |
search_field:internet)) DisjunctionMaxQuery((content:thing |
search_field:thing)))",
  *  "parsedquery_toString":"+((content:internet | search_field:internet)
(content:thing | search_field:thing))",*


Are we missing something here?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Can I use configsets with custom stopwords per collection?

2018-12-05 Thread O. Klein

Ok. So with these suggestions, I found
https://lucene.apache.org/solr/guide/6_6/configuring-solrconfig-xml.html#Configuringsolrconfig.xml-ImplicitCoreProperties
So to test this I tried to use it in DIH as this has a similar issue with
configsets as every collection needs its own DIH.properties.



However does not work. Substituting ${solr.core.name} with core name, does
work.

Am I missing something?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Can I use configsets with custom stopwords per collection?

2018-12-04 Thread Erick Erickson

Substitution variables are whatever you want. The file looks like:
${my.var.here:default_if_not_spcified}

then set it when you start Solr by
java .   -Dmy.var.here=whatever  ..

Best,
Erick
On Tue, Dec 4, 2018 at 2:43 AM O. Klein  wrote:
>
> Yeah, I'm not copying files. I want all collections to use 1 schema.
>
> So I wonder, do managed stopwords work with configsets and store stopwords
> per collection?
>
> Also, what would be the substitution variable for collection name? Is there
> a list somewhere?
>
> Thanks!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

1 2 3 4 >

1 - 100 of 372 matches

Mail list logo