Re: Re: Re: Using Synonym Graph Filter with StandardTokenizer does not tokenize the query string if it has multi-word synonym

2020-03-16 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
I don't think you can synonym-ize both the multi-token phrase and each 
individual token in the multi-token phrase at the same time. But anyone else 
feel free to chime in! 

Best,
Audrey Lorberfeld

On 3/16/20, 12:40 PM, "atin janki"  wrote:

I aim to achieve an expansion like -

Synonym(soap powder) + Synonym(soap) + Synonym (powder)


which is not happening because of the Synonym expansion is being done at
the moment.

At the moment, using  Synonym Graph Filter with StandardTokenizer  and sow
= false , expands as -

 Synonym(soap powder)

because "soap powder" is a multi-word synonym present in the synonym file.

Using sow = true in the above setting will give -

Synonym(soap) + Synonym (powder)



Best Regards,
Atin Janki


On Mon, Mar 16, 2020 at 5:27 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> To confirm, you want a synonym like "soap powder" to map onto synonyms
> like "hand soap," "hygiene products," etc? As in, more of a cognitive
> synonym mapping where you feed synonyms that only apply to the multi-token
> phrase as a whole?
>
> On 3/16/20, 12:17 PM, "atin janki"  wrote:
>
> Using sow=true, does split the word on whitespaces but it will not
> look for
> synonyms of "soap powder" anymore, rather it expands separate synonyms
> for
> "soap" and "powder".
>
>
>
> Best Regards,
> Atin Janki
>
>
> On Mon, Mar 16, 2020 at 4:59 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > Have you set sow=true in your search handler? I know that we have it
> set
> > to false (sow = split on whitespace) because we WANT multi-token
> synonyms
> > retained as multiple tokens.
> >
> > On 3/16/20, 10:49 AM, "atin janki"  wrote:
> >
> > Hello everyone,
> >
> > I am using solr 8.3.
> >
> > After I included Synonym Graph Filter in my managed-schema file,
> I
> > have noticed that if the query string contains a multi-word
> synonym,
> > it considers that multi-word synonym as a single term and does
> not
> > break it, further suppressing the default search behaviour.
> >
> > I am using StandardTokenizer.
> >
> > Below is a snippet from managed-schema file -
> >
> > >
> > > *   > positionIncrementGap="100" multiValued="true">*
> > > **
> > > *  *
> > > *   words="stopwords.txt"
> > ignoreCase="true"/>*
> > > *  *
> > > **
> > > **
> > > *  *
> > > *   words="stopwords.txt"
> > ignoreCase="true"/>*
> > > *   expand="true"
> > ignoreCase="true" synonyms="synonyms.txt"/>*
> > > *  *
> > > ***  *
> >
> >
> > Here "*soap powder*" is the search *query* which is also a
> multi-word
> > synonym in the synonym file as-
> >
> > > s(104254535,1,'soap powder',n,1,1).
> > > s(104254535,2,'built-soap powder',n,1,0).
> > > s(104254535,3,'washing powder',n,1,0).
> >
> >
> > I am sharing some screenshots for understanding the problem-
> >
> > *without* Synonym Graph Filter => 2 docs returned  (screenshot 
at
> > below mentioned URL) -
> >
> >
> >
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_zQXx7mV=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=QUaaR69psn7pqa3DtaC7MrTMFstQrQHgeuY0qeQTc0k=
> >
> > *with* Synonym Graph Filter => 2 docs expected, only 1 returned
> > (screenshot at below mentioned URL) -
> >
> >
> >
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_tp04Rzw=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=pLPVuD71W1IhokvFuu4F672lX8Nk07b0X9pCVETRjks=
> >
> >
> > Has anyone experienced this before? If yes, is there any
> workaround ?
> > Or is it an expected behaviour?
> >
> > Regards,
> > Atin Janki
> >
> >
> >
>
>
>




Re: Re: Using Synonym Graph Filter with StandardTokenizer does not tokenize the query string if it has multi-word synonym

2020-03-16 Thread atin janki
I aim to achieve an expansion like -

Synonym(soap powder) + Synonym(soap) + Synonym (powder)


which is not happening because of the Synonym expansion is being done at
the moment.

At the moment, using  Synonym Graph Filter with StandardTokenizer  and sow
= false , expands as -

 Synonym(soap powder)

because "soap powder" is a multi-word synonym present in the synonym file.

Using sow = true in the above setting will give -

Synonym(soap) + Synonym (powder)



Best Regards,
Atin Janki


On Mon, Mar 16, 2020 at 5:27 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> To confirm, you want a synonym like "soap powder" to map onto synonyms
> like "hand soap," "hygiene products," etc? As in, more of a cognitive
> synonym mapping where you feed synonyms that only apply to the multi-token
> phrase as a whole?
>
> On 3/16/20, 12:17 PM, "atin janki"  wrote:
>
> Using sow=true, does split the word on whitespaces but it will not
> look for
> synonyms of "soap powder" anymore, rather it expands separate synonyms
> for
> "soap" and "powder".
>
>
>
> Best Regards,
> Atin Janki
>
>
> On Mon, Mar 16, 2020 at 4:59 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > Have you set sow=true in your search handler? I know that we have it
> set
> > to false (sow = split on whitespace) because we WANT multi-token
> synonyms
> > retained as multiple tokens.
> >
> > On 3/16/20, 10:49 AM, "atin janki"  wrote:
> >
> > Hello everyone,
> >
> > I am using solr 8.3.
> >
> > After I included Synonym Graph Filter in my managed-schema file,
> I
> > have noticed that if the query string contains a multi-word
> synonym,
> > it considers that multi-word synonym as a single term and does
> not
> > break it, further suppressing the default search behaviour.
> >
> > I am using StandardTokenizer.
> >
> > Below is a snippet from managed-schema file -
> >
> > >
> > > *   > positionIncrementGap="100" multiValued="true">*
> > > **
> > > *  *
> > > *   words="stopwords.txt"
> > ignoreCase="true"/>*
> > > *  *
> > > **
> > > **
> > > *  *
> > > *   words="stopwords.txt"
> > ignoreCase="true"/>*
> > > *   expand="true"
> > ignoreCase="true" synonyms="synonyms.txt"/>*
> > > *  *
> > > ***  *
> >
> >
> > Here "*soap powder*" is the search *query* which is also a
> multi-word
> > synonym in the synonym file as-
> >
> > > s(104254535,1,'soap powder',n,1,1).
> > > s(104254535,2,'built-soap powder',n,1,0).
> > > s(104254535,3,'washing powder',n,1,0).
> >
> >
> > I am sharing some screenshots for understanding the problem-
> >
> > *without* Synonym Graph Filter => 2 docs returned  (screenshot at
> > below mentioned URL) -
> >
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_zQXx7mV=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=QUaaR69psn7pqa3DtaC7MrTMFstQrQHgeuY0qeQTc0k=
> >
> > *with* Synonym Graph Filter => 2 docs expected, only 1 returned
> > (screenshot at below mentioned URL) -
> >
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_tp04Rzw=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=pLPVuD71W1IhokvFuu4F672lX8Nk07b0X9pCVETRjks=
> >
> >
> > Has anyone experienced this before? If yes, is there any
> workaround ?
> > Or is it an expected behaviour?
> >
> > Regards,
> > Atin Janki
> >
> >
> >
>
>
>


Re: Re: Using Synonym Graph Filter with StandardTokenizer does not tokenize the query string if it has multi-word synonym

2020-03-16 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
To confirm, you want a synonym like "soap powder" to map onto synonyms like 
"hand soap," "hygiene products," etc? As in, more of a cognitive synonym 
mapping where you feed synonyms that only apply to the multi-token phrase as a 
whole?

On 3/16/20, 12:17 PM, "atin janki"  wrote:

Using sow=true, does split the word on whitespaces but it will not look for
synonyms of "soap powder" anymore, rather it expands separate synonyms for
"soap" and "powder".



Best Regards,
Atin Janki


On Mon, Mar 16, 2020 at 4:59 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> Have you set sow=true in your search handler? I know that we have it set
> to false (sow = split on whitespace) because we WANT multi-token synonyms
> retained as multiple tokens.
>
> On 3/16/20, 10:49 AM, "atin janki"  wrote:
>
> Hello everyone,
>
> I am using solr 8.3.
>
> After I included Synonym Graph Filter in my managed-schema file, I
> have noticed that if the query string contains a multi-word synonym,
> it considers that multi-word synonym as a single term and does not
> break it, further suppressing the default search behaviour.
>
> I am using StandardTokenizer.
>
> Below is a snippet from managed-schema file -
>
> >
> > *   positionIncrementGap="100" multiValued="true">*
> > **
> > *  *
> > *   ignoreCase="true"/>*
> > *  *
> > **
> > **
> > *  *
> > *   ignoreCase="true"/>*
> > *   ignoreCase="true" synonyms="synonyms.txt"/>*
> > *  *
> > ***  *
>
>
> Here "*soap powder*" is the search *query* which is also a multi-word
> synonym in the synonym file as-
>
> > s(104254535,1,'soap powder',n,1,1).
> > s(104254535,2,'built-soap powder',n,1,0).
> > s(104254535,3,'washing powder',n,1,0).
>
>
> I am sharing some screenshots for understanding the problem-
>
> *without* Synonym Graph Filter => 2 docs returned  (screenshot at
> below mentioned URL) -
>
>
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_zQXx7mV=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=QUaaR69psn7pqa3DtaC7MrTMFstQrQHgeuY0qeQTc0k=
>
> *with* Synonym Graph Filter => 2 docs expected, only 1 returned
> (screenshot at below mentioned URL) -
>
>
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_tp04Rzw=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=pLPVuD71W1IhokvFuu4F672lX8Nk07b0X9pCVETRjks=
>
>
> Has anyone experienced this before? If yes, is there any workaround ?
> Or is it an expected behaviour?
>
> Regards,
> Atin Janki
>
>
>




Re: Re: Re: Re: Re: Query Autocomplete Evaluation

2020-02-28 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Paras,

Thank you! This is all very helpful __ I'm going to read through your answer a 
couple more times and follow up if I have any more questions!

Best,
Audrey

On 2/28/20, 8:08 AM, "Paras Lehana"  wrote:

Hey Audrey,

Users often skip results and go straight to vanilla search even though
> their query is displayed in the top of the suggestions list


Yes, we do track this in another metric. This behaviour is more
prevalent for shorter terms like "tea" and "bag". But, anyways, we measure
MRR for quantifying how high are we able to show suggestions to the users.
Since we include only the terms selection via Auto-Suggest in the universe
for calculation, the searches where user skip Auto-Suggest won't be
counted. I think we can safely exclude these if you're using MRR to measure
how well you order your result set. Still, if you want to include those,
you can always compare the search term with the last result set and include
them in MRR - you're actually right that users maybe skipping the lower
positions even if the intended suggestion is available. Our MRR stands at
68% and 75% of all of the suggestions are selected from position #1 or #2.


So acceptance rate = # of suggestions taken / total queries issued?


Yes. The total queries issues should ideally be those where Auto-Suggest
was selected or could have been selected i.e. we exclude voice searches. We
try to include as much as those searches which were made via typing in the
search bar. But that's how we have fine-tuned our tracking over months.
You're right about the general formula - searches via Auto-Suggest divided
by total Searches.


And Selection to Display = # of suggestions taken (this would only be 1, if
> the not-taken suggestions are given 0s) / total suggestions displayed? If
> the above is true, wouldn't Selection to Display be binary? I.e. it's
> either 1/# of suggestions displayed (assuming this is a constant) or 0?


Yup. Please note that this is calculated per session of Auto-Suggest. Let
the formula be S/D. We will take D (Display) as 1 and not 3 when a user
query for "bag" (b, ba, bag). If the S (Selection) was made in the last
display, it is 1 also. If a user selects "bag" after writing "ba", we don't
say that S=0, D=1 for "b" and S=1, D=1 for "ba". For this, we already track
APL (Average Prefix Length). S/D is calculated per search and thus, here
S=1, D=1 for search "bag". Thus, for a single search, S/D can be either 0
or 1 - you're right, it's binary!

Hope this helps. Loved your questions! :)

On Thu, 27 Feb 2020 at 22:21, Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Paras,
>
> Thank you for this response! Yes, you are being clear __
>
> Regarding the assumptions you make for MRR, do you have any research
> papers to confirm that these user behaviors have been observed? I only ask
> because this paper 
https://urldefense.proofpoint.com/v2/url?u=http-3A__yichang-2Dcs.com_yahoo_sigir14-5FSearchAssist.pdf=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=itCtsKdh-LT8eUwdVvqBc96lR_64mPtVw7t52WMtBLs=JrGARO4xkzWbtv7_b-H5da6Ki6PemYL5NQ253y0Y7Qs=
 
> talks about how users often skip results and go straight to vanilla search
> even though their query is displayed in the top of the suggestions list
> (section 3.2 "QAC User Behavior Analysis"), among other behaviors that go
> against general IR intuition. This is only one paper, of course, but it
> seems that user research of QAC is hard to come by otherwise.
>
> So acceptance rate = # of suggestions taken / total queries issued ?
> And Selection to Display = # of suggestions taken (this would only be 1,
> if the not-taken suggestions are given 0s) / total suggestions displayed ?
>
> If the above is true, wouldn't Selection to Display be binary? I.e. it's
> either 1/# of suggestions displayed (assuming this is a constant) or 0?
>
> Best,
> Audrey
>
>
    > 
> From: Paras Lehana 
> Sent: Thursday, February 27, 2020 2:58:25 AM
> To: solr-user@lucene.apache.org
> Subject: [EXTERNAL] Re: Re: Re: Query Autocomplete Evaluation
>
> Hi Audrey,
>
> For MRR, we assume that if a suggestion is selected, it's relevant. It's
> also assumed that the user will always click the highest relevant
> suggestion. Thus, we calculate position selection for each selection. If
> still, I'm not understanding your question correctly, feel free to contact
> me personally (hangouts?).
 

Re: Re: Re: Re: Query Autocomplete Evaluation

2020-02-28 Thread Paras Lehana
Hey Audrey,

Users often skip results and go straight to vanilla search even though
> their query is displayed in the top of the suggestions list


Yes, we do track this in another metric. This behaviour is more
prevalent for shorter terms like "tea" and "bag". But, anyways, we measure
MRR for quantifying how high are we able to show suggestions to the users.
Since we include only the terms selection via Auto-Suggest in the universe
for calculation, the searches where user skip Auto-Suggest won't be
counted. I think we can safely exclude these if you're using MRR to measure
how well you order your result set. Still, if you want to include those,
you can always compare the search term with the last result set and include
them in MRR - you're actually right that users maybe skipping the lower
positions even if the intended suggestion is available. Our MRR stands at
68% and 75% of all of the suggestions are selected from position #1 or #2.


So acceptance rate = # of suggestions taken / total queries issued?


Yes. The total queries issues should ideally be those where Auto-Suggest
was selected or could have been selected i.e. we exclude voice searches. We
try to include as much as those searches which were made via typing in the
search bar. But that's how we have fine-tuned our tracking over months.
You're right about the general formula - searches via Auto-Suggest divided
by total Searches.


And Selection to Display = # of suggestions taken (this would only be 1, if
> the not-taken suggestions are given 0s) / total suggestions displayed? If
> the above is true, wouldn't Selection to Display be binary? I.e. it's
> either 1/# of suggestions displayed (assuming this is a constant) or 0?


Yup. Please note that this is calculated per session of Auto-Suggest. Let
the formula be S/D. We will take D (Display) as 1 and not 3 when a user
query for "bag" (b, ba, bag). If the S (Selection) was made in the last
display, it is 1 also. If a user selects "bag" after writing "ba", we don't
say that S=0, D=1 for "b" and S=1, D=1 for "ba". For this, we already track
APL (Average Prefix Length). S/D is calculated per search and thus, here
S=1, D=1 for search "bag". Thus, for a single search, S/D can be either 0
or 1 - you're right, it's binary!

Hope this helps. Loved your questions! :)

On Thu, 27 Feb 2020 at 22:21, Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Paras,
>
> Thank you for this response! Yes, you are being clear __
>
> Regarding the assumptions you make for MRR, do you have any research
> papers to confirm that these user behaviors have been observed? I only ask
> because this paper http://yichang-cs.com/yahoo/sigir14_SearchAssist.pdf
> talks about how users often skip results and go straight to vanilla search
> even though their query is displayed in the top of the suggestions list
> (section 3.2 "QAC User Behavior Analysis"), among other behaviors that go
> against general IR intuition. This is only one paper, of course, but it
> seems that user research of QAC is hard to come by otherwise.
>
> So acceptance rate = # of suggestions taken / total queries issued ?
> And Selection to Display = # of suggestions taken (this would only be 1,
> if the not-taken suggestions are given 0s) / total suggestions displayed ?
>
> If the above is true, wouldn't Selection to Display be binary? I.e. it's
> either 1/# of suggestions displayed (assuming this is a constant) or 0?
>
> Best,
> Audrey
>
>
> 
> From: Paras Lehana 
> Sent: Thursday, February 27, 2020 2:58:25 AM
> To: solr-user@lucene.apache.org
> Subject: [EXTERNAL] Re: Re: Re: Query Autocomplete Evaluation
>
> Hi Audrey,
>
> For MRR, we assume that if a suggestion is selected, it's relevant. It's
> also assumed that the user will always click the highest relevant
> suggestion. Thus, we calculate position selection for each selection. If
> still, I'm not understanding your question correctly, feel free to contact
> me personally (hangouts?).
>
> And @Paras, the third and fourth evaluation metrics you listed in your
> > first reply seem the same to me. What is the difference between the two?
>
>
> I was expecting you to ask this - I should have explained a bit more.
> Acceptance Rate is the searches through Auto-Suggest for all Searches.
> Whereas, value for Selection to Display is 1 if the Selection is made given
> the suggestions were displayed otherwise 0. Here, the cases where results
> are displayed is the universal set. Acceptance Rate is counted 0 even for
> those searches where Selection was not made because there were no results
> while S/D will not count this - it only counts cases where the result was
> displayed.
>
> Hope I'm clear.

Re: Re: Re: Re: Query Autocomplete Evaluation

2020-02-27 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Paras,

Thank you for this response! Yes, you are being clear __

Regarding the assumptions you make for MRR, do you have any research papers to 
confirm that these user behaviors have been observed? I only ask because this 
paper http://yichang-cs.com/yahoo/sigir14_SearchAssist.pdf talks about how 
users often skip results and go straight to vanilla search even though their 
query is displayed in the top of the suggestions list (section 3.2 "QAC User 
Behavior Analysis"), among other behaviors that go against general IR 
intuition. This is only one paper, of course, but it seems that user research 
of QAC is hard to come by otherwise.

So acceptance rate = # of suggestions taken / total queries issued ?
And Selection to Display = # of suggestions taken (this would only be 1, if the 
not-taken suggestions are given 0s) / total suggestions displayed ?

If the above is true, wouldn't Selection to Display be binary? I.e. it's either 
1/# of suggestions displayed (assuming this is a constant) or 0?

Best,
Audrey



From: Paras Lehana 
Sent: Thursday, February 27, 2020 2:58:25 AM
To: solr-user@lucene.apache.org
Subject: [EXTERNAL] Re: Re: Re: Query Autocomplete Evaluation

Hi Audrey,

For MRR, we assume that if a suggestion is selected, it's relevant. It's
also assumed that the user will always click the highest relevant
suggestion. Thus, we calculate position selection for each selection. If
still, I'm not understanding your question correctly, feel free to contact
me personally (hangouts?).

And @Paras, the third and fourth evaluation metrics you listed in your
> first reply seem the same to me. What is the difference between the two?


I was expecting you to ask this - I should have explained a bit more.
Acceptance Rate is the searches through Auto-Suggest for all Searches.
Whereas, value for Selection to Display is 1 if the Selection is made given
the suggestions were displayed otherwise 0. Here, the cases where results
are displayed is the universal set. Acceptance Rate is counted 0 even for
those searches where Selection was not made because there were no results
while S/D will not count this - it only counts cases where the result was
displayed.

Hope I'm clear. :)

On Tue, 25 Feb 2020 at 21:10, Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> This article
> https://urldefense.proofpoint.com/v2/url?u=http-3A__wwwconference.org_proceedings_www2011_proceedings_p107.pdf=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=KMeOCffgJOgN3RoE0ht8jssgdO3AbyNYqRmXlQ6xWRo=fVp2mKYimlchSj0RMKpd595S7C2nGxK2G3CQSkrycg4=
>   also
> indicates that MRR needs binary relevance labels, p. 114: "To this end, we
> selected a random sample of 198 (query, context) pairs from the set of
> 7,311 pairs, and manually tagged each of them as related (i.e., the query
> is related to the context; 60% of the pairs) and unrelated (40% of the
> pairs)."
>
> On 2/25/20, 10:25 AM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
> audrey.lorberf...@ibm.com> wrote:
>
> Thank you, Walter & Paras!
>
> So, from the MRR equation, I was under the impression the suggestions
> all needed a binary label (0,1) indicating relevance.* But it's great to
> know that you guys use proxies for relevance, such as clicks.
>
> *The reason I think MRR has to have binary relevance labels is this
> Wikipedia article:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Mean-5Freciprocal-5Frank=DwIGaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=1f2LPzuBvibQd8m-8_HuNVYFm0JvCGyPDul6r4ATsLk=Sn7KV-BcFDTrmc1PfRVeSpB9Ysh3UrVIQKcB3G5zstw=
> , where it states below the formula that rank_i = "refers to the rank
> position of the first relevant document for the i-th query." If the
> suggestions are not labeled as relevant (0) or not relevant (1), then how
> do you compute the rank of the first RELEVANT document?
>
> I'll check out these readings asap, thank you!
>
> And @Paras, the third and fourth evaluation metrics you listed in your
> first reply seem the same to me. What is the difference between the two?
>
> Best,
> Audrey
>
> On 2/25/20, 1:11 AM, "Walter Underwood"  wrote:
>
> Here is a blog article with a worked example for MRR based on
> customer clicks.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__observer.wunderwood.org_2016_09_12_measuring-2Dsearch-2Drelevance-2Dwith-2Dmrr_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=GzNrf4l_FjMqOkSx2B4_sCIGoJv2QYPbPqWplHGE3PI=
>
> At my place of work, we compare the CTR and MRR of queries using
> suggestions to those that do not use suggestions. Solr autosuggest based on
> lexicon of book ti

Re: Re: Re: Query Autocomplete Evaluation

2020-02-26 Thread Paras Lehana
Hi Audrey,

For MRR, we assume that if a suggestion is selected, it's relevant. It's
also assumed that the user will always click the highest relevant
suggestion. Thus, we calculate position selection for each selection. If
still, I'm not understanding your question correctly, feel free to contact
me personally (hangouts?).

And @Paras, the third and fourth evaluation metrics you listed in your
> first reply seem the same to me. What is the difference between the two?


I was expecting you to ask this - I should have explained a bit more.
Acceptance Rate is the searches through Auto-Suggest for all Searches.
Whereas, value for Selection to Display is 1 if the Selection is made given
the suggestions were displayed otherwise 0. Here, the cases where results
are displayed is the universal set. Acceptance Rate is counted 0 even for
those searches where Selection was not made because there were no results
while S/D will not count this - it only counts cases where the result was
displayed.

Hope I'm clear. :)

On Tue, 25 Feb 2020 at 21:10, Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> This article
> http://wwwconference.org/proceedings/www2011/proceedings/p107.pdf also
> indicates that MRR needs binary relevance labels, p. 114: "To this end, we
> selected a random sample of 198 (query, context) pairs from the set of
> 7,311 pairs, and manually tagged each of them as related (i.e., the query
> is related to the context; 60% of the pairs) and unrelated (40% of the
> pairs)."
>
> On 2/25/20, 10:25 AM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
> audrey.lorberf...@ibm.com> wrote:
>
> Thank you, Walter & Paras!
>
> So, from the MRR equation, I was under the impression the suggestions
> all needed a binary label (0,1) indicating relevance.* But it's great to
> know that you guys use proxies for relevance, such as clicks.
>
> *The reason I think MRR has to have binary relevance labels is this
> Wikipedia article:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Mean-5Freciprocal-5Frank=DwIGaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=1f2LPzuBvibQd8m-8_HuNVYFm0JvCGyPDul6r4ATsLk=Sn7KV-BcFDTrmc1PfRVeSpB9Ysh3UrVIQKcB3G5zstw=
> , where it states below the formula that rank_i = "refers to the rank
> position of the first relevant document for the i-th query." If the
> suggestions are not labeled as relevant (0) or not relevant (1), then how
> do you compute the rank of the first RELEVANT document?
>
> I'll check out these readings asap, thank you!
>
> And @Paras, the third and fourth evaluation metrics you listed in your
> first reply seem the same to me. What is the difference between the two?
>
> Best,
> Audrey
>
> On 2/25/20, 1:11 AM, "Walter Underwood"  wrote:
>
> Here is a blog article with a worked example for MRR based on
> customer clicks.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__observer.wunderwood.org_2016_09_12_measuring-2Dsearch-2Drelevance-2Dwith-2Dmrr_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=GzNrf4l_FjMqOkSx2B4_sCIGoJv2QYPbPqWplHGE3PI=
>
> At my place of work, we compare the CTR and MRR of queries using
> suggestions to those that do not use suggestions. Solr autosuggest based on
> lexicon of book titles is highly effective for us.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=L4yZqRG0pWGPpZ8U7S-feoiWSTrz_zBEq0FANYqncuE=
>  (my blog)
>
> > On Feb 24, 2020, at 9:52 PM, Paras Lehana <
> paras.leh...@indiamart.com> wrote:
> >
> > Hey Audrey,
> >
> > I assume MRR is about the ranking of the intended suggestion.
> For this, no
> > human judgement is required. We track position selection - the
> position
> > (1-10) of the selected suggestion. For example, this is our
> recent numbers:
> >
> > Position 1 Selected (B3) 107,699
> > Position 2 Selected (B4) 58,736
> > Position 3 Selected (B5) 23,507
> > Position 4 Selected (B6) 12,250
> > Position 5 Selected (B7) 7,980
> > Position 6 Selected (B8) 5,653
> > Position 7 Selected (B9) 4,193
> > Position 8 Selected (B10) 3,511
> > Position 9 Selected (B11) 2,997
> > Position 10 Selected (B12) 2,428
> > *Total Selections (B13)* *228,954*
> > MRR = (B3+B4/2+B5/3+B6/4+B7/5+B8/6+B9/7+B10/8+B11/9+B12/10)/B13
> = 66.45%
> >
> > Refer here for MRR calculation keeping Auto-Suggest in
> perspective:
> >
> 

Re: Re: Re: Query Autocomplete Evaluation

2020-02-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
This article http://wwwconference.org/proceedings/www2011/proceedings/p107.pdf 
also indicates that MRR needs binary relevance labels, p. 114: "To this end, we 
selected a random sample of 198 (query, context) pairs from the set of 7,311 
pairs, and manually tagged each of them as related (i.e., the query is related 
to the context; 60% of the pairs) and unrelated (40% of the pairs)."

On 2/25/20, 10:25 AM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" 
 wrote:

Thank you, Walter & Paras! 

So, from the MRR equation, I was under the impression the suggestions all 
needed a binary label (0,1) indicating relevance.* But it's great to know that 
you guys use proxies for relevance, such as clicks.

*The reason I think MRR has to have binary relevance labels is this 
Wikipedia article: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Mean-5Freciprocal-5Frank=DwIGaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=1f2LPzuBvibQd8m-8_HuNVYFm0JvCGyPDul6r4ATsLk=Sn7KV-BcFDTrmc1PfRVeSpB9Ysh3UrVIQKcB3G5zstw=
 , where it states below the formula that rank_i = "refers to the rank position 
of the first relevant document for the i-th query." If the suggestions are not 
labeled as relevant (0) or not relevant (1), then how do you compute the rank 
of the first RELEVANT document? 

I'll check out these readings asap, thank you!

And @Paras, the third and fourth evaluation metrics you listed in your 
first reply seem the same to me. What is the difference between the two?

Best,
Audrey

On 2/25/20, 1:11 AM, "Walter Underwood"  wrote:

Here is a blog article with a worked example for MRR based on customer 
clicks.


https://urldefense.proofpoint.com/v2/url?u=https-3A__observer.wunderwood.org_2016_09_12_measuring-2Dsearch-2Drelevance-2Dwith-2Dmrr_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=GzNrf4l_FjMqOkSx2B4_sCIGoJv2QYPbPqWplHGE3PI=
 

At my place of work, we compare the CTR and MRR of queries using 
suggestions to those that do not use suggestions. Solr autosuggest based on 
lexicon of book titles is highly effective for us.

wunder
Walter Underwood
wun...@wunderwood.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=L4yZqRG0pWGPpZ8U7S-feoiWSTrz_zBEq0FANYqncuE=
   (my blog)

> On Feb 24, 2020, at 9:52 PM, Paras Lehana 
 wrote:
> 
> Hey Audrey,
> 
> I assume MRR is about the ranking of the intended suggestion. For 
this, no
> human judgement is required. We track position selection - the 
position
> (1-10) of the selected suggestion. For example, this is our recent 
numbers:
> 
> Position 1 Selected (B3) 107,699
> Position 2 Selected (B4) 58,736
> Position 3 Selected (B5) 23,507
> Position 4 Selected (B6) 12,250
> Position 5 Selected (B7) 7,980
> Position 6 Selected (B8) 5,653
> Position 7 Selected (B9) 4,193
> Position 8 Selected (B10) 3,511
> Position 9 Selected (B11) 2,997
> Position 10 Selected (B12) 2,428
> *Total Selections (B13)* *228,954*
> MRR = (B3+B4/2+B5/3+B6/4+B7/5+B8/6+B9/7+B10/8+B11/9+B12/10)/B13 = 
66.45%
> 
> Refer here for MRR calculation keeping Auto-Suggest in perspective:
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40dtunkelang_evaluating-2Dsearch-2Dmeasuring-2Dsearcher-2Dbehavior-2D5f8347619eb0=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=WFv9xHoFHlnQmBgqIoHPi3moIiyttgAZJzWRxFLjyfk=
 
> 
> "In practice, this is inverted to obtain the reciprocal rank, e.g., 
if the
> searcher clicks on the 4th result, the reciprocal rank is 0.25. The 
average
> of these reciprocal ranks is called the mean reciprocal rank (MRR)."
> 
> nDCG may require human intervention. Please let me know in case I 
have not
> understood your question properly. :)
> 
> 
> 
> On Mon, 24 Feb 2020 at 20:49, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com
>  wrote:
> 
>> Hi Paras,
>> 
>> This is SO helpful, thank you. Quick question about your MRR metric 
-- do
>> you have binary human judgements for your suggestions? If no, how do 
you
>> label suggestions successful or not?
>> 
>> Best,
>> Audrey
>> 
>> On 2/24/20, 2:27 AM, "Paras Lehana"  
wrote:
>> 
>>Hi Audrey,
>> 
>>I work for Auto-Suggest at IndiaMART. Although we don't use the
 

Re: Re: Query Autocomplete Evaluation

2020-02-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thank you, Walter & Paras! 

So, from the MRR equation, I was under the impression the suggestions all 
needed a binary label (0,1) indicating relevance.* But it's great to know that 
you guys use proxies for relevance, such as clicks.

*The reason I think MRR has to have binary relevance labels is this Wikipedia 
article: https://en.wikipedia.org/wiki/Mean_reciprocal_rank, where it states 
below the formula that rank_i = "refers to the rank position of the first 
relevant document for the i-th query." If the suggestions are not labeled as 
relevant (0) or not relevant (1), then how do you compute the rank of the first 
RELEVANT document? 

I'll check out these readings asap, thank you!

And @Paras, the third and fourth evaluation metrics you listed in your first 
reply seem the same to me. What is the difference between the two?

Best,
Audrey

On 2/25/20, 1:11 AM, "Walter Underwood"  wrote:

Here is a blog article with a worked example for MRR based on customer 
clicks.


https://urldefense.proofpoint.com/v2/url?u=https-3A__observer.wunderwood.org_2016_09_12_measuring-2Dsearch-2Drelevance-2Dwith-2Dmrr_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=GzNrf4l_FjMqOkSx2B4_sCIGoJv2QYPbPqWplHGE3PI=
 

At my place of work, we compare the CTR and MRR of queries using 
suggestions to those that do not use suggestions. Solr autosuggest based on 
lexicon of book titles is highly effective for us.

wunder
Walter Underwood
wun...@wunderwood.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=L4yZqRG0pWGPpZ8U7S-feoiWSTrz_zBEq0FANYqncuE=
   (my blog)

> On Feb 24, 2020, at 9:52 PM, Paras Lehana  
wrote:
> 
> Hey Audrey,
> 
> I assume MRR is about the ranking of the intended suggestion. For this, no
> human judgement is required. We track position selection - the position
> (1-10) of the selected suggestion. For example, this is our recent 
numbers:
> 
> Position 1 Selected (B3) 107,699
> Position 2 Selected (B4) 58,736
> Position 3 Selected (B5) 23,507
> Position 4 Selected (B6) 12,250
> Position 5 Selected (B7) 7,980
> Position 6 Selected (B8) 5,653
> Position 7 Selected (B9) 4,193
> Position 8 Selected (B10) 3,511
> Position 9 Selected (B11) 2,997
> Position 10 Selected (B12) 2,428
> *Total Selections (B13)* *228,954*
> MRR = (B3+B4/2+B5/3+B6/4+B7/5+B8/6+B9/7+B10/8+B11/9+B12/10)/B13 = 66.45%
> 
> Refer here for MRR calculation keeping Auto-Suggest in perspective:
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40dtunkelang_evaluating-2Dsearch-2Dmeasuring-2Dsearcher-2Dbehavior-2D5f8347619eb0=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=WFv9xHoFHlnQmBgqIoHPi3moIiyttgAZJzWRxFLjyfk=
 
> 
> "In practice, this is inverted to obtain the reciprocal rank, e.g., if the
> searcher clicks on the 4th result, the reciprocal rank is 0.25. The 
average
> of these reciprocal ranks is called the mean reciprocal rank (MRR)."
> 
> nDCG may require human intervention. Please let me know in case I have not
> understood your question properly. :)
> 
> 
> 
> On Mon, 24 Feb 2020 at 20:49, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com
>  wrote:
> 
>> Hi Paras,
>> 
>> This is SO helpful, thank you. Quick question about your MRR metric -- do
>> you have binary human judgements for your suggestions? If no, how do you
>> label suggestions successful or not?
>> 
>> Best,
>> Audrey
>> 
>> On 2/24/20, 2:27 AM, "Paras Lehana"  wrote:
>> 
>>Hi Audrey,
>> 
>>I work for Auto-Suggest at IndiaMART. Although we don't use the
>> Suggester
>>component, I think you need evaluation metrics for Auto-Suggest as a
>>business product and not specifically for Solr Suggester which is the
>>backend. We use edismax parser with EdgeNGrams Tokenization.
>> 
>>Every week, as the property owner, I report around 500 metrics. I 
would
>>like to mention a few of those:
>> 
>>   1. MRR (Mean Reciprocal Rate): How high the user selection was
>> among the
>>   returned result. Ranges from 0 to 1, the higher the better.
>>   2. APL (Average Prefix Length): Prefix is the query by user. Lesser
>> the
>>   better. This reports how less an average user has to type for
>> getting the
>>   intended suggestion.
>>   3. Acceptance Rate or Selection: How many of the total searches are
>>   being served from Auto-Suggest. We are around 50%.
>>   4. Selection to Display Ratio: Did you make the user to click 

Re: Re: Query Autocomplete Evaluation

2020-02-24 Thread Paras Lehana
Hey Audrey,

I assume MRR is about the ranking of the intended suggestion. For this, no
human judgement is required. We track position selection - the position
(1-10) of the selected suggestion. For example, this is our recent numbers:

Position 1 Selected (B3) 107,699
Position 2 Selected (B4) 58,736
Position 3 Selected (B5) 23,507
Position 4 Selected (B6) 12,250
Position 5 Selected (B7) 7,980
Position 6 Selected (B8) 5,653
Position 7 Selected (B9) 4,193
Position 8 Selected (B10) 3,511
Position 9 Selected (B11) 2,997
Position 10 Selected (B12) 2,428
*Total Selections (B13)* *228,954*
MRR = (B3+B4/2+B5/3+B6/4+B7/5+B8/6+B9/7+B10/8+B11/9+B12/10)/B13 = 66.45%

Refer here for MRR calculation keeping Auto-Suggest in perspective:
https://medium.com/@dtunkelang/evaluating-search-measuring-searcher-behavior-5f8347619eb0

"In practice, this is inverted to obtain the reciprocal rank, e.g., if the
searcher clicks on the 4th result, the reciprocal rank is 0.25. The average
of these reciprocal ranks is called the mean reciprocal rank (MRR)."

nDCG may require human intervention. Please let me know in case I have not
understood your question properly. :)



On Mon, 24 Feb 2020 at 20:49, Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Hi Paras,
>
> This is SO helpful, thank you. Quick question about your MRR metric -- do
> you have binary human judgements for your suggestions? If no, how do you
> label suggestions successful or not?
>
> Best,
> Audrey
>
> On 2/24/20, 2:27 AM, "Paras Lehana"  wrote:
>
> Hi Audrey,
>
> I work for Auto-Suggest at IndiaMART. Although we don't use the
> Suggester
> component, I think you need evaluation metrics for Auto-Suggest as a
> business product and not specifically for Solr Suggester which is the
> backend. We use edismax parser with EdgeNGrams Tokenization.
>
> Every week, as the property owner, I report around 500 metrics. I would
> like to mention a few of those:
>
>1. MRR (Mean Reciprocal Rate): How high the user selection was
> among the
>returned result. Ranges from 0 to 1, the higher the better.
>2. APL (Average Prefix Length): Prefix is the query by user. Lesser
> the
>better. This reports how less an average user has to type for
> getting the
>intended suggestion.
>3. Acceptance Rate or Selection: How many of the total searches are
>being served from Auto-Suggest. We are around 50%.
>4. Selection to Display Ratio: Did you make the user to click any
> of the
>suggestions if they are displayed?
>5. Response Time: How fast are you serving your average query.
>
>
> The Selection and Response Time are our main KPIs. We track a lot about
> Auto-Suggest usage on our platform which becomes apparent if you
> observe
> the URL after clicking a suggestion on dir.indiamart.com. However, not
> everything would benefit you. Do let me know for any related query or
> explanation. Hope this helps. :)
>
> On Fri, 14 Feb 2020 at 21:23, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com
>  wrote:
>
> > Hi all,
> >
> > How do you all evaluate the success of your query autocomplete (i.e.
> > suggester) component if you use it?
> >
> > We cannot use MRR for various reasons (I can go into them if you're
> > interested), so we're thinking of using nDCG since we already use
> that for
> > relevance eval of our system as a whole. I am also interested in the
> metric
> > "success at top-k," but I can't find any research papers that
> explicitly
> > define "success" -- I am assuming it's a suggestion (or suggestions)
> > labeled "relevant," but maybe it could also simply be the suggestion
> that
> > receives a click from the user?
> >
> > Would love to hear from the hive mind!
> >
> > Best,
> > Audrey
> >
> > --
> >
> >
> >
>
> --
> --
> Regards,
>
> *Paras Lehana* [65871]
> Development Engineer, *Auto-Suggest*,
> IndiaMART InterMESH Ltd,
>
> 11th Floor, Tower 2, Assotech Business Cresterra,
> Plot No. 22, Sector 135, Noida, Uttar Pradesh, India 201305
>
> Mob.: +91-9560911996
> Work: 0120-4056700 | Extn:
> *11096*
>
> --
> *
> *
>
>  <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_IndiaMART_videos_578196442936091_=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=CTfu2EkiAFh-Ra4cn3EL2GdkKLBhD754dBAoRYpr2uc=kwWlK4TbSM6iPH6DBIrwg3QCeHrY-83N5hm2HtQQsjc=
> >
>
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, *Auto-Suggest*,
IndiaMART InterMESH Ltd,

11th Floor, Tower 2, Assotech Business Cresterra,
Plot No. 22, Sector 135, Noida, Uttar Pradesh, India 201305

Mob.: +91-9560911996
Work: 0120-4056700 | Extn:
*11096*

-- 
*
*

 


Re: Re: Query Autocomplete Evaluation

2020-02-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi Paras,

This is SO helpful, thank you. Quick question about your MRR metric -- do you 
have binary human judgements for your suggestions? If no, how do you label 
suggestions successful or not?

Best,
Audrey

On 2/24/20, 2:27 AM, "Paras Lehana"  wrote:

Hi Audrey,

I work for Auto-Suggest at IndiaMART. Although we don't use the Suggester
component, I think you need evaluation metrics for Auto-Suggest as a
business product and not specifically for Solr Suggester which is the
backend. We use edismax parser with EdgeNGrams Tokenization.

Every week, as the property owner, I report around 500 metrics. I would
like to mention a few of those:

   1. MRR (Mean Reciprocal Rate): How high the user selection was among the
   returned result. Ranges from 0 to 1, the higher the better.
   2. APL (Average Prefix Length): Prefix is the query by user. Lesser the
   better. This reports how less an average user has to type for getting the
   intended suggestion.
   3. Acceptance Rate or Selection: How many of the total searches are
   being served from Auto-Suggest. We are around 50%.
   4. Selection to Display Ratio: Did you make the user to click any of the
   suggestions if they are displayed?
   5. Response Time: How fast are you serving your average query.


The Selection and Response Time are our main KPIs. We track a lot about
Auto-Suggest usage on our platform which becomes apparent if you observe
the URL after clicking a suggestion on dir.indiamart.com. However, not
everything would benefit you. Do let me know for any related query or
explanation. Hope this helps. :)

On Fri, 14 Feb 2020 at 21:23, Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Hi all,
>
> How do you all evaluate the success of your query autocomplete (i.e.
> suggester) component if you use it?
>
> We cannot use MRR for various reasons (I can go into them if you're
> interested), so we're thinking of using nDCG since we already use that for
> relevance eval of our system as a whole. I am also interested in the 
metric
> "success at top-k," but I can't find any research papers that explicitly
> define "success" -- I am assuming it's a suggestion (or suggestions)
> labeled "relevant," but maybe it could also simply be the suggestion that
> receives a click from the user?
>
> Would love to hear from the hive mind!
>
> Best,
> Audrey
>
> --
>
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, *Auto-Suggest*,
IndiaMART InterMESH Ltd,

11th Floor, Tower 2, Assotech Business Cresterra,
Plot No. 22, Sector 135, Noida, Uttar Pradesh, India 201305

Mob.: +91-9560911996
Work: 0120-4056700 | Extn:
*11096*

-- 
*
*

 





Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood
Make phrases into single tokens at indexing and query time. Let the engine do
the rest of the work.

For example, “subunits of the army” can become “subunitsofthearmy” or 
“subunits_of_the_army”.
We used patterns to choose phrases, so “word word”, “word glue word”, or “word 
glue glue word”
could become phrases.

Nutch did something like this, but used it for filtering down the candidates 
for matching,
then used regular Lucene scoring for ranking.

The Infoseek Ultra index used these phrase terms but did not store positions.

The idea came from early DNA search engines.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 17, 2020, at 10:53 AM, David Hastings  
> wrote:
> 
> interesting, i cant seem to find anything on Phrase IDF, dont suppose you
> have a link or two i could look at by chance?
> 
> On Mon, Feb 17, 2020 at 1:48 PM Walter Underwood 
> wrote:
> 
>> At Infoseek, we used “glue words” to build phrase tokens. It was really
>> effective.
>> Phrase IDF is powerful stuff.
>> 
>> Luckily for you, the patent on that has expired. :-)
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 17, 2020, at 10:46 AM, David Hastings <
>> hastings.recurs...@gmail.com> wrote:
>>> 
>>> i use stop words for building shingles into "interesting phrases" for my
>>> machine teacher/students, so i wouldnt say theres no reason, however my
>> use
>>> case is very specific.  Otherwise yeah, theyre gone for all practical
>>> reasons/search scenarios.
>>> 
>>> On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
>>> wrote:
>>> 
 Why are you using stopwords? I would need a really, really good reason
>> to
 use those.
 
 Stopwords are an obsolete technique from 16-bit processors. I’ve never
 used them and
 I’ve been a search engineer since 1997.
 
 wunder
 Walter Underwood
 wun...@wunderwood.org
 http://observer.wunderwood.org/  (my blog)
 
> On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
 wrote:
> 
> Hi
> 
> I've run into an issue with creating a Managed Stopwords list that has
 the
> same name as a previously deleted list. Going through the same flow
>> with
> Managed Synonyms doesn't result in this unexpected behaviour. Am I
 missing
> something or did I discover a bug in Solr?
> 
> On a newly started solr with the techproducts core:
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> 
 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> 
 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl
 http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> 
 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> 
> The second PUT request results in a status 500 with error
> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> 
> Similar requests for synonyms work fine, no matter how many times I
 repeat
> the CREATE/DELETE/RELOAD cycle:
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl -X DELETE
> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl
 http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> 
>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> 
> Reloading after creating the Stopwords list but not after deleting it
 works
> without error too on a fresh techproducts core (you'll have to remove
>> the
> directory from disk and create the core again after running the
>> previous
> commands).
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> 
 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl
 http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X DELETE
> 
 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> 
>> 

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread David Hastings
interesting, i cant seem to find anything on Phrase IDF, dont suppose you
have a link or two i could look at by chance?

On Mon, Feb 17, 2020 at 1:48 PM Walter Underwood 
wrote:

> At Infoseek, we used “glue words” to build phrase tokens. It was really
> effective.
> Phrase IDF is powerful stuff.
>
> Luckily for you, the patent on that has expired. :-)
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 17, 2020, at 10:46 AM, David Hastings <
> hastings.recurs...@gmail.com> wrote:
> >
> > i use stop words for building shingles into "interesting phrases" for my
> > machine teacher/students, so i wouldnt say theres no reason, however my
> use
> > case is very specific.  Otherwise yeah, theyre gone for all practical
> > reasons/search scenarios.
> >
> > On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
> > wrote:
> >
> >> Why are you using stopwords? I would need a really, really good reason
> to
> >> use those.
> >>
> >> Stopwords are an obsolete technique from 16-bit processors. I’ve never
> >> used them and
> >> I’ve been a search engineer since 1997.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
> >> wrote:
> >>>
> >>> Hi
> >>>
> >>> I've run into an issue with creating a Managed Stopwords list that has
> >> the
> >>> same name as a previously deleted list. Going through the same flow
> with
> >>> Managed Synonyms doesn't result in this unexpected behaviour. Am I
> >> missing
> >>> something or did I discover a bug in Solr?
> >>>
> >>> On a newly started solr with the techproducts core:
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl -X DELETE
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>>
> >>> The second PUT request results in a status 500 with error
> >>> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> >>>
> >>> Similar requests for synonyms work fine, no matter how many times I
> >> repeat
> >>> the CREATE/DELETE/RELOAD cycle:
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> >>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>> curl -X DELETE
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> >>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> >>>
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >>>
> >>> Reloading after creating the Stopwords list but not after deleting it
> >> works
> >>> without error too on a fresh techproducts core (you'll have to remove
> the
> >>> directory from disk and create the core again after running the
> previous
> >>> commands).
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl
> >> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> >>> curl -X DELETE
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>>
> >>> And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
> >>> CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the
> cycle
> >>> can be completed twice. (Again, on a freshly created techproducts
> core.)
> >>> Only the third attempt to create a list results in an error. Synonyms
> can
> >>> still be created and deleted repeatedly after this.
> >>>
> >>> curl -X PUT -H 'Content-type:application/json' --data-binary
> >>>
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >>>
> >>
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >>> curl -X DELETE
> >>>
> >>
> 

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood
At Infoseek, we used “glue words” to build phrase tokens. It was really 
effective.
Phrase IDF is powerful stuff.

Luckily for you, the patent on that has expired. :-)

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 17, 2020, at 10:46 AM, David Hastings  
> wrote:
> 
> i use stop words for building shingles into "interesting phrases" for my
> machine teacher/students, so i wouldnt say theres no reason, however my use
> case is very specific.  Otherwise yeah, theyre gone for all practical
> reasons/search scenarios.
> 
> On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
> wrote:
> 
>> Why are you using stopwords? I would need a really, really good reason to
>> use those.
>> 
>> Stopwords are an obsolete technique from 16-bit processors. I’ve never
>> used them and
>> I’ve been a search engineer since 1997.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>>> On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
>> wrote:
>>> 
>>> Hi
>>> 
>>> I've run into an issue with creating a Managed Stopwords list that has
>> the
>>> same name as a previously deleted list. Going through the same flow with
>>> Managed Synonyms doesn't result in this unexpected behaviour. Am I
>> missing
>>> something or did I discover a bug in Solr?
>>> 
>>> On a newly started solr with the techproducts core:
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl -X DELETE
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> 
>>> The second PUT request results in a status 500 with error
>>> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
>>> 
>>> Similar requests for synonyms work fine, no matter how many times I
>> repeat
>>> the CREATE/DELETE/RELOAD cycle:
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> curl -X DELETE
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> 
>>> Reloading after creating the Stopwords list but not after deleting it
>> works
>>> without error too on a fresh techproducts core (you'll have to remove the
>>> directory from disk and create the core again after running the previous
>>> commands).
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
>>> curl -X DELETE
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> 
>>> And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
>>> CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
>>> can be completed twice. (Again, on a freshly created techproducts core.)
>>> Only the third attempt to create a list results in an error. Synonyms can
>>> still be created and deleted repeatedly after this.
>>> 
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl -X DELETE
>>> 
>> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
>>> curl -X PUT -H 'Content-type:application/json' --data-binary
>>> 
>> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> curl -X DELETE
>>> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
>>> curl
>> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread David Hastings
i use stop words for building shingles into "interesting phrases" for my
machine teacher/students, so i wouldnt say theres no reason, however my use
case is very specific.  Otherwise yeah, theyre gone for all practical
reasons/search scenarios.

On Mon, Feb 17, 2020 at 1:41 PM Walter Underwood 
wrote:

> Why are you using stopwords? I would need a really, really good reason to
> use those.
>
> Stopwords are an obsolete technique from 16-bit processors. I’ve never
> used them and
> I’ve been a search engineer since 1997.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 17, 2020, at 7:31 AM, Thomas Corthals 
> wrote:
> >
> > Hi
> >
> > I've run into an issue with creating a Managed Stopwords list that has
> the
> > same name as a previously deleted list. Going through the same flow with
> > Managed Synonyms doesn't result in this unexpected behaviour. Am I
> missing
> > something or did I discover a bug in Solr?
> >
> > On a newly started solr with the techproducts core:
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >
> > The second PUT request results in a status 500 with error
> > msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> >
> > Similar requests for synonyms work fine, no matter how many times I
> repeat
> > the CREATE/DELETE/RELOAD cycle:
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl -X DELETE
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> >
> > Reloading after creating the Stopwords list but not after deleting it
> works
> > without error too on a fresh techproducts core (you'll have to remove the
> > directory from disk and create the core again after running the previous
> > commands).
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> >
> > And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
> > CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
> > can be completed twice. (Again, on a freshly created techproducts core.)
> > Only the third attempt to create a list results in an error. Synonyms can
> > still be created and deleted repeatedly after this.
> >
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl -X DELETE
> > http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> > curl
> http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> > '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X DELETE
> >
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> > curl -X PUT -H 'Content-type:application/json' --data-binary
> >

Re: Re-creating deleted Managed Stopwords lists results in error

2020-02-17 Thread Walter Underwood
Why are you using stopwords? I would need a really, really good reason to use 
those.

Stopwords are an obsolete technique from 16-bit processors. I’ve never used 
them and
I’ve been a search engineer since 1997.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 17, 2020, at 7:31 AM, Thomas Corthals  wrote:
> 
> Hi
> 
> I've run into an issue with creating a Managed Stopwords list that has the
> same name as a previously deleted list. Going through the same flow with
> Managed Synonyms doesn't result in this unexpected behaviour. Am I missing
> something or did I discover a bug in Solr?
> 
> On a newly started solr with the techproducts core:
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> 
> The second PUT request results in a status 500 with error
> msg "java.util.LinkedHashMap cannot be cast to java.util.List".
> 
> Similar requests for synonyms work fine, no matter how many times I repeat
> the CREATE/DELETE/RELOAD cycle:
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> 
> Reloading after creating the Stopwords list but not after deleting it works
> without error too on a fresh techproducts core (you'll have to remove the
> directory from disk and create the core again after running the previous
> commands).
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> 
> And even curiouser, when doing a CREATE/DELETE for Stopwords, then a
> CREATE/DELETE for Synonyms, and only then a RELOAD of the core, the cycle
> can be completed twice. (Again, on a freshly created techproducts core.)
> Only the third attempt to create a list results in an error. Synonyms can
> still be created and deleted repeatedly after this.
> 
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/stopwords/testlist
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedSynonymGraphFilterFactory$SynonymManager"}'
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl -X DELETE
> http://localhost:8983/solr/techproducts/schema/analysis/synonyms/testmap
> curl http://localhost:8983/solr/admin/cores?action=RELOAD\=techproducts
> curl -X PUT -H 'Content-type:application/json' --data-binary
> '{"class":"org.apache.solr.rest.schema.analysis.ManagedWordSetResource"}'
> 

Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-31 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi all, reviving this thread.

For those of you who use an external file for your suggestions, how do you 
decide from your query logs what suggestions to include? Just starting out with 
some exploratory analysis of clicks, dwell times, etc., and would love to hear 
from the community any advise.

Thanks!

Best,
Audrey

On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:

It's a great idea.   And then index that file into a separate lean 
collection of just the suggestions, along with the weight as another field on 
those documents, to use for ranking them at query time with standard /select 
queries.  (this separate suggest collection would also have appropriate 
tokenization to match the partial words as the user types, like ngramming)

Erik


> On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com  wrote:
> 
> David, 
> 
> Thank you, that is useful. So, would you recommend using a (clean) field 
over an external dictionary file? We have lots of "top queries" and measure 
their nDCG. A thought was to programmatically generate an external file where 
the weight per query term (or phrase) == its nDCG. Bad idea?
> 
> Best,
> Audrey
> 
> On 1/20/20, 11:51 AM, "David Hastings"  
wrote:
> 
>Ive used this quite a bit, my biggest piece of advice is to choose a 
field
>that you know is clean, with well defined terms/words, you dont want an
>autocomplete that has a massive dictionary, also it will make the
>start/reload times pretty slow
> 
>On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
>audrey.lorberf...@ibm.com  wrote:
> 
>> Hi All,
>> 
>> We plan to incorporate a query autocomplete functionality into our search
>> engine (like this: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
 
>> ). And I was wondering if anyone has personal experience with this
>> component and would like to share? Basically, we are just looking for 
some
>> best practices from more experienced Solr admins so that we have a 
starting
>> place to launch this in our beta.
>> 
>> Thank you!
>> 
>> Best,
>> Audrey
>> 
> 
> 





Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-26 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Oh, great! Thank you, this is helpful!

On 1/24/20, 6:43 PM, "Walter Underwood"  wrote:

Click-based weights are vulnerable to spamming. Some of us fondly remember 
when
Google was showing Microsoft as the first hit for “evil empire” thanks to a 
click attack.

For our ecommerce search, we use the actual titles of books weighted by 
order volume.
Decorated titles are reduced to a base title, so “Managerial Accounting: 
Student Value Edition”
becomes just “Managerial Accounting”. Showing all the variations is the job 
of the 
real results page.

wunder
Walter Underwood
wun...@wunderwood.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=3oEhRJWEHDoz3HXt87Y_FXxPTUZg1zSA5r4P6urviug=87IOY_vKNONtR2r2IkW-NnZ4Rn3wI-OIO6RSdqdOMfU=
   (my blog)

> On Jan 24, 2020, at 7:07 AM, Lucky Sharma  wrote:
> 
> Hi Audrey,
> As suggested by Erik, you can index the data into a seperate collection 
and
> You can instead of adding weights inthe document you can also use
> LTR(Learning to Rank) with in Solr to rerank on the documents.
> And also to increase more relevance with in the Autosuggestion and making
> positional context of the user in case of Multi token keywords you can 
also
> bigrams/trigrams to generate edge n-grams.
> 
> 
> 
> Regards,
> Lucky Sharma
> 
> On Fri, 24 Jan, 2020, 8:28 pm Lucky Sharma,  wrote:
> 
>> Hi Audrey,
>> As suggested by Erik, you can index the data into a seperate collection
>> and You can instead of adding weights inthe document you can also use LTR
>> with in Solr to rerank on the features.
>> 
>> Regards,
>> Lucky Sharma
>> 
>> On Fri, 24 Jan, 2020, 8:01 pm Audrey Lorberfeld -
>> audrey.lorberf...@ibm.com,  wrote:
>> 
>>> Erik,
>>> 
>>> Thank you! Yes, that's exactly how we were thinking of architecting it.
>>> And our ML engineer suggested something else for the suggestion weights,
>>> actually -- to build a model that would programmatically update the 
weights
>>> based on those suggestions' live clicks @ position k, etc. Pretty cool
>>> idea...
>>> 
>>> 
>>> 
>>> On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:
>>> 
>>>It's a great idea.   And then index that file into a separate lean
>>> collection of just the suggestions, along with the weight as another 
field
>>> on those documents, to use for ranking them at query time with standard
>>> /select queries.  (this separate suggest collection would also have
>>> appropriate tokenization to match the partial words as the user types, 
like
>>> ngramming)
>>> 
>>>Erik
>>> 
>>> 
 On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld -
>>> audrey.lorberf...@ibm.com  wrote:
 
 David,
 
 Thank you, that is useful. So, would you recommend using a (clean)
>>> field over an external dictionary file? We have lots of "top queries" 
and
>>> measure their nDCG. A thought was to programmatically generate an 
external
>>> file where the weight per query term (or phrase) == its nDCG. Bad idea?
 
 Best,
 Audrey
 
 On 1/20/20, 11:51 AM, "David Hastings" <
>>> hastings.recurs...@gmail.com> wrote:
 
   Ive used this quite a bit, my biggest piece of advice is to
>>> choose a field
   that you know is clean, with well defined terms/words, you dont
>>> want an
   autocomplete that has a massive dictionary, also it will make the
   start/reload times pretty slow
 
   On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
   audrey.lorberf...@ibm.com  wrote:
 
> Hi All,
> 
> We plan to incorporate a query autocomplete functionality into our
>>> search
> engine (like this:
>>> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
> ). And I was wondering if anyone has personal experience with this
> component and would like to share? Basically, we are just looking
>>> for some
> best practices from more experienced Solr admins so that we have a
>>> starting
> place to launch this in our beta.
> 
> Thank you!
> 
> Best,
> Audrey
> 
 
 
>>> 
>>> 
>>> 
>>> 





Re: Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
David,

True! But we are hoping that these are purely seen as suggestions and that 
people, if they know exactly what they are wanting to type/looking for, will 
simply ignore the dropdown options.

On 1/24/20, 10:03 AM, "David Hastings"  wrote:

This is a really cool idea!  My only concern is that the edge case
searches, where a user knows exactly what they want to find, would be
autocomplete into something that happens to be more "successful" rather
than what they were looking for.  for example, i want to know the legal
implications of jay z's 99 problems.   most of the autocompletes i imagine
would be for the lyrics for the song, or links to the video or jay z
himself, when what im looking for is a line by line analysis of the song
itself and how it relates to the fourth amendment:

https://urldefense.proofpoint.com/v2/url?u=http-3A__pdf.textfiles.com_academics_lj56-2D2-5Fmason-5Farticle.pdf=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=CPAGySYcW7hCqtFtjaThX2vIAhcKEMHHhYpqtqHkx-Q=XEyh7ewstUTlEuyKcYHaTU1vHMYA2-Db_nIYnl89yw4=
 

But in general this is a really clever idea, especially in the retail
arena.  However i suspect your use case is more in research, and after
years of dealing with lawyers and librarians, they tend to not like having
their searches intercepted, they know what they're looking for and they
tend to get mad if you assume they dont :)

On Fri, Jan 24, 2020 at 9:59 AM Lucky Sharma  wrote:

> Hi Audrey,
> As suggested by Erik, you can index the data into a seperate collection 
and
> You can instead of adding weights inthe document you can also use LTR with
> in Solr to rerank on the features.
>
> Regards,
> Lucky Sharma
>
> On Fri, 24 Jan, 2020, 8:01 pm Audrey Lorberfeld -
> audrey.lorberf...@ibm.com,
>  wrote:
>
> > Erik,
> >
> > Thank you! Yes, that's exactly how we were thinking of architecting it.
> > And our ML engineer suggested something else for the suggestion weights,
> > actually -- to build a model that would programmatically update the
> weights
> > based on those suggestions' live clicks @ position k, etc. Pretty cool
> > idea...
> >
> >
> >
> > On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:
> >
> > It's a great idea.   And then index that file into a separate lean
> > collection of just the suggestions, along with the weight as another
> field
> > on those documents, to use for ranking them at query time with standard
> > /select queries.  (this separate suggest collection would also have
> > appropriate tokenization to match the partial words as the user types,
> like
> > ngramming)
> >
> > Erik
> >
> >
> > > On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > David,
> > >
> > > Thank you, that is useful. So, would you recommend using a (clean)
> > field over an external dictionary file? We have lots of "top queries" 
and
> > measure their nDCG. A thought was to programmatically generate an
> external
> > file where the weight per query term (or phrase) == its nDCG. Bad idea?
> > >
> > > Best,
> > > Audrey
> > >
> > > On 1/20/20, 11:51 AM, "David Hastings" <
> hastings.recurs...@gmail.com>
> > wrote:
> > >
> > >Ive used this quite a bit, my biggest piece of advice is to
> > choose a field
> > >that you know is clean, with well defined terms/words, you dont
> > want an
> > >autocomplete that has a massive dictionary, also it will make
> the
> > >start/reload times pretty slow
> > >
> > >On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
> > >audrey.lorberf...@ibm.com  wrote:
> > >
> > >> Hi All,
> > >>
> > >> We plan to incorporate a query autocomplete functionality into 
our
> > search
> > >> engine (like this:
> >
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
> > >> ). And I was wondering if anyone has personal experience with 
this
> > >> component and would like to share? Basically, we are just looking
> > for some
> > >> best practices from more experienced Solr admins so that we have 
a
> > starting
> > >> place to launch this in our beta.
> > >>
> > >> Thank you!
> > >>
> > >> Best,
> > >> Audrey
> > >>
> > >
> > >
> >
> >
> >
> >
>




Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread Lucky Sharma
Hi Audrey,
As suggested by Erik, you can index the data into a seperate collection and
You can instead of adding weights inthe document you can also use
LTR(Learning to Rank) with in Solr to rerank on the documents.
And also to increase more relevance with in the Autosuggestion and making
positional context of the user in case of Multi token keywords you can also
bigrams/trigrams to generate edge n-grams.



Regards,
Lucky Sharma

On Fri, 24 Jan, 2020, 8:28 pm Lucky Sharma,  wrote:

> Hi Audrey,
> As suggested by Erik, you can index the data into a seperate collection
> and You can instead of adding weights inthe document you can also use LTR
> with in Solr to rerank on the features.
>
> Regards,
> Lucky Sharma
>
> On Fri, 24 Jan, 2020, 8:01 pm Audrey Lorberfeld -
> audrey.lorberf...@ibm.com,  wrote:
>
>> Erik,
>>
>> Thank you! Yes, that's exactly how we were thinking of architecting it.
>> And our ML engineer suggested something else for the suggestion weights,
>> actually -- to build a model that would programmatically update the weights
>> based on those suggestions' live clicks @ position k, etc. Pretty cool
>> idea...
>>
>>
>>
>> On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:
>>
>> It's a great idea.   And then index that file into a separate lean
>> collection of just the suggestions, along with the weight as another field
>> on those documents, to use for ranking them at query time with standard
>> /select queries.  (this separate suggest collection would also have
>> appropriate tokenization to match the partial words as the user types, like
>> ngramming)
>>
>> Erik
>>
>>
>> > On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld -
>> audrey.lorberf...@ibm.com  wrote:
>> >
>> > David,
>> >
>> > Thank you, that is useful. So, would you recommend using a (clean)
>> field over an external dictionary file? We have lots of "top queries" and
>> measure their nDCG. A thought was to programmatically generate an external
>> file where the weight per query term (or phrase) == its nDCG. Bad idea?
>> >
>> > Best,
>> > Audrey
>> >
>> > On 1/20/20, 11:51 AM, "David Hastings" <
>> hastings.recurs...@gmail.com> wrote:
>> >
>> >Ive used this quite a bit, my biggest piece of advice is to
>> choose a field
>> >that you know is clean, with well defined terms/words, you dont
>> want an
>> >autocomplete that has a massive dictionary, also it will make the
>> >start/reload times pretty slow
>> >
>> >On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
>> >audrey.lorberf...@ibm.com  wrote:
>> >
>> >> Hi All,
>> >>
>> >> We plan to incorporate a query autocomplete functionality into our
>> search
>> >> engine (like this:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
>> >> ). And I was wondering if anyone has personal experience with this
>> >> component and would like to share? Basically, we are just looking
>> for some
>> >> best practices from more experienced Solr admins so that we have a
>> starting
>> >> place to launch this in our beta.
>> >>
>> >> Thank you!
>> >>
>> >> Best,
>> >> Audrey
>> >>
>> >
>> >
>>
>>
>>
>>


Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread David Hastings
This is a really cool idea!  My only concern is that the edge case
searches, where a user knows exactly what they want to find, would be
autocomplete into something that happens to be more "successful" rather
than what they were looking for.  for example, i want to know the legal
implications of jay z's 99 problems.   most of the autocompletes i imagine
would be for the lyrics for the song, or links to the video or jay z
himself, when what im looking for is a line by line analysis of the song
itself and how it relates to the fourth amendment:
http://pdf.textfiles.com/academics/lj56-2_mason_article.pdf

But in general this is a really clever idea, especially in the retail
arena.  However i suspect your use case is more in research, and after
years of dealing with lawyers and librarians, they tend to not like having
their searches intercepted, they know what they're looking for and they
tend to get mad if you assume they dont :)

On Fri, Jan 24, 2020 at 9:59 AM Lucky Sharma  wrote:

> Hi Audrey,
> As suggested by Erik, you can index the data into a seperate collection and
> You can instead of adding weights inthe document you can also use LTR with
> in Solr to rerank on the features.
>
> Regards,
> Lucky Sharma
>
> On Fri, 24 Jan, 2020, 8:01 pm Audrey Lorberfeld -
> audrey.lorberf...@ibm.com,
>  wrote:
>
> > Erik,
> >
> > Thank you! Yes, that's exactly how we were thinking of architecting it.
> > And our ML engineer suggested something else for the suggestion weights,
> > actually -- to build a model that would programmatically update the
> weights
> > based on those suggestions' live clicks @ position k, etc. Pretty cool
> > idea...
> >
> >
> >
> > On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:
> >
> > It's a great idea.   And then index that file into a separate lean
> > collection of just the suggestions, along with the weight as another
> field
> > on those documents, to use for ranking them at query time with standard
> > /select queries.  (this separate suggest collection would also have
> > appropriate tokenization to match the partial words as the user types,
> like
> > ngramming)
> >
> > Erik
> >
> >
> > > On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > David,
> > >
> > > Thank you, that is useful. So, would you recommend using a (clean)
> > field over an external dictionary file? We have lots of "top queries" and
> > measure their nDCG. A thought was to programmatically generate an
> external
> > file where the weight per query term (or phrase) == its nDCG. Bad idea?
> > >
> > > Best,
> > > Audrey
> > >
> > > On 1/20/20, 11:51 AM, "David Hastings" <
> hastings.recurs...@gmail.com>
> > wrote:
> > >
> > >Ive used this quite a bit, my biggest piece of advice is to
> > choose a field
> > >that you know is clean, with well defined terms/words, you dont
> > want an
> > >autocomplete that has a massive dictionary, also it will make
> the
> > >start/reload times pretty slow
> > >
> > >On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
> > >audrey.lorberf...@ibm.com  wrote:
> > >
> > >> Hi All,
> > >>
> > >> We plan to incorporate a query autocomplete functionality into our
> > search
> > >> engine (like this:
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
> > >> ). And I was wondering if anyone has personal experience with this
> > >> component and would like to share? Basically, we are just looking
> > for some
> > >> best practices from more experienced Solr admins so that we have a
> > starting
> > >> place to launch this in our beta.
> > >>
> > >> Thank you!
> > >>
> > >> Best,
> > >> Audrey
> > >>
> > >
> > >
> >
> >
> >
> >
>


Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread Lucky Sharma
Hi Audrey,
As suggested by Erik, you can index the data into a seperate collection and
You can instead of adding weights inthe document you can also use LTR with
in Solr to rerank on the features.

Regards,
Lucky Sharma

On Fri, 24 Jan, 2020, 8:01 pm Audrey Lorberfeld - audrey.lorberf...@ibm.com,
 wrote:

> Erik,
>
> Thank you! Yes, that's exactly how we were thinking of architecting it.
> And our ML engineer suggested something else for the suggestion weights,
> actually -- to build a model that would programmatically update the weights
> based on those suggestions' live clicks @ position k, etc. Pretty cool
> idea...
>
>
>
> On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:
>
> It's a great idea.   And then index that file into a separate lean
> collection of just the suggestions, along with the weight as another field
> on those documents, to use for ranking them at query time with standard
> /select queries.  (this separate suggest collection would also have
> appropriate tokenization to match the partial words as the user types, like
> ngramming)
>
> Erik
>
>
> > On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > David,
> >
> > Thank you, that is useful. So, would you recommend using a (clean)
> field over an external dictionary file? We have lots of "top queries" and
> measure their nDCG. A thought was to programmatically generate an external
> file where the weight per query term (or phrase) == its nDCG. Bad idea?
> >
> > Best,
> > Audrey
> >
> > On 1/20/20, 11:51 AM, "David Hastings" 
> wrote:
> >
> >Ive used this quite a bit, my biggest piece of advice is to
> choose a field
> >that you know is clean, with well defined terms/words, you dont
> want an
> >autocomplete that has a massive dictionary, also it will make the
> >start/reload times pretty slow
> >
> >On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
> >audrey.lorberf...@ibm.com  wrote:
> >
> >> Hi All,
> >>
> >> We plan to incorporate a query autocomplete functionality into our
> search
> >> engine (like this:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
> >> ). And I was wondering if anyone has personal experience with this
> >> component and would like to share? Basically, we are just looking
> for some
> >> best practices from more experienced Solr admins so that we have a
> starting
> >> place to launch this in our beta.
> >>
> >> Thank you!
> >>
> >> Best,
> >> Audrey
> >>
> >
> >
>
>
>
>


Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi Alessandro,

I'm so happy there is someone who's done extensive work with QAC here! 

Right now, we measure nDCG via a Dynamic Bayesian Network. To break it down, 
we: 
- use a DBN model to generate a "score" for each query_url pair. 
- We then plug that score into a mathematical formula we found in a research 
paper (happy to share the paper if you're interested) for assigning labels 0-4. 
- We then cross-reference the scored & labeled query_url pairs with 1k of our 
system's top queries and 1k of our system's random queries. 
- We use that dataset as our ground truth. 
- We then query the system in real time each day for those 2k queries, label 
them, and compare those labels with our ground truth to get our system's nDCG. 

I hope that makes sense! Lots of steps __

Due to computational overhead reasons, we are pretty committed to using an 
external file & a separate Solr core for our suggestions. We are also planning 
to use the Suggester to add a little human nudge towards "successful" queries. 
I'm not sure whether that's what the Suggester is really meant to do, but we 
are not using it as a naïve prefix-matcher, but more of a query-suggestion 
tool. So, if we know that the query "blue pages" is less successful than the 
query "bluepages" (assuming we can identify the user's intent with this query), 
we will not show suggestions that match "blue pages," instead we will show 
suggestions that match "bluepages." Sort of like a query rewrite, except with 
fuzzy prefix matching, not the introduction of synonyms/expansions.

What we are concerned with currently is how to define a "successful" query. We 
have things like abandonment rate, dwell time, etc., but if you have any advice 
on more ways to identify successful queries, that'd be great. We want to stay 
away from defining success as "popularity," since that will just create a 
closed language system where people only query popular queries, and those 
queries stay popular only because people are querying them (assuming people 
click on the suggestions, of course).

Let me know your thoughts!

On 1/23/20, 10:45 AM, "Alessandro Benedetti"  wrote:

I have been working extensively on query autocompletion, these blogs should
be helpful to you:


https://urldefense.proofpoint.com/v2/url?u=https-3A__sease.io_2015_07_solr-2Dyou-2Dcomplete-2Dme.html=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=0lExcWXK-kGTAfpnv-kU_LGminLzJjJKv6hYBFQG7iI=c149I_QBokd35FBMGaUxoBPMViUXAdZtVnkSKTINndE=
 

https://urldefense.proofpoint.com/v2/url?u=https-3A__sease.io_2018_06_apache-2Dlucene-2Dblendedinfixsuggester-2Dhow-2Dit-2Dworks-2Dbugs-2Dand-2Dimprovements.html=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=0lExcWXK-kGTAfpnv-kU_LGminLzJjJKv6hYBFQG7iI=m8s2XvI7tR1t9bNaA4SI-w90MdbLZTYxc0mBMz8RMSw=
 

You idea of using search quality evaluation to drive the autocompletion is
interesting.
How do you currently calculate the NDCG for a query? What's your golden
truth?
Using that approach you will autocomplete favouring query completion that
your search engine is able to process better, not necessarily closer to the
user intent, still it could work.

We should differentiate here between the suggester dictionary (where the
suggestions come from, in your case it could be your extracted data) and
the kind of suggestion (that in your case could be the free text suggester
lookup)

Cheers
--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Mon, 20 Jan 2020 at 17:02, David Hastings 
wrote:

> Not a bad idea at all, however ive never used an external file before, 
just
> a field in the index, so not an area im familiar with
>
> On Mon, Jan 20, 2020 at 11:55 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > David,
> >
> > Thank you, that is useful. So, would you recommend using a (clean) field
> > over an external dictionary file? We have lots of "top queries" and
> measure
> > their nDCG. A thought was to programmatically generate an external file
> > where the weight per query term (or phrase) == its nDCG. Bad idea?
> >
> > Best,
> > Audrey
> >
> > On 1/20/20, 11:51 AM, "David Hastings" 
> > wrote:
> >
> > Ive used this quite a bit, my biggest piece of advice is to choose a
> > field
> > that you know is clean, with well defined terms/words, you dont want
> an
> > autocomplete that has a massive dictionary, also it will make the
> > start/reload times pretty slow
> >
> > On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> >
> > > Hi All,
> > >
> > > We plan to incorporate a query autocomplete functionality into our
> > search
> 

Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Erik,

Thank you! Yes, that's exactly how we were thinking of architecting it. And our 
ML engineer suggested something else for the suggestion weights, actually -- to 
build a model that would programmatically update the weights based on those 
suggestions' live clicks @ position k, etc. Pretty cool idea... 



On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:

It's a great idea.   And then index that file into a separate lean 
collection of just the suggestions, along with the weight as another field on 
those documents, to use for ranking them at query time with standard /select 
queries.  (this separate suggest collection would also have appropriate 
tokenization to match the partial words as the user types, like ngramming)

Erik


> On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com  wrote:
> 
> David, 
> 
> Thank you, that is useful. So, would you recommend using a (clean) field 
over an external dictionary file? We have lots of "top queries" and measure 
their nDCG. A thought was to programmatically generate an external file where 
the weight per query term (or phrase) == its nDCG. Bad idea?
> 
> Best,
> Audrey
> 
> On 1/20/20, 11:51 AM, "David Hastings"  
wrote:
> 
>Ive used this quite a bit, my biggest piece of advice is to choose a 
field
>that you know is clean, with well defined terms/words, you dont want an
>autocomplete that has a massive dictionary, also it will make the
>start/reload times pretty slow
> 
>On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
>audrey.lorberf...@ibm.com  wrote:
> 
>> Hi All,
>> 
>> We plan to incorporate a query autocomplete functionality into our search
>> engine (like this: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
 
>> ). And I was wondering if anyone has personal experience with this
>> component and would like to share? Basically, we are just looking for 
some
>> best practices from more experienced Solr admins so that we have a 
starting
>> place to launch this in our beta.
>> 
>> Thank you!
>> 
>> Best,
>> Audrey
>> 
> 
> 





Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-23 Thread Erik Hatcher
It's a great idea.   And then index that file into a separate lean collection 
of just the suggestions, along with the weight as another field on those 
documents, to use for ranking them at query time with standard /select queries. 
 (this separate suggest collection would also have appropriate tokenization to 
match the partial words as the user types, like ngramming)

Erik


> On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld - audrey.lorberf...@ibm.com 
>  wrote:
> 
> David, 
> 
> Thank you, that is useful. So, would you recommend using a (clean) field over 
> an external dictionary file? We have lots of "top queries" and measure their 
> nDCG. A thought was to programmatically generate an external file where the 
> weight per query term (or phrase) == its nDCG. Bad idea?
> 
> Best,
> Audrey
> 
> On 1/20/20, 11:51 AM, "David Hastings"  wrote:
> 
>Ive used this quite a bit, my biggest piece of advice is to choose a field
>that you know is clean, with well defined terms/words, you dont want an
>autocomplete that has a massive dictionary, also it will make the
>start/reload times pretty slow
> 
>On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
>audrey.lorberf...@ibm.com  wrote:
> 
>> Hi All,
>> 
>> We plan to incorporate a query autocomplete functionality into our search
>> engine (like this: 
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
>>  
>> ). And I was wondering if anyone has personal experience with this
>> component and would like to share? Basically, we are just looking for some
>> best practices from more experienced Solr admins so that we have a starting
>> place to launch this in our beta.
>> 
>> Thank you!
>> 
>> Best,
>> Audrey
>> 
> 
> 



Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-23 Thread Alessandro Benedetti
I have been working extensively on query autocompletion, these blogs should
be helpful to you:

https://sease.io/2015/07/solr-you-complete-me.html
https://sease.io/2018/06/apache-lucene-blendedinfixsuggester-how-it-works-bugs-and-improvements.html

You idea of using search quality evaluation to drive the autocompletion is
interesting.
How do you currently calculate the NDCG for a query? What's your golden
truth?
Using that approach you will autocomplete favouring query completion that
your search engine is able to process better, not necessarily closer to the
user intent, still it could work.

We should differentiate here between the suggester dictionary (where the
suggestions come from, in your case it could be your extracted data) and
the kind of suggestion (that in your case could be the free text suggester
lookup)

Cheers
--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Mon, 20 Jan 2020 at 17:02, David Hastings 
wrote:

> Not a bad idea at all, however ive never used an external file before, just
> a field in the index, so not an area im familiar with
>
> On Mon, Jan 20, 2020 at 11:55 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > David,
> >
> > Thank you, that is useful. So, would you recommend using a (clean) field
> > over an external dictionary file? We have lots of "top queries" and
> measure
> > their nDCG. A thought was to programmatically generate an external file
> > where the weight per query term (or phrase) == its nDCG. Bad idea?
> >
> > Best,
> > Audrey
> >
> > On 1/20/20, 11:51 AM, "David Hastings" 
> > wrote:
> >
> > Ive used this quite a bit, my biggest piece of advice is to choose a
> > field
> > that you know is clean, with well defined terms/words, you dont want
> an
> > autocomplete that has a massive dictionary, also it will make the
> > start/reload times pretty slow
> >
> > On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> >
> > > Hi All,
> > >
> > > We plan to incorporate a query autocomplete functionality into our
> > search
> > > engine (like this:
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
> > > ). And I was wondering if anyone has personal experience with this
> > > component and would like to share? Basically, we are just looking
> > for some
> > > best practices from more experienced Solr admins so that we have a
> > starting
> > > place to launch this in our beta.
> > >
> > > Thank you!
> > >
> > > Best,
> > > Audrey
> > >
> >
> >
> >
>


Re: Re: Re: Re: Handling overlapping synonyms

2020-01-20 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hm, I'm not sure what you mean, but I am pretty new to Solr. Apologies!

On 1/20/20, 12:01 PM, "fiedzia"  wrote:

>From my understanding, if you want regional sales manager to be indexed as
both director of sales and area manager, you  
>would have to type:
>
>Regional sales manager -> director of sales, area manager

that works for searching, but because everything is in the same position, 
searching for "director of sales" highlights whole "regional sales manager".

while it should be indexed as: (numbers inidicate token positions

1   2   3
regional sales manager

1
area manager
 2 director of sales


I guess I'll need to override SynonymGraphFilter to achieve that



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=tDOfGxVxBgFG1YZDv8WICuXs07jdb2IIpoJ0j3Fu7nc=yT0_rHgmEbHTvjxL9Vw9TN3d0TeqHg6avTkuseDWDw8=
 




Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-20 Thread David Hastings
Not a bad idea at all, however ive never used an external file before, just
a field in the index, so not an area im familiar with

On Mon, Jan 20, 2020 at 11:55 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> David,
>
> Thank you, that is useful. So, would you recommend using a (clean) field
> over an external dictionary file? We have lots of "top queries" and measure
> their nDCG. A thought was to programmatically generate an external file
> where the weight per query term (or phrase) == its nDCG. Bad idea?
>
> Best,
> Audrey
>
> On 1/20/20, 11:51 AM, "David Hastings" 
> wrote:
>
> Ive used this quite a bit, my biggest piece of advice is to choose a
> field
> that you know is clean, with well defined terms/words, you dont want an
> autocomplete that has a massive dictionary, also it will make the
> start/reload times pretty slow
>
> On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > Hi All,
> >
> > We plan to incorporate a query autocomplete functionality into our
> search
> > engine (like this:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
> > ). And I was wondering if anyone has personal experience with this
> > component and would like to share? Basically, we are just looking
> for some
> > best practices from more experienced Solr admins so that we have a
> starting
> > place to launch this in our beta.
> >
> > Thank you!
> >
> > Best,
> > Audrey
> >
>
>
>


Re: Re: Re: Handling overlapping synonyms

2020-01-20 Thread fiedzia
>From my understanding, if you want regional sales manager to be indexed as
both director of sales and area manager, you  
>would have to type:
>
>Regional sales manager -> director of sales, area manager

that works for searching, but because everything is in the same position, 
searching for "director of sales" highlights whole "regional sales manager".

while it should be indexed as: (numbers inidicate token positions

1   2   3
regional sales manager

1
area manager
 2 director of sales


I guess I'll need to override SynonymGraphFilter to achieve that



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-20 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
David, 

Thank you, that is useful. So, would you recommend using a (clean) field over 
an external dictionary file? We have lots of "top queries" and measure their 
nDCG. A thought was to programmatically generate an external file where the 
weight per query term (or phrase) == its nDCG. Bad idea?

Best,
Audrey

On 1/20/20, 11:51 AM, "David Hastings"  wrote:

Ive used this quite a bit, my biggest piece of advice is to choose a field
that you know is clean, with well defined terms/words, you dont want an
autocomplete that has a massive dictionary, also it will make the
start/reload times pretty slow

On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> Hi All,
>
> We plan to incorporate a query autocomplete functionality into our search
> engine (like this: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
 
> ). And I was wondering if anyone has personal experience with this
> component and would like to share? Basically, we are just looking for some
> best practices from more experienced Solr admins so that we have a 
starting
> place to launch this in our beta.
>
> Thank you!
>
> Best,
> Audrey
>




Re: Re: Re: Handling overlapping synonyms

2020-01-20 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
From my understanding, if you want regional sales manager to be indexed as both 
director of sales and area manager, you would have to type: 

Regional sales manager -> director of sales, area manager

I do not believe you can chain synonyms.

Re: bigrams/trigrams, I was more interested in you wanting to manually create 
them by inserting a "_" between the tokens. There is a bigram / trigram 
capability OOTB with Solr, so is there a reason you're manually coding these 
into your index instead of just using the OOTB function?

On 1/20/20, 6:58 AM, "fiedzia"  wrote:

> what is the reasoning behind adding the bigrams and trigrams manually like
that? Maybe if we knew the end goal, we could figure out a different
strategy. Happy that at least the matching is working now! 

I have large amount of synonyms and keep adding new ones, some of them
partially overlap. Its the nature of a language that adding keywords to a
phrase creates distinctive meaning. Another example:


sales manager -> director of sales
regional sales manager -> area manager

I'd expect "regional sales manager" to be indexed as both.

regional sales manager
^^ -> director of sales
^^ -> area manager

so that searching for any of those terms matches and highlights relevant
part.
However when SynonymGraphFilter finds one synonym it will ignore the other.



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=JUEk2QAGcPS4Pi_y6d3EWDmtYMVjg2Sg-4ZwC-90VqE=tgepeqV5fWmuUgtTc767hv_1czuJnhM9O9LmWVgpDdM=
 




Re: Re: Handling overlapping synonyms

2020-01-20 Thread fiedzia
> what is the reasoning behind adding the bigrams and trigrams manually like
that? Maybe if we knew the end goal, we could figure out a different
strategy. Happy that at least the matching is working now! 

I have large amount of synonyms and keep adding new ones, some of them
partially overlap. Its the nature of a language that adding keywords to a
phrase creates distinctive meaning. Another example:


sales manager -> director of sales
regional sales manager -> area manager

I'd expect "regional sales manager" to be indexed as both.

regional sales manager
^^ -> director of sales
^^ -> area manager

so that searching for any of those terms matches and highlights relevant
part.
However when SynonymGraphFilter finds one synonym it will ignore the other.



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Re: Handling overlapping synonyms

2020-01-17 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hmm  what is the reasoning behind adding the bigrams and trigrams manually 
like that? Maybe if we knew the end goal, we could figure out a different 
strategy. Happy that at least the matching is working now!

On 1/17/20, 10:28 AM, "fiedzia"  wrote:

> Doing it the other way (new york city -> new_york_city, new_york) makes
more
sense,

Just checked it, that way does the matching as expected, but highlighting is
wrong
("new york: query matches "new york city" as it should, but also highlights
all of it)



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=sxUM_HkySPw_KqJdqMGkjWQyUQ6W7K44Nid7p7wcBJ4=rJFkuEpTxkPp6EtyRstEE3PWCY-CSAmtjOFJ9ge67uU=
 




Re: Re: Re: POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Oh I see I see 

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/25/19, 12:21 PM, "David Hastings"  wrote:

oh i see what you mean, sorry, i explained it incorrectly.
 those sentences are what would be in the index, and a general search for
'rush limbaugh' would come back with results where he is an entity higher
than if it was two words in a sentence

On Fri, Oct 25, 2019 at 12:12 PM David Hastings <
hastings.recurs...@gmail.com> wrote:

> nope, i boost the fields already tagged at query time against teh query
>
> On Fri, Oct 25, 2019 at 12:11 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
>> So then you do run your POS tagger at query-time, Dave?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> audrey.lorberf...@ibm.com
>>
>>
>> On 10/25/19, 12:06 PM, "David Hastings" 
>> wrote:
>>
>> I use them for query boosting, so if someone searches for:
>>
>> i dont want to rush limbaugh out the door
>> vs
>> i talked to rush limbaugh through the door
>>
>> my documents where 'rush limbaugh' is a known entity (noun) and a
>> person
>> (look at the sentence, its obviously a person and the nlp finds that)
>> have
>> 'rush limbaugh' stored in a field, which is boosted on queries.  this
>> makes
>> sure results from the second query with him as a person will be
>> boosted
>> above those from the first query
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris <
>> nicolas.pa...@riseup.net>
>> wrote:
>>
>> > Also we are using stanford POS tagger for french. The processing
>> time is
>> > mitigated by the spark-corenlp package which distribute the process
>> over
>> > multiple node.
>> >
>> > Also I am interesting in the way you use POS information within 
solr
>> > queries, or solr fields.
>> >
>> > Thanks,
>> > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
>> > > ah, yeah its not the fastest but it proved to be the best for my
>> > purposes,
>> > > I use it to pre-process data before indexing, to apply more
>> metadata to
>> > the
>> > > documents in a separate field(s)
>> > >
>> > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
>> > > audrey.lorberf...@ibm.com  wrote:
>> > >
>> > > > No, I meant for part-of-speech tagging __ But that's
>> interesting that
>> > you
>> > > > use StanfordNLP. I've read that it's very slow, so we are
>> concerned
>> > that it
>> > > > might not work for us at query-time. Do you use it at
>> query-time, or
>> > just
>> > > > index-time?
>> > > >
>> > > > --
>> > > > Audrey Lorberfeld
>> > > > Data Scientist, w3 Search
>> > > > IBM
>> > > > audrey.lorberf...@ibm.com
>> > > >
>> > > >
>> > > > On 10/25/19, 10:30 AM, "David Hastings" <
>> hastings.recurs...@gmail.com
>> > >
>> > > > wrote:
>> > > >
>> > > > Do you mean for entity extraction?
>> > > > I make a LOT of use from the stanford nlp project, and get
>> out the
>> > > > entities
>> > > > and use them for different purposes in solr
>> > > > -Dave
>> > > >
>> > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
>> > > > audrey.lorberf...@ibm.com 
>> wrote:
>> > > >
>> > > > > Hi All,
>> > > > >
>> > > > > Does anyone use a POS tagger with their Solr instance
>> other than
>> > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
>> > > > >
>> > > > > Thanks!
>> > > > >
>> > > > > --
>> > > > > Audrey Lorberfeld
>> > > > > Data Scientist, w3 Search
>> > > > > IBM
>> > > > > audrey.lorberf...@ibm.com
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> >
>> > --
>> > nicolas
>> >
>>
>>
>>




Re: Re: Re: POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
How can a field itself be tagged with a part of speech?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/25/19, 12:12 PM, "David Hastings"  wrote:

nope, i boost the fields already tagged at query time against teh query

On Fri, Oct 25, 2019 at 12:11 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> So then you do run your POS tagger at query-time, Dave?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/25/19, 12:06 PM, "David Hastings" 
> wrote:
>
> I use them for query boosting, so if someone searches for:
>
> i dont want to rush limbaugh out the door
> vs
> i talked to rush limbaugh through the door
>
> my documents where 'rush limbaugh' is a known entity (noun) and a
> person
> (look at the sentence, its obviously a person and the nlp finds that)
> have
> 'rush limbaugh' stored in a field, which is boosted on queries.  this
> makes
> sure results from the second query with him as a person will be 
boosted
> above those from the first query
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris <
> nicolas.pa...@riseup.net>
> wrote:
>
> > Also we are using stanford POS tagger for french. The processing
> time is
> > mitigated by the spark-corenlp package which distribute the process
> over
> > multiple node.
> >
> > Also I am interesting in the way you use POS information within solr
> > queries, or solr fields.
> >
> > Thanks,
> > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> > > ah, yeah its not the fastest but it proved to be the best for my
> > purposes,
> > > I use it to pre-process data before indexing, to apply more
> metadata to
> > the
> > > documents in a separate field(s)
> > >
> > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > >
> > > > No, I meant for part-of-speech tagging __ But that's interesting
> that
> > you
> > > > use StanfordNLP. I've read that it's very slow, so we are
> concerned
> > that it
> > > > might not work for us at query-time. Do you use it at
> query-time, or
> > just
> > > > index-time?
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > > >
> > > > On 10/25/19, 10:30 AM, "David Hastings" <
> hastings.recurs...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > Do you mean for entity extraction?
> > > > I make a LOT of use from the stanford nlp project, and get
> out the
> > > > entities
> > > > and use them for different purposes in solr
> > > > -Dave
> > > >
> > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > > > audrey.lorberf...@ibm.com  wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Does anyone use a POS tagger with their Solr instance
> other than
> > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > --
> > > > > Audrey Lorberfeld
> > > > > Data Scientist, w3 Search
> > > > > IBM
> > > > > audrey.lorberf...@ibm.com
> > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> > --
> > nicolas
> >
>
>
>




Re: Re: POS Tagger

2019-10-25 Thread David Hastings
oh i see what you mean, sorry, i explained it incorrectly.
 those sentences are what would be in the index, and a general search for
'rush limbaugh' would come back with results where he is an entity higher
than if it was two words in a sentence

On Fri, Oct 25, 2019 at 12:12 PM David Hastings <
hastings.recurs...@gmail.com> wrote:

> nope, i boost the fields already tagged at query time against teh query
>
> On Fri, Oct 25, 2019 at 12:11 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
>> So then you do run your POS tagger at query-time, Dave?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> audrey.lorberf...@ibm.com
>>
>>
>> On 10/25/19, 12:06 PM, "David Hastings" 
>> wrote:
>>
>> I use them for query boosting, so if someone searches for:
>>
>> i dont want to rush limbaugh out the door
>> vs
>> i talked to rush limbaugh through the door
>>
>> my documents where 'rush limbaugh' is a known entity (noun) and a
>> person
>> (look at the sentence, its obviously a person and the nlp finds that)
>> have
>> 'rush limbaugh' stored in a field, which is boosted on queries.  this
>> makes
>> sure results from the second query with him as a person will be
>> boosted
>> above those from the first query
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris <
>> nicolas.pa...@riseup.net>
>> wrote:
>>
>> > Also we are using stanford POS tagger for french. The processing
>> time is
>> > mitigated by the spark-corenlp package which distribute the process
>> over
>> > multiple node.
>> >
>> > Also I am interesting in the way you use POS information within solr
>> > queries, or solr fields.
>> >
>> > Thanks,
>> > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
>> > > ah, yeah its not the fastest but it proved to be the best for my
>> > purposes,
>> > > I use it to pre-process data before indexing, to apply more
>> metadata to
>> > the
>> > > documents in a separate field(s)
>> > >
>> > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
>> > > audrey.lorberf...@ibm.com  wrote:
>> > >
>> > > > No, I meant for part-of-speech tagging __ But that's
>> interesting that
>> > you
>> > > > use StanfordNLP. I've read that it's very slow, so we are
>> concerned
>> > that it
>> > > > might not work for us at query-time. Do you use it at
>> query-time, or
>> > just
>> > > > index-time?
>> > > >
>> > > > --
>> > > > Audrey Lorberfeld
>> > > > Data Scientist, w3 Search
>> > > > IBM
>> > > > audrey.lorberf...@ibm.com
>> > > >
>> > > >
>> > > > On 10/25/19, 10:30 AM, "David Hastings" <
>> hastings.recurs...@gmail.com
>> > >
>> > > > wrote:
>> > > >
>> > > > Do you mean for entity extraction?
>> > > > I make a LOT of use from the stanford nlp project, and get
>> out the
>> > > > entities
>> > > > and use them for different purposes in solr
>> > > > -Dave
>> > > >
>> > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
>> > > > audrey.lorberf...@ibm.com 
>> wrote:
>> > > >
>> > > > > Hi All,
>> > > > >
>> > > > > Does anyone use a POS tagger with their Solr instance
>> other than
>> > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
>> > > > >
>> > > > > Thanks!
>> > > > >
>> > > > > --
>> > > > > Audrey Lorberfeld
>> > > > > Data Scientist, w3 Search
>> > > > > IBM
>> > > > > audrey.lorberf...@ibm.com
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> >
>> > --
>> > nicolas
>> >
>>
>>
>>


Re: Re: POS Tagger

2019-10-25 Thread David Hastings
nope, i boost the fields already tagged at query time against teh query

On Fri, Oct 25, 2019 at 12:11 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> So then you do run your POS tagger at query-time, Dave?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/25/19, 12:06 PM, "David Hastings" 
> wrote:
>
> I use them for query boosting, so if someone searches for:
>
> i dont want to rush limbaugh out the door
> vs
> i talked to rush limbaugh through the door
>
> my documents where 'rush limbaugh' is a known entity (noun) and a
> person
> (look at the sentence, its obviously a person and the nlp finds that)
> have
> 'rush limbaugh' stored in a field, which is boosted on queries.  this
> makes
> sure results from the second query with him as a person will be boosted
> above those from the first query
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris <
> nicolas.pa...@riseup.net>
> wrote:
>
> > Also we are using stanford POS tagger for french. The processing
> time is
> > mitigated by the spark-corenlp package which distribute the process
> over
> > multiple node.
> >
> > Also I am interesting in the way you use POS information within solr
> > queries, or solr fields.
> >
> > Thanks,
> > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> > > ah, yeah its not the fastest but it proved to be the best for my
> > purposes,
> > > I use it to pre-process data before indexing, to apply more
> metadata to
> > the
> > > documents in a separate field(s)
> > >
> > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > >
> > > > No, I meant for part-of-speech tagging __ But that's interesting
> that
> > you
> > > > use StanfordNLP. I've read that it's very slow, so we are
> concerned
> > that it
> > > > might not work for us at query-time. Do you use it at
> query-time, or
> > just
> > > > index-time?
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > > >
> > > > On 10/25/19, 10:30 AM, "David Hastings" <
> hastings.recurs...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > Do you mean for entity extraction?
> > > > I make a LOT of use from the stanford nlp project, and get
> out the
> > > > entities
> > > > and use them for different purposes in solr
> > > > -Dave
> > > >
> > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > > > audrey.lorberf...@ibm.com  wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Does anyone use a POS tagger with their Solr instance
> other than
> > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > --
> > > > > Audrey Lorberfeld
> > > > > Data Scientist, w3 Search
> > > > > IBM
> > > > > audrey.lorberf...@ibm.com
> > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> > --
> > nicolas
> >
>
>
>


Re: Re: POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
So then you do run your POS tagger at query-time, Dave?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/25/19, 12:06 PM, "David Hastings"  wrote:

I use them for query boosting, so if someone searches for:

i dont want to rush limbaugh out the door
vs
i talked to rush limbaugh through the door

my documents where 'rush limbaugh' is a known entity (noun) and a person
(look at the sentence, its obviously a person and the nlp finds that) have
'rush limbaugh' stored in a field, which is boosted on queries.  this makes
sure results from the second query with him as a person will be boosted
above those from the first query












On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris 
wrote:

> Also we are using stanford POS tagger for french. The processing time is
> mitigated by the spark-corenlp package which distribute the process over
> multiple node.
>
> Also I am interesting in the way you use POS information within solr
> queries, or solr fields.
>
> Thanks,
> On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> > ah, yeah its not the fastest but it proved to be the best for my
> purposes,
> > I use it to pre-process data before indexing, to apply more metadata to
> the
> > documents in a separate field(s)
> >
> > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> >
> > > No, I meant for part-of-speech tagging __ But that's interesting that
> you
> > > use StanfordNLP. I've read that it's very slow, so we are concerned
> that it
> > > might not work for us at query-time. Do you use it at query-time, or
> just
> > > index-time?
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> > > On 10/25/19, 10:30 AM, "David Hastings"  >
> > > wrote:
> > >
> > > Do you mean for entity extraction?
> > > I make a LOT of use from the stanford nlp project, and get out the
> > > entities
> > > and use them for different purposes in solr
> > > -Dave
> > >
> > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > >
> > > > Hi All,
> > > >
> > > > Does anyone use a POS tagger with their Solr instance other than
> > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > > >
> > >
> > >
> > >
>
> --
> nicolas
>




Re: Re: POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Nicolas,

Do you use the POS tagger at query time, or just at index time? 

We are thinking of using it to filter the tokens we will eventually perform ML 
on. Basically, we have a bunch of acronyms in our corpus. However, many 
departments use the same acronyms but expand those acronyms to different 
things. Eventually, we are thinking of using ML on our index to determine which 
expansion is meant by a particular query according to the context we find in 
certain documents. However, since we don't want to run ML on all tokens in a 
query, and since we think that acronyms are usually the nouns in a multi-token 
query, we want to only feed nouns to the ML model (TBD).

Does that make sense? So, we'd want both an index-side POS tagger (could be 
slow), and also a query-side POS tagger (must be fast).

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/25/19, 11:57 AM, "Nicolas Paris"  wrote:

Also we are using stanford POS tagger for french. The processing time is
mitigated by the spark-corenlp package which distribute the process over
multiple node.

Also I am interesting in the way you use POS information within solr
queries, or solr fields. 

Thanks,
On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> ah, yeah its not the fastest but it proved to be the best for my purposes,
> I use it to pre-process data before indexing, to apply more metadata to 
the
> documents in a separate field(s)
> 
> On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> 
> > No, I meant for part-of-speech tagging __ But that's interesting that 
you
> > use StanfordNLP. I've read that it's very slow, so we are concerned 
that it
> > might not work for us at query-time. Do you use it at query-time, or 
just
> > index-time?
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/25/19, 10:30 AM, "David Hastings" 
> > wrote:
> >
> > Do you mean for entity extraction?
> > I make a LOT of use from the stanford nlp project, and get out the
> > entities
> > and use them for different purposes in solr
> > -Dave
> >
> > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> >
> > > Hi All,
> > >
> > > Does anyone use a POS tagger with their Solr instance other than
> > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > >
> > > Thanks!
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> >
> >
> >

-- 
nicolas




Re: Re: POS Tagger

2019-10-25 Thread David Hastings
ah, yeah its not the fastest but it proved to be the best for my purposes,
I use it to pre-process data before indexing, to apply more metadata to the
documents in a separate field(s)

On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> No, I meant for part-of-speech tagging __ But that's interesting that you
> use StanfordNLP. I've read that it's very slow, so we are concerned that it
> might not work for us at query-time. Do you use it at query-time, or just
> index-time?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/25/19, 10:30 AM, "David Hastings" 
> wrote:
>
> Do you mean for entity extraction?
> I make a LOT of use from the stanford nlp project, and get out the
> entities
> and use them for different purposes in solr
> -Dave
>
> On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > Hi All,
> >
> > Does anyone use a POS tagger with their Solr instance other than
> > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> >
> > Thanks!
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
>
>
>


Re: Re: POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
No, I meant for part-of-speech tagging __ But that's interesting that you use 
StanfordNLP. I've read that it's very slow, so we are concerned that it might 
not work for us at query-time. Do you use it at query-time, or just index-time?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/25/19, 10:30 AM, "David Hastings"  wrote:

Do you mean for entity extraction?
I make a LOT of use from the stanford nlp project, and get out the entities
and use them for different purposes in solr
-Dave

On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> Hi All,
>
> Does anyone use a POS tagger with their Solr instance other than
> OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
>
> Thanks!
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>




Re: Re: using the df parameter to set a default to search all fields

2019-10-22 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Eek, Shawn, you're right -- I'm sorry, all! I meant to say the the QF (!) 
parameter. And pasted the wrong thing too ☹ This is what ours looks like with 
the qf parameter (and the edismax parser)

  
  title_en^1.5 description_en^0.5 content_en^0.5 headings_en^1.3 
keywords_en^1.5 url^0.5
  

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/22/19, 1:50 PM, "Shawn Heisey"  wrote:

On 10/22/2019 11:42 AM, Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote:
> I think you actually can search over all fields, but not in the df 
parameter. We have a big list of fields we want to search over. So, we just put 
a dummy one in the df param field, and then we use the fl parameter. With the 
edismax parser, this works. It looks something like this:

The fl parameter means "field list" and controls which fields are 
included in the search results.  It does not control which fields are 
searched.

Thanks,
Shawn




Re: Re: using the df parameter to set a default to search all fields

2019-10-22 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
I think you actually can search over all fields, but not in the df parameter. 
We have a big list of fields we want to search over. So, we just put a dummy 
one in the df param field, and then we use the fl parameter. With the edismax 
parser, this works. It looks something like this: 



edismax
1.0
explicit
30
content_en

update_date, display_url, url, id, uid, scopes, source_id, 
json_payload, language, snippet, [elevated],
title:title_en, description:description_en, score




-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/22/19, 1:01 PM, "Shawn Heisey"  wrote:

On 10/22/2019 10:26 AM, rhys J wrote:
> How do I make Solr search on all fields in a document?

Solr does not have a way to ask for all fields on a search.  If you use 
the edismax query parser, you can specify multiple fields with the qf 
parameter, but there is nothing you can put in that parameter as a 
shortcut for "all fields."  Using qf with multiple fields is the 
cleanest way to do this.

> I read the documentation about the df field, and added the following to my
> solrconfig.xml:
> 
>   
>explicit
>10
>   _text_
>  

The df parameter just means "default field".  It can only search one field.

> in my managed-schema file i have the following:
> 
>stored="true"/>
> 
> I have deleted the documents, and re-indexed the csv file.
> 
> When I do a search in the api for: _text_:amy - which should return 2
> documents, I get nothing.

Just having a field named _text_ doesn't make anything happen, unless 
your indexing specifically adds documents with that field defined. 
There is nothing special about _text_.  Other field names that start and 
end with an underscore, like _version_ or _root_, are special ... but 
_text_ is not.

Probably what you are looking for here is to set up one or more 
copyField definitions in your schema, which are configured to copy one 
or more of your other fields to _text_ so it can be searched as a 
catchall field.  I find it useful to name that field "catchall" rather 
than something like _text_ which seems like a special field name, but isn't.

Thanks,
Shawn




Re: Re: Query on autoGeneratePhraseQueries

2019-10-16 Thread Shubham Goswami
Hi Rohan/Audrey

I have implemented the sow=false property with eDismax Query parser but
still it does not has any effect
on the query as it is still parsing as separate terms instead of phrased
one.

On Tue, Oct 15, 2019 at 8:25 PM Rohan Kasat  wrote:

> Also check ,
> pf , pf2 , pf3
> ps , ps2, ps3 parameters for phrase searches.
>
> Regards,
> Rohan K
>
> On Tue, Oct 15, 2019 at 6:41 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > I'm not sure how your config file is setup, but I know that the way we do
> > multi-token synonyms is to have the sow (split on whitespace) parameter
> set
> > to False while using the edismax parser. I'm not sure if this would work
> > with PhraseQueries , but it might be worth a try!
> >
> > In our config file we do something like this:
> >
> > 
> > 
> > edismax
> > 1.0
> > explicit
> > 100
> > content_en
> > w3json_en
> > false
> > 
> >  
> >
> > You can read a bit about the parameter here:
> >
> https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
> >
> > Best,
> > Audrey
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/15/19, 5:50 AM, "Shubham Goswami" 
> > wrote:
> >
> > Hi kshitij
> >
> > Thanks for the reply!
> > I tried to debug it and found that raw query(black company) has
> parsed
> > as
> > two separate queries
> > black and company and returning the results based on black query
> > instead of
> > this it should have
> > got parsed as a single phrase query like("black company") because i
> am
> > using
> > autoGeneratedPhraseQuery.
> > Do you have any idea about this please correct me if i am wrong.
> >
> > Thanks
> > Shubham
> >
> > On Tue, Oct 15, 2019 at 1:58 PM kshitij tyagi <
> > kshitij.shopcl...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Try debugging your solr query and understand how it gets parsed.
> Try
> > using
> > > "debug=true" for the same
> > >
> > > On Tue, Oct 15, 2019 at 12:58 PM Shubham Goswami <
> > > shubham.gosw...@hotwax.co>
> > > wrote:
> > >
> > > > *Hi all,*
> > > >
> > > > I am a beginner to solr framework and I am trying to implement
> > > > *autoGeneratePhraseQueries* property in a fieldtype of
> > > type=text_general, i
> > > > kept the property value as true and restarted the solr server but
> > still
> > > it
> > > > is not taking my two words query like(Black company) as a phrase
> > without
> > > > double quotes and returning the results only for Black.
> > > >
> > > >  Can somebody please help me to understand what am i
> > missing ?
> > > > Following is my Schema.xml file code and i am using solr 7.5
> > version.
> > > >  > > > positionIncrementGap="100" multiValued="true"
> > > > autoGeneratePhraseQueries="true">
> > > > 
> > > >   =
> > > >words="stopwords.txt"
> > > > ignoreCase="true"/>
> > > >   
> > > > 
> > > > 
> > > >   
> > > >words="stopwords.txt"
> > > > ignoreCase="true"/>
> > > >expand="true"
> > > > ignoreCase="true" synonyms="synonyms.txt"/>
> > > >   
> > > > 
> > > >   
> > > >
> > > >
> > > > --
> > > > *Thanks & Regards*
> > > > Shubham Goswami
> > > > Enterprise Software Engineer
> > > > *HotWax Systems*
> > > > *Enterprise open source experts*
> > > > cell: +91-7803886288
> > > > office: 0731-409-3684
> > > >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hotwaxsystems.com=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=Zi9beGF58BzJUNUdCkeW0pwliKwq9vdTSh0V_lR0734=FhSkJBcmYw_bfHgq1enzuYQeOZwKHzlP9h4VwTZSL5E=
> > > >
> > >
> >
> >
> > --
> > *Thanks & Regards*
> > Shubham Goswami
> > Enterprise Software Engineer
> > *HotWax Systems*
> > *Enterprise open source experts*
> > cell: +91-7803886288
> > office: 0731-409-3684
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hotwaxsystems.com=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=Zi9beGF58BzJUNUdCkeW0pwliKwq9vdTSh0V_lR0734=FhSkJBcmYw_bfHgq1enzuYQeOZwKHzlP9h4VwTZSL5E=
> >
> >
> > --
>
> *Regards,Rohan Kasat*
>


-- 
*Thanks & Regards*
Shubham Goswami
Enterprise Software Engineer
*HotWax Systems*
*Enterprise open source experts*
cell: +91-7803886288
office: 0731-409-3684
http://www.hotwaxsystems.com


Re: Re: Query on autoGeneratePhraseQueries

2019-10-15 Thread Rohan Kasat
Also check ,
pf , pf2 , pf3
ps , ps2, ps3 parameters for phrase searches.

Regards,
Rohan K

On Tue, Oct 15, 2019 at 6:41 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> I'm not sure how your config file is setup, but I know that the way we do
> multi-token synonyms is to have the sow (split on whitespace) parameter set
> to False while using the edismax parser. I'm not sure if this would work
> with PhraseQueries , but it might be worth a try!
>
> In our config file we do something like this:
>
> 
> 
> edismax
> 1.0
> explicit
> 100
> content_en
> w3json_en
> false
> 
>  
>
> You can read a bit about the parameter here:
> https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
>
> Best,
> Audrey
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/15/19, 5:50 AM, "Shubham Goswami" 
> wrote:
>
> Hi kshitij
>
> Thanks for the reply!
> I tried to debug it and found that raw query(black company) has parsed
> as
> two separate queries
> black and company and returning the results based on black query
> instead of
> this it should have
> got parsed as a single phrase query like("black company") because i am
> using
> autoGeneratedPhraseQuery.
> Do you have any idea about this please correct me if i am wrong.
>
> Thanks
> Shubham
>
> On Tue, Oct 15, 2019 at 1:58 PM kshitij tyagi <
> kshitij.shopcl...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Try debugging your solr query and understand how it gets parsed. Try
> using
> > "debug=true" for the same
> >
> > On Tue, Oct 15, 2019 at 12:58 PM Shubham Goswami <
> > shubham.gosw...@hotwax.co>
> > wrote:
> >
> > > *Hi all,*
> > >
> > > I am a beginner to solr framework and I am trying to implement
> > > *autoGeneratePhraseQueries* property in a fieldtype of
> > type=text_general, i
> > > kept the property value as true and restarted the solr server but
> still
> > it
> > > is not taking my two words query like(Black company) as a phrase
> without
> > > double quotes and returning the results only for Black.
> > >
> > >  Can somebody please help me to understand what am i
> missing ?
> > > Following is my Schema.xml file code and i am using solr 7.5
> version.
> > >  > > positionIncrementGap="100" multiValued="true"
> > > autoGeneratePhraseQueries="true">
> > > 
> > >   =
> > >> > ignoreCase="true"/>
> > >   
> > > 
> > > 
> > >   
> > >> > ignoreCase="true"/>
> > >> > ignoreCase="true" synonyms="synonyms.txt"/>
> > >   
> > > 
> > >   
> > >
> > >
> > > --
> > > *Thanks & Regards*
> > > Shubham Goswami
> > > Enterprise Software Engineer
> > > *HotWax Systems*
> > > *Enterprise open source experts*
> > > cell: +91-7803886288
> > > office: 0731-409-3684
> > >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hotwaxsystems.com=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=Zi9beGF58BzJUNUdCkeW0pwliKwq9vdTSh0V_lR0734=FhSkJBcmYw_bfHgq1enzuYQeOZwKHzlP9h4VwTZSL5E=
> > >
> >
>
>
> --
> *Thanks & Regards*
> Shubham Goswami
> Enterprise Software Engineer
> *HotWax Systems*
> *Enterprise open source experts*
> cell: +91-7803886288
> office: 0731-409-3684
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hotwaxsystems.com=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=Zi9beGF58BzJUNUdCkeW0pwliKwq9vdTSh0V_lR0734=FhSkJBcmYw_bfHgq1enzuYQeOZwKHzlP9h4VwTZSL5E=
>
>
> --

*Regards,Rohan Kasat*


Re: Re: Query on autoGeneratePhraseQueries

2019-10-15 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
I'm not sure how your config file is setup, but I know that the way we do 
multi-token synonyms is to have the sow (split on whitespace) parameter set to 
False while using the edismax parser. I'm not sure if this would work with 
PhraseQueries , but it might be worth a try! 

In our config file we do something like this: 



edismax
1.0
explicit
100
content_en
w3json_en
false

 

You can read a bit about the parameter here: 
https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
 

Best,
Audrey

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/15/19, 5:50 AM, "Shubham Goswami"  wrote:

Hi kshitij

Thanks for the reply!
I tried to debug it and found that raw query(black company) has parsed as
two separate queries
black and company and returning the results based on black query instead of
this it should have
got parsed as a single phrase query like("black company") because i am using
autoGeneratedPhraseQuery.
Do you have any idea about this please correct me if i am wrong.

Thanks
Shubham

On Tue, Oct 15, 2019 at 1:58 PM kshitij tyagi 
wrote:

> Hi,
>
> Try debugging your solr query and understand how it gets parsed. Try using
> "debug=true" for the same
>
> On Tue, Oct 15, 2019 at 12:58 PM Shubham Goswami <
> shubham.gosw...@hotwax.co>
> wrote:
>
> > *Hi all,*
> >
> > I am a beginner to solr framework and I am trying to implement
> > *autoGeneratePhraseQueries* property in a fieldtype of
> type=text_general, i
> > kept the property value as true and restarted the solr server but still
> it
> > is not taking my two words query like(Black company) as a phrase without
> > double quotes and returning the results only for Black.
> >
> >  Can somebody please help me to understand what am i missing ?
> > Following is my Schema.xml file code and i am using solr 7.5 version.
> >  > positionIncrementGap="100" multiValued="true"
> > autoGeneratePhraseQueries="true">
> > 
> >   =
> >> ignoreCase="true"/>
> >   
> > 
> > 
> >   
> >> ignoreCase="true"/>
> >> ignoreCase="true" synonyms="synonyms.txt"/>
> >   
> > 
> >   
> >
> >
> > --
> > *Thanks & Regards*
> > Shubham Goswami
> > Enterprise Software Engineer
> > *HotWax Systems*
> > *Enterprise open source experts*
> > cell: +91-7803886288
> > office: 0731-409-3684
> > 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hotwaxsystems.com=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=Zi9beGF58BzJUNUdCkeW0pwliKwq9vdTSh0V_lR0734=FhSkJBcmYw_bfHgq1enzuYQeOZwKHzlP9h4VwTZSL5E=
 
> >
>


-- 
*Thanks & Regards*
Shubham Goswami
Enterprise Software Engineer
*HotWax Systems*
*Enterprise open source experts*
cell: +91-7803886288
office: 0731-409-3684

https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hotwaxsystems.com=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=Zi9beGF58BzJUNUdCkeW0pwliKwq9vdTSh0V_lR0734=FhSkJBcmYw_bfHgq1enzuYQeOZwKHzlP9h4VwTZSL5E=
 




Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
yup.  youre going to find solr is WAY more efficient than you think when it
comes to complex queries.

On Wed, Oct 9, 2019 at 3:17 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> True...I guess another rub here is that we're using the edismax parser, so
> all of our queries are inherently OR queries. So for a query like  'the ibm
> way', the search engine would have to:
>
> 1) retrieve a document list for:
>  -->  "ibm" (this list is probably 80% of the documents)
>  -->  "the" (this list is 100%  of the english documents)
>  -- >"way"
> 2) apply edismax parser
>  --> foreach term
>  -->  -->  foreach document  in term
>  -->  -->  -->  score it
>
> So, it seems like it would take a toll on our system but maybe that's
> incorrect! (For reference, our corpus is ~5MM documents, multi-language,
> and we get ~80k-100k queries/day)
>
> Are you using edismax?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 3:11 PM, "David Hastings" 
> wrote:
>
> if you have anything close to a decent server you wont notice it all.
> im
> at about 21 million documents, index varies between 450gb to 800gb
> depending on merges, and about 60k searches a day and stay sub second
> non
> stop, and this is on a single core/non cloud environment
>
> On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com
>  wrote:
>
> > Also, in terms of computational cost, it would seem that including
> most
> > terms/not having a stop ilst would take a toll on the system. For
> instance,
> > right now we have "ibm" as a stop word because it appears everywhere
> in our
> > corpus. If we did not include it in the stop words file, we would
> have to
> > retrieve every single document in our corpus and rank them. That's a
> high
> > computational cost, no?
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com"
> <
> > audrey.lorberf...@ibm.com> wrote:
> >
> > Wow, thank you so much, everyone. This is all incredibly helpful
> > insight.
> >
> > So, would it be fair to say that the majority of you all do NOT
> use
> > stop words?
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/9/19, 11:14 AM, "David Hastings" <
> hastings.recurs...@gmail.com>
> > wrote:
> >
> > However, with all that said, stopwords CAN be useful in some
> > situations.  I
> > combine stopwords with the shingle factory to create
> "interesting
> > phrases"
> > (not really) that i use in "my more like this" needs.  for
> example,
> > europe for vacation
> > europe on vacation
> > will create the shingle
> > europe_vacation
> > which i can then use to relate other documents that would be
> much
> > more similar in such regard, rather than just using the
> > "interesting words"
> > europe, vacation
> >
> > with stop words, the shingles would be
> > europe_for
> > for_vacation
> > and
> > europe_on
> > on_vacation
> >
> > just something to keep in mind,  theres a lot of creative
> ways to
> > use
> > stopwords depending on your needs.  i use the above for a
> VERY
> > basic ML
> > teacher and it works way better than using stopwords,
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> > erickerick...@gmail.com>
> > wrote:
> >
> > > The theory behind stopwords is that they are “safe” to
> remove
> > when
> > > calculating relevance, so we can squeeze every last bit of
> > usefulness out
> > > of very constrained hardware (think 64K of memory. Yes
> > kilobytes). We’ve
> > > come a long way since then and the necessity of removing
> > stopwords from the
> > > indexed tokens to conserve RAM and disk is much less
> relevant
> > than it used
> > > to be in “the bad old days” when the idea of stopwords was
> > invented.
> > >
> > > I’m not quite so confident as Alex that there is “no
> benefit”,
> > but I’ll
> > > totally agree that you should remove stopwords only
> _after_ you
> > have some
> > > evidence that removing them is A Good Thing in your
> situation.
> > >
> > > And removing stopwords leads to some interesting corner
> cases.
> > Consider a
> > 

Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
True...I guess another rub here is that we're using the edismax parser, so all 
of our queries are inherently OR queries. So for a query like  'the ibm way', 
the search engine would have to: 

1) retrieve a document list for:
 -->  "ibm" (this list is probably 80% of the documents)
 -->  "the" (this list is 100%  of the english documents)
 -- >"way"
2) apply edismax parser
 --> foreach term
 -->  -->  foreach document  in term
 -->  -->  -->  score it

So, it seems like it would take a toll on our system but maybe that's 
incorrect! (For reference, our corpus is ~5MM documents, multi-language, and we 
get ~80k-100k queries/day)

Are you using edismax?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/9/19, 3:11 PM, "David Hastings"  wrote:

if you have anything close to a decent server you wont notice it all.  im
at about 21 million documents, index varies between 450gb to 800gb
depending on merges, and about 60k searches a day and stay sub second non
stop, and this is on a single core/non cloud environment

On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Also, in terms of computational cost, it would seem that including most
> terms/not having a stop ilst would take a toll on the system. For 
instance,
> right now we have "ibm" as a stop word because it appears everywhere in 
our
> corpus. If we did not include it in the stop words file, we would have to
> retrieve every single document in our corpus and rank them. That's a high
> computational cost, no?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
> audrey.lorberf...@ibm.com> wrote:
>
> Wow, thank you so much, everyone. This is all incredibly helpful
> insight.
>
> So, would it be fair to say that the majority of you all do NOT use
> stop words?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 11:14 AM, "David Hastings" 
> wrote:
>
> However, with all that said, stopwords CAN be useful in some
> situations.  I
> combine stopwords with the shingle factory to create "interesting
> phrases"
> (not really) that i use in "my more like this" needs.  for 
example,
> europe for vacation
> europe on vacation
> will create the shingle
> europe_vacation
> which i can then use to relate other documents that would be much
> more similar in such regard, rather than just using the
> "interesting words"
> europe, vacation
>
> with stop words, the shingles would be
> europe_for
> for_vacation
> and
> europe_on
> on_vacation
>
> just something to keep in mind,  theres a lot of creative ways to
> use
> stopwords depending on your needs.  i use the above for a VERY
> basic ML
> teacher and it works way better than using stopwords,
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> erickerick...@gmail.com>
> wrote:
>
> > The theory behind stopwords is that they are “safe” to remove
> when
> > calculating relevance, so we can squeeze every last bit of
> usefulness out
> > of very constrained hardware (think 64K of memory. Yes
> kilobytes). We’ve
> > come a long way since then and the necessity of removing
> stopwords from the
> > indexed tokens to conserve RAM and disk is much less relevant
> than it used
> > to be in “the bad old days” when the idea of stopwords was
> invented.
> >
> > I’m not quite so confident as Alex that there is “no benefit”,
> but I’ll
> > totally agree that you should remove stopwords only _after_ you
> have some
> > evidence that removing them is A Good Thing in your situation.
> >
> > And removing stopwords leads to some interesting corner cases.
> Consider a
> > search for “to be or not to be” if they’re all stopwords.
> >
> > Best,
> > Erick
> >
> > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > Hey Alex,
> > >
> > > Thank you!
> > >
> > > Re: stopwords being a thing of the past due to the
> affordability of
> > hardware...can you expand? 

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
if you have anything close to a decent server you wont notice it all.  im
at about 21 million documents, index varies between 450gb to 800gb
depending on merges, and about 60k searches a day and stay sub second non
stop, and this is on a single core/non cloud environment

On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Also, in terms of computational cost, it would seem that including most
> terms/not having a stop ilst would take a toll on the system. For instance,
> right now we have "ibm" as a stop word because it appears everywhere in our
> corpus. If we did not include it in the stop words file, we would have to
> retrieve every single document in our corpus and rank them. That's a high
> computational cost, no?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
> audrey.lorberf...@ibm.com> wrote:
>
> Wow, thank you so much, everyone. This is all incredibly helpful
> insight.
>
> So, would it be fair to say that the majority of you all do NOT use
> stop words?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 11:14 AM, "David Hastings" 
> wrote:
>
> However, with all that said, stopwords CAN be useful in some
> situations.  I
> combine stopwords with the shingle factory to create "interesting
> phrases"
> (not really) that i use in "my more like this" needs.  for example,
> europe for vacation
> europe on vacation
> will create the shingle
> europe_vacation
> which i can then use to relate other documents that would be much
> more similar in such regard, rather than just using the
> "interesting words"
> europe, vacation
>
> with stop words, the shingles would be
> europe_for
> for_vacation
> and
> europe_on
> on_vacation
>
> just something to keep in mind,  theres a lot of creative ways to
> use
> stopwords depending on your needs.  i use the above for a VERY
> basic ML
> teacher and it works way better than using stopwords,
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> erickerick...@gmail.com>
> wrote:
>
> > The theory behind stopwords is that they are “safe” to remove
> when
> > calculating relevance, so we can squeeze every last bit of
> usefulness out
> > of very constrained hardware (think 64K of memory. Yes
> kilobytes). We’ve
> > come a long way since then and the necessity of removing
> stopwords from the
> > indexed tokens to conserve RAM and disk is much less relevant
> than it used
> > to be in “the bad old days” when the idea of stopwords was
> invented.
> >
> > I’m not quite so confident as Alex that there is “no benefit”,
> but I’ll
> > totally agree that you should remove stopwords only _after_ you
> have some
> > evidence that removing them is A Good Thing in your situation.
> >
> > And removing stopwords leads to some interesting corner cases.
> Consider a
> > search for “to be or not to be” if they’re all stopwords.
> >
> > Best,
> > Erick
> >
> > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > Hey Alex,
> > >
> > > Thank you!
> > >
> > > Re: stopwords being a thing of the past due to the
> affordability of
> > hardware...can you expand? I'm not sure I understand.
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> > > On 10/8/19, 1:01 PM, "David Hastings" <
> hastings.recurs...@gmail.com>
> > wrote:
> > >
> > >Another thing to add to the above,
> > >>
> > >> IT:ibm. In this case, we would want to maintain the colon and
> the
> > >> capitalization (otherwise “it” would be taken out as a
> stopword).
> > >>
> > >stopwords are a thing of the past at this point.  there is
> no benefit
> > to
> > >using them now with hardware being so cheap.
> > >
> > >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > >wrote:
> > >
> > >> If you don't want it to be touched by a tokenizer, how would
> the
> > >> protection step know that the sequence of characters you want
> to
> > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> > >> protect"?
> > >>
> > >> What it sounds to me is that you may want to:
> > >> 1) copyField 

Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
only in my more like this tools, but they have a very specific purpose,
otherwise no

On Wed, Oct 9, 2019 at 2:31 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Wow, thank you so much, everyone. This is all incredibly helpful insight.
>
> So, would it be fair to say that the majority of you all do NOT use stop
> words?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 11:14 AM, "David Hastings" 
> wrote:
>
> However, with all that said, stopwords CAN be useful in some
> situations.  I
> combine stopwords with the shingle factory to create "interesting
> phrases"
> (not really) that i use in "my more like this" needs.  for example,
> europe for vacation
> europe on vacation
> will create the shingle
> europe_vacation
> which i can then use to relate other documents that would be much
> more similar in such regard, rather than just using the "interesting
> words"
> europe, vacation
>
> with stop words, the shingles would be
> europe_for
> for_vacation
> and
> europe_on
> on_vacation
>
> just something to keep in mind,  theres a lot of creative ways to use
> stopwords depending on your needs.  i use the above for a VERY basic ML
> teacher and it works way better than using stopwords,
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> erickerick...@gmail.com>
> wrote:
>
> > The theory behind stopwords is that they are “safe” to remove when
> > calculating relevance, so we can squeeze every last bit of
> usefulness out
> > of very constrained hardware (think 64K of memory. Yes kilobytes).
> We’ve
> > come a long way since then and the necessity of removing stopwords
> from the
> > indexed tokens to conserve RAM and disk is much less relevant than
> it used
> > to be in “the bad old days” when the idea of stopwords was invented.
> >
> > I’m not quite so confident as Alex that there is “no benefit”, but
> I’ll
> > totally agree that you should remove stopwords only _after_ you have
> some
> > evidence that removing them is A Good Thing in your situation.
> >
> > And removing stopwords leads to some interesting corner cases.
> Consider a
> > search for “to be or not to be” if they’re all stopwords.
> >
> > Best,
> > Erick
> >
> > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > Hey Alex,
> > >
> > > Thank you!
> > >
> > > Re: stopwords being a thing of the past due to the affordability of
> > hardware...can you expand? I'm not sure I understand.
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> > > On 10/8/19, 1:01 PM, "David Hastings" <
> hastings.recurs...@gmail.com>
> > wrote:
> > >
> > >Another thing to add to the above,
> > >>
> > >> IT:ibm. In this case, we would want to maintain the colon and the
> > >> capitalization (otherwise “it” would be taken out as a stopword).
> > >>
> > >stopwords are a thing of the past at this point.  there is no
> benefit
> > to
> > >using them now with hardware being so cheap.
> > >
> > >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> > arafa...@gmail.com>
> > >wrote:
> > >
> > >> If you don't want it to be touched by a tokenizer, how would the
> > >> protection step know that the sequence of characters you want to
> > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> > >> protect"?
> > >>
> > >> What it sounds to me is that you may want to:
> > >> 1) copyField to a second field
> > >> 2) Apply a much lighter (whitespace?) tokenizer to that second
> field
> > >> 3) Run the results through something like KeepWordFilterFactory
> > >> 4) Search both fields with a boost on the second, higher-signal
> field
> > >>
> > >> The other option is to run CharacterFilter,
> > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map
> known
> > >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> > >> term365". As long as it is done on both indexing and query, they
> will
> > >> still match. You may have to have a bunch of them or write some
> sort
> > >> of lookup map.
> > >>
> > >> Regards,
> > >>   Alex.
> > >>
> > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> > >> audrey.lorberf...@ibm.com  wrote:
> > >>>
> > >>> Hi All,
> > >>>
> > >>> This is likely a rudimentary question, but I can’t seem to find a
> > >> straight-forward answer on forums or the documentation…is there a
> way to
> > >> protect tokens from ANY analysis? I know things like the
> > >> KeywordMarkerFilterFactory protect 

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
oh and by 'non stop' i mean close enough for me :)

On Wed, Oct 9, 2019 at 2:59 PM David Hastings 
wrote:

> if you have anything close to a decent server you wont notice it all.  im
> at about 21 million documents, index varies between 450gb to 800gb
> depending on merges, and about 60k searches a day and stay sub second non
> stop, and this is on a single core/non cloud environment
>
> On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
>> Also, in terms of computational cost, it would seem that including most
>> terms/not having a stop ilst would take a toll on the system. For instance,
>> right now we have "ibm" as a stop word because it appears everywhere in our
>> corpus. If we did not include it in the stop words file, we would have to
>> retrieve every single document in our corpus and rank them. That's a high
>> computational cost, no?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> audrey.lorberf...@ibm.com
>>
>>
>> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
>> audrey.lorberf...@ibm.com> wrote:
>>
>> Wow, thank you so much, everyone. This is all incredibly helpful
>> insight.
>>
>> So, would it be fair to say that the majority of you all do NOT use
>> stop words?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> audrey.lorberf...@ibm.com
>>
>>
>> On 10/9/19, 11:14 AM, "David Hastings" 
>> wrote:
>>
>> However, with all that said, stopwords CAN be useful in some
>> situations.  I
>> combine stopwords with the shingle factory to create "interesting
>> phrases"
>> (not really) that i use in "my more like this" needs.  for
>> example,
>> europe for vacation
>> europe on vacation
>> will create the shingle
>> europe_vacation
>> which i can then use to relate other documents that would be much
>> more similar in such regard, rather than just using the
>> "interesting words"
>> europe, vacation
>>
>> with stop words, the shingles would be
>> europe_for
>> for_vacation
>> and
>> europe_on
>> on_vacation
>>
>> just something to keep in mind,  theres a lot of creative ways to
>> use
>> stopwords depending on your needs.  i use the above for a VERY
>> basic ML
>> teacher and it works way better than using stopwords,
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
>> erickerick...@gmail.com>
>> wrote:
>>
>> > The theory behind stopwords is that they are “safe” to remove
>> when
>> > calculating relevance, so we can squeeze every last bit of
>> usefulness out
>> > of very constrained hardware (think 64K of memory. Yes
>> kilobytes). We’ve
>> > come a long way since then and the necessity of removing
>> stopwords from the
>> > indexed tokens to conserve RAM and disk is much less relevant
>> than it used
>> > to be in “the bad old days” when the idea of stopwords was
>> invented.
>> >
>> > I’m not quite so confident as Alex that there is “no benefit”,
>> but I’ll
>> > totally agree that you should remove stopwords only _after_ you
>> have some
>> > evidence that removing them is A Good Thing in your situation.
>> >
>> > And removing stopwords leads to some interesting corner cases.
>> Consider a
>> > search for “to be or not to be” if they’re all stopwords.
>> >
>> > Best,
>> > Erick
>> >
>> > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>> > audrey.lorberf...@ibm.com  wrote:
>> > >
>> > > Hey Alex,
>> > >
>> > > Thank you!
>> > >
>> > > Re: stopwords being a thing of the past due to the
>> affordability of
>> > hardware...can you expand? I'm not sure I understand.
>> > >
>> > > --
>> > > Audrey Lorberfeld
>> > > Data Scientist, w3 Search
>> > > IBM
>> > > audrey.lorberf...@ibm.com
>> > >
>> > >
>> > > On 10/8/19, 1:01 PM, "David Hastings" <
>> hastings.recurs...@gmail.com>
>> > wrote:
>> > >
>> > >Another thing to add to the above,
>> > >>
>> > >> IT:ibm. In this case, we would want to maintain the colon
>> and the
>> > >> capitalization (otherwise “it” would be taken out as a
>> stopword).
>> > >>
>> > >stopwords are a thing of the past at this point.  there is
>> no benefit
>> > to
>> > >using them now with hardware being so cheap.
>> > >
>> > >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
>> > arafa...@gmail.com>
>> > >wrote:
>> > >
>> > >> If you don't want it to be touched by a tokenizer, how would
>> the
>> > >> 

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Also, in terms of computational cost, it would seem that including most 
terms/not having a stop ilst would take a toll on the system. For instance, 
right now we have "ibm" as a stop word because it appears everywhere in our 
corpus. If we did not include it in the stop words file, we would have to 
retrieve every single document in our corpus and rank them. That's a high 
computational cost, no?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" 
 wrote:

Wow, thank you so much, everyone. This is all incredibly helpful insight.

So, would it be fair to say that the majority of you all do NOT use stop 
words?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/9/19, 11:14 AM, "David Hastings"  wrote:

However, with all that said, stopwords CAN be useful in some 
situations.  I
combine stopwords with the shingle factory to create "interesting 
phrases"
(not really) that i use in "my more like this" needs.  for example,
europe for vacation
europe on vacation
will create the shingle
europe_vacation
which i can then use to relate other documents that would be much
more similar in such regard, rather than just using the "interesting 
words"
europe, vacation

with stop words, the shingles would be
europe_for
for_vacation
and
europe_on
on_vacation

just something to keep in mind,  theres a lot of creative ways to use
stopwords depending on your needs.  i use the above for a VERY basic ML
teacher and it works way better than using stopwords,













On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson 
wrote:

> The theory behind stopwords is that they are “safe” to remove when
> calculating relevance, so we can squeeze every last bit of usefulness 
out
> of very constrained hardware (think 64K of memory. Yes kilobytes). 
We’ve
> come a long way since then and the necessity of removing stopwords 
from the
> indexed tokens to conserve RAM and disk is much less relevant than it 
used
> to be in “the bad old days” when the idea of stopwords was invented.
>
> I’m not quite so confident as Alex that there is “no benefit”, but 
I’ll
> totally agree that you should remove stopwords only _after_ you have 
some
> evidence that removing them is A Good Thing in your situation.
>
> And removing stopwords leads to some interesting corner cases. 
Consider a
> search for “to be or not to be” if they’re all stopwords.
>
> Best,
> Erick
>
> > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" 
> wrote:
> >
> >Another thing to add to the above,
> >>
> >> IT:ibm. In this case, we would want to maintain the colon and the
> >> capitalization (otherwise “it” would be taken out as a stopword).
> >>
> >stopwords are a thing of the past at this point.  there is no 
benefit
> to
> >using them now with hardware being so cheap.
> >
> >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> >wrote:
> >
> >> If you don't want it to be touched by a tokenizer, how would the
> >> protection step know that the sequence of characters you want to
> >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >> protect"?
> >>
> >> What it sounds to me is that you may want to:
> >> 1) copyField to a second field
> >> 2) Apply a much lighter (whitespace?) tokenizer to that second 
field
> >> 3) Run the results through something like KeepWordFilterFactory
> >> 4) Search both fields with a boost on the second, higher-signal 
field
> >>
> >> The other option is to run CharacterFilter,
> >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map 
known
> >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> >> term365". As long as it is done on 

Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Wow, thank you so much, everyone. This is all incredibly helpful insight.

So, would it be fair to say that the majority of you all do NOT use stop words?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/9/19, 11:14 AM, "David Hastings"  wrote:

However, with all that said, stopwords CAN be useful in some situations.  I
combine stopwords with the shingle factory to create "interesting phrases"
(not really) that i use in "my more like this" needs.  for example,
europe for vacation
europe on vacation
will create the shingle
europe_vacation
which i can then use to relate other documents that would be much
more similar in such regard, rather than just using the "interesting words"
europe, vacation

with stop words, the shingles would be
europe_for
for_vacation
and
europe_on
on_vacation

just something to keep in mind,  theres a lot of creative ways to use
stopwords depending on your needs.  i use the above for a VERY basic ML
teacher and it works way better than using stopwords,













On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson 
wrote:

> The theory behind stopwords is that they are “safe” to remove when
> calculating relevance, so we can squeeze every last bit of usefulness out
> of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
> come a long way since then and the necessity of removing stopwords from 
the
> indexed tokens to conserve RAM and disk is much less relevant than it used
> to be in “the bad old days” when the idea of stopwords was invented.
>
> I’m not quite so confident as Alex that there is “no benefit”, but I’ll
> totally agree that you should remove stopwords only _after_ you have some
> evidence that removing them is A Good Thing in your situation.
>
> And removing stopwords leads to some interesting corner cases. Consider a
> search for “to be or not to be” if they’re all stopwords.
>
> Best,
> Erick
>
> > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" 
> wrote:
> >
> >Another thing to add to the above,
> >>
> >> IT:ibm. In this case, we would want to maintain the colon and the
> >> capitalization (otherwise “it” would be taken out as a stopword).
> >>
> >stopwords are a thing of the past at this point.  there is no benefit
> to
> >using them now with hardware being so cheap.
> >
> >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> >wrote:
> >
> >> If you don't want it to be touched by a tokenizer, how would the
> >> protection step know that the sequence of characters you want to
> >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >> protect"?
> >>
> >> What it sounds to me is that you may want to:
> >> 1) copyField to a second field
> >> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> >> 3) Run the results through something like KeepWordFilterFactory
> >> 4) Search both fields with a boost on the second, higher-signal field
> >>
> >> The other option is to run CharacterFilter,
> >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> >> term365". As long as it is done on both indexing and query, they will
> >> still match. You may have to have a bunch of them or write some sort
> >> of lookup map.
> >>
> >> Regards,
> >>   Alex.
> >>
> >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> >> audrey.lorberf...@ibm.com  wrote:
> >>>
> >>> Hi All,
> >>>
> >>> This is likely a rudimentary question, but I can’t seem to find a
> >> straight-forward answer on forums or the documentation…is there a way 
to
> >> protect tokens from ANY analysis? I know things like the
> >> KeywordMarkerFilterFactory protect tokens from stemming, but we have
> some
> >> terms we don’t even want our tokenizer to touch. Mostly, these are
> >> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> >> maintain the colon and the capitalization (otherwise “it” would be 
taken
> >> out as a stopword).
> >>>
> >>> Any advice is appreciated!
> >>>
> >>> Thank you,
> >>> Audrey
> >>>
> >>> --
> >>> 

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
However, with all that said, stopwords CAN be useful in some situations.  I
combine stopwords with the shingle factory to create "interesting phrases"
(not really) that i use in "my more like this" needs.  for example,
europe for vacation
europe on vacation
will create the shingle
europe_vacation
which i can then use to relate other documents that would be much
more similar in such regard, rather than just using the "interesting words"
europe, vacation

with stop words, the shingles would be
europe_for
for_vacation
and
europe_on
on_vacation

just something to keep in mind,  theres a lot of creative ways to use
stopwords depending on your needs.  i use the above for a VERY basic ML
teacher and it works way better than using stopwords,













On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson 
wrote:

> The theory behind stopwords is that they are “safe” to remove when
> calculating relevance, so we can squeeze every last bit of usefulness out
> of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
> come a long way since then and the necessity of removing stopwords from the
> indexed tokens to conserve RAM and disk is much less relevant than it used
> to be in “the bad old days” when the idea of stopwords was invented.
>
> I’m not quite so confident as Alex that there is “no benefit”, but I’ll
> totally agree that you should remove stopwords only _after_ you have some
> evidence that removing them is A Good Thing in your situation.
>
> And removing stopwords leads to some interesting corner cases. Consider a
> search for “to be or not to be” if they’re all stopwords.
>
> Best,
> Erick
>
> > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" 
> wrote:
> >
> >Another thing to add to the above,
> >>
> >> IT:ibm. In this case, we would want to maintain the colon and the
> >> capitalization (otherwise “it” would be taken out as a stopword).
> >>
> >stopwords are a thing of the past at this point.  there is no benefit
> to
> >using them now with hardware being so cheap.
> >
> >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> >wrote:
> >
> >> If you don't want it to be touched by a tokenizer, how would the
> >> protection step know that the sequence of characters you want to
> >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >> protect"?
> >>
> >> What it sounds to me is that you may want to:
> >> 1) copyField to a second field
> >> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> >> 3) Run the results through something like KeepWordFilterFactory
> >> 4) Search both fields with a boost on the second, higher-signal field
> >>
> >> The other option is to run CharacterFilter,
> >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> >> term365". As long as it is done on both indexing and query, they will
> >> still match. You may have to have a bunch of them or write some sort
> >> of lookup map.
> >>
> >> Regards,
> >>   Alex.
> >>
> >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> >> audrey.lorberf...@ibm.com  wrote:
> >>>
> >>> Hi All,
> >>>
> >>> This is likely a rudimentary question, but I can’t seem to find a
> >> straight-forward answer on forums or the documentation…is there a way to
> >> protect tokens from ANY analysis? I know things like the
> >> KeywordMarkerFilterFactory protect tokens from stemming, but we have
> some
> >> terms we don’t even want our tokenizer to touch. Mostly, these are
> >> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> >> maintain the colon and the capitalization (otherwise “it” would be taken
> >> out as a stopword).
> >>>
> >>> Any advice is appreciated!
> >>>
> >>> Thank you,
> >>> Audrey
> >>>
> >>> --
> >>> Audrey Lorberfeld
> >>> Data Scientist, w3 Search
> >>> IBM
> >>> audrey.lorberf...@ibm.com
> >>>
> >>
> >
> >
>
>


Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread David Hastings
another add on, as the previous two were pretty much spot on:

https://www.google.com/search?rlz=1C5CHFA_enUS814US819=ACYBGNTi2tQTQH6TycDKwRNEn9g2km9awg%3A1570632176627=8PGdXa7tJeem_QaatJ_oAg=drive+in=drive+in_l=psy-ab.3..0l10.35669.36730..37042...0.4..1.434.1152.4j3j4-1..01..gws-wiz...0i71j35i39j0i273j0i67j0i131j0i273i70i249.agjl1cqAyog=0ahUKEwiupdfntI_lAhVnU98KHRraBy0Q4dUDCAs=5

vs

https://www.google.com/search?rlz=1C5CHFA_enUS814US819=ACYBGNRFNjzWADDR7awohPfgg8qGXqOlmg%3A1570632182338=9vGdXZ2VFKW8ggeuw73IDQ=drive+on=drive+on_l=psy-ab.3..0l10.35301.37396..37917...0.4..0.83.590.82..01..gws-wiz...0i71j35i39j0i273j0i131j0i67j0i3.34FIDQtvfOE=0ahUKEwid6LPqtI_lAhUlnuAKHa5hD9kQ4dUDCAs=5


On Wed, Oct 9, 2019 at 10:41 AM Alexandre Rafalovitch 
wrote:

> Stopwords (it was discussed on mailing list several times I recall):
> The ideas is that it used to be part of the tricks to make the index
> as small as possible to allow faster search. Stopwords being the most
> common words
> This days, disk space is not an issue most of the time and there have
> been many optimizations to make stopwords less relevant. Plus, like
> you said, sometimes the stopword management actively gets in the way.
> Here is an interesting - if old - article about it too:
>
> https://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be
>
> Regards,
>Alex.
>
> On Wed, 9 Oct 2019 at 09:39, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" 
> wrote:
> >
> > Another thing to add to the above,
> > >
> > > IT:ibm. In this case, we would want to maintain the colon and the
> > > capitalization (otherwise “it” would be taken out as a stopword).
> > >
> > stopwords are a thing of the past at this point.  there is no
> benefit to
> > using them now with hardware being so cheap.
> >
> > On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> > > If you don't want it to be touched by a tokenizer, how would the
> > > protection step know that the sequence of characters you want to
> > > protect is "IT:ibm" and not "this is an IT:ibm term I want to
> > > protect"?
> > >
> > > What it sounds to me is that you may want to:
> > > 1) copyField to a second field
> > > 2) Apply a much lighter (whitespace?) tokenizer to that second
> field
> > > 3) Run the results through something like KeepWordFilterFactory
> > > 4) Search both fields with a boost on the second, higher-signal
> field
> > >
> > > The other option is to run CharacterFilter,
> > > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map
> known
> > > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> > > term365". As long as it is done on both indexing and query, they
> will
> > > still match. You may have to have a bunch of them or write some
> sort
> > > of lookup map.
> > >
> > > Regards,
> > >Alex.
> > >
> > > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > > >
> > > > Hi All,
> > > >
> > > > This is likely a rudimentary question, but I can’t seem to find a
> > > straight-forward answer on forums or the documentation…is there a
> way to
> > > protect tokens from ANY analysis? I know things like the
> > > KeywordMarkerFilterFactory protect tokens from stemming, but we
> have some
> > > terms we don’t even want our tokenizer to touch. Mostly, these are
> > > IBM-specific acronyms, such as IT:ibm. In this case, we would want
> to
> > > maintain the colon and the capitalization (otherwise “it” would be
> taken
> > > out as a stopword).
> > > >
> > > > Any advice is appreciated!
> > > >
> > > > Thank you,
> > > > Audrey
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > >
> >
> >
>


Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Erick Erickson
The theory behind stopwords is that they are “safe” to remove when calculating 
relevance, so we can squeeze every last bit of usefulness out of very 
constrained hardware (think 64K of memory. Yes kilobytes). We’ve come a long 
way since then and the necessity of removing stopwords from the indexed tokens 
to conserve RAM and disk is much less relevant than it used to be in “the bad 
old days” when the idea of stopwords was invented.

I’m not quite so confident as Alex that there is “no benefit”, but I’ll totally 
agree that you should remove stopwords only _after_ you have some evidence that 
removing them is A Good Thing in your situation.

And removing stopwords leads to some interesting corner cases. Consider a 
search for “to be or not to be” if they’re all stopwords.

Best,
Erick

> On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - audrey.lorberf...@ibm.com 
>  wrote:
> 
> Hey Alex,
> 
> Thank you!
> 
> Re: stopwords being a thing of the past due to the affordability of 
> hardware...can you expand? I'm not sure I understand.
> 
> -- 
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
> 
> 
> On 10/8/19, 1:01 PM, "David Hastings"  wrote:
> 
>Another thing to add to the above,
>> 
>> IT:ibm. In this case, we would want to maintain the colon and the
>> capitalization (otherwise “it” would be taken out as a stopword).
>> 
>stopwords are a thing of the past at this point.  there is no benefit to
>using them now with hardware being so cheap.
> 
>On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch 
>wrote:
> 
>> If you don't want it to be touched by a tokenizer, how would the
>> protection step know that the sequence of characters you want to
>> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>> protect"?
>> 
>> What it sounds to me is that you may want to:
>> 1) copyField to a second field
>> 2) Apply a much lighter (whitespace?) tokenizer to that second field
>> 3) Run the results through something like KeepWordFilterFactory
>> 4) Search both fields with a boost on the second, higher-signal field
>> 
>> The other option is to run CharacterFilter,
>> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
>> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
>> term365". As long as it is done on both indexing and query, they will
>> still match. You may have to have a bunch of them or write some sort
>> of lookup map.
>> 
>> Regards,
>>   Alex.
>> 
>> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>> audrey.lorberf...@ibm.com  wrote:
>>> 
>>> Hi All,
>>> 
>>> This is likely a rudimentary question, but I can’t seem to find a
>> straight-forward answer on forums or the documentation…is there a way to
>> protect tokens from ANY analysis? I know things like the
>> KeywordMarkerFilterFactory protect tokens from stemming, but we have some
>> terms we don’t even want our tokenizer to touch. Mostly, these are
>> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
>> maintain the colon and the capitalization (otherwise “it” would be taken
>> out as a stopword).
>>> 
>>> Any advice is appreciated!
>>> 
>>> Thank you,
>>> Audrey
>>> 
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> IBM
>>> audrey.lorberf...@ibm.com
>>> 
>> 
> 
> 



Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Alexandre Rafalovitch
Stopwords (it was discussed on mailing list several times I recall):
The ideas is that it used to be part of the tricks to make the index
as small as possible to allow faster search. Stopwords being the most
common words
This days, disk space is not an issue most of the time and there have
been many optimizations to make stopwords less relevant. Plus, like
you said, sometimes the stopword management actively gets in the way.
Here is an interesting - if old - article about it too:
https://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be

Regards,
   Alex.

On Wed, 9 Oct 2019 at 09:39, Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:
>
> Hey Alex,
>
> Thank you!
>
> Re: stopwords being a thing of the past due to the affordability of 
> hardware...can you expand? I'm not sure I understand.
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/8/19, 1:01 PM, "David Hastings"  wrote:
>
> Another thing to add to the above,
> >
> > IT:ibm. In this case, we would want to maintain the colon and the
> > capitalization (otherwise “it” would be taken out as a stopword).
> >
> stopwords are a thing of the past at this point.  there is no benefit to
> using them now with hardware being so cheap.
>
> On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch 
> wrote:
>
> > If you don't want it to be touched by a tokenizer, how would the
> > protection step know that the sequence of characters you want to
> > protect is "IT:ibm" and not "this is an IT:ibm term I want to
> > protect"?
> >
> > What it sounds to me is that you may want to:
> > 1) copyField to a second field
> > 2) Apply a much lighter (whitespace?) tokenizer to that second field
> > 3) Run the results through something like KeepWordFilterFactory
> > 4) Search both fields with a boost on the second, higher-signal field
> >
> > The other option is to run CharacterFilter,
> > (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> > complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> > term365". As long as it is done on both indexing and query, they will
> > still match. You may have to have a bunch of them or write some sort
> > of lookup map.
> >
> > Regards,
> >Alex.
> >
> > On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > Hi All,
> > >
> > > This is likely a rudimentary question, but I can’t seem to find a
> > straight-forward answer on forums or the documentation…is there a way to
> > protect tokens from ANY analysis? I know things like the
> > KeywordMarkerFilterFactory protect tokens from stemming, but we have 
> some
> > terms we don’t even want our tokenizer to touch. Mostly, these are
> > IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> > maintain the colon and the capitalization (otherwise “it” would be taken
> > out as a stopword).
> > >
> > > Any advice is appreciated!
> > >
> > > Thank you,
> > > Audrey
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> >
>
>


Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Walter Underwood
Stopwords were used when we were running search engines on 16-bit computers 
with 50 Megabyte disks, like the PDP-11. They avoided storing and processing 
long posting lists.

Think of removing stopwords as a binary weighting on frequent terms, either on 
or off (not in the index). With idf, we have a proportional weighting for 
frequent terms. That gives better results than binary weighting.

Removing stopwords makes some searches impossible. The classic example is “to 
be or not to be”, which is 100% stopwords. This is a real-world problem. When I 
was building search for Netflix a dozen years ago, I hit several movie or TV 
titles which were all stopwords. I wrote about them in this blog post.

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 9, 2019, at 6:38 AM, Audrey Lorberfeld - audrey.lorberf...@ibm.com 
>  wrote:
> 
> Hey Alex,
> 
> Thank you!
> 
> Re: stopwords being a thing of the past due to the affordability of 
> hardware...can you expand? I'm not sure I understand.
> 
> -- 
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
> 
> 
> On 10/8/19, 1:01 PM, "David Hastings"  wrote:
> 
>Another thing to add to the above,
>> 
>> IT:ibm. In this case, we would want to maintain the colon and the
>> capitalization (otherwise “it” would be taken out as a stopword).
>> 
>stopwords are a thing of the past at this point.  there is no benefit to
>using them now with hardware being so cheap.
> 
>On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch 
>wrote:
> 
>> If you don't want it to be touched by a tokenizer, how would the
>> protection step know that the sequence of characters you want to
>> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>> protect"?
>> 
>> What it sounds to me is that you may want to:
>> 1) copyField to a second field
>> 2) Apply a much lighter (whitespace?) tokenizer to that second field
>> 3) Run the results through something like KeepWordFilterFactory
>> 4) Search both fields with a boost on the second, higher-signal field
>> 
>> The other option is to run CharacterFilter,
>> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
>> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
>> term365". As long as it is done on both indexing and query, they will
>> still match. You may have to have a bunch of them or write some sort
>> of lookup map.
>> 
>> Regards,
>>   Alex.
>> 
>> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>> audrey.lorberf...@ibm.com  wrote:
>>> 
>>> Hi All,
>>> 
>>> This is likely a rudimentary question, but I can’t seem to find a
>> straight-forward answer on forums or the documentation…is there a way to
>> protect tokens from ANY analysis? I know things like the
>> KeywordMarkerFilterFactory protect tokens from stemming, but we have some
>> terms we don’t even want our tokenizer to touch. Mostly, these are
>> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
>> maintain the colon and the capitalization (otherwise “it” would be taken
>> out as a stopword).
>>> 
>>> Any advice is appreciated!
>>> 
>>> Thank you,
>>> Audrey
>>> 
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> IBM
>>> audrey.lorberf...@ibm.com
>>> 
>> 
> 
> 



Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hey Alex,

Thank you!

Re: stopwords being a thing of the past due to the affordability of 
hardware...can you expand? I'm not sure I understand.

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/8/19, 1:01 PM, "David Hastings"  wrote:

Another thing to add to the above,
>
> IT:ibm. In this case, we would want to maintain the colon and the
> capitalization (otherwise “it” would be taken out as a stopword).
>
stopwords are a thing of the past at this point.  there is no benefit to
using them now with hardware being so cheap.

On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch 
wrote:

> If you don't want it to be touched by a tokenizer, how would the
> protection step know that the sequence of characters you want to
> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> protect"?
>
> What it sounds to me is that you may want to:
> 1) copyField to a second field
> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> 3) Run the results through something like KeepWordFilterFactory
> 4) Search both fields with a boost on the second, higher-signal field
>
> The other option is to run CharacterFilter,
> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> term365". As long as it is done on both indexing and query, they will
> still match. You may have to have a bunch of them or write some sort
> of lookup map.
>
> Regards,
>Alex.
>
> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hi All,
> >
> > This is likely a rudimentary question, but I can’t seem to find a
> straight-forward answer on forums or the documentation…is there a way to
> protect tokens from ANY analysis? I know things like the
> KeywordMarkerFilterFactory protect tokens from stemming, but we have some
> terms we don’t even want our tokenizer to touch. Mostly, these are
> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> maintain the colon and the capitalization (otherwise “it” would be taken
> out as a stopword).
> >
> > Any advice is appreciated!
> >
> > Thank you,
> > Audrey
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
>




Re: Re: Need urgent help with Solr spatial search using SpatialRecursivePrefixTreeFieldType

2019-10-01 Thread David Smiley
Do you know how URLs are structured?  They include name=value pairs
separated by ampersands.  This takes precedence over the contents of any
particular name or value.  Consequently looking at your parenthesis doesn't
make sense since the open-close span ampersands and thus go to different
filter queries.  I think you can completely remove those parenthesis in
fact.  Also try a tool like Postman to compose your queries rather than
direct URL manipulation.

=adminLatLon
=80
= {!geofilt pt=33.0198431,-96.6988856} OR {!geofilt
pt=50.2171726,8.265894}

Notice the leading space after 'fq'.  This is a syntax parsing gotcha that
has to do with how embedded queries are parsed, which is what you need to
do as you need to compose two with an operator.  It'd be kinda awkard to
fix that gotcha in Solr.  There are other techniques too, but this is the
most succinct.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, Oct 1, 2019 at 7:34 AM anushka gupta <
anushka_gu...@external.mckinsey.com> wrote:

> Thanks,
>
> Could you please help me in combining two geofilt fqs as the following
> gives
> error, it treats ")" as part of the d parameter and gives error that
> 'd=80)'
> is not a valid param:
>
>
> ({!geofilt}=adminLatLon=33.0198431,-96.6988856=80)+OR+({!geofilt}=adminLatLon=50.2171726,8.265894=80)
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Re: Need urgent help with Solr spatial search using SpatialRecursivePrefixTreeFieldType

2019-10-01 Thread anushka gupta
Thanks, 

Could you please help me in combining two geofilt fqs as the following gives
error, it treats ")" as part of the d parameter and gives error that 'd=80)'
is not a valid param:

({!geofilt}=adminLatLon=33.0198431,-96.6988856=80)+OR+({!geofilt}=adminLatLon=50.2171726,8.265894=80)



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Re: Need urgent help with Solr spatial search using SpatialRecursivePrefixTreeFieldType

2019-09-30 Thread David Smiley
"sort" is a regular request parameter.  In your non-working query, you
specified it as a local-param inside geofilt which isn't where it belongs.
If you want to sort from two points then you need to make up your mind on
how to combine the distances into some greater aggregate function (e.g.
min/max/sum).

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Sep 30, 2019 at 10:22 AM Anushka Gupta <
anushka_gu...@external.mckinsey.com> wrote:

> Hi,
>
>
>
> I want to be able to filter on different cities and also sort the results
> based on geoproximity. But sorting doesn’t work:
>
>
>
>
> admin_directory_search_geolocation?q=david=({!geofilt+sfield=adminLatLon+pt=33.0198431,-96.6988856+d=80+sort=min(geodist(33.0198431,-96.6988856))})+OR+({!geofilt+sfield=adminLatLon+pt=50.2171726,8.265894+d=80+sort=min(geodist(50.2171726,8.265894))})
>
>
>
> Sorting works fine if I add ‘&’ in geofilt condition like :
> q=david={!geofilt=adminLatLon=33.0198431,-96.6988856=80=geodist(33.0198431,-96.6988856)}
>
>
>
> But when I combine the two FQs then sorting doesn’t work.
>
>
>
> Please help.
>
>
>
>
>
> Best regards,
>
> Anushka gupta
>
>
>
>
>
>
>
> *From:* David Smiley 
> *Sent:* Friday, September 13, 2019 10:29 PM
> *To:* Anushka Gupta 
> *Subject:* [EXT]Re: Need urgent help with Solr spatial search using
> SpatialRecursivePrefixTreeFieldType
>
>
>
> Hello,
>
>
>
> Please don't email me directly for public help.  CC is okay if you send it
> to solr-user@lucene.apache.org so that the Solr community can benefit
> from my answer or might even answer it.
>
>
> ~ David Smiley
>
> Apache Lucene/Solr Search Developer
>
> http://www.linkedin.com/in/davidwsmiley
> 
>
>
>
>
>
> On Wed, Sep 11, 2019 at 11:27 AM Anushka Gupta <
> anushka_gu...@external.mckinsey.com> wrote:
>
> Hello David,
>
>
>
> I read a lot of articles of yours regarding Solr spatial search using
> SpatialRecursivePrefixTreeFieldType. But unfortunately it doesn’t work for
> me when I combine filter query with my keyword search.
>
>
>
> Solr Version used : Solr 7.1.0
>
>
>
> I have declared fields as :
>
>
>
>  class="solr.SpatialRecursivePrefixTreeFieldType" geo="true"
> maxDistErr="0.001"
>
> distErrPct="0.025"
> distanceUnits="kilometers"/>
>
>
>
>  stored="true"  multiValued="true" />
>
>
>
>
>
> Field values are populated like :
>
> adminLatLon: [50.2171726,8.265894]
>
>
>
> Query is :
>
>
> /solr/ac3_persons/admin_directory_search_location?q=Idstein=Idstein={!geofilt%20cache=false%20cost=100}=adminLatLon=50.2171726,8.265894=500=recip(geodist(),2,200,20)=true
>
>
>
> My request handler is :
>
> admin_directory_search_location
>
>
>
> I get results if I do :
>
> /solr/ac3_persons/admin_directory_search_location?q=*:*
> =Idstein={!geofilt%20cache=false%20cost=100}=adminLatLon=50.2171726,8.265894=500=recip(geodist(),2,200,20)=true
>
>
>
> But I do not get results when I add any keyword in q.
>
>
>
> I am stuck in this issue since last many days. Could you please help with
> the same.
>
>
>
>
>
> Thanks,
>
> Anushka Gupta
>
>
>
> ++
> This email is confidential and may be privileged. If you have received it
> in error, please notify us immediately and then delete it. Please do not
> copy it, disclose its contents or use it for any purpose.
> ++
>
> ++
> This email is confidential and may be privileged. If you have received it
> in error, please notify us immediately and then delete it. Please do not
> copy it, disclose its contents or use it for any purpose.
> ++
>


Re: Re: Need urgent help with Solr spatial search using SpatialRecursivePrefixTreeFieldType

2019-09-30 Thread Tim Casey
https://stackoverflow.com/questions/48348312/solr-7-how-to-do-full-text-search-w-geo-spatial-search


On Mon, Sep 30, 2019 at 10:31 AM Anushka Gupta <
anushka_gu...@external.mckinsey.com> wrote:

> Hi,
>
> I want to be able to filter on different cities and also sort the results
> based on geoproximity. But sorting doesn’t work:
>
>
> admin_directory_search_geolocation?q=david=({!geofilt+sfield=adminLatLon+pt=33.0198431,-96.6988856+d=80+sort=min(geodist(33.0198431,-96.6988856))})+OR+({!geofilt+sfield=adminLatLon+pt=50.2171726,8.265894+d=80+sort=min(geodist(50.2171726,8.265894))})
>
> Sorting works fine if I add ‘&’ in geofilt condition like :
> q=david={!geofilt=adminLatLon=33.0198431,-96.6988856=80=geodist(33.0198431,-96.6988856)}
>
> But when I combine the two FQs then sorting doesn’t work.
>
> Please help.
>
>
> Best regards,
> Anushka gupta
>
>
>
> From: David Smiley 
> Sent: Friday, September 13, 2019 10:29 PM
> To: Anushka Gupta 
> Subject: [EXT]Re: Need urgent help with Solr spatial search using
> SpatialRecursivePrefixTreeFieldType
>
> Hello,
>
> Please don't email me directly for public help.  CC is okay if you send it
> to solr-user@lucene.apache.org so
> that the Solr community can benefit from my answer or might even answer it.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley<
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.linkedin.com_in_davidwsmiley=DwMFaQ=yIH1_-b1hO27QV_BdDph9suDL0Jq0WcgndLmIuQXoms=0egJOuVVdmY5VQTw_S3m4bVez1r-U8nqqi6RYBxO6tTbzryrDHrFoJROJ8r-TqNc=ulu2-5V3TDOnVNfRRQusod6-FoJcdeAWu5gGB3owryU=Hv2uYeXnut3oi1ijHp14BJ09QIZzhEI-onwzhnQYB8I=
> >
>
>
> On Wed, Sep 11, 2019 at 11:27 AM Anushka Gupta <
> anushka_gu...@external.mckinsey.com anushka_gu...@external.mckinsey.com>> wrote:
> Hello David,
>
> I read a lot of articles of yours regarding Solr spatial search using
> SpatialRecursivePrefixTreeFieldType. But unfortunately it doesn’t work for
> me when I combine filter query with my keyword search.
>
> Solr Version used : Solr 7.1.0
>
> I have declared fields as :
>
>  class="solr.SpatialRecursivePrefixTreeFieldType" geo="true"
> maxDistErr="0.001"
> distErrPct="0.025"
> distanceUnits="kilometers"/>
>
>  stored="true"  multiValued="true" />
>
>
> Field values are populated like :
> adminLatLon: [50.2171726,8.265894]
>
> Query is :
>
> /solr/ac3_persons/admin_directory_search_location?q=Idstein=Idstein={!geofilt%20cache=false%20cost=100}=adminLatLon=50.2171726,8.265894=500=recip(geodist(),2,200,20)=true
>
> My request handler is :
> admin_directory_search_location
>
> I get results if I do :
>
> /solr/ac3_persons/admin_directory_search_location?q=*:*=Idstein={!geofilt%20cache=false%20cost=100}=adminLatLon=50.2171726,8.265894=500=recip(geodist(),2,200,20)=true
>
> But I do not get results when I add any keyword in q.
>
> I am stuck in this issue since last many days. Could you please help with
> the same.
>
>
> Thanks,
> Anushka Gupta
>
> ++
> This email is confidential and may be privileged. If you have received it
> in error, please notify us immediately and then delete it. Please do not
> copy it, disclose its contents or use it for any purpose.
> ++
>
> ++
> This email is confidential and may be privileged. If you have received it
> in error, please notify us immediately and then delete it.  Please do not
> copy it, disclose its contents or use it for any purpose.
> ++
>


RE: Re: Need urgent help with Solr spatial search using SpatialRecursivePrefixTreeFieldType

2019-09-30 Thread Anushka Gupta
Hi,

I want to be able to filter on different cities and also sort the results based 
on geoproximity. But sorting doesn’t work:

admin_directory_search_geolocation?q=david=({!geofilt+sfield=adminLatLon+pt=33.0198431,-96.6988856+d=80+sort=min(geodist(33.0198431,-96.6988856))})+OR+({!geofilt+sfield=adminLatLon+pt=50.2171726,8.265894+d=80+sort=min(geodist(50.2171726,8.265894))})

Sorting works fine if I add ‘&’ in geofilt condition like : 
q=david={!geofilt=adminLatLon=33.0198431,-96.6988856=80=geodist(33.0198431,-96.6988856)}

But when I combine the two FQs then sorting doesn’t work.

Please help.


Best regards,
Anushka gupta



From: David Smiley 
Sent: Friday, September 13, 2019 10:29 PM
To: Anushka Gupta 
Subject: [EXT]Re: Need urgent help with Solr spatial search using 
SpatialRecursivePrefixTreeFieldType

Hello,

Please don't email me directly for public help.  CC is okay if you send it to 
solr-user@lucene.apache.org so that the 
Solr community can benefit from my answer or might even answer it.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Wed, Sep 11, 2019 at 11:27 AM Anushka Gupta 
mailto:anushka_gu...@external.mckinsey.com>>
 wrote:
Hello David,

I read a lot of articles of yours regarding Solr spatial search using 
SpatialRecursivePrefixTreeFieldType. But unfortunately it doesn’t work for me 
when I combine filter query with my keyword search.

Solr Version used : Solr 7.1.0

I have declared fields as :






Field values are populated like :
adminLatLon: [50.2171726,8.265894]

Query is :
/solr/ac3_persons/admin_directory_search_location?q=Idstein=Idstein={!geofilt%20cache=false%20cost=100}=adminLatLon=50.2171726,8.265894=500=recip(geodist(),2,200,20)=true

My request handler is :
admin_directory_search_location

I get results if I do :
/solr/ac3_persons/admin_directory_search_location?q=*:*=Idstein={!geofilt%20cache=false%20cost=100}=adminLatLon=50.2171726,8.265894=500=recip(geodist(),2,200,20)=true

But I do not get results when I add any keyword in q.

I am stuck in this issue since last many days. Could you please help with the 
same.


Thanks,
Anushka Gupta

++
This email is confidential and may be privileged. If you have received it
in error, please notify us immediately and then delete it. Please do not
copy it, disclose its contents or use it for any purpose.
++

++
This email is confidential and may be privileged. If you have received it
in error, please notify us immediately and then delete it.  Please do not
copy it, disclose its contents or use it for any purpose.
++


Re: Re: SolR: How to sort (or boost) by Availability dates

2019-09-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Yay!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 9/24/19, 10:15 AM, "digi_business"  wrote:

Hi all, reading your suggestions i've juste come out of the darkness!

Just for explaining, my problem is that i want to show all my items (not
only the "availables"), but having the availables coming first, still
mantaining my custom sorting by "ranking" desc.
i then used this BoostQuery
bq=(Avail_From: [* TO NOW] AND Avail_To: [NOW TO *])^10
and discovered that for activating it i must declare defType=edismax before
all
then i discovered about the default SolR "score" sorting, and explicitating
it like this did the magic
sort=score desc, Ranking desc

thanks all for the help, and I really hope this could help someone else in
the future



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=6psFmUJgDOuMWnRpn2-SLAU20C7GLXGSvkdyOVcxe08=dOnO6vl6A2vleGcMXwtwAmYybldRdW7Cp3aerxgPeAo=
 




Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-04 Thread Walter Underwood
On Sep 3, 2019, at 1:13 PM, Audrey Lorberfeld - audrey.lorberf...@ibm.com 
 wrote:
> 
> The main issue we are anticipating with the above strategy surrounds scoring. 
> Since we will be increasing the frequency of accented terms, we might bias 
> our page ranker...

You will not be increasing the frequency of the accented terms. Those 
frequencies will stay the same. You’ll be adding new unaccented terms. The new 
terms will probably have higher frequencies than the accented terms. If so, the 
accented terms should be preferred for accented queries. You might or might not 
want that behavior.

doc1: glück
doc1 terms: glück, gluck, glueck

doc2: glueck
doc2 terms: glueck

df for glück: 1
df for gluck: 1
df for glueck: 2

The df for the term “glück” is the same whether you expand or not.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-04 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thanks, Alex! We'll look into this.

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 9/3/19, 4:27 PM, "Alexandre Rafalovitch"  wrote:

What about combining:
1) KeywordRepeatFilterFactory
2) An existing folding filter (need to check it ignores Keyword marked word)
3) RemoveDuplicatesTokenFilterFactory

That may give what you are after without custom coding.

Regards,
   Alex.

On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:
>
> Toke,
>
> Thank you! That makes a lot of sense.
>
> In other news -- we just had a meeting where we decided to try out a 
hybrid strategy. I'd love to know what you & everyone else thinks...
>
> - Since we are concerned with the overhead created by "double-fielding" 
all tokens per language (because I'm not sure how we'd work the logic into Solr 
to only double-field when an accent is present), we are going to try to do 
something along the lines of synonym-expansion:
> - We are going to build a custom plugin that detects diacritics 
-- upon detection, the plugin would expand the token to both its original form 
and its ascii-folded term (a la Toke's approach).
> - However, since we are doing it in a way that mimics synonym 
expansion, we are going to keep both terms in a single field
>
> The main issue we are anticipating with the above strategy surrounds 
scoring. Since we will be increasing the frequency of accented terms, we might 
bias our page ranker...
>
> Has anyone done anything similar (and/or does anyone think this idea is 
totally the wrong way to go?)
>
> Best,
> Audrey
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 9/3/19, 2:58 PM, "Toke Eskildsen"  wrote:
>
> Audrey Lorberfeld - audrey.lorberf...@ibm.com 
 wrote:
> > Do you find that searching over both the original title field and 
the normalized title
> > field increases the time it takes for your search engine to 
retrieve results?
>
> It is not something we have measured as that index is fast enough 
(which in this context means that we're practically always waiting for the 
result from an external service that is issued in parallel with the call to our 
Solr server).
>
> Technically it's not different from searching across other fields 
defined in the eDismax setup, so I guess it boils down to "how many fields can 
you afford to search across?", where our organization's default answer is "as 
many as we need to get quality matches. Make it work Toke, chop chop". On a 
more serious note, it is not something I would worry about unless we're talking 
some special high-performance setup with a budget for tuning: Matching terms 
and joining filters is core Solr (Lucene really) functionality. Plain query & 
filter-matching time tend to be dwarfed by aggregations (grouping, faceting, 
stats).
>
> - Toke Eskildsen
>
>




Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Alexandre Rafalovitch
What about combining:
1) KeywordRepeatFilterFactory
2) An existing folding filter (need to check it ignores Keyword marked word)
3) RemoveDuplicatesTokenFilterFactory

That may give what you are after without custom coding.

Regards,
   Alex.

On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:
>
> Toke,
>
> Thank you! That makes a lot of sense.
>
> In other news -- we just had a meeting where we decided to try out a hybrid 
> strategy. I'd love to know what you & everyone else thinks...
>
> - Since we are concerned with the overhead created by "double-fielding" all 
> tokens per language (because I'm not sure how we'd work the logic into Solr 
> to only double-field when an accent is present), we are going to try to do 
> something along the lines of synonym-expansion:
> - We are going to build a custom plugin that detects diacritics -- 
> upon detection, the plugin would expand the token to both its original form 
> and its ascii-folded term (a la Toke's approach).
> - However, since we are doing it in a way that mimics synonym 
> expansion, we are going to keep both terms in a single field
>
> The main issue we are anticipating with the above strategy surrounds scoring. 
> Since we will be increasing the frequency of accented terms, we might bias 
> our page ranker...
>
> Has anyone done anything similar (and/or does anyone think this idea is 
> totally the wrong way to go?)
>
> Best,
> Audrey
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 9/3/19, 2:58 PM, "Toke Eskildsen"  wrote:
>
> Audrey Lorberfeld - audrey.lorberf...@ibm.com  
> wrote:
> > Do you find that searching over both the original title field and the 
> normalized title
> > field increases the time it takes for your search engine to retrieve 
> results?
>
> It is not something we have measured as that index is fast enough (which 
> in this context means that we're practically always waiting for the result 
> from an external service that is issued in parallel with the call to our Solr 
> server).
>
> Technically it's not different from searching across other fields defined 
> in the eDismax setup, so I guess it boils down to "how many fields can you 
> afford to search across?", where our organization's default answer is "as 
> many as we need to get quality matches. Make it work Toke, chop chop". On a 
> more serious note, it is not something I would worry about unless we're 
> talking some special high-performance setup with a budget for tuning: 
> Matching terms and joining filters is core Solr (Lucene really) 
> functionality. Plain query & filter-matching time tend to be dwarfed by 
> aggregations (grouping, faceting, stats).
>
> - Toke Eskildsen
>
>


Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Toke,

Thank you! That makes a lot of sense.

In other news -- we just had a meeting where we decided to try out a hybrid 
strategy. I'd love to know what you & everyone else thinks...

- Since we are concerned with the overhead created by "double-fielding" all 
tokens per language (because I'm not sure how we'd work the logic into Solr to 
only double-field when an accent is present), we are going to try to do 
something along the lines of synonym-expansion:
- We are going to build a custom plugin that detects diacritics -- upon 
detection, the plugin would expand the token to both its original form and its 
ascii-folded term (a la Toke's approach).
- However, since we are doing it in a way that mimics synonym 
expansion, we are going to keep both terms in a single field

The main issue we are anticipating with the above strategy surrounds scoring. 
Since we will be increasing the frequency of accented terms, we might bias our 
page ranker...

Has anyone done anything similar (and/or does anyone think this idea is totally 
the wrong way to go?)

Best,
Audrey

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 9/3/19, 2:58 PM, "Toke Eskildsen"  wrote:

Audrey Lorberfeld - audrey.lorberf...@ibm.com  
wrote:
> Do you find that searching over both the original title field and the 
normalized title
> field increases the time it takes for your search engine to retrieve 
results?

It is not something we have measured as that index is fast enough (which in 
this context means that we're practically always waiting for the result from an 
external service that is issued in parallel with the call to our Solr server).

Technically it's not different from searching across other fields defined 
in the eDismax setup, so I guess it boils down to "how many fields can you 
afford to search across?", where our organization's default answer is "as many 
as we need to get quality matches. Make it work Toke, chop chop". On a more 
serious note, it is not something I would worry about unless we're talking some 
special high-performance setup with a budget for tuning: Matching terms and 
joining filters is core Solr (Lucene really) functionality. Plain query & 
filter-matching time tend to be dwarfed by aggregations (grouping, faceting, 
stats).

- Toke Eskildsen




Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Toke Eskildsen
Audrey Lorberfeld - audrey.lorberf...@ibm.com  wrote:
> Do you find that searching over both the original title field and the 
> normalized title
> field increases the time it takes for your search engine to retrieve results?

It is not something we have measured as that index is fast enough (which in 
this context means that we're practically always waiting for the result from an 
external service that is issued in parallel with the call to our Solr server).

Technically it's not different from searching across other fields defined in 
the eDismax setup, so I guess it boils down to "how many fields can you afford 
to search across?", where our organization's default answer is "as many as we 
need to get quality matches. Make it work Toke, chop chop". On a more serious 
note, it is not something I would worry about unless we're talking some special 
high-performance setup with a budget for tuning: Matching terms and joining 
filters is core Solr (Lucene really) functionality. Plain query & 
filter-matching time tend to be dwarfed by aggregations (grouping, faceting, 
stats).

- Toke Eskildsen


Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Toke,

Do you find that searching over both the original title field and the 
normalized title field increases the time it takes for your search engine to 
retrieve results?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/31/19, 3:01 PM, "Toke Eskildsen"  wrote:

Audrey Lorberfeld - audrey.lorberf...@ibm.com  
wrote:
> Just wanting to test the waters here – for those of you with search 
engines
> that index multiple languages, do you use ASCII-folding in your schema?

Our primary search engine is for Danish users, with sources being 
bibliographic records with titles and other meta data in many different 
languages. We normalise to Danish, meaning that most ligatures are removed, but 
also that letters such as Swedish ö becomes Danish ø. The rules for 
normalisation are dictated by Danish library practice and was implemented by a 
resident librarian.

Whenever we do this normalisation, we index two versions in our index: A 
very lightly normalised (lowercased) field and a heavily normalised field: If a 
record has a title "Köket" (kitchen in Swedish), we store title_orig:köket and 
title_norm:køket. edismax is used to ensure that both fields are searched per 
default (plus an explicit field alias "title" are set to point to both 
title_orig and title_norm for qualified searches) and that matches in 
title_orig has more weight for relevance calculation.

> We are onboarding Spanish documents into our index right now and keep
> going back and forth on whether we should preserve accent marks.

Going with what we do, my answer would be: Yes, do preserve and also remove 
:-). You could even have 3 or more levels of normalisation, depending on how 
much time you have for polishing.

- Toke Eskildsen




Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Languages are the best. Thank you all so much!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/30/19, 4:09 PM, "Walter Underwood"  wrote:

The right transliteration for accents is language-dependent. In English, a 
diaeresis can be stripped because it is only used to mark neighboring vowels as 
independently pronounced. In German, the “typewriter umlaut” adds an “e”.

English: coöperate -> cooperate
German: Glück -> Glueck

Some stemmers will handle the typewriter umlauts for you. The InXight 
stemmers used to do that.

The English diaeresis is a fussy usage, but it does occur in text. For 
years, MS Word corrected “naive” to “naïve”. There may even be a curse 
associated with its usage.


https://urldefense.proofpoint.com/v2/url?u=https-3A__www.newyorker.com_culture_culture-2Ddesk_the-2Dcurse-2Dof-2Dthe-2Ddiaeresis=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=bs1f1lhhzL5yetMSZKj0sDIC1dRXVKWJ6QfOnl6YGgo=cpRGRPUJXHCR3A-NyxcjzAqt-N1HevrBCjLJAW60KDU=
 

In German, there are corner cases where just stripping the umlaut changes 
one word into another, like schön/schon.

Isn’t language fun?

wunder
Walter Underwood
wun...@wunderwood.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=bs1f1lhhzL5yetMSZKj0sDIC1dRXVKWJ6QfOnl6YGgo=JKCjwue0SDlu5UZ5sllEI__txfMvrugOL51CIAPV1H8=
   (my blog)

> On Aug 30, 2019, at 12:48 PM, Erick Erickson  
wrote:
> 
> It Depends (tm). In this case on how sophisticated/precise your users 
are. If your users are exclusively extremely conversant in the language and are 
expected to have keyboards that allow easy access to all the accents… then I 
might leave them in. In some cases removing them can change the meaning of a 
word.
> 
> That said, most installations I’ve seen remove them. They’re still 
present in any returned stored field so the doc looks good. And then you bypass 
all the nonsense about perhaps ingesting a doc that “somehow” had accents 
removed and/or people not putting accents in their search and the like.
> 
> MappingCFF works..
> 
>> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com  wrote:
>> 
>> Aita,
>> 
>> Thanks for that insight! 
>> 
>> As the conversation has progressed, we are now leaning towards not 
having the ASCII-folding filter in our pipelines in order to keep marks like 
umlauts and tildas. Instead, we might add acute and grave accents to a file 
pointed at by the MappingCharFilterFactory to simply strip those more common 
accent marks...
>> 
>> Any other opinions are welcome!
>> 
>> -- 
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> Digital Workplace Engineering
>> CIO, Finance and Operations
>> IBM
>> audrey.lorberf...@ibm.com
>> 
>> 
>> On 8/30/19, 10:27 AM, "Atita Arora"  wrote:
>> 
>>   We work on german index, we neutralize accents before index i.e. 
umlauts to
>>   'ae', 'ue'.. Etc and similar what we do at the query time too for an
>>   appropriate match.
>> 
>>   On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - 
audrey.lorberf...@ibm.com
>>wrote:
>> 
>>> Hi All,
>>> 
>>> Just wanting to test the waters here – for those of you with search
>>> engines that index multiple languages, do you use ASCII-folding in your
>>> schema? We are onboarding Spanish documents into our index right now and
>>> keep going back and forth on whether we should preserve accent marks. 
From
>>> our query logs, it seems people generally do not include accents when
>>> searching, but you never know…
>>> 
>>> Thank you in advance for sharing your experiences!
>>> 
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> Digital Workplace Engineering
>>> CIO, Finance and Operations
>>> IBM
>>> audrey.lorberf...@ibm.com
>>> 
>>> 
>> 
>> 
> 





Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thank you, Erick!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/30/19, 3:49 PM, "Erick Erickson"  wrote:

It Depends (tm). In this case on how sophisticated/precise your users are. 
If your users are exclusively extremely conversant in the language and are 
expected to have keyboards that allow easy access to all the accents… then I 
might leave them in. In some cases removing them can change the meaning of a 
word.

That said, most installations I’ve seen remove them. They’re still present 
in any returned stored field so the doc looks good. And then you bypass all the 
nonsense about perhaps ingesting a doc that “somehow” had accents removed 
and/or people not putting accents in their search and the like.

MappingCFF works..

> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com  wrote:
> 
> Aita,
> 
> Thanks for that insight! 
> 
> As the conversation has progressed, we are now leaning towards not having 
the ASCII-folding filter in our pipelines in order to keep marks like umlauts 
and tildas. Instead, we might add acute and grave accents to a file pointed at 
by the MappingCharFilterFactory to simply strip those more common accent 
marks...
> 
> Any other opinions are welcome!
> 
> -- 
> Audrey Lorberfeld
> Data Scientist, w3 Search
> Digital Workplace Engineering
> CIO, Finance and Operations
> IBM
> audrey.lorberf...@ibm.com
> 
> 
> On 8/30/19, 10:27 AM, "Atita Arora"  wrote:
> 
>We work on german index, we neutralize accents before index i.e. 
umlauts to
>'ae', 'ue'.. Etc and similar what we do at the query time too for an
>appropriate match.
> 
>On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - 
audrey.lorberf...@ibm.com
> wrote:
> 
>> Hi All,
>> 
>> Just wanting to test the waters here – for those of you with search
>> engines that index multiple languages, do you use ASCII-folding in your
>> schema? We are onboarding Spanish documents into our index right now and
>> keep going back and forth on whether we should preserve accent marks. 
From
>> our query logs, it seems people generally do not include accents when
>> searching, but you never know…
>> 
>> Thank you in advance for sharing your experiences!
>> 
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> Digital Workplace Engineering
>> CIO, Finance and Operations
>> IBM
>> audrey.lorberf...@ibm.com
>> 
>> 
> 
> 





Re: Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Erick Erickson
It Depends (tm). In this case on how sophisticated/precise your users are. If 
your users are exclusively extremely conversant in the language and are 
expected to have keyboards that allow easy access to all the accents… then I 
might leave them in. In some cases removing them can change the meaning of a 
word.

That said, most installations I’ve seen remove them. They’re still present in 
any returned stored field so the doc looks good. And then you bypass all the 
nonsense about perhaps ingesting a doc that “somehow” had accents removed 
and/or people not putting accents in their search and the like.

MappingCFF works..

> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - audrey.lorberf...@ibm.com 
>  wrote:
> 
> Aita,
> 
> Thanks for that insight! 
> 
> As the conversation has progressed, we are now leaning towards not having the 
> ASCII-folding filter in our pipelines in order to keep marks like umlauts and 
> tildas. Instead, we might add acute and grave accents to a file pointed at by 
> the MappingCharFilterFactory to simply strip those more common accent marks...
> 
> Any other opinions are welcome!
> 
> -- 
> Audrey Lorberfeld
> Data Scientist, w3 Search
> Digital Workplace Engineering
> CIO, Finance and Operations
> IBM
> audrey.lorberf...@ibm.com
> 
> 
> On 8/30/19, 10:27 AM, "Atita Arora"  wrote:
> 
>We work on german index, we neutralize accents before index i.e. umlauts to
>'ae', 'ue'.. Etc and similar what we do at the query time too for an
>appropriate match.
> 
>On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
> wrote:
> 
>> Hi All,
>> 
>> Just wanting to test the waters here – for those of you with search
>> engines that index multiple languages, do you use ASCII-folding in your
>> schema? We are onboarding Spanish documents into our index right now and
>> keep going back and forth on whether we should preserve accent marks. From
>> our query logs, it seems people generally do not include accents when
>> searching, but you never know…
>> 
>> Thank you in advance for sharing your experiences!
>> 
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> Digital Workplace Engineering
>> CIO, Finance and Operations
>> IBM
>> audrey.lorberf...@ibm.com
>> 
>> 
> 
> 



Re: Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Aita,

Thanks for that insight! 

As the conversation has progressed, we are now leaning towards not having the 
ASCII-folding filter in our pipelines in order to keep marks like umlauts and 
tildas. Instead, we might add acute and grave accents to a file pointed at by 
the MappingCharFilterFactory to simply strip those more common accent marks...

Any other opinions are welcome!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/30/19, 10:27 AM, "Atita Arora"  wrote:

We work on german index, we neutralize accents before index i.e. umlauts to
'ae', 'ue'.. Etc and similar what we do at the query time too for an
appropriate match.

On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Hi All,
>
> Just wanting to test the waters here – for those of you with search
> engines that index multiple languages, do you use ASCII-folding in your
> schema? We are onboarding Spanish documents into our index right now and
> keep going back and forth on whether we should preserve accent marks. From
> our query logs, it seems people generally do not include accents when
> searching, but you never know…
>
> Thank you in advance for sharing your experiences!
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> Digital Workplace Engineering
> CIO, Finance and Operations
> IBM
> audrey.lorberf...@ibm.com
>
>




Re: Re: Multi-language Spellcheck

2019-08-29 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thanks, everyone!
-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/29/19, 11:28 AM, "Atita Arora"  wrote:

I would agree with the suggestion, I remember something similar presented
by someone at Berlin Buzzwords 19.

On Thu, Aug 29, 2019, 5:03 PM Jörn Franke  wrote:

> It could be sensible to have one spellchecker / language (as different
> endpoint or as a queryparameter at runtime). Alternatively, depending on
> your use case you could get away with a generic fieldtype that does not do
> anything language specific, but I doubt.
>
> > Am 29.08.2019 um 16:20 schrieb Audrey Lorberfeld -
> audrey.lorberf...@ibm.com :
> >
> > Hi All,
> >
> > We are starting up an internal search engine that has to work for many
> different languages. We are starting with a POC of Spanish and English
> documents, and we are using the DirectSolrSpellChecker.
> >
> > From reading others' threads online, I know that we have to have
> multiple spellcheckers to do this (1 for each language). However, would
> someone be able to clarify what should go in the "queryAnalyzerFieldType"
> tag? It seems that the tag can only take a single field. So, does that 
mean
> that I have to have a copy field that collates all tokens from all
> languages? Image of code attached for reference & sample code of
> English-only spellchecker below:
> >
> > 
> >
> >   ???  
> >
> >
> >default
> >minimal_en
> >solr.DirectSolrSpellChecker -->
> >internal
> >0.5
> >2
> >1
> >5
> >4
> >0.05
> >
> > ...
> >
> > Thank you!
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > Digital Workplace Engineering
> > CIO, Finance and Operations
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 8/29/19, 10:12 AM, "Joe Obernberger" 
> wrote:
> >
> >Thank you Erick.  I'm upgrading from 7.6.0 and as far as I can tell
> the
> >schema and configuration (solrconfig.xml) isn't different (apart from
> >the version).  Right now, I'm at a loss.  I still have the 7.6.0
> cluster
> >running and the query works OK there.
> >
> >Sure seems like I'm missing a field called 'features', but it's not
> >defined in the prior schema either.  Thanks again!
> >
> >-Joe
> >
> >>On 8/28/2019 6:19 PM, Erick Erickson wrote:
> >> What it says ;)
> >>
> >> My guess is that your configuration mentions the field “features” in,
> perhaps carrot.snippet or carrot.title.
> >>
> >> But it’s a guess.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Aug 28, 2019, at 5:18 PM, Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
> >>>
> >>> Hi All - trying to use clustering with SolrCloud 8.2, but getting this
> error:
> >>>
> >>> "msg":"Error from server at null: org.apache.solr.search.SyntaxError:
> Query Field 'features' is not a valid field name",
> >>>
> >>> The URL, I'm using is:
> >>>
> 
https://urldefense.proofpoint.com/v2/url?u=http-3A__solrServer-3A9100_solr_DOCS_select-3Fq-3D-2A-253A-2A-26qt-3D_clustering-26clustering-3Dtrue-26clustering.collection-3Dtrue=DwIDaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=O_wgAdeSZrC8W73ggxLnVdbVDMeiJ2jnRnzz9zriMWE=Xv6mGAm4OoATTBbEz5m-J0bRyPaUXaVpvWT_f74PIJ4=
>  <
> 
https://urldefense.proofpoint.com/v2/url?u=http-3A__cronus-3A9100_solr_UNCLASS-5F2018-5F5-5F19-5F184_select-3Fq-3D-2A-253A-2A-26qt-3D_clustering-26clustering-3Dtrue-26clustering.collection-3Dtrue=DwIDaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=O_wgAdeSZrC8W73ggxLnVdbVDMeiJ2jnRnzz9zriMWE=Erwr9WXMf9Vk16cIkTMlhUQrEzKfHYinrWrM40fF1KQ=
> >
> >>>
> >>> Thanks for any ideas!
> >>>
> >>> Complete response:
> >>> {
> >>>  "responseHeader":{
> >>>"zkConnected":true,
> >>>"status":400,
> >>>"QTime":38,
> >>>"params":{
> >>>  "q":"*:*",
> >>>  "qt":"/clustering",
> >>>  "clustering":"true",
> >>>  "clustering.collection":"true"}},
> >>>  "error":{
> >>>"metadata":[
> >>>  "error-class","org.apache.solr.common.SolrException",
> >>>  "root-error-class","org.apache.solr.common.SolrException",
> >>>
> 
"error-class","org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException",
> >>>
> 
"root-error-class","org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException"],
> >>>"msg":"Error from server at null:
> 

Re: Re: Solr edismax parser with multi-word synonyms

2019-07-18 Thread Sunil Srinivasan
Hi Erick, 
Is there anyway I can get it to match documents containing at least one of the 
words of the original query? i.e. 'frozen' or 'dinner' or both. (But not 
partial matches of the synonyms)
Thanks,Sunil


-Original Message-
From: Erick Erickson 
To: solr-user 
Sent: Thu, Jul 18, 2019 04:42 AM
Subject: Re: Solr edismax parser with multi-word synonyms


This is not a phrase query, rather it’s requiring either pair of words
to appear in the title.

You’ve told it that “frozen dinner” and “microwave foods” are synonyms. 
So it’s looking for both the words “microwave” and “foods” in the title field, 
or “frozen” and “dinner” in the title field.

You’d see the same thing with single-word synonyms, albeit a little less
confusingly.


Best,
Erick


> On Jul 18, 2019, at 1:01 AM, kshitij tyagi  
> wrote:
> 
> Hi sunil,
> 
> 1. as you have added "microwave food" in synonym as a multiword synonym to
> "frozen dinner", edismax parsers finds your synonym in the file and is
> considering your query as a Phrase query.
> 
> This is the reason you are seeing parsed query as  +(((+title:microwave
> +title:food) (+title:frozen +title:dinner))), frozen dinner is considered
> as a phrase here.
> 
> If you want partial match on your query then you can add frozen dinner,
> microwave food, microwave, food to your synonym file and you will see the
> parsed query as:
> "+(((+title:microwave +title:food) title:miccrowave title:food
> (+title:frozen +title:dinner)))"
> Another option is to write your own custom query parser and use it as a
> plugin.
> 
> Hope this helps!!
> 
> kshitij
> 
> 
> On Thu, Jul 18, 2019 at 9:14 AM Sunil Srinivasan  wrote:
> 
>> 
>> I have enabled the SynonymGraphFilter in my field configuration in order
>> to support multi-word synonyms (I am using Solr 7.6). Here is my field
>> configuration:
>> 
>>    
>>      
>>    
>> 
>>    
>>      
>>      > synonyms="synonyms.txt"/>
>>    
>> 
>> 
>> 
>> 
>> And this is my synonyms.txt file:
>> frozen dinner,microwave food
>> 
>> Scenario 1: blue shirt (query with no synonyms)
>> 
>> Here is my first Solr query:
>> 
>> http://localhost:8983/solr/base/search?q=blue+shirt=title=edismax=on
>> 
>> And this is the parsed query I see in the debug output:
>> +((title:blue) (title:shirt))
>> 
>> Scenario 2: frozen dinner (query with synonyms)
>> 
>> Now, here is my second Solr query:
>> 
>> http://localhost:8983/solr/base/search?q=frozen+dinner=title=edismax=on
>> 
>> And this is the parsed query I see in the debug output:
>> +(((+title:microwave +title:food) (+title:frozen +title:dinner)))
>> 
>> I am wondering why the first query looks for documents containing at least
>> one of the two query tokens, whereas the second query looks for documents
>> with both of the query tokens? I would understand if it looked for both the
>> tokens of the synonyms (i.e. both microwave and food) to avoid the
>> sausagization problem. But I would like to get partial matches on the
>> original query at least (i.e. it should also match documents containing
>> just the token 'dinner').
>> 
>> Would any one know why the behavior is different across queries with and
>> without synonyms? And how could I work around this if I wanted partial
>> matches on queries that also have synonyms?
>> 
>> Ideally, I would like the parsed query in the second case to be:
>> +(((+title:microwave +title:food) (title:frozen title:dinner)))
>> 
>> I'd appreciate any help with this. Thanks!
>> 


Re: Re: Query takes a long time Solr 6.1.0

2019-06-07 Thread David Hastings
There isnt anything wrong aside from your query is poorly thought out.

On Fri, Jun 7, 2019 at 11:04 AM vishal patel 
wrote:

> Any one is looking my issue??
>
> Get Outlook for Android
>
> 
> From: vishal patel
> Sent: Thursday, June 6, 2019 5:15:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query takes a long time Solr 6.1.0
>
> Thanks for your reply.
>
> > How much index data is on one server with 256GB of memory?  What is the
> > max heap size on the Solr instance?  Is there only one Solr instance?
>
> One server(256GB RAM) has two below Solr instance and other application
> also
> 1) shards1 (80GB heap ,790GB Storage, 449GB Indexed data)
> 2) replica of shard2 (80GB heap, 895GB Storage, 337GB Indexed data)
>
> The second server(256GB RAM and 1 TB storage) has two below Solr instance
> and other application also
> 1) shards2 (80GB heap, 790GB Storage, 338GB Indexed data)
> 2) replica of shard1 (80GB heap, 895GB Storage, 448GB Indexed data)
>
> Both server memory and disk usage:
> https://drive.google.com/drive/folders/11GoZy8C0i-qUGH-ranPD8PCoPWCxeS-5
>
> Note: Average 40GB heap used normally in each Solr instance. when replica
> gets down at that time disk IO are high and also GC pause time above 15
> seconds. We can not identify the exact issue of replica recovery OR down
> from logs. due to the GC pause? OR due to disk IO high? OR due to
> time-consuming query? OR due to heavy indexing?
>
> Regards,
> Vishal
> 
> From: Shawn Heisey 
> Sent: Wednesday, June 5, 2019 7:10 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query takes a long time Solr 6.1.0
>
> On 6/5/2019 7:08 AM, vishal patel wrote:
> > I have attached RAR file but not attached properly. Again attached txt
> file.
> >
> > For 2 shards and 2 replicas, we have 2 servers and each has 256 GB ram
> > and 1 TB storage. One shard and another shard replica in one server.
>
> You got lucky.  Even text files usually don't make it to the list --
> yours did this time.  Use a file sharing website in the future.
>
> That is a massive query.  The primary reason that Lucene defaults to a
> maxBooleanClauses value of 1024, which you are definitely exceeding
> here, is that queries with that many clauses tend to be slow and consume
> massive levels of resources.  It might not be possible to improve the
> query speed very much here if you cannot reduce the size of the query.
>
> Your query doesn't look like it is simple enough to replace with the
> terms query parser, which has better performance than a boolean query
> with thousands of "OR" clauses.
>
> How much index data is on one server with 256GB of memory?  What is the
> max heap size on the Solr instance?  Is there only one Solr instance?
>
> The screenshot mentioned here will most likely relay all the info I am
> looking for.  Be sure the sort is correct:
>
>
> https://wiki.apache.org/solr/SolrPerformanceProblems#Asking_for_help_on_a_memory.2Fperformance_issue
>
> You will not be able to successfully attach the screenshot to a message.
>   That will require a file sharing website.
>
> Thanks,
> Shawn
>


Re : Re: Solr 6.6 and OpenJDK11

2019-04-05 Thread e_briere
There is a lack of consensus about Java 11 support. We have been recommended to 
stick to Java 8 even on Solr 7.X. Is the page bellow the 'official' position?

Eric.

Le 05/04/19 03:23, Jan Høydahl   a écrit : 
> 
> Solr7 is the first Solr version that has been proved to work with JDK9+
> So you better stick with Java8. Solr 7/8 will work with JDK11, and Solr 9 
> will likely require it.
> Much more details to be found here: 
> https://wiki.apache.org/solr/SolrJavaVersions
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> > 5. apr. 2019 kl. 05:46 skrev solrnoobie :
> > 
> > So we are having some production issues with solr 6.6 with OpenJDK 11. There
> > are a lot of heap errors (ours was set to 10gig on a 16 gig instance) and we
> > never encountered this until we upgraded from Oracle JDK 8 to OpenJDK 11.
> > 
> > So is it advisable to keep it at openjdk 11 or should we downgrade to
> > OpenJDK 8?
> > 
> > 
> > 
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> 
> 



Re: Re: solr _route_ key now working

2019-03-27 Thread Jay Potharaju
I was reading the debug info incorrectly it is working as expected
...thanks for the help.
Thanks
Jay Potharaju



On Tue, Mar 26, 2019 at 10:58 PM Jay Potharaju 
wrote:

> Edwin, I tried escaping the special characters but it does not seems to
> work. I am using 7.7
> Thanks Jeremy for the example.
> id:123:456!789
> I do see that the data for the same key is co-located in the same shard by
> running. I can see that all the data is co-located in the same shard when
> querying the shard.
> fq=fieldB:456=shard1.
>
> Any suggestions why that would not be working when using _route_ to query
> the documents.
>
> Thanks
> Jay Potharaju
>
>
>
> On Tue, Mar 26, 2019 at 5:58 AM Branham, Jeremy (Experis) <
> jb...@allstate.com> wrote:
>
>> Jay –
>> I’m not familiar with the document ID format you mention [having a “:” in
>> the prefix], but it looks similar to the composite ID routing I’m using.
>> Document Id format: “a/1!id”
>>
>> Then I can use a _route_ value of “a/1!” when querying.
>>
>> Example Doc IDs:
>> a/1!768456
>> a/1!563575
>> b/1!456234
>> b/1!245698
>>
>> The document ID prefix “x/1!” tells Solr to spread the documents over ½
>> of the available shards. When querying with the same value for _route_ it
>> will retrieve documents only from those shards.
>>
>> Jeremy Branham
>> jb...@allstate.com
>>
>> On 3/25/19, 9:13 PM, "Zheng Lin Edwin Yeo"  wrote:
>>
>> Hi,
>>
>> Sorry, didn't see that you have an exclamation mark in your query as
>> well.
>> You will need to escape the exclamation mark as well.
>> So you can try it with the query _route_=“123\:456\!”
>>
>> You can refer to the message in the link on which special characters
>> requires escaping.
>>
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_21914956_which-2Dspecial-2Dcharacters-2Dneed-2Descaping-2Din-2Da-2Dsolr-2Dquery=DwIFaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=81cWucTr4zf8Cn2FliZ2fYFfqIb_g605mWVAxLxuQCc=30JCckpa6ctmrBupqeGhxJ7pPIcicy7VcIoeTEw_vpQ=
>>
>> By the way, which Solr version are you using?
>>
>> Regards,
>> Edwin
>>
>> On Tue, 26 Mar 2019 at 01:12, Jay Potharaju 
>> wrote:
>>
>> > That did not work . Any other suggestions
>> > My id is 123:456!678
>> > Tried running query as _route_=“123\:456!” But didn’t give expected
>> > results
>> > Thanks
>> > Jay
>> >
>> > > On Mar 24, 2019, at 8:30 PM, Zheng Lin Edwin Yeo <
>> edwinye...@gmail.com>
>> > wrote:
>> > >
>> > > Hi,
>> > >
>> > > The character ":" is a special character, so it requires escaping
>> during
>> > > the search.
>> > > You can try to search with query _route_="a\:b!".
>> > >
>> > > Regards,
>> > > Edwin
>> > >
>> > >> On Mon, 25 Mar 2019 at 07:59, Jay Potharaju <
>> jspothar...@gmail.com>
>> > wrote:
>> > >>
>> > >> Hi,
>> > >> My document id has a format of a:b!c, when I query
>> _route_="a:b!" it
>> > does
>> > >> not return any values. Any suggestions?
>> > >>
>> > >> Thanks
>> > >> Jay Potharaju
>> > >>
>> >
>>
>>
>>


Re: Re: solr _route_ key now working

2019-03-26 Thread Jay Potharaju
Edwin, I tried escaping the special characters but it does not seems to
work. I am using 7.7
Thanks Jeremy for the example.
id:123:456!789
I do see that the data for the same key is co-located in the same shard by
running. I can see that all the data is co-located in the same shard when
querying the shard.
fq=fieldB:456=shard1.

Any suggestions why that would not be working when using _route_ to query
the documents.

Thanks
Jay Potharaju



On Tue, Mar 26, 2019 at 5:58 AM Branham, Jeremy (Experis) <
jb...@allstate.com> wrote:

> Jay –
> I’m not familiar with the document ID format you mention [having a “:” in
> the prefix], but it looks similar to the composite ID routing I’m using.
> Document Id format: “a/1!id”
>
> Then I can use a _route_ value of “a/1!” when querying.
>
> Example Doc IDs:
> a/1!768456
> a/1!563575
> b/1!456234
> b/1!245698
>
> The document ID prefix “x/1!” tells Solr to spread the documents over ½ of
> the available shards. When querying with the same value for _route_ it will
> retrieve documents only from those shards.
>
> Jeremy Branham
> jb...@allstate.com
>
> On 3/25/19, 9:13 PM, "Zheng Lin Edwin Yeo"  wrote:
>
> Hi,
>
> Sorry, didn't see that you have an exclamation mark in your query as
> well.
> You will need to escape the exclamation mark as well.
> So you can try it with the query _route_=“123\:456\!”
>
> You can refer to the message in the link on which special characters
> requires escaping.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_21914956_which-2Dspecial-2Dcharacters-2Dneed-2Descaping-2Din-2Da-2Dsolr-2Dquery=DwIFaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=81cWucTr4zf8Cn2FliZ2fYFfqIb_g605mWVAxLxuQCc=30JCckpa6ctmrBupqeGhxJ7pPIcicy7VcIoeTEw_vpQ=
>
> By the way, which Solr version are you using?
>
> Regards,
> Edwin
>
> On Tue, 26 Mar 2019 at 01:12, Jay Potharaju 
> wrote:
>
> > That did not work . Any other suggestions
> > My id is 123:456!678
> > Tried running query as _route_=“123\:456!” But didn’t give expected
> > results
> > Thanks
> > Jay
> >
> > > On Mar 24, 2019, at 8:30 PM, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > The character ":" is a special character, so it requires escaping
> during
> > > the search.
> > > You can try to search with query _route_="a\:b!".
> > >
> > > Regards,
> > > Edwin
> > >
> > >> On Mon, 25 Mar 2019 at 07:59, Jay Potharaju <
> jspothar...@gmail.com>
> > wrote:
> > >>
> > >> Hi,
> > >> My document id has a format of a:b!c, when I query _route_="a:b!"
> it
> > does
> > >> not return any values. Any suggestions?
> > >>
> > >> Thanks
> > >> Jay Potharaju
> > >>
> >
>
>
>


Re: Re: solr _route_ key now working

2019-03-26 Thread Branham, Jeremy (Experis)
Jay –
I’m not familiar with the document ID format you mention [having a “:” in the 
prefix], but it looks similar to the composite ID routing I’m using.
Document Id format: “a/1!id”

Then I can use a _route_ value of “a/1!” when querying.

Example Doc IDs:
a/1!768456
a/1!563575
b/1!456234
b/1!245698

The document ID prefix “x/1!” tells Solr to spread the documents over ½ of the 
available shards. When querying with the same value for _route_ it will 
retrieve documents only from those shards.
 
Jeremy Branham
jb...@allstate.com

On 3/25/19, 9:13 PM, "Zheng Lin Edwin Yeo"  wrote:

Hi,

Sorry, didn't see that you have an exclamation mark in your query as well.
You will need to escape the exclamation mark as well.
So you can try it with the query _route_=“123\:456\!”

You can refer to the message in the link on which special characters
requires escaping.

https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_21914956_which-2Dspecial-2Dcharacters-2Dneed-2Descaping-2Din-2Da-2Dsolr-2Dquery=DwIFaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=81cWucTr4zf8Cn2FliZ2fYFfqIb_g605mWVAxLxuQCc=30JCckpa6ctmrBupqeGhxJ7pPIcicy7VcIoeTEw_vpQ=

By the way, which Solr version are you using?

Regards,
Edwin

On Tue, 26 Mar 2019 at 01:12, Jay Potharaju  wrote:

> That did not work . Any other suggestions
> My id is 123:456!678
> Tried running query as _route_=“123\:456!” But didn’t give expected
> results
> Thanks
> Jay
>
> > On Mar 24, 2019, at 8:30 PM, Zheng Lin Edwin Yeo 
> wrote:
> >
> > Hi,
> >
> > The character ":" is a special character, so it requires escaping during
> > the search.
> > You can try to search with query _route_="a\:b!".
> >
> > Regards,
> > Edwin
> >
> >> On Mon, 25 Mar 2019 at 07:59, Jay Potharaju 
> wrote:
> >>
> >> Hi,
> >> My document id has a format of a:b!c, when I query _route_="a:b!" it
> does
> >> not return any values. Any suggestions?
> >>
> >> Thanks
> >> Jay Potharaju
> >>
>




Re: Re: Re: obfuscated password error

2019-03-20 Thread Branham, Jeremy (Experis)
Hard to see in email, particularly because my email server strips urls, but a 
few thinigs I would suggest –

Be sure there aren’t any spaces after your line continuation characters ‘\’. 
This has bit me before.
Check the running processes JVM args and compare `ps –ef | grep solr`
Also, I’d recommend changes be made only in the solr.in.sh, and leave 
‘./bin/solr’ original.

 
Jeremy Branham
jb...@allstate.com


On 3/20/19, 10:24 AM, "Satya Marivada"  wrote:

Sending again, with highlighted text in yellow.

So I got a chance to do a diff of the environments solr-6.3.0 folder within
contents.

solr-6.3.0/bin/solr file has the difference highlighted in yellow. Any idea
of what is going on in that if else in solr file?

*The working configuration file contents are 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__ssl.properties=DwIFaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=nIFuSrMfKCWUmJGtJXgZ_y91GZw9SK5EBljlXsjJgMk=2Rbg_Jc8K1tqOJBPdQt4lsSC0Y3rbEdiug2q577ZoLU=
 below has the
keystore path and password repeated):*

SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"
 else
SOLR_SSL_OPTS+="

-Dcom.sun.management.jmxremote.ssl.config.file=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/https://urldefense.proofpoint.com/v2/url?u=http-3A__ssl.properties=DwIFaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=nIFuSrMfKCWUmJGtJXgZ_y91GZw9SK5EBljlXsjJgMk=2Rbg_Jc8K1tqOJBPdQt4lsSC0Y3rbEdiug2q577ZoLU=;
  fi

else

  SOLR_JETTY_CONFIG+=("--module=http")

Fi


*Not working one (basically overriding again and is causing the incorrect
password):*



SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"

  else

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_TRUST_STORE \

  -Djavax.net.ssl.trustStorePassword=$SOLR_SSL_TRUST_STORE_PASSWORD"

  fi



On Wed, Mar 20, 2019 at 10:45 AM Satya Marivada 
wrote:

> So I got a chance to do a diff of the environments solr-6.3.0 folder
> within contents.
>
> solr-6.3.0/bin/solr file has the difference highlighted in yellow. Any
> idea of what is going on in that if else in solr file?
>
> *The working configuration file contents are 
(https://urldefense.proofpoint.com/v2/url?u=http-3A__ssl.properties=DwIFaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=nIFuSrMfKCWUmJGtJXgZ_y91GZw9SK5EBljlXsjJgMk=2Rbg_Jc8K1tqOJBPdQt4lsSC0Y3rbEdiug2q577ZoLU=
 below has the
> keystore path and password repeated):*
>
> SOLR_SSL_OPTS=""
>
> if [ -n "$SOLR_SSL_KEY_STORE" ]; then
>
>   SOLR_JETTY_CONFIG+=("--module=https")
>
>   SOLR_URL_SCHEME=https
>
>   SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \
>
> -Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \
>
> -Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \
 

Re: Re: obfuscated password error

2019-03-20 Thread Satya Marivada
Sending again, with highlighted text in yellow.

So I got a chance to do a diff of the environments solr-6.3.0 folder within
contents.

solr-6.3.0/bin/solr file has the difference highlighted in yellow. Any idea
of what is going on in that if else in solr file?

*The working configuration file contents are (ssl.properties below has the
keystore path and password repeated):*

SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"
 else
SOLR_SSL_OPTS+="
-Dcom.sun.management.jmxremote.ssl.config.file=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/ssl.properties"
  fi

else

  SOLR_JETTY_CONFIG+=("--module=http")

Fi


*Not working one (basically overriding again and is causing the incorrect
password):*



SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"

  else

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_TRUST_STORE \

  -Djavax.net.ssl.trustStorePassword=$SOLR_SSL_TRUST_STORE_PASSWORD"

  fi



On Wed, Mar 20, 2019 at 10:45 AM Satya Marivada 
wrote:

> So I got a chance to do a diff of the environments solr-6.3.0 folder
> within contents.
>
> solr-6.3.0/bin/solr file has the difference highlighted in yellow. Any
> idea of what is going on in that if else in solr file?
>
> *The working configuration file contents are (ssl.properties below has the
> keystore path and password repeated):*
>
> SOLR_SSL_OPTS=""
>
> if [ -n "$SOLR_SSL_KEY_STORE" ]; then
>
>   SOLR_JETTY_CONFIG+=("--module=https")
>
>   SOLR_URL_SCHEME=https
>
>   SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \
>
> -Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \
>
> -Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \
>
> -Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \
>
> -Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \
>
> -Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"
>
>   if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then
>
> SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \
>
>   -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD
> \
>
>   -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \
>
>
> -Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"
>
>   else
>
> SOLR_SSL_OPTS+="
> -Dcom.sun.management.jmxremote.ssl.config.file=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/ssl.properties"
>
>   fi
>
> else
>
>   SOLR_JETTY_CONFIG+=("--module=http")
>
> Fi
>
>
> *Not working one (basically overriding again and is causing the incorrect
> password):*
>
>
>
> SOLR_SSL_OPTS=""
>
> if [ -n "$SOLR_SSL_KEY_STORE" ]; then
>
>   SOLR_JETTY_CONFIG+=("--module=https")
>
>   SOLR_URL_SCHEME=https
>
>   SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \
>
> -Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \
>
> -Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \
>
> -Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \
>
> -Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \
>
> -Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"
>
>   if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then
>
> SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \
>
>   

Re: Re: obfuscated password error

2019-03-20 Thread Satya Marivada
So I got a chance to do a diff of the environments solr-6.3.0 folder within
contents.

solr-6.3.0/bin/solr file has the difference highlighted in yellow. Any idea
of what is going on in that if else in solr file?

*The working configuration file contents are (ssl.properties below has the
keystore path and password repeated):*

SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"

  else

SOLR_SSL_OPTS+="
-Dcom.sun.management.jmxremote.ssl.config.file=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/ssl.properties"

  fi

else

  SOLR_JETTY_CONFIG+=("--module=http")

Fi


*Not working one (basically overriding again and is causing the incorrect
password):*



SOLR_SSL_OPTS=""

if [ -n "$SOLR_SSL_KEY_STORE" ]; then

  SOLR_JETTY_CONFIG+=("--module=https")

  SOLR_URL_SCHEME=https

  SOLR_SSL_OPTS=" -Dsolr.jetty.keystore=$SOLR_SSL_KEY_STORE \

-Dsolr.jetty.keystore.password=$SOLR_SSL_KEY_STORE_PASSWORD \

-Dsolr.jetty.truststore=$SOLR_SSL_TRUST_STORE \

-Dsolr.jetty.truststore.password=$SOLR_SSL_TRUST_STORE_PASSWORD \

-Dsolr.jetty.ssl.needClientAuth=$SOLR_SSL_NEED_CLIENT_AUTH \

-Dsolr.jetty.ssl.wantClientAuth=$SOLR_SSL_WANT_CLIENT_AUTH"

  if [ -n "$SOLR_SSL_CLIENT_KEY_STORE" ]; then

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_CLIENT_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_CLIENT_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_CLIENT_TRUST_STORE \


-Djavax.net.ssl.trustStorePassword=$SOLR_SSL_CLIENT_TRUST_STORE_PASSWORD"

  else

SOLR_SSL_OPTS+=" -Djavax.net.ssl.keyStore=$SOLR_SSL_KEY_STORE \

  -Djavax.net.ssl.keyStorePassword=$SOLR_SSL_KEY_STORE_PASSWORD \

  -Djavax.net.ssl.trustStore=$SOLR_SSL_TRUST_STORE \

  -Djavax.net.ssl.trustStorePassword=$SOLR_SSL_TRUST_STORE_PASSWORD"

  fi

On Tue, Mar 19, 2019 at 10:10 AM Satya Marivada 
wrote:

> Hi Jeremy,
>
> Thanks for the points. Yes, agreed that there is some conflicting property
> somewhere that is not letting it work. So I basically restored solr-6.3.0
> directory from another environment and replace the host name appropriately
> for this environment. And I used the original keystore that has been
> generated for this environment and it worked fine. So basically the
> keystore is good as well except that there is some conflicting property
> which is not letting it do deobfuscation right.
>
> Thanks,
> Satya
>
> On Mon, Mar 18, 2019 at 2:32 PM Branham, Jeremy (Experis) <
> jb...@allstate.com> wrote:
>
>> I’m not sure if you are sharing the trust/keystores, so I may be off-base
>> here…
>>
>> Some thoughts –
>> - Verify your VM arguments, to be sure there aren’t conflicting SSL
>> properties.
>> - Verify the environment is targeting the correct version of Java
>> - Verify the trust/key stores exist where they are expected, and you can
>> list the contents with the keytool
>> - Verify the correct CA certs are trusted
>>
>>
>> Jeremy Branham
>> jb...@allstate.com
>>
>> On 3/18/19, 1:08 PM, "Satya Marivada"  wrote:
>>
>> Any suggestions please.
>>
>> Thanks,
>> Satya
>>
>> On Mon, Mar 18, 2019 at 11:12 AM Satya Marivada <
>> satya.chaita...@gmail.com>
>> wrote:
>>
>> > Hi All,
>> >
>> > Using solr-6.3.0, to obfuscate the password, have used jetty util to
>> > generate obfuscated password
>> >
>> >
>> > java -cp jetty-util-9.3.8.v20160314.jar
>> > org.eclipse.jetty.util.security.Password mypassword
>> >
>> >
>> > The output has been used in
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__solr.in.sh=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=YtmCJK2U90u6mqx-FOmBS5nqy03luM2J-Zc_LhImnG0=
>> as below
>> >
>> >
>> >
>> >
>> SOLR_SSL_KEY_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
>> >
>> >
>> SOLR_SSL_KEY_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
>> >
>> >
>> >
>> SOLR_SSL_TRUST_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
>> >
>> >
>> >
>> SOLR_SSL_TRUST_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
>> >
>>

Re: Re: obfuscated password error

2019-03-19 Thread Satya Marivada
Hi Jeremy,

Thanks for the points. Yes, agreed that there is some conflicting property
somewhere that is not letting it work. So I basically restored solr-6.3.0
directory from another environment and replace the host name appropriately
for this environment. And I used the original keystore that has been
generated for this environment and it worked fine. So basically the
keystore is good as well except that there is some conflicting property
which is not letting it do deobfuscation right.

Thanks,
Satya

On Mon, Mar 18, 2019 at 2:32 PM Branham, Jeremy (Experis) <
jb...@allstate.com> wrote:

> I’m not sure if you are sharing the trust/keystores, so I may be off-base
> here…
>
> Some thoughts –
> - Verify your VM arguments, to be sure there aren’t conflicting SSL
> properties.
> - Verify the environment is targeting the correct version of Java
> - Verify the trust/key stores exist where they are expected, and you can
> list the contents with the keytool
> - Verify the correct CA certs are trusted
>
>
> Jeremy Branham
> jb...@allstate.com
>
> On 3/18/19, 1:08 PM, "Satya Marivada"  wrote:
>
> Any suggestions please.
>
> Thanks,
> Satya
>
> On Mon, Mar 18, 2019 at 11:12 AM Satya Marivada <
> satya.chaita...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > Using solr-6.3.0, to obfuscate the password, have used jetty util to
> > generate obfuscated password
> >
> >
> > java -cp jetty-util-9.3.8.v20160314.jar
> > org.eclipse.jetty.util.security.Password mypassword
> >
> >
> > The output has been used in
> https://urldefense.proofpoint.com/v2/url?u=http-3A__solr.in.sh=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=YtmCJK2U90u6mqx-FOmBS5nqy03luM2J-Zc_LhImnG0=
> as below
> >
> >
> >
> >
> SOLR_SSL_KEY_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
> >
> >
> SOLR_SSL_KEY_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
> >
> >
> >
> SOLR_SSL_TRUST_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
> >
> >
> >
> SOLR_SSL_TRUST_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
> >
> > Solr does not start fine with below exception, any suggestions? If I
> use
> > the plain text password, it works fine. One more thing is that the
> same
> > setup with obfuscated password works in other environments except
> one which
> > got this exception. Recently system level patches are applied, just
> saying
> > though dont think that could have impact,
> >
> > Caused by: java.net.SocketException:
> > java.security.NoSuchAlgorithmException: Error constructing
> implementation
> > (algorithm: Default, provider: SunJSSE, class:
> sun.security.ssl.SSLContextIm
> > pl$DefaultSSLContext)
> > at
> > javax.net.ssl.DefaultSSLSocketFactory.throwException(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:248
> )
> > at
> > javax.net.ssl.DefaultSSLSocketFactory.createSocket(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:255
> )
> > at
> > org.apache.http.conn.ssl.SSLSocketFactory.createSocket(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:513
> )
> > at
> > org.apache.http.conn.ssl.SSLSocketFactory.createSocket(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:383
> )
> > at
> >
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__DefaultClientConnectionOperator.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=EATR9hBi7P9kYpCcJ8maLn81bHA72GhhvwWQY0V9EQw=:165
> )
> > at
> > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ManagedClientConnectionImpl.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=yuCHQjzNKMtl0uWKiDWB01ChPkiY1tCaPX8n8lhdR-s=:304
> )
> > at
> > 

Re: Re: obfuscated password error

2019-03-19 Thread Satya Marivada
It has been generated with plain password. Same in other environments too,
but it works in other environments.

Thanks,
Satya

On Mon, Mar 18, 2019, 10:42 PM Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> Did you generate your keystore with the obfuscated password or the plain
> text password?
>
> Regards,
> Edwin
>
> On Tue, 19 Mar 2019 at 02:32, Branham, Jeremy (Experis) <
> jb...@allstate.com>
> wrote:
>
> > I’m not sure if you are sharing the trust/keystores, so I may be off-base
> > here…
> >
> > Some thoughts –
> > - Verify your VM arguments, to be sure there aren’t conflicting SSL
> > properties.
> > - Verify the environment is targeting the correct version of Java
> > - Verify the trust/key stores exist where they are expected, and you can
> > list the contents with the keytool
> > - Verify the correct CA certs are trusted
> >
> >
> > Jeremy Branham
> > jb...@allstate.com
> >
> > On 3/18/19, 1:08 PM, "Satya Marivada"  wrote:
> >
> > Any suggestions please.
> >
> > Thanks,
> > Satya
> >
> > On Mon, Mar 18, 2019 at 11:12 AM Satya Marivada <
> > satya.chaita...@gmail.com>
> > wrote:
> >
> > > Hi All,
> > >
> > > Using solr-6.3.0, to obfuscate the password, have used jetty util
> to
> > > generate obfuscated password
> > >
> > >
> > > java -cp jetty-util-9.3.8.v20160314.jar
> > > org.eclipse.jetty.util.security.Password mypassword
> > >
> > >
> > > The output has been used in
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__solr.in.sh=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=YtmCJK2U90u6mqx-FOmBS5nqy03luM2J-Zc_LhImnG0=
> > as below
> > >
> > >
> > >
> > >
> >
> SOLR_SSL_KEY_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
> > >
> > >
> >
> SOLR_SSL_KEY_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
> > >
> > >
> > >
> >
> SOLR_SSL_TRUST_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
> > >
> > >
> > >
> >
> SOLR_SSL_TRUST_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
> > >
> > > Solr does not start fine with below exception, any suggestions? If
> I
> > use
> > > the plain text password, it works fine. One more thing is that the
> > same
> > > setup with obfuscated password works in other environments except
> > one which
> > > got this exception. Recently system level patches are applied, just
> > saying
> > > though dont think that could have impact,
> > >
> > > Caused by: java.net.SocketException:
> > > java.security.NoSuchAlgorithmException: Error constructing
> > implementation
> > > (algorithm: Default, provider: SunJSSE, class:
> > sun.security.ssl.SSLContextIm
> > > pl$DefaultSSLContext)
> > > at
> > > javax.net.ssl.DefaultSSLSocketFactory.throwException(
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:248
> > )
> > > at
> > > javax.net.ssl.DefaultSSLSocketFactory.createSocket(
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:255
> > )
> > > at
> > > org.apache.http.conn.ssl.SSLSocketFactory.createSocket(
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:513
> > )
> > > at
> > > org.apache.http.conn.ssl.SSLSocketFactory.createSocket(
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:383
> > )
> > > at
> > >
> > org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__DefaultClientConnectionOperator.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=EATR9hBi7P9kYpCcJ8maLn81bHA72GhhvwWQY0V9EQw=:165
> > )
> > > at
> > > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(
> >
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ManagedClientConnectionImpl.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=yuCHQjzNKMtl0uWKiDWB01ChPkiY1tCaPX8n8lhdR-s=:304
> > )
> > > at
> > > 

Re: Re: obfuscated password error

2019-03-18 Thread Zheng Lin Edwin Yeo
Hi,

Did you generate your keystore with the obfuscated password or the plain
text password?

Regards,
Edwin

On Tue, 19 Mar 2019 at 02:32, Branham, Jeremy (Experis) 
wrote:

> I’m not sure if you are sharing the trust/keystores, so I may be off-base
> here…
>
> Some thoughts –
> - Verify your VM arguments, to be sure there aren’t conflicting SSL
> properties.
> - Verify the environment is targeting the correct version of Java
> - Verify the trust/key stores exist where they are expected, and you can
> list the contents with the keytool
> - Verify the correct CA certs are trusted
>
>
> Jeremy Branham
> jb...@allstate.com
>
> On 3/18/19, 1:08 PM, "Satya Marivada"  wrote:
>
> Any suggestions please.
>
> Thanks,
> Satya
>
> On Mon, Mar 18, 2019 at 11:12 AM Satya Marivada <
> satya.chaita...@gmail.com>
> wrote:
>
> > Hi All,
> >
> > Using solr-6.3.0, to obfuscate the password, have used jetty util to
> > generate obfuscated password
> >
> >
> > java -cp jetty-util-9.3.8.v20160314.jar
> > org.eclipse.jetty.util.security.Password mypassword
> >
> >
> > The output has been used in
> https://urldefense.proofpoint.com/v2/url?u=http-3A__solr.in.sh=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=YtmCJK2U90u6mqx-FOmBS5nqy03luM2J-Zc_LhImnG0=
> as below
> >
> >
> >
> >
> SOLR_SSL_KEY_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
> >
> >
> SOLR_SSL_KEY_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
> >
> >
> >
> SOLR_SSL_TRUST_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
> >
> >
> >
> SOLR_SSL_TRUST_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
> >
> > Solr does not start fine with below exception, any suggestions? If I
> use
> > the plain text password, it works fine. One more thing is that the
> same
> > setup with obfuscated password works in other environments except
> one which
> > got this exception. Recently system level patches are applied, just
> saying
> > though dont think that could have impact,
> >
> > Caused by: java.net.SocketException:
> > java.security.NoSuchAlgorithmException: Error constructing
> implementation
> > (algorithm: Default, provider: SunJSSE, class:
> sun.security.ssl.SSLContextIm
> > pl$DefaultSSLContext)
> > at
> > javax.net.ssl.DefaultSSLSocketFactory.throwException(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:248
> )
> > at
> > javax.net.ssl.DefaultSSLSocketFactory.createSocket(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:255
> )
> > at
> > org.apache.http.conn.ssl.SSLSocketFactory.createSocket(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:513
> )
> > at
> > org.apache.http.conn.ssl.SSLSocketFactory.createSocket(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:383
> )
> > at
> >
> org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__DefaultClientConnectionOperator.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=EATR9hBi7P9kYpCcJ8maLn81bHA72GhhvwWQY0V9EQw=:165
> )
> > at
> > org.apache.http.impl.conn.ManagedClientConnectionImpl.open(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__ManagedClientConnectionImpl.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=yuCHQjzNKMtl0uWKiDWB01ChPkiY1tCaPX8n8lhdR-s=:304
> )
> > at
> > org.apache.http.impl.client.DefaultRequestDirector.tryConnect(
> https://urldefense.proofpoint.com/v2/url?u=http-3A__DefaultRequestDirector.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=BuInFyYyCadGREvZzUoJMKX-9SWG7lzHzdO-A3x3rGA=:611
> )
> > at
> > org.apache.http.impl.client.DefaultRequestDirector.execute(
> 

Re: Re: obfuscated password error

2019-03-18 Thread Branham, Jeremy (Experis)
I’m not sure if you are sharing the trust/keystores, so I may be off-base here…

Some thoughts –
- Verify your VM arguments, to be sure there aren’t conflicting SSL properties.
- Verify the environment is targeting the correct version of Java
- Verify the trust/key stores exist where they are expected, and you can list 
the contents with the keytool
- Verify the correct CA certs are trusted

 
Jeremy Branham
jb...@allstate.com

On 3/18/19, 1:08 PM, "Satya Marivada"  wrote:

Any suggestions please.

Thanks,
Satya

On Mon, Mar 18, 2019 at 11:12 AM Satya Marivada 
wrote:

> Hi All,
>
> Using solr-6.3.0, to obfuscate the password, have used jetty util to
> generate obfuscated password
>
>
> java -cp jetty-util-9.3.8.v20160314.jar
> org.eclipse.jetty.util.security.Password mypassword
>
>
> The output has been used in 
https://urldefense.proofpoint.com/v2/url?u=http-3A__solr.in.sh=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=YtmCJK2U90u6mqx-FOmBS5nqy03luM2J-Zc_LhImnG0=
 as below
>
>
>
> 
SOLR_SSL_KEY_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
>
> SOLR_SSL_KEY_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
>
>
> 
SOLR_SSL_TRUST_STORE=/sanfs/mnt/vol01/solr/solr-6.3.0/server/etc/solr-ssl.keystore.jks
>
>
> 
SOLR_SSL_TRUST_STORE_PASSWORD="OBF:1bcd1l161lts1ltu1uum1uvk1lq41lq61k221b9t"
>
> Solr does not start fine with below exception, any suggestions? If I use
> the plain text password, it works fine. One more thing is that the same
> setup with obfuscated password works in other environments except one 
which
> got this exception. Recently system level patches are applied, just saying
> though dont think that could have impact,
>
> Caused by: java.net.SocketException:
> java.security.NoSuchAlgorithmException: Error constructing implementation
> (algorithm: Default, provider: SunJSSE, class: 
sun.security.ssl.SSLContextIm
> pl$DefaultSSLContext)
> at
> 
javax.net.ssl.DefaultSSLSocketFactory.throwException(https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:248)
> at
> 
javax.net.ssl.DefaultSSLSocketFactory.createSocket(https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:255)
> at
> 
org.apache.http.conn.ssl.SSLSocketFactory.createSocket(https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:513)
> at
> 
org.apache.http.conn.ssl.SSLSocketFactory.createSocket(https://urldefense.proofpoint.com/v2/url?u=http-3A__SSLSocketFactory.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=dud5QRNkwTMDiH04sCjNs1U9_5t8wBMxJNiyQRdjXRk=:383)
> at
> 
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(https://urldefense.proofpoint.com/v2/url?u=http-3A__DefaultClientConnectionOperator.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=EATR9hBi7P9kYpCcJ8maLn81bHA72GhhvwWQY0V9EQw=:165)
> at
> 
org.apache.http.impl.conn.ManagedClientConnectionImpl.open(https://urldefense.proofpoint.com/v2/url?u=http-3A__ManagedClientConnectionImpl.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=yuCHQjzNKMtl0uWKiDWB01ChPkiY1tCaPX8n8lhdR-s=:304)
> at
> 
org.apache.http.impl.client.DefaultRequestDirector.tryConnect(https://urldefense.proofpoint.com/v2/url?u=http-3A__DefaultRequestDirector.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=BuInFyYyCadGREvZzUoJMKX-9SWG7lzHzdO-A3x3rGA=:611)
> at
> 
org.apache.http.impl.client.DefaultRequestDirector.execute(https://urldefense.proofpoint.com/v2/url?u=http-3A__DefaultRequestDirector.java=DwIBaQ=gtIjdLs6LnStUpy9cTOW9w=0SwsmPELGv6GC1_5JSQ9T7ZPMLljrIkbF_2jBCrKXI0=Ix7ZcyM45ms93i2fWx4SNPgiLA7TGHVDOjCklcxbvLs=BuInFyYyCadGREvZzUoJMKX-9SWG7lzHzdO-A3x3rGA=:446)
> at
> 

Re: Re: Garbage Collection Metrics

2019-03-18 Thread Branham, Jeremy (Experis)
I get these metrics by pushing the JMX data into Graphite, then use the 
non-negative derivative function on the GC ‘time’ metric.
It essentially shows the amount of change on a counter, at the specific time it 
occurred. 
 
Jeremy Branham
jb...@allstate.com

On 3/18/19, 12:06 PM, "Jeff Courtade"  wrote:

The only way I found to track GC times was by truning on GC logging and the
writing cronjob data collection script and graphing it in zabbix

On Mon, Mar 18, 2019 at 12:34 PM Erick Erickson 
wrote:

> Attachments are pretty aggressively stripped by the apache mail server, so
> it didn’t come through.
>
> That said, I’m not sure how much use just the last GC time is. What do you
> want it for? This
> sounds a bit like an XY problem.
>
> Best,
> Erick
>
> > On Mar 17, 2019, at 2:43 PM, Karthik K G  wrote:
> >
> > Hi Team,
> >
> > I was looking for Old GC duration time metrics, but all I could find was
> the API for this "/solr/admin/metrics?wt=json=jvm=gc.G1-
> Old-Generation", but I am not sure if this is for
> 'gc_g1_gen_o_lastgc_duration'. I tried to hookup the IP to the jconsole 
and
> was looking for the metrics, but all I could see was the collection time
> but not last GC duration as attached in the screenshot. Can you please 
help
> here with finding the correct metrics. I strongly believe we are not
> capturing this information. Please correct me if I am wrong.
> >
> > Thanks & Regards,
> > Karthik
>
>




Re: Re: Authorization fails but api still renders

2019-03-15 Thread Branham, Jeremy (Experis)
// Adding the dev DL, as this may be a bug

Solr v7.7.0

I’m expecting the 401 on all the servers in all 3 clusters using the security 
configuration.
For example, when I access the core or collection APIs without authentication, 
it should return a 401.

On one of the servers, in one of the clusters, the authorization is completely 
ignored. The http response is 200 and the API returns results.
The other server in this cluster works properly, returning a 401 when the 
protected API is accessed without authentication.

Interesting notes –
- If I use the IP or FQDN to access the server, authorization works properly 
and a 401 is returned. It’s only when I use the short hostname to access the 
server, that the authorization is bypassed.
- On the broken server, a 401 is returned correctly when the ‘autoscaling 
suggestions’ api is accessed. This api uses a different resource path, which 
may be a clue to why the others fail.
  https://solr:8443/api/cluster/autoscaling/suggestions

Here is the security.json with sensitive data changed/removed –

{
"authentication":{
   "blockUnknown": false,
   "class":"solr.BasicAuthPlugin",
   "credentials":{
 "admin":"--REDACTED--",
 "reader":"--REDACTED--",
 "writer":"--REDACTED--"
   },
   "realm":"solr"
},
"authorization":{
   "class":"solr.RuleBasedAuthorizationPlugin",
   "permissions":[
 {"name":"security-edit", "role":"admin"},
 {"name":"security-read", "role":"admin"},
 {"name":"schema-edit", "role":"admin"},
 {"name":"config-edit", "role":"admin"},
 {"name":"core-admin-edit", "role":"admin"},
 {"name":"collection-admin-edit", "role":"admin"},
 {"name":"autoscaling-read", "role":"admin"},
 {"name":"autoscaling-write", "role":"admin"},
 {"name":"autoscaling-history-read", "role":"admin"},
 {"name":"read","role":"*"},
 {"name":"schema-read","role":"*"},
 {"name":"config-read","role":"*"},
 {"name":"collection-admin-read", "role":"*"},
 {"name":"core-admin-read","role":"*"},
 {"name":"update", "role":"write"},
 {"collection":null, "path":"/admin/info/system", "role":"admin"}
   ],
   "user-role":{
 "admin": "admin",
 "reader": "read",
 "writer": "write"
   }
}}


 
Jeremy Branham
jb...@allstate.com

On 3/14/19, 10:06 PM, "Zheng Lin Edwin Yeo"  wrote:

Hi,

Can't really catch your question. Are you facing the error 401 on all the
clusters or just one of them?

Also, which Solr version are you using?

Regards,
Edwin

On Fri, 15 Mar 2019 at 05:15, Branham, Jeremy (Experis) 
wrote:

> I’ve discovered the authorization works properly if I use the FQDN to
> access the Solr node, but the short hostname completely circumvents it.
> They are all internal server clusters, so I’m using self-signed
> certificates [the same exact certificate] on each. The SAN portion of the
> cert contains the IP, short, and FQDN of each server.
>
> I also diff’d the two servers Solr installation directories, and confirmed
> they are identical.
> They are using the same exact versions of Java and zookeeper, with the
> same chroot configuration. [different zk clusters]
>
>
> Jeremy Branham
> jb...@allstate.com
>
> On 3/14/19, 10:44 AM, "Branham, Jeremy (Experis)" 
> wrote:
>
> I’m using Basic Auth on 3 different clusters.
> On 2 of the clusters, authorization works fine. A 401 is returned when
> I try to access the core/collection apis.
>
> On the 3rd cluster I can see the authorization failed, but the api
> results are still returned.
>
> Solr.log
> 2019-03-14 09:25:47.680 INFO  (qtp1546693040-152) [   ]
> o.a.s.s.RuleBasedAuthorizationPlugin request has come without principal.
> failed permission {
>   "name":"core-admin-read",
>   "role":"*"}
>
>
> I’m using different zookeeper clusters for each solr cluster, but
> using the same security.json contents.
> I’ve tried refreshing the ZK node, and bringing the whole Solr cluster
> down and back up.
>
> Is there some sort of caching that could be happening?
>
> I wrote an installation script that I’ve used to setup each cluster,
> so I’m thinking I’ll wipe it out and re-run.
> But before I do this, I thought I’d ask the community for input. Maybe
> a bug?
>
>
> Jeremy Branham
> jb...@allstate.com
> Allstate Insurance Company | UCV Technology Services | Information
> Services Group
>
>
>
>




Antwort: Re: Re: High CPU usage with Solr 7.7.0

2019-03-01 Thread Lukas Weiss
)
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run​(QueuedThreadPool.java:683)
java.lang.Thread.run​(Thread.java:748)
75.6621ms
20.ms

ShutdownMonitor (12)
java.net.PlainSocketImpl.socketAccept​(Native Method)
java.net.AbstractPlainSocketImpl.accept​(AbstractPlainSocketImpl.java:409)
java.net.ServerSocket.implAccept​(ServerSocket.java:545)
java.net.ServerSocket.accept​(ServerSocket.java:513)
org.eclipse.jetty.server.ShutdownMonitor$ShutdownMonitorRunnable.run​(ShutdownMonitor.java:335)
java.lang.Thread.run​(Thread.java:748)
0.3767ms
0.ms
Signal Dispatcher (5)
0.0362ms
0.ms

Finalizer (3)
java.lang.ref.ReferenceQueue$Lock@448b0df5

java.lang.Object.wait​(Native Method)
java.lang.ref.ReferenceQueue.remove​(ReferenceQueue.java:144)
java.lang.ref.ReferenceQueue.remove​(ReferenceQueue.java:165)
java.lang.ref.Finalizer$FinalizerThread.run​(Finalizer.java:216)
8.2488ms
0.ms

Reference Handler (2)
java.lang.ref.Reference$Lock@19ced464

java.lang.Object.wait​(Native Method)
java.lang.Object.wait​(Object.java:502)
java.lang.ref.Reference.tryHandlePending​(Reference.java:191)
java.lang.ref.Reference$ReferenceHandler.run​(Reference.java:153)



Von:"Tomás Fernández Löbbe" 
An: solr-user@lucene.apache.org, 
Datum:  27.02.2019 19:34
Betreff:    Re: Re: High CPU usage with Solr 7.7.0



Maybe a thread dump would be useful if you still have some instance 
running
on 7.7

On Wed, Feb 27, 2019 at 7:28 AM Lukas Weiss 
wrote:

> I can confirm this. Downgrading to 7.6.0 solved the issue.
> Thanks for the hint.
>
>
>
> Von:"Joe Obernberger" 
> An: solr-user@lucene.apache.org, "Lukas Weiss"
> ,
> Datum:  27.02.2019 15:59
> Betreff:Re: High CPU usage with Solr 7.7.0
>
>
>
> Just to add to this.  We upgraded to 7.7.0 and saw very large CPU usage
> on multi core boxes - sustained in the 1200% range.  We then switched to
> 7.6.0 (no other configuration changes) and the problem went away.
>
> We have a 40 node cluster and all 40 nodes had high CPU usage with 3
> indexes stored on HDFS.
>
> -Joe
>
> On 2/27/2019 5:04 AM, Lukas Weiss wrote:
> > Hello,
> >
> > we recently updated our Solr server from 6.6.5 to 7.7.0. Since then, 
we
> > have problems with the server's CPU usage.
> > We have two Solr cores configured, but even if we clear all indexes 
and
> do
> > not start the index process, we see 100 CPU usage for both cores.
> >
> > Here's what our top says:
> >
> > root@solr:~ # top
> > top - 09:25:24 up 17:40,  1 user,  load average: 2,28, 2,56, 2,68
> > Threads:  74 total,   3 running,  71 sleeping,   0 stopped,   0 zombie
> > %Cpu0  :100,0 us,  0,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 
si,
> > 0,0 st
> > %Cpu1  :100,0 us,  0,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 
si,
> > 0,0 st
> > %Cpu2  : 11,3 us,  1,0 sy,  0,0 ni, 86,7 id,  0,7 wa,  0,0 hi,  0,3 
si,
> > 0,0 st
> > %Cpu3  :  3,0 us,  3,0 sy,  0,0 ni, 93,7 id,  0,3 wa,  0,0 hi,  0,0 
si,
> > 0,0 st
> > KiB Mem :  8388608 total,  7859168 free,   496744 used,32696
> > buff/cache
> > KiB Swap:  2097152 total,  2097152 free,0 used.  7859168 avail
> Mem
> >
> >
> >PID USER  PR  NIVIRTRESSHR S %CPU %MEM TIME+
> COMMAND
> >P
> > 10209 solr  20   0 6138468 452520  25740 R 99,9  5,4  29:43.45 
java
> > -server -Xms1024m -Xmx1024m -XX:NewRatio=3 -XX:SurvivorRatio=4
> > -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8
> > -XX:+UseConcMarkSweepGC -XX:ConcGCThreads=4 + 24
> > 10214 solr  20   0 6138468 452520  25740 R 99,9  5,4  28:42.91 
java
> > -server -Xms1024m -Xmx1024m -XX:NewRatio=3 -XX:SurvivorRatio=4
> > -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8
> > -XX:+UseConcMarkSweepGC -XX:ConcGCThreads=4 + 25
> >
> > The solr server is installed on a Debian Stretch 9.8 (64bit) on Linux
> LXC
> > dedicated Container.
> >
> > Some more server info:
> >
> > root@solr:~ # java -version
> > openjdk version "1.8.0_181"
> > OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-2~deb9u1-b13)
> > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> >
> > root@solr:~ # free -m
> >totalusedfree  shared  buff/cache
> > available
> > Mem:   8192 4847675 701  31 
7675
> > Swap:  2048   02048
> >
> > We also found something strange if we do an strace of the main 
process,
> we
> > get lots of ongoing connection timeouts:
> >
> > root@solr:~ # strace -F -p 4136
> > strace: Process 4136 attached with 48 threads
> > strace: [ Process PID=11089 ru

Re: Re: High CPU usage with Solr 7.7.0

2019-02-27 Thread Tomás Fernández Löbbe
Maybe a thread dump would be useful if you still have some instance running
on 7.7

On Wed, Feb 27, 2019 at 7:28 AM Lukas Weiss 
wrote:

> I can confirm this. Downgrading to 7.6.0 solved the issue.
> Thanks for the hint.
>
>
>
> Von:"Joe Obernberger" 
> An: solr-user@lucene.apache.org, "Lukas Weiss"
> ,
> Datum:  27.02.2019 15:59
> Betreff:Re: High CPU usage with Solr 7.7.0
>
>
>
> Just to add to this.  We upgraded to 7.7.0 and saw very large CPU usage
> on multi core boxes - sustained in the 1200% range.  We then switched to
> 7.6.0 (no other configuration changes) and the problem went away.
>
> We have a 40 node cluster and all 40 nodes had high CPU usage with 3
> indexes stored on HDFS.
>
> -Joe
>
> On 2/27/2019 5:04 AM, Lukas Weiss wrote:
> > Hello,
> >
> > we recently updated our Solr server from 6.6.5 to 7.7.0. Since then, we
> > have problems with the server's CPU usage.
> > We have two Solr cores configured, but even if we clear all indexes and
> do
> > not start the index process, we see 100 CPU usage for both cores.
> >
> > Here's what our top says:
> >
> > root@solr:~ # top
> > top - 09:25:24 up 17:40,  1 user,  load average: 2,28, 2,56, 2,68
> > Threads:  74 total,   3 running,  71 sleeping,   0 stopped,   0 zombie
> > %Cpu0  :100,0 us,  0,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,
> > 0,0 st
> > %Cpu1  :100,0 us,  0,0 sy,  0,0 ni,  0,0 id,  0,0 wa,  0,0 hi,  0,0 si,
> > 0,0 st
> > %Cpu2  : 11,3 us,  1,0 sy,  0,0 ni, 86,7 id,  0,7 wa,  0,0 hi,  0,3 si,
> > 0,0 st
> > %Cpu3  :  3,0 us,  3,0 sy,  0,0 ni, 93,7 id,  0,3 wa,  0,0 hi,  0,0 si,
> > 0,0 st
> > KiB Mem :  8388608 total,  7859168 free,   496744 used,32696
> > buff/cache
> > KiB Swap:  2097152 total,  2097152 free,0 used.  7859168 avail
> Mem
> >
> >
> >PID USER  PR  NIVIRTRESSHR S %CPU %MEM TIME+
> COMMAND
> >P
> > 10209 solr  20   0 6138468 452520  25740 R 99,9  5,4  29:43.45 java
> > -server -Xms1024m -Xmx1024m -XX:NewRatio=3 -XX:SurvivorRatio=4
> > -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8
> > -XX:+UseConcMarkSweepGC -XX:ConcGCThreads=4 + 24
> > 10214 solr  20   0 6138468 452520  25740 R 99,9  5,4  28:42.91 java
> > -server -Xms1024m -Xmx1024m -XX:NewRatio=3 -XX:SurvivorRatio=4
> > -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8
> > -XX:+UseConcMarkSweepGC -XX:ConcGCThreads=4 + 25
> >
> > The solr server is installed on a Debian Stretch 9.8 (64bit) on Linux
> LXC
> > dedicated Container.
> >
> > Some more server info:
> >
> > root@solr:~ # java -version
> > openjdk version "1.8.0_181"
> > OpenJDK Runtime Environment (build 1.8.0_181-8u181-b13-2~deb9u1-b13)
> > OpenJDK 64-Bit Server VM (build 25.181-b13, mixed mode)
> >
> > root@solr:~ # free -m
> >totalusedfree  shared  buff/cache
> > available
> > Mem:   8192 4847675 701  31 7675
> > Swap:  2048   02048
> >
> > We also found something strange if we do an strace of the main process,
> we
> > get lots of ongoing connection timeouts:
> >
> > root@solr:~ # strace -F -p 4136
> > strace: Process 4136 attached with 48 threads
> > strace: [ Process PID=11089 runs in x32 mode. ]
> > [pid  4937] epoll_wait(139,  
> > [pid  4936] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4909] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4618] epoll_wait(136,  
> > [pid  4576] futex(0x7ff61ce66474, FUTEX_WAIT_PRIVATE, 1, NULL
>  > ...>
> > [pid  4279] futex(0x7ff61ce62b34, FUTEX_WAIT_PRIVATE, 2203, NULL
> > 
> > [pid  4244] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4227] futex(0x7ff56c71ae14, FUTEX_WAIT_PRIVATE, 2237, NULL
> > 
> > [pid  4243] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4228] futex(0x7ff5608331a4, FUTEX_WAIT_PRIVATE, 2237, NULL
> > 
> > [pid  4208] futex(0x7ff61ce63e54, FUTEX_WAIT_PRIVATE, 5, NULL
>  > ...>
> > [pid  4205] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4204] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4196] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4195] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4194] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4193] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4187] restart_syscall(<... resuming interrupted restart_syscall
> ...>
> > 
> > [pid  4180] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4179] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4177] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4174] accept(133,  
> > [pid  4173] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4172] restart_syscall(<... resuming interrupted futex ...>
> > 
> > [pid  4171] restart_syscall(<... resuming interrupted restart_syscall
> ...>

RE: Re: Suppress stack trace in error response

2019-02-22 Thread Markus Jelsma
Hello,

Solr's error responses respect the configured response writer settings, so you 
could probably remove the  element and the stuff it contains 
using XSLT. It is not too fancy, but it should work.

Regards,
Markus
 
-Original message-
> From:Branham, Jeremy (Experis) 
> Sent: Friday 22nd February 2019 16:53
> To: solr-user@lucene.apache.org
> Subject: Re:  Re: Suppress stack trace in error response
> 
> Thanks Edwin – You’re right, I could explain that a bit more.
> My security team has run a scan against the SOLR servers and identified a few 
> things they want suppressed, one being the stack trace in an error message.
> 
> For example –
> 
> 
> 500
> 1
> 
> `
> 
> 
> 
> For input string: "`"
> 
> java.lang.NumberFormatException: For input string: "`" at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
> at …
> 
> 
> I’ve got a long-term solution involving middleware changes, but I’m not sure 
> there is a quick fix for this.
> 
>  
> Jeremy Branham
> jb...@allstate.com
> 
> On 2/21/19, 9:53 PM, "Zheng Lin Edwin Yeo"  wrote:
> 
> Hi,
> 
> There's too little information provided in your questions.
> You can explain more on the issue or the exception that you are facing.
> 
> Regards,
> Edwin
> 
> On Thu, 21 Feb 2019 at 23:45, Branham, Jeremy (Experis) 
> 
> wrote:
> 
> > When Solr throws an exception, like when a client sends a badly formed
> > query string, is there a way to suppress the stack trace in the error
> > response?
> >
> >
> >
> > Jeremy Branham
> > jb...@allstate.com<mailto:jb...@allstate.com>
> > Allstate Insurance Company | UCV Technology Services | Information
> > Services Group
> >
> >
> 
> 
> 


Re: Re: Suppress stack trace in error response

2019-02-22 Thread Branham, Jeremy (Experis)
Thanks Edwin – You’re right, I could explain that a bit more.
My security team has run a scan against the SOLR servers and identified a few 
things they want suppressed, one being the stack trace in an error message.

For example –


500
1

`



For input string: "`"

java.lang.NumberFormatException: For input string: "`" at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) 
at …


I’ve got a long-term solution involving middleware changes, but I’m not sure 
there is a quick fix for this.

 
Jeremy Branham
jb...@allstate.com

On 2/21/19, 9:53 PM, "Zheng Lin Edwin Yeo"  wrote:

Hi,

There's too little information provided in your questions.
You can explain more on the issue or the exception that you are facing.

Regards,
Edwin

On Thu, 21 Feb 2019 at 23:45, Branham, Jeremy (Experis) 
wrote:

> When Solr throws an exception, like when a client sends a badly formed
> query string, is there a way to suppress the stack trace in the error
> response?
>
>
>
> Jeremy Branham
> jb...@allstate.com
> Allstate Insurance Company | UCV Technology Services | Information
> Services Group
>
>




Re: Re: Re: Suppress stack trace in error response

2019-02-22 Thread Branham, Jeremy (Experis)
BTW – Congratulations on joining the PMC!

 
Jeremy Branham
jb...@allstate.com

On 2/22/19, 9:46 AM, "Branham, Jeremy (Experis)"  wrote:

Thanks Jason –
That’s what I was thinking too. It would require some development.

 
Jeremy Branham
jb...@allstate.com

On 2/22/19, 8:50 AM, "Jason Gerlowski"  wrote:

Hi Jeremy,

Unfortunately Solr doesn't offer anything like what you're looking
for, at least that I know of.  There's no sort of global "quiet" or
"suppressStack" option that you can pass on a request to _not_ get the
stacktrace information back.  There might be individual APIs which
offer something like this, but I've never run into them, so I doubt
it.

Best,

Jason

On Thu, Feb 21, 2019 at 10:53 PM Zheng Lin Edwin Yeo
 wrote:
>
> Hi,
>
> There's too little information provided in your questions.
> You can explain more on the issue or the exception that you are 
facing.
>
> Regards,
> Edwin
>
> On Thu, 21 Feb 2019 at 23:45, Branham, Jeremy (Experis) 

> wrote:
>
> > When Solr throws an exception, like when a client sends a badly 
formed
> > query string, is there a way to suppress the stack trace in the 
error
> > response?
> >
> >
> >
> > Jeremy Branham
> > jb...@allstate.com
> > Allstate Insurance Company | UCV Technology Services | Information
> > Services Group
> >
> >






Re: Re: Suppress stack trace in error response

2019-02-22 Thread Branham, Jeremy (Experis)
Thanks Jason –
That’s what I was thinking too. It would require some development.

 
Jeremy Branham
jb...@allstate.com

On 2/22/19, 8:50 AM, "Jason Gerlowski"  wrote:

Hi Jeremy,

Unfortunately Solr doesn't offer anything like what you're looking
for, at least that I know of.  There's no sort of global "quiet" or
"suppressStack" option that you can pass on a request to _not_ get the
stacktrace information back.  There might be individual APIs which
offer something like this, but I've never run into them, so I doubt
it.

Best,

Jason

On Thu, Feb 21, 2019 at 10:53 PM Zheng Lin Edwin Yeo
 wrote:
>
> Hi,
>
> There's too little information provided in your questions.
> You can explain more on the issue or the exception that you are facing.
>
> Regards,
> Edwin
>
> On Thu, 21 Feb 2019 at 23:45, Branham, Jeremy (Experis) 

> wrote:
>
> > When Solr throws an exception, like when a client sends a badly formed
> > query string, is there a way to suppress the stack trace in the error
> > response?
> >
> >
> >
> > Jeremy Branham
> > jb...@allstate.com
> > Allstate Insurance Company | UCV Technology Services | Information
> > Services Group
> >
> >




Re: Re-read from CloudSolrStream

2019-02-20 Thread Joel Bernstein
It sounds like you just need to catch the exception?


Joel Bernstein
http://joelsolr.blogspot.com/


On Mon, Feb 18, 2019 at 3:14 AM SOLR4189  wrote:

> Hi all,
>
> Let's say I have a next code:
>
> http://joelsolr.blogspot.com/2015/04/the-streaming-api-solrjio-basics.html
> <
> http://joelsolr.blogspot.com/2015/04/the-streaming-api-solrjio-basics.html>
>
>
> public class StreamingClient {
>
>public static void main(String args[]) throws IOException {
>   String zkHost = args[0];
>   String collection = args[1];
>
>   Map props = new HashMap();
>   props.put("q", "*:*");
>   props.put("qt", "/export");
>   props.put("sort", "fieldA asc");
>   props.put("fl", "fieldA,fieldB,fieldC");
>
>   CloudSolrStream cstream = new CloudSolrStream(zkHost,
> collection,
> props);
>   try {
>
> cstream.open();
> while(true) {
>
>   Tuple tuple = cstream.read();
>   if(tuple.EOF) {
>  break;
>   }
>
>   String fieldA = tuple.getString("fieldA");
>   String fieldB = tuple.getString("fieldB");
>   String fieldC = tuple.getString("fieldC");
>   System.out.println(fieldA + ", " + fieldB + ", " + fieldC);
> }
>
>   } finally {
>cstream.close();
>   }
>}
> }
>
> What can I do if I get exception in the line *Tuple tuple =
> cstream.read();*? How can I re-read the same tuple, i.e. to continue from
> exception moment ?
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


RE: Re: Delayed/waiting requests

2019-02-19 Thread Gael Jourdan-Weil
Quick update just in case someone comes on this thread someday: we did lower 
the autowarm but it didn't have effect on the performance issues we are seeing.

We are still investigating...

Regards,
Gaël


De : Gael Jourdan-Weil
Envoyé : mardi 15 janvier 2019 18:33
À : solr-user@lucene.apache.org
Objet : RE: Re: Delayed/waiting requests


@Erick:


We will try to lower the autowarm and run some tests to compare.

If I get your point, having a big cache might cause more troubles than help if 
the cache hit ratio is not high enough because the cache is constantly 
evicting/inserting entries?



@Jeremy:


Index size: ~20G and ~14M documents

Server memory available: 256G from which ~30G used and ~100G system cache

Server CPU count: 32, ~10% usage

JVM memory settings: -Xms12G -Xmx12G


We have 3 servers and 3 clusters of 3 Solr instances.

That is each server hosts 1 Solr instance for each cluster.

And, indeed, each cluster only has 1 shard with replication factor 3.


Among all these Solr instances, the pauses are observed on only one single 
cluster but on every server at different times (sometimes on all servers at the 
same time but I would say it's very rare).

We do observe the traffic is evenly balanced across the 3 servers, around 30-40 
queries per second sent to each server.



Regards,

Gaël



De : Branham, Jeremy (Experis) 
Envoyé : mardi 15 janvier 2019 17:59:56
À : solr-user@lucene.apache.org
Objet : Re: Re: Delayed/waiting requests

Hi Gael –

Could you share this information?
Size of the index
Server memory available
Server CPU count
JVM memory settings

You mentioned a cloud configuration of 3 replicas.
Does that mean you have 1 shard with a replication factor of 3?
Do the pauses occur on all 3 servers?
Is the traffic evenly balanced across those servers?


Jeremy Branham
jb...@allstate.com


On 1/15/19, 9:50 AM, "Erick Erickson"  wrote:

Well, it was a nice theory anyway.

"Other collections with the same settings"
doesn't really mean much unless those other collections are very similar,
especially in terms of numbers of docs.

You should only see a new searcher opening when you do a
hard-commit-with-opensearcher-true or soft commit.

So what happens when you just try lowering the autowarm
count? I'm assuming you're free to test in some non-prod
system.

Focusing on the hit ratio is something of a red herring. Remember
that each entry in your filterCache is roughly maxDoc/8 + a little
overhead, the increase in GC pressure has to be balanced
against getting the hits from the cache.

Now, all that said if there's no correlation, then you need to put
a profiler on the system when you see this kind of thing and
find out where the hotspots are, otherwise it's guesswork and
I'm out of ideas.

Best,
Erick

On Tue, Jan 15, 2019 at 12:06 AM Gael Jourdan-Weil
 wrote:
>
> Hi Erick,
>
>
> Thank you for your detailed answer, I better understand autowarming.
>
>
> We have an autowarming time of ~10s for filterCache (queryResultCache is 
not used at all, ratio = 0.02).
>
> We increased the size of the filterCache from 6k to 12k (and autowarming 
size set to same values) to have a better ratio which is _only_ around 
0.85/0.90.
>
>
> The thing I don't understand is I should see "Opening new searcher" in 
the logs everytime a new searcher is opened and thus an autowarming happens, 
right?
>
> But I don't see "Opening new searcher" very often, and I don't see it 
being correlated with the response time peaks.
>
>
> Also, I didn't mention it earlier but, we have other SolrCloud clusters 
with similar settings and load (~10s filterCache autowarming, 10k entries) and 
we don't observe the same behavior.
>
>
> Regards,
>
> 
> De : Erick Erickson 
> Envoyé : lundi 14 janvier 2019 17:44:38
> À : solr-user
> Objet : Re: Delayed/waiting requests
>
> Gael:
>
> bq. Nevertheless, our filterCache is set to autowarm 12k entries which
> is also the maxSize
>
> That is far, far, far too many. Let's assume you actually have 12K
> entries in the filterCache.
> Every time you open a new searcher, 12K queries are executed _before_
> the searcher
> accepts any new requests. While being able to re-use a filterCache
> entry is useful, one of
> the primary purposes is to pre-load index data from disk into memory
> which can be
> the event that takes the most time.
>
> The queryResultCache has a similar function. I often find that this
> cache doesn't have a
> very high hit ratio, but again

***UNCHECKED*** Re: Re: solr 7.0: What causes the segment to flush

2019-02-18 Thread DIMA

Buongiorno,




Vedi allegato e di confermare.

Password: 1234567




Grazie




DIMA





From: khi...@gmail.com
Sent: Tue, 17 Oct 2017 15:40:50 +
To: solr-user@lucene.apache.org
Subject: Re: solr 7.0: What causes the segment to flush
 

I take my yesterdays comment back. I assumed that the file being written
is a segment, however after letting solr run for the night. I see that the
segment is flushed at the expected size:1945MB (so that file which i
observed was still open for writing).
Now, I have two other questions:-

1. Is there a way to not write to disk continuously and only write the file
when segment is flushed?

2. With 6.5: i had ramBufferSizeMB=20G and limiting the threadCount to 12
(since LUCENE-6659
,
there is no configuration for indexing thread count, so I did a local
workaround to limit the number of threads in code); I had very good write
throughput. But with 7.0, I am getting comparable throughput only at
indexing threadcount > 50. What could be wrong ?


Thanks @Erick, I checked the commit settings, both soft and hard commits
are off.




On Tue, Oct 17, 2017 at 3:47 AM, Amrit Sarkar 
wrote:

> >
> > In 7.0, i am finding that the file is written to disk very early on
> > and it is being updated every second or so. Had something changed in 7.0
> > which is causing it?  I tried something similar with solr 6.5 and i was
> > able to get almost a GB size files on disk.
>
>
> Interesting observation, Nawab, with ramBufferSizeMB=20G, you are getting
> 20GB segments on 6.5 or less? a GB?
>
> Amrit Sarkar
> Search Engineer
> Lucidworks, Inc.
> 415-589-9269
> www.lucidworks.com
> Twitter http://twitter.com/lucidworks
> LinkedIn: https://www.linkedin.com/in/sarkaramrit2
>
> On Tue, Oct 17, 2017 at 12:48 PM, Nawab Zada Asad Iqbal 
> wrote:
>
> > Hi,
> >
> > I have  tuned  (or tried to tune) my settings to only flush the segment
> > when it has reached its maximum size. At the moment,I am using my
> > application with only a couple of threads (i have limited to one thread
> for
> > analyzing this scenario) and my ramBufferSizeMB=2 (i.e. ~20GB). With
> > this, I assumed that my file sizes on the disk will be at in the order of
> > GB; and no segments will be flushed until the segments in memory size 
>is
> > 2GB. In 7.0, i am finding that the file is written to disk very early on
> > and it is being updated every second or so. Had something changed in 7.0
> > which is causing it?  I tried something similar with solr 6.5 and i was
> > able to get almost a GB size files on disk.
> >
> > How can I control it to not write to disk until the segment has reached
> its
> > maximum permitted size (1945 MB?) ? My write traffic is new 
>only (i.e.,
> > it doesnt delete any document) , however I also found following
> infostream
> > logs, which incorrectly say delete=true:
> >
> > Oct 16, 2017 10:18:29 PM INFO  (qtp761960786-887) [   x:filesearch]
> > o.a.s.c.S.Request [filesearch]  webapp=/solr path=/update
> > params={commit=false} status=0 QTime=21
> > Oct 16, 2017 10:18:29 PM INFO  (qtp761960786-889) [   x:filesearch]
> > o.a.s.u.LoggingInfoStream [DW][qtp761960786-889]: anyChanges?
> > numDocsInRam=4434 deletes=true hasTickets:false
> pendingChangesInFullFlush:
> > false
> > Oct 16, 2017 10:18:29 PM INFO  (qtp761960786-889) [   x:filesearch]
> > o.a.s.u.LoggingInfoStream [IW][qtp761960786-889]: nrtIsCurrent:
> infoVersion
> > matches: false; DW changes: true; BD changes: false
> > Oct 16, 2017 10:18:29 PM INFO  (qtp761960786-889) [   x:filesearch]
> > o.a.s.c.S.Request [filesearch]  webapp=/solr path=/admin/luke
> > params={show=index=0=json} status=0 QTime=0
> >
> >
> >
> > Thanks
> > Nawab
> >
>


<>


  1   2   3   4   5   >