Strange Synonym Graph Filter Bug in Admin UI

2020-05-26 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi All,

We are coming across a strange bug in the Analysis section of the Admin UI. For 
our non-English schema components, instead of the Synonym Graph Filter (SGF) 
showing in the UI, it's showing something called a "List Based Token Stream" 
(LBTS) in its place. We found an old issue that documented this bug, but it 
doesn't seem to have been resolved: 
https://issues.apache.org/jira/browse/SOLR-10366. Has anyone else come across 
this &/or have a solve?

Thanks!

Best,
Audrey



RE: Indexing Korean

2020-05-04 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Oh wow, I had no idea this existed. Thank you so much!

Best,
Audrey

On 5/1/20, 12:58 PM, "Markus Jelsma"  wrote:

Hello,

Although it is not mentioned in Solr's language analysis page in the 
manual, Lucene has had support for Korean for quite a while now.


https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_core_8-5F5-5F0_analyzers-2Dnori_index.html=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=SqDPKA-n_YGjJ4_W3yBTcA-esk2YjXReCnvgtETUuv8=GCBa9JGIjJgWrcahymeFn16-B_f9XyuoAA-hQapaIas=
 

Regards,
Markus



-Original message-
> From:Audrey Lorberfeld - audrey.lorberf...@ibm.com 

> Sent: Friday 1st May 2020 17:34
> To: solr-user@lucene.apache.org
> Subject: Indexing Korean
> 
>  Hi All,
> 
> My team would like to index Korean, but it looks like Solr OOTB does not 
have explicit support for Korean. If any of you have schema pipelines you could 
share for your Korean documents, I would love to see them! I'm assuming I would 
just use some combination of the OOTB CJK factories
> 
> Best,
> Audrey
> 
> 



RE: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-05-01 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Unless I'm misunderstanding the bug in question, there is no bug. What you are 
observing is simply just how things get stemmed...

Best,
Audrey

On 4/30/20, 6:37 PM, "Jhonny Lopez"  wrote:

Yes, sounds like worth it.

Thanks guys!

-Original Message-
From: Mike Drob 
Sent: jueves, 30 de abril de 2020 5:30 p. m.
To: solr-user@lucene.apache.org
Subject: Re: Possible issue with Stemming and nouns ended with suffix 'ion'

This email has been sent from a source external to Publicis Groupe. Please 
use caution when clicking links or opening attachments.
Cet email a été envoyé depuis une source externe à Publicis Groupe. 
Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou lorsque 
vous ouvrez des pièces jointes.



Is this worth filing a bug/suggestion to the folks over at snowballstem.org?

On Thu, Apr 30, 2020 at 4:08 PM Audrey Lorberfeld - 
audrey.lorberf...@ibm.com  wrote:

> I agree with Erick. I think that's just how the cookie crumbles when
> stemming. If you have some time on your hands, you can integrate
> OpenNLP with your Solr instance and start using the lemmas of tokens
> instead of the stems. In this case, I believe if you were to lemmatize
> both "identify" and "identification," they would both condense to 
"identify."
>
> Best,
> Audrey
>
> On 4/30/20, 3:54 PM, "Erick Erickson"  wrote:
>
> They are being stemmed to two different tokens, “identif” and
> “identifi”. Stemming is algorithmic and imperfect and in this case
> you’re getting bitten by that algorithm. It looks like you’re using
> PorterStemFilter, if you want you can look up the exact algorithm, but
> I don’t think it’s a bug, just one of those little joys of English...
>
> To get a clearer picture of exactly what’s being searched, try
> adding =query to your query, in particular looking at the parsed
> query that’s returned. That’ll tell you a bunch. In this particular
> case I don’t think it’ll tell you anything more, but for future…
>
> Best,
> Erick
>
> On, and un-checking the ‘verbose’ box on the analysis page removes
> a lot of distraction, the detailed information is often TMI ;)
>
> > On Apr 30, 2020, at 2:51 PM, Jhonny Lopez <
> jhonny.lo...@publicismedia.com> wrote:
> >
> > Sure, rewriting the message with links for images:
> >
> >
> > We’re facing an issue with stemming in solr. Most of the cases
> are working correctly, for example, if we search for bidding, solr
> brings results for bidding, bid, bids, etc. However, with nouns ended 
with ‘ion’
> suffix, stemming is not working. Even when analyzers seems to have
> correct stemming of the word, the results are not reflecting that. One
> example. If I search ‘identifying’, this is the output:
> >
> > Analyzer (image link):
> >
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo=
> >
> > A clip of results:
> > "haschildren_b":false,
> >"isbucket_text_s":"0",
> >"sectionbody_t":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a
> log file report to understand the trends and gauge auction spread
> overtime to assess the impact of variable auction 
dynamics.\n\n\n\n\n\n\n",
> >"parsedupdatedby_s":"sitecorecarvaini",
> >"sectionbody_t_en":"\n\n\nIn order to identify 1st price
> auctions, leverage the proprietary tools available or manually pull a
> log file report to understand the trends and gauge auction spread
> overtime to assess the impact of variable auction 
dynamics.\n\n\n\n\n\n\n",
> >"hide_section_b":false
> >
> >
> > As you can see, it has used the stemming correctly and brings
> results for other words based in the root, in this case “Identify”.
> >
> > However, if I search for “Identification”, this is the output:
> >
> > Analyzer (imagelink):
> >
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbSh

Indexing Korean

2020-05-01 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
 Hi All,

My team would like to index Korean, but it looks like Solr OOTB does not have 
explicit support for Korean. If any of you have schema pipelines you could 
share for your Korean documents, I would love to see them! I'm assuming I would 
just use some combination of the OOTB CJK factories

Best,
Audrey



RE: Solr fields mapping

2020-04-30 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Sam,

Ah, okay, I see. Hm, I wonder if you could hack "debug mode" to show you how 
they're interacting with the field. I'll keep thinking ... 

Best,
Audrey

On 4/30/20, 3:20 PM, "sambasivarao giddaluri"  
wrote:

Hi Audrey,

Yes i am aware of copyField but it does not fit in my use case. Reason is
while giving as output we have to show each field with its
value,  with copy it combines the value but we do not know field and value
relationship.

regards
sam

On Wed, Apr 29, 2020 at 9:53 AM Audrey Lorberfeld -
    audrey.lorberf...@ibm.com  wrote:

> Hi, Sam!
>
> Have you tried creating a copyField?
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__builds.apache.org_view_L_view_Lucene_job_Solr-2Dreference-2Dguide-2D8.x_javadoc_copying-2Dfields.html=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=3pUA4RBPyJvc8q5RLwe2-r6UbKLIilZBzuS6NC9G0yw=1MxuDdlavuTpOZVDD1lGILzhSgNnPhv6Chh7dSwBGxo=
 
>
> Best,
> Audrey
>
> On 4/28/20, 1:07 PM, "sambasivarao giddaluri" <
> sambasiva.giddal...@gmail.com> wrote:
>
> Hi All,
> Is there a way we can map fields in a single field?
> Ex: scheme has below fields
> createdBy.userName
> createdBy.name
> createdBy.email
>
> If have to retrieve these fields need to pass all the three fields in
> *fl*
> parameter  instead is there a way i can have a map or a object of 
these
> fields in to createdBy and in fl i pass only createdBy and get all
> these 3
> as output
>
> Regards
> sam
>
>
>




RE: Possible issue with Stemming and nouns ended with suffix 'ion'

2020-04-30 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
I agree with Erick. I think that's just how the cookie crumbles when stemming. 
If you have some time on your hands, you can integrate OpenNLP with your Solr 
instance and start using the lemmas of tokens instead of the stems. In this 
case, I believe if you were to lemmatize both "identify" and "identification," 
they would both condense to "identify."

Best,
Audrey

On 4/30/20, 3:54 PM, "Erick Erickson"  wrote:

They are being stemmed to two different tokens, “identif” and “identifi”. 
Stemming is algorithmic and imperfect and in this case you’re getting bitten by 
that algorithm. It looks like you’re using PorterStemFilter, if you want you 
can look up the exact algorithm, but I don’t think it’s a bug, just one of 
those little joys of English...

To get a clearer picture of exactly what’s being searched, try adding 
=query to your query, in particular looking at the parsed query that’s 
returned. That’ll tell you a bunch. In this particular case I don’t think it’ll 
tell you anything more, but for future…

Best,
Erick

On, and un-checking the ‘verbose’ box on the analysis page removes a lot of 
distraction, the detailed information is often TMI ;)

> On Apr 30, 2020, at 2:51 PM, Jhonny Lopez 
 wrote:
> 
> Sure, rewriting the message with links for images:
> 
> 
> We’re facing an issue with stemming in solr. Most of the cases are 
working correctly, for example, if we search for bidding, solr brings results 
for bidding, bid, bids, etc. However, with nouns ended with ‘ion’ suffix, 
stemming is not working. Even when analyzers seems to have correct stemming of 
the word, the results are not reflecting that. One example. If I search 
‘identifying’, this is the output:
> 
> Analyzer (image link):
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd4-2DCp40Cmc0QioS0A-3Fe-3D1f3GJp=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=U-Wmu118X5bfNxDnADO_6ompf9kUxZYHj1DZM2lG4jo=
 
> 
> A clip of results:
> "haschildren_b":false,
>"isbucket_text_s":"0",
>"sectionbody_t":"\n\n\nIn order to identify 1st price auctions, 
leverage the proprietary tools available or manually pull a log file report to 
understand the trends and gauge auction spread overtime to assess the impact of 
variable auction dynamics.\n\n\n\n\n\n\n",
>"parsedupdatedby_s":"sitecorecarvaini",
>"sectionbody_t_en":"\n\n\nIn order to identify 1st price auctions, 
leverage the proprietary tools available or manually pull a log file report to 
understand the trends and gauge auction spread overtime to assess the impact of 
variable auction dynamics.\n\n\n\n\n\n\n",
>"hide_section_b":false
> 
> 
> As you can see, it has used the stemming correctly and brings results for 
other words based in the root, in this case “Identify”.
> 
> However, if I search for “Identification”, this is the output:
> 
> Analyzer (imagelink):
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__1drv.ms_u_s-21AlRTlFq8tQbShd49RpiQObzMgSjVhA=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=8Xt1N2A4ODj--DlLb242c8JMnJr6nWQIwcKjiDiA__s=5RlkLH-90sYc4nyIgnPO9MsBlyh7iWSOphEVdjUvTIE=
 
> 
> 
> Even with proper stemming, solr is only bringing results for the word 
identification (or identifications) but nothing else.
> 
> The queries are over the same field that has the Porter Stemming Filter 
applied for both, query and index. This behavior is consistent with other ‘ion’ 
ended nouns: representation, modification, etc.
> 
> Solr Version: 8.1. Does anyone know why is it happening? Is it a bug?
> 
> Thanks.
> 
> 
> 
> 
> 
> -Original Message-
> 
> From: Erick Erickson 
> 
> Sent: jueves, 30 de abril de 2020 1:47 p. m.
> 
> To: solr-user@lucene.apache.org
> 
> Subject: Re: Possible issue with Stemming and nouns ended with suffix 
'ion'
> 
> 
> 
> This email has been sent from a source external to Publicis Groupe. 
Please use caution when clicking links or opening attachments.
> 
> Cet email a été envoyé depuis une source externe à Publicis Groupe. 
Veuillez faire preuve de prudence lorsque vous cliquez sur des liens ou lorsque 
vous ouvrez des pièces jointes.
> 
> 
> 
> 
> 
> 
> 
> The mail server is pretty aggressive about stripping links, so we can’t 
see the images.
> 
> 
> 
> Could you put them somewhere and paste a link?
> 
> 
> 
> Best,
> 
> Erick
> 
> 
> 
>> On Apr 30, 2020, at 2:40 PM, Jhonny Lopez 
 wrote:
> 
>> 
> 
>> We’re facing an issue with stemming in solr. Most of the cases are 
working correctly, for example, if we search for bidding, solr brings 

Re: Solr fields mapping

2020-04-29 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi, Sam!

Have you tried creating a copyField? 
https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-8.x/javadoc/copying-fields.html

Best,
Audrey

On 4/28/20, 1:07 PM, "sambasivarao giddaluri"  
wrote:

Hi All,
Is there a way we can map fields in a single field?
Ex: scheme has below fields
createdBy.userName
createdBy.name
createdBy.email

If have to retrieve these fields need to pass all the three fields in  *fl*
parameter  instead is there a way i can have a map or a object of these
fields in to createdBy and in fl i pass only createdBy and get all these 3
as output

Regards
sam




Japanese text handling in Solr

2020-03-31 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi All,

We are adding Japanese to our index, and I would love to know if any of you 
have a synonyms file you use for Japanese?

Thank you!

Best,
Audrey Lorberfeld


Re: Re: Re: Using Synonym Graph Filter with StandardTokenizer does not tokenize the query string if it has multi-word synonym

2020-03-16 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
I don't think you can synonym-ize both the multi-token phrase and each 
individual token in the multi-token phrase at the same time. But anyone else 
feel free to chime in! 

Best,
Audrey Lorberfeld

On 3/16/20, 12:40 PM, "atin janki"  wrote:

I aim to achieve an expansion like -

Synonym(soap powder) + Synonym(soap) + Synonym (powder)


which is not happening because of the Synonym expansion is being done at
the moment.

At the moment, using  Synonym Graph Filter with StandardTokenizer  and sow
= false , expands as -

 Synonym(soap powder)

because "soap powder" is a multi-word synonym present in the synonym file.

Using sow = true in the above setting will give -

Synonym(soap) + Synonym (powder)



Best Regards,
Atin Janki


On Mon, Mar 16, 2020 at 5:27 PM Audrey Lorberfeld -
    audrey.lorberf...@ibm.com  wrote:

> To confirm, you want a synonym like "soap powder" to map onto synonyms
> like "hand soap," "hygiene products," etc? As in, more of a cognitive
> synonym mapping where you feed synonyms that only apply to the multi-token
> phrase as a whole?
>
> On 3/16/20, 12:17 PM, "atin janki"  wrote:
>
> Using sow=true, does split the word on whitespaces but it will not
> look for
> synonyms of "soap powder" anymore, rather it expands separate synonyms
> for
> "soap" and "powder".
>
>
    >
> Best Regards,
> Atin Janki
>
>
> On Mon, Mar 16, 2020 at 4:59 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > Have you set sow=true in your search handler? I know that we have it
> set
> > to false (sow = split on whitespace) because we WANT multi-token
> synonyms
> > retained as multiple tokens.
> >
> > On 3/16/20, 10:49 AM, "atin janki"  wrote:
> >
> > Hello everyone,
> >
> > I am using solr 8.3.
> >
> > After I included Synonym Graph Filter in my managed-schema file,
> I
> > have noticed that if the query string contains a multi-word
> synonym,
> > it considers that multi-word synonym as a single term and does
> not
> > break it, further suppressing the default search behaviour.
> >
> > I am using StandardTokenizer.
> >
> > Below is a snippet from managed-schema file -
> >
> > >
> > > *   > positionIncrementGap="100" multiValued="true">*
> > > **
> > > *  *
> > > *   words="stopwords.txt"
> > ignoreCase="true"/>*
> > > *  *
> > > **
> > > **
> > > *  *
> > > *   words="stopwords.txt"
> > ignoreCase="true"/>*
> > > *   expand="true"
> > ignoreCase="true" synonyms="synonyms.txt"/>*
> > > *  *
> > > ***  *
> >
> >
> > Here "*soap powder*" is the search *query* which is also a
> multi-word
> > synonym in the synonym file as-
> >
> > > s(104254535,1,'soap powder',n,1,1).
> > > s(104254535,2,'built-soap powder',n,1,0).
> > > s(104254535,3,'washing powder',n,1,0).
> >
> >
> > I am sharing some screenshots for understanding the problem-
> >
> > *without* Synonym Graph Filter => 2 docs returned  (screenshot 
at
> > below mentioned URL) -
> >
> >
> >
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_zQXx7mV=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=QUaaR69psn7pqa3DtaC7MrTMFstQrQHgeuY0qeQTc0k=
> >
> > *with* Synonym Graph Filter => 2 docs expected, only 1 returned
> > (screenshot at below mentioned URL) -
> >
> >
> >
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_tp04Rzw=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=pLPVuD71W1IhokvFuu4F672lX8Nk07b0X9pCVETRjks=
> >
> >
> > Has anyone experienced this before? If yes, is there any
> workaround ?
> > Or is it an expected behaviour?
> >
> > Regards,
> > Atin Janki
> >
> >
> >
>
>
>




Re: Re: Using Synonym Graph Filter with StandardTokenizer does not tokenize the query string if it has multi-word synonym

2020-03-16 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
To confirm, you want a synonym like "soap powder" to map onto synonyms like 
"hand soap," "hygiene products," etc? As in, more of a cognitive synonym 
mapping where you feed synonyms that only apply to the multi-token phrase as a 
whole?

On 3/16/20, 12:17 PM, "atin janki"  wrote:

Using sow=true, does split the word on whitespaces but it will not look for
synonyms of "soap powder" anymore, rather it expands separate synonyms for
"soap" and "powder".



Best Regards,
Atin Janki


On Mon, Mar 16, 2020 at 4:59 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> Have you set sow=true in your search handler? I know that we have it set
> to false (sow = split on whitespace) because we WANT multi-token synonyms
> retained as multiple tokens.
>
> On 3/16/20, 10:49 AM, "atin janki"  wrote:
>
> Hello everyone,
>
> I am using solr 8.3.
>
> After I included Synonym Graph Filter in my managed-schema file, I
> have noticed that if the query string contains a multi-word synonym,
> it considers that multi-word synonym as a single term and does not
> break it, further suppressing the default search behaviour.
>
> I am using StandardTokenizer.
>
> Below is a snippet from managed-schema file -
>
> >
> > *   positionIncrementGap="100" multiValued="true">*
> > **
> > *  *
> > *   ignoreCase="true"/>*
> > *  *
> > **
> > **
> > *  *
> > *   ignoreCase="true"/>*
> > *   ignoreCase="true" synonyms="synonyms.txt"/>*
> > *  *
> > ***  *
>
>
> Here "*soap powder*" is the search *query* which is also a multi-word
> synonym in the synonym file as-
>
> > s(104254535,1,'soap powder',n,1,1).
> > s(104254535,2,'built-soap powder',n,1,0).
> > s(104254535,3,'washing powder',n,1,0).
>
>
> I am sharing some screenshots for understanding the problem-
>
> *without* Synonym Graph Filter => 2 docs returned  (screenshot at
> below mentioned URL) -
>
>
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_zQXx7mV=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=QUaaR69psn7pqa3DtaC7MrTMFstQrQHgeuY0qeQTc0k=
>
> *with* Synonym Graph Filter => 2 docs expected, only 1 returned
> (screenshot at below mentioned URL) -
>
>
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_tp04Rzw=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=pLPVuD71W1IhokvFuu4F672lX8Nk07b0X9pCVETRjks=
>
>
> Has anyone experienced this before? If yes, is there any workaround ?
> Or is it an expected behaviour?
>
> Regards,
> Atin Janki
>
>
>




Re: Using Synonym Graph Filter with StandardTokenizer does not tokenize the query string if it has multi-word synonym

2020-03-16 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Have you set sow=true in your search handler? I know that we have it set to 
false (sow = split on whitespace) because we WANT multi-token synonyms retained 
as multiple tokens. 

On 3/16/20, 10:49 AM, "atin janki"  wrote:

Hello everyone,

I am using solr 8.3.

After I included Synonym Graph Filter in my managed-schema file, I
have noticed that if the query string contains a multi-word synonym,
it considers that multi-word synonym as a single term and does not
break it, further suppressing the default search behaviour.

I am using StandardTokenizer.

Below is a snippet from managed-schema file -

>
> *  *
> **
> *  *
> *  *
> *  *
> **
> **
> *  *
> *  *
> *  *
> *  *
> ***  *


Here "*soap powder*" is the search *query* which is also a multi-word
synonym in the synonym file as-

> s(104254535,1,'soap powder',n,1,1).
> s(104254535,2,'built-soap powder',n,1,0).
> s(104254535,3,'washing powder',n,1,0).


I am sharing some screenshots for understanding the problem-

*without* Synonym Graph Filter => 2 docs returned  (screenshot at
below mentioned URL) -


https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_zQXx7mV=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=QUaaR69psn7pqa3DtaC7MrTMFstQrQHgeuY0qeQTc0k=
 

*with* Synonym Graph Filter => 2 docs expected, only 1 returned
(screenshot at below mentioned URL) -


https://urldefense.proofpoint.com/v2/url?u=https-3A__ibb.co_tp04Rzw=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=20lvJFDIjFQqyiTdHseNNeSlDRT2YSznQPoQnxGJQfM=pLPVuD71W1IhokvFuu4F672lX8Nk07b0X9pCVETRjks=
 


Has anyone experienced this before? If yes, is there any workaround ?
Or is it an expected behaviour?

Regards,
Atin Janki




Re: configuring suggester with api

2020-03-12 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi Manoj,

In the handler, I think you are missing the suggest.dictionary parameter, which 
should be set to the name of your suggestion component. In this case, I believe 
it would should be set to "titleSuggester."

In this sample URL from the documentation, they have a suggest.dictionary 
field, and our suggester (which is working) also has this field: 

http://localhost:8983/solr/techproducts/suggest?suggest=true=true=mySuggester=c=memory

On 3/12/20, 6:33 AM, "Manoj Sonawane"  wrote:

Hello,
I am trying to learn solr config api and having problem creating suggester.
Any pointers will be appriciated

suggester has been setup with

curl -X POST -H 'Content-type:application/json' --data-binary '{
 "add-searchcomponent": {
  "class": "solr.SuggestComponent",
  "name": "titleSuggester",
  "lookupImpl": "FuzzyLookupFactory",
  "dictionaryImpl": "DocumentDictionaryFactory",
  "field": "title_facet",
  "suggestAnalyzerFieldType": "string",
  "buildOnStartup": true
 }
}' 
https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_jn-5Fcore_config=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=oHg3GLNhQJ-kkES97kG61N0D5XFZL3_9tYDftuTVsAw=XcEmA461wpbfslh7bRr8nvm3l5uxEjiSwpvoXh6j7yU=
 


and corresponding handler


echo
echo "setting up suggest handler"
curl -X POST -H 'Content-type:application/json' --data-binary '{
 "update-requesthandler": {
  "name": "/suggest",
  "startup": "lazy",
  "class": "solr.SearchHandler",
  "defaults": {
   "suggest": true,
   "suggest.count": 10
  },
  "components": ["titleSuggester"]
 }
}' 
https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_jn-5Fcore_config=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=oHg3GLNhQJ-kkES97kG61N0D5XFZL3_9tYDftuTVsAw=XcEmA461wpbfslh7bRr8nvm3l5uxEjiSwpvoXh6j7yU=
 


it seems to be configured



https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_jn-5Fcore_config_searchComponent-3FcomponentName-3DtitleSuggester=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=oHg3GLNhQJ-kkES97kG61N0D5XFZL3_9tYDftuTVsAw=SUaeM-ToDFByyJFjszLpgEYQdumajHcpV9NJHLQW6bE=
 

{
  "responseHeader":{
"status":0,
"QTime":1},
  "config":{"searchComponent":{"titleSuggester":{
"class":"solr.SuggestComponent",
"name":"titleSuggester",
"lookupImpl":"FuzzyLookupFactory",
"dictionaryImpl":"DocumentDictionaryFactory",
"field":"title_facet",
"suggestAnalyzerFieldType":"string",
"buildOnStartup":true

but solr returns "No suggester named titleSuggester was configured"

https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8983_solr_jn-5Fcore_suggest-3Fsuggest-3Dtrue-26suggest.build-3Dtrue-26suggest.q-3Delec-26suggest.dictionary-3DtitleSuggester=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=oHg3GLNhQJ-kkES97kG61N0D5XFZL3_9tYDftuTVsAw=IPWieKPIiDvS2ezg3JeIIc9xWsgZ7F1VriGtAdab0Q8=
 

  "responseHeader":{

"status":400,
"QTime":0},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException"],
"msg":"No suggester named titleSuggester was configured",
"code":400}}




exactMatchFirst Solr Suggestion Component

2020-03-04 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
 Hi All,

Would anyone be able to help me debug my suggestion component? Right now, our 
config looks like this: 


  
mySuggester
FuzzyLookupFactory
FileDictionaryFactory
./conf/queries_list_with_weights.txt
,
conf
keywords_w3_en
false
  


We like the idea of the FuzzyLookupFactory because of how it interacts with 
misspelled prefixes. However, we are finding that the exactMatchFirst 
parameter, which is supposed to be set to true by default in the code, is NOT 
showing exact match prefixes first. I think this is because of the weights we 
have with each term. However, the documentation specifically states that 
exactMatchFirst is meant to ignore weights 
(https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-8.x/javadoc/suggester.html#fuzzylookupfactory).
 

For the prefix "box" this is what our suggestions list looks like. You can see 
that "bond" is above other results I would expect to be above it, such as 
"box@ibm," etc.:

{
  "responseHeader":{
"status":0,
"QTime":112},
  "command":"build",
  "suggest":{"mySuggester":{
  "box":{
"numFound":8,
"suggestions":[{
"term":"box",
"weight":1799,
"payload":""},
  {
"term":"bond",
"weight":805,
"payload":""},
  {
"term":"box@ibm",
"weight":202,
"payload":""},
  {
"term":"box at ibm",
"weight":54,
"payload":""},
  {
"term":"books",
"weight":45,
"payload":""},
  {
"term":"box drive",
"weight":34,
"payload":""},
  {
"term":"books 24x7",
"weight":31,
"payload":""},
  {
"term":"box sync",
"weight":31,
"payload":""}]

Any help is greatly appreciated!

Best,
Audrey



Re: Re: Re: Re: Re: Query Autocomplete Evaluation

2020-02-28 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Paras,

Thank you! This is all very helpful __ I'm going to read through your answer a 
couple more times and follow up if I have any more questions!

Best,
Audrey

On 2/28/20, 8:08 AM, "Paras Lehana"  wrote:

Hey Audrey,

Users often skip results and go straight to vanilla search even though
> their query is displayed in the top of the suggestions list


Yes, we do track this in another metric. This behaviour is more
prevalent for shorter terms like "tea" and "bag". But, anyways, we measure
MRR for quantifying how high are we able to show suggestions to the users.
Since we include only the terms selection via Auto-Suggest in the universe
for calculation, the searches where user skip Auto-Suggest won't be
counted. I think we can safely exclude these if you're using MRR to measure
how well you order your result set. Still, if you want to include those,
you can always compare the search term with the last result set and include
them in MRR - you're actually right that users maybe skipping the lower
positions even if the intended suggestion is available. Our MRR stands at
68% and 75% of all of the suggestions are selected from position #1 or #2.


So acceptance rate = # of suggestions taken / total queries issued?


Yes. The total queries issues should ideally be those where Auto-Suggest
was selected or could have been selected i.e. we exclude voice searches. We
try to include as much as those searches which were made via typing in the
search bar. But that's how we have fine-tuned our tracking over months.
You're right about the general formula - searches via Auto-Suggest divided
by total Searches.


And Selection to Display = # of suggestions taken (this would only be 1, if
> the not-taken suggestions are given 0s) / total suggestions displayed? If
> the above is true, wouldn't Selection to Display be binary? I.e. it's
> either 1/# of suggestions displayed (assuming this is a constant) or 0?


Yup. Please note that this is calculated per session of Auto-Suggest. Let
the formula be S/D. We will take D (Display) as 1 and not 3 when a user
query for "bag" (b, ba, bag). If the S (Selection) was made in the last
display, it is 1 also. If a user selects "bag" after writing "ba", we don't
say that S=0, D=1 for "b" and S=1, D=1 for "ba". For this, we already track
APL (Average Prefix Length). S/D is calculated per search and thus, here
S=1, D=1 for search "bag". Thus, for a single search, S/D can be either 0
or 1 - you're right, it's binary!

    Hope this helps. Loved your questions! :)

On Thu, 27 Feb 2020 at 22:21, Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Paras,
>
> Thank you for this response! Yes, you are being clear __
>
> Regarding the assumptions you make for MRR, do you have any research
> papers to confirm that these user behaviors have been observed? I only ask
> because this paper 
https://urldefense.proofpoint.com/v2/url?u=http-3A__yichang-2Dcs.com_yahoo_sigir14-5FSearchAssist.pdf=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=itCtsKdh-LT8eUwdVvqBc96lR_64mPtVw7t52WMtBLs=JrGARO4xkzWbtv7_b-H5da6Ki6PemYL5NQ253y0Y7Qs=
 
> talks about how users often skip results and go straight to vanilla search
> even though their query is displayed in the top of the suggestions list
> (section 3.2 "QAC User Behavior Analysis"), among other behaviors that go
> against general IR intuition. This is only one paper, of course, but it
> seems that user research of QAC is hard to come by otherwise.
>
> So acceptance rate = # of suggestions taken / total queries issued ?
> And Selection to Display = # of suggestions taken (this would only be 1,
> if the not-taken suggestions are given 0s) / total suggestions displayed ?
>
> If the above is true, wouldn't Selection to Display be binary? I.e. it's
> either 1/# of suggestions displayed (assuming this is a constant) or 0?
>
> Best,
> Audrey
>
>
> 
> From: Paras Lehana 
> Sent: Thursday, February 27, 2020 2:58:25 AM
> To: solr-user@lucene.apache.org
> Subject: [EXTERNAL] Re: Re: Re: Query Autocomplete Evaluation
>
> Hi Audrey,
>
> For MRR, we assume that if a suggestion is selected, it's relevant. It's
> also assumed that the user will always click the highest relevant
> suggestion. Thus, we calculate position selection for each selection. If
> still, I'm not understanding your question correctly, feel free to contact
> me personally (hangouts?).
 

Re: Re: Re: Re: Query Autocomplete Evaluation

2020-02-27 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Paras,

Thank you for this response! Yes, you are being clear __

Regarding the assumptions you make for MRR, do you have any research papers to 
confirm that these user behaviors have been observed? I only ask because this 
paper http://yichang-cs.com/yahoo/sigir14_SearchAssist.pdf talks about how 
users often skip results and go straight to vanilla search even though their 
query is displayed in the top of the suggestions list (section 3.2 "QAC User 
Behavior Analysis"), among other behaviors that go against general IR 
intuition. This is only one paper, of course, but it seems that user research 
of QAC is hard to come by otherwise.

So acceptance rate = # of suggestions taken / total queries issued ?
And Selection to Display = # of suggestions taken (this would only be 1, if the 
not-taken suggestions are given 0s) / total suggestions displayed ?

If the above is true, wouldn't Selection to Display be binary? I.e. it's either 
1/# of suggestions displayed (assuming this is a constant) or 0?

Best,
Audrey



From: Paras Lehana 
Sent: Thursday, February 27, 2020 2:58:25 AM
To: solr-user@lucene.apache.org
Subject: [EXTERNAL] Re: Re: Re: Query Autocomplete Evaluation

Hi Audrey,

For MRR, we assume that if a suggestion is selected, it's relevant. It's
also assumed that the user will always click the highest relevant
suggestion. Thus, we calculate position selection for each selection. If
still, I'm not understanding your question correctly, feel free to contact
me personally (hangouts?).

And @Paras, the third and fourth evaluation metrics you listed in your
> first reply seem the same to me. What is the difference between the two?


I was expecting you to ask this - I should have explained a bit more.
Acceptance Rate is the searches through Auto-Suggest for all Searches.
Whereas, value for Selection to Display is 1 if the Selection is made given
the suggestions were displayed otherwise 0. Here, the cases where results
are displayed is the universal set. Acceptance Rate is counted 0 even for
those searches where Selection was not made because there were no results
while S/D will not count this - it only counts cases where the result was
displayed.

Hope I'm clear. :)

On Tue, 25 Feb 2020 at 21:10, Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> This article
> https://urldefense.proofpoint.com/v2/url?u=http-3A__wwwconference.org_proceedings_www2011_proceedings_p107.pdf=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=KMeOCffgJOgN3RoE0ht8jssgdO3AbyNYqRmXlQ6xWRo=fVp2mKYimlchSj0RMKpd595S7C2nGxK2G3CQSkrycg4=
>   also
> indicates that MRR needs binary relevance labels, p. 114: "To this end, we
> selected a random sample of 198 (query, context) pairs from the set of
> 7,311 pairs, and manually tagged each of them as related (i.e., the query
> is related to the context; 60% of the pairs) and unrelated (40% of the
> pairs)."
>
> On 2/25/20, 10:25 AM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
> audrey.lorberf...@ibm.com> wrote:
>
> Thank you, Walter & Paras!
>
> So, from the MRR equation, I was under the impression the suggestions
> all needed a binary label (0,1) indicating relevance.* But it's great to
> know that you guys use proxies for relevance, such as clicks.
>
> *The reason I think MRR has to have binary relevance labels is this
> Wikipedia article:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Mean-5Freciprocal-5Frank=DwIGaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=1f2LPzuBvibQd8m-8_HuNVYFm0JvCGyPDul6r4ATsLk=Sn7KV-BcFDTrmc1PfRVeSpB9Ysh3UrVIQKcB3G5zstw=
> , where it states below the formula that rank_i = "refers to the rank
> position of the first relevant document for the i-th query." If the
> suggestions are not labeled as relevant (0) or not relevant (1), then how
> do you compute the rank of the first RELEVANT document?
>
> I'll check out these readings asap, thank you!
>
> And @Paras, the third and fourth evaluation metrics you listed in your
> first reply seem the same to me. What is the difference between the two?
>
> Best,
> Audrey
>
> On 2/25/20, 1:11 AM, "Walter Underwood"  wrote:
>
> Here is a blog article with a worked example for MRR based on
> customer clicks.
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__observer.wunderwood.org_2016_09_12_measuring-2Dsearch-2Drelevance-2Dwith-2Dmrr_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=GzNrf4l_FjMqOkSx2B4_sCIGoJv2QYPbPqWplHGE3PI=
>
> At my place of work, we compare the CTR and MRR of queries using
> suggestions to those that do not use suggestions. Solr autosuggest based on
> lexicon of book ti

Re: Re: Re: Query Autocomplete Evaluation

2020-02-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
This article http://wwwconference.org/proceedings/www2011/proceedings/p107.pdf 
also indicates that MRR needs binary relevance labels, p. 114: "To this end, we 
selected a random sample of 198 (query, context) pairs from the set of 7,311 
pairs, and manually tagged each of them as related (i.e., the query is related 
to the context; 60% of the pairs) and unrelated (40% of the pairs)."

On 2/25/20, 10:25 AM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" 
 wrote:

Thank you, Walter & Paras! 

So, from the MRR equation, I was under the impression the suggestions all 
needed a binary label (0,1) indicating relevance.* But it's great to know that 
you guys use proxies for relevance, such as clicks.

*The reason I think MRR has to have binary relevance labels is this 
Wikipedia article: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Mean-5Freciprocal-5Frank=DwIGaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=1f2LPzuBvibQd8m-8_HuNVYFm0JvCGyPDul6r4ATsLk=Sn7KV-BcFDTrmc1PfRVeSpB9Ysh3UrVIQKcB3G5zstw=
 , where it states below the formula that rank_i = "refers to the rank position 
of the first relevant document for the i-th query." If the suggestions are not 
labeled as relevant (0) or not relevant (1), then how do you compute the rank 
of the first RELEVANT document? 

I'll check out these readings asap, thank you!

And @Paras, the third and fourth evaluation metrics you listed in your 
first reply seem the same to me. What is the difference between the two?

Best,
Audrey

On 2/25/20, 1:11 AM, "Walter Underwood"  wrote:

Here is a blog article with a worked example for MRR based on customer 
clicks.


https://urldefense.proofpoint.com/v2/url?u=https-3A__observer.wunderwood.org_2016_09_12_measuring-2Dsearch-2Drelevance-2Dwith-2Dmrr_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=GzNrf4l_FjMqOkSx2B4_sCIGoJv2QYPbPqWplHGE3PI=
 

At my place of work, we compare the CTR and MRR of queries using 
suggestions to those that do not use suggestions. Solr autosuggest based on 
lexicon of book titles is highly effective for us.

wunder
Walter Underwood
wun...@wunderwood.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=L4yZqRG0pWGPpZ8U7S-feoiWSTrz_zBEq0FANYqncuE=
   (my blog)

> On Feb 24, 2020, at 9:52 PM, Paras Lehana 
 wrote:
> 
> Hey Audrey,
> 
> I assume MRR is about the ranking of the intended suggestion. For 
this, no
> human judgement is required. We track position selection - the 
position
> (1-10) of the selected suggestion. For example, this is our recent 
numbers:
> 
> Position 1 Selected (B3) 107,699
> Position 2 Selected (B4) 58,736
> Position 3 Selected (B5) 23,507
> Position 4 Selected (B6) 12,250
> Position 5 Selected (B7) 7,980
> Position 6 Selected (B8) 5,653
> Position 7 Selected (B9) 4,193
> Position 8 Selected (B10) 3,511
> Position 9 Selected (B11) 2,997
> Position 10 Selected (B12) 2,428
> *Total Selections (B13)* *228,954*
> MRR = (B3+B4/2+B5/3+B6/4+B7/5+B8/6+B9/7+B10/8+B11/9+B12/10)/B13 = 
66.45%
> 
> Refer here for MRR calculation keeping Auto-Suggest in perspective:
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40dtunkelang_evaluating-2Dsearch-2Dmeasuring-2Dsearcher-2Dbehavior-2D5f8347619eb0=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=WFv9xHoFHlnQmBgqIoHPi3moIiyttgAZJzWRxFLjyfk=
 
> 
> "In practice, this is inverted to obtain the reciprocal rank, e.g., 
if the
> searcher clicks on the 4th result, the reciprocal rank is 0.25. The 
average
> of these reciprocal ranks is called the mean reciprocal rank (MRR)."
> 
> nDCG may require human intervention. Please let me know in case I 
have not
    > understood your question properly. :)
> 
> 
> 
> On Mon, 24 Feb 2020 at 20:49, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com
>  wrote:
> 
>> Hi Paras,
>> 
>> This is SO helpful, thank you. Quick question about your MRR metric 
-- do
>> you have binary human judgements for your suggestions? If no, how do 
you
>> label suggestions successful or not?
>> 
>> Best,

Re: Re: Query Autocomplete Evaluation

2020-02-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thank you, Walter & Paras! 

So, from the MRR equation, I was under the impression the suggestions all 
needed a binary label (0,1) indicating relevance.* But it's great to know that 
you guys use proxies for relevance, such as clicks.

*The reason I think MRR has to have binary relevance labels is this Wikipedia 
article: https://en.wikipedia.org/wiki/Mean_reciprocal_rank, where it states 
below the formula that rank_i = "refers to the rank position of the first 
relevant document for the i-th query." If the suggestions are not labeled as 
relevant (0) or not relevant (1), then how do you compute the rank of the first 
RELEVANT document? 

I'll check out these readings asap, thank you!

And @Paras, the third and fourth evaluation metrics you listed in your first 
reply seem the same to me. What is the difference between the two?

Best,
Audrey

On 2/25/20, 1:11 AM, "Walter Underwood"  wrote:

Here is a blog article with a worked example for MRR based on customer 
clicks.


https://urldefense.proofpoint.com/v2/url?u=https-3A__observer.wunderwood.org_2016_09_12_measuring-2Dsearch-2Drelevance-2Dwith-2Dmrr_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=GzNrf4l_FjMqOkSx2B4_sCIGoJv2QYPbPqWplHGE3PI=
 

At my place of work, we compare the CTR and MRR of queries using 
suggestions to those that do not use suggestions. Solr autosuggest based on 
lexicon of book titles is highly effective for us.

wunder
Walter Underwood
wun...@wunderwood.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=L4yZqRG0pWGPpZ8U7S-feoiWSTrz_zBEq0FANYqncuE=
   (my blog)

> On Feb 24, 2020, at 9:52 PM, Paras Lehana  
wrote:
> 
> Hey Audrey,
> 
> I assume MRR is about the ranking of the intended suggestion. For this, no
> human judgement is required. We track position selection - the position
> (1-10) of the selected suggestion. For example, this is our recent 
numbers:
> 
> Position 1 Selected (B3) 107,699
> Position 2 Selected (B4) 58,736
> Position 3 Selected (B5) 23,507
> Position 4 Selected (B6) 12,250
> Position 5 Selected (B7) 7,980
> Position 6 Selected (B8) 5,653
> Position 7 Selected (B9) 4,193
> Position 8 Selected (B10) 3,511
> Position 9 Selected (B11) 2,997
> Position 10 Selected (B12) 2,428
> *Total Selections (B13)* *228,954*
> MRR = (B3+B4/2+B5/3+B6/4+B7/5+B8/6+B9/7+B10/8+B11/9+B12/10)/B13 = 66.45%
> 
> Refer here for MRR calculation keeping Auto-Suggest in perspective:
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40dtunkelang_evaluating-2Dsearch-2Dmeasuring-2Dsearcher-2Dbehavior-2D5f8347619eb0=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=e9a1kzjKu6l-P1g5agvpe-jQZfCF6bT4x6CeYDrUkgE=WFv9xHoFHlnQmBgqIoHPi3moIiyttgAZJzWRxFLjyfk=
 
> 
> "In practice, this is inverted to obtain the reciprocal rank, e.g., if the
> searcher clicks on the 4th result, the reciprocal rank is 0.25. The 
average
> of these reciprocal ranks is called the mean reciprocal rank (MRR)."
> 
> nDCG may require human intervention. Please let me know in case I have not
    > understood your question properly. :)
> 
> 
> 
> On Mon, 24 Feb 2020 at 20:49, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com
>  wrote:
> 
>> Hi Paras,
>> 
>> This is SO helpful, thank you. Quick question about your MRR metric -- do
>> you have binary human judgements for your suggestions? If no, how do you
>> label suggestions successful or not?
>> 
>> Best,
>> Audrey
>> 
>> On 2/24/20, 2:27 AM, "Paras Lehana"  wrote:
>> 
>>Hi Audrey,
>> 
>>I work for Auto-Suggest at IndiaMART. Although we don't use the
>> Suggester
>>component, I think you need evaluation metrics for Auto-Suggest as a
>>business product and not specifically for Solr Suggester which is the
>>backend. We use edismax parser with EdgeNGrams Tokenization.
>> 
>>Every week, as the property owner, I report around 500 metrics. I 
would
>>like to mention a few of those:
>> 
>>   1. MRR (Mean Reciprocal Rate): How high the user selection was
>> among the
>>   returned result. Ranges from 0 to 1, the higher the better.
>>   2. APL (Average Prefix Length): Prefix is the query by user. Lesser
>> the
>>   better. This r

Re: Re: Query Autocomplete Evaluation

2020-02-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi Paras,

This is SO helpful, thank you. Quick question about your MRR metric -- do you 
have binary human judgements for your suggestions? If no, how do you label 
suggestions successful or not?

Best,
Audrey

On 2/24/20, 2:27 AM, "Paras Lehana"  wrote:

Hi Audrey,

I work for Auto-Suggest at IndiaMART. Although we don't use the Suggester
component, I think you need evaluation metrics for Auto-Suggest as a
business product and not specifically for Solr Suggester which is the
backend. We use edismax parser with EdgeNGrams Tokenization.

Every week, as the property owner, I report around 500 metrics. I would
like to mention a few of those:

   1. MRR (Mean Reciprocal Rate): How high the user selection was among the
   returned result. Ranges from 0 to 1, the higher the better.
   2. APL (Average Prefix Length): Prefix is the query by user. Lesser the
   better. This reports how less an average user has to type for getting the
   intended suggestion.
   3. Acceptance Rate or Selection: How many of the total searches are
   being served from Auto-Suggest. We are around 50%.
   4. Selection to Display Ratio: Did you make the user to click any of the
   suggestions if they are displayed?
   5. Response Time: How fast are you serving your average query.


The Selection and Response Time are our main KPIs. We track a lot about
Auto-Suggest usage on our platform which becomes apparent if you observe
the URL after clicking a suggestion on dir.indiamart.com. However, not
everything would benefit you. Do let me know for any related query or
explanation. Hope this helps. :)

On Fri, 14 Feb 2020 at 21:23, Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Hi all,
>
> How do you all evaluate the success of your query autocomplete (i.e.
> suggester) component if you use it?
>
> We cannot use MRR for various reasons (I can go into them if you're
> interested), so we're thinking of using nDCG since we already use that for
> relevance eval of our system as a whole. I am also interested in the 
metric
> "success at top-k," but I can't find any research papers that explicitly
> define "success" -- I am assuming it's a suggestion (or suggestions)
> labeled "relevant," but maybe it could also simply be the suggestion that
> receives a click from the user?
>
> Would love to hear from the hive mind!
>
> Best,
> Audrey
>
> --
>
>
>

-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, *Auto-Suggest*,
IndiaMART InterMESH Ltd,

11th Floor, Tower 2, Assotech Business Cresterra,
Plot No. 22, Sector 135, Noida, Uttar Pradesh, India 201305

Mob.: +91-9560911996
Work: 0120-4056700 | Extn:
*11096*

-- 
*
*

 
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_IndiaMART_videos_578196442936091_=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=CTfu2EkiAFh-Ra4cn3EL2GdkKLBhD754dBAoRYpr2uc=kwWlK4TbSM6iPH6DBIrwg3QCeHrY-83N5hm2HtQQsjc=
 >




Query Autocomplete Evaluation

2020-02-14 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi all,

How do you all evaluate the success of your query autocomplete (i.e. suggester) 
component if you use it? 

We cannot use MRR for various reasons (I can go into them if you're 
interested), so we're thinking of using nDCG since we already use that for 
relevance eval of our system as a whole. I am also interested in the metric 
"success at top-k," but I can't find any research papers that explicitly define 
"success" -- I am assuming it's a suggestion (or suggestions) labeled 
"relevant," but maybe it could also simply be the suggestion that receives a 
click from the user?

Would love to hear from the hive mind!

Best,
Audrey

-- 




Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-31 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi all, reviving this thread.

For those of you who use an external file for your suggestions, how do you 
decide from your query logs what suggestions to include? Just starting out with 
some exploratory analysis of clicks, dwell times, etc., and would love to hear 
from the community any advise.

Thanks!

Best,
Audrey

On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:

It's a great idea.   And then index that file into a separate lean 
collection of just the suggestions, along with the weight as another field on 
those documents, to use for ranking them at query time with standard /select 
queries.  (this separate suggest collection would also have appropriate 
tokenization to match the partial words as the user types, like ngramming)

Erik


> On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com  wrote:
> 
> David, 
> 
> Thank you, that is useful. So, would you recommend using a (clean) field 
over an external dictionary file? We have lots of "top queries" and measure 
their nDCG. A thought was to programmatically generate an external file where 
the weight per query term (or phrase) == its nDCG. Bad idea?
> 
> Best,
> Audrey
> 
> On 1/20/20, 11:51 AM, "David Hastings"  
wrote:
> 
>Ive used this quite a bit, my biggest piece of advice is to choose a 
field
>that you know is clean, with well defined terms/words, you dont want an
>autocomplete that has a massive dictionary, also it will make the
>start/reload times pretty slow
    > 
>    On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
>audrey.lorberf...@ibm.com  wrote:
> 
>> Hi All,
>> 
>> We plan to incorporate a query autocomplete functionality into our search
>> engine (like this: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
 
>> ). And I was wondering if anyone has personal experience with this
>> component and would like to share? Basically, we are just looking for 
some
>> best practices from more experienced Solr admins so that we have a 
starting
>> place to launch this in our beta.
>> 
>> Thank you!
>> 
>> Best,
>> Audrey
>> 
> 
> 





Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-26 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Oh, great! Thank you, this is helpful!

On 1/24/20, 6:43 PM, "Walter Underwood"  wrote:

Click-based weights are vulnerable to spamming. Some of us fondly remember 
when
Google was showing Microsoft as the first hit for “evil empire” thanks to a 
click attack.

For our ecommerce search, we use the actual titles of books weighted by 
order volume.
Decorated titles are reduced to a base title, so “Managerial Accounting: 
Student Value Edition”
becomes just “Managerial Accounting”. Showing all the variations is the job 
of the 
real results page.

wunder
Walter Underwood
wun...@wunderwood.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=3oEhRJWEHDoz3HXt87Y_FXxPTUZg1zSA5r4P6urviug=87IOY_vKNONtR2r2IkW-NnZ4Rn3wI-OIO6RSdqdOMfU=
   (my blog)

> On Jan 24, 2020, at 7:07 AM, Lucky Sharma  wrote:
> 
> Hi Audrey,
> As suggested by Erik, you can index the data into a seperate collection 
and
> You can instead of adding weights inthe document you can also use
> LTR(Learning to Rank) with in Solr to rerank on the documents.
> And also to increase more relevance with in the Autosuggestion and making
> positional context of the user in case of Multi token keywords you can 
also
> bigrams/trigrams to generate edge n-grams.
> 
> 
> 
> Regards,
> Lucky Sharma
> 
> On Fri, 24 Jan, 2020, 8:28 pm Lucky Sharma,  wrote:
> 
>> Hi Audrey,
>> As suggested by Erik, you can index the data into a seperate collection
>> and You can instead of adding weights inthe document you can also use LTR
>> with in Solr to rerank on the features.
>> 
>> Regards,
>> Lucky Sharma
>> 
>> On Fri, 24 Jan, 2020, 8:01 pm Audrey Lorberfeld -
>> audrey.lorberf...@ibm.com,  wrote:
>> 
>>> Erik,
>>> 
>>> Thank you! Yes, that's exactly how we were thinking of architecting it.
>>> And our ML engineer suggested something else for the suggestion weights,
>>> actually -- to build a model that would programmatically update the 
weights
>>> based on those suggestions' live clicks @ position k, etc. Pretty cool
>>> idea...
>>> 
>>> 
>>> 
>>> On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:
>>> 
>>>It's a great idea.   And then index that file into a separate lean
>>> collection of just the suggestions, along with the weight as another 
field
>>> on those documents, to use for ranking them at query time with standard
    >>> /select queries.  (this separate suggest collection would also have
>>> appropriate tokenization to match the partial words as the user types, 
like
>>> ngramming)
>>> 
>>>Erik
>>> 
>>> 
>>>> On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld -
>>> audrey.lorberf...@ibm.com  wrote:
>>>> 
>>>> David,
>>>> 
>>>> Thank you, that is useful. So, would you recommend using a (clean)
>>> field over an external dictionary file? We have lots of "top queries" 
and
>>> measure their nDCG. A thought was to programmatically generate an 
external
>>> file where the weight per query term (or phrase) == its nDCG. Bad idea?
>>>> 
>>>> Best,
>>>> Audrey
    >>>> 
>>>> On 1/20/20, 11:51 AM, "David Hastings" <
>>> hastings.recurs...@gmail.com> wrote:
>>>> 
>>>>   Ive used this quite a bit, my biggest piece of advice is to
>>> choose a field
>>>>   that you know is clean, with well defined terms/words, you dont
>>> want an
>>>>   autocomplete that has a massive dictionary, also it will make the
>>>>   start/reload times pretty slow
>>>> 
>>>>   On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
>>>>   audrey.lorberf...@ibm.com  wrote:
>>>> 
>>>>> Hi All,
>>>>> 
>>>>> We plan to incorporate a query autocomplete functionality into our
>>> search
>>>>> engine (like this:
>>> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
>>>>> ). And I was wondering if anyone has personal experience with this
>>>>> component and would like to share? Basically, we are just looking
>>> for some
>>>>> best practices from more experienced Solr admins so that we have a
>>> starting
>>>>> place to launch this in our beta.
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> Best,
>>>>> Audrey
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 





Re: Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
David,

True! But we are hoping that these are purely seen as suggestions and that 
people, if they know exactly what they are wanting to type/looking for, will 
simply ignore the dropdown options.

On 1/24/20, 10:03 AM, "David Hastings"  wrote:

This is a really cool idea!  My only concern is that the edge case
searches, where a user knows exactly what they want to find, would be
autocomplete into something that happens to be more "successful" rather
than what they were looking for.  for example, i want to know the legal
implications of jay z's 99 problems.   most of the autocompletes i imagine
would be for the lyrics for the song, or links to the video or jay z
himself, when what im looking for is a line by line analysis of the song
itself and how it relates to the fourth amendment:

https://urldefense.proofpoint.com/v2/url?u=http-3A__pdf.textfiles.com_academics_lj56-2D2-5Fmason-5Farticle.pdf=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=CPAGySYcW7hCqtFtjaThX2vIAhcKEMHHhYpqtqHkx-Q=XEyh7ewstUTlEuyKcYHaTU1vHMYA2-Db_nIYnl89yw4=
 

But in general this is a really clever idea, especially in the retail
arena.  However i suspect your use case is more in research, and after
years of dealing with lawyers and librarians, they tend to not like having
their searches intercepted, they know what they're looking for and they
tend to get mad if you assume they dont :)

On Fri, Jan 24, 2020 at 9:59 AM Lucky Sharma  wrote:

> Hi Audrey,
> As suggested by Erik, you can index the data into a seperate collection 
and
> You can instead of adding weights inthe document you can also use LTR with
> in Solr to rerank on the features.
>
> Regards,
> Lucky Sharma
    >
> On Fri, 24 Jan, 2020, 8:01 pm Audrey Lorberfeld -
> audrey.lorberf...@ibm.com,
>  wrote:
>
> > Erik,
> >
> > Thank you! Yes, that's exactly how we were thinking of architecting it.
> > And our ML engineer suggested something else for the suggestion weights,
> > actually -- to build a model that would programmatically update the
> weights
> > based on those suggestions' live clicks @ position k, etc. Pretty cool
> > idea...
> >
> >
> >
> > On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:
> >
> > It's a great idea.   And then index that file into a separate lean
> > collection of just the suggestions, along with the weight as another
> field
> > on those documents, to use for ranking them at query time with standard
> > /select queries.  (this separate suggest collection would also have
> > appropriate tokenization to match the partial words as the user types,
> like
> > ngramming)
> >
> > Erik
> >
> >
> > > On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > >
> > > David,
> > >
> > > Thank you, that is useful. So, would you recommend using a (clean)
> > field over an external dictionary file? We have lots of "top queries" 
and
> > measure their nDCG. A thought was to programmatically generate an
> external
> > file where the weight per query term (or phrase) == its nDCG. Bad idea?
> > >
> > > Best,
> > > Audrey
> > >
> > > On 1/20/20, 11:51 AM, "David Hastings" <
> hastings.recurs...@gmail.com>
> > wrote:
> > >
> > >Ive used this quite a bit, my biggest piece of advice is to
> > choose a field
> > >that you know is clean, with well defined terms/words, you dont
> > want an
> > >autocomplete that has a massive dictionary, also it will make
> the
> > >start/reload times pretty slow
> > >
> > >On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
> > >audrey.lorberf...@ibm.com  wrote:
> > >
> > >> Hi All,
> > >>
> > >> We plan to incorporate a query autocomplete functionality into 
our
> > search
> > >> engine (like this:
> >
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
> > >> ). And I was wondering if anyone has personal experience with 
this
> > >> component and would like to share? Basically, we are just looking
> > for some
> > >> best practices from more experienced Solr admins so that we have 
a
> > starting
> > >> place to launch this in our beta.
> > >>
> > >> Thank you!
> > >>
> > >> Best,
> > >> Audrey
> > >>
> > >
> > >
> >
> >
> >
> >
>




Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi Alessandro,

I'm so happy there is someone who's done extensive work with QAC here! 

Right now, we measure nDCG via a Dynamic Bayesian Network. To break it down, 
we: 
- use a DBN model to generate a "score" for each query_url pair. 
- We then plug that score into a mathematical formula we found in a research 
paper (happy to share the paper if you're interested) for assigning labels 0-4. 
- We then cross-reference the scored & labeled query_url pairs with 1k of our 
system's top queries and 1k of our system's random queries. 
- We use that dataset as our ground truth. 
- We then query the system in real time each day for those 2k queries, label 
them, and compare those labels with our ground truth to get our system's nDCG. 

I hope that makes sense! Lots of steps __

Due to computational overhead reasons, we are pretty committed to using an 
external file & a separate Solr core for our suggestions. We are also planning 
to use the Suggester to add a little human nudge towards "successful" queries. 
I'm not sure whether that's what the Suggester is really meant to do, but we 
are not using it as a naïve prefix-matcher, but more of a query-suggestion 
tool. So, if we know that the query "blue pages" is less successful than the 
query "bluepages" (assuming we can identify the user's intent with this query), 
we will not show suggestions that match "blue pages," instead we will show 
suggestions that match "bluepages." Sort of like a query rewrite, except with 
fuzzy prefix matching, not the introduction of synonyms/expansions.

What we are concerned with currently is how to define a "successful" query. We 
have things like abandonment rate, dwell time, etc., but if you have any advice 
on more ways to identify successful queries, that'd be great. We want to stay 
away from defining success as "popularity," since that will just create a 
closed language system where people only query popular queries, and those 
queries stay popular only because people are querying them (assuming people 
click on the suggestions, of course).

Let me know your thoughts!

On 1/23/20, 10:45 AM, "Alessandro Benedetti"  wrote:

I have been working extensively on query autocompletion, these blogs should
be helpful to you:


https://urldefense.proofpoint.com/v2/url?u=https-3A__sease.io_2015_07_solr-2Dyou-2Dcomplete-2Dme.html=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=0lExcWXK-kGTAfpnv-kU_LGminLzJjJKv6hYBFQG7iI=c149I_QBokd35FBMGaUxoBPMViUXAdZtVnkSKTINndE=
 

https://urldefense.proofpoint.com/v2/url?u=https-3A__sease.io_2018_06_apache-2Dlucene-2Dblendedinfixsuggester-2Dhow-2Dit-2Dworks-2Dbugs-2Dand-2Dimprovements.html=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=0lExcWXK-kGTAfpnv-kU_LGminLzJjJKv6hYBFQG7iI=m8s2XvI7tR1t9bNaA4SI-w90MdbLZTYxc0mBMz8RMSw=
 

You idea of using search quality evaluation to drive the autocompletion is
interesting.
How do you currently calculate the NDCG for a query? What's your golden
truth?
Using that approach you will autocomplete favouring query completion that
your search engine is able to process better, not necessarily closer to the
user intent, still it could work.

We should differentiate here between the suggester dictionary (where the
suggestions come from, in your case it could be your extracted data) and
the kind of suggestion (that in your case could be the free text suggester
lookup)

Cheers
--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Mon, 20 Jan 2020 at 17:02, David Hastings 
wrote:

> Not a bad idea at all, however ive never used an external file before, 
just
> a field in the index, so not an area im familiar with
>
> On Mon, Jan 20, 2020 at 11:55 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
> > David,
> >
> > Thank you, that is useful. So, would you recommend using a (clean) field
> > over an external dictionary file? We have lots of "top queries" and
> measure
> > their nDCG. A thought was to programmatically generate an external file
> > where the weight per query term (or phrase) == its nDCG. Bad idea?
> >
> > Best,
> > Audrey
> >
> > On 1/20/20, 11:51 AM, "David Hastings" 
> > wrote:
> >
> > Ive used this quite a bit, my biggest piece of advice is to choose a
> > field
> > that you know is clean, with well defined terms/words, you dont want
> an
> > autocomplete that has a massive dictionary, also it will make the
> > start/reload times pretty slow
>

Re: Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Erik,

Thank you! Yes, that's exactly how we were thinking of architecting it. And our 
ML engineer suggested something else for the suggestion weights, actually -- to 
build a model that would programmatically update the weights based on those 
suggestions' live clicks @ position k, etc. Pretty cool idea... 



On 1/23/20, 2:26 PM, "Erik Hatcher"  wrote:

It's a great idea.   And then index that file into a separate lean 
collection of just the suggestions, along with the weight as another field on 
those documents, to use for ranking them at query time with standard /select 
queries.  (this separate suggest collection would also have appropriate 
tokenization to match the partial words as the user types, like ngramming)

Erik


> On Jan 20, 2020, at 11:54 AM, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com  wrote:
> 
> David, 
> 
> Thank you, that is useful. So, would you recommend using a (clean) field 
over an external dictionary file? We have lots of "top queries" and measure 
their nDCG. A thought was to programmatically generate an external file where 
the weight per query term (or phrase) == its nDCG. Bad idea?
> 
> Best,
> Audrey
> 
> On 1/20/20, 11:51 AM, "David Hastings"  
wrote:
> 
>Ive used this quite a bit, my biggest piece of advice is to choose a 
field
>that you know is clean, with well defined terms/words, you dont want an
>autocomplete that has a massive dictionary, also it will make the
>start/reload times pretty slow
    > 
>    On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
>audrey.lorberf...@ibm.com  wrote:
> 
>> Hi All,
>> 
>> We plan to incorporate a query autocomplete functionality into our search
>> engine (like this: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
 
>> ). And I was wondering if anyone has personal experience with this
>> component and would like to share? Basically, we are just looking for 
some
>> best practices from more experienced Solr admins so that we have a 
starting
>> place to launch this in our beta.
>> 
>> Thank you!
>> 
>> Best,
>> Audrey
>> 
> 
> 





Re: Re: Re: Re: Handling overlapping synonyms

2020-01-20 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hm, I'm not sure what you mean, but I am pretty new to Solr. Apologies!

On 1/20/20, 12:01 PM, "fiedzia"  wrote:

>From my understanding, if you want regional sales manager to be indexed as
both director of sales and area manager, you  
>would have to type:
>
>Regional sales manager -> director of sales, area manager

that works for searching, but because everything is in the same position, 
searching for "director of sales" highlights whole "regional sales manager".

while it should be indexed as: (numbers inidicate token positions

1   2   3
regional sales manager

1
area manager
 2 director of sales


I guess I'll need to override SynonymGraphFilter to achieve that



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=tDOfGxVxBgFG1YZDv8WICuXs07jdb2IIpoJ0j3Fu7nc=yT0_rHgmEbHTvjxL9Vw9TN3d0TeqHg6avTkuseDWDw8=
 




Re: Re: Anyone have experience with Query Auto-Suggestor?

2020-01-20 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
David, 

Thank you, that is useful. So, would you recommend using a (clean) field over 
an external dictionary file? We have lots of "top queries" and measure their 
nDCG. A thought was to programmatically generate an external file where the 
weight per query term (or phrase) == its nDCG. Bad idea?

Best,
Audrey

On 1/20/20, 11:51 AM, "David Hastings"  wrote:

Ive used this quite a bit, my biggest piece of advice is to choose a field
that you know is clean, with well defined terms/words, you dont want an
autocomplete that has a massive dictionary, also it will make the
start/reload times pretty slow

On Mon, Jan 20, 2020 at 11:47 AM Audrey Lorberfeld -
    audrey.lorberf...@ibm.com  wrote:

> Hi All,
>
> We plan to incorporate a query autocomplete functionality into our search
> engine (like this: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.apache.org_solr_guide_8-5F1_suggester.html=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=L8V-izaMW_v4j-1zvfiXSqm6aAoaRtk-VJXA6okBs_U=vnE9KGyF3jky9fSi22XUJEEbKLM1CA7mWAKrl2qhKC0=
 
> ). And I was wondering if anyone has personal experience with this
> component and would like to share? Basically, we are just looking for some
> best practices from more experienced Solr admins so that we have a 
starting
> place to launch this in our beta.
>
> Thank you!
>
> Best,
> Audrey
>




Anyone have experience with Query Auto-Suggestor?

2020-01-20 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi All,

We plan to incorporate a query autocomplete functionality into our search 
engine (like this: https://lucene.apache.org/solr/guide/8_1/suggester.html
). And I was wondering if anyone has personal experience with this component 
and would like to share? Basically, we are just looking for some best practices 
from more experienced Solr admins so that we have a starting place to launch 
this in our beta.

Thank you!

Best,
Audrey


Re: Re: Re: Handling overlapping synonyms

2020-01-20 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
From my understanding, if you want regional sales manager to be indexed as both 
director of sales and area manager, you would have to type: 

Regional sales manager -> director of sales, area manager

I do not believe you can chain synonyms.

Re: bigrams/trigrams, I was more interested in you wanting to manually create 
them by inserting a "_" between the tokens. There is a bigram / trigram 
capability OOTB with Solr, so is there a reason you're manually coding these 
into your index instead of just using the OOTB function?

On 1/20/20, 6:58 AM, "fiedzia"  wrote:

> what is the reasoning behind adding the bigrams and trigrams manually like
that? Maybe if we knew the end goal, we could figure out a different
strategy. Happy that at least the matching is working now! 

I have large amount of synonyms and keep adding new ones, some of them
partially overlap. Its the nature of a language that adding keywords to a
phrase creates distinctive meaning. Another example:


sales manager -> director of sales
regional sales manager -> area manager

I'd expect "regional sales manager" to be indexed as both.

regional sales manager
^^ -> director of sales
^^ -> area manager

so that searching for any of those terms matches and highlights relevant
part.
However when SynonymGraphFilter finds one synonym it will ignore the other.



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=JUEk2QAGcPS4Pi_y6d3EWDmtYMVjg2Sg-4ZwC-90VqE=tgepeqV5fWmuUgtTc767hv_1czuJnhM9O9LmWVgpDdM=
 




Re: Re: Handling overlapping synonyms

2020-01-17 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hmm  what is the reasoning behind adding the bigrams and trigrams manually 
like that? Maybe if we knew the end goal, we could figure out a different 
strategy. Happy that at least the matching is working now!

On 1/17/20, 10:28 AM, "fiedzia"  wrote:

> Doing it the other way (new york city -> new_york_city, new_york) makes
more
sense,

Just checked it, that way does the matching as expected, but highlighting is
wrong
("new york: query matches "new york city" as it should, but also highlights
all of it)



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=sxUM_HkySPw_KqJdqMGkjWQyUQ6W7K44Nid7p7wcBJ4=rJFkuEpTxkPp6EtyRstEE3PWCY-CSAmtjOFJ9ge67uU=
 




Re: Handling overlapping synonyms

2020-01-17 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
If you instead write "new york => new_york, new_york_city" it should work 
(https://doc.lucidworks.com/fusion/3.1/Collections/Synonyms-Files.html)

On 1/17/20, 6:29 AM, "fiedzia"  wrote:

Having synonyms defined for

new york  -> new_york
new york city -> new_york_city

I'd like the phrase
new york city
to be indexed as both, but SynonymGraphFilter picks only one. Is there a way
around that?

-- 
Maciej Dziardziel
fied...@gmail.com



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=ogoT0t33fiW87_QMoUn_sWWs_DWHiunR_gq1iXkMR8I=3mtCduryNf-zp79DbcKRtn2hSOWWtgbmYX4idUg1VB0=
 




Re: Ref Guide - Precision & Recall of Analyzers

2019-11-06 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
I would also love to know what filter to use to ignore capitalized acronyms... 
which one can do this OOTB?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 11/6/19, 3:54 AM, "Paras Lehana"  wrote:

Hi Community,

In Ref Guide 8.3's *Understanding Analyzers, Tokenizers, and Filters*


section, the text talks about precision and recall depending on how you use
analyzers during query and index time:

For indexing, you often want to simplify, or normalize, words. For example,
> setting all letters to lowercase, eliminating punctuation and accents,
> mapping words to their stems, and so on. Doing so can *increase recall 
*because,
> for example, "ram", "Ram" and "RAM" would all match a query for "ram". To 
*increase
> query-time precision*, a filter could be employed to narrow the matches
> by, for example, *ignoring all-cap acronyms* if you’re interested in male
> sheep, but not Random Access Memory.


In first case (about Recall), is it assumed that "ram" should match to all
three? *[Q1] *Because, to increase recall, we have to decrease false
negatives (documents not retrieved but are relevant). In other case (if the
three are not intended to match the query), precision is actually decreased
here (false positives are increased).

This makes sense for the second case, where precision should increase as we
are decreasing false positives (documents marked relevant wrongly).

However, the text talks about the method of "employing a filter that
ignores all-cap acronyms". How are we supposed to do that on query time?
*[Q2]* Weren't we supposed to remove filter (LCF) during the index time?


-- 
-- 
Regards,

*Paras Lehana* [65871]
Development Engineer, Auto-Suggest,
IndiaMART Intermesh Ltd.

8th Floor, Tower A, Advant-Navis Business Park, Sector 142,
Noida, UP, IN - 201303

Mob.: +91-9560911996
Work: 01203916600 | Extn:  *8173*

-- 
IMPORTANT: 
NEVER share your IndiaMART OTP/ Password with anyone.




Re: Re: Re: POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Oh I see I see 

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/25/19, 12:21 PM, "David Hastings"  wrote:

oh i see what you mean, sorry, i explained it incorrectly.
 those sentences are what would be in the index, and a general search for
'rush limbaugh' would come back with results where he is an entity higher
than if it was two words in a sentence

On Fri, Oct 25, 2019 at 12:12 PM David Hastings <
hastings.recurs...@gmail.com> wrote:

> nope, i boost the fields already tagged at query time against teh query
>
> On Fri, Oct 25, 2019 at 12:11 PM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
>
>> So then you do run your POS tagger at query-time, Dave?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> audrey.lorberf...@ibm.com
>>
>>
>> On 10/25/19, 12:06 PM, "David Hastings" 
>> wrote:
>>
>> I use them for query boosting, so if someone searches for:
>>
>> i dont want to rush limbaugh out the door
>> vs
>> i talked to rush limbaugh through the door
>>
>> my documents where 'rush limbaugh' is a known entity (noun) and a
>> person
>> (look at the sentence, its obviously a person and the nlp finds that)
>> have
>> 'rush limbaugh' stored in a field, which is boosted on queries.  this
>> makes
>> sure results from the second query with him as a person will be
>> boosted
>> above those from the first query
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris <
>> nicolas.pa...@riseup.net>
>> wrote:
>>
>> > Also we are using stanford POS tagger for french. The processing
>> time is
>> > mitigated by the spark-corenlp package which distribute the process
>> over
>> > multiple node.
>> >
>> > Also I am interesting in the way you use POS information within 
solr
>> > queries, or solr fields.
>> >
>> > Thanks,
>> > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
>> > > ah, yeah its not the fastest but it proved to be the best for my
>> > purposes,
>> > > I use it to pre-process data before indexing, to apply more
>> metadata to
>> > the
>> > > documents in a separate field(s)
>> > >
>> > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
>> > > audrey.lorberf...@ibm.com  wrote:
>> > >
>> > > > No, I meant for part-of-speech tagging __ But that's
>> interesting that
>> > you
>> > > > use StanfordNLP. I've read that it's very slow, so we are
>> concerned
>> > that it
>> > > > might not work for us at query-time. Do you use it at
>> query-time, or
>> > just
>> > > > index-time?
>> > > >
>> > > > --
>> > > > Audrey Lorberfeld
>> > > > Data Scientist, w3 Search
>> > > > IBM
>> > > > audrey.lorberf...@ibm.com
>> > > >
>> > > >
>> > > > On 10/25/19, 10:30 AM, "David Hastings" <
>> hastings.recurs...@gmail.com
>> > >
>> > > > wrote:
>> > > >
>> > > > Do you mean for entity extraction?
>> > > > I make a LOT of use from the stanford nlp project, and get
>> out the
>> > > > entities
>> > > > and use them for different purposes in solr
>> > > > -Dave
>> > > >
>> > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
>> > > > audrey.lorberf...@ibm.com 
>> wrote:
>> > > >
>> > > > > Hi All,
>> > > > >
>> > > > > Does anyone use a POS tagger with their Solr instance
>> other than
>> > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
>> > > > >
>> > > > > Thanks!
>> > > > >
>> > > > > --
>> > > > > Audrey Lorberfeld
>> > > > > Data Scientist, w3 Search
>> > > > > IBM
>> > > > > audrey.lorberf...@ibm.com
>> > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> >
>> > --
>> > nicolas
>> >
>>
>>
>>




Re: Re: Re: POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
How can a field itself be tagged with a part of speech?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/25/19, 12:12 PM, "David Hastings"  wrote:

nope, i boost the fields already tagged at query time against teh query

On Fri, Oct 25, 2019 at 12:11 PM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> So then you do run your POS tagger at query-time, Dave?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/25/19, 12:06 PM, "David Hastings" 
> wrote:
>
> I use them for query boosting, so if someone searches for:
>
> i dont want to rush limbaugh out the door
> vs
> i talked to rush limbaugh through the door
>
> my documents where 'rush limbaugh' is a known entity (noun) and a
> person
> (look at the sentence, its obviously a person and the nlp finds that)
> have
> 'rush limbaugh' stored in a field, which is boosted on queries.  this
> makes
> sure results from the second query with him as a person will be 
boosted
> above those from the first query
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris <
> nicolas.pa...@riseup.net>
> wrote:
>
> > Also we are using stanford POS tagger for french. The processing
> time is
> > mitigated by the spark-corenlp package which distribute the process
> over
> > multiple node.
> >
> > Also I am interesting in the way you use POS information within solr
> > queries, or solr fields.
> >
> > Thanks,
> > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> > > ah, yeah its not the fastest but it proved to be the best for my
> > purposes,
> > > I use it to pre-process data before indexing, to apply more
> metadata to
> > the
> > > documents in a separate field(s)
> > >
> > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > >
> > > > No, I meant for part-of-speech tagging __ But that's interesting
> that
> > you
> > > > use StanfordNLP. I've read that it's very slow, so we are
> concerned
> > that it
> > > > might not work for us at query-time. Do you use it at
> query-time, or
> > just
> > > > index-time?
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > > >
> > > > On 10/25/19, 10:30 AM, "David Hastings" <
> hastings.recurs...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > Do you mean for entity extraction?
> > > > I make a LOT of use from the stanford nlp project, and get
> out the
> > > > entities
> > > > and use them for different purposes in solr
> > > > -Dave
> > > >
> > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > > > audrey.lorberf...@ibm.com  wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Does anyone use a POS tagger with their Solr instance
> other than
> > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > --
> > > > > Audrey Lorberfeld
> > > > > Data Scientist, w3 Search
> > > > > IBM
> > > > > audrey.lorberf...@ibm.com
> > > > >
> > > > >
> > > >
> > > >
> > > >
> >
> > --
> > nicolas
> >
>
>
>




Re: Re: POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
So then you do run your POS tagger at query-time, Dave?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/25/19, 12:06 PM, "David Hastings"  wrote:

I use them for query boosting, so if someone searches for:

i dont want to rush limbaugh out the door
vs
i talked to rush limbaugh through the door

my documents where 'rush limbaugh' is a known entity (noun) and a person
(look at the sentence, its obviously a person and the nlp finds that) have
'rush limbaugh' stored in a field, which is boosted on queries.  this makes
sure results from the second query with him as a person will be boosted
above those from the first query












On Fri, Oct 25, 2019 at 11:57 AM Nicolas Paris 
wrote:

> Also we are using stanford POS tagger for french. The processing time is
> mitigated by the spark-corenlp package which distribute the process over
> multiple node.
>
> Also I am interesting in the way you use POS information within solr
> queries, or solr fields.
>
> Thanks,
> On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> > ah, yeah its not the fastest but it proved to be the best for my
> purposes,
> > I use it to pre-process data before indexing, to apply more metadata to
> the
> > documents in a separate field(s)
> >
    > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> >
> > > No, I meant for part-of-speech tagging __ But that's interesting that
> you
> > > use StanfordNLP. I've read that it's very slow, so we are concerned
> that it
> > > might not work for us at query-time. Do you use it at query-time, or
> just
> > > index-time?
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> > > On 10/25/19, 10:30 AM, "David Hastings"  >
> > > wrote:
> > >
> > > Do you mean for entity extraction?
    > > > I make a LOT of use from the stanford nlp project, and get out the
> > > entities
> > > and use them for different purposes in solr
> > > -Dave
> > >
> > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > >
> > > > Hi All,
> > > >
> > > > Does anyone use a POS tagger with their Solr instance other than
> > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > > >
> > >
> > >
> > >
>
> --
> nicolas
>




Re: Re: POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Nicolas,

Do you use the POS tagger at query time, or just at index time? 

We are thinking of using it to filter the tokens we will eventually perform ML 
on. Basically, we have a bunch of acronyms in our corpus. However, many 
departments use the same acronyms but expand those acronyms to different 
things. Eventually, we are thinking of using ML on our index to determine which 
expansion is meant by a particular query according to the context we find in 
certain documents. However, since we don't want to run ML on all tokens in a 
query, and since we think that acronyms are usually the nouns in a multi-token 
query, we want to only feed nouns to the ML model (TBD).

Does that make sense? So, we'd want both an index-side POS tagger (could be 
slow), and also a query-side POS tagger (must be fast).

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/25/19, 11:57 AM, "Nicolas Paris"  wrote:

Also we are using stanford POS tagger for french. The processing time is
mitigated by the spark-corenlp package which distribute the process over
multiple node.

Also I am interesting in the way you use POS information within solr
queries, or solr fields. 

Thanks,
On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> ah, yeah its not the fastest but it proved to be the best for my purposes,
> I use it to pre-process data before indexing, to apply more metadata to 
the
> documents in a separate field(s)
> 
> On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> 
> > No, I meant for part-of-speech tagging __ But that's interesting that 
you
> > use StanfordNLP. I've read that it's very slow, so we are concerned 
that it
> > might not work for us at query-time. Do you use it at query-time, or 
just
> > index-time?
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/25/19, 10:30 AM, "David Hastings" 
> > wrote:
> >
> > Do you mean for entity extraction?
> > I make a LOT of use from the stanford nlp project, and get out the
    > > entities
> > and use them for different purposes in solr
> > -Dave
> >
> > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> >
> > > Hi All,
> > >
> > > Does anyone use a POS tagger with their Solr instance other than
> > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > >
> > > Thanks!
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> >
> >
> >

-- 
nicolas




Re: Re: POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
No, I meant for part-of-speech tagging __ But that's interesting that you use 
StanfordNLP. I've read that it's very slow, so we are concerned that it might 
not work for us at query-time. Do you use it at query-time, or just index-time?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/25/19, 10:30 AM, "David Hastings"  wrote:

Do you mean for entity extraction?
I make a LOT of use from the stanford nlp project, and get out the entities
and use them for different purposes in solr
-Dave

On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:

> Hi All,
>
> Does anyone use a POS tagger with their Solr instance other than
> OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
>
> Thanks!
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>




POS Tagger

2019-10-25 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi All,

Does anyone use a POS tagger with their Solr instance other than OpenNLP’s? We 
are considering OpenNLP, SpaCy, and Watson.

Thanks!

--
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com



Re: Re: using the df parameter to set a default to search all fields

2019-10-22 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Eek, Shawn, you're right -- I'm sorry, all! I meant to say the the QF (!) 
parameter. And pasted the wrong thing too ☹ This is what ours looks like with 
the qf parameter (and the edismax parser)

  
  title_en^1.5 description_en^0.5 content_en^0.5 headings_en^1.3 
keywords_en^1.5 url^0.5
  

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/22/19, 1:50 PM, "Shawn Heisey"  wrote:

On 10/22/2019 11:42 AM, Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote:
> I think you actually can search over all fields, but not in the df 
parameter. We have a big list of fields we want to search over. So, we just put 
a dummy one in the df param field, and then we use the fl parameter. With the 
edismax parser, this works. It looks something like this:

The fl parameter means "field list" and controls which fields are 
included in the search results.  It does not control which fields are 
searched.

Thanks,
Shawn




Re: Re: using the df parameter to set a default to search all fields

2019-10-22 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
I think you actually can search over all fields, but not in the df parameter. 
We have a big list of fields we want to search over. So, we just put a dummy 
one in the df param field, and then we use the fl parameter. With the edismax 
parser, this works. It looks something like this: 



edismax
1.0
explicit
30
content_en

update_date, display_url, url, id, uid, scopes, source_id, 
json_payload, language, snippet, [elevated],
title:title_en, description:description_en, score




-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/22/19, 1:01 PM, "Shawn Heisey"  wrote:

On 10/22/2019 10:26 AM, rhys J wrote:
> How do I make Solr search on all fields in a document?

Solr does not have a way to ask for all fields on a search.  If you use 
the edismax query parser, you can specify multiple fields with the qf 
parameter, but there is nothing you can put in that parameter as a 
shortcut for "all fields."  Using qf with multiple fields is the 
cleanest way to do this.

> I read the documentation about the df field, and added the following to my
> solrconfig.xml:
> 
>   
>explicit
>10
>   _text_
>  

The df parameter just means "default field".  It can only search one field.

> in my managed-schema file i have the following:
> 
>stored="true"/>
> 
> I have deleted the documents, and re-indexed the csv file.
> 
> When I do a search in the api for: _text_:amy - which should return 2
> documents, I get nothing.

Just having a field named _text_ doesn't make anything happen, unless 
your indexing specifically adds documents with that field defined. 
There is nothing special about _text_.  Other field names that start and 
end with an underscore, like _version_ or _root_, are special ... but 
_text_ is not.

Probably what you are looking for here is to set up one or more 
copyField definitions in your schema, which are configured to copy one 
or more of your other fields to _text_ so it can be searched as a 
catchall field.  I find it useful to name that field "catchall" rather 
than something like _text_ which seems like a special field name, but isn't.

Thanks,
Shawn




Re: Re: Query on autoGeneratePhraseQueries

2019-10-15 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
I'm not sure how your config file is setup, but I know that the way we do 
multi-token synonyms is to have the sow (split on whitespace) parameter set to 
False while using the edismax parser. I'm not sure if this would work with 
PhraseQueries , but it might be worth a try! 

In our config file we do something like this: 



edismax
1.0
explicit
100
content_en
w3json_en
false

 

You can read a bit about the parameter here: 
https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/
 

Best,
Audrey

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/15/19, 5:50 AM, "Shubham Goswami"  wrote:

Hi kshitij

Thanks for the reply!
I tried to debug it and found that raw query(black company) has parsed as
two separate queries
black and company and returning the results based on black query instead of
this it should have
got parsed as a single phrase query like("black company") because i am using
autoGeneratedPhraseQuery.
Do you have any idea about this please correct me if i am wrong.

Thanks
Shubham

On Tue, Oct 15, 2019 at 1:58 PM kshitij tyagi 
wrote:

> Hi,
>
> Try debugging your solr query and understand how it gets parsed. Try using
> "debug=true" for the same
>
> On Tue, Oct 15, 2019 at 12:58 PM Shubham Goswami <
> shubham.gosw...@hotwax.co>
> wrote:
>
> > *Hi all,*
> >
> > I am a beginner to solr framework and I am trying to implement
> > *autoGeneratePhraseQueries* property in a fieldtype of
> type=text_general, i
> > kept the property value as true and restarted the solr server but still
> it
> > is not taking my two words query like(Black company) as a phrase without
> > double quotes and returning the results only for Black.
> >
> >  Can somebody please help me to understand what am i missing ?
> > Following is my Schema.xml file code and i am using solr 7.5 version.
> >  > positionIncrementGap="100" multiValued="true"
> > autoGeneratePhraseQueries="true">
> > 
> >   =
> >> ignoreCase="true"/>
> >   
> > 
> > 
> >   
> >> ignoreCase="true"/>
> >> ignoreCase="true" synonyms="synonyms.txt"/>
> >   
> > 
> >   
> >
> >
> > --
> > *Thanks & Regards*
> > Shubham Goswami
> > Enterprise Software Engineer
> > *HotWax Systems*
> > *Enterprise open source experts*
> > cell: +91-7803886288
> > office: 0731-409-3684
> > 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hotwaxsystems.com=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=Zi9beGF58BzJUNUdCkeW0pwliKwq9vdTSh0V_lR0734=FhSkJBcmYw_bfHgq1enzuYQeOZwKHzlP9h4VwTZSL5E=
 
> >
>


-- 
*Thanks & Regards*
Shubham Goswami
Enterprise Software Engineer
*HotWax Systems*
*Enterprise open source experts*
cell: +91-7803886288
office: 0731-409-3684

https://urldefense.proofpoint.com/v2/url?u=http-3A__www.hotwaxsystems.com=DwIBaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=Zi9beGF58BzJUNUdCkeW0pwliKwq9vdTSh0V_lR0734=FhSkJBcmYw_bfHgq1enzuYQeOZwKHzlP9h4VwTZSL5E=
 




Re: Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
True...I guess another rub here is that we're using the edismax parser, so all 
of our queries are inherently OR queries. So for a query like  'the ibm way', 
the search engine would have to: 

1) retrieve a document list for:
 -->  "ibm" (this list is probably 80% of the documents)
 -->  "the" (this list is 100%  of the english documents)
 -- >"way"
2) apply edismax parser
 --> foreach term
 -->  -->  foreach document  in term
 -->  -->  -->  score it

So, it seems like it would take a toll on our system but maybe that's 
incorrect! (For reference, our corpus is ~5MM documents, multi-language, and we 
get ~80k-100k queries/day)

Are you using edismax?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/9/19, 3:11 PM, "David Hastings"  wrote:

if you have anything close to a decent server you wont notice it all.  im
at about 21 million documents, index varies between 450gb to 800gb
depending on merges, and about 60k searches a day and stay sub second non
stop, and this is on a single core/non cloud environment

    On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Also, in terms of computational cost, it would seem that including most
> terms/not having a stop ilst would take a toll on the system. For 
instance,
> right now we have "ibm" as a stop word because it appears everywhere in 
our
> corpus. If we did not include it in the stop words file, we would have to
> retrieve every single document in our corpus and rank them. That's a high
> computational cost, no?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" <
> audrey.lorberf...@ibm.com> wrote:
>
> Wow, thank you so much, everyone. This is all incredibly helpful
> insight.
>
> So, would it be fair to say that the majority of you all do NOT use
> stop words?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>
>
> On 10/9/19, 11:14 AM, "David Hastings" 
> wrote:
>
> However, with all that said, stopwords CAN be useful in some
> situations.  I
> combine stopwords with the shingle factory to create "interesting
> phrases"
> (not really) that i use in "my more like this" needs.  for 
example,
> europe for vacation
> europe on vacation
> will create the shingle
> europe_vacation
> which i can then use to relate other documents that would be much
> more similar in such regard, rather than just using the
> "interesting words"
> europe, vacation
>
> with stop words, the shingles would be
> europe_for
> for_vacation
> and
> europe_on
> on_vacation
>
> just something to keep in mind,  theres a lot of creative ways to
> use
> stopwords depending on your needs.  i use the above for a VERY
> basic ML
> teacher and it works way better than using stopwords,
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> erickerick...@gmail.com>
> wrote:
>
> > The theory behind stopwords is that they are “safe” to remove
> when
> > calculating relevance, so we can squeeze every last bit of
> usefulness out
> > of very constrained hardware (think 64K of memory. Yes
> kilobytes). We’ve
> > come a long way since then and the necessity of removing
> stopwords from the
> > indexed tokens to conserve RAM and disk is much less relevant
> than it used
> > to be in “the bad old days” when the idea of stopwords was
> invented.
> >
> > I’m not quite so confident as Alex that there is “no benefit”,
> but I’ll
> > totally agree that you should remove stopwords only _after_ you
> have some
    > > evidence that removing them is A Good Thing in your situation.
> >
> > And removing stopwords leads to some interesting corner cases.
> Consider a
> > search for “to be or n

Re: Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Also, in terms of computational cost, it would seem that including most 
terms/not having a stop ilst would take a toll on the system. For instance, 
right now we have "ibm" as a stop word because it appears everywhere in our 
corpus. If we did not include it in the stop words file, we would have to 
retrieve every single document in our corpus and rank them. That's a high 
computational cost, no?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/9/19, 2:31 PM, "Audrey Lorberfeld - audrey.lorberf...@ibm.com" 
 wrote:

Wow, thank you so much, everyone. This is all incredibly helpful insight.

So, would it be fair to say that the majority of you all do NOT use stop 
words?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/9/19, 11:14 AM, "David Hastings"  wrote:

However, with all that said, stopwords CAN be useful in some 
situations.  I
combine stopwords with the shingle factory to create "interesting 
phrases"
(not really) that i use in "my more like this" needs.  for example,
europe for vacation
europe on vacation
will create the shingle
europe_vacation
which i can then use to relate other documents that would be much
more similar in such regard, rather than just using the "interesting 
words"
europe, vacation

with stop words, the shingles would be
europe_for
for_vacation
and
europe_on
on_vacation

just something to keep in mind,  theres a lot of creative ways to use
stopwords depending on your needs.  i use the above for a VERY basic ML
teacher and it works way better than using stopwords,













On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson 
wrote:

> The theory behind stopwords is that they are “safe” to remove when
> calculating relevance, so we can squeeze every last bit of usefulness 
out
> of very constrained hardware (think 64K of memory. Yes kilobytes). 
We’ve
> come a long way since then and the necessity of removing stopwords 
from the
> indexed tokens to conserve RAM and disk is much less relevant than it 
used
> to be in “the bad old days” when the idea of stopwords was invented.
>
> I’m not quite so confident as Alex that there is “no benefit”, but 
I’ll
> totally agree that you should remove stopwords only _after_ you have 
some
> evidence that removing them is A Good Thing in your situation.
>
> And removing stopwords leads to some interesting corner cases. 
Consider a
> search for “to be or not to be” if they’re all stopwords.
>
> Best,
    > Erick
>
> > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" 
> wrote:
> >
> >Another thing to add to the above,
> >>
> >> IT:ibm. In this case, we would want to maintain the colon and the
> >> capitalization (otherwise “it” would be taken out as a stopword).
> >>
> >stopwords are a thing of the past at this point.  there is no 
benefit
> to
> >using them now with hardware being so cheap.
> >
> >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> >wrote:
> >
> >> If you don't want it to be touched by a tokenizer, how would the
> >> protection step know that the sequence of characters you want to
> >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >> protect"?
> >>
> >> What it sounds to me is that you may want to:
> >> 1) copyField to a second field
> >> 2) Apply a much lighter (whitespace?) tokenizer to that second 
field
> >> 3) 

Re: Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Wow, thank you so much, everyone. This is all incredibly helpful insight.

So, would it be fair to say that the majority of you all do NOT use stop words?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/9/19, 11:14 AM, "David Hastings"  wrote:

However, with all that said, stopwords CAN be useful in some situations.  I
combine stopwords with the shingle factory to create "interesting phrases"
(not really) that i use in "my more like this" needs.  for example,
europe for vacation
europe on vacation
will create the shingle
europe_vacation
which i can then use to relate other documents that would be much
more similar in such regard, rather than just using the "interesting words"
europe, vacation

with stop words, the shingles would be
europe_for
for_vacation
and
europe_on
on_vacation

just something to keep in mind,  theres a lot of creative ways to use
stopwords depending on your needs.  i use the above for a VERY basic ML
teacher and it works way better than using stopwords,













On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson 
wrote:

> The theory behind stopwords is that they are “safe” to remove when
> calculating relevance, so we can squeeze every last bit of usefulness out
> of very constrained hardware (think 64K of memory. Yes kilobytes). We’ve
> come a long way since then and the necessity of removing stopwords from 
the
> indexed tokens to conserve RAM and disk is much less relevant than it used
> to be in “the bad old days” when the idea of stopwords was invented.
>
> I’m not quite so confident as Alex that there is “no benefit”, but I’ll
> totally agree that you should remove stopwords only _after_ you have some
> evidence that removing them is A Good Thing in your situation.
>
> And removing stopwords leads to some interesting corner cases. Consider a
> search for “to be or not to be” if they’re all stopwords.
>
    > Best,
> Erick
>
> > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hey Alex,
> >
> > Thank you!
> >
> > Re: stopwords being a thing of the past due to the affordability of
> hardware...can you expand? I'm not sure I understand.
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/8/19, 1:01 PM, "David Hastings" 
> wrote:
> >
> >Another thing to add to the above,
> >>
> >> IT:ibm. In this case, we would want to maintain the colon and the
> >> capitalization (otherwise “it” would be taken out as a stopword).
> >>
> >stopwords are a thing of the past at this point.  there is no benefit
> to
> >using them now with hardware being so cheap.
> >
> >On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> arafa...@gmail.com>
> >wrote:
> >
> >> If you don't want it to be touched by a tokenizer, how would the
> >> protection step know that the sequence of characters you want to
> >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >> protect"?
> >>
> >> What it sounds to me is that you may want to:
> >> 1) copyField to a second field
> >> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> >> 3) Run the results through something like KeepWordFilterFactory
> >> 4) Search both fields with a boost on the second, higher-signal field
> >>
> >> The other option is to run CharacterFilter,
    > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> >> term365". As long as it is done on both indexing and query, they will
> >> still match. You may have to have a bunch of them or write some sort
> >> of lookup map.
> >>
> >> Regards,
> >>   Alex.
> >>
> >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> >> audrey.lorberf...@ibm.com  wrote:
> >>>
> >>> Hi All,
> >>>
> >>> This is likely a rudimentary question, but I can’t seem to find a
> 

Re: Re: Protecting Tokens from Any Analysis

2019-10-09 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hey Alex,

Thank you!

Re: stopwords being a thing of the past due to the affordability of 
hardware...can you expand? I'm not sure I understand.

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 10/8/19, 1:01 PM, "David Hastings"  wrote:

Another thing to add to the above,
>
> IT:ibm. In this case, we would want to maintain the colon and the
> capitalization (otherwise “it” would be taken out as a stopword).
>
stopwords are a thing of the past at this point.  there is no benefit to
using them now with hardware being so cheap.

On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch 
wrote:

> If you don't want it to be touched by a tokenizer, how would the
> protection step know that the sequence of characters you want to
> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> protect"?
>
> What it sounds to me is that you may want to:
> 1) copyField to a second field
> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> 3) Run the results through something like KeepWordFilterFactory
> 4) Search both fields with a boost on the second, higher-signal field
>
> The other option is to run CharacterFilter,
> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map known
> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> term365". As long as it is done on both indexing and query, they will
> still match. You may have to have a bunch of them or write some sort
    > of lookup map.
>
    > Regards,
>Alex.
>
> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> >
> > Hi All,
> >
> > This is likely a rudimentary question, but I can’t seem to find a
> straight-forward answer on forums or the documentation…is there a way to
> protect tokens from ANY analysis? I know things like the
> KeywordMarkerFilterFactory protect tokens from stemming, but we have some
> terms we don’t even want our tokenizer to touch. Mostly, these are
> IBM-specific acronyms, such as IT:ibm. In this case, we would want to
> maintain the colon and the capitalization (otherwise “it” would be taken
> out as a stopword).
> >
> > Any advice is appreciated!
> >
> > Thank you,
> > Audrey
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
>




Protecting Tokens from Any Analysis

2019-10-08 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi All,

This is likely a rudimentary question, but I can’t seem to find a 
straight-forward answer on forums or the documentation…is there a way to 
protect tokens from ANY analysis? I know things like the 
KeywordMarkerFilterFactory protect tokens from stemming, but we have some terms 
we don’t even want our tokenizer to touch. Mostly, these are IBM-specific 
acronyms, such as IT:ibm. In this case, we would want to maintain the colon and 
the capitalization (otherwise “it” would be taken out as a stopword).

Any advice is appreciated!

Thank you,
Audrey

--
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com



Re: Re: SolR: How to sort (or boost) by Availability dates

2019-09-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Yay!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 9/24/19, 10:15 AM, "digi_business"  wrote:

Hi all, reading your suggestions i've juste come out of the darkness!

Just for explaining, my problem is that i want to show all my items (not
only the "availables"), but having the availables coming first, still
mantaining my custom sorting by "ranking" desc.
i then used this BoostQuery
bq=(Avail_From: [* TO NOW] AND Avail_To: [NOW TO *])^10
and discovered that for activating it i must declare defType=edismax before
all
then i discovered about the default SolR "score" sorting, and explicitating
it like this did the magic
sort=score desc, Ranking desc

thanks all for the help, and I really hope this could help someone else in
the future



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=6psFmUJgDOuMWnRpn2-SLAU20C7GLXGSvkdyOVcxe08=dOnO6vl6A2vleGcMXwtwAmYybldRdW7Cp3aerxgPeAo=
 




Re: SolR: How to sort (or boost) by Availability dates

2019-09-24 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi Federico,

I am not sure exactly what syntax would get you the functionality that you're 
looking for, but I'd recommend writing a boost function. That's what we're 
doing right now for boosting more recent results in our search engine. You'd 
somehow have to work with date math and possibly make a copy field to store the 
answer of the mathematical expression that would incorporate the NOW part...

Our boost function is "recip(div(ms(NOW,date_discount),262800),1,24,24." It 
goes in the "bf" parameter when using the edismax parser. Our function 
translates to "max boost set to 1 for new docs, down to .4 after 3 years." We 
came up with the time frame of the boost after creating a histogram of our 
corpus's "update_date" field values (copied to the "date_discount" field) and 
finding that monthly binning gave us the most normal distribution (as opposed 
to weekly or yearly). 

We came up this solution after lots of surfing Solr forums and reading a lot 
about date math 
(https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-8.x/javadoc/working-with-dates.html#date-math)
 and boost functions 
(https://builds.apache.org/view/L/view/Lucene/job/Solr-reference-guide-8.x/javadoc/the-dismax-query-parser.html#bf-boost-functions-parameter).
 

Currently, we are running a grid search optimized for nDCG that runs ~1x/week 
to give us the optimal a,b constants to sub out for the 24s in the function 
above. We plan to change this to a linear model in the future to cut down on 
the time it takes to run.

Hopefully this gives you a nice starting place!

Best,
Audrey

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 9/24/19, 5:47 AM, "digi_business"  wrote:

i'm facing a big problem in my SolR DB.
My objects have a datetime field "Available_From" and a datetime field
"Available_To". We also have a "Ranking" field for the sorting that we
usually use desc.
I can search correctly with direct queries (eg. give me all the items that
are available at the moment) but when i do a regular search by other
criteria i cannot find a way to show the items that result "available NOW"
in the first places in the specific search results, usually sorted by
"Ranking" field.
How can i do this? Am I forced to write some java classes (the nearest thing
i've found is there

https://urldefense.proofpoint.com/v2/url?u=https-3A__medium.com_-40devchaitu18_sorting-2Dbased-2Don-2Da-2Dcustom-2Dfunction-2Din-2Dsolr-2Dc94ddae99a12=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=iCvyvST3PYrt8QATUq_UaCpoHECIsgvkQgpZ1073OLg=bgStMXTIXUGU1HG4dIgafvDG7gDRfLDHxqV9tiPsa_8=
 )
or is there a way to do with standard SolR queries?
Will boosting work? If yes, how can i boost by the 2 "Available_From" and
"Available_To" fields verified at the same time, and then show the other
results sorted by "Ranking" desc ?
Thanks in advance to everyone!



--
Sent from: 
https://urldefense.proofpoint.com/v2/url?u=https-3A__lucene.472066.n3.nabble.com_Solr-2DUser-2Df472068.html=DwICAg=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=iCvyvST3PYrt8QATUq_UaCpoHECIsgvkQgpZ1073OLg=uyc4rT6s7dUYCpfdc4jZseKQ7N8HAzBNof59kkyRsxg=
 




Re: Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-04 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thanks, Alex! We'll look into this.

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 9/3/19, 4:27 PM, "Alexandre Rafalovitch"  wrote:

What about combining:
1) KeywordRepeatFilterFactory
2) An existing folding filter (need to check it ignores Keyword marked word)
3) RemoveDuplicatesTokenFilterFactory

That may give what you are after without custom coding.

Regards,
   Alex.

On Tue, 3 Sep 2019 at 16:14, Audrey Lorberfeld -
audrey.lorberf...@ibm.com  wrote:
>
> Toke,
>
> Thank you! That makes a lot of sense.
>
> In other news -- we just had a meeting where we decided to try out a 
hybrid strategy. I'd love to know what you & everyone else thinks...
>
> - Since we are concerned with the overhead created by "double-fielding" 
all tokens per language (because I'm not sure how we'd work the logic into Solr 
to only double-field when an accent is present), we are going to try to do 
something along the lines of synonym-expansion:
> - We are going to build a custom plugin that detects diacritics 
-- upon detection, the plugin would expand the token to both its original form 
and its ascii-folded term (a la Toke's approach).
> - However, since we are doing it in a way that mimics synonym 
expansion, we are going to keep both terms in a single field
>
> The main issue we are anticipating with the above strategy surrounds 
scoring. Since we will be increasing the frequency of accented terms, we might 
bias our page ranker...
>
> Has anyone done anything similar (and/or does anyone think this idea is 
totally the wrong way to go?)
>
> Best,
> Audrey
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
    >
>
    > On 9/3/19, 2:58 PM, "Toke Eskildsen"  wrote:
>
> Audrey Lorberfeld - audrey.lorberf...@ibm.com 
 wrote:
> > Do you find that searching over both the original title field and 
the normalized title
> > field increases the time it takes for your search engine to 
retrieve results?
>
> It is not something we have measured as that index is fast enough 
(which in this context means that we're practically always waiting for the 
result from an external service that is issued in parallel with the call to our 
Solr server).
>
> Technically it's not different from searching across other fields 
defined in the eDismax setup, so I guess it boils down to "how many fields can 
you afford to search across?", where our organization's default answer is "as 
many as we need to get quality matches. Make it work Toke, chop chop". On a 
more serious note, it is not something I would worry about unless we're talking 
some special high-performance setup with a budget for tuning: Matching terms 
and joining filters is core Solr (Lucene really) functionality. Plain query & 
filter-matching time tend to be dwarfed by aggregations (grouping, faceting, 
stats).
>
> - Toke Eskildsen
>
>




Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Toke,

Thank you! That makes a lot of sense.

In other news -- we just had a meeting where we decided to try out a hybrid 
strategy. I'd love to know what you & everyone else thinks...

- Since we are concerned with the overhead created by "double-fielding" all 
tokens per language (because I'm not sure how we'd work the logic into Solr to 
only double-field when an accent is present), we are going to try to do 
something along the lines of synonym-expansion:
- We are going to build a custom plugin that detects diacritics -- upon 
detection, the plugin would expand the token to both its original form and its 
ascii-folded term (a la Toke's approach).
- However, since we are doing it in a way that mimics synonym 
expansion, we are going to keep both terms in a single field

The main issue we are anticipating with the above strategy surrounds scoring. 
Since we will be increasing the frequency of accented terms, we might bias our 
page ranker...

Has anyone done anything similar (and/or does anyone think this idea is totally 
the wrong way to go?)

Best,
Audrey

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
IBM
audrey.lorberf...@ibm.com
 

On 9/3/19, 2:58 PM, "Toke Eskildsen"  wrote:

Audrey Lorberfeld - audrey.lorberf...@ibm.com  
wrote:
> Do you find that searching over both the original title field and the 
normalized title
> field increases the time it takes for your search engine to retrieve 
results?

It is not something we have measured as that index is fast enough (which in 
this context means that we're practically always waiting for the result from an 
external service that is issued in parallel with the call to our Solr server).

Technically it's not different from searching across other fields defined 
in the eDismax setup, so I guess it boils down to "how many fields can you 
afford to search across?", where our organization's default answer is "as many 
as we need to get quality matches. Make it work Toke, chop chop". On a more 
serious note, it is not something I would worry about unless we're talking some 
special high-performance setup with a budget for tuning: Matching terms and 
joining filters is core Solr (Lucene really) functionality. Plain query & 
filter-matching time tend to be dwarfed by aggregations (grouping, faceting, 
stats).

- Toke Eskildsen




Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Toke,

Do you find that searching over both the original title field and the 
normalized title field increases the time it takes for your search engine to 
retrieve results?

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/31/19, 3:01 PM, "Toke Eskildsen"  wrote:

    Audrey Lorberfeld - audrey.lorberf...@ibm.com  
wrote:
> Just wanting to test the waters here – for those of you with search 
engines
> that index multiple languages, do you use ASCII-folding in your schema?

Our primary search engine is for Danish users, with sources being 
bibliographic records with titles and other meta data in many different 
languages. We normalise to Danish, meaning that most ligatures are removed, but 
also that letters such as Swedish ö becomes Danish ø. The rules for 
normalisation are dictated by Danish library practice and was implemented by a 
resident librarian.

Whenever we do this normalisation, we index two versions in our index: A 
very lightly normalised (lowercased) field and a heavily normalised field: If a 
record has a title "Köket" (kitchen in Swedish), we store title_orig:köket and 
title_norm:køket. edismax is used to ensure that both fields are searched per 
default (plus an explicit field alias "title" are set to point to both 
title_orig and title_norm for qualified searches) and that matches in 
title_orig has more weight for relevance calculation.

> We are onboarding Spanish documents into our index right now and keep
> going back and forth on whether we should preserve accent marks.

Going with what we do, my answer would be: Yes, do preserve and also remove 
:-). You could even have 3 or more levels of normalisation, depending on how 
much time you have for polishing.

- Toke Eskildsen




Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Languages are the best. Thank you all so much!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/30/19, 4:09 PM, "Walter Underwood"  wrote:

The right transliteration for accents is language-dependent. In English, a 
diaeresis can be stripped because it is only used to mark neighboring vowels as 
independently pronounced. In German, the “typewriter umlaut” adds an “e”.

English: coöperate -> cooperate
German: Glück -> Glueck

Some stemmers will handle the typewriter umlauts for you. The InXight 
stemmers used to do that.

The English diaeresis is a fussy usage, but it does occur in text. For 
years, MS Word corrected “naive” to “naïve”. There may even be a curse 
associated with its usage.


https://urldefense.proofpoint.com/v2/url?u=https-3A__www.newyorker.com_culture_culture-2Ddesk_the-2Dcurse-2Dof-2Dthe-2Ddiaeresis=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=bs1f1lhhzL5yetMSZKj0sDIC1dRXVKWJ6QfOnl6YGgo=cpRGRPUJXHCR3A-NyxcjzAqt-N1HevrBCjLJAW60KDU=
 

In German, there are corner cases where just stripping the umlaut changes 
one word into another, like schön/schon.

Isn’t language fun?

wunder
Walter Underwood
wun...@wunderwood.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__observer.wunderwood.org_=DwIFaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=bs1f1lhhzL5yetMSZKj0sDIC1dRXVKWJ6QfOnl6YGgo=JKCjwue0SDlu5UZ5sllEI__txfMvrugOL51CIAPV1H8=
   (my blog)

> On Aug 30, 2019, at 12:48 PM, Erick Erickson  
wrote:
> 
> It Depends (tm). In this case on how sophisticated/precise your users 
are. If your users are exclusively extremely conversant in the language and are 
expected to have keyboards that allow easy access to all the accents… then I 
might leave them in. In some cases removing them can change the meaning of a 
word.
> 
> That said, most installations I’ve seen remove them. They’re still 
present in any returned stored field so the doc looks good. And then you bypass 
all the nonsense about perhaps ingesting a doc that “somehow” had accents 
removed and/or people not putting accents in their search and the like.
> 
> MappingCFF works..
> 
    >> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com  wrote:
>> 
>> Aita,
>> 
>> Thanks for that insight! 
>> 
>> As the conversation has progressed, we are now leaning towards not 
having the ASCII-folding filter in our pipelines in order to keep marks like 
umlauts and tildas. Instead, we might add acute and grave accents to a file 
pointed at by the MappingCharFilterFactory to simply strip those more common 
accent marks...
>> 
>> Any other opinions are welcome!
>> 
>> -- 
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> Digital Workplace Engineering
>> CIO, Finance and Operations
>> IBM
>> audrey.lorberf...@ibm.com
>> 
>> 
>> On 8/30/19, 10:27 AM, "Atita Arora"  wrote:
>> 
>>   We work on german index, we neutralize accents before index i.e. 
umlauts to
>>   'ae', 'ue'.. Etc and similar what we do at the query time too for an
>>   appropriate match.
>> 
>>   On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - 
audrey.lorberf...@ibm.com
>>wrote:
>> 
>>> Hi All,
>>> 
>>> Just wanting to test the waters here – for those of you with search
>>> engines that index multiple languages, do you use ASCII-folding in your
>>> schema? We are onboarding Spanish documents into our index right now and
>>> keep going back and forth on whether we should preserve accent marks. 
From
>>> our query logs, it seems people generally do not include accents when
>>> searching, but you never know…
>>> 
>>> Thank you in advance for sharing your experiences!
>>> 
>>> --
>>> Audrey Lorberfeld
>>> Data Scientist, w3 Search
>>> Digital Workplace Engineering
>>> CIO, Finance and Operations
>>> IBM
>>> audrey.lorberf...@ibm.com
>>> 
>>> 
>> 
>> 
> 





Re: Re: Re: Multi-lingual Search & Accent Marks

2019-09-03 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thank you, Erick!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/30/19, 3:49 PM, "Erick Erickson"  wrote:

It Depends (tm). In this case on how sophisticated/precise your users are. 
If your users are exclusively extremely conversant in the language and are 
expected to have keyboards that allow easy access to all the accents… then I 
might leave them in. In some cases removing them can change the meaning of a 
word.

That said, most installations I’ve seen remove them. They’re still present 
in any returned stored field so the doc looks good. And then you bypass all the 
nonsense about perhaps ingesting a doc that “somehow” had accents removed 
and/or people not putting accents in their search and the like.

MappingCFF works..

> On Aug 30, 2019, at 1:54 PM, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com  wrote:
> 
> Aita,
> 
> Thanks for that insight! 
> 
> As the conversation has progressed, we are now leaning towards not having 
the ASCII-folding filter in our pipelines in order to keep marks like umlauts 
and tildas. Instead, we might add acute and grave accents to a file pointed at 
by the MappingCharFilterFactory to simply strip those more common accent 
marks...
> 
> Any other opinions are welcome!
> 
> -- 
> Audrey Lorberfeld
> Data Scientist, w3 Search
> Digital Workplace Engineering
> CIO, Finance and Operations
> IBM
> audrey.lorberf...@ibm.com
> 
> 
> On 8/30/19, 10:27 AM, "Atita Arora"  wrote:
> 
>We work on german index, we neutralize accents before index i.e. 
umlauts to
>'ae', 'ue'.. Etc and similar what we do at the query time too for an
    >appropriate match.
> 
>On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - 
audrey.lorberf...@ibm.com
> wrote:
> 
>> Hi All,
>> 
>> Just wanting to test the waters here – for those of you with search
>> engines that index multiple languages, do you use ASCII-folding in your
>> schema? We are onboarding Spanish documents into our index right now and
>> keep going back and forth on whether we should preserve accent marks. 
From
>> our query logs, it seems people generally do not include accents when
>> searching, but you never know…
>> 
>> Thank you in advance for sharing your experiences!
>> 
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> Digital Workplace Engineering
>> CIO, Finance and Operations
>> IBM
>> audrey.lorberf...@ibm.com
>> 
>> 
> 
> 





Re: Re: Multi-lingual Search & Accent Marks

2019-08-30 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Aita,

Thanks for that insight! 

As the conversation has progressed, we are now leaning towards not having the 
ASCII-folding filter in our pipelines in order to keep marks like umlauts and 
tildas. Instead, we might add acute and grave accents to a file pointed at by 
the MappingCharFilterFactory to simply strip those more common accent marks...

Any other opinions are welcome!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/30/19, 10:27 AM, "Atita Arora"  wrote:

We work on german index, we neutralize accents before index i.e. umlauts to
'ae', 'ue'.. Etc and similar what we do at the query time too for an
appropriate match.

On Fri, Aug 30, 2019, 4:22 PM Audrey Lorberfeld - audrey.lorberf...@ibm.com
 wrote:

> Hi All,
>
> Just wanting to test the waters here – for those of you with search
> engines that index multiple languages, do you use ASCII-folding in your
> schema? We are onboarding Spanish documents into our index right now and
> keep going back and forth on whether we should preserve accent marks. From
> our query logs, it seems people generally do not include accents when
> searching, but you never know…
>
> Thank you in advance for sharing your experiences!
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> Digital Workplace Engineering
> CIO, Finance and Operations
> IBM
> audrey.lorberf...@ibm.com
>
>




Multi-lingual Search & Accent Marks

2019-08-30 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi All,

Just wanting to test the waters here – for those of you with search engines 
that index multiple languages, do you use ASCII-folding in your schema? We are 
onboarding Spanish documents into our index right now and keep going back and 
forth on whether we should preserve accent marks. From our query logs, it seems 
people generally do not include accents when searching, but you never know…

Thank you in advance for sharing your experiences!

--
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com



Re: Re: Multi-language Spellcheck

2019-08-29 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Thanks, everyone!
-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/29/19, 11:28 AM, "Atita Arora"  wrote:

I would agree with the suggestion, I remember something similar presented
by someone at Berlin Buzzwords 19.

On Thu, Aug 29, 2019, 5:03 PM Jörn Franke  wrote:

> It could be sensible to have one spellchecker / language (as different
> endpoint or as a queryparameter at runtime). Alternatively, depending on
> your use case you could get away with a generic fieldtype that does not do
> anything language specific, but I doubt.
>
> > Am 29.08.2019 um 16:20 schrieb Audrey Lorberfeld -
> audrey.lorberf...@ibm.com :
> >
> > Hi All,
> >
> > We are starting up an internal search engine that has to work for many
> different languages. We are starting with a POC of Spanish and English
> documents, and we are using the DirectSolrSpellChecker.
> >
> > From reading others' threads online, I know that we have to have
> multiple spellcheckers to do this (1 for each language). However, would
> someone be able to clarify what should go in the "queryAnalyzerFieldType"
> tag? It seems that the tag can only take a single field. So, does that 
mean
> that I have to have a copy field that collates all tokens from all
> languages? Image of code attached for reference & sample code of
> English-only spellchecker below:
> >
> > 
> >
> >   ???  
> >
> >
> >default
> >minimal_en
> >solr.DirectSolrSpellChecker -->
> >internal
> >0.5
> >2
> >1
> >5
> >4
> >0.05
> >
> > ...
> >
> > Thank you!
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > Digital Workplace Engineering
> > CIO, Finance and Operations
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 8/29/19, 10:12 AM, "Joe Obernberger" 
> wrote:
> >
> >Thank you Erick.  I'm upgrading from 7.6.0 and as far as I can tell
> the
> >schema and configuration (solrconfig.xml) isn't different (apart from
> >the version).  Right now, I'm at a loss.  I still have the 7.6.0
> cluster
> >running and the query works OK there.
> >
> >Sure seems like I'm missing a field called 'features', but it's not
> >defined in the prior schema either.  Thanks again!
> >
> >-Joe
> >
> >>On 8/28/2019 6:19 PM, Erick Erickson wrote:
> >> What it says ;)
> >>
> >> My guess is that your configuration mentions the field “features” in,
> perhaps carrot.snippet or carrot.title.
> >>
> >> But it’s a guess.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Aug 28, 2019, at 5:18 PM, Joe Obernberger <
> joseph.obernber...@gmail.com> wrote:
> >>>
> >>> Hi All - trying to use clustering with SolrCloud 8.2, but getting this
> error:
> >>>
> >>> "msg":"Error from server at null: org.apache.solr.search.SyntaxError:
> Query Field 'features' is not a valid field name",
> >>>
> >>> The URL, I'm using is:
> >>>
> 
https://urldefense.proofpoint.com/v2/url?u=http-3A__solrServer-3A9100_solr_DOCS_select-3Fq-3D-2A-253A-2A-26qt-3D_clustering-26clustering-3Dtrue-26clustering.collection-3Dtrue=DwIDaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=O_wgAdeSZrC8W73ggxLnVdbVDMeiJ2jnRnzz9zriMWE=Xv6mGAm4OoATTBbEz5m-J0bRyPaUXaVpvWT_f74PIJ4=
>  <
> 
https://urldefense.proofpoint.com/v2/url?u=http-3A__cronus-3A9100_solr_UNCLASS-5F2018-5F5-5F19-5F184_select-3Fq-3D-2A-253A-2A-26qt-3D_clustering-26clustering-3Dtrue-26clustering.collection-3Dtrue=DwIDaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=O_wgAdeSZrC8W73ggxLnVdbVDMeiJ2jnRnzz9zriMWE=Erwr9WXMf9Vk16cIkTMlhUQrEzKfHYinrWrM40fF1KQ=
> >
> >>>
> >>> Thanks for any ideas!
> >>>
> >>> Complete response:
> >>> {
> >>>  "responseHeader":{
> >>>"zkConnected&quo

Multi-language Spellcheck

2019-08-29 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi All,

We are starting up an internal search engine that has to work for many 
different languages. We are starting with a POC of Spanish and English 
documents, and we are using the DirectSolrSpellChecker. 

From reading others' threads online, I know that we have to have multiple 
spellcheckers to do this (1 for each language). However, would someone be able 
to clarify what should go in the "queryAnalyzerFieldType" tag? It seems that 
the tag can only take a single field. So, does that mean that I have to have a 
copy field that collates all tokens from all languages? Image of code attached 
for reference & sample code of English-only spellchecker below: 



   ???  


default
minimal_en
solr.DirectSolrSpellChecker -->
internal
0.5
2
1
5
4
0.05

...

Thank you!

-- 
Audrey Lorberfeld
Data Scientist, w3 Search
Digital Workplace Engineering
CIO, Finance and Operations
IBM
audrey.lorberf...@ibm.com
 

On 8/29/19, 10:12 AM, "Joe Obernberger"  wrote:

Thank you Erick.  I'm upgrading from 7.6.0 and as far as I can tell the 
schema and configuration (solrconfig.xml) isn't different (apart from 
the version).  Right now, I'm at a loss.  I still have the 7.6.0 cluster 
running and the query works OK there.

Sure seems like I'm missing a field called 'features', but it's not 
defined in the prior schema either.  Thanks again!

-Joe

On 8/28/2019 6:19 PM, Erick Erickson wrote:
> What it says ;)
>
> My guess is that your configuration mentions the field “features” in, 
perhaps carrot.snippet or carrot.title.
>
> But it’s a guess.
>
> Best,
> Erick
>
>> On Aug 28, 2019, at 5:18 PM, Joe Obernberger 
 wrote:
>>
>> Hi All - trying to use clustering with SolrCloud 8.2, but getting this 
error:
>>
>> "msg":"Error from server at null: org.apache.solr.search.SyntaxError: 
Query Field 'features' is not a valid field name",
>>
>> The URL, I'm using is:
>> 
https://urldefense.proofpoint.com/v2/url?u=http-3A__solrServer-3A9100_solr_DOCS_select-3Fq-3D-2A-253A-2A-26qt-3D_clustering-26clustering-3Dtrue-26clustering.collection-3Dtrue=DwIDaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=O_wgAdeSZrC8W73ggxLnVdbVDMeiJ2jnRnzz9zriMWE=Xv6mGAm4OoATTBbEz5m-J0bRyPaUXaVpvWT_f74PIJ4=
   

>>
>> Thanks for any ideas!
>>
>> Complete response:
>> {
>>   "responseHeader":{
>> "zkConnected":true,
>> "status":400,
>> "QTime":38,
>> "params":{
>>   "q":"*:*",
>>   "qt":"/clustering",
>>   "clustering":"true",
>>   "clustering.collection":"true"}},
>>   "error":{
>> "metadata":[
>>   "error-class","org.apache.solr.common.SolrException",
>>   "root-error-class","org.apache.solr.common.SolrException",
>>   
"error-class","org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException",
>>   
"root-error-class","org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException"],
>> "msg":"Error from server at null: 
org.apache.solr.search.SyntaxError: Query Field 'features' is not a valid field 
name",
>> "code":400}}
>>
>>
>> -Joe
>>
>
> ---
> This email has been checked for viruses by AVG.
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.avg.com=DwIDaQ=jf_iaSHvJObTbx-siA1ZOg=_8ViuZIeSRdQjONA8yHWPZIBlhj291HU3JpNIx5a55M=O_wgAdeSZrC8W73ggxLnVdbVDMeiJ2jnRnzz9zriMWE=yqhSyt_b52qGudiP49O1SnlGvlyZCbiNd-fp-ziS-uo=
 
>