Re: Is it possible to have different Stop words depending on the value of a field?

2019-12-04 Thread Emir Arnautović
Hi,
I’ve spent quite a lot time working on a similar issue but I did not think 
about it much since (at the time it was Solr 1.3) so some new features could 
push me to some other direction, but here is what I remember: You cannot rely 
on users entering standardised address format even within one country. Users 
will use both abbreviations and full names. If you need to support Japan - good 
luck. India is a similar story. You might want to preprocess input and do some 
entity extraction and parsing both at index time and query time. Solr scoring 
is not good enough for addresses - it is good for giving you candidates but 
after that you need to apply custom scoring function on either Solr or client 
side. If you have ability to use full blown geocoder, use it at both index and 
query time - you can even store multiple geocoding results with scores and use 
those scores to calculate final score. The good thing is that Solr has many 
extension points and I’ve used almost all but unfortunately, those were 
proprietary plugins and was not able to persuade client to open source it.

Good Luck,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Dec 2019, at 08:13, Paras Lehana  wrote:
> 
> Hi Yeikel,
> 
> I want to stress on three things:
> 
>   1. If you know the probable words which can be written in different ways
>   (like street), you can use Synonyms.
> 
>   2. The longer queries can have different mm's. The mm parameter supports
>   different values for different word lengths. We generally do 100% mm match
>   for 2 words, decrease it words-1 for words > 2 and 70% for words > 7.
> 
>   3. The returned numDocs should not heavily impact your response time.
>   You can always use rows parameter to decrease the result set. Is your issue
>   regarding the ranking of documents or the number of documents? Please give
>   examples of the results that you don't want to get fetched for a query.
> 
> 
> On Tue, 3 Dec 2019 at 10:13,  wrote:
> 
>> Thank you for jumping in @hastings.recurs...@gmail.com
>> 
>> I have an index with raw addresses in a nonstandardized format such as
>> "123 main street" or "main street 123", and I am looking to search this
>> index and pull the closest addresses from another raw input with a similar
>> unpredictable format. Ideally, I am trying to reduce the number of results
>> as much as possible because of time constraints.
>> 
>> At the moment, I am launching a dismax query with the mm(minimum should
>> match) parameter set to a value I am comfortable with(say 50% for example).
>> 
>> In an address such as "123 main street CA 90201 US" , if I execute a query
>> such as: "return addresses that match 50% of the tokens"(dismax,with mm set
>> to 50%),  I will potentially get records with "US Street 123" or "main
>> street CA", which is not something that I am looking for. I understand that
>> I could increase the mm parameter and set it to say "100%", but again, I am
>> not sure if the token "street" should be considered when calculating the mm
>> parameter as I could miss a record such as "123 main CA 90201 US"
>> 
>> For longer addresses, the relevance of "main" or "street" is much lower
>> than keywords such as apartment number or the city.
>> 
>> I am not sure if this is the right way to search for unstructured
>> addresses so we are open for suggestions.
>> 
>> Thank you
>> 
>> -Original Message-
>> From: Dave 
>> Sent: Monday, December 2, 2019 7:50 PM
>> To: solr-user@lucene.apache.org
>> Cc: wun...@wunderwood.org; jornfra...@gmail.com
>> Subject: Re: Is it possible to have different Stop words depending on the
>> value of a field?
>> 
>> I’ll add to that since I’m up. Stopwords are in a practical sense useless
>> and serve no purpose. It’s an old way to save index size that’s not needed
>> any more. You’d need very specific use cases to want to use them. Maybe you
>> do, but generally you never do unless it’s for training a machine or
>> something a bit more on the experimental side. If you can explain *why you
>> think you need stop words that would be helpful in perhaps guiding you to
>> an alternative
>> 
>>> On Dec 2, 2019, at 7:45 PM,   wrote:
>>> 
>>> That makes sense, thank you for the clarification!
>>> 
>>> @wun...@wunderwood.org If you can, please build on your explanation as
>> It sounds relevant.
>>> -Original Message-
>>> From: Dave 
>>> Sent: Monday, December 2, 2019 7:38 PM
>>> To: solr-user@lucene.apache.org
>>> Cc: jornfra...@gmail.com
>>> Subject: Re: Is it possible to have different Stop words depending on
>> the value of a field?
>>> 
>>> It clarifies yes. You need new fields. In this case something like
>> Address_us Address_uk And index and search them accordingly with different
>> stopword files used in different field types, hence the copy field from
>> “address” into as many new fields as needed
>>> 
 On Dec 2, 2019, at 7:33 PM,  
>> 

Re: Is it possible to have different Stop words depending on the value of a field?

2019-12-03 Thread Paras Lehana
Hi Yeikel,

I want to stress on three things:

   1. If you know the probable words which can be written in different ways
   (like street), you can use Synonyms.

   2. The longer queries can have different mm's. The mm parameter supports
   different values for different word lengths. We generally do 100% mm match
   for 2 words, decrease it words-1 for words > 2 and 70% for words > 7.

   3. The returned numDocs should not heavily impact your response time.
   You can always use rows parameter to decrease the result set. Is your issue
   regarding the ranking of documents or the number of documents? Please give
   examples of the results that you don't want to get fetched for a query.


On Tue, 3 Dec 2019 at 10:13,  wrote:

> Thank you for jumping in @hastings.recurs...@gmail.com
>
> I have an index with raw addresses in a nonstandardized format such as
> "123 main street" or "main street 123", and I am looking to search this
> index and pull the closest addresses from another raw input with a similar
> unpredictable format. Ideally, I am trying to reduce the number of results
> as much as possible because of time constraints.
>
> At the moment, I am launching a dismax query with the mm(minimum should
> match) parameter set to a value I am comfortable with(say 50% for example).
>
> In an address such as "123 main street CA 90201 US" , if I execute a query
> such as: "return addresses that match 50% of the tokens"(dismax,with mm set
> to 50%),  I will potentially get records with "US Street 123" or "main
> street CA", which is not something that I am looking for. I understand that
> I could increase the mm parameter and set it to say "100%", but again, I am
> not sure if the token "street" should be considered when calculating the mm
> parameter as I could miss a record such as "123 main CA 90201 US"
>
> For longer addresses, the relevance of "main" or "street" is much lower
> than keywords such as apartment number or the city.
>
> I am not sure if this is the right way to search for unstructured
> addresses so we are open for suggestions.
>
> Thank you
>
> -Original Message-
> From: Dave 
> Sent: Monday, December 2, 2019 7:50 PM
> To: solr-user@lucene.apache.org
> Cc: wun...@wunderwood.org; jornfra...@gmail.com
> Subject: Re: Is it possible to have different Stop words depending on the
> value of a field?
>
> I’ll add to that since I’m up. Stopwords are in a practical sense useless
> and serve no purpose. It’s an old way to save index size that’s not needed
> any more. You’d need very specific use cases to want to use them. Maybe you
> do, but generally you never do unless it’s for training a machine or
> something a bit more on the experimental side. If you can explain *why you
> think you need stop words that would be helpful in perhaps guiding you to
> an alternative
>
> > On Dec 2, 2019, at 7:45 PM,   wrote:
> >
> > That makes sense, thank you for the clarification!
> >
> > @wun...@wunderwood.org If you can, please build on your explanation as
> It sounds relevant.
> > -Original Message-
> > From: Dave 
> > Sent: Monday, December 2, 2019 7:38 PM
> > To: solr-user@lucene.apache.org
> > Cc: jornfra...@gmail.com
> > Subject: Re: Is it possible to have different Stop words depending on
> the value of a field?
> >
> > It clarifies yes. You need new fields. In this case something like
> Address_us Address_uk And index and search them accordingly with different
> stopword files used in different field types, hence the copy field from
> “address” into as many new fields as needed
> >
> >> On Dec 2, 2019, at 7:33 PM,  
> wrote:
> >>
> >> To clarify, a document would look like this :
> >>
> >> {
> >> address: "123 main Street",
> >> country : "US"
> >> }
> >>
> >> What I'd like to do when I configure my index is to apply a set of
> different stop words to the address field depending on the value of the
> country. For example, something like this :
> >>
> >> If (country == US) -> File1
> >> Else If (country == UK) -> File2
> >>
> >> Etc..
> >>
> >> Hopefully, that clarifies.
> >>
> >> -Original Message-
> >> From: Jörn Franke 
> >> Sent: Monday, December 2, 2019 3:25 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Is it possible to have different Stop words depending on
> the value of a field?
> >>
> >> You can have different fields by country. I am not sure about your stop
> words but if they are not occurring in the other languages then you have
> not a problem.
> >> On the other hand: it you need more than stop words (eg lemmatizing,
> specialized way of tokenization etc) then you need a different field per
> language. You don’t describe your full use case, but if you have different
> fields for different language then your client application needs to handle
> this (not difficult, but you have to be aware).
> >> Not sure if you need to search a given address in all languages or if
> you use the language of the user etc.
> >>
> >>> Am 02.12.2019 um 20:13 schrieb yeikel 

RE: Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread email
Thank you for jumping in @hastings.recurs...@gmail.com

I have an index with raw addresses in a nonstandardized format such as "123 
main street" or "main street 123", and I am looking to search this index and 
pull the closest addresses from another raw input with a similar unpredictable 
format. Ideally, I am trying to reduce the number of results as much as 
possible because of time constraints. 

At the moment, I am launching a dismax query with the mm(minimum should match) 
parameter set to a value I am comfortable with(say 50% for example). 

In an address such as "123 main street CA 90201 US" , if I execute a query such 
as: "return addresses that match 50% of the tokens"(dismax,with mm set to 50%), 
 I will potentially get records with "US Street 123" or "main street CA", which 
is not something that I am looking for. I understand that I could increase the 
mm parameter and set it to say "100%", but again, I am not sure if the token 
"street" should be considered when calculating the mm parameter as I could miss 
a record such as "123 main CA 90201 US"

For longer addresses, the relevance of "main" or "street" is much lower than 
keywords such as apartment number or the city. 

I am not sure if this is the right way to search for unstructured addresses so 
we are open for suggestions. 

Thank you

-Original Message-
From: Dave  
Sent: Monday, December 2, 2019 7:50 PM
To: solr-user@lucene.apache.org
Cc: wun...@wunderwood.org; jornfra...@gmail.com
Subject: Re: Is it possible to have different Stop words depending on the value 
of a field?

I’ll add to that since I’m up. Stopwords are in a practical sense useless and 
serve no purpose. It’s an old way to save index size that’s not needed any 
more. You’d need very specific use cases to want to use them. Maybe you do, but 
generally you never do unless it’s for training a machine or something a bit 
more on the experimental side. If you can explain *why you think you need stop 
words that would be helpful in perhaps guiding you to an alternative 

> On Dec 2, 2019, at 7:45 PM,   wrote:
> 
> That makes sense, thank you for the clarification!
> 
> @wun...@wunderwood.org If you can, please build on your explanation as It 
> sounds relevant. 
> -Original Message-
> From: Dave  
> Sent: Monday, December 2, 2019 7:38 PM
> To: solr-user@lucene.apache.org
> Cc: jornfra...@gmail.com
> Subject: Re: Is it possible to have different Stop words depending on the 
> value of a field?
> 
> It clarifies yes. You need new fields. In this case something like Address_us 
> Address_uk And index and search them accordingly with different stopword 
> files used in different field types, hence the copy field from “address” into 
> as many new fields as needed
> 
>> On Dec 2, 2019, at 7:33 PM,   wrote:
>> 
>> To clarify, a document would look like this : 
>> 
>> {
>> address: "123 main Street",
>> country : "US"
>> }
>> 
>> What I'd like to do when I configure my index is to apply a set of different 
>> stop words to the address field depending on the value of the country. For 
>> example, something like this : 
>> 
>> If (country == US) -> File1
>> Else If (country == UK) -> File2
>> 
>> Etc..
>> 
>> Hopefully, that clarifies.
>> 
>> -Original Message-
>> From: Jörn Franke 
>> Sent: Monday, December 2, 2019 3:25 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Is it possible to have different Stop words depending on the 
>> value of a field?
>> 
>> You can have different fields by country. I am not sure about your stop 
>> words but if they are not occurring in the other languages then you have not 
>> a problem. 
>> On the other hand: it you need more than stop words (eg lemmatizing, 
>> specialized way of tokenization etc) then you need a different field per 
>> language. You don’t describe your full use case, but if you have different 
>> fields for different language then your client application needs to handle 
>> this (not difficult, but you have to be aware).
>> Not sure if you need to search a given address in all languages or if you 
>> use the language of the user etc.
>> 
>>> Am 02.12.2019 um 20:13 schrieb yeikel valdes :
>>> 
>>> Hi,
>>> 
>>> 
>>> I have an index that stores addresses from different countries.
>>> 
>>> 
>>> As every country has different stop words, I was wondering if it is 
>>> possible to apply a different set of stop words depending on the value of a 
>>> field. 
>>> 
>>> 
>>> Or do I need different indexes/do itnat the ETL step to accomplish this?
>>> 
>>> 
>> 
>> 
> 
> 




Re: Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread Dave
I’ll add to that since I’m up. Stopwords are in a practical sense useless and 
serve no purpose. It’s an old way to save index size that’s not needed any 
more. You’d need very specific use cases to want to use them. Maybe you do, but 
generally you never do unless it’s for training a machine or something a bit 
more on the experimental side. If you can explain *why you think you need stop 
words that would be helpful in perhaps guiding you to an alternative 

> On Dec 2, 2019, at 7:45 PM,   wrote:
> 
> That makes sense, thank you for the clarification!
> 
> @wun...@wunderwood.org If you can, please build on your explanation as It 
> sounds relevant. 
> -Original Message-
> From: Dave  
> Sent: Monday, December 2, 2019 7:38 PM
> To: solr-user@lucene.apache.org
> Cc: jornfra...@gmail.com
> Subject: Re: Is it possible to have different Stop words depending on the 
> value of a field?
> 
> It clarifies yes. You need new fields. In this case something like Address_us 
> Address_uk And index and search them accordingly with different stopword 
> files used in different field types, hence the copy field from “address” into 
> as many new fields as needed
> 
>> On Dec 2, 2019, at 7:33 PM,   wrote:
>> 
>> To clarify, a document would look like this : 
>> 
>> {
>> address: "123 main Street",
>> country : "US"
>> }
>> 
>> What I'd like to do when I configure my index is to apply a set of different 
>> stop words to the address field depending on the value of the country. For 
>> example, something like this : 
>> 
>> If (country == US) -> File1
>> Else If (country == UK) -> File2
>> 
>> Etc..
>> 
>> Hopefully, that clarifies.
>> 
>> -Original Message-
>> From: Jörn Franke 
>> Sent: Monday, December 2, 2019 3:25 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Is it possible to have different Stop words depending on the 
>> value of a field?
>> 
>> You can have different fields by country. I am not sure about your stop 
>> words but if they are not occurring in the other languages then you have not 
>> a problem. 
>> On the other hand: it you need more than stop words (eg lemmatizing, 
>> specialized way of tokenization etc) then you need a different field per 
>> language. You don’t describe your full use case, but if you have different 
>> fields for different language then your client application needs to handle 
>> this (not difficult, but you have to be aware).
>> Not sure if you need to search a given address in all languages or if you 
>> use the language of the user etc.
>> 
>>> Am 02.12.2019 um 20:13 schrieb yeikel valdes :
>>> 
>>> Hi,
>>> 
>>> 
>>> I have an index that stores addresses from different countries.
>>> 
>>> 
>>> As every country has different stop words, I was wondering if it is 
>>> possible to apply a different set of stop words depending on the value of a 
>>> field. 
>>> 
>>> 
>>> Or do I need different indexes/do itnat the ETL step to accomplish this?
>>> 
>>> 
>> 
>> 
> 
> 


RE: Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread email
That makes sense, thank you for the clarification!

@wun...@wunderwood.org If you can, please build on your explanation as It 
sounds relevant. 
-Original Message-
From: Dave  
Sent: Monday, December 2, 2019 7:38 PM
To: solr-user@lucene.apache.org
Cc: jornfra...@gmail.com
Subject: Re: Is it possible to have different Stop words depending on the value 
of a field?

It clarifies yes. You need new fields. In this case something like Address_us 
Address_uk And index and search them accordingly with different stopword files 
used in different field types, hence the copy field from “address” into as many 
new fields as needed

> On Dec 2, 2019, at 7:33 PM,   wrote:
> 
> To clarify, a document would look like this : 
> 
> {
>  address: "123 main Street",
>  country : "US"
> }
> 
> What I'd like to do when I configure my index is to apply a set of different 
> stop words to the address field depending on the value of the country. For 
> example, something like this : 
> 
> If (country == US) -> File1
> Else If (country == UK) -> File2
> 
> Etc..
> 
> Hopefully, that clarifies.
> 
> -Original Message-
> From: Jörn Franke 
> Sent: Monday, December 2, 2019 3:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to have different Stop words depending on the 
> value of a field?
> 
> You can have different fields by country. I am not sure about your stop words 
> but if they are not occurring in the other languages then you have not a 
> problem. 
> On the other hand: it you need more than stop words (eg lemmatizing, 
> specialized way of tokenization etc) then you need a different field per 
> language. You don’t describe your full use case, but if you have different 
> fields for different language then your client application needs to handle 
> this (not difficult, but you have to be aware).
> Not sure if you need to search a given address in all languages or if you use 
> the language of the user etc.
> 
>> Am 02.12.2019 um 20:13 schrieb yeikel valdes :
>> 
>> Hi,
>> 
>> 
>> I have an index that stores addresses from different countries.
>> 
>> 
>> As every country has different stop words, I was wondering if it is possible 
>> to apply a different set of stop words depending on the value of a field. 
>> 
>> 
>> Or do I need different indexes/do itnat the ETL step to accomplish this?
>> 
>> 
> 
> 




Re: Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread Dave
It clarifies yes. You need new fields. In this case something like
Address_us
Address_uk
And index and search them accordingly with different stopword files used in 
different field types, hence the copy field from “address” into as many new 
fields as needed

> On Dec 2, 2019, at 7:33 PM,   wrote:
> 
> To clarify, a document would look like this : 
> 
> {
>  address: "123 main Street",
>  country : "US"
> }
> 
> What I'd like to do when I configure my index is to apply a set of different 
> stop words to the address field depending on the value of the country. For 
> example, something like this : 
> 
> If (country == US) -> File1
> Else If (country == UK) -> File2
> 
> Etc..
> 
> Hopefully, that clarifies.
> 
> -Original Message-
> From: Jörn Franke  
> Sent: Monday, December 2, 2019 3:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to have different Stop words depending on the 
> value of a field?
> 
> You can have different fields by country. I am not sure about your stop words 
> but if they are not occurring in the other languages then you have not a 
> problem. 
> On the other hand: it you need more than stop words (eg lemmatizing, 
> specialized way of tokenization etc) then you need a different field per 
> language. You don’t describe your full use case, but if you have different 
> fields for different language then your client application needs to handle 
> this (not difficult, but you have to be aware).
> Not sure if you need to search a given address in all languages or if you use 
> the language of the user etc.
> 
>> Am 02.12.2019 um 20:13 schrieb yeikel valdes :
>> 
>> Hi,
>> 
>> 
>> I have an index that stores addresses from different countries.
>> 
>> 
>> As every country has different stop words, I was wondering if it is possible 
>> to apply a different set of stop words depending on the value of a field. 
>> 
>> 
>> Or do I need different indexes/do itnat the ETL step to accomplish this?
>> 
>> 
> 
> 


RE: Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread email
To clarify, a document would look like this : 

{
  address: "123 main Street",
  country : "US"
}

What I'd like to do when I configure my index is to apply a set of different 
stop words to the address field depending on the value of the country. For 
example, something like this : 

If (country == US) -> File1
Else If (country == UK) -> File2

Etc..

Hopefully, that clarifies.

-Original Message-
From: Jörn Franke  
Sent: Monday, December 2, 2019 3:25 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to have different Stop words depending on the value 
of a field?

You can have different fields by country. I am not sure about your stop words 
but if they are not occurring in the other languages then you have not a 
problem. 
On the other hand: it you need more than stop words (eg lemmatizing, 
specialized way of tokenization etc) then you need a different field per 
language. You don’t describe your full use case, but if you have different 
fields for different language then your client application needs to handle this 
(not difficult, but you have to be aware).
Not sure if you need to search a given address in all languages or if you use 
the language of the user etc.

> Am 02.12.2019 um 20:13 schrieb yeikel valdes :
> 
> Hi,
> 
> 
> I have an index that stores addresses from different countries.
> 
> 
> As every country has different stop words, I was wondering if it is possible 
> to apply a different set of stop words depending on the value of a field. 
> 
> 
> Or do I need different indexes/do itnat the ETL step to accomplish this?
> 
> 




Re: Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread Walter Underwood
The best approach is to not use stop words at all. That gives better relevance 
with less configuration, so it is a total win.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Dec 2, 2019, at 12:24 PM, Jörn Franke  wrote:
> 
> You can have different fields by country. I am not sure about your stop words 
> but if they are not occurring in the other languages then you have not a 
> problem. 
> On the other hand: it you need more than stop words (eg lemmatizing, 
> specialized way of tokenization etc) then you need a different field per 
> language. You don’t describe your full use case, but if you have different 
> fields for different language then your client application needs to handle 
> this (not difficult, but you have to be aware).
> Not sure if you need to search a given address in all languages or if you use 
> the language of the user etc.
> 
>> Am 02.12.2019 um 20:13 schrieb yeikel valdes :
>> 
>> Hi,
>> 
>> 
>> I have an index that stores addresses from different countries.
>> 
>> 
>> As every country has different stop words, I was wondering if it is possible 
>> to apply a different set of stop words depending on the value of a field. 
>> 
>> 
>> Or do I need different indexes/do itnat the ETL step to accomplish this?
>> 
>> 



Re: Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread Jörn Franke
You can have different fields by country. I am not sure about your stop words 
but if they are not occurring in the other languages then you have not a 
problem. 
On the other hand: it you need more than stop words (eg lemmatizing, 
specialized way of tokenization etc) then you need a different field per 
language. You don’t describe your full use case, but if you have different 
fields for different language then your client application needs to handle this 
(not difficult, but you have to be aware).
Not sure if you need to search a given address in all languages or if you use 
the language of the user etc.

> Am 02.12.2019 um 20:13 schrieb yeikel valdes :
> 
> Hi,
> 
> 
> I have an index that stores addresses from different countries.
> 
> 
> As every country has different stop words, I was wondering if it is possible 
> to apply a different set of stop words depending on the value of a field. 
> 
> 
> Or do I need different indexes/do itnat the ETL step to accomplish this?
> 
> 


Is it possible to have different Stop words depending on the value of a field?

2019-12-02 Thread yeikel valdes
Hi,


I have an index that stores addresses from different countries.


As every country has different stop words, I was wondering if it is possible to 
apply a different set of stop words depending on the value of a field. 


Or do I need different indexes/do itnat the ETL step to accomplish this?