Re: Is it possible to have different Stop words depending on the value of a field?
Hi, I’ve spent quite a lot time working on a similar issue but I did not think about it much since (at the time it was Solr 1.3) so some new features could push me to some other direction, but here is what I remember: You cannot rely on users entering standardised address format even within one country. Users will use both abbreviations and full names. If you need to support Japan - good luck. India is a similar story. You might want to preprocess input and do some entity extraction and parsing both at index time and query time. Solr scoring is not good enough for addresses - it is good for giving you candidates but after that you need to apply custom scoring function on either Solr or client side. If you have ability to use full blown geocoder, use it at both index and query time - you can even store multiple geocoding results with scores and use those scores to calculate final score. The good thing is that Solr has many extension points and I’ve used almost all but unfortunately, those were proprietary plugins and was not able to persuade client to open source it. Good Luck, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 4 Dec 2019, at 08:13, Paras Lehana wrote: > > Hi Yeikel, > > I want to stress on three things: > > 1. If you know the probable words which can be written in different ways > (like street), you can use Synonyms. > > 2. The longer queries can have different mm's. The mm parameter supports > different values for different word lengths. We generally do 100% mm match > for 2 words, decrease it words-1 for words > 2 and 70% for words > 7. > > 3. The returned numDocs should not heavily impact your response time. > You can always use rows parameter to decrease the result set. Is your issue > regarding the ranking of documents or the number of documents? Please give > examples of the results that you don't want to get fetched for a query. > > > On Tue, 3 Dec 2019 at 10:13, wrote: > >> Thank you for jumping in @hastings.recurs...@gmail.com >> >> I have an index with raw addresses in a nonstandardized format such as >> "123 main street" or "main street 123", and I am looking to search this >> index and pull the closest addresses from another raw input with a similar >> unpredictable format. Ideally, I am trying to reduce the number of results >> as much as possible because of time constraints. >> >> At the moment, I am launching a dismax query with the mm(minimum should >> match) parameter set to a value I am comfortable with(say 50% for example). >> >> In an address such as "123 main street CA 90201 US" , if I execute a query >> such as: "return addresses that match 50% of the tokens"(dismax,with mm set >> to 50%), I will potentially get records with "US Street 123" or "main >> street CA", which is not something that I am looking for. I understand that >> I could increase the mm parameter and set it to say "100%", but again, I am >> not sure if the token "street" should be considered when calculating the mm >> parameter as I could miss a record such as "123 main CA 90201 US" >> >> For longer addresses, the relevance of "main" or "street" is much lower >> than keywords such as apartment number or the city. >> >> I am not sure if this is the right way to search for unstructured >> addresses so we are open for suggestions. >> >> Thank you >> >> -Original Message- >> From: Dave >> Sent: Monday, December 2, 2019 7:50 PM >> To: solr-user@lucene.apache.org >> Cc: wun...@wunderwood.org; jornfra...@gmail.com >> Subject: Re: Is it possible to have different Stop words depending on the >> value of a field? >> >> I’ll add to that since I’m up. Stopwords are in a practical sense useless >> and serve no purpose. It’s an old way to save index size that’s not needed >> any more. You’d need very specific use cases to want to use them. Maybe you >> do, but generally you never do unless it’s for training a machine or >> something a bit more on the experimental side. If you can explain *why you >> think you need stop words that would be helpful in perhaps guiding you to >> an alternative >> >>> On Dec 2, 2019, at 7:45 PM, wrote: >>> >>> That makes sense, thank you for the clarification! >>> >>> @wun...@wunderwood.org If you can, please build on your explanation as >> It sounds relevant. >>> -Original Message- >>> From: Dave >>> Sent: Monday, December 2, 2019 7:38 PM >>> To: solr-user@lucene.apache.org >>> Cc: jornfra...@gmail.com >>> Subject: Re: Is it possible to have different Stop words depending on >> the value of a field? >>> >>> It clarifies yes. You need new fields. In this case something like >> Address_us Address_uk And index and search them accordingly with different >> stopword files used in different field types, hence the copy field from >> “address” into as many new fields as needed >>> On Dec 2, 2019, at 7:33 PM, >>
Re: Is it possible to have different Stop words depending on the value of a field?
Hi Yeikel, I want to stress on three things: 1. If you know the probable words which can be written in different ways (like street), you can use Synonyms. 2. The longer queries can have different mm's. The mm parameter supports different values for different word lengths. We generally do 100% mm match for 2 words, decrease it words-1 for words > 2 and 70% for words > 7. 3. The returned numDocs should not heavily impact your response time. You can always use rows parameter to decrease the result set. Is your issue regarding the ranking of documents or the number of documents? Please give examples of the results that you don't want to get fetched for a query. On Tue, 3 Dec 2019 at 10:13, wrote: > Thank you for jumping in @hastings.recurs...@gmail.com > > I have an index with raw addresses in a nonstandardized format such as > "123 main street" or "main street 123", and I am looking to search this > index and pull the closest addresses from another raw input with a similar > unpredictable format. Ideally, I am trying to reduce the number of results > as much as possible because of time constraints. > > At the moment, I am launching a dismax query with the mm(minimum should > match) parameter set to a value I am comfortable with(say 50% for example). > > In an address such as "123 main street CA 90201 US" , if I execute a query > such as: "return addresses that match 50% of the tokens"(dismax,with mm set > to 50%), I will potentially get records with "US Street 123" or "main > street CA", which is not something that I am looking for. I understand that > I could increase the mm parameter and set it to say "100%", but again, I am > not sure if the token "street" should be considered when calculating the mm > parameter as I could miss a record such as "123 main CA 90201 US" > > For longer addresses, the relevance of "main" or "street" is much lower > than keywords such as apartment number or the city. > > I am not sure if this is the right way to search for unstructured > addresses so we are open for suggestions. > > Thank you > > -Original Message- > From: Dave > Sent: Monday, December 2, 2019 7:50 PM > To: solr-user@lucene.apache.org > Cc: wun...@wunderwood.org; jornfra...@gmail.com > Subject: Re: Is it possible to have different Stop words depending on the > value of a field? > > I’ll add to that since I’m up. Stopwords are in a practical sense useless > and serve no purpose. It’s an old way to save index size that’s not needed > any more. You’d need very specific use cases to want to use them. Maybe you > do, but generally you never do unless it’s for training a machine or > something a bit more on the experimental side. If you can explain *why you > think you need stop words that would be helpful in perhaps guiding you to > an alternative > > > On Dec 2, 2019, at 7:45 PM, wrote: > > > > That makes sense, thank you for the clarification! > > > > @wun...@wunderwood.org If you can, please build on your explanation as > It sounds relevant. > > -Original Message- > > From: Dave > > Sent: Monday, December 2, 2019 7:38 PM > > To: solr-user@lucene.apache.org > > Cc: jornfra...@gmail.com > > Subject: Re: Is it possible to have different Stop words depending on > the value of a field? > > > > It clarifies yes. You need new fields. In this case something like > Address_us Address_uk And index and search them accordingly with different > stopword files used in different field types, hence the copy field from > “address” into as many new fields as needed > > > >> On Dec 2, 2019, at 7:33 PM, > wrote: > >> > >> To clarify, a document would look like this : > >> > >> { > >> address: "123 main Street", > >> country : "US" > >> } > >> > >> What I'd like to do when I configure my index is to apply a set of > different stop words to the address field depending on the value of the > country. For example, something like this : > >> > >> If (country == US) -> File1 > >> Else If (country == UK) -> File2 > >> > >> Etc.. > >> > >> Hopefully, that clarifies. > >> > >> -Original Message- > >> From: Jörn Franke > >> Sent: Monday, December 2, 2019 3:25 PM > >> To: solr-user@lucene.apache.org > >> Subject: Re: Is it possible to have different Stop words depending on > the value of a field? > >> > >> You can have different fields by country. I am not sure about your stop > words but if they are not occurring in the other languages then you have > not a problem. > >> On the other hand: it you need more than stop words (eg lemmatizing, > specialized way of tokenization etc) then you need a different field per > language. You don’t describe your full use case, but if you have different > fields for different language then your client application needs to handle > this (not difficult, but you have to be aware). > >> Not sure if you need to search a given address in all languages or if > you use the language of the user etc. > >> > >>> Am 02.12.2019 um 20:13 schrieb yeikel
RE: Is it possible to have different Stop words depending on the value of a field?
Thank you for jumping in @hastings.recurs...@gmail.com I have an index with raw addresses in a nonstandardized format such as "123 main street" or "main street 123", and I am looking to search this index and pull the closest addresses from another raw input with a similar unpredictable format. Ideally, I am trying to reduce the number of results as much as possible because of time constraints. At the moment, I am launching a dismax query with the mm(minimum should match) parameter set to a value I am comfortable with(say 50% for example). In an address such as "123 main street CA 90201 US" , if I execute a query such as: "return addresses that match 50% of the tokens"(dismax,with mm set to 50%), I will potentially get records with "US Street 123" or "main street CA", which is not something that I am looking for. I understand that I could increase the mm parameter and set it to say "100%", but again, I am not sure if the token "street" should be considered when calculating the mm parameter as I could miss a record such as "123 main CA 90201 US" For longer addresses, the relevance of "main" or "street" is much lower than keywords such as apartment number or the city. I am not sure if this is the right way to search for unstructured addresses so we are open for suggestions. Thank you -Original Message- From: Dave Sent: Monday, December 2, 2019 7:50 PM To: solr-user@lucene.apache.org Cc: wun...@wunderwood.org; jornfra...@gmail.com Subject: Re: Is it possible to have different Stop words depending on the value of a field? I’ll add to that since I’m up. Stopwords are in a practical sense useless and serve no purpose. It’s an old way to save index size that’s not needed any more. You’d need very specific use cases to want to use them. Maybe you do, but generally you never do unless it’s for training a machine or something a bit more on the experimental side. If you can explain *why you think you need stop words that would be helpful in perhaps guiding you to an alternative > On Dec 2, 2019, at 7:45 PM, wrote: > > That makes sense, thank you for the clarification! > > @wun...@wunderwood.org If you can, please build on your explanation as It > sounds relevant. > -Original Message- > From: Dave > Sent: Monday, December 2, 2019 7:38 PM > To: solr-user@lucene.apache.org > Cc: jornfra...@gmail.com > Subject: Re: Is it possible to have different Stop words depending on the > value of a field? > > It clarifies yes. You need new fields. In this case something like Address_us > Address_uk And index and search them accordingly with different stopword > files used in different field types, hence the copy field from “address” into > as many new fields as needed > >> On Dec 2, 2019, at 7:33 PM, wrote: >> >> To clarify, a document would look like this : >> >> { >> address: "123 main Street", >> country : "US" >> } >> >> What I'd like to do when I configure my index is to apply a set of different >> stop words to the address field depending on the value of the country. For >> example, something like this : >> >> If (country == US) -> File1 >> Else If (country == UK) -> File2 >> >> Etc.. >> >> Hopefully, that clarifies. >> >> -Original Message- >> From: Jörn Franke >> Sent: Monday, December 2, 2019 3:25 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Is it possible to have different Stop words depending on the >> value of a field? >> >> You can have different fields by country. I am not sure about your stop >> words but if they are not occurring in the other languages then you have not >> a problem. >> On the other hand: it you need more than stop words (eg lemmatizing, >> specialized way of tokenization etc) then you need a different field per >> language. You don’t describe your full use case, but if you have different >> fields for different language then your client application needs to handle >> this (not difficult, but you have to be aware). >> Not sure if you need to search a given address in all languages or if you >> use the language of the user etc. >> >>> Am 02.12.2019 um 20:13 schrieb yeikel valdes : >>> >>> Hi, >>> >>> >>> I have an index that stores addresses from different countries. >>> >>> >>> As every country has different stop words, I was wondering if it is >>> possible to apply a different set of stop words depending on the value of a >>> field. >>> >>> >>> Or do I need different indexes/do itnat the ETL step to accomplish this? >>> >>> >> >> > >
Re: Is it possible to have different Stop words depending on the value of a field?
I’ll add to that since I’m up. Stopwords are in a practical sense useless and serve no purpose. It’s an old way to save index size that’s not needed any more. You’d need very specific use cases to want to use them. Maybe you do, but generally you never do unless it’s for training a machine or something a bit more on the experimental side. If you can explain *why you think you need stop words that would be helpful in perhaps guiding you to an alternative > On Dec 2, 2019, at 7:45 PM, wrote: > > That makes sense, thank you for the clarification! > > @wun...@wunderwood.org If you can, please build on your explanation as It > sounds relevant. > -Original Message- > From: Dave > Sent: Monday, December 2, 2019 7:38 PM > To: solr-user@lucene.apache.org > Cc: jornfra...@gmail.com > Subject: Re: Is it possible to have different Stop words depending on the > value of a field? > > It clarifies yes. You need new fields. In this case something like Address_us > Address_uk And index and search them accordingly with different stopword > files used in different field types, hence the copy field from “address” into > as many new fields as needed > >> On Dec 2, 2019, at 7:33 PM, wrote: >> >> To clarify, a document would look like this : >> >> { >> address: "123 main Street", >> country : "US" >> } >> >> What I'd like to do when I configure my index is to apply a set of different >> stop words to the address field depending on the value of the country. For >> example, something like this : >> >> If (country == US) -> File1 >> Else If (country == UK) -> File2 >> >> Etc.. >> >> Hopefully, that clarifies. >> >> -Original Message- >> From: Jörn Franke >> Sent: Monday, December 2, 2019 3:25 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Is it possible to have different Stop words depending on the >> value of a field? >> >> You can have different fields by country. I am not sure about your stop >> words but if they are not occurring in the other languages then you have not >> a problem. >> On the other hand: it you need more than stop words (eg lemmatizing, >> specialized way of tokenization etc) then you need a different field per >> language. You don’t describe your full use case, but if you have different >> fields for different language then your client application needs to handle >> this (not difficult, but you have to be aware). >> Not sure if you need to search a given address in all languages or if you >> use the language of the user etc. >> >>> Am 02.12.2019 um 20:13 schrieb yeikel valdes : >>> >>> Hi, >>> >>> >>> I have an index that stores addresses from different countries. >>> >>> >>> As every country has different stop words, I was wondering if it is >>> possible to apply a different set of stop words depending on the value of a >>> field. >>> >>> >>> Or do I need different indexes/do itnat the ETL step to accomplish this? >>> >>> >> >> > >
RE: Is it possible to have different Stop words depending on the value of a field?
That makes sense, thank you for the clarification! @wun...@wunderwood.org If you can, please build on your explanation as It sounds relevant. -Original Message- From: Dave Sent: Monday, December 2, 2019 7:38 PM To: solr-user@lucene.apache.org Cc: jornfra...@gmail.com Subject: Re: Is it possible to have different Stop words depending on the value of a field? It clarifies yes. You need new fields. In this case something like Address_us Address_uk And index and search them accordingly with different stopword files used in different field types, hence the copy field from “address” into as many new fields as needed > On Dec 2, 2019, at 7:33 PM, wrote: > > To clarify, a document would look like this : > > { > address: "123 main Street", > country : "US" > } > > What I'd like to do when I configure my index is to apply a set of different > stop words to the address field depending on the value of the country. For > example, something like this : > > If (country == US) -> File1 > Else If (country == UK) -> File2 > > Etc.. > > Hopefully, that clarifies. > > -Original Message- > From: Jörn Franke > Sent: Monday, December 2, 2019 3:25 PM > To: solr-user@lucene.apache.org > Subject: Re: Is it possible to have different Stop words depending on the > value of a field? > > You can have different fields by country. I am not sure about your stop words > but if they are not occurring in the other languages then you have not a > problem. > On the other hand: it you need more than stop words (eg lemmatizing, > specialized way of tokenization etc) then you need a different field per > language. You don’t describe your full use case, but if you have different > fields for different language then your client application needs to handle > this (not difficult, but you have to be aware). > Not sure if you need to search a given address in all languages or if you use > the language of the user etc. > >> Am 02.12.2019 um 20:13 schrieb yeikel valdes : >> >> Hi, >> >> >> I have an index that stores addresses from different countries. >> >> >> As every country has different stop words, I was wondering if it is possible >> to apply a different set of stop words depending on the value of a field. >> >> >> Or do I need different indexes/do itnat the ETL step to accomplish this? >> >> > >
Re: Is it possible to have different Stop words depending on the value of a field?
It clarifies yes. You need new fields. In this case something like Address_us Address_uk And index and search them accordingly with different stopword files used in different field types, hence the copy field from “address” into as many new fields as needed > On Dec 2, 2019, at 7:33 PM, wrote: > > To clarify, a document would look like this : > > { > address: "123 main Street", > country : "US" > } > > What I'd like to do when I configure my index is to apply a set of different > stop words to the address field depending on the value of the country. For > example, something like this : > > If (country == US) -> File1 > Else If (country == UK) -> File2 > > Etc.. > > Hopefully, that clarifies. > > -Original Message- > From: Jörn Franke > Sent: Monday, December 2, 2019 3:25 PM > To: solr-user@lucene.apache.org > Subject: Re: Is it possible to have different Stop words depending on the > value of a field? > > You can have different fields by country. I am not sure about your stop words > but if they are not occurring in the other languages then you have not a > problem. > On the other hand: it you need more than stop words (eg lemmatizing, > specialized way of tokenization etc) then you need a different field per > language. You don’t describe your full use case, but if you have different > fields for different language then your client application needs to handle > this (not difficult, but you have to be aware). > Not sure if you need to search a given address in all languages or if you use > the language of the user etc. > >> Am 02.12.2019 um 20:13 schrieb yeikel valdes : >> >> Hi, >> >> >> I have an index that stores addresses from different countries. >> >> >> As every country has different stop words, I was wondering if it is possible >> to apply a different set of stop words depending on the value of a field. >> >> >> Or do I need different indexes/do itnat the ETL step to accomplish this? >> >> > >
RE: Is it possible to have different Stop words depending on the value of a field?
To clarify, a document would look like this : { address: "123 main Street", country : "US" } What I'd like to do when I configure my index is to apply a set of different stop words to the address field depending on the value of the country. For example, something like this : If (country == US) -> File1 Else If (country == UK) -> File2 Etc.. Hopefully, that clarifies. -Original Message- From: Jörn Franke Sent: Monday, December 2, 2019 3:25 PM To: solr-user@lucene.apache.org Subject: Re: Is it possible to have different Stop words depending on the value of a field? You can have different fields by country. I am not sure about your stop words but if they are not occurring in the other languages then you have not a problem. On the other hand: it you need more than stop words (eg lemmatizing, specialized way of tokenization etc) then you need a different field per language. You don’t describe your full use case, but if you have different fields for different language then your client application needs to handle this (not difficult, but you have to be aware). Not sure if you need to search a given address in all languages or if you use the language of the user etc. > Am 02.12.2019 um 20:13 schrieb yeikel valdes : > > Hi, > > > I have an index that stores addresses from different countries. > > > As every country has different stop words, I was wondering if it is possible > to apply a different set of stop words depending on the value of a field. > > > Or do I need different indexes/do itnat the ETL step to accomplish this? > >
Re: Is it possible to have different Stop words depending on the value of a field?
The best approach is to not use stop words at all. That gives better relevance with less configuration, so it is a total win. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Dec 2, 2019, at 12:24 PM, Jörn Franke wrote: > > You can have different fields by country. I am not sure about your stop words > but if they are not occurring in the other languages then you have not a > problem. > On the other hand: it you need more than stop words (eg lemmatizing, > specialized way of tokenization etc) then you need a different field per > language. You don’t describe your full use case, but if you have different > fields for different language then your client application needs to handle > this (not difficult, but you have to be aware). > Not sure if you need to search a given address in all languages or if you use > the language of the user etc. > >> Am 02.12.2019 um 20:13 schrieb yeikel valdes : >> >> Hi, >> >> >> I have an index that stores addresses from different countries. >> >> >> As every country has different stop words, I was wondering if it is possible >> to apply a different set of stop words depending on the value of a field. >> >> >> Or do I need different indexes/do itnat the ETL step to accomplish this? >> >>
Re: Is it possible to have different Stop words depending on the value of a field?
You can have different fields by country. I am not sure about your stop words but if they are not occurring in the other languages then you have not a problem. On the other hand: it you need more than stop words (eg lemmatizing, specialized way of tokenization etc) then you need a different field per language. You don’t describe your full use case, but if you have different fields for different language then your client application needs to handle this (not difficult, but you have to be aware). Not sure if you need to search a given address in all languages or if you use the language of the user etc. > Am 02.12.2019 um 20:13 schrieb yeikel valdes : > > Hi, > > > I have an index that stores addresses from different countries. > > > As every country has different stop words, I was wondering if it is possible > to apply a different set of stop words depending on the value of a field. > > > Or do I need different indexes/do itnat the ETL step to accomplish this? > >
Is it possible to have different Stop words depending on the value of a field?
Hi, I have an index that stores addresses from different countries. As every country has different stop words, I was wondering if it is possible to apply a different set of stop words depending on the value of a field. Or do I need different indexes/do itnat the ETL step to accomplish this?