Auto-Suggest within Tier Architecture
Hello, Looking to see how others accomplished this goal. We have a 3 Tier architecture, Solr is down deep in T3 far from the end user. How do you make Auto-Suggest calls from the Internet Browser through the Tiers down to Solr in T3? We essentially created steps down each tier, but I'm looking to know what other approaches people have created. Did you put your solr in T1, I assume not, that would put it at risk. Thanks! Brett Moyer * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: Odd Edge Case for SpellCheck
This is a great help, thank you! Brett Moyer -Original Message- From: Erick Erickson Sent: Monday, November 25, 2019 4:12 PM To: solr-user@lucene.apache.org Subject: Re: Odd Edge Case for SpellCheck If you’re using direct spell checking, it looks for the _indexed_ term. So this means you get stemmed corrections if you’re stemming etc. Usually you should use a copyField to a field with minimal analysis and use that field for spellchecking. Another way to thing about it is that if you use the admin/analysis page for terms in a field, the terms in the dictionary are what’s at the end of the indexed side of the page. Best, Erick > On Nov 25, 2019, at 4:02 PM, Moyer, Brett wrote: > > Yes we are stemming, ahh so we shouldn't stem our words to be spelled? > > Brett Moyer > > -Original Message- > From: Jörn Franke > Sent: Friday, November 22, 2019 8:34 AM > To: solr-user@lucene.apache.org > Subject: Re: Odd Edge Case for SpellCheck > > Stemming involved ? > >> Am 22.11.2019 um 14:23 schrieb Moyer, Brett : >> >> Hello, we have spellcheck running, using the index as the dictionary. An >> odd use case came up today wanted to get your thoughts and see if what we >> determined is correct. Use case: User sends a query for q=brokerage, >> spellcheck fires and returns "brokerage". Looking at the output I see that >> solr must have pulled the root word "brokage" then spellcheck said hey I >> need to fix that. Is that correct? There's no issue, it's just an unexpected >> outcome. Thanks! >> >> "q":"brokerage", >> "spellcheck":{ >> "suggestions": >> [ >> {"name":"brokage",{ >> "type":"str","value":"numFound":1, >> "startOffset":0, >> "endOffset":9, >> "suggestion":["brokerage"]}}], >> "collations": >> [ >> {"name":"collation","type":"str","value":"brokerage"}]}} >> >> Brett Moyer >> * >> * >> *** This e-mail may contain confidential or privileged information. >> If you are not the intended recipient, please notify the sender immediately >> and then delete it. >> >> TIAA >> * >> * >> *** > ** > *** This e-mail may contain confidential or privileged information. > If you are not the intended recipient, please notify the sender immediately > and then delete it. > > TIAA > ** > *** * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: Odd Edge Case for SpellCheck
Yes we are stemming, ahh so we shouldn't stem our words to be spelled? Brett Moyer -Original Message- From: Jörn Franke Sent: Friday, November 22, 2019 8:34 AM To: solr-user@lucene.apache.org Subject: Re: Odd Edge Case for SpellCheck Stemming involved ? > Am 22.11.2019 um 14:23 schrieb Moyer, Brett : > > Hello, we have spellcheck running, using the index as the dictionary. An odd > use case came up today wanted to get your thoughts and see if what we > determined is correct. Use case: User sends a query for q=brokerage, > spellcheck fires and returns "brokerage". Looking at the output I see that > solr must have pulled the root word "brokage" then spellcheck said hey I need > to fix that. Is that correct? There's no issue, it's just an unexpected > outcome. Thanks! > > "q":"brokerage", > "spellcheck":{ >"suggestions": >[ > {"name":"brokage",{ >"type":"str","value":"numFound":1, >"startOffset":0, >"endOffset":9, >"suggestion":["brokerage"]}}], >"collations": >[ > {"name":"collation","type":"str","value":"brokerage"}]}} > > Brett Moyer > ** > *** This e-mail may contain confidential or privileged information. > If you are not the intended recipient, please notify the sender immediately > and then delete it. > > TIAA > ** > *** * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
Odd Edge Case for SpellCheck
Hello, we have spellcheck running, using the index as the dictionary. An odd use case came up today wanted to get your thoughts and see if what we determined is correct. Use case: User sends a query for q=brokerage, spellcheck fires and returns "brokerage". Looking at the output I see that solr must have pulled the root word "brokage" then spellcheck said hey I need to fix that. Is that correct? There's no issue, it's just an unexpected outcome. Thanks! "q":"brokerage", "spellcheck":{ "suggestions": [ {"name":"brokage",{ "type":"str","value":"numFound":1, "startOffset":0, "endOffset":9, "suggestion":["brokerage"]}}], "collations": [ {"name":"collation","type":"str","value":"brokerage"}]}} Brett Moyer * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: Facet Advice
Hello Shawn, thanks for reply. The results that come back are correct, but are we implementing the query correctly to filter by a selected facet? When I say wrong, it's more about the design/use of Facets in the Query. Is it proper to do fq=Tags:Retirement? Is using a Multivalued field correct for Facets? Why do you say the above are not Facets? Here is an excerpt from our JSON: "facet_counts": { "facet_queries": {}, "facet_fields": { "Tags": [ "Retirement", 1260, "Locations & People", 1149, "Advice and Tools", 1015, "Careers", 156, "Annuities", 101, "Performance", Brett Moyer Manager, Sr. Technical Lead | TFS Technology Public Production Support Digital Search & Discovery 8625 Andrew Carnegie Blvd | 4th floor Charlotte, NC 28263 Tel: 704.988.4508 Fax: 704.988.4907 bmo...@tiaa.org -Original Message- From: Shawn Heisey Sent: Tuesday, October 15, 2019 5:40 AM To: solr-user@lucene.apache.org Subject: Re: Facet Advice On 10/14/2019 3:25 PM, Moyer, Brett wrote: > Hello, looking for some advice, I have the suspicion we are doing Facets all > wrong. We host financial information and recently "tagged" our pages with > appropriate Facets. We have built a Flat design. Are we going at it the wrong > way? > > In Solr we have a "Tags" field, based on some magic we tagged each page on > the site with a number of the below example Facets. We have the UI team > sending queries in the form of 1) q=get a loan=Tags:Retirement, 2) q=get a > loan=Tags:Retirement AND Tags:Move Money. This restricts the resultset > hopefully guiding the user to their desired result. Something about it > doesn’t seem right. Is this right with a flat single level pattern like what > we have? Should each doc have multiple Fields to map to different values? Any > help is appreciated. Thanks! > > Example Facets: > Brokerage > Retirement > Open an Account > Move Money > Estate Planning The queries you mentioned above do not have facets, only the q and fq parameters. You also have not mentioned what in the results is wrong to you. If you restrict the query to only a certain value in the tag field, then facets will only count documents that match the full query -- users will not be able to see the count of documents that do NOT match the query, unless you use tagging/excluding with your filters. This is part of the functionality called multi-select faceting. http://yonik.com/multi-select-faceting/ Because your message doesn't say what in the results is wrong, we can only guess about how to help you. I do not know if the above information will be helpful or not. Thanks, Shawn * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
Facet Advice
Hello, looking for some advice, I have the suspicion we are doing Facets all wrong. We host financial information and recently "tagged" our pages with appropriate Facets. We have built a Flat design. Are we going at it the wrong way? In Solr we have a "Tags" field, based on some magic we tagged each page on the site with a number of the below example Facets. We have the UI team sending queries in the form of 1) q=get a loan=Tags:Retirement, 2) q=get a loan=Tags:Retirement AND Tags:Move Money. This restricts the resultset hopefully guiding the user to their desired result. Something about it doesn’t seem right. Is this right with a flat single level pattern like what we have? Should each doc have multiple Fields to map to different values? Any help is appreciated. Thanks! Example Facets: Brokerage Retirement Open an Account Move Money Estate Planning Etc.. Brett * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: Indexed Data Size
Turns out this is due to a job that indexes logs. We were able to clear some with another job. We are working through the value of these indexed logs. Thanks for all your help! Brett Moyer Manager, Sr. Technical Lead | TFS Technology Public Production Support Digital Search & Discovery 8625 Andrew Carnegie Blvd | 4th floor Charlotte, NC 28263 Tel: 704.988.4508 Fax: 704.988.4907 bmo...@tiaa.org -Original Message- From: Shawn Heisey Sent: Friday, August 9, 2019 2:25 PM To: solr-user@lucene.apache.org Subject: Re: Indexed Data Size On 8/9/2019 12:17 PM, Moyer, Brett wrote: > The biggest is /data/solr/system_logs_shard1_replica_n1/data/index, files > with the extensions I stated previously. Each is 5gb and there are a few > hundred. Dated by to last 3 months. I don’t understand why there are so many > files with such small indexes. Not sure how to clean them up. Can you get a screenshot of the core overview for that particular core? Solr should correctly calculate the size on the overview based on what files are actually in the index directory. Thanks, Shawn * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: Indexed Data Size
Correct our indexes are small document wise, but for some ready we have a years' worth of files in the data/solr folders. There are no index. files. The biggest is /data/solr/system_logs_shard1_replica_n1/data/index, files with the extensions I stated previously. Each is 5gb and there are a few hundred. Dated by to last 3 months. I don’t understand why there are so many files with such small indexes. Not sure how to clean them up. -Original Message- From: Shawn Heisey Sent: Friday, August 9, 2019 9:11 AM To: solr-user@lucene.apache.org Subject: Re: Indexed Data Size On 8/9/2019 6:12 AM, Moyer, Brett wrote: > Thanks! We update each index nightly, we don’t clear, but bring in New and > Deltas, delete expired/404. All our data are basically webpages, so none are > very large. Some PDFs but again not too large. We are running Solr 7.5, > hopefully you can access the links. Solr is saying that the entire size of the index directory is 95 MB for one of those indexes and the other is 30 MB. Those sound to me like very small indexes, not very large like you indicated. You were saying that the large files were in data/index, and did not mention anything about index. directories. If you do have a bunch of index. directories in the "Data" directory mentioned on the Core overview page, you can safely delete all of the index and/or index.* directories under that directory EXCEPT the one that is indicated as the "Index" directory. If you delete that one, you're deleting the actual live index ... and since you're not on Windows, the OS will let you delete it without complaining. The directory locations are cut off on both screenshots, so I can't confirm anything there. The larger core has about 2000 deleted docs and the smaller one has 40. Doing an optimize will not save much disk space or take very long. Thanks, Shawn * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: Indexed Data Size
Thanks! We update each index nightly, we don’t clear, but bring in New and Deltas, delete expired/404. All our data are basically webpages, so none are very large. Some PDFs but again not too large. We are running Solr 7.5, hopefully you can access the links. https://www.dropbox.com/s/lzd6hkoikhagujs/CoreOne.png?dl=0 https://www.dropbox.com/s/ae6rayb38q39u9c/CoreTwo.png?dl=0 Brett -Original Message- From: Erick Erickson Sent: Thursday, August 8, 2019 5:49 PM To: solr-user@lucene.apache.org Subject: Re: Indexed Data Size On the surface, this makes no sense at all, so there’s something I don’t understand here ;). How often do you update your index? Having files from a long time ago is perfectly reasonable if you’re not updating regularly. But your statement that some of these are huge for just a 50K document index is odd unless they’re _huge_ documents. I wouldn’t optimize, unless you’re on Solr 7.5+ as that’ll create a single segment, see: https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/ and https://lucidworks.com/post/solr-and-optimizing-your-index-take-ii/ The extensions you mentioned are perfectly reasonable. Each segment is made up of multiple files. .fdt for instance contains stored data. See: https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/codecs/lucene62/package-summary.html Can you give us a long listing of one of your index directories? Best, Erick > On Aug 8, 2019, at 5:17 PM, Moyer, Brett wrote: > > In our data/solr//data/index on the filesystem, we have files > that go back 1 year. I don’t understand why and I doubt they are in use. > Files with extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are > very large and running us out of server space. Our search indexes themselves > are not large, in total we might have 50k documents. How can I reduce this > /data/solr space? Is this what the Solr Optimize command is for? Thanks! > > Brett > > ** > *** This e-mail may contain confidential or privileged information. > If you are not the intended recipient, please notify the sender immediately > and then delete it. > > TIAA > ** > *** * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: modify query response plugin
Highlight? What about using the Highlighter? https://lucene.apache.org/solr/guide/6_6/highlighting.html Brett Moyer Manager, Sr. Technical Lead | TFS Technology Public Production Support Digital Search & Discovery 8625 Andrew Carnegie Blvd | 4th floor Charlotte, NC 28263 Tel: 704.988.4508 Fax: 704.988.4907 bmo...@tiaa.org -Original Message- From: Maria Muslea Sent: Thursday, August 8, 2019 1:28 PM To: solr-user@lucene.apache.org Subject: Re: modify query response plugin Thank you for your response. I believe that the Tagger is used for NER, which is different than what I am trying to do. It is also available only with Solr 7 and I would need this to work with version 6.5.0. I am trying to manipulate the data that I already have in the response, and I can't find a good example of a plugin that does something similar, so I can see how I can access the response and construct a new one. Your help is greatly appreciated. Thank you, Maria On Tue, Aug 6, 2019 at 3:19 PM Erik Hatcher wrote: > I think you’re looking for the Solr Tagger, described here: > https://lucidworks.com/post/solr-tagger-improving-relevancy/ > > > On Aug 6, 2019, at 16:04, Maria Muslea wrote: > > > > Hi, > > > > I am trying to implement a plugin that will modify my query > > response. For example, I would like to execute a query that will return > > something like: > > > > {... > > "description":"flights at LAX", > > "highlight":"airport;11;3" > > ...} > > This is information that I have in my document, so I can return it. > > > > Now, I would like the plugin to intercept the result, do some > > processing > on > > it, and return something like: > > > > {... > > "description":"flights at LAX", > > "highlight":{ > > "concept":"airport", > > "description":"flights at LAX" > > ...} > > > > I looked at some RequestHandler implementations, but I can't find > > any sample code that would help me with this. Would this type of > > plugin be handled by a RequestHandler? Could you maybe point me to a > > sample plugin that does something similar? > > > > I would really appreciate your help. > > > > Thank you, > > Maria > * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
Indexed Data Size
In our data/solr//data/index on the filesystem, we have files that go back 1 year. I don’t understand why and I doubt they are in use. Files with extensions like fdx,cfe,doc,pos,tip,dvm etc. Some of these are very large and running us out of server space. Our search indexes themselves are not large, in total we might have 50k documents. How can I reduce this /data/solr space? Is this what the Solr Optimize command is for? Thanks! Brett * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
Solr spellcheck Collation JSON
Hello, Looks like a more recent Solr release introduced a bug for collation. Does anyone know of a way to correct it, or if a future release will address? Because of this change we had to make the app teams rewrite their code. Made us look bad because we can't control our code and introduced a bug their perspective) Thanks Solr 7.4 -- "spellcheck": { "suggestions": [ "acount", { "numFound": 1, "startOffset": 0, "endOffset": 6, "suggestion": [ "account" ] } ], "collations": [ "collation", <-this is the bad line "account" ] Previous Solr versions -- "spellcheck": { "suggestions": [ "acount", { "numFound": 1, "startOffset": 0, "endOffset": 6, "suggestion": [ "account" ] } ], "collations": [ "collation":"account" <--correct format ] Brett Moyer * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: IRA or IRA the Person
Wow, thank you Trey, great information! We are a Fusion client, works well for us, we are leveraging the Signals Boosting. We were thinking omitNorms might be of help here, turning that off actually. The PERSON document ranks #1 always because it’s a tiny document with very short fields. I'll take a closer look at what you sent, Thank you! Brett Moyer Manager, Sr. Technical Lead | TFS Technology Public Production Support Digital Search & Discovery 8625 Andrew Carnegie Blvd | 4th floor Charlotte, NC 28263 Tel: 704.988.4508 Fax: 704.988.4907 bmo...@tiaa.org -Original Message- From: Trey Grainger [mailto:solrt...@gmail.com] Sent: Monday, April 01, 2019 1:15 PM To: solr-user@lucene.apache.org Subject: Re: IRA or IRA the Person CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. Hi Brett, There are a couple of angles you can take here. If you are only concerned about this specific term or a small number of other known terms like "IRA" and want to spot fix it, you can use something like the query elevation component in Solr ( https://lucene.apache.org/solr/guide/7_7/the-query-elevation-component.html) to explicitly include or exclude documents. Otherwise, if you are looking for a more data-driven approach to solving this, you can leverage the aggregate click-streams for your users across all of the searches on your platform to boost documents higher that are more popular for any given search. We do this in our commercial product (Lucidworks Fusion) through our Signals Boosting feature, but you could implement something similar yourself with some work, as the general architecture is fairly well-documented here: https://doc.lucidworks.com/fusion-ai/4.2/user-guide/signals/index.html If you do not have long-lived content OR your do not have sufficient signals history, you could alternatively use something like Solr's Semantic Knowledge Graph to automatically find term vectors that are the most related to your terms within your content. In that case, if the "individual retirement account" meaning is more common across your documents, you'd probably end up with terms more related to that which could be used to do data-driven boosts on your query to that concept (instead of the person, in this case). I gave a presentation at Activate ("the Search & AI Conference") last year on some of the more data-driven approaches to parsing and understanding the meaning of terms within queries, that included things like disambiguation (similar to what you're doing here) and some additional approaches leveraging a combination of query log mining, the semantic knowledge graph, and the Solr Text Tagger. If you start handling these use cases in a more systematic and data-driven way, you might want to check out some of the techniques I mentioned there: Video: https://www.youtube.com/watch?v=4fMZnunTRF8 | Slides: https://www.slideshare.net/treygrainger/how-to-build-a-semantic-search-system All the best, Trey Grainger Chief Algorithms Officer @ Lucidworks On Mon, Apr 1, 2019 at 11:45 AM Moyer, Brett wrote: > Hello, > > Looking for ideas on how to determine intent and drive results to > a person result or an article result. We are a financial institution and we > have IRA's Individual Retirement Accounts and we have a page that talks > about an Advisor, IRA Black. > > Our users are in a bad habit of only using single terms for > search. A very common search term is "ira". The PERSON page ranks higher > than the article on IRA's. With essentially no information from the user, > what are some way we can detect and rank differently? Thanks! > > Brett Moyer > * > This e-mail may contain confidential or privileged information. > If you are not the intended recipient, please notify the sender > immediately and then delete it. > > TIAA > * > * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
IRA or IRA the Person
Hello, Looking for ideas on how to determine intent and drive results to a person result or an article result. We are a financial institution and we have IRA's Individual Retirement Accounts and we have a page that talks about an Advisor, IRA Black. Our users are in a bad habit of only using single terms for search. A very common search term is "ira". The PERSON page ranks higher than the article on IRA's. With essentially no information from the user, what are some way we can detect and rank differently? Thanks! Brett Moyer * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: FieldTypes and LowerCase
Ok I think I'm getting it. At Index/Query time the analyzers fire and "do stuff". Ex: "the sheep jumped over the MOON" that could be Tokened on spaces, lowercased etc. and that is stored in the Inverted Index, something you probably can't really see. In solr the string above is what you see in its original form. When you search for "sheep" that would come back because the Inverted Index has it stored in that form, separated words based on spaces, right? Further if I searched for moon (lowercase) it would be found because the analyzer is also storing in the Inverted Index the lowercase form, right? I'm getting closer I think. Ok so if I want to physically lowercase the URL and store it that way, I need to do it before it gets to the Index as you stated. Ok got it, Thanks! Brett Moyer Manager, Sr. Technical Lead | TFS Technology Public Production Support Digital Search & Discovery 8625 Andrew Carnegie Blvd | 4th floor Charlotte, NC 28263 Tel: 704.988.4508 Fax: 704.988.4907 bmo...@tiaa.org -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Thursday, March 14, 2019 10:57 AM To: solr-user@lucene.apache.org Subject: Re: FieldTypes and LowerCase CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. On 3/14/2019 8:49 AM, Moyer, Brett wrote: > Thanks Shawn, " Analysis only happens to indexed data" Being the case when > the data gets Indexed, then wouldn't the Analyzer kickoff and lowercase the > URL? The analyzer I have defined is not set for Index or Query, so as I > understand it will fire during both events. If that is the case I still don't > get why the Lowercase doesn't fire when the data is being indexed. It does happen for both index and query. It sounds like you are assuming that when index analysis happens, that what you get back in search results will be affected by that analysis. What you get back in search results is stored data -- that is never affected by analysis. What gets affected by analysis is indexed data -- the data that is searched by queries. Not the data that comes back in search results. Thanks, Shawn * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: FieldTypes and LowerCase
Thanks Shawn, " Analysis only happens to indexed data" Being the case when the data gets Indexed, then wouldn't the Analyzer kickoff and lowercase the URL? The analyzer I have defined is not set for Index or Query, so as I understand it will fire during both events. If that is the case I still don't get why the Lowercase doesn't fire when the data is being indexed. Brett Moyer -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Thursday, March 14, 2019 10:44 AM To: solr-user@lucene.apache.org Subject: Re: FieldTypes and LowerCase CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe. On 3/14/2019 7:47 AM, Moyer, Brett wrote: > I'm using the below FieldType/Field but when I index my documents, the URL is > not being lower case. Any ideas? Do I have the below wrong? > > Example: http://connect.rightprospectus.com/RSVP/TADF > Expect: http://connect.rightprospectus.com/rsvp/tadf > > omitNorms="true"> > > > > > > > stored="true"/> Analysis only happens to indexed data. The data that you get back from Solr (stored data) is *always* EXACTLY what Solr indexes, before analysis. You'll need to lowercase the data before it reaches analysis. This is how it is designed to work ... that will not be changing. If you were to configure an Update Processor chain that did the lowercasing, that would affect stored data as well as indexed data. Thanks, Shawn * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
FieldTypes and LowerCase
I'm using the below FieldType/Field but when I index my documents, the URL is not being lower case. Any ideas? Do I have the below wrong? Example: http://connect.rightprospectus.com/RSVP/TADF Expect: http://connect.rightprospectus.com/rsvp/tadf Brett Moyer * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
RE: URL Case Sensitive/Insensitive
https://www.nuveen.com/mutual-funds/nuveen-high-yield-municipal-bond-fund https://www.nuveen.com/mutual-funds/Nuveen-High-Yield-Municipal-Bond-Fund Is there any issue if we just lowercase all URLs? I can't think of an issue that would be caused, but that's why I'm asking the Guru's! Brett Moyer -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Tuesday, December 11, 2018 12:41 PM To: solr-user Subject: Re: URL Case Sensitive/Insensitive What do you mean by "url case"? No, I'm not being snarky. The value returned in a doc is very different than the value searched. The stored data is the original input without going through any filters. If you mean the value _returned_ by Solr from a stored field, then the case is exactly whatever was input originally. To get it a consistent case, I'd change it on the client side before sending to Solr, or use, say, a ScriptUpdateProcessor to change it on the way in to Solr. If you're talking about _searching_ the URL, you need to put the appropriate filters in your analysis chain. Most distributions have a "lowercase" type that is a keywordtokenizer and lowercasefilter That still treats the searchable text as a single token, so for instance you wouldn't be able to search for url:com with pre-and-post wildcards which is not a good pattern. If you want to search sub-parts of a url, you'll use one of the text-based types to break it up into tokens. Even in this case, though, the returned data is still the original case since it's the stored data that's returned. Best, Erick On Tue, Dec 11, 2018 at 8:38 AM Moyer, Brett wrote: > > Hello, I'm new to Solr been using it for a few months. A recent question came > up from our business partners about URL casing. Previously their URLs were > upper case, they made a change and now all lower. Both pages/URLs are still > accessible so there are duplicates in Solr. They are requesting all URLs be > evaluated as lowercase. What is the best practice on URL case? Is there a > negative to making all lowercase? I know I can drop the index and re-crawl to > fix it, but long term how should URL case be treated? Thanks! > > Brett Moyer > > * > This e-mail may contain confidential or privileged information. > If you are not the intended recipient, please notify the sender immediately > and then delete it. > > TIAA > * * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *
URL Case Sensitive/Insensitive
Hello, I'm new to Solr been using it for a few months. A recent question came up from our business partners about URL casing. Previously their URLs were upper case, they made a change and now all lower. Both pages/URLs are still accessible so there are duplicates in Solr. They are requesting all URLs be evaluated as lowercase. What is the best practice on URL case? Is there a negative to making all lowercase? I know I can drop the index and re-crawl to fix it, but long term how should URL case be treated? Thanks! Brett Moyer * This e-mail may contain confidential or privileged information. If you are not the intended recipient, please notify the sender immediately and then delete it. TIAA *