eDisMax and Boolean operator case-sensitivity
Hi, I'm using eDisMax query parser, and need to support Boolean operators AND and OR. It seems from testing that these are *not* case sensitive, e.g. setting mm to 0, oscar AND wilde returns the same results as oscar and wilde (15 hits) while oscar foo wilde returns the same results as oscar wilde (2000 hits). Is it possible to configure eDisMax to do case-sensitive parsing, so that AND is an operator but and is just another term? thanks, Tom
Re: eDisMax and Boolean operator case-sensitivity
On 11/6/2013 11:46 AM, Tom Mortimer wrote: I'm using eDisMax query parser, and need to support Boolean operators AND and OR. It seems from testing that these are *not* case sensitive, e.g. setting mm to 0, oscar AND wilde returns the same results as oscar and wilde (15 hits) while oscar foo wilde returns the same results as oscar wilde (2000 hits). Is it possible to configure eDisMax to do case-sensitive parsing, so that AND is an operator but and is just another term? Include another query parameter: lowercaseOperators=false http://wiki.apache.org/solr/ExtendedDisMax#lowercaseOperators Thanks, Shawn
Re: eDisMax and Boolean operator case-sensitivity
Oh, good grief - I was just reading that page, how did I miss that? *derp* Thanks Shawn!!! Tom On 6 November 2013 18:59, Shawn Heisey s...@elyograg.org wrote: On 11/6/2013 11:46 AM, Tom Mortimer wrote: I'm using eDisMax query parser, and need to support Boolean operators AND and OR. It seems from testing that these are *not* case sensitive, e.g. setting mm to 0, oscar AND wilde returns the same results as oscar and wilde (15 hits) while oscar foo wilde returns the same results as oscar wilde (2000 hits). Is it possible to configure eDisMax to do case-sensitive parsing, so that AND is an operator but and is just another term? Include another query parameter: lowercaseOperators=false http://wiki.apache.org/solr/ExtendedDisMax#lowercaseOperators Thanks, Shawn
Re: why does * affect case sensitivity of query results
Actually, look at the referenced JIRA https://issues.apache.org/jira/browse/SOLR-2438 and you'll see it's changed in 3.6. Best Erick On Mon, Apr 29, 2013 at 9:36 AM, geeky2 gee...@hotmail.com wrote: here is the jira link: https://issues.apache.org/jira/browse/SOLR-219 -- View this message in context: http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801p4059814.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: why does * affect case sensitivity of query results
hello erik, thank you for the info - yes - i did notice ;) one more reason for us to upgrade from 3.5. thx mark -- View this message in context: http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801p406.html Sent from the Solr - User mailing list archive at Nabble.com.
why does * affect case sensitivity of query results
hello, environment: solr 3.5 problem statement: when query has * appended, it turns case sensitive. assumption: query should NOT be case sensitive actual value in database at time of index: 4387828BULK here is a snapshot of what works and does not work. what works: itemModelNoExactMatchStr:4387828bULk (and any variation of upper and lower case letters for *bulk*) itemModelNoExactMatchStr:4387828bu* itemModelNoExactMatchStr:4387828bul* itemModelNoExactMatchStr:4387828bulk* what does NOT work: itemModelNoExactMatchStr:4387828BU* itemModelNoExactMatchStr:4387828BUL* itemModelNoExactMatchStr:4387828BULK* below are the specifics of my field and fieldType field name=itemModelNoExactMatchStr type=text_exact indexed=true stored=true/ fieldType name=text_exact class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType thx mark -- View this message in context: http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: why does * affect case sensitivity of query results
http://wiki.apache.org/solr/MultitermQueryAnalysis Sorry, not for your version of Solr. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Apr 29, 2013 at 11:40 AM, geeky2 gee...@hotmail.com wrote: hello, environment: solr 3.5 problem statement: when query has * appended, it turns case sensitive. assumption: query should NOT be case sensitive actual value in database at time of index: 4387828BULK here is a snapshot of what works and does not work. what works: itemModelNoExactMatchStr:4387828bULk (and any variation of upper and lower case letters for *bulk*) itemModelNoExactMatchStr:4387828bu* itemModelNoExactMatchStr:4387828bul* itemModelNoExactMatchStr:4387828bulk* what does NOT work: itemModelNoExactMatchStr:4387828BU* itemModelNoExactMatchStr:4387828BUL* itemModelNoExactMatchStr:4387828BULK* below are the specifics of my field and fieldType field name=itemModelNoExactMatchStr type=text_exact indexed=true stored=true/ fieldType name=text_exact class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType thx mark -- View this message in context: http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: why does * affect case sensitivity of query results
was looking in Smiley's book on page 129 and 130. from the book, No text analysis is performed on the search word containing the wildcard, not even lowercasing. So if you want to find a word starting with Sma, then sma* is required instead of Sma*, assuming the index side of the field's type includes lowercasing. This shortcoming is tracked on SOLR-219. Moreover, if the field that you want to use the wildcard query on is stemmed in the analysis, then smashing* would not find the original text Smashing because the stemming process transforms this to smash. Consequently, don't stem. thx mark -- View this message in context: http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801p4059812.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: why does * affect case sensitivity of query results
here is the jira link: https://issues.apache.org/jira/browse/SOLR-219 -- View this message in context: http://lucene.472066.n3.nabble.com/why-does-affect-case-sensitivity-of-query-results-tp4059801p4059814.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Case-sensitivity issue with search field name
Hi Shawn, Thanks for your reply. So you mean the field name can't be case insensitive when specifies in a query? I'm gonna stop doing research on this issue if this is confirmed... Thanks, Hyrax -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Case-sensitivity-issue-with-search-field-name-tp4043800p4044006.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Case-sensitivity issue with search field name
Hi wunder, Great advice! As a matter of fact, I choose to use upper case due to the document I indexed, but it is really pain in the ass when typing the field names all in upper case. I thought there probably would be a way to set field names case-insensitive. I was wrong, wasn't I? Thanks, Hyrax -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Case-sensitivity-issue-with-search-field-name-tp4043800p4044010.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr Case-sensitivity issue with search field name
Hi guys, I'm using Solr 4.0 and I recently notice an issue that bothers me a lot which is that if you define a field in your schema named 'HOST' then in the query you have to specify this field by 'HOST' while if you used 'host' it would throw an 'undefined field' error. I have done some googling while I only found a jira ticket which says this issue had been fixed: https://issues.apache.org/jira/browse/SOLR-873 https://issues.apache.org/jira/browse/SOLR-873 I know I can use copyField to accomplish this but I'm wonder if there a way to apply this change all the field on the fly not one by one ... Many many thanks in advance! Thanks, Hyrax -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Case-sensitivity-issue-with-search-field-name-tp4043800.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Case-sensitivity issue with search field name
On 2/28/2013 3:40 PM, hyrax wrote: I'm using Solr 4.0 and I recently notice an issue that bothers me a lot which is that if you define a field in your schema named 'HOST' then in the query you have to specify this field by 'HOST' while if you used 'host' it would throw an 'undefined field' error. I have done some googling while I only found a jira ticket which says this issue had been fixed: https://issues.apache.org/jira/browse/SOLR-873 https://issues.apache.org/jira/browse/SOLR-873 I know I can use copyField to accomplish this but I'm wonder if there a way to apply this change all the field on the fly not one by one ... It appears that the issue you have linked is specific to the dataimport handler (importing from a database or another structured data source), not searching. I've always read that fields in a Solr schema are case sensitive. My own recommendation is that you pick a standard, either all uppercase or all lowercase, and that you stick with it. I prefer all lowercase myself. Thanks, Shawn
Re: Solr Case-sensitivity issue with search field name
Lower case is safer than upper case. For unicode, uppercasing is a lossy conversion. There are sets of different lower case characters that convert to the same upper case character. When you convert back to lower case, you don't know which one it was originally. Always use lower case for text. That avoids some really subtle bugs. wunder On Feb 28, 2013, at 3:47 PM, Shawn Heisey wrote: On 2/28/2013 3:40 PM, hyrax wrote: I'm using Solr 4.0 and I recently notice an issue that bothers me a lot which is that if you define a field in your schema named 'HOST' then in the query you have to specify this field by 'HOST' while if you used 'host' it would throw an 'undefined field' error. I have done some googling while I only found a jira ticket which says this issue had been fixed: https://issues.apache.org/jira/browse/SOLR-873 https://issues.apache.org/jira/browse/SOLR-873 I know I can use copyField to accomplish this but I'm wonder if there a way to apply this change all the field on the fly not one by one ... It appears that the issue you have linked is specific to the dataimport handler (importing from a database or another structured data source), not searching. I've always read that fields in a Solr schema are case sensitive. My own recommendation is that you pick a standard, either all uppercase or all lowercase, and that you stick with it. I prefer all lowercase myself. Thanks, Shawn
Re: Text field case sensitivity problem
I'm not familiar with the CharFilters, I'll look into those now. Is the solr.LowerCaseFilterFactory not handling wildcards the expected result or is this a bug? On Wed, Jun 15, 2011 at 4:34 PM, Mike Sokolov soko...@ifactory.com wrote: I wonder whether CharFilters are applied to wildcard terms? I suspect they might be. If that's the case, you could use the MappingCharFilter to perform lowercasing (and strip diacritics too if you want that) -Mike On 06/15/2011 10:12 AM, Jamie Johnson wrote: So simply lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov soko...@ifactory.com wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
I think my answer is here... On wildcard and fuzzy searches, no text analysis is performed on the search word. taken from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers On Thu, Jun 30, 2011 at 10:23 AM, Jamie Johnson jej2...@gmail.com wrote: I'm not familiar with the CharFilters, I'll look into those now. Is the solr.LowerCaseFilterFactory not handling wildcards the expected result or is this a bug? On Wed, Jun 15, 2011 at 4:34 PM, Mike Sokolov soko...@ifactory.com wrote: I wonder whether CharFilters are applied to wildcard terms? I suspect they might be. If that's the case, you could use the MappingCharFilter to perform lowercasing (and strip diacritics too if you want that) -Mike On 06/15/2011 10:12 AM, Jamie Johnson wrote: So simply lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov soko...@ifactory.com wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
Yes, after posting that response, I read some more and came to the same conclusion... there seems to be some interest on the dev list in building a capability to specify an analysis chain for use with wildcard and related queries, but it doesn't exist now. -Mike On 06/30/2011 10:34 AM, Jamie Johnson wrote: I think my answer is here... On wildcard and fuzzy searches, no text analysis is performed on the search word. taken from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers On Thu, Jun 30, 2011 at 10:23 AM, Jamie Johnsonjej2...@gmail.com wrote: I'm not familiar with the CharFilters, I'll look into those now. Is the solr.LowerCaseFilterFactory not handling wildcards the expected result or is this a bug? On Wed, Jun 15, 2011 at 4:34 PM, Mike Sokolovsoko...@ifactory.com wrote: I wonder whether CharFilters are applied to wildcard terms? I suspect they might be. If that's the case, you could use the MappingCharFilter to perform lowercasing (and strip diacritics too if you want that) -Mike On 06/15/2011 10:12 AM, Jamie Johnson wrote: So simply lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolovsoko...@ifactory.com wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
Jamie - there is a JIRA about this, at least one: https://issues.apache.org/jira/browse/SOLR-218 Erik On Jun 15, 2011, at 10:12 , Jamie Johnson wrote: So simply lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov soko...@ifactory.com wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
Yes, and this too: https://issues.apache.org/jira/browse/SOLR-219 On 06/30/2011 12:46 PM, Erik Hatcher wrote: Jamie - there is a JIRA about this, at least one:https://issues.apache.org/jira/browse/SOLR-218 Erik On Jun 15, 2011, at 10:12 , Jamie Johnson wrote: So simply lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolovsoko...@ifactory.com wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
So simply lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov soko...@ifactory.com wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
I wonder whether CharFilters are applied to wildcard terms? I suspect they might be. If that's the case, you could use the MappingCharFilter to perform lowercasing (and strip diacritics too if you want that) -Mike On 06/15/2011 10:12 AM, Jamie Johnson wrote: So simply lower casing the works but can get complex. The query that I'm executing may have things like ranges which require some words to be upper case (i.e. TO). I think this would be much better solved on Solrs end, is there a JIRA about this? On Tue, Jun 14, 2011 at 5:33 PM, Mike Sokolov soko...@ifactory.com mailto:soko...@ifactory.com wrote: opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com mailto:jej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Text field case sensitivity problem
I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson jej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
RE: Text field case sensitivity problem
Unfortunately, wild card search terms don't get processed by the analyzers. One suggestion that's fairly common is to make sure you lower case your wild card search terms yourself before issuing the query. Bob Sandiford | Lead Software Engineer | SirsiDynix P: 800.288.8020 X6943 | bob.sandif...@sirsidynix.com www.sirsidynix.com -Original Message- From: Jamie Johnson [mailto:jej2...@gmail.com] Sent: Tuesday, June 14, 2011 5:13 PM To: solr-user@lucene.apache.org Subject: Re: Text field case sensitivity problem Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnson jej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
Re: Text field case sensitivity problem
opps, please s/Highlight/Wildcard/ On 06/14/2011 05:31 PM, Mike Sokolov wrote: Wildcard queries aren't analyzed, I think? I'm not completely sure what the best workaround is here: perhaps simply lowercasing the query terms yourself in the application. Also - I hope someone more knowledgeable will say that the new HighlightQuery in trunk doesn't have this restriction, but I'm not sure about that. -Mike On 06/14/2011 05:13 PM, Jamie Johnson wrote: Also of interest to me is this returns results http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kristine On Tue, Jun 14, 2011 at 5:08 PM, Jamie Johnsonjej2...@gmail.com wrote: I am using the following for my text field: fieldType name=text class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- !-- Case insensitive stop word removal. add enablePositionIncrements=true in both the index and query analyzers to leave a 'gap' for more accurate phrase queries. -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0 splitOnCaseChange=1/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.PorterStemFilterFactory/ /analyzer /fieldType I have a field defined as field name=Person_Name type=text stored=true indexed=true / when I execute a go to the following url I get results http://localhost:8983/solr/select?defType=luceneq=Person_Name:kris* but if I do http://localhost:8983/solr/select?defType=luceneq=Person_Name:Kris* I get nothing. I thought the LowerCaseFilterFactory would have handled lowercasing both the query and what is being indexed, am I missing something?
How to ignore whitespace/ case sensitivity with dedupe
Hi all, I've followed the instructions at this link http://wiki.apache.org/solr/Deduplication and got the basic dedupe field working. However, it doesn't seem to recognize case differences or white space differences even thought I've defined the type of the fields to be used for dedupe as well as the signature field as followings in schema.xml fieldType autoGeneratePhraseQueries=true class=solr.TextField name=text_ws_lower positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=name type=text_ws_lower/ field name=signatureField type=text_ws_lower/ and in the solrconfig.xml updateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldsignatureField/str str name=fieldsname/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain I know a possible solution is to lowercase and remove white spaces for the field name before submiting documents to solr, but is there any other alternatives so that when the following data is given Name: JOHN SMITH and jOhn SMITh the documents have the same outcome in signatureField? Thanks heaps Cheers tinman -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-ignore-whitespace-case-sensitivity-with-dedupe-tp2997624p2997624.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to ignore whitespace/ case sensitivity with dedupe
(11/05/29 8:47), tinman wrote: Hi all, I've followed the instructions at this link http://wiki.apache.org/solr/Deduplication and got the basic dedupe field working. However, it doesn't seem to recognize case differences or white space differences even thought I've defined the type of the fields to be used for dedupe as well as the signature field as followings in schema.xml fieldType autoGeneratePhraseQueries=true class=solr.TextField name=text_ws_lower positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldType field name=name type=text_ws_lower/ field name=signatureField type=text_ws_lower/ and in the solrconfig.xmlupdateRequestProcessorChain name=dedupe processor class=org.apache.solr.update.processor.SignatureUpdateProcessorFactory bool name=enabledtrue/bool bool name=overwriteDupesfalse/bool str name=signatureFieldsignatureField/str str name=fieldsname/str str name=signatureClassorg.apache.solr.update.processor.Lookup3Signature/str /processor processor class=solr.LogUpdateProcessorFactory / processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain I know a possible solution is to lowercase and remove white spaces for the field name before submiting documents to solr, but is there any other alternatives so that when the following data is given Name: JOHN SMITH and jOhn SMITh the documents have the same outcome in signatureField? I can't believe this. Those signatures should be different. Are you sure you see same signatures in signatureField (it should be stored=true in order to see the result of signature)? Or did you just see those duplicate documents were registered and not checked signatureField by yourself? If latter, it is feature. Because you set overwriteDupes=false and it mean duplication check works on uniqueKey field. koji -- http://www.rondhuit.com/en/
Re: How to ignore whitespace/ case sensitivity with dedupe
By default, stored = true, indexed = true. Any case, this is an example output from solr search console. result name=response numFound=2 start=0 doc str name=id1234/str str name=nameJOHN SMITH /str str name=signatureField5430fbe9e6374611/str/doc doc str name=id1233/str str name=name john SMITh/str str name=signatureField49867a7835ff6741/str/doc /result As you can see, the 2 signature fields are different. And I want the overrides = false as I want to use field collapsing for removing dedupe at query time. Thanks tinman -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-ignore-whitespace-case-sensitivity-with-dedupe-tp2997624p2997738.html Sent from the Solr - User mailing list archive at Nabble.com.
DataImportHandler - case sensitivity of column names
I encountered the problem with Oracle converting column names to upper case. As a result SolrInputDocument is created with field names in upper case and Document [null] missing required field: id exception is thrown ( although ID field is defined ). I do not specify field elements explicitly. I know that I can rewrite all my queries to select id as id, body as body from document format, but is there any other workaround for this? case insensitive option or something? Here's my data-config: dataConfig dataSource convertType=true driver=oracle.jdbc.driver.OracleDriver password=oracle url=jdbc:oracle:thin:@localhost:1521:xe user=SYSTEM/ document name=items entity name=root pk=id preImportDeleteQuery=db:db1 query=select id, body from document transformer=TemplateTransformer entity name=nested1 query=select category from document_category where doc_id='${root.id}'/ entity name=nested2 query=select tag from document_tag where doc_id='${root.id}'/ field column=db template=db1/ /entity /document /dataConfig Alexey
Re: DataImportHandler - case sensitivity of column names
On Mon, Feb 8, 2010 at 3:59 PM, Alexey Serba ase...@gmail.com wrote: I encountered the problem with Oracle converting column names to upper case. As a result SolrInputDocument is created with field names in upper case and Document [null] missing required field: id exception is thrown ( although ID field is defined ). I do not specify field elements explicitly. I know that I can rewrite all my queries to select id as id, body as body from document format, but is there any other workaround for this? case insensitive option or something? Here's my data-config: dataConfig dataSource convertType=true driver=oracle.jdbc.driver.OracleDriver password=oracle url=jdbc:oracle:thin:@localhost:1521:xe user=SYSTEM/ document name=items entity name=root pk=id preImportDeleteQuery=db:db1 query=select id, body from document transformer=TemplateTransformer entity name=nested1 query=select category from document_category where doc_id='${root.id}'/ entity name=nested2 query=select tag from document_tag where doc_id='${root.id}'/ field column=db template=db1/ /entity /document /dataConfig Fields are imported in a case-insensitive manner as long as they are not specified explicitly. In this case, however, the problem is that the ${ root.id} is case sensitive. There is no way right now to resolve variables in a case-insensitive manner. -- Regards, Shalin Shekhar Mangar.
documentation deficiency : case sensitivity of boolean operators
I couldn't find this anywhere on solr's docs / faq i finally found a reference on lucene http://lucene.apache.org/java/2_4_0/queryparsersyntax.html this should really be added somewhere. i'm not sure where, but I thought this was worth bringing up to the list -- as it really confused the hell out of me :)
Re: documentation deficiency : case sensitivity of boolean operators
: Subject: documentation deficiency : case sensitivity of boolean operators : : I couldn't find this anywhere on solr's docs / faq if you have suggestions on places to add it, feel free to update the wiki. (most of the documentation is deliberatly agnostic to the specifics of the query parser syntax, instead relying on links to point you to the same refrence URL you found ... so i can't actually think of anywhere in the Solr docs that mentions the AND/OR/NOT syntax that it would make sense to clarify this) -Hoss
Re: documentation deficiency : case sensitivity of boolean operators
That's already linked from http://wiki.apache.org/solr/SolrQuerySyntax -Yonik http://www.lucidimagination.com On Tue, Sep 15, 2009 at 5:38 PM, Jonathan Vanasco jvana...@2xlp.com wrote: I couldn't find this anywhere on solr's docs / faq i finally found a reference on lucene http://lucene.apache.org/java/2_4_0/queryparsersyntax.html this should really be added somewhere. i'm not sure where, but I thought this was worth bringing up to the list -- as it really confused the hell out of me :)
solr field types and case sensitivity
can I change query analyzer for concrete request to solr? ie: I want add option on my site use case-sensitive search or not for this search request, but can't find any good solution ... I think that create duplicates (index only fields with different analyzers configuration) for each field it's bad idea ... May be any one know good solution for this problem? -- View this message in context: http://www.nabble.com/solr-field-types-and-case-sensitivity-tp14395912p14395912.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr field types and case sensitivity
Dryganets Sergey wrote: can I change query analyzer for concrete request to solr? ie: I want add option on my site use case-sensitive search or not for this search request, but can't find any good solution ... I think that create duplicates (index only fields with different analyzers configuration) for each field it's bad idea ... yes, you would index a field twice - once with a LowerCaseFilter and once without. That is a good solution. ryan
Re: solr field types and case sensitivity
ryantxu wrote: yes, you would index a field twice - once with a LowerCaseFilter and once without. That is a good solution. Hm... So I'm should create n*n indexes where n is search options count ... Can I copy fields automatically? For example I have a field with name name and subset of fields with prefixes or suffixes, so can I use regexp to copy field. Or may be I can describe copy field policy for a fieldType (as for me this solution will be better - there are less efforts to add new search option) -- View this message in context: http://www.nabble.com/solr-field-types-and-case-sensitivity-tp14395912p14411420.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be useful For your line number, page number etc perspective, it is possible to index special guaranteed-to-not-match tokens then use the termdocs/termenum data, along with SpanQueries to figure this out at search time. For instance, coincident with the last term in each line, index the token $. Coincident with the last token of every paragraph index the token #. If you get the offsets of the matching terms, you can quite quickly simply count the number of line and paragraph tokens using TermDocs/TermEnums and correlate hits to lines and paragraphs. The trick is to index your special tokens with an increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this). Another possibility is to add a special field with each document with the offsets of each end-of-sentence and end-of-paragraph offsets (stored, not indexed). Again, given the offsets, you can read in this field and figure out what line/ paragraph your hits are in. How suitable either of these is depends on a lot of characteristics of your particular problem space. I'm not sure either of them is suitable for very high volume applications. Also, I'm approaching this from an in-the-guts-of-lucene perspective, so don't even *think* of asking me how to really make this work in SOLR G. Best Erick On Nov 11, 2007 12:44 AM, David Neubert [EMAIL PROTECTED] wrote: Ryan (and others who need something to put them so sleep :) ) Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool -- I just was not at all thinking the SOLR/Lucene way. I need to rethink my whole approach now that I understand (from reviewing the schema.xml closer and playing with the Analyser) how compatible index and query policies can be applied automatically on a field by field basis by SOLR at both index and query time. I still may have a stumper here, but I need to give it some thought, and may return again with another question: The problem is that my text is book text (fairly large) that ooks very much like one would expect: book chapter parasen.../sensen/sen/para parasen.../sensen/sen/para parasen.../sensen.../sen/para /chapter /book The search results need to return exact sentences or paragraphs with their exact page:line numbers (which is available in the embedded markup in the text). There were previous responses by others, suggesting I look into payloads, but I did not fully understand that -- I may have to re-read those e-mails now that I am getting a clearer picture of SOLR/Lucene. However, the reason I resorted to indexing each paragraph as a single document, and then redundantly indexing each sentence as a single document, is because I was planning on pre-parsing the text myself (outside of SOLR) -- and feeding separate doc elements to the add because in that way I could produce the page:line reference in the pre-parsing (again outside of SOLR) and feed it in as explict field in the doc elements of the add requests. Therefore at query time, I will have the exact page:line corresponding to the start of the paragraph or sentence. But I am beginning to suspect, I was planning to do a lot of work that SOLR can do for me. I will continue to study this and respond when I am a bit clearer, but the closer I could get to just submitting the books a chapter at a time -- and letting SOLR do the work, the better (cause I have all the books in well formed xml at chapter levels). However, I don't see yet how I could get par/sen granular search result hits, along with their exact page:line coordinates unless I approach it by explicitly indexing the pars and sens as single documents, not chapters hits, and also return the entire text of the sen or par, and highlight the keywords within (for the search result hit). Once a search result hit is selected, it would then act as expected and position into the chapter, at the selected reference, highlight again the key words, but this time in the context of an entire chapter (the whole document to the user's mind). Even with my new understanding you (and others) have given me, which I can use to certainly improve my approach -- it still seems to me that because multi-valued fields concatenate text -- even if you use the positionGapIncrment feature to prohibit unwanted phrase matches, how do you produce a well definied search result hit, bounded by the exact sen or par, unless you index them as single documents? Should I still read up on the payload discussion? Dave - Original Message From: Ryan McKinley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 5:00:43 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) David Neubert wrote: Ryan, Thanks for your response. I infer from your response that you can have a different analyzer for each field yes
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Erik, Probably because of my newness to SOLR/Lucene, I see now what you/Yonik meant by case field, but I am not clear about your wording per-book setting attached at index time - would you mind ellaborating on that, so I am clear? Dave - Original Message From: Erik Hatcher [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, November 11, 2007 5:21:45 AM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) Solr query syntax is documented here: http://wiki.apache.org/solr/ SolrQuerySyntax What Yonik is referring to is creating your own case field with the per-book setting attached at index time. Erik On Nov 11, 2007, at 12:55 AM, David Neubert wrote: Yonik (or anyone else) Do you know where on-line documentation on the +case: syntax is located? I can't seem to find it. Dave - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 4:56:40 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote: So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead Solr handles that for you automatically. The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book You could easily select case sensitivity or not *per query* across all books. You should step back and see what the requirements actually are (i.e. the reasons why one needs to be able to select case sensitive/insensitive on a book level... it doesn't make sense to me at first blush). It could be done on a per-book level in solr with a more complex query structure though... (+case:sensitive +(normal relevancy query on the case sensitive fields goes here)) OR (+case:insensitive +(normal relevancy query on the case insensitive fields goes here)) -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Erik - thanks, I am considering this approach, verses explicit redundant indexing -- and am also considering Lucene -- problem is, I am one week into both technologies (though have years in the search space) -- wish I could go to Hong Kong -- any discounts available anywhere :) Dave - Original Message From: Erick Erickson [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 12, 2007 2:11:14 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) DISCLAIMER: This is from a Lucene-centric viewpoint. That said, this may be useful For your line number, page number etc perspective, it is possible to index special guaranteed-to-not-match tokens then use the termdocs/termenum data, along with SpanQueries to figure this out at search time. For instance, coincident with the last term in each line, index the token $. Coincident with the last token of every paragraph index the token #. If you get the offsets of the matching terms, you can quite quickly simply count the number of line and paragraph tokens using TermDocs/TermEnums and correlate hits to lines and paragraphs. The trick is to index your special tokens with an increment of 0 (see SynonymAnalyzer in Lucene In Action for more on this). Another possibility is to add a special field with each document with the offsets of each end-of-sentence and end-of-paragraph offsets (stored, not indexed). Again, given the offsets, you can read in this field and figure out what line/ paragraph your hits are in. How suitable either of these is depends on a lot of characteristics of your particular problem space. I'm not sure either of them is suitable for very high volume applications. Also, I'm approaching this from an in-the-guts-of-lucene perspective, so don't even *think* of asking me how to really make this work in SOLR G. Best Erick On Nov 11, 2007 12:44 AM, David Neubert [EMAIL PROTECTED] wrote: Ryan (and others who need something to put them so sleep :) ) Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool -- I just was not at all thinking the SOLR/Lucene way. I need to rethink my whole approach now that I understand (from reviewing the schema.xml closer and playing with the Analyser) how compatible index and query policies can be applied automatically on a field by field basis by SOLR at both index and query time. I still may have a stumper here, but I need to give it some thought, and may return again with another question: The problem is that my text is book text (fairly large) that ooks very much like one would expect: book chapter parasen.../sensen/sen/para parasen.../sensen/sen/para parasen.../sensen.../sen/para /chapter /book The search results need to return exact sentences or paragraphs with their exact page:line numbers (which is available in the embedded markup in the text). There were previous responses by others, suggesting I look into payloads, but I did not fully understand that -- I may have to re-read those e-mails now that I am getting a clearer picture of SOLR/Lucene. However, the reason I resorted to indexing each paragraph as a single document, and then redundantly indexing each sentence as a single document, is because I was planning on pre-parsing the text myself (outside of SOLR) -- and feeding separate doc elements to the add because in that way I could produce the page:line reference in the pre-parsing (again outside of SOLR) and feed it in as explict field in the doc elements of the add requests. Therefore at query time, I will have the exact page:line corresponding to the start of the paragraph or sentence. But I am beginning to suspect, I was planning to do a lot of work that SOLR can do for me. I will continue to study this and respond when I am a bit clearer, but the closer I could get to just submitting the books a chapter at a time -- and letting SOLR do the work, the better (cause I have all the books in well formed xml at chapter levels). However, I don't see yet how I could get par/sen granular search result hits, along with their exact page:line coordinates unless I approach it by explicitly indexing the pars and sens as single documents, not chapters hits, and also return the entire text of the sen or par, and highlight the keywords within (for the search result hit). Once a search result hit is selected, it would then act as expected and position into the chapter, at the selected reference, highlight again the key words, but this time in the context of an entire chapter (the whole document to the user's mind). Even with my new understanding you (and others) have given me, which I can use to certainly improve my approach -- it still seems to me that because multi-valued fields concatenate text -- even if you use the positionGapIncrment feature to prohibit unwanted phrase matches, how do you produce
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
On Nov 12, 2007 2:20 PM, David Neubert [EMAIL PROTECTED] wrote: Erik - thanks, I am considering this approach, verses explicit redundant indexing -- and am also considering Lucene - There's not a well defined solution in either IMO. - problem is, I am one week into both technologies (though have years in the search space) -- wish I could go to Hong Kong -- any discounts available anywhere :) Unfortunately the OS Summit has been canceled. -Yonik
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
: - problem is, I am one week into both technologies (though have years in the search space) -- wish I could : go to Hong Kong -- any discounts available anywhere :) : : Unfortunately the OS Summit has been canceled. Or rescheduled to 2008 ... depending on wether you are a half-empty / half-full kind of person. And lets not forget atlanta ... starting today and all... http://us.apachecon.com/us2007/ -Hoss
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Solr query syntax is documented here: http://wiki.apache.org/solr/ SolrQuerySyntax What Yonik is referring to is creating your own case field with the per-book setting attached at index time. Erik On Nov 11, 2007, at 12:55 AM, David Neubert wrote: Yonik (or anyone else) Do you know where on-line documentation on the +case: syntax is located? I can't seem to find it. Dave - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 4:56:40 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote: So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead Solr handles that for you automatically. The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book You could easily select case sensitivity or not *per query* across all books. You should step back and see what the requirements actually are (i.e. the reasons why one needs to be able to select case sensitive/insensitive on a book level... it doesn't make sense to me at first blush). It could be done on a per-book level in solr with a more complex query structure though... (+case:sensitive +(normal relevancy query on the case sensitive fields goes here)) OR (+case:insensitive +(normal relevancy query on the case insensitive fields goes here)) -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Hi all, Using SOLR, I believe I have to index the same content 4 times (not desirable) into 2 indexes -- and I don't know how you can practically do multiple indexes in SOLR (if indeed there is no better solution than 4 indexing runs into two indexes? My need is case-sensitive and case insensitive searches over well formed XML content (books), performing exact searches at the paragraph and sentence levels -- no errors over approximate boundaries -- the source content has exact par/sen tags. I have already proven a pretty nice solution for par/sen indexing twice into the same index in SOLR. I have added a tags field, and put correlative XML tags (comma delimited) into this field (one of which is either a para or sen flag) which flags the document (partial) as a paragraph or sentence. Thus all paragraphs of the book are indexed as single document (with its sentences combined and concatenated) and then all sentences in the book are indexed again as single documents. Both go into the same SOLR index. I just add an AND tags:para or tags:sen to my search and everything works fine. The obvious downside to this approach is the 2X indexing, but it does execute quite nicely on a single Index using SOLR. This obviously doesn't scale nicely, but will do for quite a while probably. I thought I could live with that But then I moved on to case sensitive and case-insensitive searches, and my research so far is pointing to one index for each case. So now I have: (1) 4X in content indexing (2) 2X in actual SOLR/Lucene indices (3) I don't know how to practically due multiple indices using SOLR? If there is a better way of attacking this problem, I would appreciate recommendations!!! Also, I don't know how to do multiple indices in SOLR -- I have heard it might be available in 1.3.0.? If this is my only recourse, please advise me where really good documentation is available on building 1.3.0. I am not admin savvy, but I did succeed in getting SOLR up myself and navigation through it with the help of this forum. But I have that building 1.3.0 (as opposed to downloading and installing it, like in 1.2.0) is a whole different experience and much more complex. Thanks Dave __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Ryan, Thanks for your response. I infer from your response that you can have a different analyzer for each field -- I guess I should have figured that out --but because I had not thought of that, I concluded that I needed multiple indices (sorry , I am still very new to Solr/Lucene). Does such an approach make querying difficult under the following condition: ? The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book (e.g. default search modes per book). So with a single query request (just the query word(s)), you can search one book by par, with case, another by sen w/o case, etc. -- all settable as user defaults. I need to try to figure out how to match that in Solr/Lucene -- I believe that the Analyzer approach you suggested requires the use of the same Analzyer at query time that was used during indexing. So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead (and sort merge results afterwards?) Or am I just over complicating this? Dave - Original Message From: Ryan McKinley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 2:18:00 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) So now I have: (1) 4X in content indexing (2) 2X in actual SOLR/Lucene indices (3) I don't know how to practically due multiple indices using SOLR? If there is a better way of attacking this problem, I would appreciate recommendations!!! I don't quite follow your current approach, but it sounds like you just needs some copyFields to index the same content with multiple analyzers. for example, say you have fields: field name=content type=string indexed=true stored=true/ field name=content_sentence type=sentence indexed=true stored=false/ field name=content_paragraph type=paragraph indexed=true stored=false/ field name=content_text type=text indexed=true stored=false/ and copy fields: copyField source=content dest=content_sentence/ copyField source=content dest=content_paragraph/ copyField source=content dest=content_text/ The 4X indexing cost? If you *need* to index the content 4 different ways, you don't have any way around that - do you? But is it really a big deal? How often does it need to index? How big is the data? I'm not quite following your need for multiple solr indicies, but in 1.3 it is possible. ryan __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote: So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead Solr handles that for you automatically. The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book You could easily select case sensitivity or not *per query* across all books. You should step back and see what the requirements actually are (i.e. the reasons why one needs to be able to select case sensitive/insensitive on a book level... it doesn't make sense to me at first blush). It could be done on a per-book level in solr with a more complex query structure though... (+case:sensitive +(normal relevancy query on the case sensitive fields goes here)) OR (+case:insensitive +(normal relevancy query on the case insensitive fields goes here)) -Yonik
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
David Neubert wrote: Ryan, Thanks for your response. I infer from your response that you can have a different analyzer for each field yes! each field can have its own indexing strategy. I believe that the Analyzer approach you suggested requires the use of the same Analzyer at query time that was used during indexing. it does not require the *same* Analyzer - it just requires one that generates compatiable tokens. That is, you may want the indexing to split the input into sentences, but the query time analyzer keeps the input as a single token. check the example schema.xml file -- the 'text' field type applies synonyms at index time, but does at query time. re searching acrross multiple fields, don't worry, lucene handles this well. You may want to do that explicitly or with the dismax handler. I'd suggest you play around with indexing some data. check the analysis.jsp in the admin section. It is a great tool to help figure out what analyzers do at index vs query time. ryan
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Ryan (and others who need something to put them so sleep :) ) Wow -- the light-bulb finally went off -- the Analzyer admin page is very cool -- I just was not at all thinking the SOLR/Lucene way. I need to rethink my whole approach now that I understand (from reviewing the schema.xml closer and playing with the Analyser) how compatible index and query policies can be applied automatically on a field by field basis by SOLR at both index and query time. I still may have a stumper here, but I need to give it some thought, and may return again with another question: The problem is that my text is book text (fairly large) that ooks very much like one would expect: book chapter parasen.../sensen/sen/para parasen.../sensen/sen/para parasen.../sensen.../sen/para /chapter /book The search results need to return exact sentences or paragraphs with their exact page:line numbers (which is available in the embedded markup in the text). There were previous responses by others, suggesting I look into payloads, but I did not fully understand that -- I may have to re-read those e-mails now that I am getting a clearer picture of SOLR/Lucene. However, the reason I resorted to indexing each paragraph as a single document, and then redundantly indexing each sentence as a single document, is because I was planning on pre-parsing the text myself (outside of SOLR) -- and feeding separate doc elements to the add because in that way I could produce the page:line reference in the pre-parsing (again outside of SOLR) and feed it in as explict field in the doc elements of the add requests. Therefore at query time, I will have the exact page:line corresponding to the start of the paragraph or sentence. But I am beginning to suspect, I was planning to do a lot of work that SOLR can do for me. I will continue to study this and respond when I am a bit clearer, but the closer I could get to just submitting the books a chapter at a time -- and letting SOLR do the work, the better (cause I have all the books in well formed xml at chapter levels). However, I don't see yet how I could get par/sen granular search result hits, along with their exact page:line coordinates unless I approach it by explicitly indexing the pars and sens as single documents, not chapters hits, and also return the entire text of the sen or par, and highlight the keywords within (for the search result hit). Once a search result hit is selected, it would then act as expected and position into the chapter, at the selected reference, highlight again the key words, but this time in the context of an entire chapter (the whole document to the user's mind). Even with my new understanding you (and others) have given me, which I can use to certainly improve my approach -- it still seems to me that because multi-valued fields concatenate text -- even if you use the positionGapIncrment feature to prohibit unwanted phrase matches, how do you produce a well definied search result hit, bounded by the exact sen or par, unless you index them as single documents? Should I still read up on the payload discussion? Dave - Original Message From: Ryan McKinley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 5:00:43 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) David Neubert wrote: Ryan, Thanks for your response. I infer from your response that you can have a different analyzer for each field yes! each field can have its own indexing strategy. I believe that the Analyzer approach you suggested requires the use of the same Analzyer at query time that was used during indexing. it does not require the *same* Analyzer - it just requires one that generates compatiable tokens. That is, you may want the indexing to split the input into sentences, but the query time analyzer keeps the input as a single token. check the example schema.xml file -- the 'text' field type applies synonyms at index time, but does at query time. re searching acrross multiple fields, don't worry, lucene handles this well. You may want to do that explicitly or with the dismax handler. I'd suggest you play around with indexing some data. check the analysis.jsp in the admin section. It is a great tool to help figure out what analyzers do at index vs query time. ryan __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity)
Yonik (or anyone else) Do you know where on-line documentation on the +case: syntax is located? I can't seem to find it. Dave - Original Message From: Yonik Seeley [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Saturday, November 10, 2007 4:56:40 PM Subject: Re: Redundant indexing * 4 only solution (for par/sen and case sensitivity) On Nov 10, 2007 4:24 PM, David Neubert [EMAIL PROTECTED] wrote: So if I am hitting multiple fields (in the same search request) that invoke different Analyzers -- am I at a dead end, and have to result to consequetive multiple queries instead Solr handles that for you automatically. The app that I am replacing (and trying to enhance) has the ability to search multiple books at once with sen/par and case sensitivity settings individually selectable per book You could easily select case sensitivity or not *per query* across all books. You should step back and see what the requirements actually are (i.e. the reasons why one needs to be able to select case sensitive/insensitive on a book level... it doesn't make sense to me at first blush). It could be done on a per-book level in solr with a more complex query structure though... (+case:sensitive +(normal relevancy query on the case sensitive fields goes here)) OR (+case:insensitive +(normal relevancy query on the case insensitive fields goes here)) -Yonik __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
TextField case sensitivity
I run a problem when searching on a TextField. When I pass q=William or q=WILLiam, solr is able to find records whose default search field value is William, however if I pass q=WilliAm, solr did not return any thing. I searched on the archive, Yonik mentioned the lowercasefilterfactory doesn't work for wildcard because the QueryParser does not invoke analysis for partial word, that makes sense. But in my case, it's a whole word. Anyone knows why it's not working? Below is my schema info. Thanks Xuesong fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype
Re: TextField case sensitivity
On 6/7/07, Xuesong Luo [EMAIL PROTECTED] wrote: I run a problem when searching on a TextField. When I pass q=William or q=WILLiam, solr is able to find records whose default search field value is William, however if I pass q=WilliAm, solr did not return any thing. Sounds like WordDelimiterFilter is still being used for your fieldType. After you changed the fieldType for text, did you restart Solr and re-index your collection? -Yonik I searched on the archive, Yonik mentioned the lowercasefilterfactory doesn't work for wildcard because the QueryParser does not invoke analysis for partial word, that makes sense. But in my case, it's a whole word. Anyone knows why it's not working? Below is my schema info. Thanks Xuesong fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype
Re: TextField case sensitivity
have you taken a look the output from the admin/analysis? http://localhost:8983/solr/admin/analysis.jsp?highlight=on This lets you see what tokens are generated for index/query. From your description, I'm suspicious that the generated tokens are actually: willi am Also, if you want the same analyzer for indexing and query, just define one: analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer Xuesong Luo wrote: I run a problem when searching on a TextField. When I pass q=William or q=WILLiam, solr is able to find records whose default search field value is William, however if I pass q=WilliAm, solr did not return any thing. I searched on the archive, Yonik mentioned the lowercasefilterfactory doesn't work for wildcard because the QueryParser does not invoke analysis for partial word, that makes sense. But in my case, it's a whole word. Anyone knows why it's not working? Below is my schema info. Thanks Xuesong fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype
RE: TextField case sensitivity
I have WordDelimiterFilter defined in the schema, I didn't include it in my original email because I thought it doesn't matter. It seems it matters. Looks like WilliAm is treated as two words. That's why it didn't find a match. Thanks Xuesong -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Thursday, June 07, 2007 11:25 AM To: solr-user@lucene.apache.org Subject: Re: TextField case sensitivity On 6/7/07, Xuesong Luo [EMAIL PROTECTED] wrote: I run a problem when searching on a TextField. When I pass q=William or q=WILLiam, solr is able to find records whose default search field value is William, however if I pass q=WilliAm, solr did not return any thing. Sounds like WordDelimiterFilter is still being used for your fieldType. After you changed the fieldType for text, did you restart Solr and re-index your collection? -Yonik I searched on the archive, Yonik mentioned the lowercasefilterfactory doesn't work for wildcard because the QueryParser does not invoke analysis for partial word, that makes sense. But in my case, it's a whole word. Anyone knows why it's not working? Below is my schema info. Thanks Xuesong fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype
RE: TextField case sensitivity
Ryan, you are right, that's the problem. WilliAM is treated as two words by the WordDelimiterFilterFactory. Thanks Xuesong -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Thursday, June 07, 2007 11:30 AM To: solr-user@lucene.apache.org Subject: Re: TextField case sensitivity have you taken a look the output from the admin/analysis? http://localhost:8983/solr/admin/analysis.jsp?highlight=on This lets you see what tokens are generated for index/query. From your description, I'm suspicious that the generated tokens are actually: willi am Also, if you want the same analyzer for indexing and query, just define one: analyzer tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer Xuesong Luo wrote: I run a problem when searching on a TextField. When I pass q=William or q=WILLiam, solr is able to find records whose default search field value is William, however if I pass q=WilliAm, solr did not return any thing. I searched on the archive, Yonik mentioned the lowercasefilterfactory doesn't work for wildcard because the QueryParser does not invoke analysis for partial word, that makes sense. But in my case, it's a whole word. Anyone knows why it's not working? Below is my schema info. Thanks Xuesong fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.LowerCaseFilterFactory/ /analyzer /fieldtype
Re: TextField case sensitivity
On 7-Jun-07, at 1:04 PM, Xuesong Luo wrote: Ryan, you are right, that's the problem. WilliAM is treated as two words by the WordDelimiterFilterFactory. I have found this behaviour a little too aggresive for my needs, so i added an option to disable it. Patch is here: http://issues.apache.org/jira/browse/SOLR-257 I'll probably commit it in a day or so, at which point it will be part of the Solr nightly build. -Mike
Re: case sensitivity
On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: We're (and by 'we' I mean my esteemed colleague!) working on patching a few of these items to be in the solrconf.xml file and should likely have some patches submitted next week. It's being done on 'company time' and I'm not sure about the exact policy/procedure for this sort of thing here (or indeed, if there is one at all). That's fine, as long as your company has agreed to contribute back the patch (under the Apache license). Apache enjoys a lot of business support (being business friendly) and a *lot* of contributions is done on company time. Anything really big would probably need a CLA, but patches only require clicking the grant license to ASF button in JIRA. -Yonik
Re: case sensitivity
Can you point me to the process for submitting these small patches? I'm looking at the jira site but don't see much of anything there outlining a process for submitting patches. Sorry to be so basic about this, but I'm trying to follow correct procedures on both sides of the aisle, so to speak. On 4/27/07, Yonik Seeley [EMAIL PROTECTED] wrote: On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: We're (and by 'we' I mean my esteemed colleague!) working on patching a few of these items to be in the solrconf.xml file and should likely have some patches submitted next week. It's being done on 'company time' and I'm not sure about the exact policy/procedure for this sort of thing here (or indeed, if there is one at all). That's fine, as long as your company has agreed to contribute back the patch (under the Apache license). Apache enjoys a lot of business support (being business friendly) and a *lot* of contributions is done on company time. Anything really big would probably need a CLA, but patches only require clicking the grant license to ASF button in JIRA. -Yonik -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
Once the code/patch in the issue is put/committed to SVN, it means it will be in the next release. You get your patch committed faster if it's clear, well written and explained, if it comes with a unit test if it's a code change, and so on. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Michael Kimsal [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, April 27, 2007 1:47:06 PM Subject: Re: case sensitivity What's the procedure then for something to get included in the next release? Thanks again all! On 4/27/07, Michael Kimsal [EMAIL PROTECTED] wrote: So I just create my own 'issue' first? OK. Thanks. On 4/27/07, Ryan McKinley [EMAIL PROTECTED] wrote: Michael Kimsal wrote: Can you point me to the process for submitting these small patches? I'm looking at the jira site but don't see much of anything there outlining a process for submitting patches. Sorry to be so basic about this, but I'm trying to follow correct procedures on both sides of the aisle, so to speak. Check: http://wiki.apache.org/solr/HowToContribute Essentially you will create a new issue on JIRA, then upload a svn diff to that issue. holler if you have any troubles ryan -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: I think we should open up as many of the switches as we can to QueryParser, allowing users to tinker with them if they want, setting the defaults to the most common reasonable settings we can agree upon. I think we should also try and handle what we can automatically too. Always lowercasing or not isn't elegant, as the right thing to do depends on the field. I always had it in my head that the QueryParser should figure it out. Actually, for good performance, the fieldType should figure it out just once. The presense of a LowerCaseFilter could be one signal to lowercase prefix strings, or one could actually run a test token through analysis and test if it comes out lowercased. Numeric fields are a sticking point... prefix queries and wildcard queries aren't even possible there. Of course, even stemming is problematic with wildcard queries. -Yonik
Re: case sensitivity
On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? Way back before Solr was opensourced, and Chris was the only user, I thought he needed to do prefix queries where case sensitive wildcard queries (hence I set it to false). I think I may have been mistaken about that need, but by that time, I didn't know if anyone depended on it, so I never changed it back. A default of false is actually more powerful too. You can do prefix queries on fields that have a LowercaseFilter in their analyzer, and also fields that don't. If it's set to true, you can't reliably do prefix queries on fields that don't have a LowercaseFilter. -Yonik
Re: case sensitivity
In our experience, setting a LowercaseFilter in the query did not work; we had to call setLowercaseExpandedTerms(true) to get wildcard queries to be case-insensitive. Here's our analyzer definition from our solr schema: analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer If calling setLowercaseExpandedTerms(true) is *not* in fact necessary for case-insensitive wildcard queries, could you please provide an example of a solr schema that can achieve this? Thanks! - mps Yonik Seeley [EMAIL PROTECTED] wrote: On 4/26/07, Michael Kimsal wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? Way back before Solr was opensourced, and Chris was the only user, I thought he needed to do prefix queries where case sensitive wildcard queries (hence I set it to false). I think I may have been mistaken about that need, but by that time, I didn't know if anyone depended on it, so I never changed it back. A default of false is actually more powerful too. You can do prefix queries on fields that have a LowercaseFilter in their analyzer, and also fields that don't. If it's set to true, you can't reliably do prefix queries on fields that don't have a LowercaseFilter. -Yonik
Re: case sensitivity
On 4/27/07, Michael Pelz Sherman [EMAIL PROTECTED] wrote: In our experience, setting a LowercaseFilter in the query did not work; we had to call setLowercaseExpandedTerms(true) to get wildcard queries to be case-insensitive. Correct, because in that case the QueryParser does not invoke analysis (because it's a partial word, not a whole word). If calling setLowercaseExpandedTerms(true) is *not* in fact necessary for case-insensitive wildcard queries, could you please provide an example of a solr schema that can achieve this? I didn't say that :-) I'm saying setLowercaseExpandedTerms(true) is not sufficient for wildcard queries in general. If the term is indexed as Windows95, then a prefix query of Windows* won't find anything if setLowercaseExpandedTerms(true) -Yonik Yonik Seeley [EMAIL PROTECTED] wrote: On 4/26/07, Michael Kimsal wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? Way back before Solr was opensourced, and Chris was the only user, I thought he needed to do prefix queries where case sensitive wildcard queries (hence I set it to false). I think I may have been mistaken about that need, but by that time, I didn't know if anyone depended on it, so I never changed it back. A default of false is actually more powerful too. You can do prefix queries on fields that have a LowercaseFilter in their analyzer, and also fields that don't. If it's set to true, you can't reliably do prefix queries on fields that don't have a LowercaseFilter. -Yonik
case sensitivity
I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik
Re: case sensitivity
I was just writing a followup. I'm using the default text field type fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype That looks to me like it's got LowerCaseFilterFactory in the query analyzer and the index analyzer. I'm still digging in to this, but are there any other things to look for anyone can point me to? (Thanks Erik!) On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
type:changelog AND ( ( (listing:Fox) or (listing:Fox*) or (listing:*Fox) ) ) and type:changelog AND ( ( (listing:fox) or (listing:fox*) or (listing:*fox) ) ) Is this to do with the wildcards? Actually, I've just answered my own question. type:changelog AND ( ( (listing:fox) ) ) and type:changelog AND ( ( (listing:Fox) ) ) give the same results. But adding in the or listing:fox* or listing:*fox is always case-sensitive. However, http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35aseems to say that wildcard searches are not case-sensitive. Unless someone can point out a way around this, it seems I'll need to manually reindex and lower-case everything on the way in, then reformat my search queries to be lower-case as well. On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: I was just writing a followup. I'm using the default text field type fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class= solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype That looks to me like it's got LowerCaseFilterFactory in the query analyzer and the index analyzer. I'm still digging in to this, but are there any other things to look for anyone can point me to? (Thanks Erik!) On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? Perhaps it's not related to my problem, but it seems that it might be. Thanks in advance! On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: type:changelog AND ( ( (listing:Fox) or (listing:Fox*) or (listing:*Fox) ) ) and type:changelog AND ( ( (listing:fox) or (listing:fox*) or (listing:*fox) ) ) Is this to do with the wildcards? Actually, I've just answered my own question. type:changelog AND ( ( (listing:fox) ) ) and type:changelog AND ( ( (listing:Fox) ) ) give the same results. But adding in the or listing:fox* or listing:*fox is always case-sensitive. However, http://wiki.apache.org/lucene-java/LuceneFAQ#head-133cf44dd3dff3680c96c1316a663e881eeac35aseems to say that wildcard searches are not case-sensitive. Unless someone can point out a way around this, it seems I'll need to manually reindex and lower-case everything on the way in, then reformat my search queries to be lower-case as well. On 4/26/07, Michael Kimsal [EMAIL PROTECTED] wrote: I was just writing a followup. I'm using the default text field type fieldtype name=text class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.WhitespaceTokenizerFactory/ !-- in this example, we will only use synonyms at query time filter class=solr.SynonymFilterFactory synonyms=index_synonyms.txt ignoreCase=true expand=false/ -- filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ filter class= solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true/ filter class=solr.StopFilterFactory ignoreCase=true words= stopwords.txt/ filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=0 catenateNumbers=0 catenateAll=0/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPorterFilterFactory protected=protwords.txt/ filter class= solr.RemoveDuplicatesTokenFilterFactory/ /analyzer /fieldtype That looks to me like it's got LowerCaseFilterFactory in the query analyzer and the index analyzer. I'm still digging in to this, but are there any other things to look for anyone can point me to? (Thanks Erik!) On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 5:43 PM, Michael Kimsal wrote: I've looked through the mailing lists and can't find much of anything regarding case sensitivity. It seems SOLR is case sensitive by default - I'm using the default settings with a very basic schema - just text fields. All depends on the analysis you have set up for the fields. If you're indexing string-type fields in the default example schema, there is effectively no analysis so searches must be exact matches case and all. Is there any way to tell the query parser to be case insensitive during a query? Or do I have to reindex all my data again with lowercase values? Terms are indexed in a case-sensitive manner, so if you need case insensitivity you need to lowercase on the way in and on querying. Erik -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com -- Michael Kimsal http://webdevradio.com
Re: case sensitivity
On Apr 26, 2007, at 6:03 PM, Michael Kimsal wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? I was just about to respond that this is likely the issue with your non-totally-lowercased wildcard terms. I don't consider it an oversight, but rather this whole analysis business and wildcards are things that vary from project to project on how they should be handled. If you, have, for example, a string field and want to do prefixed queries on them (trailing asterisk) you wouldn't want the term to be lowercased. I think we should open up as many of the switches as we can to QueryParser, allowing users to tinker with them if they want, setting the defaults to the most common reasonable settings we can agree upon. Erik
Re: case sensitivity
We're (and by 'we' I mean my esteemed colleague!) working on patching a few of these items to be in the solrconf.xml file and should likely have some patches submitted next week. It's being done on 'company time' and I'm not sure about the exact policy/procedure for this sort of thing here (or indeed, if there is one at all). On 4/26/07, Erik Hatcher [EMAIL PROTECTED] wrote: On Apr 26, 2007, at 6:03 PM, Michael Kimsal wrote: My colleague, after some digging, found in SolrQueryParser (around line 62) setLowercaseExpandedTerms(false); The default for Lucene is true. Was this intentional? Or an oversight? I was just about to respond that this is likely the issue with your non-totally-lowercased wildcard terms. I don't consider it an oversight, but rather this whole analysis business and wildcards are things that vary from project to project on how they should be handled. If you, have, for example, a string field and want to do prefixed queries on them (trailing asterisk) you wouldn't want the term to be lowercased. I think we should open up as many of the switches as we can to QueryParser, allowing users to tinker with them if they want, setting the defaults to the most common reasonable settings we can agree upon. Erik -- Michael Kimsal http://webdevradio.com
Case sensitivity on hostnames and email addresses
I've run into some unexpected case sensitivity on searches, at least unexpected by me. If you index a text field containing this sentence: A sentence containing CamelCase words by [EMAIL PROTECTED] is found at StudlyCaps.org The document will be found by searching for camelcase but not for [EMAIL PROTECTED] or studlycaps.org. This happens with the Standard or the DisMax query handler. A bit of a problem for me, because I'm indexing a bunch of business magazines, and domain names are frequently capitalized, often in CamelCase. Is this maybe a bug? Or a WAD? -- Wade Leftwich Ithaca, NY