Re: Accent Characters
Vicente, Are you using CommonsHttpSolrServer or HttpSolrServer? If the latter then you are probably hitting this: https://issues.apache.org/jira/browse/SOLR-3375 The remedy is to use CommonshHttpSolrServer. -- Sami Siren On Thu, May 31, 2012 at 7:52 AM, Vicente Couto couto.vice...@gmail.com wrote: Hello, Jack. Yeah, I'm screwed up. Well, the documents are indexed with the accents. I started a new clean solr 3.6 configuration, with as few changes as possible; I'm running two cores, one for English and another one for French. Here is where I am now: If I try to run queries by using solrJ, it does some sort of encoding. For example, I can see into the logs that if I run one query looking for pré, I got INFO: [coreFR] webapp=/solr path=/select params={fl=*,scoreq=content:préhl.fl=contenthl.maxAnalyzedChars=10hl=true} hits=0 status=0 QTime=0 And I can't see any results. If I try by using encoding to UTF-8 it's not works. But if I simply put http calls into the browser address bar, for example, it works perfectly! So, how can I tell solrJ to not encode the queries? Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986970.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Accent Characters
Hello, guys. Now it's working. Thank you both Jack and Sami. I fixed my issue by just using server.query(query, METHOD.POST) in solrJ and yes, I was using HttpSolrServer. I have to move on to CommonsHttpSolrServer. Thank you very much. -- View this message in context: http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3987046.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Accent Characters
Hello, Jack. Yeah, I'm screwed up. Well, the documents are indexed with the accents. I started a new clean solr 3.6 configuration, with as few changes as possible; I'm running two cores, one for English and another one for French. Here is where I am now: If I try to run queries by using solrJ, it does some sort of encoding. For example, I can see into the logs that if I run one query looking for pré, I got INFO: [coreFR] webapp=/solr path=/select params={fl=*,scoreq=content:préhl.fl=contenthl.maxAnalyzedChars=10hl=true} hits=0 status=0 QTime=0 And I can't see any results. If I try by using encoding to UTF-8 it's not works. But if I simply put http calls into the browser address bar, for example, it works perfectly! So, how can I tell solrJ to not encode the queries? Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986970.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Accent Characters
This might be related: https://issues.apache.org/jira/browse/SOLR-443 It suggests setting an HTTP header: Content-Type: application/x-www-form-urlencoded; charset=UTF-8 -- Jack Krupansky -Original Message- From: Vicente Couto Sent: Thursday, May 31, 2012 12:52 AM To: solr-user@lucene.apache.org Subject: Re: Accent Characters Hello, Jack. Yeah, I'm screwed up. Well, the documents are indexed with the accents. I started a new clean solr 3.6 configuration, with as few changes as possible; I'm running two cores, one for English and another one for French. Here is where I am now: If I try to run queries by using solrJ, it does some sort of encoding. For example, I can see into the logs that if I run one query looking for pré, I got INFO: [coreFR] webapp=/solr path=/select params={fl=*,scoreq=content:préhl.fl=contenthl.maxAnalyzedChars=10hl=true} hits=0 status=0 QTime=0 And I can't see any results. If I try by using encoding to UTF-8 it's not works. But if I simply put http calls into the browser address bar, for example, it works perfectly! So, how can I tell solrJ to not encode the queries? Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986970.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Accent Characters
Hi, Jack. First of all thank you for your help. Well, I tried again then I realized that my problem is not really with solr. I did run this query against solr after start it up with the command java -jar start.jar: http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9sentaspellcheck=truespellcheck.collate=truerows=0spellcheck.count=10 It gives me the result: ?xml version=1.0 encoding=UTF-8 ? response lst name=responseHeader int name=status0/int int name=QTime31/int /lst result name=response numFound=0 start=0 / lst name=spellcheck lst name=suggestions lst name=présenta int name=numFound10/int int name=startOffset8/int int name=endOffset16/int arr name=suggestion strprésente/str strprésent/str strprésenté/str strprésents/str strprésentant/str strprésentera/str strprésentait/str strprésentes/str strprésenter/str strprésentée/str /arr /lst str name=collationcontent:présente/str /lst /lst /response And I did run exactly the same query after deploy solr.war in tomcat 7. Here is my result: ?xml version=1.0 encoding=UTF-8 ? response lst name=responseHeader int name=status0/int int name=QTime16/int /lst result name=response numFound=0 start=0 / lst name=spellcheck lst name=suggestions lst name=présenta int name=numFound10/int int name=startOffset8/int int name=endOffset16/int arr name=suggestion strpresent/str strprbsent/str strpresentant/str strpresentait/str strpuisent/str strpasent/str strpensent/str strposent/str strdresent/str strresenti/str /arr /lst str name=collationcontent:present/str /lst /lst /response As my application is running under tomcat, it means that I have some issue with tomcat, but the weird stuff is that I already google it looking for a fix and find out that we have to set up a parameter into server.xml tomcat config file: Connector port=5443 protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 URIEncoding=UTF-8 / But it's not working as you can see. I'm feeling a little stupid because it doesn't look like a big problem. For sure people around the world are using solr with accents queries running under tomcat properly! Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986423.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Accent Characters
The query seems fine - as far as the URL being UTF-8. It seems that the documents are not being passed to Solr with UTF-8 encoding. The document is not part of the URL. It is HTTP POST data. Try an explicit curl command to add a document and see if it is indexed with the accents. -- Jack Krupansky -Original Message- From: couto.vicente Sent: Monday, May 28, 2012 9:58 AM To: solr-user@lucene.apache.org Subject: Re: Accent Characters Hi, Jack. First of all thank you for your help. Well, I tried again then I realized that my problem is not really with solr. I did run this query against solr after start it up with the command java -jar start.jar: http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9sentaspellcheck=truespellcheck.collate=truerows=0spellcheck.count=10 It gives me the result: ?xml version=1.0 encoding=UTF-8 ? response lst name=responseHeader int name=status0/int int name=QTime31/int /lst result name=response numFound=0 start=0 / lst name=spellcheck lst name=suggestions lst name=présenta int name=numFound10/int int name=startOffset8/int int name=endOffset16/int arr name=suggestion strprésente/str strprésent/str strprésenté/str strprésents/str strprésentant/str strprésentera/str strprésentait/str strprésentes/str strprésenter/str strprésentée/str /arr /lst str name=collationcontent:présente/str /lst /lst /response And I did run exactly the same query after deploy solr.war in tomcat 7. Here is my result: ?xml version=1.0 encoding=UTF-8 ? response lst name=responseHeader int name=status0/int int name=QTime16/int /lst result name=response numFound=0 start=0 / lst name=spellcheck lst name=suggestions lst name=présenta int name=numFound10/int int name=startOffset8/int int name=endOffset16/int arr name=suggestion strpresent/str strprbsent/str strpresentant/str strpresentait/str strpuisent/str strpasent/str strpensent/str strposent/str strdresent/str strresenti/str /arr /lst str name=collationcontent:present/str /lst /lst /response As my application is running under tomcat, it means that I have some issue with tomcat, but the weird stuff is that I already google it looking for a fix and find out that we have to set up a parameter into server.xml tomcat config file: Connector port=5443 protocol=HTTP/1.1 connectionTimeout=2 redirectPort=8443 URIEncoding=UTF-8 / But it's not working as you can see. I'm feeling a little stupid because it doesn't look like a big problem. For sure people around the world are using solr with accents queries running under tomcat properly! Thank you -- View this message in context: http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986423.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Accent Characters
I tried your scenario with the Solr 3.6 example and it seemed to work fine and suggested an accented term for me. Some possibilities: 1) Your term had an editing distance that was too high relative to any accented correction. Check your term and count how many characters must be changed to match an accented term. Case changes count as well. In the case of a 4-character word, the maximum editing distance allowed (by default) is 2. Maybe you simply need to override the default for accuracy; e.g., spellcheck.accuracy=0.35, compared to the default of 0.5. 2) Did you get some other suggestion when you expected the accented term? If so, increase the spellcheck.count request parameter from 1 to 10 see other suggestions. 3) You have some other schema/solrconfig changes that you haven't told us about. Try to reproduce your issue against a fresh copy of Solr 3.6 example, and then see how your actual configuration (that fails) is different from the example. Here's my test query and the spellcheck result : http://localhost:8983/solr/spell?q=x%20Cafe%20yspellcheck=truespellcheck.collate=truespellcheck.build=truespellcheck.count=10 lst name=spellcheck lst name=suggestions lst name=Cafe int name=numFound2/int int name=startOffset2/int int name=endOffset6/int arr name=suggestion strcafé/str strcofe/str /arr /lst str name=collationx café y/str /lst /lst And here was my test doc: curl http://localhost:8983/solr/update?commit=true -H Content-Type: text/xml --data-binary 'adddocfield name=iddoc-c1/fieldfield name=contentInternet café - Café au lait - Viennese coffee house - Maid café cofe/field/doc/add' Here is a test query that returns zero suggestions, because the editing distance is greater than two (Capital C, unaccented character, and extra character at end): http://localhost:8983/solr/spell?q=x%20Cafex%20yspellcheck=truespellcheck.collate=truespellcheck.build=true But, by overriding the default accuracy of 0.5 and dropping it to 0.35, I can get the expected suggestion: http://localhost:8983/solr/spell?q=x%20Cafex%20yspellcheck=truespellcheck.collate=truespellcheck.build=truespellcheck.accuracy=0.35 -- Jack Krupansky -Original Message- From: couto.vicente Sent: Thursday, May 24, 2012 10:28 AM To: solr-user@lucene.apache.org Subject: Accent Characters Hello All. I'm a newbie in Solr and I saw this subject a lot, but no one answer was satisfactory or (probably) I don't know how to properly set up the Solr environment. I indexed documents in Solr with a French content field. I used the field type text_fr that comes with the solr schema.xml file. field name=content type=text_fr indexed=true stored=true / My spellchecker is almost the same that comes with solrconfig.xml: lst name=spellchecker str name=namedefault/str str name=fieldcontent/str str name=spellcheckIndexDirspellchecker/str /lst When I try any search query either with words with accent or not, I get the results pretty fine. But if I try the spell checking or even a facet query, it looks like Solr is ignoring the words with accents. I Google it a lot I could not find any satisfactory fix. Can anyone give me a help? Thank you! -- View this message in context: http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931.html Sent from the Solr - User mailing list archive at Nabble.com.
Accent Characters
Hello All. I'm a newbie in Solr and I saw this subject a lot, but no one answer was satisfactory or (probably) I don't know how to properly set up the Solr environment. I indexed documents in Solr with a French content field. I used the field type text_fr that comes with the solr schema.xml file. field name=content type=text_fr indexed=true stored=true / My spellchecker is almost the same that comes with solrconfig.xml: lst name=spellchecker str name=namedefault/str str name=fieldcontent/str str name=spellcheckIndexDirspellchecker/str /lst When I try any search query either with words with accent or not, I get the results pretty fine. But if I try the spell checking or even a facet query, it looks like Solr is ignoring the words with accents. I Google it a lot I could not find any satisfactory fix. Can anyone give me a help? Thank you! -- View this message in context: http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Filters: acute accent characters replaced with their english counterpart
Wiadomość napisana w dniu 2008-11-10, o godz. 11:14, przez joeMcElroy: I need a custom filter to be added to a field which will replace special foreign characters with their english counterpart. for example ø = o Grave À È Ì Ò Ù à è ì ò ù = A E I O U a e i o u Circumflex Â Ê Î Ô Û â ê î ô û = A E I O U a e i o u is this possible? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-4ebf7aea23b3d6d34a1f8314f9de17334a3e2fac I wish such filter exist for Latin2... -- We read Knuth so you don't have to. - Tim Peters Jarek Zgoda, RD, Redefine [EMAIL PROTECTED]
RE: Filters: acute accent characters replaced with their english counterpart
Hi Jarek, On 11/10/2008 at 6:08 AM, Jarek Zgoda wrote: Wiadomość napisana w dniu 2008-11-10, o godz. 11:14, przez joeMcElroy: I need a custom filter to be added to a field which will replace special foreign characters with their english counterpart. for example ø = o Grave À È Ì Ò Ù à è ì ò ù = A E I O U a e i o u Circumflex Â Ê Î Ô Û â ê î ô û = A E I O U a e i o u is this possible? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-4ebf7aea23b3d6d34a1f8314f9de17334a3e2fac I wish such filter exist for Latin2... The following Lucene patch hasn't been committed yet, and there is no Solr Factory counterpart yet, but: ASCIIFoldingFilter folds all accented letters to their (accent-stripped, if necessary) ASCII equivalents: https://issues.apache.org/jira/browse/LUCENE-1390 Steve
Re: Filters: acute accent characters replaced with their english counterpart
joe, This hasn't been committed yet, but SOLR-822 may be your answer. https://issues.apache.org/jira/browse/SOLR-822 Koji joeMcElroy wrote: I need a custom filter to be added to a field which will replace special foreign characters with their english counterpart. for example ø = o Grave À È Ì Ò Ù à è ì ò ù = A E I O U a e i o u Circumflex Â Ê Î Ô Û â ê î ô û = A E I O U a e i o u is this possible? joe
Filters: acute accent characters replaced with their english counterpart
I need a custom filter to be added to a field which will replace special foreign characters with their english counterpart. for example ø = o Grave À È Ì Ò Ù à è ì ò ù = A E I O U a e i o u Circumflex Â Ê Î Ô Û â ê î ô û = A E I O U a e i o u is this possible? joe -- View this message in context: http://www.nabble.com/Filters%3A-acute-accent-characters-replaced-with-their-english-counterpart-tp20416888p20416888.html Sent from the Solr - User mailing list archive at Nabble.com.