Re: Accent Characters

2012-05-31 Thread Sami Siren
Vicente,

Are you using CommonsHttpSolrServer or HttpSolrServer? If the latter
then you are probably hitting this:
https://issues.apache.org/jira/browse/SOLR-3375

The remedy is to use CommonshHttpSolrServer.

--
 Sami Siren

On Thu, May 31, 2012 at 7:52 AM, Vicente Couto couto.vice...@gmail.com wrote:
 Hello, Jack.

 Yeah, I'm screwed up.

 Well, the documents are indexed with the accents.
 I started a new clean solr 3.6 configuration, with as few changes as
 possible; I'm running two cores, one for English and another one for French.
 Here is where I am now: If I try to run queries by using solrJ, it does some
 sort of encoding. For example, I can see into the logs that if I run one
 query looking for pré, I got

 INFO: [coreFR] webapp=/solr path=/select
 params={fl=*,scoreq=content:préhl.fl=contenthl.maxAnalyzedChars=10hl=true}
 hits=0 status=0 QTime=0

 And I can't see any results. If I try by using encoding to UTF-8 it's not
 works.
 But if I simply put http calls into the browser address bar, for example, it
 works perfectly!
 So, how can I tell solrJ to not encode the queries?

 Thank you

 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986970.html
 Sent from the Solr - User mailing list archive at Nabble.com.


Re: Accent Characters

2012-05-31 Thread Vicente Couto
Hello, guys.

Now it's working. Thank you both Jack and Sami.
I fixed my issue by just using server.query(query, METHOD.POST) in solrJ and
yes, I was using HttpSolrServer. I have to move on to CommonsHttpSolrServer.

Thank you very much.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3987046.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Accent Characters

2012-05-30 Thread Vicente Couto
Hello, Jack.

Yeah, I'm screwed up.

Well, the documents are indexed with the accents.
I started a new clean solr 3.6 configuration, with as few changes as
possible; I'm running two cores, one for English and another one for French.
Here is where I am now: If I try to run queries by using solrJ, it does some
sort of encoding. For example, I can see into the logs that if I run one
query looking for pré, I got

INFO: [coreFR] webapp=/solr path=/select
params={fl=*,scoreq=content:préhl.fl=contenthl.maxAnalyzedChars=10hl=true}
hits=0 status=0 QTime=0 

And I can't see any results. If I try by using encoding to UTF-8 it's not
works.
But if I simply put http calls into the browser address bar, for example, it
works perfectly!
So, how can I tell solrJ to not encode the queries?

Thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986970.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Accent Characters

2012-05-30 Thread Jack Krupansky

This might be related:

https://issues.apache.org/jira/browse/SOLR-443

It suggests setting an HTTP header: Content-Type: 
application/x-www-form-urlencoded; charset=UTF-8


-- Jack Krupansky

-Original Message- 
From: Vicente Couto

Sent: Thursday, May 31, 2012 12:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Accent Characters

Hello, Jack.

Yeah, I'm screwed up.

Well, the documents are indexed with the accents.
I started a new clean solr 3.6 configuration, with as few changes as
possible; I'm running two cores, one for English and another one for French.
Here is where I am now: If I try to run queries by using solrJ, it does some
sort of encoding. For example, I can see into the logs that if I run one
query looking for pré, I got

INFO: [coreFR] webapp=/solr path=/select
params={fl=*,scoreq=content:préhl.fl=contenthl.maxAnalyzedChars=10hl=true}
hits=0 status=0 QTime=0

And I can't see any results. If I try by using encoding to UTF-8 it's not
works.
But if I simply put http calls into the browser address bar, for example, it
works perfectly!
So, how can I tell solrJ to not encode the queries?

Thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986970.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Accent Characters

2012-05-28 Thread couto.vicente
Hi, Jack.
First of all thank you for your help.
Well, I tried again then I realized that my problem is not really with solr.
I did run this query against solr after start it up with the command java
-jar start.jar:
http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9sentaspellcheck=truespellcheck.collate=truerows=0spellcheck.count=10

It gives me the result:
?xml version=1.0 encoding=UTF-8 ? 
response
 lst name=responseHeader
  int name=status0/int 
  int name=QTime31/int 
  /lst
  result name=response numFound=0 start=0 / 
 lst name=spellcheck
 lst name=suggestions
 lst name=présenta
  int name=numFound10/int 
  int name=startOffset8/int 
  int name=endOffset16/int 
 arr name=suggestion
  strprésente/str 
  strprésent/str 
  strprésenté/str 
  strprésents/str 
  strprésentant/str 
  strprésentera/str 
  strprésentait/str 
  strprésentes/str 
  strprésenter/str 
  strprésentée/str 
  /arr
  /lst
  str name=collationcontent:présente/str 
  /lst
  /lst
/response

And I did run exactly the same query after deploy solr.war in tomcat 7. Here
is my result:
?xml version=1.0 encoding=UTF-8 ? 
response
 lst name=responseHeader
  int name=status0/int 
  int name=QTime16/int 
  /lst
  result name=response numFound=0 start=0 / 
 lst name=spellcheck
 lst name=suggestions
 lst name=présenta
  int name=numFound10/int 
  int name=startOffset8/int 
  int name=endOffset16/int 
 arr name=suggestion
  strpresent/str 
  strprbsent/str 
  strpresentant/str 
  strpresentait/str 
  strpuisent/str 
  strpasent/str 
  strpensent/str 
  strposent/str 
  strdresent/str 
  strresenti/str 
  /arr
  /lst
  str name=collationcontent:present/str 
  /lst
  /lst
/response

As my application is running under tomcat, it means that I have some issue
with tomcat, but the weird stuff is that I already google it looking for a
fix and find out that we have to set up a parameter into server.xml tomcat
config file:

Connector port=5443 protocol=HTTP/1.1
   connectionTimeout=2
   redirectPort=8443
   URIEncoding=UTF-8 /

But it's not working as you can see.
I'm feeling a little stupid because it doesn't look like a big problem. For
sure people around the world are using solr with accents queries running
under tomcat properly!

Thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986423.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Accent Characters

2012-05-28 Thread Jack Krupansky
The query seems fine - as far as the URL being UTF-8. It seems that the 
documents are not being passed to Solr with UTF-8 encoding. The document is 
not part of the URL. It is HTTP POST data.


Try an explicit curl command to add a document and see if it is indexed with 
the accents.


-- Jack Krupansky

-Original Message- 
From: couto.vicente

Sent: Monday, May 28, 2012 9:58 AM
To: solr-user@lucene.apache.org
Subject: Re: Accent Characters

Hi, Jack.
First of all thank you for your help.
Well, I tried again then I realized that my problem is not really with solr.
I did run this query against solr after start it up with the command java
-jar start.jar:
http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9sentaspellcheck=truespellcheck.collate=truerows=0spellcheck.count=10

It gives me the result:
?xml version=1.0 encoding=UTF-8 ?
response
lst name=responseHeader
 int name=status0/int
 int name=QTime31/int
 /lst
 result name=response numFound=0 start=0 /
lst name=spellcheck
lst name=suggestions
lst name=présenta
 int name=numFound10/int
 int name=startOffset8/int
 int name=endOffset16/int
arr name=suggestion
 strprésente/str
 strprésent/str
 strprésenté/str
 strprésents/str
 strprésentant/str
 strprésentera/str
 strprésentait/str
 strprésentes/str
 strprésenter/str
 strprésentée/str
 /arr
 /lst
 str name=collationcontent:présente/str
 /lst
 /lst
/response

And I did run exactly the same query after deploy solr.war in tomcat 7. Here
is my result:
?xml version=1.0 encoding=UTF-8 ?
response
lst name=responseHeader
 int name=status0/int
 int name=QTime16/int
 /lst
 result name=response numFound=0 start=0 /
lst name=spellcheck
lst name=suggestions
lst name=présenta
 int name=numFound10/int
 int name=startOffset8/int
 int name=endOffset16/int
arr name=suggestion
 strpresent/str
 strprbsent/str
 strpresentant/str
 strpresentait/str
 strpuisent/str
 strpasent/str
 strpensent/str
 strposent/str
 strdresent/str
 strresenti/str
 /arr
 /lst
 str name=collationcontent:present/str
 /lst
 /lst
/response

As my application is running under tomcat, it means that I have some issue
with tomcat, but the weird stuff is that I already google it looking for a
fix and find out that we have to set up a parameter into server.xml tomcat
config file:

Connector port=5443 protocol=HTTP/1.1
  connectionTimeout=2
  redirectPort=8443
  URIEncoding=UTF-8 /

But it's not working as you can see.
I'm feeling a little stupid because it doesn't look like a big problem. For
sure people around the world are using solr with accents queries running
under tomcat properly!

Thank you

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931p3986423.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Re: Accent Characters

2012-05-25 Thread Jack Krupansky
I tried your scenario with the Solr 3.6 example and it seemed to work fine 
and suggested an accented term for me.


Some possibilities:

1) Your term had an editing distance that was too high relative to any 
accented correction. Check your term and count how many characters must be 
changed to match an accented term. Case changes count as well. In the case 
of a 4-character word, the maximum editing distance allowed (by default) is 
2. Maybe you simply need to override the default for accuracy;  e.g., 
spellcheck.accuracy=0.35, compared to the default of 0.5.
2) Did you get some other suggestion  when you expected the accented term? 
If so, increase the spellcheck.count request parameter from 1 to 10 see 
other suggestions.
3) You have some other schema/solrconfig changes that you haven't told us 
about.


Try to reproduce your issue against a fresh copy of Solr 3.6 example, and 
then see how your actual configuration (that fails) is different from the 
example.


Here's my test query and the spellcheck result :

http://localhost:8983/solr/spell?q=x%20Cafe%20yspellcheck=truespellcheck.collate=truespellcheck.build=truespellcheck.count=10

lst name=spellcheck
 lst name=suggestions
   lst name=Cafe
 int name=numFound2/int
 int name=startOffset2/int
 int name=endOffset6/int
 arr name=suggestion
   strcafé/str
   strcofe/str
 /arr
   /lst
   str name=collationx café y/str
 /lst
/lst

And here was my test doc:

curl http://localhost:8983/solr/update?commit=true -H Content-Type: 
text/xml --data-binary 'adddocfield name=iddoc-c1/fieldfield 
name=contentInternet café - Café au lait - Viennese coffee house - Maid 
café cofe/field/doc/add'


Here is a test query that returns zero suggestions, because the editing 
distance is greater than two (Capital C, unaccented character, and extra 
character at end):


http://localhost:8983/solr/spell?q=x%20Cafex%20yspellcheck=truespellcheck.collate=truespellcheck.build=true

But, by overriding the default accuracy of 0.5 and dropping it to 0.35, I 
can get the expected suggestion:


http://localhost:8983/solr/spell?q=x%20Cafex%20yspellcheck=truespellcheck.collate=truespellcheck.build=truespellcheck.accuracy=0.35

-- Jack Krupansky

-Original Message- 
From: couto.vicente

Sent: Thursday, May 24, 2012 10:28 AM
To: solr-user@lucene.apache.org
Subject: Accent Characters

Hello All.
I'm a newbie in Solr and I saw this subject a lot, but no one answer was
satisfactory or (probably) I don't know how to properly set up the Solr
environment.
I indexed documents in Solr with a French content field. I used the field
type text_fr that comes with the solr schema.xml file.

field name=content type=text_fr indexed=true stored=true /

My spellchecker is almost the same that comes with solrconfig.xml:

   lst name=spellchecker
 str name=namedefault/str
 str name=fieldcontent/str
 str name=spellcheckIndexDirspellchecker/str


   /lst

When I try any search query either with words with accent or not, I get the
results pretty fine.
But if I try the spell checking or even a facet query, it looks like Solr is
ignoring the words with accents.
I Google it a lot I could not find any satisfactory fix.

Can anyone give me a help?

Thank you!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931.html
Sent from the Solr - User mailing list archive at Nabble.com. 



Accent Characters

2012-05-24 Thread couto.vicente
Hello All.
I'm a newbie in Solr and I saw this subject a lot, but no one answer was
satisfactory or (probably) I don't know how to properly set up the Solr
environment.
I indexed documents in Solr with a French content field. I used the field
type text_fr that comes with the solr schema.xml file.

field name=content type=text_fr indexed=true stored=true /

My spellchecker is almost the same that comes with solrconfig.xml:

lst name=spellchecker
  str name=namedefault/str
  str name=fieldcontent/str
  str name=spellcheckIndexDirspellchecker/str
  
  
/lst

When I try any search query either with words with accent or not, I get the
results pretty fine.
But if I try the spell checking or even a facet query, it looks like Solr is
ignoring the words with accents.
I Google it a lot I could not find any satisfactory fix.

Can anyone give me a help?

Thank you!


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Accent-Characters-tp3985931.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filters: acute accent characters replaced with their english counterpart

2008-11-10 Thread Jarek Zgoda

Wiadomość napisana w dniu 2008-11-10, o godz. 11:14, przez joeMcElroy:

I need a custom filter to be added to a field which will replace  
special

foreign characters with their english counterpart.

for example ø = o
Grave À È Ì Ò Ù à è ì ò ù = A E I O U a e i o u
Circumflex Â Ê Î Ô Û â ê î ô û  = A E I O U a e i o u

is this possible?


http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-4ebf7aea23b3d6d34a1f8314f9de17334a3e2fac

I wish such filter exist for Latin2...

--
We read Knuth so you don't have to. - Tim Peters

Jarek Zgoda, RD, Redefine
[EMAIL PROTECTED]



RE: Filters: acute accent characters replaced with their english counterpart

2008-11-10 Thread Steven A Rowe
Hi Jarek,

On 11/10/2008 at 6:08 AM, Jarek Zgoda wrote:
 Wiadomość napisana w dniu 2008-11-10, o godz. 11:14, przez joeMcElroy:
  I need a custom filter to be added to a field which will replace
  special foreign characters with their english counterpart.
  
  for example ø = o
  Grave À È Ì Ò Ù à è ì ò ù = A E I O U a e i o u
  Circumflex Â Ê Î Ô Û â ê î ô û  = A E I O U a e i o u
  
  is this possible?
 
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-4ebf7aea23b3d6d34a1f8314f9de17334a3e2fac
 
 I wish such filter exist for Latin2...

The following Lucene patch hasn't been committed yet, and there is no Solr 
Factory counterpart yet, but: ASCIIFoldingFilter folds all accented letters to 
their (accent-stripped, if necessary) ASCII equivalents:

https://issues.apache.org/jira/browse/LUCENE-1390

Steve


Re: Filters: acute accent characters replaced with their english counterpart

2008-11-10 Thread Koji Sekiguchi

joe,

This hasn't been committed yet, but SOLR-822 may be your answer.

https://issues.apache.org/jira/browse/SOLR-822

Koji

joeMcElroy wrote:

I need a custom filter to be added to a field which will replace special
foreign characters with their english counterpart. 


for example ø = o
Grave À È Ì Ò Ù à è ì ò ù = A E I O U a e i o u 
Circumflex Â Ê Î Ô Û â ê î ô û  = A E I O U a e i o u


is this possible?

joe
  




Filters: acute accent characters replaced with their english counterpart

2008-11-10 Thread joeMcElroy

I need a custom filter to be added to a field which will replace special
foreign characters with their english counterpart. 

for example ø = o
Grave À È Ì Ò Ù à è ì ò ù = A E I O U a e i o u 
Circumflex Â Ê Î Ô Û â ê î ô û  = A E I O U a e i o u

is this possible?

joe
-- 
View this message in context: 
http://www.nabble.com/Filters%3A-acute-accent-characters-replaced-with-their-english-counterpart-tp20416888p20416888.html
Sent from the Solr - User mailing list archive at Nabble.com.