Re: Searching for words with accented characters.

2010-08-28 Thread Lance Norskog
This was a 2-year-old question :)

Have you made sure that UTF-8 character encoding is set in all phases
of your project? Servlet container, XML input header, etc? Character
encodings are hell to debug on Windows, so I would suggest checking it
on Linux or a Mac.

Since this is a one-character fumble, a spell checker could help the
user find the actual word.

There is a new character mapper tool that might not have this problem.

You can save the input text and the mapped text in different fields.
(It would be very useful to have the mapper save the original word as
a synonym.)

On Fri, Aug 27, 2010 at 9:45 AM, Muneeb Ali muneeba...@hotmail.com wrote:

 Hey Robert,

 Just wondering if you ever got to solve this problem?
 We are facing a similar issue with our catalog search :(

 look forward to hearing from you.

 -Thanks,

 Muneeb
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Searching-for-words-with-accented-characters-tp486325p1375019.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Lance Norskog
goks...@gmail.com


Re: Searching for words with accented characters.

2010-08-27 Thread Muneeb Ali

Hey Robert,

Just wondering if you ever got to solve this problem?
We are facing a similar issue with our catalog search :(

look forward to hearing from you.

-Thanks,

Muneeb
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Searching-for-words-with-accented-characters-tp486325p1375019.html
Sent from the Solr - User mailing list archive at Nabble.com.


Searching for words with accented characters.

2008-06-11 Thread Robert Haschart
We are using Solr as the search engine for our public access library 
catalog.  In testing I did a search for a French movie that I know is in 
the catalog named:  Kirikou et la sorcière  and nothing was returned.  
If I search for just the work Kirikou several results are returned, 
and the problem becomes apparent.  The records contain Kirikou et la 
sorcie?re  where the accent is a unicode combining character following 
the e. 

After some research into Unicode normalization, I found and installed a 
Unicode normalization filter that is set to convert letters followed by 
combining codes into the precomposed form.  I also installed a 
solr.ISOLatin1AccentFilterFactory that will then convert these 
precomposed forms into the latin equivalent without any accent.   The 
following is the fieldType definition taken from the schema.xml file:


  fieldType name=text class=solr.TextField positionIncrementGap=100
 analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=schema.UnicodeNormalizationFilterFactory/
   filter class=solr.ISOLatin1AccentFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
   filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=1 
catenateNumbers=1 catenateAll=0/

   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/

   filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
 analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   filter class=schema.UnicodeNormalizationFilterFactory/
   filter class=solr.ISOLatin1AccentFilterFactory/
   filter class=solr.SynonymFilterFactory 
synonyms=synonyms.txt ignoreCase=true expand=true/
   filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt/
   filter class=solr.WordDelimiterFilterFactory 
generateWordParts=1 generateNumberParts=1 catenateWords=0 
catenateNumbers=0 catenateAll=0/

   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.EnglishPorterFilterFactory 
protected=protwords.txt/

   filter class=solr.RemoveDuplicatesTokenFilterFactory/
 /analyzer
   /fieldType

So it seems like this should work. 
However again searching for Kirikou et la sorcière or sorcière or 
sorcie?re or just sorciere  doesn't return the docment in question.


I've tried looking at the results from solr/admin/analysis.jsp  entering 
in text from the record for the Field value (Index) and entering in 
sorciere in the Field value (Query)  and I get the follow results, which 
seems to indicate that there should be a match between the stemmed entry 
sorcier in the record and the stemmed word sorcier from the query.


So clearly I am either doing something wrong or misinterpreting the 
analyzers, but I am at a loss as to how to figure out what is wrong.  
Any suggestions?



   org.apache.solr.analysis.WhitespaceTokenizerFactory {}

term position 	1 	2 	3 	4 	5 	6 	7 	8 	9 	10 	11 	12 	13 	14 	15 	16 
17 	18 	19 	20 	21 	22 	23 	24 	25 	26 	27 	28 	29
term text 	Kirikou 	et 	la 	sorcie?re 	France 	3 	Cinema 	/ 	RTBF 
(Te?le?vision 	belge). 	Grand 	Prix 	du 	festival 	d'Annecy 	1999 
France 	French 	VHS 	VIDEO 	.VHS10969 	1 	vide?ocassette 	(1h10 	min.) 
(VHS) 	Ocelot, 	Michel
term type 	word 	word 	word 	word 	word 	word 	word 	word 	word 
word 	word 	word 	word 	word 	word 	word 	word 	word 	word 	word 	word 
word 	word 	word 	word 	word 	word 	word 	word
source start,end 	0,7 	8,10 	11,13 	14,23 	25,31 	32,33 	34,40 	41,42 
43,47 	48,61 	62,69 	72,77 	78,82 	83,85 	86,94 	95,103 	104,108 
110,116 	117,123 	124,127 	129,134 	135,144 	147,148 	149,163 	164,169 
170,175 	176,181 	183,190 	191,197



   schema.UnicodeNormalizationFilterFactory {}

term position 	1 	2 	3 	4 	5 	6 	7 	8 	9 	10 	11 	12 	13 	14 	15 	16 
17 	18 	19 	20 	21 	22 	23 	24 	25 	26 	27 	28 	29
term text 	(Kirikou,0,7) 	(et,8,10) 	(la,11,13) 	(sorcière,14,23) 
(France,25,31) 	(3,32,33) 	(Cinema,34,40) 	(/,41,42) 	(RTBF,43,47) 
((Télévision,48,61) 	(belge).,62,69) 	(Grand,72,77) 	(Prix,78,82) 
(du,83,85) 	(festival,86,94) 	(d'Annecy,95,103) 	(1999,104,108) 
(France,110,116) 	(French,117,123) 	(VHS,124,127) 	(VIDEO,129,134) 
(.VHS10969,135,144) 	(1,147,148) 	(vidéocassette,149,163) 
((1h10,164,169) 	(min.),170,175) 	((VHS),176,181) 	(Ocelot,,183,190) 
(Michel,191,197)
term type 	word 	word 	word 	word 	word 	word 	word 	word 	word 
word 	word 	word 	word 	word 	word 	word 	word 	word 	word 	word 	word 
word 	word 	word 	word 	word 	word 	word 	word
source start,end 	0,7 	8,10 	11,13 	14,23 	25,31 	32,33 	34,40 	41,42 
43,47 	48,61 	62,69 	72,77 	78,82 	83,85 	86,94 	95,103 	104,108 
110,116 	117,123 	124,127 	129,134 	135,144 	147,148 	149,163 	164,169 
170,175 	176,181 	183,190 	191,197



   org.apache.solr.analysis.ISOLatin1AccentFilterFactory {}

term position 	1 	2 

Re: Searching for words with accented characters.

2008-06-11 Thread solrtom
 83,85   86,94   86,94   86,94   95,103  95,103  95,103  95,103 
 104,108   104,108 104,108 110,116 110,116 
 110,116 117,123 117,123 
 117,123   124,127 124,127 124,127 129,134 
 129,134 129,134 135,144 
 135,144   135,144 135,144 147,148 147,148 
 147,148 149,163 149,163 
 149,163   164,169 164,169 164,169 164,169 
 164,169 170,175 170,175 
 170,175   176,181 176,181 176,181 183,190 
 183,190 183,190 191,197 
 191,197   191,197
 0,7   8,1011,13   14,23   25,31   32,33   34,40   41,42   43,47   48,61 
 62,69 72,77   78,82   83,85   86,94   95,103  95,103  104,108 
 110,116 
 117,123   124,127 129,134 135,144 147,148 
 149,163 164,169 170,175 
 176,181   183,190 191,197
 
 
   Query Analyzer
 
 
 org.apache.solr.analysis.WhitespaceTokenizerFactory {}
 
 term position 1
 term text sorciere
 term type word
 source start,end  0,8
 
 
 schema.UnicodeNormalizationFilterFactory {}
 
 term position 1
 term text (sorciere,0,8)
 term type word
 source start,end  0,8
 
 
 org.apache.solr.analysis.ISOLatin1AccentFilterFactory {}
 
 term position 1
 term text (sorciere,0,8)
 term type word
 source start,end  0,8
 
 
 org.apache.solr.analysis.SynonymFilterFactory {expand=true,
 ignoreCase=true, synonyms=synonyms.txt}
 
 term position 1
 term text (sorciere,0,8)
 term type word
 source start,end  0,8
 
 
 org.apache.solr.analysis.StopFilterFactory {words=stopwords.txt,
 ignoreCase=true}
 
 term position 1
 term text (sorciere,0,8)
 term type word
 source start,end  0,8
 
 
 org.apache.solr.analysis.WordDelimiterFilterFactory
 {catenateWords=0, catenateNumbers=0, catenateAll=0,
 generateNumberParts=1, generateWordParts=1}
 
 term position 1   2   3
 term text sorciere0   8
 term type wordwordword
 source start,end  0,8 0,8 0,8
 
 
 org.apache.solr.analysis.LowerCaseFilterFactory {}
 
 term position 1   2   3
 term text sorciere0   8
 term type wordwordword
 source start,end  0,8 0,8 0,8
 
 
 org.apache.solr.analysis.EnglishPorterFilterFactory
 {protected=protwords.txt}
 
 term position 1   2   3
 term text sorcier 0   8
 term type wordwordword
 source start,end  0,8 0,8 0,8
 
 
 org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory {}
 
 term position 1   2   3
 term text sorcier 0   8
 term type wordwordword
 source start,end  0,8 0,8 0,8
 
 
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Searching-for-words-with-accented-characters.-tp17782723p17789006.html
Sent from the Solr - User mailing list archive at Nabble.com.