I am working on indexing arabic documents containg arabic diacritics and dotless characters (old arabic characters), I am using Apache Tomcat server, and I am using my modified version of the aramorph analyzer as the arabic analyzer. I managed on the development enviorment to normalize the arabic diacritics and dotless characters (same concept as in the solr.ArabicNormalizationFilterFactory). and i can verfiy that the analyzer is working fine, and i get the correct stem for arabic words. the input text file for testing has a utf-8 encoding.
When i build the aramorph jar file and place it under solr lib, the diacritics and the dotless characters splits the word. I made sure that the server.xml contains the URI-Encoding="utf-8". I also made sure that the text being send to solr using solj is utf-8 encoding example : solr.addBean(new Doc("4",new String("حِباًَ".getBytes("UTF8")))); but nothing is working. I tried to use the analyze link on solr admin for both indexing and querying and both shows that the arabic word is splited if a diacritics or dotless character is found. Do you have any idea what might be the problem schema snippet: <fieldType name="text" class="solr.TextField"> <analyzer type="index" class="gpl.pierrick.brihaye.aramorph.lucene.ArabicNormalizeStemmer"/> <analyzer type="query" class="gpl.pierrick.brihaye.aramorph.lucene.ArabicNormalizeStemmer"/> </fieldType> I also added the following parameter to the JVM: -Dfile.encoding=UTF-8 Thanks, engy