On 7 September 2012 06:24, kiran chitturi <chitturikira...@gmail.com> wrote: [...]
> When i index a text field which has arabic and English like this tweet > “@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس ويختار الكرافته ؟؟” > #gcc #ksa #lebanon #syria #kuwait #egypt #سوريا > with field_type as 'text_ar' and when i try to see the same field again in > solr, it is shown as below. > RT @AhmedWagih: لو معملناش ØØ§Ø¬Ø© Ù�ÙŠ الزيادة > السكانية Ù�ÙŠ مصر، هنتØÙˆÙ„ لدولة Ù�قيرة > كثيÙ�Ø© السكان زي بنجلادش #Egypt #EgyEconomy > > both of the lines do not mean the same, but i have just placed them here as > an example. This was the problem i am facing. > [...] The encoding of your input text is being mangled at some point. Presuming that your original encoding is UTF-8, I would look at how you are indexing into Solr, and the encoding settings on the Java container. Solr itself handles UTF-8 perfectly fine, as do most Java containers if configured properly, so my first suspicion would be the indexing code. As it looks like you are pulling from mysql using DIH, check that the database character set is UTF-8, and that the connection uses UTF-8. Regards, Gora