On 7 September 2012 06:24, kiran chitturi <chitturikira...@gmail.com> wrote:
[...]

> When i index a text field which has arabic and English like this tweet
> “@anaga3an: هو سعد الحريري بيعمل ايه غير تحديد الدوجلاس ويختار الكرافته ؟؟”
> #gcc #ksa #lebanon #syria #kuwait #egypt #سوريا
> with field_type as 'text_ar' and when i try to see the same field again in
> solr, it is shown as below.
> RT @AhmedWagih: لو معملناش حاجة �ي الزيادة
> السكانية �ي مصر، هنتحول لدولة �قيرة
> كثي�ة السكان زي بنجلادش #Egypt #EgyEconomy
>
> both of the lines do not mean the same, but i have just placed them here as
> an example. This was the problem i am facing.
>
[...]

The encoding of your input text is being mangled at some point.
Presuming that your original encoding is UTF-8, I would look at
how you are indexing into Solr, and the encoding settings on the
Java container. Solr itself handles UTF-8 perfectly fine, as do
most Java containers if configured properly, so my first suspicion
would be the indexing code.

As it looks like you are pulling from mysql using DIH, check that
the database character set is UTF-8, and that the connection uses
UTF-8.

Regards,
Gora

Reply via email to