-----Original Message----- From: Jukka Zitting [mailto:[EMAIL PROTECTED] Sent: 2008/02/17 01:38 ق.ظ To: [email protected] Subject: Re: Lucene Analyzerr
Hi, 2008/2/15 Hamid Reza Sahlolbey <[EMAIL PROTECTED]>: > First I used StandardAnalyzer but when I looked in workspace index files I > recognized that I it doesn't index Persian text so I change to > SimpleAnalyzer, Now it seems that it index Persian text right, but don't > find it(Consider that the query is the same for Msword and pdf files). Could there be some character encoding confusion somewhere? You may want to check that the Unicode character stream produced by the text extractor looks valid. BR, Jukka Zitting Hi Jukka; Yes you are right ,last night I found that pdfbox extract my text wrong,but we couldn't be able to understand as there is 2 set of Persian characters in Unicode character map.It should be below \u06dc (what browser understand as UTF-8 Persian characters) but pdfbox extract character above \uFB50, I don't understand why pdfbox does not return the standard character which is common for web. Is there any way to define what I want to be returned by pdfbox and get the correct result? (I mean changing something like glyphlist in pdfbox and get my desired result) does any body know about this. Thanks in advance, Hamid
