RE: Lucene Analyzerr

Hamid Reza Sahlolbey Sat, 16 Feb 2008 21:35:43 -0800

-----Original Message-----
From: Jukka Zitting [mailto:[EMAIL PROTECTED] 
Sent: 2008/02/17 01:38 ق.ظ
To: [email protected]
Subject: Re: Lucene Analyzerr


Hi,

2008/2/15 Hamid Reza Sahlolbey <[EMAIL PROTECTED]>:
> First I used StandardAnalyzer but when I looked in workspace index files I
> recognized that I it doesn't index Persian text so I change to
> SimpleAnalyzer, Now it seems that it index Persian text right, but don't
> find it(Consider that the query is the same for Msword and pdf files).

Could there be some character encoding confusion somewhere? You may
want to check that the Unicode character stream produced by the text
extractor looks valid.

BR,

Jukka Zitting

Hi Jukka;
Yes you are right ,last night I found that pdfbox extract my text wrong,but
we couldn't be able to understand as there is 2 set of Persian characters in
Unicode character map.It should be below \u06dc (what browser understand as
UTF-8 Persian characters) but pdfbox extract character above \uFB50, I don't
understand why pdfbox does not return the standard character which is common
for web. Is there any way to define what I want to be returned by pdfbox and
get the correct result? (I mean changing something like glyphlist in pdfbox
and get my desired result) does any body know about this.

Thanks in advance,
Hamid

RE: Lucene Analyzerr

Reply via email to