Thanks for relaying,  the problem was indeed around  the way I was 
converting the extracted  data as string to byte array.
Bwt, is there way in Tika api  to obtain extracted data as InputStream and 
not only as string from ContentHandler object.

Thanks 

 


Best Regards. 

Denis Voloshin 
Software engineer 
Phone: +972-2-649-1162 
Mobile: +972-54-642-2269 


 



From:   Jukka Zitting <[email protected]>
To:     [email protected]
Date:   06/23/2011 01:30 AM
Subject:        Re: non-West European languages support



Hi,

On Wed, Jun 22, 2011 at 1:37 PM, Denis Voloshin <[email protected]> wrote:
> I'd like to verify either Tika doesn't support  non-West European 
languages
> or I'm just missing something in my client  code.

Tika uses Unicode internally and should be able to handle pretty much
all languages in the world with few problems.

The output example you attached (with plenty of "?" characters)
suggests that your default output encoding (see [1]) is not able to
represent all these characters and simply falls back to the default
"?" replacement character.

[1] 
http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#default-encoding


BR,

Jukka Zitting

<<image/jpeg>>

Reply via email to