Thanks for relaying, the problem was indeed around the way I was converting the extracted data as string to byte array. Bwt, is there way in Tika api to obtain extracted data as InputStream and not only as string from ContentHandler object.
Thanks Best Regards. Denis Voloshin Software engineer Phone: +972-2-649-1162 Mobile: +972-54-642-2269 From: Jukka Zitting <[email protected]> To: [email protected] Date: 06/23/2011 01:30 AM Subject: Re: non-West European languages support Hi, On Wed, Jun 22, 2011 at 1:37 PM, Denis Voloshin <[email protected]> wrote: > I'd like to verify either Tika doesn't support non-West European languages > or I'm just missing something in my client code. Tika uses Unicode internally and should be able to handle pretty much all languages in the world with few problems. The output example you attached (with plenty of "?" characters) suggests that your default output encoding (see [1]) is not able to represent all these characters and simply falls back to the default "?" replacement character. [1] http://java.sun.com/javase/technologies/core/basic/intl/faq.jsp#default-encoding BR, Jukka Zitting
<<image/jpeg>>
