Hi, On Thu, Jun 23, 2011 at 11:59 AM, Denis Voloshin <[email protected]> wrote: > Thanks for relaying, the problem was indeed around the way I was converting > the extracted data as string to byte array. > Bwt, is there way in Tika api to obtain extracted data as InputStream and > not only as string from ContentHandler object.
Not as an InputStream (because of the encoding question), but you can use the parseToString() and parse() methods of the org.apache.tika.Tika facade to get a String or a java.io.Reader for reading the extracted text. Alternatively, if you want to output the extracted text to a Writer or an OutputStream, you can use the WriteOutHandler class for that. To explicitly specify the output encoding you want, use a java.io.OutputStreamWriter wrapper around your output stream. BR, Jukka Zitting
