Dear list, I am trying to convert documents to plaintext using Tika. For this, I do:
StringWriter writer = new StringWriter(); ContentHandler handler = new BodyContentHandler(writer); Parser parser = new AutoDetectParser(); Metadata metadata = new Metadata(); parser.parse(dataStream, handler, metadata, new ParseContext()); String plaintext = writer.toString(); This works like a charm and I also get some metadata (like metadata.get("Content-Type")) for free. The only issue I have run into is the encoding of documents. If I pass in a file in for example cp437 encoding, the plaintext seems to be in this encoding, too -- it is not in UTF-8 encoding. I am therefore wondering where and how the encoding detection must be done? The "how" in this question is probably answered like this (please correct me if I am wrong!): CharsetDetector detector = new CharsetDetector(); detector.setText(new BufferedInputStream(stream)); String encoding = detector.detect().getName(); but I don't know how to tell the parser/handler about the encoding. I have tried adding the encoding to the input metadata via metadata.add(Metadata.CONTENT_ENCODING, encoding); but without success. Many thanks for any feedback, Kaspar