BodyContentHandler and encoding

Kaspar Fischer Mon, 15 Feb 2010 07:34:49 -0800

Dear list,

I am trying to convert documents to plaintext using Tika. For this, I do:


  StringWriter writer = new StringWriter();
  ContentHandler handler = new BodyContentHandler(writer);
  Parser parser = new AutoDetectParser();
  Metadata metadata = new Metadata();
  parser.parse(dataStream, handler, metadata, new ParseContext());
  String plaintext = writer.toString();

This works like a charm and I also get some metadata (like 
metadata.get("Content-Type")) for free.

The only issue I have run into is the encoding of documents. If I pass in a 
file in for example cp437 encoding, the plaintext seems to be in this encoding, 
too -- it is not in UTF-8 encoding. I am therefore wondering where and how the 
encoding detection must be done?

The "how" in this question is probably answered like this (please correct me if 
I am wrong!):

  CharsetDetector detector = new CharsetDetector();
  detector.setText(new BufferedInputStream(stream));
  String encoding = detector.detect().getName();

but I don't know how to tell the parser/handler about the encoding.

I have tried adding the encoding to the input metadata via

  metadata.add(Metadata.CONTENT_ENCODING, encoding);

but without success.

Many thanks for any feedback,
Kaspar

BodyContentHandler and encoding

Reply via email to