Hi,

On Fri, Oct 1, 2010 at 12:19 PM, Jan Høydahl / Cominvent
<[email protected]> wrote:
> My question is whether Tika HTML has a "wysiwyg html output" mode, where the 
> focus
> is to produce good looking html?

Not really. The main purpose of the XHTML output from a parser is to
produce a semantically meaningful representation of the input
document, or at least to capture the plain text content of the
document. Sometimes this goal is at odds with the often complex
requirements of a visual rendering of a document.

For now you'll probably get best results by storing the original
document along with the text extracted by Tika, and using some other
tool to produce the viewable HTML rendering.

That said, there's quite a bit we can do to improve the readability of
the XHTML output even within the bounds of semantically structured
text. Ideally I'd like Tika's XHTML output to work a bit like
http://lab.arc90.com/experiments/readability/. In other words, our
goal should not be pixel-perfect rendition of the input document.

BR,

Jukka Zitting

Reply via email to