Hi, On Fri, Oct 1, 2010 at 12:19 PM, Jan Høydahl / Cominvent <[email protected]> wrote: > My question is whether Tika HTML has a "wysiwyg html output" mode, where the > focus > is to produce good looking html?
Not really. The main purpose of the XHTML output from a parser is to produce a semantically meaningful representation of the input document, or at least to capture the plain text content of the document. Sometimes this goal is at odds with the often complex requirements of a visual rendering of a document. For now you'll probably get best results by storing the original document along with the text extracted by Tika, and using some other tool to produce the viewable HTML rendering. That said, there's quite a bit we can do to improve the readability of the XHTML output even within the bounds of semantically structured text. Ideally I'd like Tika's XHTML output to work a bit like http://lab.arc90.com/experiments/readability/. In other words, our goal should not be pixel-perfect rendition of the input document. BR, Jukka Zitting
