Thanks. My customer already uses Stellent filters and they provide a "high fidelity" output mode which is then stored on the (FAST) document for preview. ISYS also have this feature for some of the formats.
Would it make sense to add framework support for extracting a HIFi version wherever the underlying extractor supports it? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 1. okt. 2010, at 14.11, Jukka Zitting wrote: > Hi, > > On Fri, Oct 1, 2010 at 12:19 PM, Jan Høydahl / Cominvent > <[email protected]> wrote: >> My question is whether Tika HTML has a "wysiwyg html output" mode, where the >> focus >> is to produce good looking html? > > Not really. The main purpose of the XHTML output from a parser is to > produce a semantically meaningful representation of the input > document, or at least to capture the plain text content of the > document. Sometimes this goal is at odds with the often complex > requirements of a visual rendering of a document. > > For now you'll probably get best results by storing the original > document along with the text extracted by Tika, and using some other > tool to produce the viewable HTML rendering. > > That said, there's quite a bit we can do to improve the readability of > the XHTML output even within the bounds of semantically structured > text. Ideally I'd like Tika's XHTML output to work a bit like > http://lab.arc90.com/experiments/readability/. In other words, our > goal should not be pixel-perfect rendition of the input document. > > BR, > > Jukka Zitting
