Hi, On 10/10/07, Sami Siren <[EMAIL PROTECTED]> wrote: > Does this mean Tika users need to implement "parser" (ContentHandler) > that can handle events fired by Tika Parser. One for each format? Or do > we plan to normalize events somehow?
The main rationale for outputting XML is to be able to express things like "this is a heading", "this is a link", etc. so that for example a search engine can put more weight on those parts of the content. My preference would be to use XHTML Basic as the XML format that the parsers will output. XHTML is widely known and supported, and is more than expressive enough for our needs. > Or is Tika going to provide those handlers for simple tasks like > extracting title + content. I would at least have utility adapters that convert the SAX events to a character stream and further to a single string. BR, Jukka Zitting
