Michael Wechner schrieb:
I would suggest using BodyContentHandler instead of
WriteOutContentHandler. You can use it just like
WriteOutContentHandler, but it only outputs the contents of the
<body/> section. See the --text option in TikaCLI or the ParsingReader
class for good examples.
yes, I have seen the BodyContentHandler, but it means I have to
explicitely concatenate the title (and the other meta data), which is
not that much
effort,
I am using now the BodyContentHandler and aggregate the rest of the
metadata (title, keywords, description, etc.) and it works well,
but as pointing out below I think the WriteOutContentHandler is
misleading and I think the behaviour should either be changed or
deprecated (with a note that one should use the BodyContentHandler)
Cheers
Michael
but as said I think it defeats the purpose of the
WriteOutContentHandler ;-)
Thanks for your explanations
Michael
BR,
Jukka Zitting