Re: XHTML Bean and corresponding content handler

Michael Wechner Sat, 08 Aug 2009 12:56:18 -0700

Jukka Zitting schrieb:

Hi,


On Fri, Aug 7, 2009 at 2:53 PM, Michael
Wechner<michael.wech...@wyona.com> wrote:

I did some more debugging and it really seems to me that the
XHTMLContentHandler does not add meta content to the head of
the XHTML and hence when using the WriteOutContentHandler one does not
receive this meta content, but one has to make
sure to retrieve the meta content separately in order to make a "full text"
index.

Is this a feature or a bug or do I misunderstand something?


It's a feature. The title is included in the <head/> section just to
make the resulting XHTML validate

ok, thanks for pointing this out. I think it would be good to add a notesomewhere within the code about this or does this already exist and Ijust missed it?

 and so far we haven't had people
needing more metadata in there. The Parser interface is designed to
return document metadata in the Metadata object and the structured
text content through the given ContentHandler. Is there a good use
case for why the metadata should be exposed also in the <head/>
section of the XHTML stream?

as I mentioned below when using the WriteOutContentHandler one wouldn'thave to extract the metadata explicitely

Also it seems to me that the WriteOutContentHandler concatenates title and
body which means the last word of the title and the first word of the body
are "merged" and hence are probably not indexed correctly at some later
stage.


I would suggest using BodyContentHandler instead of
WriteOutContentHandler. You can use it just like
WriteOutContentHandler, but it only outputs the contents of the
<body/> section. See the --text option in TikaCLI or the ParsingReader
class for good examples.

yes, I have seen the BodyContentHandler, but it means I have toexplicitely concatenate the title (and the other meta data), which isnot that mucheffort, but as said I think it defeats the purpose of theWriteOutContentHandler ;-)


Thanks for your explanations

Michael

BR,

Jukka Zitting

Re: XHTML Bean and corresponding content handler

Reply via email to