Jukka Zitting schrieb:
Hi,
On Fri, Aug 7, 2009 at 2:53 PM, Michael
Wechner<michael.wech...@wyona.com> wrote:
I did some more debugging and it really seems to me that the
XHTMLContentHandler does not add meta content to the head of
the XHTML and hence when using the WriteOutContentHandler one does not
receive this meta content, but one has to make
sure to retrieve the meta content separately in order to make a "full text"
index.
Is this a feature or a bug or do I misunderstand something?
It's a feature. The title is included in the <head/> section just to
make the resulting XHTML validate
ok, thanks for pointing this out. I think it would be good to add a note
somewhere within the code about this or does this already exist and I
just missed it?
and so far we haven't had people
needing more metadata in there. The Parser interface is designed to
return document metadata in the Metadata object and the structured
text content through the given ContentHandler. Is there a good use
case for why the metadata should be exposed also in the <head/>
section of the XHTML stream?
as I mentioned below when using the WriteOutContentHandler one wouldn't
have to extract the metadata explicitely
Also it seems to me that the WriteOutContentHandler concatenates title and
body which means the last word of the title and the first word of the body
are "merged" and hence are probably not indexed correctly at some later
stage.
I would suggest using BodyContentHandler instead of
WriteOutContentHandler. You can use it just like
WriteOutContentHandler, but it only outputs the contents of the
<body/> section. See the --text option in TikaCLI or the ParsingReader
class for good examples.
yes, I have seen the BodyContentHandler, but it means I have to
explicitely concatenate the title (and the other meta data), which is
not that much
effort, but as said I think it defeats the purpose of the
WriteOutContentHandler ;-)
Thanks for your explanations
Michael
BR,
Jukka Zitting