Hi, On Fri, Aug 7, 2009 at 2:53 PM, Michael Wechner<michael.wech...@wyona.com> wrote: > I did some more debugging and it really seems to me that the > XHTMLContentHandler does not add meta content to the head of > the XHTML and hence when using the WriteOutContentHandler one does not > receive this meta content, but one has to make > sure to retrieve the meta content separately in order to make a "full text" > index. > > Is this a feature or a bug or do I misunderstand something?
It's a feature. The title is included in the <head/> section just to make the resulting XHTML validate and so far we haven't had people needing more metadata in there. The Parser interface is designed to return document metadata in the Metadata object and the structured text content through the given ContentHandler. Is there a good use case for why the metadata should be exposed also in the <head/> section of the XHTML stream? > Also it seems to me that the WriteOutContentHandler concatenates title and > body which means the last word of the title and the first word of the body > are "merged" and hence are probably not indexed correctly at some later > stage. I would suggest using BodyContentHandler instead of WriteOutContentHandler. You can use it just like WriteOutContentHandler, but it only outputs the contents of the <body/> section. See the --text option in TikaCLI or the ParsingReader class for good examples. BR, Jukka Zitting