Re: XHTML Bean and corresponding content handler

Michael Wechner Thu, 06 Aug 2009 16:02:45 -0700

Jukka Zitting schrieb:

Hi,


On Tue, Aug 4, 2009 at 9:30 AM, Michael
Wechner<michael.wech...@wyona.com> wrote:

String XHTMLBean.getHead().getMeta(XHTMLBean.DESCRIPTION)
String XHTMLBean.getHead().getTitle()


These you can get from the Metadata object.

ok, I think I finally understood this, whereas I think it's a bit"confusing" that one seems to set /html/head/title with


metadata.set(metadata.TITLE, "some title");

and to set /html/head/meta with for example

metadata.set(metadata.KEYWORDS, "some keywords")

whereas it seems that the title is really added when usingstartDocument(), but for example the <meta name="keywords"content="..."/> seems not to be added.


Maybe I still misunderstand something though

String[] XHTMLBean.getBody().getParagraphs();


This is a bit troublesome as not all parsers produce paragraphs of
content. For example the Excel parser produces XHTML tables.

ok

You can either get just the plain character stream using tools like
BodyContentHandler, or the full XHTML output as SAX events (which you
can serialize to a byte stream if you want). I'm not sure if there's
any reasonable intermediate content abstraction.

the reason I am looking for this is because it seems that various searchengines are using for the result excerpt the following order


- <meta name="description" ...
- first paragraph within body tag
- ???

Thanks

Michael

BR,

Jukka Zitting

Re: XHTML Bean and corresponding content handler

Reply via email to