Hi, On Fri, Sep 25, 2009 at 8:21 PM, Ken Krugler <kkrugler_li...@transpac.com> wrote: > 1. The handler's startElement() never gets called with the <base> tag. I'm > assuming this is because <base> isn't part of the SAFE_ELEMENTS set. > > But without the base tag, you can't correctly resolve relative URLs in > anchor tags. > > Seems like <base> should be part of the SAFE_ELEMENTS set.
Instead of passing the <base> element to the client (and thus requiring it to keep track of it to correctly resolve local links), I'd rather use it inside Tika to automatically turn local links to absolute URLs in the parse output. Such a mechanism could also leverage out-of-band information like a base URL given in the RESOURCE_NAME_KEY metadata entry. > How as this set of tags derived? The purpose of the SAFE_ELEMENTS set is to only pass out tags that may be useful in inferring the semantic structure of the incoming HTML document and to ensure that the output conforms to XHTML 1.0 Strict. Note that in the future we may even start doing something like limited JavaScript and CSS processing to "render" the incoming page to better decide what outgoing XHTML structure best matches the content seen by someone reading the page on a browser. This will help general purpose Tika-based web crawlers to avoid SEO tricks like "display: none;" or JavaScript DOM reorganization. So as a general rule clients should not assume a one-to-one mapping between the HTML input and XHTML output tags in Tika. > 2. The handler's characters() method gets called with the following text > > Untitled > \n\n > link1 > \n > link2 > \n\n > \n > \n > > The first six calls make sense to me. > > The last two calls (with a single \n) happen just before endElement("body") > is called, and this is unexpected. > > From the offset in the buffer, passed to characters(), these are the return > _after_ the </body> tag. If I put any number of returns in between the > </body> and </html>, they all get passed to characters() before the > endElement("body") call. This seems like a bug. > > Has anybody else noticed this? No, but you're right in that it seems like a bug. BR, Jukka Zitting