Re: Html parser questions

Ken Krugler Sun, 27 Sep 2009 07:41:12 -0700

Hi Jukka,

After writing this email, while on my morning jog, I realized that youwere right about the issue of "tag fidelity".

It makes total sense for Tika to manipulate tags to provide a better(more consistent/accurate) representation of the document.

In my use case, I depended on getting back exactly the tag data thatwas in the original document.

Since these two are in opposition, it's valid and appropriate for Tikato significantly restrict the set of returned tags to those thatarguably help with the abstract representation.

Where it can get very fuzzy is with things like <b> tags - e.g. if youpass through an <h3>, why not <b>, since these are often usedinterchangeably? Or should Tika look for the case of:


<b>some text</b>

and remap it to use <h3> tags?

In any case, I'm going to file an issue w/a patch for AutoDetectParserto have an alternative constructor that takes a Map<class, Parser> soI can explicitly override HtmlParser with my version, to handle caseslike <b>.


-- Ken

On Fri, Sep 25, 2009 at 8:21 PM, Ken Krugler
<kkrugler_li...@transpac.com> wrote:
1. The handler's startElement() never gets called with the <base>tag. I'm
assuming this is because <base> isn't part of the SAFE_ELEMENTS set.
But without the base tag, you can't correctly resolve relativeURLs in
anchor tags.

Seems like <base> should be part of the SAFE_ELEMENTS set.
Instead of passing the <base> element to the client (and thus
requiring it to keep track of it to correctly resolve local links),
I'd rather use it inside Tika to automatically turn local links to
absolute URLs in the parse output. Such a mechanism could also
leverage out-of-band information like a base URL given in the
RESOURCE_NAME_KEY metadata entry.
Makes sense. I'll file an issue and create a patch.
How as this set of tags derived?
The purpose of the SAFE_ELEMENTS set is to only pass out tags thatmay
be useful in inferring the semantic structure of the incoming HTML
document and to ensure that the output conforms to XHTML 1.0 Strict.
Note that in the future we may even start doing something likelimited
JavaScript and CSS processing to "render" the incoming page to better
decide what outgoing XHTML structure best matches the content seen by
someone reading the page on a browser. This will help general purpose
Tika-based web crawlers to avoid SEO tricks like "display: none;" or
JavaScript DOM reorganization. So as a general rule clients shouldnot
assume a one-to-one mapping between the HTML input and XHTML output
tags in Tika.
So recently I had to derive URLs from a web page, where theinformation I needed was in the attribute of a <span> element.
It sounds like the direction Tika is going would mean I couldn't useit for this use case.
I understand why the output might not match the input, which isfine. But what's the benefit of stripping arbitrary tags?
And given the heavy use of CSS/Ajax, how is it possible to decidewhat might or might not be useful for determining "semanticstructure"? E.g. "ignore footers" implies layout, and layout isoften done using CSS.
2. The handler's characters() method gets called with thefollowing text
Untitled
\n\n
link1
\n
link2
\n\n
\n
\n

The first six calls make sense to me.
The last two calls (with a single \n) happen just beforeendElement("body")
is called, and this is unexpected.
From the offset in the buffer, passed to characters(), these arethe return_after_ the </body> tag. If I put any number of returns in betweenthe
</body> and </html>, they all get passed to characters() before the
endElement("body") call. This seems like a bug.

Has anybody else noticed this?
No, but you're right in that it seems like a bug.
OK, I'll file a Jira issue.

Thanks,

-- Ken


--------------------------
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-210-6378

Re: Html parser questions

Reply via email to