Hi all,

I was taking another look at TIKA-379, which is the issue of "Html elements and attributes not available in XHTML representation"

In a comment on that issue, Jukka said:

The reason for the default HTML mapping rules in Tika are to simplify and normalize the input documents so that client applications could easily process all sorts of input (HTML or not) without needing type- or source-specific heuristics. The basic idea has been that clients should directly use the underlying parser libraries when it needs custom processing of specific content types.

It feels to me like the issue of elements is a bit different than attributes. When processing the response, having a well-constrained set of (XHTML-valid) elements would definitely make it easier for clients.

But I don't see how restricting valid XHTML _attributes_ helps much. During processing of the result, you care about the structure of the DOM, not typically optional attributes.

Anybody care to weigh in on this?

My specific issue has to do with lang and rel attributes, which are very useful during crawling.

I know that the HtmlMapper support (with some improvements) could address my needs, but if there's a way to propagate safe attributes through to everybody, that seems like a superior solution.

Thanks,

-- Ken

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to