Hi all,
I was taking another look at TIKA-379, which is the issue of "Html
elements and attributes not available in XHTML representation"
In a comment on that issue, Jukka said:
The reason for the default HTML mapping rules in Tika are to
simplify and normalize the input documents so that client
applications could easily process all sorts of input (HTML or not)
without needing type- or source-specific heuristics. The basic idea
has been that clients should directly use the underlying parser
libraries when it needs custom processing of specific content types.
It feels to me like the issue of elements is a bit different than
attributes. When processing the response, having a well-constrained
set of (XHTML-valid) elements would definitely make it easier for
clients.
But I don't see how restricting valid XHTML _attributes_ helps much.
During processing of the result, you care about the structure of the
DOM, not typically optional attributes.
Anybody care to weigh in on this?
My specific issue has to do with lang and rel attributes, which are
very useful during crawling.
I know that the HtmlMapper support (with some improvements) could
address my needs, but if there's a way to propagate safe attributes
through to everybody, that seems like a superior solution.
Thanks,
-- Ken
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g