On 2010-05-11 02:56, Ken Krugler wrote: > Hi all, > > I was taking another look at TIKA-379, which is the issue of "Html > elements and attributes not available in XHTML representation" > > In a comment on that issue, Jukka said: > >> The reason for the default HTML mapping rules in Tika are to simplify >> and normalize the input documents so that client applications could >> easily process all sorts of input (HTML or not) without needing type- >> or source-specific heuristics. The basic idea has been that clients >> should directly use the underlying parser libraries when it needs >> custom processing of specific content types. > > It feels to me like the issue of elements is a bit different than > attributes. When processing the response, having a well-constrained set > of (XHTML-valid) elements would definitely make it easier for clients. > > But I don't see how restricting valid XHTML _attributes_ helps much. > During processing of the result, you care about the structure of the > DOM, not typically optional attributes. > > Anybody care to weigh in on this? > > My specific issue has to do with lang and rel attributes, which are very > useful during crawling.
Hi, In my opinion this has to do with the level of knowledge that you expect from the clients of this API, and the extent of a meaningful schema mapping that you can perform by default. If you pass through all valid attributes unchanged, then clients need to be aware of "lang" and "rel" and their meaning, which poses a question: what if some other format uses "language" and "function" instead? your client then would have to handle all such variants of the same (semantically speaking) data. It's a natural expectation that such details should be handled by the library, and the library should know that for this particular format "language" is semantically equivalent to a better-known "lang" attribute... Such 1:1 mapping is often impossible to do, but in many useful cases it is possible. I think this should be a configurable component in Tika. E.g. in many Nutch plugins we map format-specific attributes to a "standard set" of attributes that other Nutch plugins can rely upon. This is currently hardcoded in plugin implementations. > > I know that the HtmlMapper support (with some improvements) could > address my needs, but if there's a way to propagate safe attributes > through to everybody, that seems like a superior solution. +1 for a component that knows how to map common format-specific attributes to abstract attributes e.g. Dublin Core, HTML, Office, etc. The classes in o.a.nutch.metadata may be helpful. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com