On 2010-05-11 15:22, Ken Krugler wrote: >> If you pass through all valid attributes unchanged, then clients need to >> be aware of "lang" and "rel" and their meaning, which poses a question: >> what if some other format uses "language" and "function" instead? your >> client then would have to handle all such variants of the same >> (semantically speaking) data. It's a natural expectation that such >> details should be handled by the library, and the library should know >> that for this particular format "language" is semantically equivalent to >> a better-known "lang" attribute... > > If it's valid XHTML, and validates with (say) the XHTML 1.0 Strict DTD, > then I don't think you would have this case of getting back a language > (versus lang) attribute.
No, of course not - but XHTML is not the original data that we have, we generate it ourselves, and we have a choice of either dropping offending attributes, or converting them to something acceptable under XHTML. > Or are you talking about ways to make it easier for parsers to return > conformant attributes? Yes. >> +1 for a component that knows how to map common format-specific >> attributes to abstract attributes e.g. Dublin Core, HTML, Office, etc. >> The classes in o.a.nutch.metadata may be helpful. > > So if I understand this correctly, it's not a concern about passing > through valid XHTML attributes, but rather their value to clients - > specifically in the context of normalizing the meaning for a variety of > input formats. Passing translated attributes when we can (according to a mapping), and passing original attributes in a non-offending way when we can't translate them. > > I think the initial idea was to use the metadata map to return these in > a generic way, which works for document-wide things...but most of what's > interesting to me, at least, is on a per-element basis. > > If we said that XHTML 1.0 Strict specified allowable attributes, would > this address your concern about clients needing to handle multiple > attribute names? Can't we put any attributes that we want if they are under a different namespace, and still be XHTML conformant? You are right that top-level maps may not cut - e.g. when parsing bilingual corpora (like europarl) every other line should get a different <p lang="">. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com