On 6/14/2011 2:32 AM, Tab Atkins Jr. wrote:
On Mon, Jun 13, 2011 at 2:29 AM, Brett Zamir<[email protected]>  wrote:
Thanks, that's helpful. Still would be nice to have item-* though...
Well, your idea for custom item-* attributes is just a way to more
concisely embed triples of non-visible data.  You already have a
mechanism for embedding non-visible triples (<meta>  or<link>), so the
new method needs some decent benefits to justify the duplication of
functionality.
HTML could have been created without attributes too--but if one is going to use it frequently enough, concision is a big selling point (as is non-redundant styleability).
Additionally, while we recognize that non-visible data is sometimes
necessary to embed, we'd like to discourage its use as much as
possible (in general, non-visible data rots much faster).  One way to
do that is to make the syntax slightly cumbersome or ugly - when you
really need it, you can use it, but your aesthetic sense will keep it
from being the first tool you reach for.  So, making it easier or
prettier to embed non-visible triples is actually something we'd like
to avoid if we can.

People who are going to go to the trouble of adding semantics which do nothing for visual rendering are probably going to have some idea of what they are doing. And if there is an adequately convenient method, they will have the chance to learn from experience about the right balance.

And is my idea really encouraging "hidden" meta-data?

Even in my own example of using water damage:

<span itemprop="damage" item-agent="water">
    So blurry....
</span>

...this is allowing some extensibility (by allowing an indefinite number of attributes), but conceptually it is not so different from:

<span itemprop="water-damage">
    So blurry....
</span>

...which no one is calling "hidden".

My suggestion is actually /helping/ avoid hidden meta tags not directly associated with an element encapsulating visible text.

Note, though, that Microdata or RDFa may not be quite appropriate for
this kind of thing.  You're not marking up data triples for later
extraction as independent data - you're doing in-band annotations of
the document itself.  As such, a different mechanism may be more
appropriate, such as your original design of using a custom markup
language in XML, or using custom attributes in HTML.  There's no
particular reason for these sorts of things to be readable by
arbitrary robots; it's sufficient to design for ones that know exactly
what they're reading and looking for.
With the likes of Google offering Microdata-aware searches, I think it makes
a whole lot of sense to allow rich documents such as TEI ones to enter as
regular document citizens of the web, whereby the limited resources of such
specialized semantic communities can leverage the general purpose and
better-supported services such as Google's Microdata tool, while also having
their documents editable within the likes of WYSIWYG HTML text editors, and
stored on sites such as discussion forums or wikis where only HTML may be
allowed and supported.

I think such a focus would also enable the TEI community to benefit from
reusing search-engine-recognized schemas where available, as well as helping
the web community build new schemas for the unique needs of encoding
academic texts.
I haven't yet looked into TEI's metadata scheme, but is the TEI
metadata actually something that needs to be known to search engines?
The one example you've presented in your emails, annotating that some
parts of a transcription were water-damaged (and thus presumably
possibly inaccurate?), isn't something useful for search engines, but
only for humans looking at the document as a whole.
It could be useful to a search engine. If I remembered that some text was water-damaged, I could specify that I only wanted to look for water-damaged text (with the TEI itemtype).

But I used the water damage example to show something very minute and concrete. I could have given examples about how one wished to search for more frequent use cases such as finding a particular component of a structured bibliography, or find all quotations attributed to a particular author.

Search engines could of course be employed not only for searching the whole web, but for searching a particular site.

If most of the other metadata is similar, then the only reason to use
Microdata is to potentially make it easier to read/embed data via
Microdata-aware WYSIWYG editors (are there any?).  Or, possibly, to
use Microdata-extraction tools.
My point about editors was that relative to TEI XML, TEI in HTML could be put into editors. Relative to other approaches like using data-*, it would not be a particular advantage, outside of the fact that data-* is meant only to be used by the specific site, not for republishing by others. For example, if a publisher of a TEI Bible encoded a ton of semantics, using data-* to do so would let the document be previewable in a text editor or shared on a wiki, but it would not be using a recognized mechanism for semantics.
  Is it useful to, for example, extract
all the water-damaged text from a document, minus the context in which
it appeared?
It could be. Scholars might be interested in many different aspects of a document:

* Finding all of the unique closings of a letter writer.
* Using the semantics as hooks for transformations, such as finding all of the letters whose openers begin after a certain date.
* Finding quotations attributed to a particular person.

Many other possibilities using the rich semantic detail of TEI (as one can see by browsing http://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ELEMENTS.html and http://www.tei-c.org/release/doc/tei-p5-doc/en/html/REF-ATTS.html. XQuery meets this need rather well for XML and is familiar to that community (and is just starting to become available to JavaScript via XQIB) though jQuery/querySelect could meet the need very well in HTML.

Of course, search engines might not be offering users the ability to make open-ended XQueries over night, but some targeting across the web would still be very powerful.

Otherwise, one might as well just use data-* attributes to mark up
triples directly on the subjects.  That would give you most of the
benefits with much less verbosity and more direct linkages between
data and metadata.  It would also be somewhat easier to style with
CSS:

<span data-tei-damage="water">
    Some water damaged words
</span>

span[data-tei-damage=water] {
  ...
}
Yes, thanks for that, but I'd really like to avoid the additional redundancy here of needing to add both CSS and <meta/> tags (which would be necessary in order to have the extra information be recognized as universally semantic, rather than application-specific, markup--for the sake of the benefits of search engine discoverability, for one).

Best wishes,
Brett

Reply via email to