On 4/27/2011 9:06 PM, Benjamin Hawkes-Lewis wrote:
On Wed, Apr 27, 2011 at 3:54 AM, Brett Zamir wrote:
Thanks for the references. While this may be relevant for the likes of blogs
and other documents whose requirements for semantic density is limited
enough to allow such reshaping for practical effect and whose content is
reshapeable by the content creator (as opposed to republishing of already
completed books), for more semantically dense content, such as the types of
classical documents marked up by TEI, it is simply not possible to expose
text for each bit of semantic information or to generate new text to meet
that need. And of course, even with microformats/microdata as it is now, the
semantic content itself is not necessarily exposed just because text is
visible on the page.
The issue of discoverability is I think more related to how it will be
consumed or may be consumed. And even if some pieces of information are less
discoverable, it does not mean that they have no value. For such rich
documents, a lot of attention is being paid to these texts since they are
themselves considered important enough to be worth the time.
If the Declaration of Independence of the United States was marked up with
hidden information about prior emendations, their likely reasons, etc., or
about suspected authors of particular passages, or the United Nations
Declaration of Human Rights were marked up to indicate which countries have
expressed reservations (qualifications) about which rights, while a browsing
application or query tool ought to be able (optionally) expose this hidden
information, there is no automatic need for the markup to be polluted with
extra (hidden) (and especially URI-based or other non-textual) tags when an
attribute would suffice.
For things that are truly important, there may be a great deal of care put
into building up many layers which are meant to be peeled away, and it is
worth allowing some of that information (particularly the non-textual
information, e.g., the conditions of authorship, publisher, etc.),
especially which the original publication did not expose, to be still
selectively revealed to queries and deeper browsing.
If a site like Wikisource (the online library sister project of Wikipedia's)
would be able to offer such officially sanctioned semantic attributes,
classic texts could become enhanced in this way over time, with the wiki
exposing the hidden semantic information, which indeed may not be as
important as the visible text, but with queries by interested to users, any
problems in encoding could be discovered just as well.
Your email challenges the principle of visible data on four different grounds:
1. You note even proponents of visible data do not always show their data.
But the microformats community only endorse hidden metadata for annotating
human-friendly visible data (e.g. "mercredi prochain") with a machine-readable
equivalent (e.g. an ISO 8601 formatted date). They do not endorse hidden
metadata without visible equivalents against which it can be cross-checked.
2. You imply editorial effort can offset the error-proneness of hidden
metadata. But the same extraordinary editorial effort would yield even greater
accuracy if it went towards creating visible data rather than hidden metadata.
3. You claim tool-assisted queries by end-users against the hidden metadata
will reveal errors at the same rate as visible data. But this is doubtful, in
so far as many queries will obfuscate context whereas simply reading through the
text encourages serendipitous error discovery. For example, I could issue a
query asking what proportion of the Declaration of Independence is suspected to
be authored by John Adams. A percentage answer would not reveal the odd
misattributed passage. By contrast, if I'm a scholar of the Declaration and am
reading through the text and I happen to see a suspiciously Jeffersonian
passage visibly attributed to John Adams, I'm much more likely to notice the
error.
Of course a visible attribution is helpful, but one cannot possibly
visibly represent all information one might wish to add, especially if
one does not wish to clutter the view hopelessly. Meta-data can be
available to searching, and if search engines don't wish to take
advantage of it, at least individual document queries can do so.
4. You assert that it is not viable to make multiple layers of rich data
visible in a single view. I'd make the counterargument that on the web, unlike
in print, it is economical to dynamically construct different views and filters
of a document and its various visible data streams on the client, on the
server, on the client, or on some combination of the two. The HTML5
specification itself is a great example of this. The source text is kept in a
repository that stores changes to the text, along with date and rationale.
Multiple views of this source text are then generated serverside: the source
text is carved up into multiple draft specs