Re: [CODE4LIB] Formats and its identifiers

2009-05-14 Thread Mike Taylor
Rob is correct on all points.

Namespace URIs can, in some cases, be overloaded to function as schema
identifiers.  But they absolutely can't be used blindly in this way
for arbitrary formats -- there are all kinds of potential gotchas.
That being so, I think it is wiser and more explicit _always_ to
define a separate identitifier for a format.

 _/|____
/o ) \/  Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk
)_v__/\  ... currently trading under the name Gently for reasons which it
 would be otiose, for the moment, to rehearse -- Douglas Adams,
 Dirk Gently


Rob Sanderson writes:
  On Mon, 2009-05-11 at 14:53 +0100, Jakob Voss wrote:
  
A format should be described with a schema (XML Schema, OWL etc.) or at 
least a standard. Mostly this schema already has a namespace or similar 
identifier that can be used for the whole format.

This is unfortunately not the case.
   
   It is mostly the case - but people like to misinterpret schemas and 
   tailor them to their needs.
  
  You're advocating an approach that mostly works, as opposed to one
  that works in all cases?
  
  
For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
identify MODS. 

And this is a perfect example of why this is not the case. 
The same mods schema (let alone namespace) defines TWO formats, mods and
modsCollection.
  
   That's your interpretation. According to the schema, the MODS format 
   *is* either a single mods-element or a modsCollection-element. 
  
  According to the __schema__ yes.  Not according to the namespace. The
  namespace is a collection of names only and says precisely nothing about
  structure.
  
  And, yes, given no definition of format, my interpretation is that the
  mods schema defines two formats, as it defines two top level elements
  with different contents (eg one may contain the other).  This is
  typically how people would define format in this context, I would say.  
  
  This is, of course, tangential to the fact that you cannot use the __XML
  Namespace__ as an identifier for the format, no matter how you define
  it.
  
  
   That's 
   exactely what you can refer to with the namespace identifier 
   http://www.loc.gov/mods/v3.
  
  No, that's a collection of elements, not a schema.
  
  
   If you need to identify the specific element 'mods' of the format only, 
   then you need another identifer.
  
  Correct. I'm glad you agree with me.
  
  Given that namespaces do not specify anything to do with structure, you
  thus need a new identifier for EVERY element in a namespace as they
  could be used as the top level tag of ANY schema.
  
  There isn't a widely accepted identifier system for schemas, only schema
  locations.  There are also many methods for defining schemas
  (schematron, relax-ng, DTDs, xml schema) which can all define exactly
  the same format.
  
  
   But if the MODS specification defines that you can refer to any element 
   with an URI fragment identifier, then the right identifier would be 
   http://www.loc.gov/mods/v3#mods
  
  That would be an identifier for the *element*.
  
   The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' 
   does not identify the top level element but the MODS *format* (in any of 
   the versions 3.0-3.4) itself. This format *includes* the top level 
   element 'mods'.
  
  No, it identifies a collection of names.  These names are structured
  according to a schema, which is what we need an identifier for. Beyond
  that, we may also need identifiers for which structure we mean within
  the schema (eg mods vs modsCollection)
  
  
  Rob


Re: [CODE4LIB] Formats and its identifiers

2009-05-11 Thread Ross Singer
On Mon, May 11, 2009 at 9:53 AM, Jakob Voss jakob.v...@gbv.de wrote:

 That's your interpretation. According to the schema, the MODS format *is*
 either a single mods-element or a modsCollection-element. That's exactely
 what you can refer to with the namespace identifier
 http://www.loc.gov/mods/v3.

Agreed.  The same is true, of course, of MARC and, by extension,
MARCXML.  Part of the format is that it can be one record or
multiple.  I don't think this a particularly strong argument against
using the namespace as an identifier.

 The namespace http://www.loc.gov/mods/v3 of the top level element 'mods'
 does not identify the top level element but the MODS *format* (in any of the
 versions 3.0-3.4) itself. This format *includes* the top level element
 'mods'.

I'm not really sure of the changes between MODS v.3.0-3.3 -- are they
basically backwards and forwards compatible?

I imagine there are a lot of cases where the client doesn't care what
point release of MODS the thing is serialized as, just that it's MODS
and that it can find generally what it's looking for in that
structure, right?

-Ross.


Re: [CODE4LIB] Formats and its identifiers

2009-05-11 Thread Rob Sanderson
On Mon, 2009-05-11 at 14:53 +0100, Jakob Voss wrote:

  A format should be described with a schema (XML Schema, OWL etc.) or at 
  least a standard. Mostly this schema already has a namespace or similar 
  identifier that can be used for the whole format.
  
  This is unfortunately not the case.
 
 It is mostly the case - but people like to misinterpret schemas and 
 tailor them to their needs.

You're advocating an approach that mostly works, as opposed to one
that works in all cases?


  For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
  Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
  identify MODS. 
  
  And this is a perfect example of why this is not the case. 
  The same mods schema (let alone namespace) defines TWO formats, mods and
  modsCollection.

 That's your interpretation. According to the schema, the MODS format 
 *is* either a single mods-element or a modsCollection-element. 

According to the __schema__ yes.  Not according to the namespace. The
namespace is a collection of names only and says precisely nothing about
structure.

And, yes, given no definition of format, my interpretation is that the
mods schema defines two formats, as it defines two top level elements
with different contents (eg one may contain the other).  This is
typically how people would define format in this context, I would say.  

This is, of course, tangential to the fact that you cannot use the __XML
Namespace__ as an identifier for the format, no matter how you define
it.


 That's 
 exactely what you can refer to with the namespace identifier 
 http://www.loc.gov/mods/v3.

No, that's a collection of elements, not a schema.


 If you need to identify the specific element 'mods' of the format only, 
 then you need another identifer.

Correct. I'm glad you agree with me.

Given that namespaces do not specify anything to do with structure, you
thus need a new identifier for EVERY element in a namespace as they
could be used as the top level tag of ANY schema.

There isn't a widely accepted identifier system for schemas, only schema
locations.  There are also many methods for defining schemas
(schematron, relax-ng, DTDs, xml schema) which can all define exactly
the same format.


 But if the MODS specification defines that you can refer to any element 
 with an URI fragment identifier, then the right identifier would be 
 http://www.loc.gov/mods/v3#mods

That would be an identifier for the *element*.

 The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' 
 does not identify the top level element but the MODS *format* (in any of 
 the versions 3.0-3.4) itself. This format *includes* the top level 
 element 'mods'.

No, it identifies a collection of names.  These names are structured
according to a schema, which is what we need an identifier for. Beyond
that, we may also need identifiers for which structure we mean within
the schema (eg mods vs modsCollection)


Rob


Re: [CODE4LIB] Formats and its identifiers

2009-05-11 Thread Karen Coyle

Ross Singer wrote:

Agreed.  The same is true, of course, of MARC and, by extension,
MARCXML.  Part of the format is that it can be one record or
multiple.  I don't think this a particularly strong argument against
using the namespace as an identifier.
  



Actually, the MARC format (not MARCXML) is very much a single-record 
format. There is a standard for tape headers but no wrapper for MARC 
(Z39.2) records, since the MARC format doesn't have a way to do that. 
Having worked for way too long with MARC, I had a lot of trouble with 
the collection concept in MARCXML and MODS, and am still not sure I 
see the utility of it beyond what a file of records provides. I'm 
assuming its main purpose is to provide valid XML when you have a file 
with more than one bibliographic record. However, it seems that the 
collection and the records within the collection are part and parcel of 
the same schema, making the things we think of as records subordinate 
to the collection, even if it is a collection of one.


kc

--
---
Karen Coyle / Digital Library Consultant
kco...@kcoyle.net http://www.kcoyle.net
ph.: 510-540-7596   skype: kcoylenet
fx.: 510-848-3913
mo.: 510-435-8234