Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-14 Thread Mike Taylor
Alexander Johannesen writes:
  Anyway, I'm suspecting I don't see what the problem seems to be. To
  create the best identifier for things seems a bit of a strange
  notion to me, but is this based on that there is only (or rather,
  that you're trying to create) one identifier for any one thing?

Yes, this is exactly it.  RDF things that each concept should have
exactly one identifier; Topic Maps says its fine to have multiple
identifiers.  That seems to be 99% of the conceptual difference
between them.

My position: it seems obvious that one is the CORRECT number of
identifiers for a thing to have.  But since we live in a formal
world, the Topics Map approach may be more practical.

In other words, I might end up _advocating_ Topic Maps, but don't
expect me to _like_ it :-)

 _/|____
/o ) \/  Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk
)_v__/\  I think it's too consistently wrong not to be fixable --
 Phil Baldwin.


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-14 Thread Rob Sanderson
RDF is fine with one 'thing' having multiple identifiers, it just hands
the problem up a level to the application to deal with.

For example, the owl:sameAs predicate is used to express that the
subject and object are the same 'thing'.  Then the application can infer
that if a owl:sameAs b, and a x y, then b x y.

Rob

On Thu, 2009-05-14 at 13:00 +0100, Mike Taylor wrote:
 Alexander Johannesen writes:
   Anyway, I'm suspecting I don't see what the problem seems to be. To
   create the best identifier for things seems a bit of a strange
   notion to me, but is this based on that there is only (or rather,
   that you're trying to create) one identifier for any one thing?
 
 Yes, this is exactly it.  RDF things that each concept should have
 exactly one identifier; Topic Maps says its fine to have multiple
 identifiers.  That seems to be 99% of the conceptual difference
 between them.
 
 My position: it seems obvious that one is the CORRECT number of
 identifiers for a thing to have.  But since we live in a formal
 world, the Topics Map approach may be more practical.
 
 In other words, I might end up _advocating_ Topic Maps, but don't
 expect me to _like_ it :-)
 
  _/|_  ___
 /o ) \/  Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk
 )_v__/\  I think it's too consistently wrong not to be fixable --
Phil Baldwin.


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-14 Thread Alexander Johannesen
On Thu, May 14, 2009 at 17:35, Rob Sanderson azar...@liverpool.ac.uk wrote:
 For example, the owl:sameAs predicate is used to express that the
 subject and object are the same 'thing'.  Then the application can infer
 that if a owl:sameAs b, and a x y, then b x y.

Yes, but there's a snag; as RDF work only on the URI resource level
(no added semantics to the typification of the URI resource) if
someone does an owl:sameAs between an identifier of a thing and a
locator of a thing (a locator being the resource itself as opposed to
being an identifier; example are you talking about Sun Corp
(http://sun.com/) or are you talking about their website
(http://sun.com/)) you can get a nasty case of integrity rot, and I've
not seen any proposals to address this issue (the RDF world is
essentially assuming modeling from the viewpoint of everything being
true).

I guess Mike don't like RDF *nor* Topic Maps now. :)


Regards,

Alex
-- 
---
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
-- http://shelter.nu/blog/ 


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-14 Thread Alexander Johannesen
On Thu, May 14, 2009 at 17:45, Rob Sanderson azar...@liverpool.ac.uk wrote:
 I'll quote Mike (and most common approaches to the problem):
        Don't Do That Then.
 :)

Oh, for sure. :) But these are very subtle things that are hard to
understand, and certainly the long-term implications, so people *will*
do this, and they *will* put rot into the SemWeb chains people create.
It's unavoidable, but I know lots are trying to work out some kind of
solution. Unfortunately, this one is being routed to software
frameworks rather than the RDF core itself. Oh well.


Regards,

Alex
-- 
---
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
-- http://shelter.nu/blog/ 


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-14 Thread Eric Lease Morgan
[ /me is creating an email filter/rule against the Code4Lib mailing  
list to automatically delete messages whose subject lines contain One  
Data Format Identifier because he has acquired carpal tunnel syndrome  
after pressing the delete key so often. ]


--
Earache Least Moron


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-12 Thread Jakob Voss

Ross Singer wrote:


?xml version=1.0 encoding=UTF-8?
formats xmlns=http://unapi.info/;
 format name=foaf uri=http://xmlns.com/foaf/0.1//
/formats


I generally agree with this, but what about formats that aren't XML or
RDF based?  How do I also say that you can grab my text/x-vcard?  Or
my application/marc record?  There is still lots of data I want that
doesn't necessarily have these characteristics.


In my blog posting I included a way to specify mime types (such as as 
text/x-vcard or application/marcURI) as URI. According to RFC 2220 the 
application/marc type refers to the harmonized USMARC/CANMARC 
specification whatever this is - so the mime type can be used as format 
identifier. For vCard there is an RDF namespace and a (not very nice) 
XML namespace:


http://www.w3.org/2001/vcard-rdf/3.0#
vcard-temp (see http://xmpp.org/registrar/namespaces.html)

If you want to identify a defined format, there is almost always an 
identifier you can reuse - if not, ask the creator of the format. The 
problem is not in identifiers or the complexity of formats but in people 
that create and use formats that are not well defined.



What about XML formats that have no namespace?  JSON objects that
conform to a defined structure?  Protocol Buffers?


If something does not conform to a defined structure then it is no 
format at all but data garbage (yes, we have a lot of this in library 
systems but that's no excuse). To refer to XML or JSON in general there 
are mime types. If you want to identify something more specific there 
must be a definition of it or you are lost anyway.



And, while I didn't really want to wade into these waters, what about
formats that are really only used to carry other formats, where it's
the *other* format that really matters (METS, Atom, OpenURL XML,
etc.)?


A container format with restricted carried format is a subset of the 
container format. If you cannot handle the whole but only a subset then 
you should only ask for the subset. There are three possibilities:


1. implicitely define the container format and choose the carried 
format. This is what SRU does - you ask for the record format but you 
always get the SRU response format as container with embedded record format.


2. implicitely define the carried format and choose the container format

3. define a new format as combination of container and carried format


unAPI should be revised and specified bore strictly to become an RFC anyway.
Yes, this requires a laborious and lengthy submission and review process but
there is no such thing as a free lunch.


Yeah, I have no problem with this (same with Jangle).  The argument
could be made, however, is there a cowpath yet to be paved?


That depends whether you want to be taken serious outside the library 
community and target at the web as a whole or not.


Cheers,
Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-12 Thread Ross Singer
On Tue, May 12, 2009 at 6:21 AM, Jakob Voss jakob.v...@gbv.de wrote:
 Ross Singer wrote:

 ?xml version=1.0 encoding=UTF-8?
 formats xmlns=http://unapi.info/;
  format name=foaf uri=http://xmlns.com/foaf/0.1//
 /formats

 I generally agree with this, but what about formats that aren't XML or
 RDF based?  How do I also say that you can grab my text/x-vcard?  Or
 my application/marc record?  There is still lots of data I want that
 doesn't necessarily have these characteristics.

 In my blog posting I included a way to specify mime types (such as as
 text/x-vcard or application/marcURI) as URI. According to RFC 2220 the
 application/marc type refers to the harmonized USMARC/CANMARC
 specification whatever this is - so the mime type can be used as format
 identifier. For vCard there is an RDF namespace and a (not very nice) XML
 namespace:

 http://www.w3.org/2001/vcard-rdf/3.0#
 vcard-temp (see http://xmpp.org/registrar/namespaces.html)


This is vCard as RDF, not vCard the format (which is text based).  It
would be the equivalent of saying, here's an hCard, it's the same
thing, right? although the reason I may be requesting a vCard in its
native format is because I have a vCard parser or an application that
consumes them (Exchange, for example).


 That depends whether you want to be taken serious outside the library
 community and target at the web as a whole or not.


My point is that there's a step before that, possibly, where the
theory behind unAPI, Jangle, whatever, is tested to even see if it's
going in the right direction before writing it up formally as an RFC.

I don't think the lack of adoption of unAPI has anything to do with
the prose of it's specification document.  The RFC format is useful
for later adopters, but people that, say, jumped on the Atom
syndication format as a good idea didn't need an RFC first, they
developed a spec, /then/ wrote the standard once they  had an idea of
how it needed to work.

-Ross.


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-12 Thread Jonathan Rochkind

Ross Singer wrote:

My point is that there's a step before that, possibly, where the
theory behind unAPI, Jangle, whatever, is tested to even see if it's
going in the right direction before writing it up formally as an RFC.

I don't think the lack of adoption of unAPI has anything to do with
the prose of it's specification document.  The RFC format is useful
for later adopters, but people that, say, jumped on the Atom
syndication format as a good idea didn't need an RFC first, they
developed a spec, /then/ wrote the standard once they  had an idea of
how it needed to work.
  


I think this is a really important point, for us to get used to. Good 
formal standards are built _from_ best practices tested through 
experience.  Too often we try to do it vice versa, and wind up spending 
an awful lot of time on the details of standards that turn out to 
actually not solve the problem we wanted to solve as optimally as it 
could have been solved.


Jonathan


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Jakob Voss

Hi,

I summarized my thoughts about identifiers for data formats in a blog 
posting: http://jakoblog.de/2009/05/10/who-identifies-the-identifiers/


In short it’s not a technology issue but a commitment issue and the 
problem of identifying the right identifiers for data formats can be 
reduced to two fundamental rules of thumb:


1. reuse: don’t create new identifiers for things that already have one.

2. document: if you have to create an identifier describe its referent 
as open, clear, and detailled as possible to make it reusable.


A format should be described with a schema (XML Schema, OWL etc.) or at 
least a standard. Mostly this schema already has a namespace or similar 
identifier that can be used for the whole format.


For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
identify MODS. If you need to identify a specific version then you 
should *first* look if such identifiers already exist, *second* push the 
publisher (LOC) to assign official URIs for MODS versions, if this do 
not already exist, or *third* create and document specific URIs and make 
that everyone knows about this identifiers. At the moment there are:


MODS Version 3 http://www.loc.gov/mods/v3
MODS Version 3.0   info:srw/schema/1/mods-v3.0
MODS Version 3.1   info:srw/schema/1/mods-v3.1
MODS Version 3.2   info:srw/schema/1/mods-v3.2
   info:ofi/fmt:xml:xsd:mods
MODS Version 3.3   info:srw/schema/1/mods-v3.3

The SRU Schemas registry links the info:srw/schema/1/mods-v3* 
identifiers to its XML Schemas which is very little documentation but it 
links to http://www.loc.gov/mods/v3 at least in some way.


Ross wrote:


First, and most importantly, how do we reconcile these different
identifiers for the same thing?  Can we come up with some agreement on
which ones we should really use?


Use the one that is documented best.


Secondly, and this gets to the reason why any of this was brought up
in the first place, how can we coordinate these identifiers more
effectively and efficiently to reuse among various specs and
protocols, but not:



1) be tied to a particular community
2) require some laborious and lengthy submission and review process to
just say hey, here's my FOAF available via UnAPI


The identifier for FOAF is http://xmlns.com/foaf/0.1/. Forget about 
identifiers that are not URIs. OAI-PMH at least includes a mechanism to 
map metadataPrefixes to official URIs but this mechanism is not always 
used. If unAPI lacks a way to map a local name to a global URI, we 
should better fix unAPI to tell us:


?xml version=1.0 encoding=UTF-8?
formats xmlns=http://unapi.info/;
  format name=foaf uri=http://xmlns.com/foaf/0.1//
/formats

unAPI should be revised and specified bore strictly to become an RFC 
anyway. Yes, this requires a laborious and lengthy submission and review 
process but there is no such thing as a free lunch.



3) be so lax that it throws all hope of authority out the window


Reuse existing authorities and document better to create authority.


I would expect the various communities to still maintain their own
registries of approved data formats (well, OpenURL and SRU, anyway
-- it's not as appropriate to UnAPI or Jangle).


There should be a distinction between descriptive registries that only 
list identifiers and formats that are defined elsewhere and 
authoritative registries that define new identifiers and formats. The 
number of authoritatively defined identifiers should be small for a 
given API because the identifier should better be defined by the creator 
of the format instead by a user of the format. If the creator does not 
support usable identifiers then better talk to him instead of creating 
something in parallel.


Greetings,
Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Rob Sanderson
On Mon, 2009-05-11 at 11:31 +0100, Jakob Voss wrote
 A format should be described with a schema (XML Schema, OWL etc.) or at 
 least a standard. Mostly this schema already has a namespace or similar 
 identifier that can be used for the whole format.

This is unfortunately not the case.


 For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
 Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
 identify MODS. 

And this is a perfect example of why this is not the case.

The same mods schema (let alone namespace) defines TWO formats, mods and
modsCollection.


To quote from the schema:

*  An instance of this schema is 

 (1) a single MODS record:  
 --
xsd:element name=mods type=modsType/
!--  
or 

(2) a collection of MODS records: 
 --
xsd:element name=modsCollection
xsd:complexType
xsd:sequence
xsd:element ref=mods maxOccurs=unbounded/
/xsd:sequence
/xsd:complexType
/xsd:element
!--  

*  End of instance definition
-

So you're using the same identifier to identify two different things at
the same time.

We discussed this a lot during the development of SRU and there simply
isn't an existing identifier for an XML 'format'.

Also consider the following more hypothetical, but perfectly feasible
situations:

* One namespace is used to define two _totally_ separate sets of
elements.  There's no reason why this can't be done.

* One namespace defines so many elements that it's meaningless to call
it a format at all.  Even though the top level tag might be the same,
the contents are so varied that you're unable to realistically process
it.


Rob


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Alexander Johannesen
On Mon, May 11, 2009 at 16:04, Rob Sanderson azar...@liverpool.ac.uk wrote:
 * One namespace is used to define two _totally_ separate sets of
 elements.  There's no reason why this can't be done.

As opposed to all the reasons for not doing it. :) This is crap design
of a higher magnitude, and the designers should be either a) whipped
in public and thrown out in shame, or b) repent and made to fix the
problem. Even I would opt for the latter, but such a simple task not
being done seems to suggest that perhaps the former needs to be put in
place.

 * One namespace defines so many elements that it's meaningless to call
 it a format at all.  Even though the top level tag might be the same,
 the contents are so varied that you're unable to realistically process
 it.

Yeah, don't use MODS in general; it's a hack. It's even crazier still
that many versions have the same namespace. What were they thinking?!

Anyway, even if the namespace is botched, you can still (if I'll dare
go by the Topic Maps moniker) have multiple namespaces for the same
subject (the format in question), and simply publish and use your own
and let the TM mechanics handle the ambiguity for you. If enough
people do this, and perhaps even use your unofficial identifiers,
maybe LOC will see the errors of their ways and repent.


Regards,

Alex
-- 
---
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
-- http://shelter.nu/blog/ 


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Rob Sanderson
On Mon, 2009-05-11 at 12:02 +0100, Alexander Johannesen wrote:
 On Mon, May 11, 2009 at 16:04, Rob Sanderson azar...@liverpool.ac.uk wrote:
  * One namespace is used to define two _totally_ separate sets of
  elements.  There's no reason why this can't be done.
 
 As opposed to all the reasons for not doing it. :) This is crap design
 of a higher magnitude, and the designers should be either a) whipped
 in public and thrown out in shame, or b) repent and made to fix the
 problem. Even I would opt for the latter, but such a simple task not
 being done seems to suggest that perhaps the former needs to be put in
 place.

I totally agree that it's an awful design choice. However it's a
demonstration that XML namespaces _do not identify format_.  And hence,
we need another identifier which is not the namespace of the top level
element.

  * One namespace defines so many elements that it's meaningless to call
  it a format at all.  Even though the top level tag might be the same,
  the contents are so varied that you're unable to realistically process
  it.
 
 Yeah, don't use MODS in general; it's a hack. It's even crazier still
 that many versions have the same namespace. What were they thinking?!

Or TEI for that matter. However I wouldn't call either of them a 'hack'
and there are many people who do want to use both of these schemas.

Therefore, again, we need another identifier.
Q.E.D.

Rob


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Jonathan Rochkind

Alexander Johannesen wrote:


Yeah, don't use MODS in general; it's a hack. It's even crazier still
that many versions have the same namespace. What were they thinking?!
  


Um, MODS is awfully useful for a bunch of reasons. I'm not going to stop 
using it because they've used namespaces in a way you don't approve of.


In the real world, we use things when they solve the problem in front of 
us in as easy a way as possible, bonus when they are actually standards 
used by a few other people (like MODS is).   If you have the luxury to 
avoid using things that you don't believe are theoretically sound (and 
inter-operating with anyone who does use those things), good on you, I 
guess.


Jonathan


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Ross Singer
On Mon, May 11, 2009 at 6:31 AM, Jakob Voss jakob.v...@gbv.de wrote:
 2) require some laborious and lengthy submission and review process to
 just say hey, here's my FOAF available via UnAPI

 The identifier for FOAF is http://xmlns.com/foaf/0.1/. Forget about
 identifiers that are not URIs. OAI-PMH at least includes a mechanism to map
 metadataPrefixes to official URIs but this mechanism is not always used. If
 unAPI lacks a way to map a local name to a global URI, we should better fix
 unAPI to tell us:

 ?xml version=1.0 encoding=UTF-8?
 formats xmlns=http://unapi.info/;
  format name=foaf uri=http://xmlns.com/foaf/0.1//
 /formats

I generally agree with this, but what about formats that aren't XML or
RDF based?  How do I also say that you can grab my text/x-vcard?  Or
my application/marc record?  There is still lots of data I want that
doesn't necessarily have these characteristics.

What about XML formats that have no namespace?  JSON objects that
conform to a defined structure?  Protocol Buffers?

And, while I didn't really want to wade into these waters, what about
formats that are really only used to carry other formats, where it's
the *other* format that really matters (METS, Atom, OpenURL XML,
etc.)?

 unAPI should be revised and specified bore strictly to become an RFC anyway.
 Yes, this requires a laborious and lengthy submission and review process but
 there is no such thing as a free lunch.


Yeah, I have no problem with this (same with Jangle).  The argument
could be made, however, is there a cowpath yet to be paved?

-Ross.


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Alexander Johannesen
On Mon, May 11, 2009 at 19:34, Jonathan Rochkind rochk...@jhu.edu wrote:
 In the real world, we use things when they solve the problem in front of us
 in as easy a way as possible

And somehow you're suggesting that I don't live in the real-world? :)
Good try, but as far as I've experienced, people in the library world
lives quite a distance away from the real one.


Alex
-- 
---
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
-- http://shelter.nu/blog/ 


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-08 Thread Jonathan Rochkind
I don't understand from your description how Topic Maps solve the 
identifying multiple versions of a standard problem. Which was the 
original question, right?  Or have I gotten confused? I didn't think the 
original question was even about topic vocabularies, but about how to 
best provide an identifier for (eg) Marc 2.1 and another for Marc 2.2, 
while still allowing machines to ignore versions if they like and just 
request and/or identify generic marc.  And you said that Topic Maps 
had a solution to this?


I am genuinely curious -- not neccesarily because I'm ever going to use 
Topic Maps (sorry!), but because if they have a well thought out tested 
solution to this, it could serve as a model in other contexts.


Jonathan

Alexander Johannesen wrote:

On Wed, May 6, 2009 at 18:44, Mike Taylor m...@indexdata.com wrote:
  

Can't you just tell us?



Sorry, but surely you must be tired of me banging on this gong by now?
It's not that I don't want to seem helpful, but I've been writing a
bit on this here already and don't want to be marked as spam for Topic
Maps.

In the Topic Maps world our global identificators are called PSI, for
Published Subject Indicators. There's a few subtleties within this,
but they are not so different from any other identificator you'll find
elsewhere (RDF, library world, etc.) except of course they are
*always* URIs. Now, the thing here is that they should *always* be
published somewhere, whether as a part of a list or somewhere. The
next thing is that they always should resolve to something (although
the standard don't require this, however I'd say you're doing it wrong
if you couldn't do this, even if it sometimes is an evil necessity).

This last part is really the important bit, where any PSI will act as
1) a global identificator, and 2) resolve to a human text explaining
what it represents. Systems can just use it while at the same time
people can choose the right ones for their uses.

And, yes, the identificators can be done any way you slice them. Some
might think that ie. a PSI set for all dates is crazy as you need to
produce identificators for all dates (or times), and that would be
just way too much to deal with, but again, that's not an identifcation
problem, that's a resolver problem. If I can browse to a PSI and get
the text that this is 3rd of June, 19971, using the whatsnot calendar
style, then that's safe for me to use for my birthday. Let's pretend
the PSI is http://iso.org/datetime/03061971. By releasing an URI
template computers can work with this automatically, no frills.

Now a bit more technical; any topic (which is a Topic Map
representation of any subject, where subject is defined as anything
you can ever hope to think of) can have more than one PSI, because I
might use the PSI http://someother.org/time/date/3/6/1971 for my date.
If my application only understand this former set of PSIs, I can't
merge and find similar cross-semantics (which really is the core of
the problem this thread has been talking about). But simply attach the
second PSI to the same Topic, and you do. In fact, both parties will
understand perfectly what you're talking about.

More complex is that the definitions of PSI sets doesn't have to
happen on the subject level, ie. the Topic called Alex to which I
tried to attach my birthday. It can be moved to a meta model level,
where you say the Topic for Time and dates have the PSI for both
organsiations, and all Topics just use one or the other; we're
shifting the explicity of identification up a notch.

Having multiple PSIs might seem a bit unordered, but it's based on the
notion of organic growth, just like the web. People will gravitate
towards using PSIs from the most trusted sources (or most accurate or
most whatever), shifting identification schemes around. This is a good
thing (organic growth) at the price of multiple identifiers, but if
the library world started creating PSIs, I betcha humanity and the
library world both could be saved in one fell swoop! (That's another
gong I like to bang)

I'm kinda anticipating Jonathan saying this is all so complex now. :)
But it's not really; your application only has to have complexity in
the small meta model you set up, *not* for every single Topic you've
got in your map. And they're mergable and shareable, and as such can
be merged and fixed (or cleaned or sobered or made less complex) for
all your various needs also.

Anyway, that's the basics. Let me know if you want me to bang on. :)
For me, the problem the library face isn't really the mechanisms of
this (because this is solvable, and I guess you just have to trust
that the Topic Maps community have been doing this for the last 10
years or so already :), however, but how you're going to fit existing
resources into FRBR and RDA, but that's a separate discussion.


Regards,

Alex
  


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-08 Thread Alexander Johannesen
On Sat, May 9, 2009 at 00:32, Jonathan Rochkind rochk...@jhu.edu wrote:
 I don't understand from your description how Topic Maps solve the
 identifying multiple versions of a standard problem.

It's the mechanism of having multiple identifiers for Topics, so, in pseudo ;

Topic MARC21
  psi info:ofi/fmt:xml:xsd:MARC21
  psi http://loc.org/stuff/marc21;
  property #mime-type whatever for the binary

Topic MARC 1.1
  is_a MARC
  psi info:srw/schema/1/marcxml-v1.1
  psi http://loc.org/stuff/marcxml-v1.1;
  property #mime-type whatever 1.1

Topic MARC 1.2
  is_a MARC
  psi info:srw/schema/1/marcxml-v1.2
  psi http://bingo.com/psi/marcxml;
  property #mime-type whatever 1.2

Or, if if MARC 1.2 is backwards compatible with 1.1 ;

Topic MARC 1.2
  is_a MARC 1.1
  psi info:srw/schema/1/marcxml-v1.2

Or, if I make my own unofficial version ;

Topic MARC 2.0
  is_a MARC 1.2
  psi http://alex.com/psi/marc-2.0;

This is enough to hobble together what is and isn't compatible in
types of formats, so if your application is Topic Maps aware, this
should be trivial (including what format to ignore or react to). The
point is that you don't need *one* identifier for things; Topics are
proxies for knowledge, and part of the notion of knowledge is what
identifies that knowledge. Multiple PSIs help us leverage both rigid
and fuzzy systems.

As to the identifiers themselves (as in, the formatting), is that important?

Anyway, I'm suspecting I don't see what the problem seems to be. To
create the best identifier for things seems a bit of a strange
notion to me, but is this based on that there is only (or rather, that
you're trying to create) one identifier for any one thing?


Alex
-- 
---
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
-- http://shelter.nu/blog/ 


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-07 Thread Alexander Johannesen
On Wed, May 6, 2009 at 18:44, Mike Taylor m...@indexdata.com wrote:
 Can't you just tell us?

Sorry, but surely you must be tired of me banging on this gong by now?
It's not that I don't want to seem helpful, but I've been writing a
bit on this here already and don't want to be marked as spam for Topic
Maps.

In the Topic Maps world our global identificators are called PSI, for
Published Subject Indicators. There's a few subtleties within this,
but they are not so different from any other identificator you'll find
elsewhere (RDF, library world, etc.) except of course they are
*always* URIs. Now, the thing here is that they should *always* be
published somewhere, whether as a part of a list or somewhere. The
next thing is that they always should resolve to something (although
the standard don't require this, however I'd say you're doing it wrong
if you couldn't do this, even if it sometimes is an evil necessity).

This last part is really the important bit, where any PSI will act as
1) a global identificator, and 2) resolve to a human text explaining
what it represents. Systems can just use it while at the same time
people can choose the right ones for their uses.

And, yes, the identificators can be done any way you slice them. Some
might think that ie. a PSI set for all dates is crazy as you need to
produce identificators for all dates (or times), and that would be
just way too much to deal with, but again, that's not an identifcation
problem, that's a resolver problem. If I can browse to a PSI and get
the text that this is 3rd of June, 19971, using the whatsnot calendar
style, then that's safe for me to use for my birthday. Let's pretend
the PSI is http://iso.org/datetime/03061971. By releasing an URI
template computers can work with this automatically, no frills.

Now a bit more technical; any topic (which is a Topic Map
representation of any subject, where subject is defined as anything
you can ever hope to think of) can have more than one PSI, because I
might use the PSI http://someother.org/time/date/3/6/1971 for my date.
If my application only understand this former set of PSIs, I can't
merge and find similar cross-semantics (which really is the core of
the problem this thread has been talking about). But simply attach the
second PSI to the same Topic, and you do. In fact, both parties will
understand perfectly what you're talking about.

More complex is that the definitions of PSI sets doesn't have to
happen on the subject level, ie. the Topic called Alex to which I
tried to attach my birthday. It can be moved to a meta model level,
where you say the Topic for Time and dates have the PSI for both
organsiations, and all Topics just use one or the other; we're
shifting the explicity of identification up a notch.

Having multiple PSIs might seem a bit unordered, but it's based on the
notion of organic growth, just like the web. People will gravitate
towards using PSIs from the most trusted sources (or most accurate or
most whatever), shifting identification schemes around. This is a good
thing (organic growth) at the price of multiple identifiers, but if
the library world started creating PSIs, I betcha humanity and the
library world both could be saved in one fell swoop! (That's another
gong I like to bang)

I'm kinda anticipating Jonathan saying this is all so complex now. :)
But it's not really; your application only has to have complexity in
the small meta model you set up, *not* for every single Topic you've
got in your map. And they're mergable and shareable, and as such can
be merged and fixed (or cleaned or sobered or made less complex) for
all your various needs also.

Anyway, that's the basics. Let me know if you want me to bang on. :)
For me, the problem the library face isn't really the mechanisms of
this (because this is solvable, and I guess you just have to trust
that the Topic Maps community have been doing this for the last 10
years or so already :), however, but how you're going to fit existing
resources into FRBR and RDA, but that's a separate discussion.


Regards,

Alex
-- 
---
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
-- http://shelter.nu/blog/ 


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-06 Thread Mike Taylor
Alexander Johannesen writes:
  With Topic Maps it's been solved years and years ago, and it's the
  part of it that the RDF world didn't think of until recently (and
  applied their kludges). I'm not going to bang my gong on this, just
  urge you to read up on PSIs.

Can't you just tell us?

 _/|____
/o ) \/  Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk
)_v__/\  It takes a certain kind of bad writer to write badly sincerely
 -- Richard Sherbaniuk.


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-03 Thread Jonathan Rochkind
The new URI may be unavoidable to resolve the present situation, especially 
realizing that current attempted solutions do not deal with verioning 
succesfully, as Jenn Riley notes through experience. 

What is the current state of the art for dealing with versioning in URIs, with 
having URIs that specify a particular version of the thing-identified, but also 
allow you to easily tell that any of those URIs represents the thing at some 
version, when you don't care about what version in particular. 

Sure, conceptually and theoretically you could use ANY arbitrary URIs to refer 
to a specific version. http://something.org/mods refers to mods 3.0, and 
http://else.org/mods refers to 3.1, and http://foo.com/bar refers to mods 3.2.  
And then I guess you could theoretically have RDF that asserts the 
same-thing-different-version relationship between them?  I think?  I'm no RDF 
expert, is why I ask. 

But even if that's conceptually possible, it wouldn't be a good idea. Too 
confusing to humans (and being un-confusing to humans is part of what we do to 
try and encourage consistency and consensus in use); also too much trouble to 
discover that two URIs represent different versions of the same thing when you 
don't really care about version, you've got to actually follow the RDF 
spiderweb. We've got to build URIs that work for fantasy where all systems 
really DO understand RDF (and for the present few that do), AND that still work 
for the majority of present day cases where systems don't. 

http://something.info/mods/3.0?

http://something.info/mods#3.0   ?

Naturally, either of those could give you RDF representations of the OTHER 
existing URIs that represent that particular version of MODS. 

Could http://something.info/mods then give you RDF representations of the other 
existing URIs that represent MODS regardless of version?

Are other people in linked data and URIs in general doing anything that makes 
sense in these areas?

Jonathan

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Ross Singer 
[rossfsin...@gmail.com]
Sent: Friday, May 01, 2009 9:16 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them 
All

I agree that most software probably won't do it.  But the data will be
there and free and relatively easy to integrate if one wanted to.

In a lot ways, Jonathan, it's got Umlaut written all over it.

Now to get to Jonathan's point -- yes, I think the primary goal still
needs to be working towards bringing use of identifiers for a given
thing to a single variant.  However, we would obviously have to know
what the options are in order to figure out what that one is -- while
we're doing that, why not enter the different options into the
registry and document them in some way (such as, who uses this
variant?).  Voila, we have a crosswalk.

Of course, the downside is that we technically also have a new URI
for this resource (since the skos:Concept would need to have a URI),
but we could probably hand wave that away as the id for the registry
concept, not the data format.

So -- we seem to have some agreement here?

-Ross.

On Fri, May 1, 2009 at 5:53 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 From my perspective, all we're talking about is using the same URI to refer
 to the same format(s) accross the library community standards this community
 generally can control.

 That will make things much easier for developers, especially but not only
 when building software that interacts with more than one of these standards
 (as client or server).

 Now, once you've done that, you've ALSO set the stage for that kind of RDF
 scenario, among other RDF scenarios. I agree with Mike that that particular
 scenario is unlikely, but once you set the stage for RDF experimentation
 like that, if folks are interested in experimenting (and many in our
 community are), maybe something more attractively useful will come out of
 it.

 Or maybe not. Either way, you've made things easier and more inter-operable
 just by using the same set of URIs across multiple standards to refer to the
 same thing. So, yeah, I'd still focus on that, rather than any kind of
 'cross walk', RDF or not. It's the actual use case in front of us, in which
 the benefit will definitely be worth the effort (if the effort is kept
 manageable by avoiding trying to solve the entire universe of problems at
 once).

 Jonathan

 Mike Taylor wrote:

 So what are we talking about here?  A situation where an SRU server
 receives a request for response records to be delivered in a
 particular format, it doesn't recognise the format URI, so it goes and
 looks it up in an RDF database and discovers that it's equivalent to a
 URI that it does know?  Hmm ... it's crazy, but it might just work.

 I bet no-one does it, though.

  _/|_
  ___
 /o ) \/  Mike Taylorm

Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-03 Thread Alexander Johannesen
With Topic Maps it's been solved years and years ago, and it's the
part of it that the RDF world didn't think of until recently (and
applied their kludges). I'm not going to bang my gong on this, just
urge you to read up on PSIs.

Alex
-- 
---
 Project Wrangler, SOA, Information Alchemist, UX, RESTafarian, Topic Maps
-- http://shelter.nu/blog/ 


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-02 Thread Riley, Jenn
One thing I note in the current SRU list is that versioning might be an issue. 
MODS 3.0, 3.1, 3.2, and 3.3 all have different identifiers (naturally) but the 
same short name. I've run into this issue with OAI-PMH, where there isn't a 
formal registry of metadata formats but general conventions that most folks 
follow. The issue there is that from the OAI-PMH metadataPrefix (which I think 
is corollary to the SRU short name) you don't know which version of the format 
is being used. For minor release versions in practice this is more of an 
annoyance than a big problem, but I suspect for major release versions it could 
be a bigger issue. In the OpenURL list, mods is limited to *only* MODS 3.2. 
So when harmonizing these it might be useful to have a convention for dealing 
with version numbers within a format.

Jenn



Jenn Riley
Metadata Librarian
Digital Library Program
Indiana University - Bloomington
Wells Library W501
(812) 856-5759
www.dlib.indiana.edu

Inquiring Librarian blog: www.inquiringlibrarian.blogspot.com



 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ross Singer
 Sent: Friday, May 01, 2009 9:17 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to
 Rule Them All
 
 I agree that most software probably won't do it.  But the data will be
 there and free and relatively easy to integrate if one wanted to.
 
 In a lot ways, Jonathan, it's got Umlaut written all over it.
 
 Now to get to Jonathan's point -- yes, I think the primary goal still
 needs to be working towards bringing use of identifiers for a given
 thing to a single variant.  However, we would obviously have to know
 what the options are in order to figure out what that one is -- while
 we're doing that, why not enter the different options into the
 registry and document them in some way (such as, who uses this
 variant?).  Voila, we have a crosswalk.
 
 Of course, the downside is that we technically also have a new URI
 for this resource (since the skos:Concept would need to have a URI),
 but we could probably hand wave that away as the id for the registry
 concept, not the data format.
 
 So -- we seem to have some agreement here?
 
 -Ross.
 
 On Fri, May 1, 2009 at 5:53 PM, Jonathan Rochkind rochk...@jhu.edu
 wrote:
  From my perspective, all we're talking about is using the same URI to
 refer
  to the same format(s) accross the library community standards this
 community
  generally can control.
 
  That will make things much easier for developers, especially but not
 only
  when building software that interacts with more than one of these
 standards
  (as client or server).
 
  Now, once you've done that, you've ALSO set the stage for that kind
 of RDF
  scenario, among other RDF scenarios. I agree with Mike that that
 particular
  scenario is unlikely, but once you set the stage for RDF
 experimentation
  like that, if folks are interested in experimenting (and many in our
  community are), maybe something more attractively useful will come
 out of
  it.
 
  Or maybe not. Either way, you've made things easier and more inter-
 operable
  just by using the same set of URIs across multiple standards to refer
 to the
  same thing. So, yeah, I'd still focus on that, rather than any kind
 of
  'cross walk', RDF or not. It's the actual use case in front of us, in
 which
  the benefit will definitely be worth the effort (if the effort is
 kept
  manageable by avoiding trying to solve the entire universe of
 problems at
  once).
 
  Jonathan
 
  Mike Taylor wrote:
 
  So what are we talking about here?  A situation where an SRU server
  receives a request for response records to be delivered in a
  particular format, it doesn't recognise the format URI, so it goes
 and
  looks it up in an RDF database and discovers that it's equivalent to
 a
  URI that it does know?  Hmm ... it's crazy, but it might just work.
 
  I bet no-one does it, though.
 
   _/|_
   ___
  /o ) \/  Mike Taylor    m...@indexdata.com
   http://www.miketaylor.org.uk
  )_v__/\  Someday, I'll show you around monster-free Tokyo --
 dialogue
          from Gamera: Guardian of the Universe
 
 
 
 
  Peter Noerr writes:
    I agree with Ross wholeheartedly. Particularly in the use of an
 RDF
  based mechanism to describe, and then have systems act on, the
 semantics of
  these uniquely identified objects. Semantics (as in Web) has been
 exercising
  my thoughts recently and the problems we have here are writ large
 over all
  the SW people are trying to achieve. Perhaps we can help...
      Peter      -Original Message-
     From: Code for Libraries [mailto:code4...@listserv.nd.edu] On
 Behalf
  Of
     Ross Singer
     Sent: Friday, May 01, 2009 13:40
     To: CODE4LIB@LISTSERV.ND.EDU
     Subject: Re: [CODE4LIB] One Data Format Identifier (and
 Registry) to
  Rule

Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-01 Thread Mike Taylor
Ray Denenberg, Library of Congress writes:
  Thanks, Ross. For SRU, this is an opportune time to reconcile these
  differences.  Opportune, because we are approaching standardization
  of SRU/CQL within OASIS, and there will be a number of areas that
  need to change.

Agreed.  Looking at the situation as it stands, it really does seem
insane that we've ended up with these three or four different URIs
describing each of the data formats; and if we with our library
background can't get this right, what hope does the rest of the world
have?  Because OpenURL 1.0 seems to have been more widely implemented
than SRU (though much less so than OpenURL 0.1), I think it would be
less painful to change SRU to change OpenURL's data-format URIs than
vice versa; good implementations will of course recognise both old and
new URIs.

  Some observations.
  
  1. the 'ofi' namespace of 'info' has the advantage that the name,
  ofi, isn't necessarily tied to a community or application (I
  suppose one could claim that the acronym ofi means openURL
  something starting with 'f' for Identifiers but it doesn't say
  so anywhere that I can find.)  However, the namespace itself (if
  not the name) is tied to OpenURL.  Namespace of Registry
  Identifiers used by the NISO OpenURL Framework Registry.  That
  seems like a simple problem to fix.  (Changing that title would not
  cause any technical problems. )
  
  2. In contrast, with the srw namespace, the actual name is
  srw. So at least in name, it is tied to an application.

Agreed -- another reason to prefer the OpenURL standard's URIs.

  3. On the other side, the srw namespace has the distinct advantage
  of built-in extensibility.  For the URI:
  info:srw/schema/1/onix-v2.0, the 1 is an authority.  There are
  (currently) 15 such authorities, they are listed in the (second)
  table at http://www.loc.gov/standards/sru/resources/infoURI.html
  
  Authority 1 is the SRU maintenance agency, and the objects
  registered under that authority are, more-or-less, public. But
  objects can be defined under the other authorities with no
  registration process required.
  
  4.  ofi does not offer this sort of extensibility.

But SRU's has always been a clumsy extensibility mechanism -- the
assignment of integer identifiers for sub-namespaces has the distinct
whiff of an OID hangover.  In these enlightened days, we use our
domains for namespace partitioning, as with HTTP URLs.

I'd like to see the info:ofi URI specification extended to allow this
kind of thing:
info:ofi/ext:miketaylor.org.uk:whateverTheHeckIWantToPutHere

  So, if we were going to unify these two systems (and I can't speak
  for the SRU community and commit to doing so yet) the extensibility
  offered by the srw approach would be an absolute requirement.  If
  it could somehow be built in to ofi, then I would not be opposed to
  migrating the srw identifiers.  Another approach would be to
  register an entirely new 'info:' URI namespace and migrating all of
  these identifiers to the new namespace.

Oh, gosh, no, introducing yet ANOTHER set of identifiers is really not
the answer! :-)

 _/|____
/o ) \/  Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk
)_v__/\  Conclusion: is left to the reader (see Table 2).
 Acknowledgements: I wrote this paper for money -- A. A. Chastel,
 _A critical analysis of the explanation of red-shifts by a new
 field_, AA 53, 67 (1976)


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-01 Thread Mike Taylor
Jonathan Rochkind writes:
  Crosswalk is exactly the wrong answer for this. Two very small
  overlapping communities of most library developers can surely agree
  on using the same identifiers, and then we make things easier for
  US.  We don't need to solve the entire universe of problems. Solve
  the simple problem in front of you in the simplest way that could
  possibly work and still leave room for future expansion and
  improvement. From that, we learn how to solve the big problems,
  when we're ready. Overreach and try to solve the huge problem
  including every possible use case, many of which don't apply to you
  but SOMEDAY MIGHT... and you end up with the kind of
  over-abstracted over-engineered
  too-complicated-to-actually-catch-on solutions that... we in the
  library community normally end up with.

I strongly, STRONGLY agree with this.  It's exactly what I was about
to write myself, in response to Peter's message, until I saw that
Jonathan had saved me the trouble :-)  Let's solve the problem that's
in front of us right now: bring SRU into harmony with OpenURL in this
respect, and the very act of doing so will lend extra legitimacy to
the agreed-on identifiers, which will then be more strongly positioned
as The Right Identifiers for other initiatives to use.

 _/|____
/o ) \/  Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk
)_v__/\  You cannot really appreciate Dilbert unless you've read it in
 the original Klingon. -- Klingon Programming Mantra


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-01 Thread Peter Noerr
I am pleased to disagree to various levels of 'strongly (if we can agree on a 
definition for it :-).

Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he 
supplied

-snip
We could have something like:
http://purl.org/DataFormat/marcxml
  . skos:prefLabel MARC21 XML .
  . skos:notation info:srw/schema/1/marcxml-v1.1 .
  . skos:notation info:ofi/fmt:xml:xsd:MARC21 .
  . skos:notation http://www.loc.gov/MARC21/slim; .
  . skos:broader http://purl.org/DataFormat/marc .
  . skos:description ... .

Or maybe those skos:notations should be owl:sameAs -- anyway, that's not really 
the point.  The point is that all of these various identifiers would be valid, 
but we'd have a real way of knowing what they actually mean.  Maybe this is 
what you mean by a crosswalk.
--end

Is exactly what I meant by a crosswalk. Basically a translating dictionary 
which allows any entity (system or person) to relate the various identifiers.

I would love to see a single unified set of identifiers, my life as a wrangled 
of record semantics would be s much easier. But I don't see it happening. 

That does not mean we should not try. Even a unification in our space (and if 
not in the library/information space, then where? as Mike said) reduces the 
larger problem. However I don't believe it is a scalable solution (which may 
not matter if all of a group of users agree, they why not leave them to it) as, 
at any time one group/organisation/person/system could introduce a new scheme, 
and a world view which relies on unified semantics would no longer be viable.

Which means until global unification on an object (better a (large) set of 
objects) is achieved it will be necessary to have the translating dictionary 
and systems which know how to use it. Unification reduces Ray's list of 15 
alternative uris to 14 or 13 or whatever. As long as that number is 1 
translation will be necessary. (I will leave aside discussions of massive 
record bloat, continual system re-writes, the politics of whose view prevails, 
the unhelpfulness of compromises for joint solutions, and so on.)

Peter

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Mike Taylor
 Sent: Friday, May 01, 2009 02:36
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule
 Them All
 
 Jonathan Rochkind writes:
   Crosswalk is exactly the wrong answer for this. Two very small
   overlapping communities of most library developers can surely agree
   on using the same identifiers, and then we make things easier for
   US.  We don't need to solve the entire universe of problems. Solve
   the simple problem in front of you in the simplest way that could
   possibly work and still leave room for future expansion and
   improvement. From that, we learn how to solve the big problems,
   when we're ready. Overreach and try to solve the huge problem
   including every possible use case, many of which don't apply to you
   but SOMEDAY MIGHT... and you end up with the kind of
   over-abstracted over-engineered
   too-complicated-to-actually-catch-on solutions that... we in the
   library community normally end up with.
 
 I strongly, STRONGLY agree with this.  It's exactly what I was about
 to write myself, in response to Peter's message, until I saw that
 Jonathan had saved me the trouble :-)  Let's solve the problem that's
 in front of us right now: bring SRU into harmony with OpenURL in this
 respect, and the very act of doing so will lend extra legitimacy to
 the agreed-on identifiers, which will then be more strongly positioned
 as The Right Identifiers for other initiatives to use.
 
  _/|_  ___
 /o ) \/  Mike Taylorm...@indexdata.com
 http://www.miketaylor.org.uk
 )_v__/\  You cannot really appreciate Dilbert unless you've read it in
the original Klingon. -- Klingon Programming Mantra


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-01 Thread Ross Singer
Ideally, though, if we have some buy in and extend this outside our
communities, future identifiers *should* have fewer variations, since
people can find the appropriate URI for the format and use that.

I readily admit that this is wishful thinking, but so be it.  I do
think that modeling it as SKOS/RDF at least would make it attractive
to the Linked Data/Semweb crowd who are likely the sorts of people
that would be interested in seeing URIs, anyway.

I mean, the worst that can happen is that nobody cares, right?

-Ross.

On Fri, May 1, 2009 at 3:41 PM, Peter Noerr pno...@museglobal.com wrote:
 I am pleased to disagree to various levels of 'strongly (if we can agree on 
 a definition for it :-).

 Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he 
 supplied

 -snip
 We could have something like:
 http://purl.org/DataFormat/marcxml
  . skos:prefLabel MARC21 XML .
  . skos:notation info:srw/schema/1/marcxml-v1.1 .
  . skos:notation info:ofi/fmt:xml:xsd:MARC21 .
  . skos:notation http://www.loc.gov/MARC21/slim; .
  . skos:broader http://purl.org/DataFormat/marc .
  . skos:description ... .

 Or maybe those skos:notations should be owl:sameAs -- anyway, that's not 
 really the point.  The point is that all of these various identifiers would 
 be valid, but we'd have a real way of knowing what they actually mean.  Maybe 
 this is what you mean by a crosswalk.
 --end

 Is exactly what I meant by a crosswalk. Basically a translating dictionary 
 which allows any entity (system or person) to relate the various identifiers.

 I would love to see a single unified set of identifiers, my life as a 
 wrangled of record semantics would be s much easier. But I don't see it 
 happening.

 That does not mean we should not try. Even a unification in our space (and 
 if not in the library/information space, then where? as Mike said) reduces 
 the larger problem. However I don't believe it is a scalable solution (which 
 may not matter if all of a group of users agree, they why not leave them to 
 it) as, at any time one group/organisation/person/system could introduce a 
 new scheme, and a world view which relies on unified semantics would no 
 longer be viable.

 Which means until global unification on an object (better a (large) set of 
 objects) is achieved it will be necessary to have the translating dictionary 
 and systems which know how to use it. Unification reduces Ray's list of 15 
 alternative uris to 14 or 13 or whatever. As long as that number is 1 
 translation will be necessary. (I will leave aside discussions of massive 
 record bloat, continual system re-writes, the politics of whose view 
 prevails, the unhelpfulness of compromises for joint solutions, and so on.)

 Peter

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Mike Taylor
 Sent: Friday, May 01, 2009 02:36
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule
 Them All

 Jonathan Rochkind writes:
   Crosswalk is exactly the wrong answer for this. Two very small
   overlapping communities of most library developers can surely agree
   on using the same identifiers, and then we make things easier for
   US.  We don't need to solve the entire universe of problems. Solve
   the simple problem in front of you in the simplest way that could
   possibly work and still leave room for future expansion and
   improvement. From that, we learn how to solve the big problems,
   when we're ready. Overreach and try to solve the huge problem
   including every possible use case, many of which don't apply to you
   but SOMEDAY MIGHT... and you end up with the kind of
   over-abstracted over-engineered
   too-complicated-to-actually-catch-on solutions that... we in the
   library community normally end up with.

 I strongly, STRONGLY agree with this.  It's exactly what I was about
 to write myself, in response to Peter's message, until I saw that
 Jonathan had saved me the trouble :-)  Let's solve the problem that's
 in front of us right now: bring SRU into harmony with OpenURL in this
 respect, and the very act of doing so will lend extra legitimacy to
 the agreed-on identifiers, which will then be more strongly positioned
 as The Right Identifiers for other initiatives to use.

  _/|_  ___
 /o ) \/  Mike Taylor    m...@indexdata.com
 http://www.miketaylor.org.uk
 )_v__/\  You cannot really appreciate Dilbert unless you've read it in
        the original Klingon. -- Klingon Programming Mantra



Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-01 Thread Peter Noerr
I agree with Ross wholeheartedly. Particularly in the use of an RDF based 
mechanism to describe, and then have systems act on, the semantics of these 
uniquely identified objects. Semantics (as in Web) has been exercising my 
thoughts recently and the problems we have here are writ large over all the SW 
people are trying to achieve. Perhaps we can help...

Peter 

 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ross Singer
 Sent: Friday, May 01, 2009 13:40
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule
 Them All
 
 Ideally, though, if we have some buy in and extend this outside our
 communities, future identifiers *should* have fewer variations, since
 people can find the appropriate URI for the format and use that.
 
 I readily admit that this is wishful thinking, but so be it.  I do
 think that modeling it as SKOS/RDF at least would make it attractive
 to the Linked Data/Semweb crowd who are likely the sorts of people
 that would be interested in seeing URIs, anyway.
 
 I mean, the worst that can happen is that nobody cares, right?
 
 -Ross.
 
 On Fri, May 1, 2009 at 3:41 PM, Peter Noerr pno...@museglobal.com wrote:
  I am pleased to disagree to various levels of 'strongly (if we can agree
 on a definition for it :-).
 
  Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he
 supplied
 
  -snip
  We could have something like:
  http://purl.org/DataFormat/marcxml
   . skos:prefLabel MARC21 XML .
   . skos:notation info:srw/schema/1/marcxml-v1.1 .
   . skos:notation info:ofi/fmt:xml:xsd:MARC21 .
   . skos:notation http://www.loc.gov/MARC21/slim; .
   . skos:broader http://purl.org/DataFormat/marc .
   . skos:description ... .
 
  Or maybe those skos:notations should be owl:sameAs -- anyway, that's not
 really the point.  The point is that all of these various identifiers would
 be valid, but we'd have a real way of knowing what they actually mean.
  Maybe this is what you mean by a crosswalk.
  --end
 
  Is exactly what I meant by a crosswalk. Basically a translating
 dictionary which allows any entity (system or person) to relate the various
 identifiers.
 
  I would love to see a single unified set of identifiers, my life as a
 wrangled of record semantics would be s much easier. But I don't see it
 happening.
 
  That does not mean we should not try. Even a unification in our space
 (and if not in the library/information space, then where? as Mike said)
 reduces the larger problem. However I don't believe it is a scalable
 solution (which may not matter if all of a group of users agree, they why
 not leave them to it) as, at any time one group/organisation/person/system
 could introduce a new scheme, and a world view which relies on unified
 semantics would no longer be viable.
 
  Which means until global unification on an object (better a (large) set
 of objects) is achieved it will be necessary to have the translating
 dictionary and systems which know how to use it. Unification reduces Ray's
 list of 15 alternative uris to 14 or 13 or whatever. As long as that number
 is 1 translation will be necessary. (I will leave aside discussions of
 massive record bloat, continual system re-writes, the politics of whose
 view prevails, the unhelpfulness of compromises for joint solutions, and so
 on.)
 
  Peter
 
  -Original Message-
  From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
  Mike Taylor
  Sent: Friday, May 01, 2009 02:36
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to
 Rule
  Them All
 
  Jonathan Rochkind writes:
    Crosswalk is exactly the wrong answer for this. Two very small
    overlapping communities of most library developers can surely agree
    on using the same identifiers, and then we make things easier for
    US.  We don't need to solve the entire universe of problems. Solve
    the simple problem in front of you in the simplest way that could
    possibly work and still leave room for future expansion and
    improvement. From that, we learn how to solve the big problems,
    when we're ready. Overreach and try to solve the huge problem
    including every possible use case, many of which don't apply to you
    but SOMEDAY MIGHT... and you end up with the kind of
    over-abstracted over-engineered
    too-complicated-to-actually-catch-on solutions that... we in the
    library community normally end up with.
 
  I strongly, STRONGLY agree with this.  It's exactly what I was about
  to write myself, in response to Peter's message, until I saw that
  Jonathan had saved me the trouble :-)  Let's solve the problem that's
  in front of us right now: bring SRU into harmony with OpenURL in this
  respect, and the very act of doing so will lend extra legitimacy to
  the agreed-on identifiers, which will then be more strongly positioned
  as The Right Identifiers

Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-01 Thread Jonathan Rochkind
From my perspective, all we're talking about is using the same URI to 
refer to the same format(s) accross the library community standards this 
community generally can control.


That will make things much easier for developers, especially but not 
only when building software that interacts with more than one of these 
standards (as client or server).


Now, once you've done that, you've ALSO set the stage for that kind of 
RDF scenario, among other RDF scenarios. I agree with Mike that that 
particular scenario is unlikely, but once you set the stage for RDF 
experimentation like that, if folks are interested in experimenting (and 
many in our community are), maybe something more attractively useful 
will come out of it.


Or maybe not. Either way, you've made things easier and more 
inter-operable just by using the same set of URIs across multiple 
standards to refer to the same thing. So, yeah, I'd still focus on that, 
rather than any kind of 'cross walk', RDF or not. It's the actual use 
case in front of us, in which the benefit will definitely be worth the 
effort (if the effort is kept manageable by avoiding trying to solve the 
entire universe of problems at once).


Jonathan

Mike Taylor wrote:

So what are we talking about here?  A situation where an SRU server
receives a request for response records to be delivered in a
particular format, it doesn't recognise the format URI, so it goes and
looks it up in an RDF database and discovers that it's equivalent to a
URI that it does know?  Hmm ... it's crazy, but it might just work.

I bet no-one does it, though.

 _/|____
/o ) \/  Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk
)_v__/\  Someday, I'll show you around monster-free Tokyo -- dialogue
 from Gamera: Guardian of the Universe




Peter Noerr writes:
  I agree with Ross wholeheartedly. Particularly in the use of an RDF based 
mechanism to describe, and then have systems act on, the semantics of these 
uniquely identified objects. Semantics (as in Web) has been exercising my thoughts 
recently and the problems we have here are writ large over all the SW people are 
trying to achieve. Perhaps we can help...
  
  Peter 
  
   -Original Message-

   From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
   Ross Singer
   Sent: Friday, May 01, 2009 13:40
   To: CODE4LIB@LISTSERV.ND.EDU
   Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule
   Them All
   
   Ideally, though, if we have some buy in and extend this outside our

   communities, future identifiers *should* have fewer variations, since
   people can find the appropriate URI for the format and use that.
   
   I readily admit that this is wishful thinking, but so be it.  I do

   think that modeling it as SKOS/RDF at least would make it attractive
   to the Linked Data/Semweb crowd who are likely the sorts of people
   that would be interested in seeing URIs, anyway.
   
   I mean, the worst that can happen is that nobody cares, right?
   
   -Ross.
   
   On Fri, May 1, 2009 at 3:41 PM, Peter Noerr pno...@museglobal.com wrote:

I am pleased to disagree to various levels of 'strongly (if we can agree
   on a definition for it :-).
   
Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he
   supplied
   
-snip
We could have something like:
http://purl.org/DataFormat/marcxml
 . skos:prefLabel MARC21 XML .
 . skos:notation info:srw/schema/1/marcxml-v1.1 .
 . skos:notation info:ofi/fmt:xml:xsd:MARC21 .
 . skos:notation http://www.loc.gov/MARC21/slim; .
 . skos:broader http://purl.org/DataFormat/marc .
 . skos:description ... .
   
Or maybe those skos:notations should be owl:sameAs -- anyway, that's not
   really the point.  The point is that all of these various identifiers would
   be valid, but we'd have a real way of knowing what they actually mean.
Maybe this is what you mean by a crosswalk.
--end
   
Is exactly what I meant by a crosswalk. Basically a translating
   dictionary which allows any entity (system or person) to relate the various
   identifiers.
   
I would love to see a single unified set of identifiers, my life as a
   wrangled of record semantics would be s much easier. But I don't see it
   happening.
   
That does not mean we should not try. Even a unification in our space
   (and if not in the library/information space, then where? as Mike said)
   reduces the larger problem. However I don't believe it is a scalable
   solution (which may not matter if all of a group of users agree, they why
   not leave them to it) as, at any time one group/organisation/person/system
   could introduce a new scheme, and a world view which relies on unified
   semantics would no longer be viable.
   
Which means until global unification on an object (better a (large) set
   of objects

Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-01 Thread Mike Taylor
So what are we talking about here?  A situation where an SRU server
receives a request for response records to be delivered in a
particular format, it doesn't recognise the format URI, so it goes and
looks it up in an RDF database and discovers that it's equivalent to a
URI that it does know?  Hmm ... it's crazy, but it might just work.

I bet no-one does it, though.

 _/|____
/o ) \/  Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk
)_v__/\  Someday, I'll show you around monster-free Tokyo -- dialogue
 from Gamera: Guardian of the Universe




Peter Noerr writes:
  I agree with Ross wholeheartedly. Particularly in the use of an RDF based 
  mechanism to describe, and then have systems act on, the semantics of these 
  uniquely identified objects. Semantics (as in Web) has been exercising my 
  thoughts recently and the problems we have here are writ large over all the 
  SW people are trying to achieve. Perhaps we can help...
  
  Peter 
  
   -Original Message-
   From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
   Ross Singer
   Sent: Friday, May 01, 2009 13:40
   To: CODE4LIB@LISTSERV.ND.EDU
   Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule
   Them All
   
   Ideally, though, if we have some buy in and extend this outside our
   communities, future identifiers *should* have fewer variations, since
   people can find the appropriate URI for the format and use that.
   
   I readily admit that this is wishful thinking, but so be it.  I do
   think that modeling it as SKOS/RDF at least would make it attractive
   to the Linked Data/Semweb crowd who are likely the sorts of people
   that would be interested in seeing URIs, anyway.
   
   I mean, the worst that can happen is that nobody cares, right?
   
   -Ross.
   
   On Fri, May 1, 2009 at 3:41 PM, Peter Noerr pno...@museglobal.com wrote:
I am pleased to disagree to various levels of 'strongly (if we can agree
   on a definition for it :-).
   
Ross earlier gave a sample of a crossw3alk' for my MARC problem. What he
   supplied
   
-snip
We could have something like:
http://purl.org/DataFormat/marcxml
 . skos:prefLabel MARC21 XML .
 . skos:notation info:srw/schema/1/marcxml-v1.1 .
 . skos:notation info:ofi/fmt:xml:xsd:MARC21 .
 . skos:notation http://www.loc.gov/MARC21/slim; .
 . skos:broader http://purl.org/DataFormat/marc .
 . skos:description ... .
   
Or maybe those skos:notations should be owl:sameAs -- anyway, that's not
   really the point.  The point is that all of these various identifiers would
   be valid, but we'd have a real way of knowing what they actually mean.
    Maybe this is what you mean by a crosswalk.
--end
   
Is exactly what I meant by a crosswalk. Basically a translating
   dictionary which allows any entity (system or person) to relate the various
   identifiers.
   
I would love to see a single unified set of identifiers, my life as a
   wrangled of record semantics would be s much easier. But I don't see it
   happening.
   
That does not mean we should not try. Even a unification in our space
   (and if not in the library/information space, then where? as Mike said)
   reduces the larger problem. However I don't believe it is a scalable
   solution (which may not matter if all of a group of users agree, they why
   not leave them to it) as, at any time one group/organisation/person/system
   could introduce a new scheme, and a world view which relies on unified
   semantics would no longer be viable.
   
Which means until global unification on an object (better a (large) set
   of objects) is achieved it will be necessary to have the translating
   dictionary and systems which know how to use it. Unification reduces Ray's
   list of 15 alternative uris to 14 or 13 or whatever. As long as that number
   is 1 translation will be necessary. (I will leave aside discussions of
   massive record bloat, continual system re-writes, the politics of whose
   view prevails, the unhelpfulness of compromises for joint solutions, and so
   on.)
   
Peter
   
-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Mike Taylor
Sent: Friday, May 01, 2009 02:36
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to
   Rule
Them All
   
Jonathan Rochkind writes:
  Crosswalk is exactly the wrong answer for this. Two very small
  overlapping communities of most library developers can surely agree
  on using the same identifiers, and then we make things easier for
  US.  We don't need to solve the entire universe of problems. Solve
  the simple problem in front of you in the simplest way that could
  possibly work and still leave room for future expansion

Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-01 Thread Ross Singer
I agree that most software probably won't do it.  But the data will be
there and free and relatively easy to integrate if one wanted to.

In a lot ways, Jonathan, it's got Umlaut written all over it.

Now to get to Jonathan's point -- yes, I think the primary goal still
needs to be working towards bringing use of identifiers for a given
thing to a single variant.  However, we would obviously have to know
what the options are in order to figure out what that one is -- while
we're doing that, why not enter the different options into the
registry and document them in some way (such as, who uses this
variant?).  Voila, we have a crosswalk.

Of course, the downside is that we technically also have a new URI
for this resource (since the skos:Concept would need to have a URI),
but we could probably hand wave that away as the id for the registry
concept, not the data format.

So -- we seem to have some agreement here?

-Ross.

On Fri, May 1, 2009 at 5:53 PM, Jonathan Rochkind rochk...@jhu.edu wrote:
 From my perspective, all we're talking about is using the same URI to refer
 to the same format(s) accross the library community standards this community
 generally can control.

 That will make things much easier for developers, especially but not only
 when building software that interacts with more than one of these standards
 (as client or server).

 Now, once you've done that, you've ALSO set the stage for that kind of RDF
 scenario, among other RDF scenarios. I agree with Mike that that particular
 scenario is unlikely, but once you set the stage for RDF experimentation
 like that, if folks are interested in experimenting (and many in our
 community are), maybe something more attractively useful will come out of
 it.

 Or maybe not. Either way, you've made things easier and more inter-operable
 just by using the same set of URIs across multiple standards to refer to the
 same thing. So, yeah, I'd still focus on that, rather than any kind of
 'cross walk', RDF or not. It's the actual use case in front of us, in which
 the benefit will definitely be worth the effort (if the effort is kept
 manageable by avoiding trying to solve the entire universe of problems at
 once).

 Jonathan

 Mike Taylor wrote:

 So what are we talking about here?  A situation where an SRU server
 receives a request for response records to be delivered in a
 particular format, it doesn't recognise the format URI, so it goes and
 looks it up in an RDF database and discovers that it's equivalent to a
 URI that it does know?  Hmm ... it's crazy, but it might just work.

 I bet no-one does it, though.

  _/|_
  ___
 /o ) \/  Mike Taylor    m...@indexdata.com
  http://www.miketaylor.org.uk
 )_v__/\  Someday, I'll show you around monster-free Tokyo -- dialogue
         from Gamera: Guardian of the Universe




 Peter Noerr writes:
   I agree with Ross wholeheartedly. Particularly in the use of an RDF
 based mechanism to describe, and then have systems act on, the semantics of
 these uniquely identified objects. Semantics (as in Web) has been exercising
 my thoughts recently and the problems we have here are writ large over all
 the SW people are trying to achieve. Perhaps we can help...
     Peter      -Original Message-
    From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf
 Of
    Ross Singer
    Sent: Friday, May 01, 2009 13:40
    To: CODE4LIB@LISTSERV.ND.EDU
    Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to
 Rule
    Them All
       Ideally, though, if we have some buy in and extend this outside
 our
    communities, future identifiers *should* have fewer variations, since
    people can find the appropriate URI for the format and use that.
       I readily admit that this is wishful thinking, but so be it.  I
 do
    think that modeling it as SKOS/RDF at least would make it attractive
    to the Linked Data/Semweb crowd who are likely the sorts of people
    that would be interested in seeing URIs, anyway.
       I mean, the worst that can happen is that nobody cares, right?
       -Ross.
       On Fri, May 1, 2009 at 3:41 PM, Peter Noerr
 pno...@museglobal.com wrote:
     I am pleased to disagree to various levels of 'strongly (if we can
 agree
    on a definition for it :-).
    
     Ross earlier gave a sample of a crossw3alk' for my MARC problem.
 What he
    supplied
    
     -snip
     We could have something like:
     http://purl.org/DataFormat/marcxml
      . skos:prefLabel MARC21 XML .
      . skos:notation info:srw/schema/1/marcxml-v1.1 .
      . skos:notation info:ofi/fmt:xml:xsd:MARC21 .
      . skos:notation http://www.loc.gov/MARC21/slim; .
      . skos:broader http://purl.org/DataFormat/marc .
      . skos:description ... .
    
     Or maybe those skos:notations should be owl:sameAs -- anyway,
 that's not
    really the point.  The point is that all of these various identifiers
 would
    be valid, but we'd

[CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-04-30 Thread Ross Singer
Hello everybody.  I apologize for the crossposting, but this is an
area that could (potentially) affect every one of these groups.  I
realize that not everybody will be able to respond to all lists,
but...

First of all, some back story (Code4Lib subscribers can probably skip ahead):

Jangle [1] requires URIs to explicitly declare the format of the data
it is transporting (binary marc, marcxml, vcard, DLF
simpleAvailability, MODS, EAD, etc.).  In the past, it has used it's
own URI structure for this (http://jangle.org/vocab/formats#...) but
this was always been with the intention of moving out of the
jangle.org into a more generic space so it could be used by other
initiatives.

This same concept came up in UnAPI [2] (I think this thread:
http://old.onebiglibrary.net/yale/cipolo/gcs-pcs-list/2006-March/thread.html#682
discusses it a bit - there is a reference there that it maybe had come
up before) although was rejected ultimately in favor of an (optional)
approach more in line with how OAI-PMH disambiguates metadata formats.
 That being said, this page used to try to set sort of convention
around the UnAPI formats:
http://unapi.stikipad.com/unapi/show/existing+formats
But it's now just a squatter page.

Jakob Voss pointed out that SRU has a schema registry and that it
would make sense to coordinate with this rather than mint new URIs for
things that have already been defined there:
http://www.loc.gov/standards/sru/resources/schemas.html

This, of course, made a lot of sense.  It also made me realize that
OpenURL *also* has a registry of metadata formats:
http://alcme.oclc.org/openurl/servlet/OAIHandler?verb=ListRecordsmetadataPrefix=oai_dcset=Core:Metadata+Formats

The problem here is that OpenURL and SRW are using different info URIs
to describe the same things:

info:srw/schema/1/marcxml-v1.1

info:ofi/fmt:xml:xsd:MARC21

or

info:srw/schema/1/onix-v2.0

info:ofi/fmt:xml:xsd:onix

The latter technically isn't the same thing since the OpenURL one
claims it's an identifier for ONIX 2.1, but if I wasn't sending this
email now, eventually SRU would have registered
info:srw/schema/1/onix-v2.1

There are several other examples, as well (MODS, ISO20775, etc.) and
it's not a stretch to envision more in the future.

So there are a couple of questions here.

First, and most importantly, how do we reconcile these different
identifiers for the same thing?  Can we come up with some agreement on
which ones we should really use?

Secondly, and this gets to the reason why any of this was brought up
in the first place, how can we coordinate these identifiers more
effectively and efficiently to reuse among various specs and
protocols, but not:
1) be tied to a particular community
2) require some laborious and lengthy submission and review process to
just say hey, here's my FOAF available via UnAPI
3) be so lax that it throws all hope of authority out the window
?

I would expect the various communities to still maintain their own
registries of approved data formats (well, OpenURL and SRU, anyway
-- it's not as appropriate to UnAPI or Jangle).

Does something like this interest any of you?  Is there value in such
an initiative?

Thanks,
-Ross.


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-04-30 Thread Ray Denenberg, Library of Congress
Thanks, Ross. For SRU, this is an opportune time to reconcile these 
differences.  Opportune, because we are approaching standardization of 
SRU/CQL within OASIS, and there will be a number of areas that need to 
change.


Some observations.

1. the 'ofi' namespace of 'info' has the advantage that the name, ofi, 
isn't necessarily tied to a community or application (I suppose one could 
claim that  the acronym ofi means openURL something starting with 'f' 
for Identifiers  but it doesn't say so anywhere that I can find.)  However, 
the namespace itself (if not the name) is tied to OpenURL.  Namespace of 
Registry Identifiers used by the NISO OpenURL Framework Registry.  That 
seems like a simple problem to fix.  (Changing  that title would not cause 
any technical problems. )


2. In contrast,  with the srw namespace,  the actual name is srw. So at 
least in name, it is tied to an application.


3. On the other side, the srw namespace has the distinct advantage of 
built-in extensibility.  For the URI: info:srw/schema/1/onix-v2.0,  the 1 
is an authority.   There are (currently) 15 such authorities, they are 
listed in the (second) table at 
http://www.loc.gov/standards/sru/resources/infoURI.html


Authority 1  is the SRU maintenance agency, and the objects registered 
under that authority are, more-or-less, public. But objects can be defined 
under the other authorities with no registration process required.


4.  ofi does not offer this sort of extensibility.


So, if we were going to unify these two systems (and I can't speak for the 
SRU community and commit to doing so yet) the extensibility offered by the 
srw approach would be an absolute requirement.   If it could somehow be 
built in to ofi,  then I would not be opposed to migrating the srw 
identifiers.   Another approach would be to register  an entirely  new 
'info:' URI namespace and migrating all of these identifiers to the new 
namespace.


--Ray


- Original Message - 
From: Ross Singer rossfsin...@gmail.com

To: z...@listserv.loc.gov
Sent: Thursday, April 30, 2009 2:59 PM
Subject: One Data Format Identifier (and Registry) to Rule Them All



Hello everybody.  I apologize for the crossposting, but this is an
area that could (potentially) affect every one of these groups.  I
realize that not everybody will be able to respond to all lists,
but...

First of all, some back story (Code4Lib subscribers can probably skip 
ahead):


Jangle [1] requires URIs to explicitly declare the format of the data
it is transporting (binary marc, marcxml, vcard, DLF
simpleAvailability, MODS, EAD, etc.).  In the past, it has used it's
own URI structure for this (http://jangle.org/vocab/formats#...) but
this was always been with the intention of moving out of the
jangle.org into a more generic space so it could be used by other
initiatives.

This same concept came up in UnAPI [2] (I think this thread:
http://old.onebiglibrary.net/yale/cipolo/gcs-pcs-list/2006-March/thread.html#682
discusses it a bit - there is a reference there that it maybe had come
up before) although was rejected ultimately in favor of an (optional)
approach more in line with how OAI-PMH disambiguates metadata formats.
That being said, this page used to try to set sort of convention
around the UnAPI formats:
http://unapi.stikipad.com/unapi/show/existing+formats
But it's now just a squatter page.

Jakob Voss pointed out that SRU has a schema registry and that it
would make sense to coordinate with this rather than mint new URIs for
things that have already been defined there:
http://www.loc.gov/standards/sru/resources/schemas.html

This, of course, made a lot of sense.  It also made me realize that
OpenURL *also* has a registry of metadata formats:
http://alcme.oclc.org/openurl/servlet/OAIHandler?verb=ListRecordsmetadataPrefix=oai_dcset=Core:Metadata+Formats

The problem here is that OpenURL and SRW are using different info URIs
to describe the same things:

info:srw/schema/1/marcxml-v1.1

info:ofi/fmt:xml:xsd:MARC21

or

info:srw/schema/1/onix-v2.0

info:ofi/fmt:xml:xsd:onix

The latter technically isn't the same thing since the OpenURL one
claims it's an identifier for ONIX 2.1, but if I wasn't sending this
email now, eventually SRU would have registered
info:srw/schema/1/onix-v2.1

There are several other examples, as well (MODS, ISO20775, etc.) and
it's not a stretch to envision more in the future.

So there are a couple of questions here.

First, and most importantly, how do we reconcile these different
identifiers for the same thing?  Can we come up with some agreement on
which ones we should really use?

Secondly, and this gets to the reason why any of this was brought up
in the first place, how can we coordinate these identifiers more
effectively and efficiently to reuse among various specs and
protocols, but not:
1) be tied to a particular community
2) require some laborious and lengthy submission and review process to
just say hey, here's my FOAF available via UnAPI
3) be so 

Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-04-30 Thread Peter Noerr
Some further observations. So far this threadling has mentioned only trying to 
unify two different sets of identifiers. However there are a much larger number 
of them out there (and even larger numbers of schemas and other 
standard-things-that-everyone-should-use-so-we-all-know-what-we-are-talking-about)
 and the problem exists for any of these things (identifiers, etc.) where there 
are more than one of them. So really unifying two sets of identifiers, while 
very useful, is not actually going to solve much.

Is there any broader methodology we could approach which potentially allows 
multiple unifications or (my favourite) cross-walks. (Complete unification 
requires everybody agrees and sticks to it, and human history is sort of not on 
that track...) And who (people and organizations) would undertake this?

Ross' point about a lightweight approach is necessary for any sort of adoption, 
but this is a problem (which plagues all we do in federated search) which 
cannot just be solved by another registry. Somebody/organisation has to look at 
the identifiers or whatever and decide that two of them are identical or, 
worse, only partially overlap and hence scope has to be defined. In a syntax 
that all understand of course. Already in this thread we have the sub/super 
case question from Karen (in a post on the openurl (or Z39.88 sigh - 
identifiers!) listserv). And the various identifiers for MARC (below) could 
easily be for MARC-XML, MARC21-ISO2709, MARCUK-ISO2709. Now explain in words of 
one (computer understandable) syllable what the differences are. 

I'm not trying to make problems. There are problems and this is only a small 
subset of them, and they confound us every day. I would love to adopt standard 
definitions for these things, but which Standard? Because anyone can produce 
any identifier they like, we have decided that the unification of them has to 
be kept internal where we at least have control of the unifications, even if 
they change pretty frequently.

Peter


Dr Peter Noerr
CTO, MuseGlobal, Inc.

+1 415 896 6873 (office)
+1 415 793 6547 (mobile)
www.museglobal.com


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ross Singer
 Sent: Thursday, April 30, 2009 12:00
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them
 All
 
 Hello everybody.  I apologize for the crossposting, but this is an
 area that could (potentially) affect every one of these groups.  I
 realize that not everybody will be able to respond to all lists,
 but...
 
 First of all, some back story (Code4Lib subscribers can probably skip
 ahead):
 
 Jangle [1] requires URIs to explicitly declare the format of the data
 it is transporting (binary marc, marcxml, vcard, DLF
 simpleAvailability, MODS, EAD, etc.).  In the past, it has used it's
 own URI structure for this (http://jangle.org/vocab/formats#...) but
 this was always been with the intention of moving out of the
 jangle.org into a more generic space so it could be used by other
 initiatives.
 
 This same concept came up in UnAPI [2] (I think this thread:
 http://old.onebiglibrary.net/yale/cipolo/gcs-pcs-list/2006-
 March/thread.html#682
 discusses it a bit - there is a reference there that it maybe had come
 up before) although was rejected ultimately in favor of an (optional)
 approach more in line with how OAI-PMH disambiguates metadata formats.
  That being said, this page used to try to set sort of convention
 around the UnAPI formats:
 http://unapi.stikipad.com/unapi/show/existing+formats
 But it's now just a squatter page.
 
 Jakob Voss pointed out that SRU has a schema registry and that it
 would make sense to coordinate with this rather than mint new URIs for
 things that have already been defined there:
 http://www.loc.gov/standards/sru/resources/schemas.html
 
 This, of course, made a lot of sense.  It also made me realize that
 OpenURL *also* has a registry of metadata formats:
 http://alcme.oclc.org/openurl/servlet/OAIHandler?verb=ListRecordsmetadataP
 refix=oai_dcset=Core:Metadata+Formats
 
 The problem here is that OpenURL and SRW are using different info URIs
 to describe the same things:
 
 info:srw/schema/1/marcxml-v1.1
 
 info:ofi/fmt:xml:xsd:MARC21
 
 or
 
 info:srw/schema/1/onix-v2.0
 
 info:ofi/fmt:xml:xsd:onix
 
 The latter technically isn't the same thing since the OpenURL one
 claims it's an identifier for ONIX 2.1, but if I wasn't sending this
 email now, eventually SRU would have registered
 info:srw/schema/1/onix-v2.1
 
 There are several other examples, as well (MODS, ISO20775, etc.) and
 it's not a stretch to envision more in the future.
 
 So there are a couple of questions here.
 
 First, and most importantly, how do we reconcile these different
 identifiers for the same thing?  Can we come up with some agreement on
 which ones we should really use?
 
 Secondly, and this gets to the reason why any of this was brought up

Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-04-30 Thread Jonathan Rochkind
Crosswalk is exactly the wrong answer for this. Two very small overlapping 
communities of most library developers can surely agree on using the same 
identifiers, and then we make things easier for US.  We don't need to solve the 
entire universe of problems. Solve the simple problem in front of you in the 
simplest way that could possibly work and still leave room for future expansion 
and improvement. From that, we learn how to solve the big problems, when we're 
ready. Overreach and try to solve the huge problem including every possible use 
case, many of which don't apply to you but SOMEDAY MIGHT... and you end up with 
the kind of over-abstracted over-engineered 
too-complicated-to-actually-catch-on solutions that... we in the library 
community normally end up with. 

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Peter Noerr 
[pno...@museglobal.com]
Sent: Thursday, April 30, 2009 6:37 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them 
All

Some further observations. So far this threadling has mentioned only trying to 
unify two different sets of identifiers. However there are a much larger number 
of them out there (and even larger numbers of schemas and other 
standard-things-that-everyone-should-use-so-we-all-know-what-we-are-talking-about)
 and the problem exists for any of these things (identifiers, etc.) where there 
are more than one of them. So really unifying two sets of identifiers, while 
very useful, is not actually going to solve much.

Is there any broader methodology we could approach which potentially allows 
multiple unifications or (my favourite) cross-walks. (Complete unification 
requires everybody agrees and sticks to it, and human history is sort of not on 
that track...) And who (people and organizations) would undertake this?

Ross' point about a lightweight approach is necessary for any sort of adoption, 
but this is a problem (which plagues all we do in federated search) which 
cannot just be solved by another registry. Somebody/organisation has to look at 
the identifiers or whatever and decide that two of them are identical or, 
worse, only partially overlap and hence scope has to be defined. In a syntax 
that all understand of course. Already in this thread we have the sub/super 
case question from Karen (in a post on the openurl (or Z39.88 sigh - 
identifiers!) listserv). And the various identifiers for MARC (below) could 
easily be for MARC-XML, MARC21-ISO2709, MARCUK-ISO2709. Now explain in words of 
one (computer understandable) syllable what the differences are.

I'm not trying to make problems. There are problems and this is only a small 
subset of them, and they confound us every day. I would love to adopt standard 
definitions for these things, but which Standard? Because anyone can produce 
any identifier they like, we have decided that the unification of them has to 
be kept internal where we at least have control of the unifications, even if 
they change pretty frequently.

Peter


Dr Peter Noerr
CTO, MuseGlobal, Inc.

+1 415 896 6873 (office)
+1 415 793 6547 (mobile)
www.museglobal.com


 -Original Message-
 From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
 Ross Singer
 Sent: Thursday, April 30, 2009 12:00
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them
 All

 Hello everybody.  I apologize for the crossposting, but this is an
 area that could (potentially) affect every one of these groups.  I
 realize that not everybody will be able to respond to all lists,
 but...

 First of all, some back story (Code4Lib subscribers can probably skip
 ahead):

 Jangle [1] requires URIs to explicitly declare the format of the data
 it is transporting (binary marc, marcxml, vcard, DLF
 simpleAvailability, MODS, EAD, etc.).  In the past, it has used it's
 own URI structure for this (http://jangle.org/vocab/formats#...) but
 this was always been with the intention of moving out of the
 jangle.org into a more generic space so it could be used by other
 initiatives.

 This same concept came up in UnAPI [2] (I think this thread:
 http://old.onebiglibrary.net/yale/cipolo/gcs-pcs-list/2006-
 March/thread.html#682
 discusses it a bit - there is a reference there that it maybe had come
 up before) although was rejected ultimately in favor of an (optional)
 approach more in line with how OAI-PMH disambiguates metadata formats.
  That being said, this page used to try to set sort of convention
 around the UnAPI formats:
 http://unapi.stikipad.com/unapi/show/existing+formats
 But it's now just a squatter page.

 Jakob Voss pointed out that SRU has a schema registry and that it
 would make sense to coordinate with this rather than mint new URIs for
 things that have already been defined there:
 http://www.loc.gov/standards/sru/resources/schemas.html

 This, of course, made a lot