Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-14 Thread Rob Sanderson
I'll quote Mike (and most common approaches to the problem):  

Don't Do That Then.
:)

Rob

On Thu, 2009-05-14 at 13:18 +0100, Alexander Johannesen wrote:
> On Thu, May 14, 2009 at 17:35, Rob Sanderson  wrote:
> > For example, the owl:sameAs predicate is used to express that the
> > subject and object are the same 'thing'.  Then the application can infer
> > that if a owl:sameAs b, and a x y, then b x y.
> 
> Yes, but there's a snag; as RDF work only on the URI resource level
> (no added semantics to the typification of the URI resource) if
> someone does an owl:sameAs between an identifier of a thing and a
> locator of a thing (a locator being the resource itself as opposed to
> being an identifier; example are you talking about Sun Corp
> (http://sun.com/) or are you talking about their website
> (http://sun.com/)) you can get a nasty case of integrity rot, and I've
> not seen any proposals to address this issue (the RDF world is
> essentially assuming modeling from the viewpoint of everything being
> true).
> 
> I guess Mike don't like RDF *nor* Topic Maps now. :)


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-14 Thread Rob Sanderson
RDF is fine with one 'thing' having multiple identifiers, it just hands
the problem up a level to the application to deal with.

For example, the owl:sameAs predicate is used to express that the
subject and object are the same 'thing'.  Then the application can infer
that if a owl:sameAs b, and a x y, then b x y.

Rob

On Thu, 2009-05-14 at 13:00 +0100, Mike Taylor wrote:
> Alexander Johannesen writes:
>  > Anyway, I'm suspecting I don't see what the problem seems to be. To
>  > create "the best identifier" for things seems a bit of a strange
>  > notion to me, but is this based on that there is only (or rather,
>  > that you're trying to create) one identifier for any one thing?
> 
> Yes, this is exactly it.  RDF things that each concept should have
> exactly one identifier; Topic Maps says its fine to have multiple
> identifiers.  That seems to be 99% of the conceptual difference
> between them.
> 
> My position: it seems obvious that one is the CORRECT number of
> identifiers for a thing to have.  But since we live in a formal
> world, the Topics Map approach may be more practical.
> 
> In other words, I might end up _advocating_ Topic Maps, but don't
> expect me to _like_ it :-)
> 
>  _/|_  ___
> /o ) \/  Mike Taylorhttp://www.miketaylor.org.uk
> )_v__/\  "I think it's too consistently wrong not to be fixable" --
>Phil Baldwin.


Re: [CODE4LIB] Formats and its identifiers

2009-05-11 Thread Rob Sanderson
On Mon, 2009-05-11 at 14:53 +0100, Jakob Voss wrote:

> >> A format should be described with a schema (XML Schema, OWL etc.) or at 
> >> least a standard. Mostly this schema already has a namespace or similar 
> >> identifier that can be used for the whole format.
> > 
> > This is unfortunately not the case.
> 
> It is mostly the case - but people like to misinterpret schemas and 
> tailor them to their needs.

You're advocating an approach that "mostly" works, as opposed to one
that works in all cases?


> >> For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
> >> Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
> >> identify MODS. 
> > 
> > And this is a perfect example of why this is not the case. 
> > The same mods schema (let alone namespace) defines TWO formats, mods and
> > modsCollection.

> That's your interpretation. According to the schema, the MODS format 
> *is* either a single mods-element or a modsCollection-element. 

According to the __schema__ yes.  Not according to the namespace. The
namespace is a collection of names only and says precisely nothing about
structure.

And, yes, given no definition of "format", my interpretation is that the
mods schema defines two formats, as it defines two top level elements
with different contents (eg one may contain the other).  This is
typically how people would define format in this context, I would say.  

This is, of course, tangential to the fact that you cannot use the __XML
Namespace__ as an identifier for the format, no matter how you define
it.


> That's 
> exactely what you can refer to with the namespace identifier 
> http://www.loc.gov/mods/v3.

No, that's a collection of elements, not a schema.


> If you need to identify the specific element 'mods' of the format only, 
> then you need another identifer.

Correct. I'm glad you agree with me.

Given that namespaces do not specify anything to do with structure, you
thus need a new identifier for EVERY element in a namespace as they
could be used as the top level tag of ANY schema.

There isn't a widely accepted identifier system for schemas, only schema
locations.  There are also many methods for defining schemas
(schematron, relax-ng, DTDs, xml schema) which can all define exactly
the same "format".


> But if the MODS specification defines that you can refer to any element 
> with an URI fragment identifier, then the right identifier would be 
> http://www.loc.gov/mods/v3#mods

That would be an identifier for the *element*.

> The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' 
> does not identify the top level element but the MODS *format* (in any of 
> the versions 3.0-3.4) itself. This format *includes* the top level 
> element 'mods'.

No, it identifies a collection of names.  These names are structured
according to a schema, which is what we need an identifier for. Beyond
that, we may also need identifiers for which structure we mean within
the schema (eg mods vs modsCollection)


Rob


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Rob Sanderson
On Mon, 2009-05-11 at 12:02 +0100, Alexander Johannesen wrote:
> On Mon, May 11, 2009 at 16:04, Rob Sanderson  wrote:
> > * One namespace is used to define two _totally_ separate sets of
> > elements.  There's no reason why this can't be done.
> 
> As opposed to all the reasons for not doing it. :) This is crap design
> of a higher magnitude, and the designers should be either a) whipped
> in public and thrown out in shame, or b) repent and made to fix the
> problem. Even I would opt for the latter, but such a simple task not
> being done seems to suggest that perhaps the former needs to be put in
> place.

I totally agree that it's an awful design choice. However it's a
demonstration that XML namespaces _do not identify format_.  And hence,
we need another identifier which is not the namespace of the top level
element.

> > * One namespace defines so many elements that it's meaningless to call
> > it a format at all.  Even though the top level tag might be the same,
> > the contents are so varied that you're unable to realistically process
> > it.
> 
> Yeah, don't use MODS in general; it's a hack. It's even crazier still
> that many versions have the same namespace. What were they thinking?!

Or TEI for that matter. However I wouldn't call either of them a 'hack'
and there are many people who do want to use both of these schemas.

Therefore, again, we need another identifier.
Q.E.D.

Rob


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Rob Sanderson
On Mon, 2009-05-11 at 11:31 +0100, Jakob Voss wrote
> A format should be described with a schema (XML Schema, OWL etc.) or at 
> least a standard. Mostly this schema already has a namespace or similar 
> identifier that can be used for the whole format.

This is unfortunately not the case.


> For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
> Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
> identify MODS. 

And this is a perfect example of why this is not the case.

The same mods schema (let alone namespace) defines TWO formats, mods and
modsCollection.


To quote from the schema:

*  An instance of this schema is 

 (1) a single MODS record:  
 -->










Re: [CODE4LIB] RDA in RDF, was: Something completely different

2009-04-07 Thread Rob Sanderson
See also the thread, 'RDA: A Standard Nobody Will Notice'.

http://www.mail-archive.com/code4lib@listserv.nd.edu/msg04422.html

A standard nobody will notice ... for good reason. 

Rob

On Tue, 2009-04-07 at 18:24 +0100, Eric Lease Morgan wrote:
> On Apr 7, 2009, at 1:15 PM, Karen Coyle wrote:
> 
> > Absolutely. The catalogers are still creating a textual document, not
> > data. At best you can mark up the text, as we do with the MARC  
> > record...
> 
> 
> Listen...  What you hear from over here is the sound of a very heavy  
> sigh coming from a computer type who really wants to help improve the  
> way library data is used in a networked environment, but they can't  
> convince their own to modify the way they encode information.
> 


Re: [CODE4LIB] registering info: uris?

2009-04-02 Thread Rob Sanderson
On Thu, 2009-04-02 at 18:11 +0100, Erik Hetzner wrote:
> At Thu, 2 Apr 2009 13:47:50 +0100,
> Mike Taylor wrote:
> > 
> > Erik Hetzner writes:
> >  > Without external knowledge that info:doi/10./xxx is a URI, I can
> >  > only guess.
> > 
> > Yes, that is true.  The point is that by specifying that the rft_id
> > has to be a URI, you can then use other kinds of URI without needing
> > to broaden the specification.
> 
> Thanks for the clarification. Honestly I was also responding to Rob
> Sanderson’s message (bad practice, surely) where he described URIs as
> ‘self-describing’, which seemed to me unclear. URIs are only
> self-describing insofar as they describe what type of URI they are.

All I meant by that was that the info:doi/ URI is more informative as to
what the identifier actually is than just the doi by itself, which could
be any string.  Equally, if I saw an SRW info URI like:

info:srw/cql-context-set/2/relevance-1.0

that's more informative than some ad-hoc URI for the same thing.
Without the external knowledge that info:doi/xxx is a DOI and
info:srw/cql-context-set/2/ is a cql context set administered by the
owner with identifier '2' (which happens to be me), then they're still
just opaque strings.

I could have said that http://srw.cheshire3.org/contextSets/rel/ was the
identifier for it (SRU doesn't care) but that's the location for the
retrieval documentation for the context set, not a collection of
abstract access points.

If srw.cheshire3.org was to go away, then people can still happily use
the info URI with the continued knowledge that it shouldn't resolve to
anything.

With the potential dissolution of DLF, this has real implications, as
DLF have an info URI namespace.  If they'd registered a bunch of URIs
with diglib.org instead, which will go away, then people would have
trouble using them.  Notably when someone else grabs the domain and
starts using the URIs for something else.
Now if DLF were to disband AND reform, then they can happily go back to
using info:dlf/ URIs even if they have a brand new domain.


> I think that all of us in this discussion like URIs. I can’t speak
> for, say, Andrew, but, tentatively, I think that I prefer
>  to plain 10.111/xxx. I would just prefer
> 

info URIs, In My Opinion, are ideally suited for long term identifiers
of non information resources.  But http URIs are definitely better than
something which isn't a URI at all.

Rob


Re: [CODE4LIB] registering info: uris?

2009-04-01 Thread Rob Sanderson
On Wed, 2009-04-01 at 14:28 +0100, Houghton,Andrew wrote:
>  The DOI is the identifier and it inherently doesn't
> tie itself to any resolution mechanism.  So creating an info URI
> for it is meaningless, it's just another alias for the DOI.  I 
> can create an HTTP resolution mechanism for DOI's by doing:
> 
> http://resolve.example.org/?doi=10./j.1475-4983.2007.00728.x
> http://resolve.example.org/?uri=info:doi/10./j.1475-4983.2007.00728.x


Note that in the first you have to EXPLICITLY name the identifier
encoding as doi=... Note also that this is your *resolution service*
doing this, not the identifier.

In the second example, the identifier is self-describing.  It says I am
a DOI (info:doi/) and my value is 10.111/bla-bla-bla.

So, I disagree that the two are of identical value.  The string doi does
not identify itself outside of your resolution service, whereas the info
URI doi does.

Rob


Re: [CODE4LIB] registering info: uris?

2009-04-01 Thread Rob Sanderson
On Wed, 2009-04-01 at 14:17 +0100, Mike Taylor wrote:
> Ed Summers writes:
>  > Assuming a world where you cannot de-reference this DOI what is it
>  > good for?
> 
> It wouldn't be good for much if you couldn't dereference it at all.
> The point is that (I argue) the identifier shouldn't tie itself to a
> particular dereferencing mechanism (such as dx.doi.org, or amazon.com)
> but should be dereferenced by software that knows what's the most
> appropriate dereferencing mechanism _for you_ in your situation, with
> your subscriptions, at particular distances from specific libraries,
> etc.

Heh, that sounds like a good idea. Maybe we could call it an OpenURL?

And that distinction about having a dereferencing mechanism sounds okay,
but let's call it a ... service. Then we could define an architecture
for that sort of thing rather than a Resource oriented one.  We could
call it a Service Oriented Architecture.

Oh, wait... 

Rob


Re: [CODE4LIB] registering info: uris?

2009-03-30 Thread Rob Sanderson
On Mon, 2009-03-30 at 16:08 +0100, Ross Singer wrote:
> There should be no issue with having both, mainly because like I
> mentioned earlier, nobody cares about info:uris.

s/nobody cares/the web doesn't care/

'The Web' isn't the only use case.  There are plenty of reasons for
having non dereferencable identifiers, for example for things which do
not have a web representation, or have too many web representations to
make favouring one over another a waste of time. For example abstract
concepts.

> I guess the way I look at it is:
> 1.  The web is not going to wait for info:uris
> 2.  The web is not going to use info:uris anyway, even after we've
> exhausted all of the corner cases and come up with the perfect URI
> model for a given domain, *because there's nothing the web can do with
> them anyway*.

Working As Intended.

If you want an identifier that *explicitly* cannot be dereferenced, then
info URIs are a good choice.  If you want one that can be dereferenced
to some representation of the identified object, then HTTP is the only
choice.

Rob


Re: [CODE4LIB] BISAC Subject Headings Lookup or Crosswalk

2009-01-21 Thread Rob Sanderson
And if you could get access to the catalogue, you could then train a
classifier (maybe bayes?) to predict BISAC given the other types of
headings (or other data) in the records.

Rob

On Wed, 2009-01-21 at 12:21 -0500, Andrew Nagy wrote:
> I saw a great presentation by Jesse Haro from Phoenix Public on their Endeca
> catalog.  They had their catalogers go back and recatalog the entire
> collection with BISAC headings.  You might want to see if you can get in
> touch with him to see if he has any information for you.
> 
> http://mlamasslib.blogspot.com/2008/05/endeca-developments-in-opac-world.html
> 
> Andrew
> 
> On Wed, Jan 21, 2009 at 12:12 PM, Ryan Eby  wrote:
> 
> > I was wondering if anyone knows of a good BISAC Subject Headings
> > source for looking up a recommended BISAC based on ISBN, LCSH, etc.
> > I've found some pages on oclc.org saying they were starting work on
> > crosswalks and possibly including them in WorldCat but I haven't seen
> > any returned in any WorldCat api calls yet. I've also read that ONIX
> > records often have a BISAC code, is there a good source that might
> > cover many publishers?
> >
> > http://www.bisg.org/standards/bisac_subject/index.html
> >
> > http://www.oclc.org/dewey/updates/numbers/
> >
> > eby
> >


Re: [CODE4LIB] RDA - a standard that nobody will notice?

2008-12-17 Thread Rob Sanderson
My first question would be:  Why?

Why invent a new element for title (etc.) rather than using Dublin Core?
Wouldn't it have been easier to do this building from SWAP?
http://www.ukoln.ac.uk/repositories/digirep/index/Eprints_Application_Profile

And my second question would be: Really?

251 elements!! Man... At least they're not just numbers, but ... do you
expect anyone to actually use it?

Rob


Re: [CODE4LIB] Open Source Institutional Repository Software?

2008-08-22 Thread Rob Sanderson
To throw in my 2c.

> Eric Lease Morgan wrote:
> > On Aug 21, 2008, at 4:34 PM, Jonathan Rochkind wrote:
> >> If you can figure out what the difference between an 'institutional 
> >> repository' and a 'digital library' is, let me know.
> > I think an institutional repository is a type of digital library.

I think the set of "institutional repository" is a subset of the set of
"digital library".  The defining feature being that IRs are designed to
be updated relatively frequently, by more than one or two people, and
typically non technical members of an institution.  This happens via a
user UI, rather than via an admin UI.  The contents of the IR are
research output, whereas a DL can hold anything.

Rob


[CODE4LIB] ORE software libraries from Foresite

2008-06-09 Thread Rob Sanderson
Apologies for cross-posting...

The Foresite [1] project is pleased to announce the initial code of two
software libraries for constructing, parsing, manipulating and
serialising OAI-ORE [2] Resource Maps.  These libraries are being
written in Java and Python, and can be used generically to provide
advanced functionality to OAI-ORE aware applications, and are compliant
with the latest release (0.9) of the specification.  The software is
open source, released under a BSD licence, and is available from a
Google Code repository:

http://code.google.com/p/foresite-toolkit/

You will find that the implementations are not absolutely complete yet,
and are lacking good documentation for this early release, but we will
be continuing to develop this software throughout the project and hope
that it will be of use to the community immediately and beyond the end
of the project.

Both libraries support parsing and serialising in: ATOM, RDF/XML, N3,
N-Triples, Turtle and RDFa

Foresite is a JISC [3] funded project which aims to produce a
demonstrator and test of the OAI-ORE standard by creating Resource Maps
of journals and their contents held in JSTOR [4], and delivering them as
ATOM documents via the SWORD [5] interface to DSpace [6].  DSpace will
ingest these resource maps, and convert them into repository items which
reference content which continues to reside in JSTOR.  The Python
library is being used to generate the resource maps from JSTOR and the
Java library is being used to provide all the ingest, transformation and
dissemination support required in DSpace.

Please feel free to download and play with the source code, and let us
have your feedback via the Google group:

[EMAIL PROTECTED]

All the best,

Richard Jones & Rob Sanderson

[1] Foresite project page: http://foresite.cheshire3.org/
[2] OAI-ORE specification: http://www.openarchives.org/ore/0.9/toc
[3] Joint Information Systems Committee (JISC): http://www.jisc.ac.uk/
[4] JSTOR: http://www.jstor.org/
[5] Simple Web Service Offering Repository Deposit (SWORD):
http://www.ukoln.ac.uk/repositories/digirep/index/SWORD
[6] DSpace: http://www.dspace.org/


Re: [CODE4LIB] Latest OpenLibrary.org release

2008-05-09 Thread Rob Sanderson
On Thu, 2008-05-08 at 11:41 -0400, Godmar Back wrote:
> On Thu, May 8, 2008 at 11:25 AM, Dr R. Sanderson
> <[EMAIL PROTECTED]> wrote:
> >
> >  Like what?  The current API seems to be concerned with search.  Search
> >  is what SRU does well.  If it was concerned with harvest, I (and I'm
> >  sure many others) would have instead suggested OAI-PMH.
> >
> No, the API presented does not support search.

Well, it only doesn't support search because of the way that the API has
been described without using the word 'search'!

To quote the documentation in the API:

--
Infogami provides an API to query the database for objects matching
particular criteria
...
To find objects matching a particular query, send a GET request to
http://openlibrary.org/api/things with query as parameter. In this
documentation we use curl as a simple command line query client; any
software that supports http GET can be used.
...
The API supports querying for objects based of string matching.
-

And so on.

There's a query, which can have its results sorted, be limited in terms
of the number of results returned, and have the beginning of that result
list start at an offset.

Sounds a lot like a search?

Rob


[CODE4LIB] OAI-ORE European Open Meeting, April 4 2008

2008-01-25 Thread Rob Sanderson
Apologies for cross-posting

A meeting will be held on April 4, 2008 at the University of
Southampton, in conjunction with Open Repositories 2008, to roll-out the
beta release of the OAI-ORE specifications. This meeting is the European
follow-on to a meeting that will be held in the USA on March 3, 2008 at
Johns Hopkins University.

The OAI-ORE specifications describe a data model to identify and
describe aggregations of web resources, and they introduce
machine-readable formats to describe these aggregations based on ATOM
and RDF/XML. The current, alpha version of the OAI-ORE specifications is
  at http://www.openarchives.org/ore/0.1/ .

Additional details for the OAI-ORE European Open Meeting are available at:

- The full press release for this event:

http://www.openarchives.org/ore/documents/EUKickoffPressrelease.pdf

- The registration site for the event:

http://regonline.com/eu-oai-ore

Note that registration is required and space is limited.


Carl Lagoze and Herbert Van de Sompel


[CODE4LIB] ORE meeting announcement

2007-11-02 Thread Rob Sanderson
(And apologies for incorrect cross posting to code4libcon!)

Apologies for cross-posting

A meeting will be held on March 3, 2008 at Johns Hopkins University to
roll-out the first beta release of the OAI-ORE specifications.  These
specifications describe a data model to identify and describe
aggregations of web resources, and the encoding of the data model in the
XML-based Atom syndication format.

Additional details are available at:

- The full press release for this event:

http://www.openarchives.org/ore/documents/ore-hopkins-press-release.pdf

- The registration site for the meeting:

http://www.regonline.com/oai-ore

Note that registration is required and space is limited.

Carl Lagoze and Herbert Van de Sompel