Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-14 Thread Rob Sanderson
RDF is fine with one 'thing' having multiple identifiers, it just hands
the problem up a level to the application to deal with.

For example, the owl:sameAs predicate is used to express that the
subject and object are the same 'thing'.  Then the application can infer
that if a owl:sameAs b, and a x y, then b x y.

Rob

On Thu, 2009-05-14 at 13:00 +0100, Mike Taylor wrote:
 Alexander Johannesen writes:
   Anyway, I'm suspecting I don't see what the problem seems to be. To
   create the best identifier for things seems a bit of a strange
   notion to me, but is this based on that there is only (or rather,
   that you're trying to create) one identifier for any one thing?
 
 Yes, this is exactly it.  RDF things that each concept should have
 exactly one identifier; Topic Maps says its fine to have multiple
 identifiers.  That seems to be 99% of the conceptual difference
 between them.
 
 My position: it seems obvious that one is the CORRECT number of
 identifiers for a thing to have.  But since we live in a formal
 world, the Topics Map approach may be more practical.
 
 In other words, I might end up _advocating_ Topic Maps, but don't
 expect me to _like_ it :-)
 
  _/|_  ___
 /o ) \/  Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk
 )_v__/\  I think it's too consistently wrong not to be fixable --
Phil Baldwin.


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Rob Sanderson
On Mon, 2009-05-11 at 11:31 +0100, Jakob Voss wrote
 A format should be described with a schema (XML Schema, OWL etc.) or at 
 least a standard. Mostly this schema already has a namespace or similar 
 identifier that can be used for the whole format.

This is unfortunately not the case.


 For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
 Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
 identify MODS. 

And this is a perfect example of why this is not the case.

The same mods schema (let alone namespace) defines TWO formats, mods and
modsCollection.


To quote from the schema:

*  An instance of this schema is 

 (1) a single MODS record:  
 --
xsd:element name=mods type=modsType/
!--  
or 

(2) a collection of MODS records: 
 --
xsd:element name=modsCollection
xsd:complexType
xsd:sequence
xsd:element ref=mods maxOccurs=unbounded/
/xsd:sequence
/xsd:complexType
/xsd:element
!--  

*  End of instance definition
-

So you're using the same identifier to identify two different things at
the same time.

We discussed this a lot during the development of SRU and there simply
isn't an existing identifier for an XML 'format'.

Also consider the following more hypothetical, but perfectly feasible
situations:

* One namespace is used to define two _totally_ separate sets of
elements.  There's no reason why this can't be done.

* One namespace defines so many elements that it's meaningless to call
it a format at all.  Even though the top level tag might be the same,
the contents are so varied that you're unable to realistically process
it.


Rob


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Rob Sanderson
On Mon, 2009-05-11 at 12:02 +0100, Alexander Johannesen wrote:
 On Mon, May 11, 2009 at 16:04, Rob Sanderson azar...@liverpool.ac.uk wrote:
  * One namespace is used to define two _totally_ separate sets of
  elements.  There's no reason why this can't be done.
 
 As opposed to all the reasons for not doing it. :) This is crap design
 of a higher magnitude, and the designers should be either a) whipped
 in public and thrown out in shame, or b) repent and made to fix the
 problem. Even I would opt for the latter, but such a simple task not
 being done seems to suggest that perhaps the former needs to be put in
 place.

I totally agree that it's an awful design choice. However it's a
demonstration that XML namespaces _do not identify format_.  And hence,
we need another identifier which is not the namespace of the top level
element.

  * One namespace defines so many elements that it's meaningless to call
  it a format at all.  Even though the top level tag might be the same,
  the contents are so varied that you're unable to realistically process
  it.
 
 Yeah, don't use MODS in general; it's a hack. It's even crazier still
 that many versions have the same namespace. What were they thinking?!

Or TEI for that matter. However I wouldn't call either of them a 'hack'
and there are many people who do want to use both of these schemas.

Therefore, again, we need another identifier.
Q.E.D.

Rob


Re: [CODE4LIB] Formats and its identifiers

2009-05-11 Thread Rob Sanderson
On Mon, 2009-05-11 at 14:53 +0100, Jakob Voss wrote:

  A format should be described with a schema (XML Schema, OWL etc.) or at 
  least a standard. Mostly this schema already has a namespace or similar 
  identifier that can be used for the whole format.
  
  This is unfortunately not the case.
 
 It is mostly the case - but people like to misinterpret schemas and 
 tailor them to their needs.

You're advocating an approach that mostly works, as opposed to one
that works in all cases?


  For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
  Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
  identify MODS. 
  
  And this is a perfect example of why this is not the case. 
  The same mods schema (let alone namespace) defines TWO formats, mods and
  modsCollection.

 That's your interpretation. According to the schema, the MODS format 
 *is* either a single mods-element or a modsCollection-element. 

According to the __schema__ yes.  Not according to the namespace. The
namespace is a collection of names only and says precisely nothing about
structure.

And, yes, given no definition of format, my interpretation is that the
mods schema defines two formats, as it defines two top level elements
with different contents (eg one may contain the other).  This is
typically how people would define format in this context, I would say.  

This is, of course, tangential to the fact that you cannot use the __XML
Namespace__ as an identifier for the format, no matter how you define
it.


 That's 
 exactely what you can refer to with the namespace identifier 
 http://www.loc.gov/mods/v3.

No, that's a collection of elements, not a schema.


 If you need to identify the specific element 'mods' of the format only, 
 then you need another identifer.

Correct. I'm glad you agree with me.

Given that namespaces do not specify anything to do with structure, you
thus need a new identifier for EVERY element in a namespace as they
could be used as the top level tag of ANY schema.

There isn't a widely accepted identifier system for schemas, only schema
locations.  There are also many methods for defining schemas
(schematron, relax-ng, DTDs, xml schema) which can all define exactly
the same format.


 But if the MODS specification defines that you can refer to any element 
 with an URI fragment identifier, then the right identifier would be 
 http://www.loc.gov/mods/v3#mods

That would be an identifier for the *element*.

 The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' 
 does not identify the top level element but the MODS *format* (in any of 
 the versions 3.0-3.4) itself. This format *includes* the top level 
 element 'mods'.

No, it identifies a collection of names.  These names are structured
according to a schema, which is what we need an identifier for. Beyond
that, we may also need identifiers for which structure we mean within
the schema (eg mods vs modsCollection)


Rob


Re: [CODE4LIB] RDA in RDF, was: Something completely different

2009-04-07 Thread Rob Sanderson
See also the thread, 'RDA: A Standard Nobody Will Notice'.

http://www.mail-archive.com/code4lib@listserv.nd.edu/msg04422.html

A standard nobody will notice ... for good reason. 

Rob

On Tue, 2009-04-07 at 18:24 +0100, Eric Lease Morgan wrote:
 On Apr 7, 2009, at 1:15 PM, Karen Coyle wrote:
 
  Absolutely. The catalogers are still creating a textual document, not
  data. At best you can mark up the text, as we do with the MARC  
  record...
 
 
 Listen...  What you hear from over here is the sound of a very heavy  
 sigh coming from a computer type who really wants to help improve the  
 way library data is used in a networked environment, but they can't  
 convince their own to modify the way they encode information.
 


Re: [CODE4LIB] registering info: uris?

2009-04-01 Thread Rob Sanderson
On Wed, 2009-04-01 at 14:17 +0100, Mike Taylor wrote:
 Ed Summers writes:
   Assuming a world where you cannot de-reference this DOI what is it
   good for?
 
 It wouldn't be good for much if you couldn't dereference it at all.
 The point is that (I argue) the identifier shouldn't tie itself to a
 particular dereferencing mechanism (such as dx.doi.org, or amazon.com)
 but should be dereferenced by software that knows what's the most
 appropriate dereferencing mechanism _for you_ in your situation, with
 your subscriptions, at particular distances from specific libraries,
 etc.

Heh, that sounds like a good idea. Maybe we could call it an OpenURL?

And that distinction about having a dereferencing mechanism sounds okay,
but let's call it a ... service. Then we could define an architecture
for that sort of thing rather than a Resource oriented one.  We could
call it a Service Oriented Architecture.

Oh, wait... 

Rob


Re: [CODE4LIB] registering info: uris?

2009-03-30 Thread Rob Sanderson
On Mon, 2009-03-30 at 16:08 +0100, Ross Singer wrote:
 There should be no issue with having both, mainly because like I
 mentioned earlier, nobody cares about info:uris.

s/nobody cares/the web doesn't care/

'The Web' isn't the only use case.  There are plenty of reasons for
having non dereferencable identifiers, for example for things which do
not have a web representation, or have too many web representations to
make favouring one over another a waste of time. For example abstract
concepts.

 I guess the way I look at it is:
 1.  The web is not going to wait for info:uris
 2.  The web is not going to use info:uris anyway, even after we've
 exhausted all of the corner cases and come up with the perfect URI
 model for a given domain, *because there's nothing the web can do with
 them anyway*.

Working As Intended.

If you want an identifier that *explicitly* cannot be dereferenced, then
info URIs are a good choice.  If you want one that can be dereferenced
to some representation of the identified object, then HTTP is the only
choice.

Rob


Re: [CODE4LIB] BISAC Subject Headings Lookup or Crosswalk

2009-01-21 Thread Rob Sanderson
And if you could get access to the catalogue, you could then train a
classifier (maybe bayes?) to predict BISAC given the other types of
headings (or other data) in the records.

Rob

On Wed, 2009-01-21 at 12:21 -0500, Andrew Nagy wrote:
 I saw a great presentation by Jesse Haro from Phoenix Public on their Endeca
 catalog.  They had their catalogers go back and recatalog the entire
 collection with BISAC headings.  You might want to see if you can get in
 touch with him to see if he has any information for you.
 
 http://mlamasslib.blogspot.com/2008/05/endeca-developments-in-opac-world.html
 
 Andrew
 
 On Wed, Jan 21, 2009 at 12:12 PM, Ryan Eby ryan...@gmail.com wrote:
 
  I was wondering if anyone knows of a good BISAC Subject Headings
  source for looking up a recommended BISAC based on ISBN, LCSH, etc.
  I've found some pages on oclc.org saying they were starting work on
  crosswalks and possibly including them in WorldCat but I haven't seen
  any returned in any WorldCat api calls yet. I've also read that ONIX
  records often have a BISAC code, is there a good source that might
  cover many publishers?
 
  http://www.bisg.org/standards/bisac_subject/index.html
 
  http://www.oclc.org/dewey/updates/numbers/
 
  eby
 


Re: [CODE4LIB] RDA - a standard that nobody will notice?

2008-12-17 Thread Rob Sanderson
My first question would be:  Why?

Why invent a new element for title (etc.) rather than using Dublin Core?
Wouldn't it have been easier to do this building from SWAP?
http://www.ukoln.ac.uk/repositories/digirep/index/Eprints_Application_Profile

And my second question would be: Really?

251 elements!! Man... At least they're not just numbers, but ... do you
expect anyone to actually use it?

Rob


Re: [CODE4LIB] Open Source Institutional Repository Software?

2008-08-22 Thread Rob Sanderson
To throw in my 2c.

 Eric Lease Morgan wrote:
  On Aug 21, 2008, at 4:34 PM, Jonathan Rochkind wrote:
  If you can figure out what the difference between an 'institutional 
  repository' and a 'digital library' is, let me know.
  I think an institutional repository is a type of digital library.

I think the set of institutional repository is a subset of the set of
digital library.  The defining feature being that IRs are designed to
be updated relatively frequently, by more than one or two people, and
typically non technical members of an institution.  This happens via a
user UI, rather than via an admin UI.  The contents of the IR are
research output, whereas a DL can hold anything.

Rob


[CODE4LIB] ORE software libraries from Foresite

2008-06-09 Thread Rob Sanderson
Apologies for cross-posting...

The Foresite [1] project is pleased to announce the initial code of two
software libraries for constructing, parsing, manipulating and
serialising OAI-ORE [2] Resource Maps.  These libraries are being
written in Java and Python, and can be used generically to provide
advanced functionality to OAI-ORE aware applications, and are compliant
with the latest release (0.9) of the specification.  The software is
open source, released under a BSD licence, and is available from a
Google Code repository:

http://code.google.com/p/foresite-toolkit/

You will find that the implementations are not absolutely complete yet,
and are lacking good documentation for this early release, but we will
be continuing to develop this software throughout the project and hope
that it will be of use to the community immediately and beyond the end
of the project.

Both libraries support parsing and serialising in: ATOM, RDF/XML, N3,
N-Triples, Turtle and RDFa

Foresite is a JISC [3] funded project which aims to produce a
demonstrator and test of the OAI-ORE standard by creating Resource Maps
of journals and their contents held in JSTOR [4], and delivering them as
ATOM documents via the SWORD [5] interface to DSpace [6].  DSpace will
ingest these resource maps, and convert them into repository items which
reference content which continues to reside in JSTOR.  The Python
library is being used to generate the resource maps from JSTOR and the
Java library is being used to provide all the ingest, transformation and
dissemination support required in DSpace.

Please feel free to download and play with the source code, and let us
have your feedback via the Google group:

[EMAIL PROTECTED]

All the best,

Richard Jones  Rob Sanderson

[1] Foresite project page: http://foresite.cheshire3.org/
[2] OAI-ORE specification: http://www.openarchives.org/ore/0.9/toc
[3] Joint Information Systems Committee (JISC): http://www.jisc.ac.uk/
[4] JSTOR: http://www.jstor.org/
[5] Simple Web Service Offering Repository Deposit (SWORD):
http://www.ukoln.ac.uk/repositories/digirep/index/SWORD
[6] DSpace: http://www.dspace.org/


Re: [CODE4LIB] Latest OpenLibrary.org release

2008-05-09 Thread Rob Sanderson
On Thu, 2008-05-08 at 11:41 -0400, Godmar Back wrote:
 On Thu, May 8, 2008 at 11:25 AM, Dr R. Sanderson
 [EMAIL PROTECTED] wrote:
 
   Like what?  The current API seems to be concerned with search.  Search
   is what SRU does well.  If it was concerned with harvest, I (and I'm
   sure many others) would have instead suggested OAI-PMH.
 
 No, the API presented does not support search.

Well, it only doesn't support search because of the way that the API has
been described without using the word 'search'!

To quote the documentation in the API:

--
Infogami provides an API to query the database for objects matching
particular criteria
...
To find objects matching a particular query, send a GET request to
http://openlibrary.org/api/things with query as parameter. In this
documentation we use curl as a simple command line query client; any
software that supports http GET can be used.
...
The API supports querying for objects based of string matching.
-

And so on.

There's a query, which can have its results sorted, be limited in terms
of the number of results returned, and have the beginning of that result
list start at an offset.

Sounds a lot like a search?

Rob


[CODE4LIB] OAI-ORE European Open Meeting, April 4 2008

2008-01-25 Thread Rob Sanderson
Apologies for cross-posting

A meeting will be held on April 4, 2008 at the University of
Southampton, in conjunction with Open Repositories 2008, to roll-out the
beta release of the OAI-ORE specifications. This meeting is the European
follow-on to a meeting that will be held in the USA on March 3, 2008 at
Johns Hopkins University.

The OAI-ORE specifications describe a data model to identify and
describe aggregations of web resources, and they introduce
machine-readable formats to describe these aggregations based on ATOM
and RDF/XML. The current, alpha version of the OAI-ORE specifications is
  at http://www.openarchives.org/ore/0.1/ .

Additional details for the OAI-ORE European Open Meeting are available at:

- The full press release for this event:

http://www.openarchives.org/ore/documents/EUKickoffPressrelease.pdf

- The registration site for the event:

http://regonline.com/eu-oai-ore

Note that registration is required and space is limited.


Carl Lagoze and Herbert Van de Sompel