Re: [CODE4LIB] RFC 5005 ATOM extension and OAI

2007-10-22 Thread Jakob Voss

Ed Summers wrote:


Thanks for posting this Jakob. I was just reading RFC 5005 on the
train yesterday (literally) and the parallels between it and OAI-PMH
struck me as well. It's not quite clear to me how deleted records
would be handled with an atom archive feed. But I guess one could
assume if the identifier is no longer present it has been deleted it.
But that would require pulling the entire archive... I'm not really
sure how much deletes are really used in OAI-PMH repositories anyhow.


OAI-PMH 1.1 was not clear enough on deletions but in 2.0 the
specification contains an example. I think the missing support of
deletions in data providers has to do with the missing explicit support
in service providers and vice versa (henn-and-egg-problem).


Stuart Weibel has written [1] about the subject of blog archiving in
the past. And I remember hearing Jon Udell and Dan Chudnov talk about
it [2]. Who knows what technorati, bloglines and googlereader are
doing in this area. I guess the reality is that blogs are on the web
and as such will be archived by InternetArchive [3]. But perhaps that
doesn't really fit quite right? That's my feeling.


Thanks. BlogML was new to me - sounds interesting but looks very shaggy
and over-engineered - you do not even get the spec in HTML but have to
download an archive that contains tons of nasty .NET files and an XML
schema instead of a textual description with examples and discussion. I
copied the XML schema here: http://www.gbv.de/wikis/cls/BlogML. I think
extending ATOM is the better way.


I think your general point is correct. Libraries need to be
integrating themselves into the web these days rather than expecting
the web to integrate into them.


I doubt that archiving weblogs is that complicated [1]! You need a
harvester (partly implemented in many Feed-Reader), an archive (you
could start with just saving validated ATOM-Files), an index (Solr?) and
a reader (also already implemented in many Feed-Readers). I bet you
don't need more then a medium size project with one or two developers
and one or two years to create sustainable tools for basic weblog
archiving. Such a project could be done by any larger library or archive
that is able to get funding. It's not a lack of resources, it's a lack
of visions.


Oh, and would it be alright to add your blog to
http://planet.code4lib.org -- we need more of an international
presence on there IMHO.


The subfeed http://jakoblog.de/category/en/feed/atom/ contains all
English language postings which are probably of higher relevance.

Jakob

[1] Ok, real long-term preservatation *is* complicated but if you only
archive well-formed XML that conforms to a given schema (ATOM, HTML) you
should be in a good position for the next decades.

--
Jakob Voß [EMAIL PROTECTED], skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] RFC 5005 ATOM extension and OAI

2007-10-23 Thread Jakob Voss

Hi Ed,

You wrote:


I completely agree.  When developing software it's really important to
focus on the cleanest/clearest solution, rather than getting bogged
down in edge cases and the comments from nay sayers. I hope that my
response didn't come across that way.


:-)


A couple follow on questions for you:

In your vision for this software are you expecting that content
providers would have to implement RFC 5005 for your archiving system
to work?


Probably yes - at least for older entries. New posts can also be
collected with the default feeds. Instead of working out exceptions and
special solutions how to get blog archives with other methods you should
provide RFC 5005 plugins for common blog software like Wordpress and
advertise its use (We are sorry - the blog that you asked to archive
does not support RFC 5005 so we can only archive new postings. Please
ask its provider to implement archived feeds so we can archive the
postings before {TIMESTAMP}. More information and plugins for RFC 5005
can be found {HERE}. Thank you!).


Are you considering archiving media files associated with a blog entry
(images, sound, video, etc?).


Well, it depends on. There are hundreds of ways to associate media files
- I doubt that you can easily archive YouTube and SlideShare widgets
etc. but images included with img src=.../ should be doable. However
I prefer iterative developement - if basic archiving works, you can
start to think about media files. By the way I would value more the
comments - which are also additional and non trivial to archive.

To begin with, a WordPress plugin is surely the right step. Up to now
RFC 5005 is so new that noone implemented it yet although its not
complicated.

Greetings,
Jakob

--
Jakob Voß [EMAIL PROTECTED], skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] RFC 5005 ATOM extension and OAI

2007-10-25 Thread Jakob Voss

Hi Clay,


I completely agree with everything you just wrote, especially about
Atom + APP being more than just a technology for blogs.  APP is a
great lightweight alternative to WebDAV, and promising for all sorts
of data transfer.  The fact that it has developer groundswell is a
huge plus.  During my Princeton days Kevin Clarke and I briefly
talked about what a METS + APP metadata editing application could
do.  (I can't remember the answer, but I bet it would be snazzy.)


On the one side you are right: Atom + APP is becoming popular and the
standards are good, so digital libraries should get into it. On the
other side I was just reminded to the ECDL2006-paper Repository
Replication Using NNTP and SMTP: You can almost use any protocol (HTTP,
OAI, ATOM APP, WebDAV, NNTP...) for most of digital libraries' use cases
- but the best standard without approriate tools and support is pretty
worthless.


I came to this realization out of frustration that most OAI toolkits
(at the time, ca. 2005) didn't support that functionality well -- or
at all.  I don't know if that's still the case.  However, the need to
delete records is a reality for most projects, and OAI has somewhat
awkwardly made us rethink how to delete a record in repositories
and the like, both on the service and data provider end.   You almost
have to build your entire system around handling deleted records
just for OAI exposure.   In reality it seems like you just end up
masquerading or re-representing its outward visibility on our local
systems, which gets onerous.

I guess the difference is that the growing number of Atom developers
are heeding the requirement for deletions, whereas the few existing
OAI toolkit developers have deemed that functionality as optional.


Most repositories do not even track deletions so they cannot syndicate
them. If OAI-delete was mandatory, maybe OAI-PMH had not been used that
much? OAI did a good job in promoting and documenting OAI-PMH but
deletions were always treated as an orphan - I would not blame the
standard but the lacking implementation.

Also ATOM and RFC 5005 is not much better than other solutions - but its
much more likely to get it implemented in Weblog and other software then
OAI which is not that known outside the library world.

Greetings,
Jakob

P.S: Maybe we would all be happy with Z39.50 if we had that wonderful
Indexdata tools right from the beginning - instead there were only
closed source specifications and different closed source partial
implementations. A standard without easy to use open source
implementations is condemned to be violated and die.


--
Jakob Voß [EMAIL PROTECTED], skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] RFC 5005 ATOM extension and OAI

2007-10-25 Thread Jakob Voss

Peter wrote:


Also, re: blog mirroring, I highly recommend the current discussions
floating aroung the blogosphere regarding distributed source control (Git,
Mercurial, etc.).  It's a fundamental paradigm shift from centralized
control to distributed control that points the way toward the future of
libraries as they (we) become less and less the gatekeepers for the
stuff be it digital or physical and more and more the facilitators of
the bidirectional replication that assures ubiquitous access and
long-term preservation.  The library becomes (actually it has already
happended) simply a node on a network of trust and should act accordingly.

See the thoroughly entertaining/thought-provoking Google tech talk by
Linus Torvalds on Git:  http://www.youtube.com/watch?v=4XpnKHJAok8


Thanks for pointing to this interesting discussion. This goes even
further then the current paradigm shift from the old model
(author - publisher - distributor - reader) to a world of
user-generated content and collaboration! I was glad if we finally got
to model and archive Weblogs and Wikis - modelling and archiving the
whole process of content copying, changing and remixing and
republication is far beyong libraries capabilities!

Greetings,
Jakob

--
Jakob Voß [EMAIL PROTECTED], skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


[CODE4LIB] RDA - a standard that nobody will notice?

2008-12-17 Thread Jakob Voss

Hi,

As you may already noticed the Resource Description and Access (RDA) 
cataloguing instructions will be published 2009. You can submit final 
comments on the full draft until February 2nd:


http://www.collectionscanada.gc.ca/jsc/rda.html
http://www.collectionscanada.gc.ca/jsc/rdafulldraft.html

Although there are several details you can argue about (and despite the 
questions whether detailed cataloguing rules have a future at all when 
people do cataloguing in LibraryThing, BibSonomy etc. without rules) I 
think that RDA is a step in the right direction. But there are some 
serious problems with the publication of RDA that should be of your 
interest:



1.) the standard is scattered in a set of PDF files instead of clean web 
based HTML (compare with the W3C recommendations). You cannot easily 
browse and search in RDA with your browser and a public search engine of 
your choice. You cannot link to a specific paragraph to cite RDA in a 
weblog positing etc. This shows me that the authors are still bound in 
physical world of dusty books instead of the digital age.



2.) RDA is not going to be published freely available on the web at all! 
See http://www.collectionscanada.gc.ca/jsc/rdafaq.html#7 Another reason 
why you won't be able to refer to specific sections of RDA. Defining a 
standard without putting in on Open Access (ideally under a specific 
CC-license) is retrogressive practise and a good strategy to make people 
ignored, misinterprete and violated it (you could also argue ethically 
that its a shame for every librarian not putting his publications under 
Open Access but the argument of quality should be enough).



3.) There are no official URIs for the elements of RDA. It looks like 
there has been no progress compared to FRBR (IFLA failed to publish an 
official RDF encoding of FRBR so several people created their own 
vocabularies). To encode bibliographic data on the Semantic web you need 
URIs for classes and properties. I don't expect RDA to get published as 
a full ontology but at least you could determine the basic concepts and 
elements and provide common URIs that people can build on. There are 
several attempts to create ontologies for bibliographic data but most of 
them come from outside the professional library community. Without 
connection to the Semantic Web RDA will be irrelevant outside the 
library world. With official URIs people can build on RDA and create a 
common ontology of it. Deirdre Kiorgaard did a good job in collecting 
elements [1] and Eversberg provides a database to start with.



What do you think about my concerns? We should try to get the JSC to 
make RDA Open Access, prepared for use in the Web and even prepared for 
the Semantic Web. This should not be too difficult - the main work is 
convincing people (ok, it may be difficult to convince people ;-). I'd 
be glad if you send your comments to the Joint Steering Committee for 
Development of RDA until February 2nd:


http://www.collectionscanada.gc.ca/jsc/rdadraftcomments.html

It would be a pitty if RDA is an irrelevant anachronism from the 
beginning just because it is not published the way standards need to be 
published on the Web.



Greetings
Jakob Voss

[1] http://www.collectionscanada.gc.ca/jsc/docs/5rda-elementanalysisrev.pdf

[2] A helpful tool for structured temporary access to RDA is provided by 
Bernhard Eversberg at http://www.biblio.tu-bs.de/db/wtr/detail.php - 
this is what should be provided officially!


--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] PICAplus to MODS conversion

2009-01-21 Thread Jakob Voss

Joachim Neubert wrote:


Has anybody built a conversion from PICA+ to MODS and is willing to
share the code?


First there is not one PICA+ but different variants like there are 
different MARC variants. Second the most PICA+ to anything conversions 
that I know use the FCV conversion language that was developed by PICA 
an is neither documented nor standardized - so you won't be able to do 
something with it unless you run your own PICA system. At GBV we export 
MODS via MARC and the XSLT provided by the LOC, the chain is


GBV-PICA+ --(FCV)-- MARC --- MARCXML --(XSLT)-- MODS

As you can imagine, the metadata quality does not get better with each 
step like alway when you do conversion between metadata formats. But its 
better than having to write n*n single conversions.


Greetings,
Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-11 Thread Jakob Voss

Hi,

I summarized my thoughts about identifiers for data formats in a blog 
posting: http://jakoblog.de/2009/05/10/who-identifies-the-identifiers/


In short it’s not a technology issue but a commitment issue and the 
problem of identifying the right identifiers for data formats can be 
reduced to two fundamental rules of thumb:


1. reuse: don’t create new identifiers for things that already have one.

2. document: if you have to create an identifier describe its referent 
as open, clear, and detailled as possible to make it reusable.


A format should be described with a schema (XML Schema, OWL etc.) or at 
least a standard. Mostly this schema already has a namespace or similar 
identifier that can be used for the whole format.


For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
identify MODS. If you need to identify a specific version then you 
should *first* look if such identifiers already exist, *second* push the 
publisher (LOC) to assign official URIs for MODS versions, if this do 
not already exist, or *third* create and document specific URIs and make 
that everyone knows about this identifiers. At the moment there are:


MODS Version 3 http://www.loc.gov/mods/v3
MODS Version 3.0   info:srw/schema/1/mods-v3.0
MODS Version 3.1   info:srw/schema/1/mods-v3.1
MODS Version 3.2   info:srw/schema/1/mods-v3.2
   info:ofi/fmt:xml:xsd:mods
MODS Version 3.3   info:srw/schema/1/mods-v3.3

The SRU Schemas registry links the info:srw/schema/1/mods-v3* 
identifiers to its XML Schemas which is very little documentation but it 
links to http://www.loc.gov/mods/v3 at least in some way.


Ross wrote:


First, and most importantly, how do we reconcile these different
identifiers for the same thing?  Can we come up with some agreement on
which ones we should really use?


Use the one that is documented best.


Secondly, and this gets to the reason why any of this was brought up
in the first place, how can we coordinate these identifiers more
effectively and efficiently to reuse among various specs and
protocols, but not:



1) be tied to a particular community
2) require some laborious and lengthy submission and review process to
just say hey, here's my FOAF available via UnAPI


The identifier for FOAF is http://xmlns.com/foaf/0.1/. Forget about 
identifiers that are not URIs. OAI-PMH at least includes a mechanism to 
map metadataPrefixes to official URIs but this mechanism is not always 
used. If unAPI lacks a way to map a local name to a global URI, we 
should better fix unAPI to tell us:


?xml version=1.0 encoding=UTF-8?
formats xmlns=http://unapi.info/;
  format name=foaf uri=http://xmlns.com/foaf/0.1//
/formats

unAPI should be revised and specified bore strictly to become an RFC 
anyway. Yes, this requires a laborious and lengthy submission and review 
process but there is no such thing as a free lunch.



3) be so lax that it throws all hope of authority out the window


Reuse existing authorities and document better to create authority.


I would expect the various communities to still maintain their own
registries of approved data formats (well, OpenURL and SRU, anyway
-- it's not as appropriate to UnAPI or Jangle).


There should be a distinction between descriptive registries that only 
list identifiers and formats that are defined elsewhere and 
authoritative registries that define new identifiers and formats. The 
number of authoritatively defined identifiers should be small for a 
given API because the identifier should better be defined by the creator 
of the format instead by a user of the format. If the creator does not 
support usable identifiers then better talk to him instead of creating 
something in parallel.


Greetings,
Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


[CODE4LIB] Formats and its identifiers

2009-05-11 Thread Jakob Voss

Hi Rob,

You wrote:

A format should be described with a schema (XML Schema, OWL etc.) or at 
least a standard. Mostly this schema already has a namespace or similar 
identifier that can be used for the whole format.


This is unfortunately not the case.


It is mostly the case - but people like to misinterpret schemas and 
tailor them to their needs.


For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML 
Namespace http://www.loc.gov/mods/v3 so this is the best identifier to 
identify MODS. 


And this is a perfect example of why this is not the case.

The same mods schema (let alone namespace) defines TWO formats, mods and
modsCollection.


That's your interpretation. According to the schema, the MODS format 
*is* either a single mods-element or a modsCollection-element. That's 
exactely what you can refer to with the namespace identifier 
http://www.loc.gov/mods/v3.


If you need to identify the specific element 'mods' of the format only, 
then you need another identifer. Up to now there is no default way to 
create an identifier for a specific element in an XML format, see

http://www.w3.org/TR/webarch/#xml-fragids

But if the MODS specification defines that you can refer to any element 
with an URI fragment identifier, then the right identifier would be 
http://www.loc.gov/mods/v3#mods


You wrote:

 I totally agree that it's an awful design choice. However it's a
 demonstration that XML namespaces _do not identify format_.  And
 hence, we need another identifier which is not the namespace of
 the top level element.

The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' 
does not identify the top level element but the MODS *format* (in any of 
the versions 3.0-3.4) itself. This format *includes* the top level 
element 'mods'.



Also consider the following more hypothetical, but perfectly feasible
situations:

* One namespace is used to define two _totally_ separate sets of
elements.  There's no reason why this can't be done.


Ok, let A and B be two formats with two totally sets of elements (and 
rules how to use them). If you put them into one namespace, then you get 
a new format C that is the union of A and B.



* One namespace defines so many elements that it's meaningless to call
it a format at all.  Even though the top level tag might be the same,
the contents are so varied that you're unable to realistically process
it.


Sad but true: The word format in the context of library applications 
does not make sense anyway in most cases. Technically a format is just a 
set of possible instances, defined as a formal language or with any 
other type of specification. The problem of library formats is that many 
people refer to them without providing a proper specification.


Coming back to the mods example: If the SRU Schema registry lists 
info:srw/schema/1/mods-v3.3 as the identifier for MODS Schema Version 
3.3 with a pointer to the XML Schema 
http://www.loc.gov/standards/mods/v3/mods-3-3.xsd; then *any* XML 
document that validates against this schema must be considered to be a 
MODS 3.3 document - either with 'mods' or with 'modsCollection' as root 
element.


Greetings
Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All

2009-05-12 Thread Jakob Voss

Ross Singer wrote:


?xml version=1.0 encoding=UTF-8?
formats xmlns=http://unapi.info/;
 format name=foaf uri=http://xmlns.com/foaf/0.1//
/formats


I generally agree with this, but what about formats that aren't XML or
RDF based?  How do I also say that you can grab my text/x-vcard?  Or
my application/marc record?  There is still lots of data I want that
doesn't necessarily have these characteristics.


In my blog posting I included a way to specify mime types (such as as 
text/x-vcard or application/marcURI) as URI. According to RFC 2220 the 
application/marc type refers to the harmonized USMARC/CANMARC 
specification whatever this is - so the mime type can be used as format 
identifier. For vCard there is an RDF namespace and a (not very nice) 
XML namespace:


http://www.w3.org/2001/vcard-rdf/3.0#
vcard-temp (see http://xmpp.org/registrar/namespaces.html)

If you want to identify a defined format, there is almost always an 
identifier you can reuse - if not, ask the creator of the format. The 
problem is not in identifiers or the complexity of formats but in people 
that create and use formats that are not well defined.



What about XML formats that have no namespace?  JSON objects that
conform to a defined structure?  Protocol Buffers?


If something does not conform to a defined structure then it is no 
format at all but data garbage (yes, we have a lot of this in library 
systems but that's no excuse). To refer to XML or JSON in general there 
are mime types. If you want to identify something more specific there 
must be a definition of it or you are lost anyway.



And, while I didn't really want to wade into these waters, what about
formats that are really only used to carry other formats, where it's
the *other* format that really matters (METS, Atom, OpenURL XML,
etc.)?


A container format with restricted carried format is a subset of the 
container format. If you cannot handle the whole but only a subset then 
you should only ask for the subset. There are three possibilities:


1. implicitely define the container format and choose the carried 
format. This is what SRU does - you ask for the record format but you 
always get the SRU response format as container with embedded record format.


2. implicitely define the carried format and choose the container format

3. define a new format as combination of container and carried format


unAPI should be revised and specified bore strictly to become an RFC anyway.
Yes, this requires a laborious and lengthy submission and review process but
there is no such thing as a free lunch.


Yeah, I have no problem with this (same with Jangle).  The argument
could be made, however, is there a cowpath yet to be paved?


That depends whether you want to be taken serious outside the library 
community and target at the web as a whole or not.


Cheers,
Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


[CODE4LIB] Availability Information API

2009-10-23 Thread Jakob Voss

Hi,

I just wanted to announce that I finished a reference implementation of 
the Document Availability Information API (DAIA) as CPAN module at 
http://search.cpan.org/perldoc?DAIA. More information about DAIA can be 
found in the specification at http://purl.org/NET/DAIA and at 
http://www.gbv.de/wikis/cls/DAIA. The basic structure is:


[Document] -- 1-to-n -- [Item]
[Item ]-- 1-to-n -- [Service] (which is either [Available] or 
[Unavailable])


We created DAIA for German library networks as interchange format and 
API to encode information about the current availability of a specific 
document (or any copy of it) in a given library. There are numerous APIs 
for several tasks in library systems (SRU/SRW, Z39.50, OpenSearch, 
OAI-PMH, Atom, unAPI etc.) but no standard way to just query whether a 
copy of given - for instance book - is available in a library, in which 
department, whether you can loan it or only use it in the library (or 
even read it online) or if it is not available how long it will probably 
take until it is available again. Obviously such an API would be helpful 
not only to connect different library systems but to create mashups and 
services (Show me on a map, where a given book is currently hold and 
available, Send me a tweet if a given books in my library is available 
again etc.). DAIA was createt to fill this gap. In the context of ILS 
Discovery Interface Task Force and its official recommendation 
(http://diglib.org/architectures/ilsdi/) DAIA fits to the 
GetAvailability method (section 6.3.1).


At the moment the format and API are pretty stable so the main work is 
to create server and client components for several ILS software. Every 
library has its own special rules and schemas - Jonathan Rochkind 
already wrote about the problems to implement DAIA because of ILS 
complexity: 
http://bibwild.wordpress.com/2009/09/02/daia-and-ils-complexity/ . We 
cannot erase this complexity by magic (unless we refactor and clean the 
ILS) but at least we can try to map it to a common data model which DAIA 
provides. With the DAIA Perl package you can concentrate on writing the 
wrapper without dealing with DAIA parsing and serialization issues. Why 
should everyone write its own routines to grab for instance the HTML 
OPAC output to parse availability status? One mapping to DAIA should fit 
most needs, so others can build upon. A public DAIA converter/validator 
is available at http://ws.gbv.de/daia/validator


Extensions to DAIA can be discussion in the Code4Lib Wiki 
http://wiki.code4lib.org/index.php/DAIA_extensions but I'd prefer no to 
start with the extensions but with basic services. If you have more cool 
ideas for client applications, just let me know!


Cheers,
Jakob

P.S: Yes, there are some other awkward attemts to encode availability 
(Z39.50 Holdings, ISO 20775 Holdings, NCIP, SLNP...) but I found all of 
them  underdefined and not publically documented or openly usable based 
on Web standards.


--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


[CODE4LIB] DAIA project hosted at SourceForge

2010-04-26 Thread Jakob Voss

Hi,

I'd just like to announce that we moved our DAIA related resources [1] 
to SourceForge:


http://daia.sourceforge.net/
https://sourceforge.net/projects/daia/
http://daia.svn.sourceforge.net/viewvc/daia/

DAIA is a data model to express information about the availability of 
documents and their particular copies for library-related services 
(loan, local presentation, open access online, interlibrary loan...). 
Its specification (http://purl.org/NET/DAIA) includes serialization in 
XML, JSON and RDF - use the format of your choice and get the other 
serializations for free [2]


If you are happy with stuff like NCIP, SLNP, SIP2, Z39.50 Holdings, ISO 
20775 Holdings than I wonder why. But if you like to decouple and open 
up your library system for better innovation that you might want to have 
a look. Additional services like Tweet-me-if-the-book-is-back should 
be easy to implement by third parties based on a common, well-defined 
availability API (/end-of-advertisement)


The current SVN repository contains the existing Perl implementation 
(that you can also get via CPAN), an implementation in PHP and an XSLT 
client for DAIA/XML. The VuFind project also contains a DAIA-Driver as 
client component.


Feedback is very welcome, feel free to join. The project mailing list is 
 https://lists.sourceforge.net/mailman/listinfo/daia-devel (don't be 
confused that the first mail is in German, we also know English).


So how do you get availability information out of your library system?

Cheers
Jakob

[1] see my mail from October 2009:
http://www.mail-archive.com/code4lib@listserv.nd.edu/msg06164.html

[2] ok, the RDF serialization is still work-in-progress - see 
http://purl.org/ontology/daia for the current draft


--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] Twitter annotations and library software

2010-04-27 Thread Jakob Voss

Hi Tim,

you wrote:


Unless someone can come up with a perfect pre-cooked format—one that
not only covers what we need but is also super easy and
space-efficient (we have only 1/2k to use!)—Why don't we just decide
on:

'simplebib' : {

}

and start filling in fields. I don't think it makes sense to
externalize the information under another URL, at least in the first
instance. That at least doubles the calls involved, and makes whatever
you build dependent on lots of external services that may or may not
work.


Oh yeah, let's create just another ad-hoc metadata format, because 
obviously there are not enough different formats around there!


To be honest: I admire your multitude of good ideas and efforts but this 
is one of the rare counterexamples. If you want to put bibliographic 
metadata into twitter annotations (good idea) you first need to clarify 
the basic purpose of embedding this information. I see two of them:


I. Identification: To identify other tweets and resources that refer to 
the same publication


II. Description: To nicely show which publication someone refers to.


The purpose of identification can be served by the following means:

a). standard identifiers
b). standard identifiers
c). standard identifiers

Examples of standard identifiers include ISBN, OCLC Number, ASIN, 
LibraryThing Work-ID, well-defined bibliographic hash keys [*] etc.



The purpose of description can best be served by a format that can 
easily be displayed for human beeings. You can either use a simple 
string or a well-known format. A string can be displayed but people will 
put all different citation formats in there. Right now there are only 
two established metadata formats that aim at creating a citation:


a) BibTeX
b) The input format of the Citation Style Language (CSL)

I bet that CSL is the easier way to go. See http://citationstyles.org/ 
for details and examples.



Cheers
Jakob

[*] See http://www.gbv.de/wikis/cls/Bibliographic_Hash_Key for a 
description of the mapping mechanism that is also used in BibSonomy to 
match BibTeX records.


--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] Twitter annotations and library software

2010-04-28 Thread Jakob Voss

Hi

it's funny how quickly you vote against BibTeX, but at least it is a 
format that is frequently used in the wild to create citations. If you 
call BibTeX undocumented and garbage then how do you call MARC which is 
far more difficult to make use of?


My assumption was that there is a specific use case for bibliographic 
data in twitter annotations:


I. Identifiy publication = this can *only* be done seriously with 
identifiers like ISBN, DOI, OCLCNum, LCCN etc.


II. Deliver a citation = use a citation-oriented format (BibTeX, CSL, RIS)

I was not voting explicitly for BibTeX but at least there is a large 
community that can make use of it. I strongly favour CSL 
(http://citationstyles.org/) because:


- there is a JavaScript CSL-Processor. JavaScript is kind of a 
punishment but it is the natural environment for the Web 2.0 Mashup 
crowd that is going to implement applications that use Twitter annotations


- there are dozens of CSL citation styles so you can display a citation 
in any way you want


As Ross pointed out RIS would be an option too, but I miss the easy open 
source tools that use RIS to create citations from RIS data.


Any other relevant format that I know (Bibont, MODS, MARC etc.) does not 
aim at identification or citation at the first place but tries to model 
the full variety of bibliographic metadata. If your use case is


III. Provide semantic properties and connections of a publication

Then you should look at the Bibliographic Ontology. But III does *not* 
just subsume usecase II. - it is a different story that is not beeing 
told by normal people but only but metadata experts, semantic web gurus, 
library system developers etc. (I would count me to this groups). If you 
want such complex data then you should use other systems but Twitter for 
data exchange anyway.


A list of CSL metadata fields can be found at

http://citationstyles.org/downloads/specification.html#appendices

and the JavaScript-Processor (which is also used in Zotero) provides 
more information for developers: http://groups.google.com/group/citeproc-js


Cheers
Jakob

P.S: An example of a CSL record from the JavaScript client:

{
title: True Crime Radio and Listener Disenchantment with Network 
Broadcasting, 1935-1946,

  author: [ {
family: Razlogova,
given: Elena
  } ],
 container-title: American Quarterly,
 volume: 58,
 page: 137-158,
 issued: { date-parts: [ [2006, 3] ] },
 type: article-journal
}


--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] Twitter annotations and library software

2010-04-28 Thread Jakob Voss

Ed Summers wrote:


II. Description: To nicely show which publication someone refers to.


I think this is right. I wonder, would you consider a potential use
case for Description to also provide machine readable data for a
resource when a standard identifier is not known?


There are lookup services to get a standard identifier when only some 
bibliographic data is known - mainly OpenURL. I have not investigated 
whether you can easily map CSL format to OpenURL or if you need to also 
embed the OpenURL as twitter annotation. However all lookup services 
that I know are either crapy or proprietary or both. This is not a 
technical issue but just based on a lack of data (hopefully to get 
better with more linked open data). Given enough open bibliographic data 
anyone can create a lookup service where you throw in some title, author 
and this stuff and get back an identified record. I think there also are 
some services called library catalog for this purpose.


Anyway this os nothing that can be solved with a bibliographic data 
format alone. Either you have a standard identifier or you have not. If 
you have not, you must rely on third party services that run independent 
of your bibliographic data.



It would be interesting to explore what identifiers + csl (and other
options) would look like in a twitter annotation if you had time to
mock something up in a wiki somewhere :-)


I summarized my findings on CSL at

http://wiki.code4lib.org/index.php/Citation_Style_Language

and included some ideas of CSL and other data in twitter annotations. 
Feel free to modify!


Cheers
Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] Twitter annotations and library software

2010-04-28 Thread Jakob Voss

Jonathan Rochkind wrote:


Jakob Voss wrote:
I. Identifiy publication = this can *only* be done seriously with 
identifiers like ISBN, DOI, OCLCNum, LCCN etc.
  
Ah, but for better or for worse, that's not the world we live in. We 
have LOTS of publications that either lack such identifiers altogether, 
or where information about identifiers is not available. (Mostly the 
former). That we need to identify. This is an actual use case, you can't 
just dismiss it by saying it can't be done!


Call me pedantic but if you do not have an identifier than there is no 
hope to identity the publication by means of metadata. You only 
*describe* it with metadata and use additional heuristics (mostly search 
engines) to hopefully identify the publication based on the description.


But these additional heuristics are not part of the metadta while a 
well-defined identifier implies a standard of how the identifier had 
been created and how it can be looked up.


The last hope if there is no identifier is to create one. For instance 
our library system creates internal record numbers (such as OCLC 
numbers) which can be reused. You can also define an algorithm that 
creates a hash as identifier like the bibkey I mentioned. But as long as 
there is no identifier there is no identification independent from a 
bibliographic database that already contains the record to search in.


Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] Twitter annotations and library software

2010-04-29 Thread Jakob Voss

Jonathan Rochkind wrote:

Call me pedantic but if you do not have an identifier than there is no 
hope to identity the publication by means of metadata. You only 
*describe* it with metadata and use additional heuristics (mostly 
search engines) to hopefully identify the publication based on the 
description.
  
But the entire OpenURL infrastructure DOES this, and does it without 
using search engines. It's a real world use case that has a solution in 
production! So, yeah, I call you pedantic for wanting to pretend the use 
case and the real world solution doesn't exist. :)


As you said OpenURL is an *infrastructure*. It only makes sense if you 
have resolvers that map an OpenURL to a unique publications. This 
resolvers do the identification while OpenURL only describes (as long as 
you do not put a unique publication identifier into an OpenURL). In 
contrast an identifier can be used to compare and search publications 
without any additional infrastructure.


You can call it description rather than identification if you like, 
that is a question of terminology. But it's description that is meant to 
uniquely identify a particular publication, and that a whole bunch of 
software in use every day succesfully uses to identify a particular 
publication.


It's not just terminology if you can either just compare two strings for 
equalness (identifcation) or you need an infrastructure with knowledge 
bases and specific software (to make use of a description).


OpenURL is of no use if you seperate it from the existing infrastructure 
which is mainly held by companies. No sane person will try to build an 
open alternative infrastructure because OpenURL is a crapy 
library-standard like MARC etc. This rant on OpenURL summarizes it well:


http://cavlec.yarinareth.net/2006/10/13/i-hate-library-standards/

The OpenURL specification is a 119 page PDF - that alone is a reason to 
run away as fast as you can.


If a twitter annotation setup wants to be able to identify publications 
that don't have standard identifiers, then you don't want to ignore this 
use case and how actually in production software currently deals with 
it. You can perhaps find a better way to deal with it -- I'm certainly 
not arguing for OpenURL as the be all end all, I rather hate OpenURL 
actually.  But dismissing it as impossible is indeed pedantic, since 
it's being done!


If a twitter annotation setup wants to get adopted than it should not be 
build on a crapy complex library standard like OpenURL.


 It IS a hacky and error-prone solution, to be sure.   But it's the
 best solution we've got, because it's simply a fact that we have many
 publications we want to identify that lack standard identifiers.

Ok, back to serious: Bibliographic Twitter annotations should be 
designed in a way that libraries (or whoever provides that knowledge 
bases aka OpenURL resolvers) can use it to look up a publication by its 
metadata. So there should be a transformation


Twitter annotation = OpenURL

If you choose CSL as bibliographic input format you can hopefully create 
a CSL style that does not produce a citation but an OpenURL - Voilà!


I must admit that this solution is based on the open assumption that CSL 
record format contains all information needed for OpenURL which may not 
the case. A good point to start from is the function createContextObject in


https://www.zotero.org/svn/extension/trunk/chrome/content/zotero/xpcom/ingester.js

which is used by Zotero to create OpenURLs.

Cheers
Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] Twitter annotations and library software

2010-04-29 Thread Jakob Voss

Dear Tim,

you wrote:

So this is my recommended framework for proceeding. Tim, I'm afraid
you'll actually have to do the hard work yourself.


No, I don't. Because the work isn't fundamentally that hard. A
complex standard might be, but I never for a moment considered
anything like that. We have *512 bytes*, and it needs to be usable by
anyone. Library technology is usually fatally over-engineered, but
this is a case where that approach isn't even possible.


Jonathan did a very well summary - you just have to pick what you main 
focus of embedding bibliographic data is.



A) I favour using the CSL-Record format which I summarized at

http://wiki.code4lib.org/index.php/Citation_Style_Language

because I had in mind that people want to have a nice looking citation 
of the publication that someone tweeted about. The drawback is that CSL 
is less adopted and will not always fit in 512 bytes



B) If you main focus is to link Tweets about the same publication (and 
other stuff about this publication) than you must embed identifiers. 
LibraryThing is mainly based on two identifiers


1) ISBN to identify editions
2) LT work ids to identify works

I wonder why LT work ids have not picked up more although you thankfully 
provide a full mapping to ISBN at 
http://www.librarything.com/feeds/thingISBN.xml.gz but nevermind. I 
thought that some LT records also contain other identifiers such as OCLC 
number, LOC number etc. but maybe I am wrong. The best way to specify 
identifiers is to use an URI (all relevant identifiers that I know have 
an URI form). For ISBN it is


uri:isbn:{ISBN13}

For LT Work-ID you can use the URL with your .com top level domain:

http://www.librarything.com/work/{LTWORKID}

That would fit for tweets about books with an ISBN and for tweets about 
a work which will make 99.9% of tweets from LT about single publications 
anyway.



C) If your focus is to let people search for a publication in libraries 
than and to copy bibliographic data in reference management software 
then COinS is a way to go. COinS is based on OpenURL which I and others 
ranted about because it is a crapy library standard like MARC. But 
unlike other metadata formats COinS usually fits in less then 512 bytes. 
Furthermore you may have to deal with it for LibraryThing for libraries 
anyway.



Although I strongly favour CSL as a practising library scientist and 
developer I must admit that for LibraryThing the best way is to embed 
identifiers (ISBN and LT Work-ID) and maybe COinS. As long as 
LibraryThing does not open up to more complex publications like 
preprints of proceeding-articles in series etc. but mainly deals with 
books and works this will make LibraryThing users happy.


Then, three years from now, we can all conference-tweet about a CIL 
talk, about all the cool ways libraries are using Twitter, and how 
it's such a shame that the annotations standard wasn't designed with 
libraries in mind.


How about a bet instead of voting. In three years will there be:

a) No relevant Twitter annotations anyway
b) Twitter annotations but not used much for bibliographic data
c) A rich variety of incompatible bibliographic annotation standards
d) Semantic Web will have solved every problem anyway
..

Cheers
Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] it's cool to hate on OpenURL (was: Twitter annotations...)

2010-04-29 Thread Jakob Voss

Eric Hellman wrote:


What I hope for is that OpenURL 1.0 eventually takes a place
alongside SGML as a too-complex standard that directly paves the way
for a universally adopted foundational technology like XML. What I
fear is that it takes a place alongside MARC as an anachronistic
standard that paralyzes an entire industry.


But all the flaws of XML can be traced back to SGML which is why we now 
use JSON despite all of its limitations. May brother Ted Nelson 
enlighten all of us - he not only hates XML [1] and similar formats but 
also  proposed an alternative way to structure information even before 
the invention of hierarchical file systems and operating systems [2]. In 
his vision of Xanadu every piece of published information had a unique 
ID that was reused everytimes the publication was referenced - which 
would solve our problem.


SCNR
Jakob

[1] http://ted.hyperland.com/XMLisEvil.html
[2] Ted Nelson: A File Structure for the Complex, The Changing and the 
Indeterminate, ACM 20th National Conference, pages 84-100, 1965. 
Presented in 1964.


--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


[CODE4LIB] Unicode persistence (was: it's cool to hate on OpenURL)

2010-04-30 Thread Jakob Voss

Eric Hellman wrote:


May I just add here that of all the things we've talked about in
these threads, perhaps the only thing that will still be in use a
hundred years from now will be Unicode. إن شاء الله


Stuart Yeates wrote:

 Sadly, yes, I agree with you on this.

 Do you have any idea how demotivating that is for those of us
 maintaining collections with works containing characters that don't
 qualify for inclusion?

May I just add there that Unicode is evolving too and you can help to 
get missing characters included. One of the next updates will even 
include hundreds of icons such as a slice of pizza, a kissing couple, 
and the mount Fuji (See this zipped PDF: http://is.gd/bABl9 and 
http://en.wikipedia.org/wiki/Emoji).


I also bet that Unicode will be there in hundred years from now (and 
probably URIs) while things like XML and RDF may be little used then. 
But I fear that the Unicode may be used in a different way just like 
words in natural language change their meanings over the centuries.


And that's why wee need libraries (phew, at least one positive claim 
about these institutions we all are bound to ;-)


Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] it's cool to hate on OpenURL

2010-04-30 Thread Jakob Voss

Stuart Yeates wrote:

A great deal of heat has been vented in this thread, and at least a 
little light.


I'd like to invite everyone to contribute to the wikipedia page at 
http://en.wikipedia.org/wiki/OpenURL in the hopes that it evolves into a 
better overview of the protocol, the ecosystem and their place on th web.


[Hint: the best heading for a rant wikipedia is 'criticisms' but you'll 
still need to reference the key points. Links into this thread count as 
references, if you can't find anything else.]


Good point - but writing Wikipedia articles is more work than discussing 
on mailing lists ;-) Instead of improving the OpenURL article I started 
to add to the more relevant[1] COinS article:


http://en.wikipedia.org/wiki/COinS

Maybe some of you (Eric Hellman, Richard Cameron, Daniel Chudnov, Ross 
Singer, Herbert Van de Sompel ...) could fix the history section which I 
tried to reconstruct from historic sources[2] from the Internet without 
violating the Wikipedia NPOV which is hard if you write about things you 
were involved at.


Am I right that neither OpenURL nor COinS strictly defines a metadata 
model with a set of entities/attributes/fields/you-name-it and their 
definition? Apparently all ContextObjects metadata formats are based on 
non-normative implementation guidelines only ??


Cheers
Jakob

[1] My bet: What will remain from OpenURL will be a link server base 
URL that you attach a COinS to


[2] about five years ago, so its historic in terms of internet ;-) By 
the way does anyone have a copy of

http://dbk.ch.umist.ac.uk/wiki/index.php?title=Metadata_in_HTML ?

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)

2010-05-03 Thread Jakob Voss

Hi Stuart,

These have been included because they are in widespread use in a current 
written culture. The problems I personally have are down to characters 
used by a single publisher in a handful of books more than a hundred 
years ago. Such characters are explicitly excluded from Unicode.


In the early period of the standardisation of the Māori language there
were several competing ideas of what to use as a character set. One of
those included a 'wh' ligature as a character. Several works were
printed using this ligature. This ligature does not qualify for
inclusion in Unicode.


That is a matter of discussion. If you do not call it 'ligature' chances 
are higher to get it included.



To see how we handle the text, see:

http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html

The underlying representation is TEI/XML, which has a mechanism to
handle such glyphs. The things I'm still unhappy with are:

* getting reasonable results when users cut-n-paste the text/image HTML
combination to some other application
* some browsers still like line-breaking on images in the middle of words


That's interesting and reminds me on the treatment of mathematical 
formula in journal titels which mostly end up as ugly images.


In Unicode you are allowed to assign private characters

http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters

The U+200D ZERO WIDTH JOINER could also be used but most browsers will 
not support it - you need a font that supports your character anyway.


http://blogs.msdn.com/michkap/archive/2006/02/15/532394.aspx

In summary: Unicode is just a subset of all characters which have been 
used for written communication and whether a character gets included 
depends not only on objective properties but on lobbying and other 
circumstances. The deeper you dig the more nasty Unicode gets - as all 
complex formats and standards.


Cheers
Jakob

P.S: Michael Kaplan's  blog also contains a funny article about emoji: 
http://blogs.msdn.com/michkap/archive/2010/04/27/10002948.aspx


--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] Open Source Federated Library Search

2010-05-06 Thread Jakob Voss

David Kane wrote:


Anyone got any suggestions?

I am liking LibraryFind at the moment, but am not sure if anyone is using
it.  Has anyone else got experience with this or any other federated search
programs?


How about YaCy (see http://yacy.net/oai.html). Or did you mean 
metasearch instead of real federated search?


Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] Twitter annotations and library software

2010-06-15 Thread Jakob Voss

On 07.06.2010 16:15, Jay Luker wrote:

Hi all,

I found this thread rather interesting and figured I'd try and revive
the convo since apparently some things have been happening in the
twitter annotation space in the past month. I just read on techcrunch
that testing of the annotation features will commence next week [1].
Also it appears that an initial schema for a book type has been
defined [2].


 [1] http://techcrunch.com/2010/06/02/twitter-annotations-testing/
 [2] http://apiwiki.twitter.com/Annotations-Overview#RecommendedTypes


Have any code4libbers gotten involved in this beyond just opining on list?


I don't this so - the discussion slipped to general data modelling 
questions. For the specific, limited use case of twitter annotations I 
bet the recommended format from [2] will be fine (title is implied as 
common attribute, url is optional):


{book:{
  title: ...,
  author: ...,
  isbn: ...,
  year: ,
  url: ...
}}

I only miss an article type with a doi field for non-books.

Cheers,
Jako


--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


[CODE4LIB] Bibliography of Code4Lib journal

2010-10-22 Thread Jakob Voss
Hi,


Recently I had the vision of a kind of DBLP for the field of library and
information science. There are some commercial bibliographic databases
like Library and Information Science Abstracts (LISA), INFODATA, some
public repositories like E-LIS, some open bibliographic databases like
DABI (German), and a rich variety of bibliographic data that is only
available embedded on web pages. The latter includes for instance the
encyclopedia of library and information science [1] and the Code4Lib
journal [2].


[1] http://www.informaworld.com/smpp/title~content=t713172967
[2] http://journal.code4lib.org/issues


During some procrastination I wrote a screen scraper to collect
bibliographic data for all Code4Lib journal articles. It looks like
there are 100 right now, including book reviews and editorials. You can
find the scraper script [3] and the resulting bibliography [4] in a git
repository [5]


[3]
http://github.com/nichtich/dblis/blob/master/scripts/code4libjournal.pl
[4]
http://github.com/nichtich/dblis/blob/master/data/code4libjournal/code4libjournal.bib
[5] http://github.com/nichtich/dblis


Feel free to fork and add more scrapers or raw bibliographic data. I
doubt that one social cataloging system fits all needs, and linked data
is more the method than the goal, so just aggregating related
bibliographic data in any format I can get, is an easy way to start
with. Of course you can also just grab the Code4Lib Journal
bibliography, and forget about the vision.


I also tried to import the data to Mendeley and created a Code4Lib
journal group [6]. Unfortunately the BibTeX import of Mendeley does not
fully support some fields like month, day, and abstract - you can
vote and comment on a feature request [7].


[6] http://www.mendeley.com/library/group/607411/
[7]
http://feedback.mendeley.com/forums/4941-mendeley-feedback/suggestions/128222-improve-bibtex-import


Cheers
Jakob


-- 
Verbundzentrale des GBV (VZG)
Digitale Bibliothek - Jakob Voß
Platz der Goettinger Sieben 1
37073 Goettingen - Germany
+49 (0)551 39-10242
http://www.gbv.de
jakob.v...@gbv.de


[CODE4LIB] XML Schema vs Library APIs (OAI-PMH/SRU/unAPI)

2011-02-24 Thread Jakob Voss

Hi,

We are developing a general API management tool to provide different 
APIs (unAPI, SRU, OAI-PMH...) with different record formats (MARC, MODS, 
DC...) to our databases. We now stumbled upon some confusion regarding 
XML formats. The basic question is what is a format and how do you 
refer to it?


I came to the conclusion that at least SRU schema identifiers are 
useless. In addition you can extract XML namespace URIs from XML 
Schemas, so all you need to identify a format is a link to its XML Schema.


I wrote a more detailed blog posting about this at
http://jakoblog.de/2011/02/24/xml-schema-vs-library-apis-oai-pmhsruunapi/

Does anyone of you relies on SRU schema identifiers when consuming SRU?
I think at least for XML-based formats we should only use the XML Schema 
as authoritative reference. Sure there are different applications of 
variants of one schema, but then it makes no sense to use global 
identifiers in addition to local names.


Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


Re: [CODE4LIB] XML Schema vs Library APIs (OAI-PMH/SRU/unAPI)

2011-02-25 Thread Jakob Voss

Hi Rob,

 This is just a rehash of a previous discussion on this list, between
 us:

 http://www.mail-archive.com/code4lib@listserv.nd.edu/msg05309.html

 So I guess I'm wasting my time ;)

Thanks, I added a link to the previous discussion. You wrote:


Referring to your blog post, you can say how the four inter-relate:

Schema Identifier uniquely identifies the format.
Schema Location is a non-unique description of the format.
Schema Name is a short, human readable, non-unique name for the format
and Namespace is a non-unique namespace used by the format.


These definitions can help to clarify things, but they are of little 
practical value. The practical question is how to refer to a particular 
format. If you have to manually look at each particular server and 
collection to find out what format is *actually* served, then names and 
identifiers are of little help to code against. Both schema identifiers 
and schema names only help you to guess a format. A precise format needs 
an authoritative reference that you can validate against. If there 
exists an official XML Schema, this and only this schema defines the 
format (or the commonly agreed upon subset that you can work with 
without manually adopting each single data source).


 A single schema may contain multiple namespaces, and there isn't a
 unique identifier for a schema.  For example, any simple Dublin Core
 based syntax must have at least two Namespaces, Dublin Core and the
 wrapper element. SchemaLocation is not unique as there can be many
 copies of the same schema.  A single schema may define multiple root
 elements, such as MODS does with both item and collection level
 elements.

A unique identifier for a schema is helpful because you do not need to 
actually look up a schema that you already know by its identifier. But 
it's not a must. If there is no single root namespace, you just should 
not use a namespace to point to a particular format.


Ok, enough :-)

Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


[CODE4LIB] Mapping vocabularies (was: LCSH and Linked Data)

2011-04-08 Thread Jakob Voss

Hi,

Any transformation of a controlled vocabulary, either in format (MARC to 
RDF) or in coverage (e.g. vom LCSH to DDC, MeSH, GND, etc.) has to 
decide whether


(a) there is a one-to-one (or one-to-zero) mapping between all concepts
(b) you need n-to-m or even more complex mappings

Mapping name authority files in VIAF was one of (a) because we more or 
less agree on hat a person is always the same person. But


It looks like mapping authority data in MARC from different institutions 
is an instance of (b). Not only are concepts like England more fuzzy 
than people, but they are also used in different context for different 
purpose, depending on the cataloging rules and their specific 
interpretation. It does not help to argue about MARC field because there 
just is no easy one-to-one mapping between for instance:


- The Kingdom of England (927–1707)
- The area of the Kingdom of England (927–1707)
- The country England as today
- The area of England including the Principality of Sealand
- The area of England excluding the Principality of Sealand
- The whole Island Great Britain
- The Island Great Britain including Ireland
- The Island Great Britain including Northern Ireland
- The Kingdom of Great Britain (1707 to 1801)
- The United Kingdom of Great Britain and Ireland (1801 to 1922)
- etc.

I gave a talk about the fruitless attempt to put reality in terms of 
Semantic Web at Wikimania 2007 (stating with slide 12):

http://www.slideshare.net/NCurse/jakob-voss-wikipedia2007

Instead of discussing how to map terms and concepts the right way you 
should think about how to express fuzzy and complex mappings. The SKOS 
mapping vocabulary provides some relations for this purpose. I can also 
recommend the DC2010 paper Establishing a Multi-Thesauri-Scenario based 
on SKOS and Cross-Concordances by Mayr, Zapilko, and Sure:

http://dcpapers.dublincore.org/ojs/pubs/article/viewArticle/1031

If you do not want to bother with complex mappings but prefer 
one-to-one, you should not talk about differences like England as 
corporate body or as England as place or England as nationality etc.


Sure you can put all these meanings into a broad and fuzzy term 
England but than stop complaining about semantic differences and use 
the term as unqualified subject heading with no specific meaning for 
anything that is related to any of the many ideas that anyone can call 
England. This is the way that full text retrieval works.


You just can't have both simple mappings and precise terms.

Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de


[CODE4LIB] Classification of loan types

2012-04-23 Thread Jakob Voss
Hi,

is there an established classification of loan types, preferable as
Ontology with URIs for each type? I am not sure about the English
terminology with terms such as loan, hold and recall. In
particular I am looking for a simple (!) list of types for current
relationships between a patron and a library item, such as the
following:

1. loan: the patron has got the library item
2. reservation: the patron will get the library item as soon as it is
available again
3. ordered: the patron will get the library item, that is currently made
availabe, for instance by bringing it to some location from the closed
stacks

Maybe that's all and I only need to define my own URIs for this cases.
But there may also be relevant cases such as the item is allocated to
some location for the patron, waiting that he or she picks it up and
digital library items may be even more complicated.

So do you know any such specification or do I have to define my own
standard?

Jakob




-- 
Verbundzentrale des GBV (VZG)
Digitale Bibliothek - Jakob Voß
Platz der Goettinger Sieben 1
37073 Goettingen - Germany
+49 (0)551 39-10242
http://www.gbv.de
jakob.v...@gbv.de


Re: [CODE4LIB] First draft of patron account API

2012-05-29 Thread Jakob Voss

Hi,

I added a recent changes list at http://gbv.github.com/paia/ to easier 
follow modifications to the PAIA specification.


P Williams wrote:


I'm very interested in this problem space.  Good to see that someone is
taking the initiative to try to solve the problem.  I guess I'll have to
learn some German :)


Ach, das ist nicht nötig :-)

You better correct my English by proofreading the current PAIA spec.


You mention VuFind ILS drivers.  You might also be interested in the
connectors from the XC NCIP toolkit [http://xcncip2toolkit.googlecode.com]
and LAI Connector from Equinox's FulfILLment [
http://www.fulfillment-ill.org/].


Thanks, I started a page with related work, open for contributions:

https://github.com/gbv/paia/wiki/Related-work


I think OAuth is a good starting place when you talk about authentication.
This would address some of the issues of trust with applications that want
to access your library related information and how to securely grant access
to these client applications.  With an OAuth model the server (ILS) doesn't
have to know about the client application before the first request in order
to establish trust.  The trust is established by the user just in time.

With library systems username and password are usually barcode and pin.
The pin is usually a four digit number which is substantially easier to
break with brute force than a true password (alpha-numeric + case +
punctuation).  I think that unfortunately PAIA has the potential to make
this type of attack easier.  Any thought to hardening library systems
against brute force authentication attempts?


You are right, but library systems should not have allowed weak 
passwords in the first place. I added a section on secuity considerations:


http://gbv.github.com/paia/paia-5c2005c.html#security-considerations

I think the best way is to enable PAIA only for patron accounts with 
strong passwords.


 What did you mean by decoupling of authorization and access?

One reason to decouple authorization (PAIA auth) and access (PAIA core) 
was to be free to use different authorization methods in addition to 
username and password. You could also support access tokens bound to 
specific clients which can access multiple patron accounts or access 
tokens with read-only access to a patron account. With username/password 
one would only have all or nothing.



What are your major complaints with NCIP?


1. One of the most important bits of information (circulation status) 
is not defined but free text.

2. NCIP is rarely implemented in total, so you never know what you get
3. Identifiers are not URIs.
4. I don't know of any library that allows open access to their NCIP 
interface by patrons and third-parties.


But I've seen worse library APIs than NCIP ;-)


I can see this being useful with authenticating for use of licensed
databases, to determine eligibility for ILL services, or to verify a valid
user for reciprocal borrowing in person within a consortia.  It might also
be useful for a service like Library Elf.


Interesting case. You could think of a database as a document which can 
be requested - so the patron sends a PAIA requestItems and the 
resulting document state is 3 (held) or 4 (provided) on success and 
5 (rejected) on failure.


Jakob

--
Jakob Voß jakob.v...@gbv.de, skype: nichtich
Verbundzentrale des GBV (VZG) / Common Library Network
Platz der Goettinger Sieben 1, 37073 Göttingen, Germany
+49 (0)551 39-10242, http://www.gbv.de