Re: [CODE4LIB] RFC 5005 ATOM extension and OAI
Ed Summers wrote: Thanks for posting this Jakob. I was just reading RFC 5005 on the train yesterday (literally) and the parallels between it and OAI-PMH struck me as well. It's not quite clear to me how deleted records would be handled with an atom archive feed. But I guess one could assume if the identifier is no longer present it has been deleted it. But that would require pulling the entire archive... I'm not really sure how much deletes are really used in OAI-PMH repositories anyhow. OAI-PMH 1.1 was not clear enough on deletions but in 2.0 the specification contains an example. I think the missing support of deletions in data providers has to do with the missing explicit support in service providers and vice versa (henn-and-egg-problem). Stuart Weibel has written [1] about the subject of blog archiving in the past. And I remember hearing Jon Udell and Dan Chudnov talk about it [2]. Who knows what technorati, bloglines and googlereader are doing in this area. I guess the reality is that blogs are on the web and as such will be archived by InternetArchive [3]. But perhaps that doesn't really fit quite right? That's my feeling. Thanks. BlogML was new to me - sounds interesting but looks very shaggy and over-engineered - you do not even get the spec in HTML but have to download an archive that contains tons of nasty .NET files and an XML schema instead of a textual description with examples and discussion. I copied the XML schema here: http://www.gbv.de/wikis/cls/BlogML. I think extending ATOM is the better way. I think your general point is correct. Libraries need to be integrating themselves into the web these days rather than expecting the web to integrate into them. I doubt that archiving weblogs is that complicated [1]! You need a harvester (partly implemented in many Feed-Reader), an archive (you could start with just saving validated ATOM-Files), an index (Solr?) and a reader (also already implemented in many Feed-Readers). I bet you don't need more then a medium size project with one or two developers and one or two years to create sustainable tools for basic weblog archiving. Such a project could be done by any larger library or archive that is able to get funding. It's not a lack of resources, it's a lack of visions. Oh, and would it be alright to add your blog to http://planet.code4lib.org -- we need more of an international presence on there IMHO. The subfeed http://jakoblog.de/category/en/feed/atom/ contains all English language postings which are probably of higher relevance. Jakob [1] Ok, real long-term preservatation *is* complicated but if you only archive well-formed XML that conforms to a given schema (ATOM, HTML) you should be in a good position for the next decades. -- Jakob Voß [EMAIL PROTECTED], skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] RFC 5005 ATOM extension and OAI
Hi Ed, You wrote: I completely agree. When developing software it's really important to focus on the cleanest/clearest solution, rather than getting bogged down in edge cases and the comments from nay sayers. I hope that my response didn't come across that way. :-) A couple follow on questions for you: In your vision for this software are you expecting that content providers would have to implement RFC 5005 for your archiving system to work? Probably yes - at least for older entries. New posts can also be collected with the default feeds. Instead of working out exceptions and special solutions how to get blog archives with other methods you should provide RFC 5005 plugins for common blog software like Wordpress and advertise its use (We are sorry - the blog that you asked to archive does not support RFC 5005 so we can only archive new postings. Please ask its provider to implement archived feeds so we can archive the postings before {TIMESTAMP}. More information and plugins for RFC 5005 can be found {HERE}. Thank you!). Are you considering archiving media files associated with a blog entry (images, sound, video, etc?). Well, it depends on. There are hundreds of ways to associate media files - I doubt that you can easily archive YouTube and SlideShare widgets etc. but images included with img src=.../ should be doable. However I prefer iterative developement - if basic archiving works, you can start to think about media files. By the way I would value more the comments - which are also additional and non trivial to archive. To begin with, a WordPress plugin is surely the right step. Up to now RFC 5005 is so new that noone implemented it yet although its not complicated. Greetings, Jakob -- Jakob Voß [EMAIL PROTECTED], skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] RFC 5005 ATOM extension and OAI
Hi Clay, I completely agree with everything you just wrote, especially about Atom + APP being more than just a technology for blogs. APP is a great lightweight alternative to WebDAV, and promising for all sorts of data transfer. The fact that it has developer groundswell is a huge plus. During my Princeton days Kevin Clarke and I briefly talked about what a METS + APP metadata editing application could do. (I can't remember the answer, but I bet it would be snazzy.) On the one side you are right: Atom + APP is becoming popular and the standards are good, so digital libraries should get into it. On the other side I was just reminded to the ECDL2006-paper Repository Replication Using NNTP and SMTP: You can almost use any protocol (HTTP, OAI, ATOM APP, WebDAV, NNTP...) for most of digital libraries' use cases - but the best standard without approriate tools and support is pretty worthless. I came to this realization out of frustration that most OAI toolkits (at the time, ca. 2005) didn't support that functionality well -- or at all. I don't know if that's still the case. However, the need to delete records is a reality for most projects, and OAI has somewhat awkwardly made us rethink how to delete a record in repositories and the like, both on the service and data provider end. You almost have to build your entire system around handling deleted records just for OAI exposure. In reality it seems like you just end up masquerading or re-representing its outward visibility on our local systems, which gets onerous. I guess the difference is that the growing number of Atom developers are heeding the requirement for deletions, whereas the few existing OAI toolkit developers have deemed that functionality as optional. Most repositories do not even track deletions so they cannot syndicate them. If OAI-delete was mandatory, maybe OAI-PMH had not been used that much? OAI did a good job in promoting and documenting OAI-PMH but deletions were always treated as an orphan - I would not blame the standard but the lacking implementation. Also ATOM and RFC 5005 is not much better than other solutions - but its much more likely to get it implemented in Weblog and other software then OAI which is not that known outside the library world. Greetings, Jakob P.S: Maybe we would all be happy with Z39.50 if we had that wonderful Indexdata tools right from the beginning - instead there were only closed source specifications and different closed source partial implementations. A standard without easy to use open source implementations is condemned to be violated and die. -- Jakob Voß [EMAIL PROTECTED], skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] RFC 5005 ATOM extension and OAI
Peter wrote: Also, re: blog mirroring, I highly recommend the current discussions floating aroung the blogosphere regarding distributed source control (Git, Mercurial, etc.). It's a fundamental paradigm shift from centralized control to distributed control that points the way toward the future of libraries as they (we) become less and less the gatekeepers for the stuff be it digital or physical and more and more the facilitators of the bidirectional replication that assures ubiquitous access and long-term preservation. The library becomes (actually it has already happended) simply a node on a network of trust and should act accordingly. See the thoroughly entertaining/thought-provoking Google tech talk by Linus Torvalds on Git: http://www.youtube.com/watch?v=4XpnKHJAok8 Thanks for pointing to this interesting discussion. This goes even further then the current paradigm shift from the old model (author - publisher - distributor - reader) to a world of user-generated content and collaboration! I was glad if we finally got to model and archive Weblogs and Wikis - modelling and archiving the whole process of content copying, changing and remixing and republication is far beyong libraries capabilities! Greetings, Jakob -- Jakob Voß [EMAIL PROTECTED], skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
[CODE4LIB] RDA - a standard that nobody will notice?
Hi, As you may already noticed the Resource Description and Access (RDA) cataloguing instructions will be published 2009. You can submit final comments on the full draft until February 2nd: http://www.collectionscanada.gc.ca/jsc/rda.html http://www.collectionscanada.gc.ca/jsc/rdafulldraft.html Although there are several details you can argue about (and despite the questions whether detailed cataloguing rules have a future at all when people do cataloguing in LibraryThing, BibSonomy etc. without rules) I think that RDA is a step in the right direction. But there are some serious problems with the publication of RDA that should be of your interest: 1.) the standard is scattered in a set of PDF files instead of clean web based HTML (compare with the W3C recommendations). You cannot easily browse and search in RDA with your browser and a public search engine of your choice. You cannot link to a specific paragraph to cite RDA in a weblog positing etc. This shows me that the authors are still bound in physical world of dusty books instead of the digital age. 2.) RDA is not going to be published freely available on the web at all! See http://www.collectionscanada.gc.ca/jsc/rdafaq.html#7 Another reason why you won't be able to refer to specific sections of RDA. Defining a standard without putting in on Open Access (ideally under a specific CC-license) is retrogressive practise and a good strategy to make people ignored, misinterprete and violated it (you could also argue ethically that its a shame for every librarian not putting his publications under Open Access but the argument of quality should be enough). 3.) There are no official URIs for the elements of RDA. It looks like there has been no progress compared to FRBR (IFLA failed to publish an official RDF encoding of FRBR so several people created their own vocabularies). To encode bibliographic data on the Semantic web you need URIs for classes and properties. I don't expect RDA to get published as a full ontology but at least you could determine the basic concepts and elements and provide common URIs that people can build on. There are several attempts to create ontologies for bibliographic data but most of them come from outside the professional library community. Without connection to the Semantic Web RDA will be irrelevant outside the library world. With official URIs people can build on RDA and create a common ontology of it. Deirdre Kiorgaard did a good job in collecting elements [1] and Eversberg provides a database to start with. What do you think about my concerns? We should try to get the JSC to make RDA Open Access, prepared for use in the Web and even prepared for the Semantic Web. This should not be too difficult - the main work is convincing people (ok, it may be difficult to convince people ;-). I'd be glad if you send your comments to the Joint Steering Committee for Development of RDA until February 2nd: http://www.collectionscanada.gc.ca/jsc/rdadraftcomments.html It would be a pitty if RDA is an irrelevant anachronism from the beginning just because it is not published the way standards need to be published on the Web. Greetings Jakob Voss [1] http://www.collectionscanada.gc.ca/jsc/docs/5rda-elementanalysisrev.pdf [2] A helpful tool for structured temporary access to RDA is provided by Bernhard Eversberg at http://www.biblio.tu-bs.de/db/wtr/detail.php - this is what should be provided officially! -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] PICAplus to MODS conversion
Joachim Neubert wrote: Has anybody built a conversion from PICA+ to MODS and is willing to share the code? First there is not one PICA+ but different variants like there are different MARC variants. Second the most PICA+ to anything conversions that I know use the FCV conversion language that was developed by PICA an is neither documented nor standardized - so you won't be able to do something with it unless you run your own PICA system. At GBV we export MODS via MARC and the XSLT provided by the LOC, the chain is GBV-PICA+ --(FCV)-- MARC --- MARCXML --(XSLT)-- MODS As you can imagine, the metadata quality does not get better with each step like alway when you do conversion between metadata formats. But its better than having to write n*n single conversions. Greetings, Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
Hi, I summarized my thoughts about identifiers for data formats in a blog posting: http://jakoblog.de/2009/05/10/who-identifies-the-identifiers/ In short it’s not a technology issue but a commitment issue and the problem of identifying the right identifiers for data formats can be reduced to two fundamental rules of thumb: 1. reuse: don’t create new identifiers for things that already have one. 2. document: if you have to create an identifier describe its referent as open, clear, and detailled as possible to make it reusable. A format should be described with a schema (XML Schema, OWL etc.) or at least a standard. Mostly this schema already has a namespace or similar identifier that can be used for the whole format. For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML Namespace http://www.loc.gov/mods/v3 so this is the best identifier to identify MODS. If you need to identify a specific version then you should *first* look if such identifiers already exist, *second* push the publisher (LOC) to assign official URIs for MODS versions, if this do not already exist, or *third* create and document specific URIs and make that everyone knows about this identifiers. At the moment there are: MODS Version 3 http://www.loc.gov/mods/v3 MODS Version 3.0 info:srw/schema/1/mods-v3.0 MODS Version 3.1 info:srw/schema/1/mods-v3.1 MODS Version 3.2 info:srw/schema/1/mods-v3.2 info:ofi/fmt:xml:xsd:mods MODS Version 3.3 info:srw/schema/1/mods-v3.3 The SRU Schemas registry links the info:srw/schema/1/mods-v3* identifiers to its XML Schemas which is very little documentation but it links to http://www.loc.gov/mods/v3 at least in some way. Ross wrote: First, and most importantly, how do we reconcile these different identifiers for the same thing? Can we come up with some agreement on which ones we should really use? Use the one that is documented best. Secondly, and this gets to the reason why any of this was brought up in the first place, how can we coordinate these identifiers more effectively and efficiently to reuse among various specs and protocols, but not: 1) be tied to a particular community 2) require some laborious and lengthy submission and review process to just say hey, here's my FOAF available via UnAPI The identifier for FOAF is http://xmlns.com/foaf/0.1/. Forget about identifiers that are not URIs. OAI-PMH at least includes a mechanism to map metadataPrefixes to official URIs but this mechanism is not always used. If unAPI lacks a way to map a local name to a global URI, we should better fix unAPI to tell us: ?xml version=1.0 encoding=UTF-8? formats xmlns=http://unapi.info/; format name=foaf uri=http://xmlns.com/foaf/0.1// /formats unAPI should be revised and specified bore strictly to become an RFC anyway. Yes, this requires a laborious and lengthy submission and review process but there is no such thing as a free lunch. 3) be so lax that it throws all hope of authority out the window Reuse existing authorities and document better to create authority. I would expect the various communities to still maintain their own registries of approved data formats (well, OpenURL and SRU, anyway -- it's not as appropriate to UnAPI or Jangle). There should be a distinction between descriptive registries that only list identifiers and formats that are defined elsewhere and authoritative registries that define new identifiers and formats. The number of authoritatively defined identifiers should be small for a given API because the identifier should better be defined by the creator of the format instead by a user of the format. If the creator does not support usable identifiers then better talk to him instead of creating something in parallel. Greetings, Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
[CODE4LIB] Formats and its identifiers
Hi Rob, You wrote: A format should be described with a schema (XML Schema, OWL etc.) or at least a standard. Mostly this schema already has a namespace or similar identifier that can be used for the whole format. This is unfortunately not the case. It is mostly the case - but people like to misinterpret schemas and tailor them to their needs. For instance MODS Version 3 (currently 3.0, 3.1, 3.2, 3.4) has the XML Namespace http://www.loc.gov/mods/v3 so this is the best identifier to identify MODS. And this is a perfect example of why this is not the case. The same mods schema (let alone namespace) defines TWO formats, mods and modsCollection. That's your interpretation. According to the schema, the MODS format *is* either a single mods-element or a modsCollection-element. That's exactely what you can refer to with the namespace identifier http://www.loc.gov/mods/v3. If you need to identify the specific element 'mods' of the format only, then you need another identifer. Up to now there is no default way to create an identifier for a specific element in an XML format, see http://www.w3.org/TR/webarch/#xml-fragids But if the MODS specification defines that you can refer to any element with an URI fragment identifier, then the right identifier would be http://www.loc.gov/mods/v3#mods You wrote: I totally agree that it's an awful design choice. However it's a demonstration that XML namespaces _do not identify format_. And hence, we need another identifier which is not the namespace of the top level element. The namespace http://www.loc.gov/mods/v3 of the top level element 'mods' does not identify the top level element but the MODS *format* (in any of the versions 3.0-3.4) itself. This format *includes* the top level element 'mods'. Also consider the following more hypothetical, but perfectly feasible situations: * One namespace is used to define two _totally_ separate sets of elements. There's no reason why this can't be done. Ok, let A and B be two formats with two totally sets of elements (and rules how to use them). If you put them into one namespace, then you get a new format C that is the union of A and B. * One namespace defines so many elements that it's meaningless to call it a format at all. Even though the top level tag might be the same, the contents are so varied that you're unable to realistically process it. Sad but true: The word format in the context of library applications does not make sense anyway in most cases. Technically a format is just a set of possible instances, defined as a formal language or with any other type of specification. The problem of library formats is that many people refer to them without providing a proper specification. Coming back to the mods example: If the SRU Schema registry lists info:srw/schema/1/mods-v3.3 as the identifier for MODS Schema Version 3.3 with a pointer to the XML Schema http://www.loc.gov/standards/mods/v3/mods-3-3.xsd; then *any* XML document that validates against this schema must be considered to be a MODS 3.3 document - either with 'mods' or with 'modsCollection' as root element. Greetings Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] One Data Format Identifier (and Registry) to Rule Them All
Ross Singer wrote: ?xml version=1.0 encoding=UTF-8? formats xmlns=http://unapi.info/; format name=foaf uri=http://xmlns.com/foaf/0.1// /formats I generally agree with this, but what about formats that aren't XML or RDF based? How do I also say that you can grab my text/x-vcard? Or my application/marc record? There is still lots of data I want that doesn't necessarily have these characteristics. In my blog posting I included a way to specify mime types (such as as text/x-vcard or application/marcURI) as URI. According to RFC 2220 the application/marc type refers to the harmonized USMARC/CANMARC specification whatever this is - so the mime type can be used as format identifier. For vCard there is an RDF namespace and a (not very nice) XML namespace: http://www.w3.org/2001/vcard-rdf/3.0# vcard-temp (see http://xmpp.org/registrar/namespaces.html) If you want to identify a defined format, there is almost always an identifier you can reuse - if not, ask the creator of the format. The problem is not in identifiers or the complexity of formats but in people that create and use formats that are not well defined. What about XML formats that have no namespace? JSON objects that conform to a defined structure? Protocol Buffers? If something does not conform to a defined structure then it is no format at all but data garbage (yes, we have a lot of this in library systems but that's no excuse). To refer to XML or JSON in general there are mime types. If you want to identify something more specific there must be a definition of it or you are lost anyway. And, while I didn't really want to wade into these waters, what about formats that are really only used to carry other formats, where it's the *other* format that really matters (METS, Atom, OpenURL XML, etc.)? A container format with restricted carried format is a subset of the container format. If you cannot handle the whole but only a subset then you should only ask for the subset. There are three possibilities: 1. implicitely define the container format and choose the carried format. This is what SRU does - you ask for the record format but you always get the SRU response format as container with embedded record format. 2. implicitely define the carried format and choose the container format 3. define a new format as combination of container and carried format unAPI should be revised and specified bore strictly to become an RFC anyway. Yes, this requires a laborious and lengthy submission and review process but there is no such thing as a free lunch. Yeah, I have no problem with this (same with Jangle). The argument could be made, however, is there a cowpath yet to be paved? That depends whether you want to be taken serious outside the library community and target at the web as a whole or not. Cheers, Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
[CODE4LIB] Availability Information API
Hi, I just wanted to announce that I finished a reference implementation of the Document Availability Information API (DAIA) as CPAN module at http://search.cpan.org/perldoc?DAIA. More information about DAIA can be found in the specification at http://purl.org/NET/DAIA and at http://www.gbv.de/wikis/cls/DAIA. The basic structure is: [Document] -- 1-to-n -- [Item] [Item ]-- 1-to-n -- [Service] (which is either [Available] or [Unavailable]) We created DAIA for German library networks as interchange format and API to encode information about the current availability of a specific document (or any copy of it) in a given library. There are numerous APIs for several tasks in library systems (SRU/SRW, Z39.50, OpenSearch, OAI-PMH, Atom, unAPI etc.) but no standard way to just query whether a copy of given - for instance book - is available in a library, in which department, whether you can loan it or only use it in the library (or even read it online) or if it is not available how long it will probably take until it is available again. Obviously such an API would be helpful not only to connect different library systems but to create mashups and services (Show me on a map, where a given book is currently hold and available, Send me a tweet if a given books in my library is available again etc.). DAIA was createt to fill this gap. In the context of ILS Discovery Interface Task Force and its official recommendation (http://diglib.org/architectures/ilsdi/) DAIA fits to the GetAvailability method (section 6.3.1). At the moment the format and API are pretty stable so the main work is to create server and client components for several ILS software. Every library has its own special rules and schemas - Jonathan Rochkind already wrote about the problems to implement DAIA because of ILS complexity: http://bibwild.wordpress.com/2009/09/02/daia-and-ils-complexity/ . We cannot erase this complexity by magic (unless we refactor and clean the ILS) but at least we can try to map it to a common data model which DAIA provides. With the DAIA Perl package you can concentrate on writing the wrapper without dealing with DAIA parsing and serialization issues. Why should everyone write its own routines to grab for instance the HTML OPAC output to parse availability status? One mapping to DAIA should fit most needs, so others can build upon. A public DAIA converter/validator is available at http://ws.gbv.de/daia/validator Extensions to DAIA can be discussion in the Code4Lib Wiki http://wiki.code4lib.org/index.php/DAIA_extensions but I'd prefer no to start with the extensions but with basic services. If you have more cool ideas for client applications, just let me know! Cheers, Jakob P.S: Yes, there are some other awkward attemts to encode availability (Z39.50 Holdings, ISO 20775 Holdings, NCIP, SLNP...) but I found all of them underdefined and not publically documented or openly usable based on Web standards. -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
[CODE4LIB] DAIA project hosted at SourceForge
Hi, I'd just like to announce that we moved our DAIA related resources [1] to SourceForge: http://daia.sourceforge.net/ https://sourceforge.net/projects/daia/ http://daia.svn.sourceforge.net/viewvc/daia/ DAIA is a data model to express information about the availability of documents and their particular copies for library-related services (loan, local presentation, open access online, interlibrary loan...). Its specification (http://purl.org/NET/DAIA) includes serialization in XML, JSON and RDF - use the format of your choice and get the other serializations for free [2] If you are happy with stuff like NCIP, SLNP, SIP2, Z39.50 Holdings, ISO 20775 Holdings than I wonder why. But if you like to decouple and open up your library system for better innovation that you might want to have a look. Additional services like Tweet-me-if-the-book-is-back should be easy to implement by third parties based on a common, well-defined availability API (/end-of-advertisement) The current SVN repository contains the existing Perl implementation (that you can also get via CPAN), an implementation in PHP and an XSLT client for DAIA/XML. The VuFind project also contains a DAIA-Driver as client component. Feedback is very welcome, feel free to join. The project mailing list is https://lists.sourceforge.net/mailman/listinfo/daia-devel (don't be confused that the first mail is in German, we also know English). So how do you get availability information out of your library system? Cheers Jakob [1] see my mail from October 2009: http://www.mail-archive.com/code4lib@listserv.nd.edu/msg06164.html [2] ok, the RDF serialization is still work-in-progress - see http://purl.org/ontology/daia for the current draft -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] Twitter annotations and library software
Hi Tim, you wrote: Unless someone can come up with a perfect pre-cooked format—one that not only covers what we need but is also super easy and space-efficient (we have only 1/2k to use!)—Why don't we just decide on: 'simplebib' : { } and start filling in fields. I don't think it makes sense to externalize the information under another URL, at least in the first instance. That at least doubles the calls involved, and makes whatever you build dependent on lots of external services that may or may not work. Oh yeah, let's create just another ad-hoc metadata format, because obviously there are not enough different formats around there! To be honest: I admire your multitude of good ideas and efforts but this is one of the rare counterexamples. If you want to put bibliographic metadata into twitter annotations (good idea) you first need to clarify the basic purpose of embedding this information. I see two of them: I. Identification: To identify other tweets and resources that refer to the same publication II. Description: To nicely show which publication someone refers to. The purpose of identification can be served by the following means: a). standard identifiers b). standard identifiers c). standard identifiers Examples of standard identifiers include ISBN, OCLC Number, ASIN, LibraryThing Work-ID, well-defined bibliographic hash keys [*] etc. The purpose of description can best be served by a format that can easily be displayed for human beeings. You can either use a simple string or a well-known format. A string can be displayed but people will put all different citation formats in there. Right now there are only two established metadata formats that aim at creating a citation: a) BibTeX b) The input format of the Citation Style Language (CSL) I bet that CSL is the easier way to go. See http://citationstyles.org/ for details and examples. Cheers Jakob [*] See http://www.gbv.de/wikis/cls/Bibliographic_Hash_Key for a description of the mapping mechanism that is also used in BibSonomy to match BibTeX records. -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] Twitter annotations and library software
Hi it's funny how quickly you vote against BibTeX, but at least it is a format that is frequently used in the wild to create citations. If you call BibTeX undocumented and garbage then how do you call MARC which is far more difficult to make use of? My assumption was that there is a specific use case for bibliographic data in twitter annotations: I. Identifiy publication = this can *only* be done seriously with identifiers like ISBN, DOI, OCLCNum, LCCN etc. II. Deliver a citation = use a citation-oriented format (BibTeX, CSL, RIS) I was not voting explicitly for BibTeX but at least there is a large community that can make use of it. I strongly favour CSL (http://citationstyles.org/) because: - there is a JavaScript CSL-Processor. JavaScript is kind of a punishment but it is the natural environment for the Web 2.0 Mashup crowd that is going to implement applications that use Twitter annotations - there are dozens of CSL citation styles so you can display a citation in any way you want As Ross pointed out RIS would be an option too, but I miss the easy open source tools that use RIS to create citations from RIS data. Any other relevant format that I know (Bibont, MODS, MARC etc.) does not aim at identification or citation at the first place but tries to model the full variety of bibliographic metadata. If your use case is III. Provide semantic properties and connections of a publication Then you should look at the Bibliographic Ontology. But III does *not* just subsume usecase II. - it is a different story that is not beeing told by normal people but only but metadata experts, semantic web gurus, library system developers etc. (I would count me to this groups). If you want such complex data then you should use other systems but Twitter for data exchange anyway. A list of CSL metadata fields can be found at http://citationstyles.org/downloads/specification.html#appendices and the JavaScript-Processor (which is also used in Zotero) provides more information for developers: http://groups.google.com/group/citeproc-js Cheers Jakob P.S: An example of a CSL record from the JavaScript client: { title: True Crime Radio and Listener Disenchantment with Network Broadcasting, 1935-1946, author: [ { family: Razlogova, given: Elena } ], container-title: American Quarterly, volume: 58, page: 137-158, issued: { date-parts: [ [2006, 3] ] }, type: article-journal } -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] Twitter annotations and library software
Ed Summers wrote: II. Description: To nicely show which publication someone refers to. I think this is right. I wonder, would you consider a potential use case for Description to also provide machine readable data for a resource when a standard identifier is not known? There are lookup services to get a standard identifier when only some bibliographic data is known - mainly OpenURL. I have not investigated whether you can easily map CSL format to OpenURL or if you need to also embed the OpenURL as twitter annotation. However all lookup services that I know are either crapy or proprietary or both. This is not a technical issue but just based on a lack of data (hopefully to get better with more linked open data). Given enough open bibliographic data anyone can create a lookup service where you throw in some title, author and this stuff and get back an identified record. I think there also are some services called library catalog for this purpose. Anyway this os nothing that can be solved with a bibliographic data format alone. Either you have a standard identifier or you have not. If you have not, you must rely on third party services that run independent of your bibliographic data. It would be interesting to explore what identifiers + csl (and other options) would look like in a twitter annotation if you had time to mock something up in a wiki somewhere :-) I summarized my findings on CSL at http://wiki.code4lib.org/index.php/Citation_Style_Language and included some ideas of CSL and other data in twitter annotations. Feel free to modify! Cheers Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] Twitter annotations and library software
Jonathan Rochkind wrote: Jakob Voss wrote: I. Identifiy publication = this can *only* be done seriously with identifiers like ISBN, DOI, OCLCNum, LCCN etc. Ah, but for better or for worse, that's not the world we live in. We have LOTS of publications that either lack such identifiers altogether, or where information about identifiers is not available. (Mostly the former). That we need to identify. This is an actual use case, you can't just dismiss it by saying it can't be done! Call me pedantic but if you do not have an identifier than there is no hope to identity the publication by means of metadata. You only *describe* it with metadata and use additional heuristics (mostly search engines) to hopefully identify the publication based on the description. But these additional heuristics are not part of the metadta while a well-defined identifier implies a standard of how the identifier had been created and how it can be looked up. The last hope if there is no identifier is to create one. For instance our library system creates internal record numbers (such as OCLC numbers) which can be reused. You can also define an algorithm that creates a hash as identifier like the bibkey I mentioned. But as long as there is no identifier there is no identification independent from a bibliographic database that already contains the record to search in. Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] Twitter annotations and library software
Jonathan Rochkind wrote: Call me pedantic but if you do not have an identifier than there is no hope to identity the publication by means of metadata. You only *describe* it with metadata and use additional heuristics (mostly search engines) to hopefully identify the publication based on the description. But the entire OpenURL infrastructure DOES this, and does it without using search engines. It's a real world use case that has a solution in production! So, yeah, I call you pedantic for wanting to pretend the use case and the real world solution doesn't exist. :) As you said OpenURL is an *infrastructure*. It only makes sense if you have resolvers that map an OpenURL to a unique publications. This resolvers do the identification while OpenURL only describes (as long as you do not put a unique publication identifier into an OpenURL). In contrast an identifier can be used to compare and search publications without any additional infrastructure. You can call it description rather than identification if you like, that is a question of terminology. But it's description that is meant to uniquely identify a particular publication, and that a whole bunch of software in use every day succesfully uses to identify a particular publication. It's not just terminology if you can either just compare two strings for equalness (identifcation) or you need an infrastructure with knowledge bases and specific software (to make use of a description). OpenURL is of no use if you seperate it from the existing infrastructure which is mainly held by companies. No sane person will try to build an open alternative infrastructure because OpenURL is a crapy library-standard like MARC etc. This rant on OpenURL summarizes it well: http://cavlec.yarinareth.net/2006/10/13/i-hate-library-standards/ The OpenURL specification is a 119 page PDF - that alone is a reason to run away as fast as you can. If a twitter annotation setup wants to be able to identify publications that don't have standard identifiers, then you don't want to ignore this use case and how actually in production software currently deals with it. You can perhaps find a better way to deal with it -- I'm certainly not arguing for OpenURL as the be all end all, I rather hate OpenURL actually. But dismissing it as impossible is indeed pedantic, since it's being done! If a twitter annotation setup wants to get adopted than it should not be build on a crapy complex library standard like OpenURL. It IS a hacky and error-prone solution, to be sure. But it's the best solution we've got, because it's simply a fact that we have many publications we want to identify that lack standard identifiers. Ok, back to serious: Bibliographic Twitter annotations should be designed in a way that libraries (or whoever provides that knowledge bases aka OpenURL resolvers) can use it to look up a publication by its metadata. So there should be a transformation Twitter annotation = OpenURL If you choose CSL as bibliographic input format you can hopefully create a CSL style that does not produce a citation but an OpenURL - Voilà! I must admit that this solution is based on the open assumption that CSL record format contains all information needed for OpenURL which may not the case. A good point to start from is the function createContextObject in https://www.zotero.org/svn/extension/trunk/chrome/content/zotero/xpcom/ingester.js which is used by Zotero to create OpenURLs. Cheers Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] Twitter annotations and library software
Dear Tim, you wrote: So this is my recommended framework for proceeding. Tim, I'm afraid you'll actually have to do the hard work yourself. No, I don't. Because the work isn't fundamentally that hard. A complex standard might be, but I never for a moment considered anything like that. We have *512 bytes*, and it needs to be usable by anyone. Library technology is usually fatally over-engineered, but this is a case where that approach isn't even possible. Jonathan did a very well summary - you just have to pick what you main focus of embedding bibliographic data is. A) I favour using the CSL-Record format which I summarized at http://wiki.code4lib.org/index.php/Citation_Style_Language because I had in mind that people want to have a nice looking citation of the publication that someone tweeted about. The drawback is that CSL is less adopted and will not always fit in 512 bytes B) If you main focus is to link Tweets about the same publication (and other stuff about this publication) than you must embed identifiers. LibraryThing is mainly based on two identifiers 1) ISBN to identify editions 2) LT work ids to identify works I wonder why LT work ids have not picked up more although you thankfully provide a full mapping to ISBN at http://www.librarything.com/feeds/thingISBN.xml.gz but nevermind. I thought that some LT records also contain other identifiers such as OCLC number, LOC number etc. but maybe I am wrong. The best way to specify identifiers is to use an URI (all relevant identifiers that I know have an URI form). For ISBN it is uri:isbn:{ISBN13} For LT Work-ID you can use the URL with your .com top level domain: http://www.librarything.com/work/{LTWORKID} That would fit for tweets about books with an ISBN and for tweets about a work which will make 99.9% of tweets from LT about single publications anyway. C) If your focus is to let people search for a publication in libraries than and to copy bibliographic data in reference management software then COinS is a way to go. COinS is based on OpenURL which I and others ranted about because it is a crapy library standard like MARC. But unlike other metadata formats COinS usually fits in less then 512 bytes. Furthermore you may have to deal with it for LibraryThing for libraries anyway. Although I strongly favour CSL as a practising library scientist and developer I must admit that for LibraryThing the best way is to embed identifiers (ISBN and LT Work-ID) and maybe COinS. As long as LibraryThing does not open up to more complex publications like preprints of proceeding-articles in series etc. but mainly deals with books and works this will make LibraryThing users happy. Then, three years from now, we can all conference-tweet about a CIL talk, about all the cool ways libraries are using Twitter, and how it's such a shame that the annotations standard wasn't designed with libraries in mind. How about a bet instead of voting. In three years will there be: a) No relevant Twitter annotations anyway b) Twitter annotations but not used much for bibliographic data c) A rich variety of incompatible bibliographic annotation standards d) Semantic Web will have solved every problem anyway .. Cheers Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] it's cool to hate on OpenURL (was: Twitter annotations...)
Eric Hellman wrote: What I hope for is that OpenURL 1.0 eventually takes a place alongside SGML as a too-complex standard that directly paves the way for a universally adopted foundational technology like XML. What I fear is that it takes a place alongside MARC as an anachronistic standard that paralyzes an entire industry. But all the flaws of XML can be traced back to SGML which is why we now use JSON despite all of its limitations. May brother Ted Nelson enlighten all of us - he not only hates XML [1] and similar formats but also proposed an alternative way to structure information even before the invention of hierarchical file systems and operating systems [2]. In his vision of Xanadu every piece of published information had a unique ID that was reused everytimes the publication was referenced - which would solve our problem. SCNR Jakob [1] http://ted.hyperland.com/XMLisEvil.html [2] Ted Nelson: A File Structure for the Complex, The Changing and the Indeterminate, ACM 20th National Conference, pages 84-100, 1965. Presented in 1964. -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
[CODE4LIB] Unicode persistence (was: it's cool to hate on OpenURL)
Eric Hellman wrote: May I just add here that of all the things we've talked about in these threads, perhaps the only thing that will still be in use a hundred years from now will be Unicode. إن شاء الله Stuart Yeates wrote: Sadly, yes, I agree with you on this. Do you have any idea how demotivating that is for those of us maintaining collections with works containing characters that don't qualify for inclusion? May I just add there that Unicode is evolving too and you can help to get missing characters included. One of the next updates will even include hundreds of icons such as a slice of pizza, a kissing couple, and the mount Fuji (See this zipped PDF: http://is.gd/bABl9 and http://en.wikipedia.org/wiki/Emoji). I also bet that Unicode will be there in hundred years from now (and probably URIs) while things like XML and RDF may be little used then. But I fear that the Unicode may be used in a different way just like words in natural language change their meanings over the centuries. And that's why wee need libraries (phew, at least one positive claim about these institutions we all are bound to ;-) Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] it's cool to hate on OpenURL
Stuart Yeates wrote: A great deal of heat has been vented in this thread, and at least a little light. I'd like to invite everyone to contribute to the wikipedia page at http://en.wikipedia.org/wiki/OpenURL in the hopes that it evolves into a better overview of the protocol, the ecosystem and their place on th web. [Hint: the best heading for a rant wikipedia is 'criticisms' but you'll still need to reference the key points. Links into this thread count as references, if you can't find anything else.] Good point - but writing Wikipedia articles is more work than discussing on mailing lists ;-) Instead of improving the OpenURL article I started to add to the more relevant[1] COinS article: http://en.wikipedia.org/wiki/COinS Maybe some of you (Eric Hellman, Richard Cameron, Daniel Chudnov, Ross Singer, Herbert Van de Sompel ...) could fix the history section which I tried to reconstruct from historic sources[2] from the Internet without violating the Wikipedia NPOV which is hard if you write about things you were involved at. Am I right that neither OpenURL nor COinS strictly defines a metadata model with a set of entities/attributes/fields/you-name-it and their definition? Apparently all ContextObjects metadata formats are based on non-normative implementation guidelines only ?? Cheers Jakob [1] My bet: What will remain from OpenURL will be a link server base URL that you attach a COinS to [2] about five years ago, so its historic in terms of internet ;-) By the way does anyone have a copy of http://dbk.ch.umist.ac.uk/wiki/index.php?title=Metadata_in_HTML ? -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] Handling non-Unicode characters (was: Unicode persistence)
Hi Stuart, These have been included because they are in widespread use in a current written culture. The problems I personally have are down to characters used by a single publisher in a handful of books more than a hundred years ago. Such characters are explicitly excluded from Unicode. In the early period of the standardisation of the Māori language there were several competing ideas of what to use as a character set. One of those included a 'wh' ligature as a character. Several works were printed using this ligature. This ligature does not qualify for inclusion in Unicode. That is a matter of discussion. If you do not call it 'ligature' chances are higher to get it included. To see how we handle the text, see: http://www.nzetc.org/tm/scholarly/tei-Auc1911NgaM-t1-body-d4.html The underlying representation is TEI/XML, which has a mechanism to handle such glyphs. The things I'm still unhappy with are: * getting reasonable results when users cut-n-paste the text/image HTML combination to some other application * some browsers still like line-breaking on images in the middle of words That's interesting and reminds me on the treatment of mathematical formula in journal titels which mostly end up as ugly images. In Unicode you are allowed to assign private characters http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Private_use_characters The U+200D ZERO WIDTH JOINER could also be used but most browsers will not support it - you need a font that supports your character anyway. http://blogs.msdn.com/michkap/archive/2006/02/15/532394.aspx In summary: Unicode is just a subset of all characters which have been used for written communication and whether a character gets included depends not only on objective properties but on lobbying and other circumstances. The deeper you dig the more nasty Unicode gets - as all complex formats and standards. Cheers Jakob P.S: Michael Kaplan's blog also contains a funny article about emoji: http://blogs.msdn.com/michkap/archive/2010/04/27/10002948.aspx -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] Open Source Federated Library Search
David Kane wrote: Anyone got any suggestions? I am liking LibraryFind at the moment, but am not sure if anyone is using it. Has anyone else got experience with this or any other federated search programs? How about YaCy (see http://yacy.net/oai.html). Or did you mean metasearch instead of real federated search? Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] Twitter annotations and library software
On 07.06.2010 16:15, Jay Luker wrote: Hi all, I found this thread rather interesting and figured I'd try and revive the convo since apparently some things have been happening in the twitter annotation space in the past month. I just read on techcrunch that testing of the annotation features will commence next week [1]. Also it appears that an initial schema for a book type has been defined [2]. [1] http://techcrunch.com/2010/06/02/twitter-annotations-testing/ [2] http://apiwiki.twitter.com/Annotations-Overview#RecommendedTypes Have any code4libbers gotten involved in this beyond just opining on list? I don't this so - the discussion slipped to general data modelling questions. For the specific, limited use case of twitter annotations I bet the recommended format from [2] will be fine (title is implied as common attribute, url is optional): {book:{ title: ..., author: ..., isbn: ..., year: , url: ... }} I only miss an article type with a doi field for non-books. Cheers, Jako -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
[CODE4LIB] Bibliography of Code4Lib journal
Hi, Recently I had the vision of a kind of DBLP for the field of library and information science. There are some commercial bibliographic databases like Library and Information Science Abstracts (LISA), INFODATA, some public repositories like E-LIS, some open bibliographic databases like DABI (German), and a rich variety of bibliographic data that is only available embedded on web pages. The latter includes for instance the encyclopedia of library and information science [1] and the Code4Lib journal [2]. [1] http://www.informaworld.com/smpp/title~content=t713172967 [2] http://journal.code4lib.org/issues During some procrastination I wrote a screen scraper to collect bibliographic data for all Code4Lib journal articles. It looks like there are 100 right now, including book reviews and editorials. You can find the scraper script [3] and the resulting bibliography [4] in a git repository [5] [3] http://github.com/nichtich/dblis/blob/master/scripts/code4libjournal.pl [4] http://github.com/nichtich/dblis/blob/master/data/code4libjournal/code4libjournal.bib [5] http://github.com/nichtich/dblis Feel free to fork and add more scrapers or raw bibliographic data. I doubt that one social cataloging system fits all needs, and linked data is more the method than the goal, so just aggregating related bibliographic data in any format I can get, is an easy way to start with. Of course you can also just grab the Code4Lib Journal bibliography, and forget about the vision. I also tried to import the data to Mendeley and created a Code4Lib journal group [6]. Unfortunately the BibTeX import of Mendeley does not fully support some fields like month, day, and abstract - you can vote and comment on a feature request [7]. [6] http://www.mendeley.com/library/group/607411/ [7] http://feedback.mendeley.com/forums/4941-mendeley-feedback/suggestions/128222-improve-bibtex-import Cheers Jakob -- Verbundzentrale des GBV (VZG) Digitale Bibliothek - Jakob Voß Platz der Goettinger Sieben 1 37073 Goettingen - Germany +49 (0)551 39-10242 http://www.gbv.de jakob.v...@gbv.de
[CODE4LIB] XML Schema vs Library APIs (OAI-PMH/SRU/unAPI)
Hi, We are developing a general API management tool to provide different APIs (unAPI, SRU, OAI-PMH...) with different record formats (MARC, MODS, DC...) to our databases. We now stumbled upon some confusion regarding XML formats. The basic question is what is a format and how do you refer to it? I came to the conclusion that at least SRU schema identifiers are useless. In addition you can extract XML namespace URIs from XML Schemas, so all you need to identify a format is a link to its XML Schema. I wrote a more detailed blog posting about this at http://jakoblog.de/2011/02/24/xml-schema-vs-library-apis-oai-pmhsruunapi/ Does anyone of you relies on SRU schema identifiers when consuming SRU? I think at least for XML-based formats we should only use the XML Schema as authoritative reference. Sure there are different applications of variants of one schema, but then it makes no sense to use global identifiers in addition to local names. Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
Re: [CODE4LIB] XML Schema vs Library APIs (OAI-PMH/SRU/unAPI)
Hi Rob, This is just a rehash of a previous discussion on this list, between us: http://www.mail-archive.com/code4lib@listserv.nd.edu/msg05309.html So I guess I'm wasting my time ;) Thanks, I added a link to the previous discussion. You wrote: Referring to your blog post, you can say how the four inter-relate: Schema Identifier uniquely identifies the format. Schema Location is a non-unique description of the format. Schema Name is a short, human readable, non-unique name for the format and Namespace is a non-unique namespace used by the format. These definitions can help to clarify things, but they are of little practical value. The practical question is how to refer to a particular format. If you have to manually look at each particular server and collection to find out what format is *actually* served, then names and identifiers are of little help to code against. Both schema identifiers and schema names only help you to guess a format. A precise format needs an authoritative reference that you can validate against. If there exists an official XML Schema, this and only this schema defines the format (or the commonly agreed upon subset that you can work with without manually adopting each single data source). A single schema may contain multiple namespaces, and there isn't a unique identifier for a schema. For example, any simple Dublin Core based syntax must have at least two Namespaces, Dublin Core and the wrapper element. SchemaLocation is not unique as there can be many copies of the same schema. A single schema may define multiple root elements, such as MODS does with both item and collection level elements. A unique identifier for a schema is helpful because you do not need to actually look up a schema that you already know by its identifier. But it's not a must. If there is no single root namespace, you just should not use a namespace to point to a particular format. Ok, enough :-) Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
[CODE4LIB] Mapping vocabularies (was: LCSH and Linked Data)
Hi, Any transformation of a controlled vocabulary, either in format (MARC to RDF) or in coverage (e.g. vom LCSH to DDC, MeSH, GND, etc.) has to decide whether (a) there is a one-to-one (or one-to-zero) mapping between all concepts (b) you need n-to-m or even more complex mappings Mapping name authority files in VIAF was one of (a) because we more or less agree on hat a person is always the same person. But It looks like mapping authority data in MARC from different institutions is an instance of (b). Not only are concepts like England more fuzzy than people, but they are also used in different context for different purpose, depending on the cataloging rules and their specific interpretation. It does not help to argue about MARC field because there just is no easy one-to-one mapping between for instance: - The Kingdom of England (927–1707) - The area of the Kingdom of England (927–1707) - The country England as today - The area of England including the Principality of Sealand - The area of England excluding the Principality of Sealand - The whole Island Great Britain - The Island Great Britain including Ireland - The Island Great Britain including Northern Ireland - The Kingdom of Great Britain (1707 to 1801) - The United Kingdom of Great Britain and Ireland (1801 to 1922) - etc. I gave a talk about the fruitless attempt to put reality in terms of Semantic Web at Wikimania 2007 (stating with slide 12): http://www.slideshare.net/NCurse/jakob-voss-wikipedia2007 Instead of discussing how to map terms and concepts the right way you should think about how to express fuzzy and complex mappings. The SKOS mapping vocabulary provides some relations for this purpose. I can also recommend the DC2010 paper Establishing a Multi-Thesauri-Scenario based on SKOS and Cross-Concordances by Mayr, Zapilko, and Sure: http://dcpapers.dublincore.org/ojs/pubs/article/viewArticle/1031 If you do not want to bother with complex mappings but prefer one-to-one, you should not talk about differences like England as corporate body or as England as place or England as nationality etc. Sure you can put all these meanings into a broad and fuzzy term England but than stop complaining about semantic differences and use the term as unqualified subject heading with no specific meaning for anything that is related to any of the many ideas that anyone can call England. This is the way that full text retrieval works. You just can't have both simple mappings and precise terms. Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de
[CODE4LIB] Classification of loan types
Hi, is there an established classification of loan types, preferable as Ontology with URIs for each type? I am not sure about the English terminology with terms such as loan, hold and recall. In particular I am looking for a simple (!) list of types for current relationships between a patron and a library item, such as the following: 1. loan: the patron has got the library item 2. reservation: the patron will get the library item as soon as it is available again 3. ordered: the patron will get the library item, that is currently made availabe, for instance by bringing it to some location from the closed stacks Maybe that's all and I only need to define my own URIs for this cases. But there may also be relevant cases such as the item is allocated to some location for the patron, waiting that he or she picks it up and digital library items may be even more complicated. So do you know any such specification or do I have to define my own standard? Jakob -- Verbundzentrale des GBV (VZG) Digitale Bibliothek - Jakob Voß Platz der Goettinger Sieben 1 37073 Goettingen - Germany +49 (0)551 39-10242 http://www.gbv.de jakob.v...@gbv.de
Re: [CODE4LIB] First draft of patron account API
Hi, I added a recent changes list at http://gbv.github.com/paia/ to easier follow modifications to the PAIA specification. P Williams wrote: I'm very interested in this problem space. Good to see that someone is taking the initiative to try to solve the problem. I guess I'll have to learn some German :) Ach, das ist nicht nötig :-) You better correct my English by proofreading the current PAIA spec. You mention VuFind ILS drivers. You might also be interested in the connectors from the XC NCIP toolkit [http://xcncip2toolkit.googlecode.com] and LAI Connector from Equinox's FulfILLment [ http://www.fulfillment-ill.org/]. Thanks, I started a page with related work, open for contributions: https://github.com/gbv/paia/wiki/Related-work I think OAuth is a good starting place when you talk about authentication. This would address some of the issues of trust with applications that want to access your library related information and how to securely grant access to these client applications. With an OAuth model the server (ILS) doesn't have to know about the client application before the first request in order to establish trust. The trust is established by the user just in time. With library systems username and password are usually barcode and pin. The pin is usually a four digit number which is substantially easier to break with brute force than a true password (alpha-numeric + case + punctuation). I think that unfortunately PAIA has the potential to make this type of attack easier. Any thought to hardening library systems against brute force authentication attempts? You are right, but library systems should not have allowed weak passwords in the first place. I added a section on secuity considerations: http://gbv.github.com/paia/paia-5c2005c.html#security-considerations I think the best way is to enable PAIA only for patron accounts with strong passwords. What did you mean by decoupling of authorization and access? One reason to decouple authorization (PAIA auth) and access (PAIA core) was to be free to use different authorization methods in addition to username and password. You could also support access tokens bound to specific clients which can access multiple patron accounts or access tokens with read-only access to a patron account. With username/password one would only have all or nothing. What are your major complaints with NCIP? 1. One of the most important bits of information (circulation status) is not defined but free text. 2. NCIP is rarely implemented in total, so you never know what you get 3. Identifiers are not URIs. 4. I don't know of any library that allows open access to their NCIP interface by patrons and third-parties. But I've seen worse library APIs than NCIP ;-) I can see this being useful with authenticating for use of licensed databases, to determine eligibility for ILL services, or to verify a valid user for reciprocal borrowing in person within a consortia. It might also be useful for a service like Library Elf. Interesting case. You could think of a database as a document which can be requested - so the patron sends a PAIA requestItems and the resulting document state is 3 (held) or 4 (provided) on success and 5 (rejected) on failure. Jakob -- Jakob Voß jakob.v...@gbv.de, skype: nichtich Verbundzentrale des GBV (VZG) / Common Library Network Platz der Goettinger Sieben 1, 37073 Göttingen, Germany +49 (0)551 39-10242, http://www.gbv.de