Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
Thanks for this pointer Owen. It's a nice illustration of the fact that what users actually want (well, I know I did back when I actually worked in large information services departments!) is something more like an intranet where the content I find is weighted towards me, the audience e.g. the intranet knows I'm a 2nd year medical student and one of my registered preferred languages is Mandarin already, or it knows that I'm a rare books cataloguer and I want to see what nine out of ten other cataloguers recorded for this obscure and confusing title. However, this stuff is quite intense for linked data, isn't it? I understand that it would involve lots of quads, named graphs or whatever... In a parallel world, I'm currently writing up recommendations for aggregating ONIX for Books records. ONIX data can come from multiple sources who potentially assert different things about a given book (i.e. something with an ISBN to keep it simple). This is why *every single ONIX data element* can have option attributes of @datestamp @sourcename @sourcetype [e.g. publisher, retailer, data aggregator... library?] ...and the ONIX message as a whole is set up with header and product record segments that each include some info about the sender/recipient/data record in question. How people in the book supply chain are implementing these is a distinct issue, but could these capabilities have some relevance to what you're discussing? Do you have any other pointers to intranet-like catalogues? In the museum space, there is of course this: http://www.researchspace.org/ Cheers, Michael -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Owen Stephens Sent: 28 August 2012 21:37 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google The JISC funded CLOCK project did some thinking around cataloguing processes and tracking changes to statements and/or records - e.g. http://clock.blogs.lincoln.ac.uk/2012/05/23/its-a-model-and-its-looking-good/ Not solutions of course, but hopefully of interest Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 28 Aug 2012, at 19:43, Simon Spero sesunc...@gmail.com wrote: On Aug 28, 2012, at 2:17 PM, Joe Hourcle wrote: I seem to recall seeing a presentation a couple of years ago from someone in the intelligence community, where they'd keep all of their intelligence, but they stored RDF quads so they could track the source. They'd then assign a confidence level to each source, so they could get an overall level of confidence on their inferences. [...] It's possible that it was in the context of provenance, but I'm getting bogged down in too many articles about people storing provenance information using RDF-triples (without actually tracking the provenance of the triple itself) Provenance is of great importance in the IC and related sectors. An good overview of the nature of evidential reasoning is David A Schum (1994;2001). Evidential Foundations of Probabilistic Reasoning. Wiley Sons, 1994; Northwestern University Press, 2001 [Paperback edition]. There are usually papers on provenance and associated semantics at the GMU Semantic Technology for Intelligence, Defense, and Security (STIDS). This years conference is 23 - 26 October 2012; see http://stids.c4i.gmu.edu/ for more details. Simon
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
On 29/08/12 19:46, Michael Hopwood wrote: Thanks for this pointer Owen. It's a nice illustration of the fact that what users actually want (well, I know I did back when I actually worked in large information services departments!) is something more like an intranet where the content I find is weighted towards me, the audience e.g. the intranet knows I'm a 2nd year medical student and one of my registered preferred languages is Mandarin already, or it knows that I'm a rare books cataloguer and I want to see what nine out of ten other cataloguers recorded for this obscure and confusing title. Yet another re-invention of content negotiation, AKA RFC 2295. These attempts fail because 99% of data publishers care in the first instance about the single use before them and in the second instance the precedent has already been set. The exception, of course, is legally mandated multi-lingual bureaucracies (Canadian government for en/fr; EU organs for various languages etc) and on-the-wire formatting (for which it works very well) However, this stuff is quite intense for linked data, isn't it? I understand that it would involve lots of quads, named graphs or whatever... In a parallel world, I'm currently writing up recommendations for aggregating ONIX for Books records. ONIX data can come from multiple sources who potentially assert different things about a given book (i.e. something with an ISBN to keep it simple). This is why *every single ONIX data element* can have option attributes of @datestamp @sourcename @sourcetype [e.g. publisher, retailer, data aggregator... library?] ...and the ONIX message as a whole is set up with header and product record segments that each include some info about the sender/recipient/data record in question. Do yo have any stats for how many ONIX data elements in the wild actually use these elements in non-trivial ways? I've never seen any. cheers stuart -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
Hi, On 08/27/2012 04:36 PM, Karen Coyle wrote: I also assumed that Ed wasn't suggesting that we literally use github as our platform, but I do want to remind folks how far we are from having people friendly versioning software -- at least, none that I have seen has felt intuitive. The features of git are great, and people have built interfaces to it, but as Galen's question brings forth, the very *idea* of versioning doesn't exist in library data processing, even though having central-system based versions of MARC records (with a single time line) is at least conceptually simple. What's interesting, however, is that at least a couple parts of the concept of distributed version control, viewed broadly, have been used in traditional library cataloging. For example, RLIN had a concept of a cluster of MARC records for the same title, with each library having their own record in the cluster. I don't know if RLIN kept track of previous versions of a library's record in a cluster as it got edited, but it means that there was the concept of a spatial distribution of record versions if not a temporal one. I've never used RLIN myself, but I'd be curious to know if it provided any tools to readily compare records in the same cluster and if there were any mechanisms (formal or informal) for a library to grab improvements from another library's record and apply it to their own. As another example, the MARC cataloging source field has long been used, particularly in central utilities, to record institution-level attribution for changes to a MARC record. I think that's mostly been used by catalogers to help decide which version of a record to start from when copy cataloging, but I suppose it's possible that some catalogers were also looking at the list of modifying agencies (library A touched this record and is particularly good at subject analysis, so I'll grab their 650s). Regards, Galen -- Galen Charlton Director of Support and Implementation Equinox Software, Inc. / The Open Source Experts email: g...@esilibrary.com direct: +1 770-709-5581 cell: +1 404-984-4366 skype: gmcharlt web:http://www.esilibrary.com/ Supporting Koha and Evergreen: http://koha-community.org http://evergreen-ils.org
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
An interesting reference is this: High, W. M. (1990). Editing Changes to Monographic Cataloging Records in the OCLC Database: An Analysis of the Practice in Five University Libraries. PhD thesis, University of North Carolina at Chapel Hill. It's in UMI (and Heavy Trussed). Simon On Aug 28, 2012, at 12:05 PM, Galen Charlton wrote: Hi, On 08/27/2012 04:36 PM, Karen Coyle wrote: I also assumed that Ed wasn't suggesting that we literally use github as our platform, but I do want to remind folks how far we are from having people friendly versioning software -- at least, none that I have seen has felt intuitive. The features of git are great, and people have built interfaces to it, but as Galen's question brings forth, the very *idea* of versioning doesn't exist in library data processing, even though having central-system based versions of MARC records (with a single time line) is at least conceptually simple. What's interesting, however, is that at least a couple parts of the concept of distributed version control, viewed broadly, have been used in traditional library cataloging. For example, RLIN had a concept of a cluster of MARC records for the same title, with each library having their own record in the cluster. I don't know if RLIN kept track of previous versions of a library's record in a cluster as it got edited, but it means that there was the concept of a spatial distribution of record versions if not a temporal one. I've never used RLIN myself, but I'd be curious to know if it provided any tools to readily compare records in the same cluster and if there were any mechanisms (formal or informal) for a library to grab improvements from another library's record and apply it to their own. As another example, the MARC cataloging source field has long been used, particularly in central utilities, to record institution-level attribution for changes to a MARC record. I think that's mostly been used by catalogers to help decide which version of a record to start from when copy cataloging, but I suppose it's possible that some catalogers were also looking at the list of modifying agencies (library A touched this record and is particularly good at subject analysis, so I'll grab their 650s). Regards, Galen -- Galen Charlton Director of Support and Implementation Equinox Software, Inc. / The Open Source Experts email: g...@esilibrary.com direct: +1 770-709-5581 cell: +1 404-984-4366 skype: gmcharlt web:http://www.esilibrary.com/ Supporting Koha and Evergreen: http://koha-community.org http://evergreen-ils.org
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
On Aug 28, 2012, at 12:05 PM, Galen Charlton wrote: Hi, On 08/27/2012 04:36 PM, Karen Coyle wrote: I also assumed that Ed wasn't suggesting that we literally use github as our platform, but I do want to remind folks how far we are from having people friendly versioning software -- at least, none that I have seen has felt intuitive. The features of git are great, and people have built interfaces to it, but as Galen's question brings forth, the very *idea* of versioning doesn't exist in library data processing, even though having central-system based versions of MARC records (with a single time line) is at least conceptually simple. What's interesting, however, is that at least a couple parts of the concept of distributed version control, viewed broadly, have been used in traditional library cataloging. For example, RLIN had a concept of a cluster of MARC records for the same title, with each library having their own record in the cluster. I don't know if RLIN kept track of previous versions of a library's record in a cluster as it got edited, but it means that there was the concept of a spatial distribution of record versions if not a temporal one. I've never used RLIN myself, but I'd be curious to know if it provided any tools to readily compare records in the same cluster and if there were any mechanisms (formal or informal) for a library to grab improvements from another library's record and apply it to their own. As another example, the MARC cataloging source field has long been used, particularly in central utilities, to record institution-level attribution for changes to a MARC record. I think that's mostly been used by catalogers to help decide which version of a record to start from when copy cataloging, but I suppose it's possible that some catalogers were also looking at the list of modifying agencies (library A touched this record and is particularly good at subject analysis, so I'll grab their 650s). I seem to recall seeing a presentation a couple of years ago from someone in the intelligence community, where they'd keep all of their intelligence, but they stored RDF quads so they could track the source. They'd then assign a confidence level to each source, so they could get an overall level of confidence on their inferences. ... it'd get a bit messier if you have to do some sort of analysis of which sources are good for what type of information, but it might be a start. Unfortunately, I'm not having luck finding the reference again. It's possible that it was in the context of provenance, but I'm getting bogged down in too many articles about people storing provenance information using RDF-triples (without actually tracking the provenance of the triple itself) -Joe ps. I just realized this discussion's been on CODE4LIB, and not NGC4LIB ... would it make sense to move it over there?
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
On Aug 28, 2012, at 2:17 PM, Joe Hourcle wrote: I seem to recall seeing a presentation a couple of years ago from someone in the intelligence community, where they'd keep all of their intelligence, but they stored RDF quads so they could track the source. They'd then assign a confidence level to each source, so they could get an overall level of confidence on their inferences. […] It's possible that it was in the context of provenance, but I'm getting bogged down in too many articles about people storing provenance information using RDF-triples (without actually tracking the provenance of the triple itself) Provenance is of great importance in the IC and related sectors. An good overview of the nature of evidential reasoning is David A Schum (1994;2001). Evidential Foundations of Probabilistic Reasoning. Wiley Sons, 1994; Northwestern University Press, 2001 [Paperback edition]. There are usually papers on provenance and associated semantics at the GMU Semantic Technology for Intelligence, Defense, and Security (STIDS). This years conference is 23 - 26 October 2012; see http://stids.c4i.gmu.edu/ for more details. Simon
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
The JISC funded CLOCK project did some thinking around cataloguing processes and tracking changes to statements and/or records - e.g. http://clock.blogs.lincoln.ac.uk/2012/05/23/its-a-model-and-its-looking-good/ Not solutions of course, but hopefully of interest Owen Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com Telephone: 0121 288 6936 On 28 Aug 2012, at 19:43, Simon Spero sesunc...@gmail.com wrote: On Aug 28, 2012, at 2:17 PM, Joe Hourcle wrote: I seem to recall seeing a presentation a couple of years ago from someone in the intelligence community, where they'd keep all of their intelligence, but they stored RDF quads so they could track the source. They'd then assign a confidence level to each source, so they could get an overall level of confidence on their inferences. […] It's possible that it was in the context of provenance, but I'm getting bogged down in too many articles about people storing provenance information using RDF-triples (without actually tracking the provenance of the triple itself) Provenance is of great importance in the IC and related sectors. An good overview of the nature of evidential reasoning is David A Schum (1994;2001). Evidential Foundations of Probabilistic Reasoning. Wiley Sons, 1994; Northwestern University Press, 2001 [Paperback edition]. There are usually papers on provenance and associated semantics at the GMU Semantic Technology for Intelligence, Defense, and Security (STIDS). This years conference is 23 - 26 October 2012; see http://stids.c4i.gmu.edu/ for more details. Simon
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
A week ago, I wrote: What I have done is just to search (worldcat.org and hathitrust.org) for some common Swedish words, and I don't have to do this for long before some very obvious (to a native speaker) spelling mistakes appear. I've reported 38 errors to Hathitrust, and got feedback that they are now corrected. The originating libraries for these 38 records were: 18 University of California 6 University of Wisconsin 5 University of Michigan 3 Harvard University 2 Columbia University 2 New York Public Library 1 Cornell University 1 Princeton University Maybe Google scanned a lot of Swedish books at the University of California, or their error rate is higher. I didn't compile any statistics on the correct records that I looked at. Four of the records had a title where an a-ring (å) was erroneously written as an a-dot. But these four records came from four different libraries. I didn't see any patterns. -- Lars Aronsson (l...@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
Actually, Ed, this would not only make for a good blog post (please, so it doesn't get lost in email space), but I would love to see a discussion of what kind of revision control would work: 1) for libraries (git is gawdawful nerdy) 2) for linked data kc p.s. the Ramsay book is now showing on Open Library, and the subtitle is correct... perhaps because the record is from the LC MARC service :-) http://openlibrary.org/works/OL16528530W/Reading_machines On 8/26/12 6:32 PM, Ed Summers wrote: Thanks for sharing this bit of detective work. I noticed something similar fairly recently myself [1], but didn't discover as plausible of a scenario for what had happened as you did. I imagine others have noticed this network effect before as well. On Tue, Aug 21, 2012 at 11:42 AM, Lars Aronsson l...@aronsson.se wrote: And sure enough, there it is, http://clio.cul.columbia.edu:7018/vwebv/holdingsInfo?bibId=1439352 But will my error report to Worldcat find its way back to CLIO? Or if I report the error to Columbia University, will the correction propagate to Google, Hathi and Worldcat? (Columbia asks me for a student ID when I want to give feedback, so that removes this option for me.) I realize this probably will sound flippant (or overly grandiose), but innovating solutions to this problem, where there isn't necessarily one metadata master that everyone is slaved to seems to be one of the more important and interesting problems that our sector faces. When Columbia University can become the source of a bibliographic record for Google Books, HathiTrust and OpenLibrary, etc how does this change the hub and spoke workflows (with OCLC as the hub) that we are more familiar with? I think this topic is what's at the heart of the discussions about a github-for-data [2,3], since decentralized version control systems [4] allow for the evolution of more organic, push/pull, multimaster workflows...and platforms like Github make them socially feasible, easy and fun. I also think Linked Library Data, where bibliographic descriptions are REST enabled Web resources identified with URLs, and patterns such as webhooks [5] make it easy to trigger update events could be part of an answer. Feed technologies like Atom, RSS and the work being done on ResourceSync also seem important technologies for us to use to allow people to poll for changes [6]. And being able to say where you have obtained data from, possibly using something like the W3C Provenance vocabulary [7] also seems like an important part of the puzzle. I'm sure there are other (and perhaps better) creative analogies or tools that could help solve this problem. I think you're probably right that we are starting to see the errors more now that more library data is becoming part of the visible Web via projects like GoogleBooks, HathiTrust, OpenLibrary and other enterprising libraries that design their catalogs to be crawlable and indexable by search engines. But I think it's more fun to think about (and hack on) what grassroots things we could be doing to help these new bibliographic data workflows to grow and flourish than to get piled under by the errors, and a sense of futility... Or it might make for a good article or dissertation topic :-) //Ed [1] http://inkdroid.org/journal/2011/12/25/genealogy-of-a-typo/ [2] http://www.informationdiet.com/blog/read/we-need-a-github-for-data [3] http://sunlightlabs.com/blog/2010/we-dont-need-a-github-for-data/ [4] http://en.wikipedia.org/wiki/Distributed_revision_control [5] https://help.github.com/articles/post-receive-hooks [6] http://www.niso.org/workrooms/resourcesync/ [7] http://www.w3.org/TR/prov-primer/ -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
Hi, On 08/27/2012 08:49 AM, Karen Coyle wrote: Actually, Ed, this would not only make for a good blog post (please, so it doesn't get lost in email space), but I would love to see a discussion of what kind of revision control would work: 1) for libraries (git is gawdawful nerdy) 2) for linked data Speaking of revision control, does any have or know of a sizable dataset of bibliographic metadata that includes change history? For example, I know that some ILSs can retain previous versions of bibliographic records as they get edited. Such a dataset would be useful in figuring out good ways to calculate differences between versions of a record, and perhaps more to the point, express those in a way that's more useful to maintainers of the metadata. Regards, Galen -- Galen Charlton Director of Support and Implementation Equinox Software, Inc. / The Open Source Experts email: g...@esilibrary.com direct: +1 770-709-5581 cell: +1 404-984-4366 skype: gmcharlt web:http://www.esilibrary.com/ Supporting Koha and Evergreen: http://koha-community.org http://evergreen-ils.org
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
On Mon, Aug 27, 2012 at 10:36 AM, Ross Singer rossfsin...@gmail.com wrote: For MARC data, while I don't know of any examples of this, it seems like something like CouchDB [2] and marc-in-json [3] would be a fantastic way to make something like this available. Great idea...and there are 4 years of transactions for LC record create/update/deletes up at Internet Archive: http://archive.org/details/marc_loc_updates //Ed
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
On Mon, Aug 27, 2012 at 8:49 AM, Karen Coyle li...@kcoyle.net wrote: Actually, Ed, this would not only make for a good blog post (please, so it doesn't get lost in email space), but I would love to see a discussion of what kind of revision control would work: 1) for libraries (git is gawdawful nerdy) 2) for linked data I think you know well as me that linked data is gawdawful nerdy too :-) p.s. the Ramsay book is now showing on Open Library, and the subtitle is correct... perhaps because the record is from the LC MARC service :-) http://openlibrary.org/works/OL16528530W/Reading_machines perhaps being the operative word. Being able to concretely answer these provenence questions is important. Actually, I'm not sure it was ever incorrect at OpenLibrary. At least I don't think I used it as an example in my Genealogy of a Typo post. //Ed
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
On Mon, Aug 27, 2012 at 1:33 PM, Corey A Harper corey.har...@nyu.edu wrote: I think there's a useful distinction here. Ed can correct me if I'm wrong, but I suspect he was not actually suggesting that Git itself be the user-interface to a github-for-data type service, but rather that such a service can be built *on top* of an infrastructure component like GitHub. Yes, I wasn't saying that we could just plonk our data into Github, and pat ourselves on the back for a good days work :-) I guess I was stating the obvious: technologies like Git have made once hard problems like decentralized version control much, much easier...and there might be some giants shoulders to stand on. //Ed
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
Ed, Corey - I also assumed that Ed wasn't suggesting that we literally use github as our platform, but I do want to remind folks how far we are from having people friendly versioning software -- at least, none that I have seen has felt intuitive. The features of git are great, and people have built interfaces to it, but as Galen's question brings forth, the very *idea* of versioning doesn't exist in library data processing, even though having central-system based versions of MARC records (with a single time line) is at least conceptually simple. Therefore it seems to me that first we have to define what a version would be, both in terms of data but also in terms of the mind set and work flow of the cataloging process. How will people *understand* versions in the context of their work? What do they need in order to evaluate different versions? And that leads to my second question: what is a version in LD space? Triples are just triples - you can add them or delete them but I don't know of a way that you can version them, since each has an independent T-space existence. So, are we talking about named graphs? I think this should be a high priority activity around the new bibliographic framework planning because, as we have seen with MARC, the idea of versioning needs to be part of the very design or it won't happen. kc On 8/27/12 11:20 AM, Ed Summers wrote: On Mon, Aug 27, 2012 at 1:33 PM, Corey A Harper corey.har...@nyu.edu wrote: I think there's a useful distinction here. Ed can correct me if I'm wrong, but I suspect he was not actually suggesting that Git itself be the user-interface to a github-for-data type service, but rather that such a service can be built *on top* of an infrastructure component like GitHub. Yes, I wasn't saying that we could just plonk our data into Github, and pat ourselves on the back for a good days work :-) I guess I was stating the obvious: technologies like Git have made once hard problems like decentralized version control much, much easier...and there might be some giants shoulders to stand on. //Ed -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
These have to be named graphs, or at least collections of triples which can be processed through workflows as a single unit. In terms of LD there version needs to be defined in terms of: (a) synchronisation with the non-bibliographic real world (i.e. Dataset Z version X was released at time Y) (b) correction/augmentation of other datasets (i.e Dataset F version G contains triples augmenting Dataset H versions A, B, C and D) (c) mapping between datasets (i.e. Dataset I contains triples mapping between Dataset J version K and Dataset L version M (and visa-versa)) Note that a 'Dataset' here could be a bibliographic dataset (records of works, etc), a classification dataset (a version of the Dewey Decimal Scheme, a version of the Māori Subject Headings, a version of Dublin Core Scheme, etc), a dataset of real-world entities to do authority control against (a dbpedia dump, an organisational structure in an institution, etc), or some arbitrary mapping between some arbitrary combination of these. Most of these are going to be managed and generated using current systems with processes that involve periodic dumps (or drops) of data (the dbpedia drops of wikipedia data are a good model here). git makes little sense for this kind of data. github is most likely to be useful for smaller niche collaborative collections (probably no more than a million triples) mapping between the larger collections, and scripts for integrating the collections into a sane whole. cheers stuart On 28/08/12 08:36, Karen Coyle wrote: Ed, Corey - I also assumed that Ed wasn't suggesting that we literally use github as our platform, but I do want to remind folks how far we are from having people friendly versioning software -- at least, none that I have seen has felt intuitive. The features of git are great, and people have built interfaces to it, but as Galen's question brings forth, the very *idea* of versioning doesn't exist in library data processing, even though having central-system based versions of MARC records (with a single time line) is at least conceptually simple. Therefore it seems to me that first we have to define what a version would be, both in terms of data but also in terms of the mind set and work flow of the cataloging process. How will people *understand* versions in the context of their work? What do they need in order to evaluate different versions? And that leads to my second question: what is a version in LD space? Triples are just triples - you can add them or delete them but I don't know of a way that you can version them, since each has an independent T-space existence. So, are we talking about named graphs? I think this should be a high priority activity around the new bibliographic framework planning because, as we have seen with MARC, the idea of versioning needs to be part of the very design or it won't happen. kc On 8/27/12 11:20 AM, Ed Summers wrote: On Mon, Aug 27, 2012 at 1:33 PM, Corey A Harper corey.har...@nyu.edu wrote: I think there's a useful distinction here. Ed can correct me if I'm wrong, but I suspect he was not actually suggesting that Git itself be the user-interface to a github-for-data type service, but rather that such a service can be built *on top* of an infrastructure component like GitHub. Yes, I wasn't saying that we could just plonk our data into Github, and pat ourselves on the back for a good days work :-) I guess I was stating the obvious: technologies like Git have made once hard problems like decentralized version control much, much easier...and there might be some giants shoulders to stand on. //Ed -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
I agree entirely that these would need to be a collection of triples with its own set of attributes/metadata describing the collection. Basically a record with triples as the data elements. But I see a bigger problem with the direction this thread has taken so far. The use of versions has been conditioned by the use of something like Github as the underlying versioning platform. But Github (and all software versioning systems) are based on temporal versions, where each version is, in some way, an evolved unit of the same underlying thing - a program or whatever. So the versions are really temporally linearly related to each other as well as related in terms of added or improved or fixed functionality. Yes, the codebase (the underlying thing) can fork or split in a number of ways, but they are all versions of the same thing, progressing through time. In the existing bibliographic case we have many records which purport to be about the same thing, but contain different data values for the same elements. And these are the the versions we have to deal with, and eventually reconcile. They are not descendents of the same original, they are independent entities, whether they are recorded as singular MARC records or collections of LD triples. I would suggest that at all levels, from the triplet or key/value field pair to the triple collection or fielded record, what we have are alternates, not versions. Thus the alternates exist at the triple level, and also at the collection level (the normal bibliographic unit record we are familiar with). And those alternates could then be allowed versions which are the attempts to, in some way, improve the quality (your definition of what this is is as good as mine) over time. And with a closed group of alternates (of a single bib unit) these versioned alternates would (in a perfect world) iterate to a common descendent which had the same agreed, authorized set of triples. Of course this would only be the authorized form for those organizations which recognized the arrangement. But, allowing alternates and their versions does allow for a method of tracking the original problem of three organizations each copying each other endlessly to correct their data. In this model it would be an alternate/version spiral of states, rather than a flat circle of each changing version with no history, and no idea of which was master. (Try re-reading Stuart's (a), (b), (c) below with the idea of alternates as well as versions (of the Datasets). I think it would become clearer as to what was happening.) There is still no master, but at least the state changes can be properly tracked and checked by software (and/or humans) so the endless cycle can be addressed - probably by an outside (human) decision about the correct form of a triple to use for this bib entity. Or this may all prove to be an unnecessary complication. Peter -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of stuart yeates Sent: Monday, August 27, 2012 3:42 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google These have to be named graphs, or at least collections of triples which can be processed through workflows as a single unit. In terms of LD there version needs to be defined in terms of: (a) synchronisation with the non-bibliographic real world (i.e. Dataset Z version X was released at time Y) (b) correction/augmentation of other datasets (i.e Dataset F version G contains triples augmenting Dataset H versions A, B, C and D) (c) mapping between datasets (i.e. Dataset I contains triples mapping between Dataset J version K and Dataset L version M (and visa-versa)) Note that a 'Dataset' here could be a bibliographic dataset (records of works, etc), a classification dataset (a version of the Dewey Decimal Scheme, a version of the Māori Subject Headings, a version of Dublin Core Scheme, etc), a dataset of real-world entities to do authority control against (a dbpedia dump, an organisational structure in an institution, etc), or some arbitrary mapping between some arbitrary combination of these. Most of these are going to be managed and generated using current systems with processes that involve periodic dumps (or drops) of data (the dbpedia drops of wikipedia data are a good model here). git makes little sense for this kind of data. github is most likely to be useful for smaller niche collaborative collections (probably no more than a million triples) mapping between the larger collections, and scripts for integrating the collections into a sane whole. cheers stuart On 28/08/12 08:36, Karen Coyle wrote: Ed, Corey - I also assumed that Ed wasn't suggesting that we literally use github as our platform, but I do want to remind folks how far we are from having people friendly versioning software -- at least, none that I
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
On 28/08/12 12:07, Peter Noerr wrote: They are not descendents of the same original, they are independent entities, whether they are recorded as singular MARC records or collections of LD triples. That depends on which end of the stick one grasps. Conceptually these are descendants of the abstract work in question; textually these are independent (or likely to be). In practice it doesn't matter: since git/svn/etc are all textual in nature, they're not good at handling these. The reconciliation is likely to be a good candidate for temporal versioning. It's interesting to ponder which of the many datasets is going to prove to be the hub for reconciliation. My money is on librarything, because their merge-ist approach to cataloguing means they have lots and lots of different versions of the work information to match against. See for example: https://www.librarything.com/work/683408/editions/11795335 Wikipedia / dbpedia have redirects which tend in the same direction, but only for titles and not ISBNs. cheers stuart -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
Thanks for sharing this bit of detective work. I noticed something similar fairly recently myself [1], but didn't discover as plausible of a scenario for what had happened as you did. I imagine others have noticed this network effect before as well. On Tue, Aug 21, 2012 at 11:42 AM, Lars Aronsson l...@aronsson.se wrote: And sure enough, there it is, http://clio.cul.columbia.edu:7018/vwebv/holdingsInfo?bibId=1439352 But will my error report to Worldcat find its way back to CLIO? Or if I report the error to Columbia University, will the correction propagate to Google, Hathi and Worldcat? (Columbia asks me for a student ID when I want to give feedback, so that removes this option for me.) I realize this probably will sound flippant (or overly grandiose), but innovating solutions to this problem, where there isn't necessarily one metadata master that everyone is slaved to seems to be one of the more important and interesting problems that our sector faces. When Columbia University can become the source of a bibliographic record for Google Books, HathiTrust and OpenLibrary, etc how does this change the hub and spoke workflows (with OCLC as the hub) that we are more familiar with? I think this topic is what's at the heart of the discussions about a github-for-data [2,3], since decentralized version control systems [4] allow for the evolution of more organic, push/pull, multimaster workflows...and platforms like Github make them socially feasible, easy and fun. I also think Linked Library Data, where bibliographic descriptions are REST enabled Web resources identified with URLs, and patterns such as webhooks [5] make it easy to trigger update events could be part of an answer. Feed technologies like Atom, RSS and the work being done on ResourceSync also seem important technologies for us to use to allow people to poll for changes [6]. And being able to say where you have obtained data from, possibly using something like the W3C Provenance vocabulary [7] also seems like an important part of the puzzle. I'm sure there are other (and perhaps better) creative analogies or tools that could help solve this problem. I think you're probably right that we are starting to see the errors more now that more library data is becoming part of the visible Web via projects like GoogleBooks, HathiTrust, OpenLibrary and other enterprising libraries that design their catalogs to be crawlable and indexable by search engines. But I think it's more fun to think about (and hack on) what grassroots things we could be doing to help these new bibliographic data workflows to grow and flourish than to get piled under by the errors, and a sense of futility... Or it might make for a good article or dissertation topic :-) //Ed [1] http://inkdroid.org/journal/2011/12/25/genealogy-of-a-typo/ [2] http://www.informationdiet.com/blog/read/we-need-a-github-for-data [3] http://sunlightlabs.com/blog/2010/we-dont-need-a-github-for-data/ [4] http://en.wikipedia.org/wiki/Distributed_revision_control [5] https://help.github.com/articles/post-receive-hooks [6] http://www.niso.org/workrooms/resourcesync/ [7] http://www.w3.org/TR/prov-primer/
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
On August 22 ya'aQovwrote: *Lars, so the wrong spelling of that Swedish author is based on your browsing it, not on an automated procedure, or reference to an online thesaurus. Given your Swedish resources, is there any quality control mechanism you can suggest? This author's name was just one that I stumbled upon. My problem is that I stumble on bad spelling of titles and authors' names far too often, like 5% of all cases. I guess this is because they are now exposed, when Google and Hathi Trust pull all data together. Earlier these errors have been isolated in various library catalogs that had no users who speak the language (in this case: Swedish). In each library catalog these Swedish entries make up a tiny minority. But now with the aggregation in Hathi Trust and Worldcat, we can start to see patterns and fix the errors. For Swedish books, the national catalog at libris.kb.se is most often correct. It's also available as open data under cc0 (Creative Commons zero) http://www.kb.se/libris/teknisk-information/Oppen-data/Open-Data/ and the Libris authority file is part of VIAF. You'd need one good reference per language. But I think you can do a good analysis even without knowing much about a language. If you find that records for books in Czech often contain č (c-hacek), but almost never ĉ (c-circumflex), you can look at the records that contain the unusual letter and see if they are errors or perhaps intentional uses of Esperanto words. -- Lars Aronsson (l...@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
Lars, With regards to HathiTrust and Google, HathiTrust receives bib records from partner libraries. Google also receives bib records from partner libraries. HathiTrust has never received bib records from Google - sometimes we provide corrected records to them, but the flow is only one-sided. For HathiTrust-related issues, you can report cataloging errors to us at feedb...@issues.hathitrust.org, and we can work with the partner library to get them fixed. Thanks, Angelina Zaytsev Project Librarian HathiTrust http://www.hathitrust.org/ On Mon, Aug 20, 2012 at 9:42 PM, ya'aQov yaaq...@gmail.com wrote: *Lars, Thank you. Which online thesaurus are you matching these names with (when you declare them mistakes)? if an online procedure, please be specific; how can VIAF benefit, it at all, from *Project Runeberg? http://runeberg.org/%20 Ya'aqov Ziso -- Angelina Zaytsev In your thirst for knowledge, be sure not to drown in all the information. ~Anthony J. D'Angelo Il y a autant de beaux idéals que de formes de nez différentes ou de caractères différents. ~Stendhal Education can give you a skill, but a liberal education can give you dignity. ~Ellen Key
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
On 2012-08-20 22:38, Roy Tennant wrote: Any errors in WorldCat can be reported to bibcha...@oclc.org. We take record quality seriously, but as you can imagine when you take in records from thousands of sources around the world, this is a constant struggle. We have our own quality control efforts, but individuals reporting problems are also an important strategy. The web page for a bibliographic record, when I look at it, in the top-left menu, has a report feedback option, but the form that pops up doesn't indicate whether the referring page URL will be included in the report. It should be, of course. What I have done is just to search (worldcat.org and hathitrust.org) for some common Swedish words, and I don't have to do this for long before some very obvious (to a native speaker) spelling mistakes appear. One example is this, http://www.worldcat.org/oclc/681652093 This record has a link to Hathi Trust, where the display of the title page immediatly shows that the A-ring is on the wrong A. If both Hathi Trust and Google receive catalog records from partner libraries, the error must originate from Columbia University where this book was scanned by Google in August 2009, http://books.google.se/books?id=-EBGYAAJ And sure enough, there it is, http://clio.cul.columbia.edu:7018/vwebv/holdingsInfo?bibId=1439352 But will my error report to Worldcat find its way back to CLIO? Or if I report the error to Columbia University, will the correction propagate to Google, Hathi and Worldcat? (Columbia asks me for a student ID when I want to give feedback, so that removes this option for me.) Searching Worldcat for this title without any diacritics yields two results: This book from 1858 and another one (with the A-ring on the right A) from 1855, http://www.worldcat.org/search?q=kort+afhandling+om+angmachiner The 1855 book has the author's name wrong, Karl Bertil Lilliehoehoe instead of Carl Bertil Lilliehöök, who lived 1809-1890, http://www.worldcat.org/oclc/249342164 I understand that oe is a poor man's transcription of the umlaut ö, but where did the -k go? That Worldcat record doesn't indicate its source. It's apparently not Hathi Trust. I don't know how you work, but I think it would be quite easy to extract all titles for a given language, and run them through a spell checker, to find an indication of where there could be mistakes. For example, afhandling means treatise or dissertation, and brings up 32,000 hits in a Worldcat search (the modern spelling avhandling yields 140,000), but åfhandling is not in any dictionary. (It's obvious that Google Books never tried this.) -- Lars Aronsson (l...@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
*Lars, so the wrong spelling of that Swedish author is based on your browsing it, not on an automated procedure, or reference to an online thesaurus. Given your Swedish resources, is there any quality control mechanism you can suggest? * On Tue, Aug 21, 2012 at 10:42 AM, Lars Aronsson l...@aronsson.se wrote: On 2012-08-20 22:38, Roy Tennant wrote: Any errors in WorldCat can be reported to bibcha...@oclc.org. We take record quality seriously, but as you can imagine when you take in records from thousands of sources around the world, this is a constant struggle. We have our own quality control efforts, but individuals reporting problems are also an important strategy. The web page for a bibliographic record, when I look at it, in the top-left menu, has a report feedback option, but the form that pops up doesn't indicate whether the referring page URL will be included in the report. It should be, of course. What I have done is just to search (worldcat.org and hathitrust.org) for some common Swedish words, and I don't have to do this for long before some very obvious (to a native speaker) spelling mistakes appear. One example is this, http://www.worldcat.org/oclc/**681652093http://www.worldcat.org/oclc/681652093 This record has a link to Hathi Trust, where the display of the title page immediatly shows that the A-ring is on the wrong A. If both Hathi Trust and Google receive catalog records from partner libraries, the error must originate from Columbia University where this book was scanned by Google in August 2009, http://books.google.se/books?**id=-EBGYAAJhttp://books.google.se/books?id=-EBGYAAJ And sure enough, there it is, http://clio.cul.columbia.edu:**7018/vwebv/holdingsInfo?bibId=**1439352http://clio.cul.columbia.edu:7018/vwebv/holdingsInfo?bibId=1439352 But will my error report to Worldcat find its way back to CLIO? Or if I report the error to Columbia University, will the correction propagate to Google, Hathi and Worldcat? (Columbia asks me for a student ID when I want to give feedback, so that removes this option for me.) Searching Worldcat for this title without any diacritics yields two results: This book from 1858 and another one (with the A-ring on the right A) from 1855, http://www.worldcat.org/**search?q=kort+afhandling+om+**angmachinerhttp://www.worldcat.org/search?q=kort+afhandling+om+angmachiner The 1855 book has the author's name wrong, Karl Bertil Lilliehoehoe instead of Carl Bertil Lilliehöök, who lived 1809-1890, http://www.worldcat.org/oclc/**249342164http://www.worldcat.org/oclc/249342164 I understand that oe is a poor man's transcription of the umlaut ö, but where did the -k go? That Worldcat record doesn't indicate its source. It's apparently not Hathi Trust. I don't know how you work, but I think it would be quite easy to extract all titles for a given language, and run them through a spell checker, to find an indication of where there could be mistakes. For example, afhandling means treatise or dissertation, and brings up 32,000 hits in a Worldcat search (the modern spelling avhandling yields 140,000), but åfhandling is not in any dictionary. (It's obvious that Google Books never tried this.) -- Lars Aronsson (l...@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/ -- ya'a*Q*ov ziso | yaaq...@gmail.com* *| 856 217 3456
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
Any errors in WorldCat can be reported to bibcha...@oclc.org. We take record quality seriously, but as you can imagine when you take in records from thousands of sources around the world, this is a constant struggle. We have our own quality control efforts, but individuals reporting problems are also an important strategy. Roy On Mon, Aug 20, 2012 at 1:01 PM, Lars Aronsson l...@aronsson.se wrote: This big mess of broken bibliographic records that started out as Google Book Search, and then spread to Hathi Trust, is now also spreading to OCLC Worldcat. I find the same kind of misspellings and misconceptions in all three places. So far, the VIAF author names seem to be more correct, but maybe they will soon import all the misspellings from Worldcat as well? What is the way to stop this downward trend? Where can one post corrections, so that they go back to the source, and don't reappear next time? How can I know who copies from whom? Does Hathi Trust copy everything from Google? And does Worldcat discover new (misspelled) authors and titles in Hathi Trust, and import them rather than reporting the errors? -- Lars Aronsson (l...@aronsson.se) Project Runeberg - free Nordic literature - http://runeberg.org/
Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
*Lars, Thank you. Which online thesaurus are you matching these names with (when you declare them mistakes)? if an online procedure, please be specific; how can VIAF benefit, it at all, from *Project Runeberg?http://runeberg.org/%20 Ya'aqov Ziso