Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-29 Thread Michael Hopwood
Thanks for this pointer Owen.

It's a nice illustration of the fact that what users actually want (well, I 
know I did back when I actually worked in large information services 
departments!) is something more like an intranet where the content I find is 
weighted towards me, the audience e.g. the intranet knows I'm a 2nd year 
medical student and one of my registered preferred languages is Mandarin 
already, or it knows that I'm a rare books cataloguer and I want to see what 
nine out of ten other cataloguers recorded for this obscure and confusing 
title.

However, this stuff is quite intense for linked data, isn't it? I understand 
that it would involve lots of quads, named graphs or whatever...

In a parallel world, I'm currently writing up recommendations for aggregating 
ONIX for Books records. ONIX data can come from multiple sources who 
potentially assert different things about a given book (i.e. something with 
an ISBN to keep it simple).

This is why *every single ONIX data element* can have option attributes of

@datestamp
@sourcename
@sourcetype [e.g. publisher, retailer, data aggregator... library?]

...and the ONIX message as a whole is set up with header and product record 
 segments that each include some info about the sender/recipient/data record in 
question.

How people in the book supply chain are implementing these is a distinct issue, 
but could these capabilities have some relevance to what you're discussing?

Do you have any other pointers to intranet-like catalogues?

In the museum space, there is of course this: http://www.researchspace.org/

Cheers,

Michael 

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Owen 
Stephens
Sent: 28 August 2012 21:37
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

The JISC funded CLOCK project did some thinking around cataloguing processes 
and tracking changes to statements and/or records - e.g. 
http://clock.blogs.lincoln.ac.uk/2012/05/23/its-a-model-and-its-looking-good/

Not solutions of course, but hopefully of interest

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 28 Aug 2012, at 19:43, Simon Spero sesunc...@gmail.com wrote:

 On Aug 28, 2012, at 2:17 PM, Joe Hourcle wrote:
 
 I seem to recall seeing a presentation a couple of years ago from someone in 
 the intelligence community, where they'd keep all of their intelligence, but 
 they stored RDF quads so they could track the source.
 
 They'd then assign a confidence level to each source, so they could get an 
 overall level of confidence on their inferences.
 [...]
 It's possible that it was in the context of provenance, but I'm getting 
 bogged down in too many articles about people storing provenance information 
 using RDF-triples (without actually tracking the provenance of the triple 
 itself)
 
 Provenance is of great importance in the IC and related sectors.   
 
 An good overview of the nature of evidential reasoning is David A Schum 
 (1994;2001). Evidential Foundations of Probabilistic Reasoning. Wiley  Sons, 
 1994; Northwestern University Press, 2001 [Paperback edition].
 
 There are usually papers on provenance and associated semantics at the GMU 
 Semantic Technology for Intelligence, Defense, and Security (STIDS).  This 
 years conference is 23 - 26 October 2012; see http://stids.c4i.gmu.edu/ for 
 more details. 
 
 Simon


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-29 Thread stuart yeates

On 29/08/12 19:46, Michael Hopwood wrote:

Thanks for this pointer Owen.

It's a nice illustration of the fact that what users actually want (well, I know I did back when I actually 
worked in large information services departments!) is something more like an intranet where the content I 
find is weighted towards me, the audience e.g. the intranet knows I'm a 2nd year medical student 
and one of my registered preferred languages is Mandarin already, or it knows that I'm a rare 
books cataloguer and I want to see what nine out of ten other cataloguers recorded for this 
obscure and confusing title.


Yet another re-invention of content negotiation, AKA  RFC 2295.

These attempts fail because 99% of data publishers care in the first 
instance about the single use before them and in the second instance the 
precedent has already been set.


The exception, of course, is legally mandated multi-lingual 
bureaucracies (Canadian government for en/fr; EU organs for various 
languages etc) and on-the-wire formatting (for which it works very well)



However, this stuff is quite intense for linked data, isn't it? I understand 
that it would involve lots of quads, named graphs or whatever...

In a parallel world, I'm currently writing up recommendations for aggregating ONIX for 
Books records. ONIX data can come from multiple sources who potentially assert different 
things about a given book (i.e. something with an ISBN to keep it simple).

This is why *every single ONIX data element* can have option attributes of

@datestamp
@sourcename
@sourcetype [e.g. publisher, retailer, data aggregator... library?]

...and the ONIX message as a whole is set up with header and product record 
 segments that each include some info about the sender/recipient/data record in question.


Do yo have any stats for how many ONIX data elements in the wild 
actually use these elements in non-trivial ways? I've never seen any.


cheers
stuart
--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-28 Thread Galen Charlton

Hi,

On 08/27/2012 04:36 PM, Karen Coyle wrote:

I also assumed that Ed wasn't suggesting that we literally use github as
our platform, but I do want to remind folks how far we are from having
people friendly versioning software -- at least, none that I have seen
has felt intuitive. The features of git are great, and people have
built interfaces to it, but as Galen's question brings forth, the very
*idea* of versioning doesn't exist in library data processing, even
though having central-system based versions of MARC records (with a
single time line) is at least conceptually simple.


What's interesting, however, is that at least a couple parts of the 
concept of distributed version control, viewed broadly, have been used 
in traditional library cataloging.


For example, RLIN had a concept of a cluster of MARC records for the 
same title, with each library having their own record in the cluster.  I 
don't know if RLIN kept track of previous versions of a library's record 
in a cluster as it got edited, but it means that there was the concept 
of a spatial distribution of record versions if not a temporal one. 
I've never used RLIN myself, but I'd be curious to know if it provided 
any tools to readily compare records in the same cluster and if there 
were any mechanisms (formal or informal) for a library to grab 
improvements from another library's record and apply it to their own.


As another example, the MARC cataloging source field has long been used, 
particularly in central utilities, to record institution-level 
attribution for changes to a MARC record.  I think that's mostly been 
used by catalogers to help decide which version of a record to start 
from when copy cataloging, but I suppose it's possible that some 
catalogers were also looking at the list of modifying agencies (library 
A touched this record and is particularly good at subject analysis, so 
I'll grab their 650s).


Regards,

Galen
--
Galen Charlton
Director of Support and Implementation
Equinox Software, Inc. / The Open Source Experts
email:  g...@esilibrary.com
direct: +1 770-709-5581
cell:   +1 404-984-4366
skype:  gmcharlt
web:http://www.esilibrary.com/
Supporting Koha and Evergreen: http://koha-community.org  
http://evergreen-ils.org


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-28 Thread Simon Spero
An interesting reference is this: 

 High, W. M. (1990). Editing Changes to Monographic Cataloging Records in the 
OCLC Database: An Analysis of the Practice in Five University Libraries. PhD 
thesis, University of North Carolina at Chapel Hill.

It's in UMI (and Heavy Trussed). 

Simon 

On Aug 28, 2012, at 12:05 PM, Galen Charlton wrote:

 Hi,
 
 On 08/27/2012 04:36 PM, Karen Coyle wrote:
 I also assumed that Ed wasn't suggesting that we literally use github as
 our platform, but I do want to remind folks how far we are from having
 people friendly versioning software -- at least, none that I have seen
 has felt intuitive. The features of git are great, and people have
 built interfaces to it, but as Galen's question brings forth, the very
 *idea* of versioning doesn't exist in library data processing, even
 though having central-system based versions of MARC records (with a
 single time line) is at least conceptually simple.
 
 What's interesting, however, is that at least a couple parts of the concept 
 of distributed version control, viewed broadly, have been used in traditional 
 library cataloging.
 
 For example, RLIN had a concept of a cluster of MARC records for the same 
 title, with each library having their own record in the cluster.  I don't 
 know if RLIN kept track of previous versions of a library's record in a 
 cluster as it got edited, but it means that there was the concept of a 
 spatial distribution of record versions if not a temporal one. I've never 
 used RLIN myself, but I'd be curious to know if it provided any tools to 
 readily compare records in the same cluster and if there were any mechanisms 
 (formal or informal) for a library to grab improvements from another 
 library's record and apply it to their own.
 
 As another example, the MARC cataloging source field has long been used, 
 particularly in central utilities, to record institution-level attribution 
 for changes to a MARC record.  I think that's mostly been used by catalogers 
 to help decide which version of a record to start from when copy cataloging, 
 but I suppose it's possible that some catalogers were also looking at the 
 list of modifying agencies (library A touched this record and is 
 particularly good at subject analysis, so I'll grab their 650s).
 
 Regards,
 
 Galen
 -- 
 Galen Charlton
 Director of Support and Implementation
 Equinox Software, Inc. / The Open Source Experts
 email:  g...@esilibrary.com
 direct: +1 770-709-5581
 cell:   +1 404-984-4366
 skype:  gmcharlt
 web:http://www.esilibrary.com/
 Supporting Koha and Evergreen: http://koha-community.org  
 http://evergreen-ils.org


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-28 Thread Joe Hourcle
On Aug 28, 2012, at 12:05 PM, Galen Charlton wrote:

 Hi,
 
 On 08/27/2012 04:36 PM, Karen Coyle wrote:
 I also assumed that Ed wasn't suggesting that we literally use github as
 our platform, but I do want to remind folks how far we are from having
 people friendly versioning software -- at least, none that I have seen
 has felt intuitive. The features of git are great, and people have
 built interfaces to it, but as Galen's question brings forth, the very
 *idea* of versioning doesn't exist in library data processing, even
 though having central-system based versions of MARC records (with a
 single time line) is at least conceptually simple.
 
 What's interesting, however, is that at least a couple parts of the concept 
 of distributed version control, viewed broadly, have been used in traditional 
 library cataloging.
 
 For example, RLIN had a concept of a cluster of MARC records for the same 
 title, with each library having their own record in the cluster.  I don't 
 know if RLIN kept track of previous versions of a library's record in a 
 cluster as it got edited, but it means that there was the concept of a 
 spatial distribution of record versions if not a temporal one. I've never 
 used RLIN myself, but I'd be curious to know if it provided any tools to 
 readily compare records in the same cluster and if there were any mechanisms 
 (formal or informal) for a library to grab improvements from another 
 library's record and apply it to their own.
 
 As another example, the MARC cataloging source field has long been used, 
 particularly in central utilities, to record institution-level attribution 
 for changes to a MARC record.  I think that's mostly been used by catalogers 
 to help decide which version of a record to start from when copy cataloging, 
 but I suppose it's possible that some catalogers were also looking at the 
 list of modifying agencies (library A touched this record and is 
 particularly good at subject analysis, so I'll grab their 650s).

I seem to recall seeing a presentation a couple of years ago from someone in 
the intelligence community, where they'd keep all of their intelligence, but 
they stored RDF quads so they could track the source.

They'd then assign a confidence level to each source, so they could get an 
overall level of confidence on their inferences.

... it'd get a bit messier if you have to do some sort of analysis of which 
sources are good for what type of information, but it might be a start.

Unfortunately, I'm not having luck finding the reference again.

It's possible that it was in the context of provenance, but I'm getting bogged 
down in too many articles about people storing provenance information using 
RDF-triples (without actually tracking the provenance of the triple itself)

-Joe

ps.  I just realized this discussion's been on CODE4LIB, and not NGC4LIB ... 
would it make sense to move it over there?


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-28 Thread Simon Spero
On Aug 28, 2012, at 2:17 PM, Joe Hourcle wrote:

 I seem to recall seeing a presentation a couple of years ago from someone in 
 the intelligence community, where they'd keep all of their intelligence, but 
 they stored RDF quads so they could track the source.
 
 They'd then assign a confidence level to each source, so they could get an 
 overall level of confidence on their inferences.
 […]
 It's possible that it was in the context of provenance, but I'm getting 
 bogged down in too many articles about people storing provenance information 
 using RDF-triples (without actually tracking the provenance of the triple 
 itself)

Provenance is of great importance in the IC and related sectors.   

An good overview of the nature of evidential reasoning is David A Schum 
(1994;2001). Evidential Foundations of Probabilistic Reasoning. Wiley  Sons, 
1994; Northwestern University Press, 2001 [Paperback edition].

There are usually papers on provenance and associated semantics at the GMU 
Semantic Technology for Intelligence, Defense, and Security (STIDS).  This 
years conference is 23 - 26 October 2012; see http://stids.c4i.gmu.edu/ for 
more details. 

Simon

Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-28 Thread Owen Stephens
The JISC funded CLOCK project did some thinking around cataloguing processes 
and tracking changes to statements and/or records - e.g. 
http://clock.blogs.lincoln.ac.uk/2012/05/23/its-a-model-and-its-looking-good/

Not solutions of course, but hopefully of interest

Owen

Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com
Telephone: 0121 288 6936

On 28 Aug 2012, at 19:43, Simon Spero sesunc...@gmail.com wrote:

 On Aug 28, 2012, at 2:17 PM, Joe Hourcle wrote:
 
 I seem to recall seeing a presentation a couple of years ago from someone in 
 the intelligence community, where they'd keep all of their intelligence, but 
 they stored RDF quads so they could track the source.
 
 They'd then assign a confidence level to each source, so they could get an 
 overall level of confidence on their inferences.
 […]
 It's possible that it was in the context of provenance, but I'm getting 
 bogged down in too many articles about people storing provenance information 
 using RDF-triples (without actually tracking the provenance of the triple 
 itself)
 
 Provenance is of great importance in the IC and related sectors.   
 
 An good overview of the nature of evidential reasoning is David A Schum 
 (1994;2001). Evidential Foundations of Probabilistic Reasoning. Wiley  Sons, 
 1994; Northwestern University Press, 2001 [Paperback edition].
 
 There are usually papers on provenance and associated semantics at the GMU 
 Semantic Technology for Intelligence, Defense, and Security (STIDS).  This 
 years conference is 23 - 26 October 2012; see http://stids.c4i.gmu.edu/ for 
 more details. 
 
 Simon


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-28 Thread Lars Aronsson

A week ago, I wrote:

What I have done is just to search (worldcat.org and
hathitrust.org) for some common Swedish words, and
I don't have to do this for long before some very
obvious (to a native speaker) spelling mistakes appear.


I've reported 38 errors to Hathitrust, and got feedback
that they are now corrected. The originating libraries
for these 38 records were:

  18  University of California
   6  University of Wisconsin
   5  University of Michigan
   3  Harvard University
   2  Columbia University
   2  New York Public Library
   1  Cornell University
   1  Princeton University

Maybe Google scanned a lot of Swedish books at the University
of California, or their error rate is higher. I didn't compile
any statistics on the correct records that I looked at.

Four of the records had a title where an a-ring (å) was
erroneously written as an a-dot. But these four records came
from four different libraries. I didn't see any patterns.


--
  Lars Aronsson (l...@aronsson.se)
  Project Runeberg - free Nordic literature - http://runeberg.org/


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-27 Thread Karen Coyle
Actually, Ed, this would not only make for a good blog post (please, so 
it doesn't get lost in email space), but I would love to see a 
discussion of what kind of revision control would work:


1) for libraries (git is gawdawful nerdy)
2) for linked data

kc
p.s. the Ramsay book is now showing on Open Library, and the subtitle is 
correct... perhaps because the record is from the LC MARC service :-)

http://openlibrary.org/works/OL16528530W/Reading_machines

On 8/26/12 6:32 PM, Ed Summers wrote:

Thanks for sharing this bit of detective work. I noticed something
similar fairly recently myself [1], but didn't discover as plausible
of a scenario for what had happened as you did. I imagine others have
noticed this network effect before as well.

On Tue, Aug 21, 2012 at 11:42 AM, Lars Aronsson l...@aronsson.se wrote:

And sure enough, there it is,
http://clio.cul.columbia.edu:7018/vwebv/holdingsInfo?bibId=1439352
But will my error report to Worldcat find its way back
to CLIO? Or if I report the error to Columbia University,
will the correction propagate to Google, Hathi and Worldcat?
(Columbia asks me for a student ID when I want to give
feedback, so that removes this option for me.)

I realize this probably will sound flippant (or overly grandiose), but
innovating solutions to this problem, where there isn't necessarily
one metadata master that everyone is slaved to seems to be one of the
more important and interesting problems that our sector faces.

When Columbia University can become the source of a bibliographic
record for Google Books, HathiTrust and OpenLibrary, etc how does this
change the hub and spoke workflows (with OCLC as the hub) that we are
more familiar with? I think this topic is what's at the heart of the
discussions about a github-for-data [2,3], since decentralized
version control systems [4] allow for the evolution of more organic,
push/pull, multimaster workflows...and platforms like Github make them
socially feasible, easy and fun.

I also think Linked Library Data, where bibliographic descriptions are
REST enabled Web resources identified with URLs, and patterns such as
webhooks [5] make it easy to trigger update events could be part of an
answer. Feed technologies like Atom, RSS and the work being done on
ResourceSync also seem important technologies for us to use to allow
people to poll for changes [6]. And being able to say where you have
obtained data from, possibly using something like the W3C Provenance
vocabulary [7] also seems like an important part of the puzzle.

I'm sure there are other (and perhaps better) creative analogies or
tools that could help solve this problem. I think you're probably
right that we are starting to see the errors more now that more
library data is becoming part of the visible Web via projects like
GoogleBooks, HathiTrust, OpenLibrary and other enterprising libraries
that design their catalogs to be crawlable and indexable by search
engines.

But I think it's more fun to think about (and hack on) what grassroots
things we could be doing to help these new bibliographic data
workflows to grow and flourish than to get piled under by the errors,
and a sense of futility...

Or it might make for a good article or dissertation topic :-)

//Ed

[1] http://inkdroid.org/journal/2011/12/25/genealogy-of-a-typo/
[2] http://www.informationdiet.com/blog/read/we-need-a-github-for-data
[3] http://sunlightlabs.com/blog/2010/we-dont-need-a-github-for-data/
[4] http://en.wikipedia.org/wiki/Distributed_revision_control
[5] https://help.github.com/articles/post-receive-hooks
[6] http://www.niso.org/workrooms/resourcesync/
[7] http://www.w3.org/TR/prov-primer/


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-27 Thread Galen Charlton

Hi,

On 08/27/2012 08:49 AM, Karen Coyle wrote:

Actually, Ed, this would not only make for a good blog post (please, so
it doesn't get lost in email space), but I would love to see a
discussion of what kind of revision control would work:

1) for libraries (git is gawdawful nerdy)
2) for linked data


Speaking of revision control, does any have or know of a sizable dataset 
of bibliographic metadata that includes change history?  For example, I 
know that some ILSs can retain previous versions of bibliographic 
records as they get edited.


Such a dataset would be useful in figuring out good ways to calculate 
differences between versions of a record, and perhaps more to the point, 
express those in a way that's more useful to maintainers of the metadata.


Regards,

Galen
--
Galen Charlton
Director of Support and Implementation
Equinox Software, Inc. / The Open Source Experts
email:  g...@esilibrary.com
direct: +1 770-709-5581
cell:   +1 404-984-4366
skype:  gmcharlt
web:http://www.esilibrary.com/
Supporting Koha and Evergreen: http://koha-community.org  
http://evergreen-ils.org


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-27 Thread Ed Summers
On Mon, Aug 27, 2012 at 10:36 AM, Ross Singer rossfsin...@gmail.com wrote:
 For MARC data, while I don't know of any examples of this, it seems like 
 something like CouchDB [2] and marc-in-json [3] would be a fantastic way to 
 make something like this available.

Great idea...and there are 4 years of transactions for LC record
create/update/deletes up at Internet Archive:

http://archive.org/details/marc_loc_updates

//Ed


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-27 Thread Ed Summers
On Mon, Aug 27, 2012 at 8:49 AM, Karen Coyle li...@kcoyle.net wrote:
 Actually, Ed, this would not only make for a good blog post (please, so it
 doesn't get lost in email space), but I would love to see a discussion of
 what kind of revision control would work:

 1) for libraries (git is gawdawful nerdy)
 2) for linked data

I think you know well as me that linked data is gawdawful nerdy too :-)

 p.s. the Ramsay book is now showing on Open Library, and the subtitle is
 correct... perhaps because the record is from the LC MARC service :-)
 http://openlibrary.org/works/OL16528530W/Reading_machines

perhaps being the operative word. Being able to concretely answer
these provenence questions is important. Actually, I'm not sure it was
ever incorrect at OpenLibrary. At least I don't think I used it as an
example in my Genealogy of a Typo post.

//Ed


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-27 Thread Ed Summers
On Mon, Aug 27, 2012 at 1:33 PM, Corey A Harper corey.har...@nyu.edu wrote:
 I think there's a useful distinction here. Ed can correct me if I'm
 wrong, but I suspect he was not actually suggesting that Git itself be
 the user-interface to a github-for-data type service, but rather that
 such a service can be built *on top* of an infrastructure component
 like GitHub.

Yes, I wasn't saying that we could just plonk our data into Github,
and pat ourselves on the back for a good days work :-) I guess I was
stating the obvious: technologies like Git have made once hard
problems like decentralized version control much, much easier...and
there might be some giants shoulders to stand on.

//Ed


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-27 Thread Karen Coyle

Ed, Corey -

I also assumed that Ed wasn't suggesting that we literally use github as 
our platform, but I do want to remind folks how far we are from having 
people friendly versioning software -- at least, none that I have seen 
has felt intuitive. The features of git are great, and people have 
built interfaces to it, but as Galen's question brings forth, the very 
*idea* of versioning doesn't exist in library data processing, even 
though having central-system based versions of MARC records (with a 
single time line) is at least conceptually simple.


Therefore it seems to me that first we have to define what a version 
would be, both in terms of data but also in terms of the mind set and 
work flow of the cataloging process. How will people *understand* 
versions in the context of their work? What do they need in order to 
evaluate different versions? And that leads to my second question: what 
is a version in LD space? Triples are just triples - you can add them or 
delete them but I don't know of a way that you can version them, since 
each has an independent T-space existence. So, are we talking about 
named graphs?


I think this should be a high priority activity around the new 
bibliographic framework planning because, as we have seen with MARC, 
the idea of versioning needs to be part of the very design or it won't 
happen.


kc

On 8/27/12 11:20 AM, Ed Summers wrote:

On Mon, Aug 27, 2012 at 1:33 PM, Corey A Harper corey.har...@nyu.edu wrote:

I think there's a useful distinction here. Ed can correct me if I'm
wrong, but I suspect he was not actually suggesting that Git itself be
the user-interface to a github-for-data type service, but rather that
such a service can be built *on top* of an infrastructure component
like GitHub.

Yes, I wasn't saying that we could just plonk our data into Github,
and pat ourselves on the back for a good days work :-) I guess I was
stating the obvious: technologies like Git have made once hard
problems like decentralized version control much, much easier...and
there might be some giants shoulders to stand on.

//Ed


--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-27 Thread stuart yeates
These have to be named graphs, or at least collections of triples which 
can be processed through workflows as a single unit.


In terms of LD there version needs to be defined in terms of:

(a) synchronisation with the non-bibliographic real world (i.e. Dataset 
Z version X was released at time Y)


(b) correction/augmentation of other datasets (i.e Dataset F version G 
contains triples augmenting Dataset H versions A, B, C and D)


(c) mapping between datasets (i.e. Dataset I contains triples mapping 
between Dataset J version K and Dataset L version M (and visa-versa))


Note that a 'Dataset' here could be a bibliographic dataset (records of 
works, etc), a classification dataset (a version of the Dewey Decimal 
Scheme, a version of the Māori Subject Headings, a version of Dublin 
Core Scheme, etc), a dataset of real-world entities to do authority 
control against (a dbpedia dump, an organisational structure in an 
institution, etc), or some arbitrary mapping between some arbitrary 
combination of these.


Most of these are going to be managed and generated using current 
systems with processes that involve periodic dumps (or drops) of data 
(the dbpedia drops of wikipedia data are a good model here). git makes 
little sense for this kind of data.


github is most likely to be useful for smaller niche collaborative 
collections (probably no more than a million triples) mapping between 
the larger collections, and scripts for integrating the collections into 
a sane whole.


cheers
stuart

On 28/08/12 08:36, Karen Coyle wrote:

Ed, Corey -

I also assumed that Ed wasn't suggesting that we literally use github as
our platform, but I do want to remind folks how far we are from having
people friendly versioning software -- at least, none that I have seen
has felt intuitive. The features of git are great, and people have
built interfaces to it, but as Galen's question brings forth, the very
*idea* of versioning doesn't exist in library data processing, even
though having central-system based versions of MARC records (with a
single time line) is at least conceptually simple.

Therefore it seems to me that first we have to define what a version
would be, both in terms of data but also in terms of the mind set and
work flow of the cataloging process. How will people *understand*
versions in the context of their work? What do they need in order to
evaluate different versions? And that leads to my second question: what
is a version in LD space? Triples are just triples - you can add them or
delete them but I don't know of a way that you can version them, since
each has an independent T-space existence. So, are we talking about
named graphs?

I think this should be a high priority activity around the new
bibliographic framework planning because, as we have seen with MARC,
the idea of versioning needs to be part of the very design or it won't
happen.

kc

On 8/27/12 11:20 AM, Ed Summers wrote:

On Mon, Aug 27, 2012 at 1:33 PM, Corey A Harper corey.har...@nyu.edu
wrote:

I think there's a useful distinction here. Ed can correct me if I'm
wrong, but I suspect he was not actually suggesting that Git itself be
the user-interface to a github-for-data type service, but rather that
such a service can be built *on top* of an infrastructure component
like GitHub.

Yes, I wasn't saying that we could just plonk our data into Github,
and pat ourselves on the back for a good days work :-) I guess I was
stating the obvious: technologies like Git have made once hard
problems like decentralized version control much, much easier...and
there might be some giants shoulders to stand on.

//Ed





--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-27 Thread Peter Noerr
I agree entirely that these would need to be a collection of triples with its 
own set of attributes/metadata describing the collection. Basically a record 
with triples as the data elements.

But I see a bigger problem with the direction this thread has taken so far. The 
use of versions has been conditioned by the use of something like Github as the 
underlying versioning platform. But Github (and all software versioning 
systems) are based on temporal versions, where each version is, in some way, an 
evolved unit of the same underlying thing - a program or whatever. So the 
versions are really temporally linearly related to each other as well as 
related in terms of added or improved or fixed functionality. Yes, the codebase 
(the underlying thing) can fork or split in a number of ways, but they are 
all versions of the same thing, progressing through time.

In the existing bibliographic case we have many records which purport to be 
about the same thing, but contain different data values for the same elements. 
And these are the the versions we have to deal with, and eventually 
reconcile. They are not descendents of the same original, they are independent 
entities, whether they are recorded as singular MARC records or collections of 
LD triples. I would suggest that at all levels, from the triplet or key/value 
field pair to the triple collection or fielded record, what we have are 
alternates, not versions. 
 
Thus the alternates exist at the triple level, and also at the collection 
level (the normal bibliographic unit record we are familiar with). And those 
alternates could then be allowed versions which are the attempts to, in some 
way, improve the quality (your definition of what this is is as good as mine) 
over time. And with a closed group of alternates (of a single bib unit) these 
versioned alternates would (in a perfect world) iterate to a common descendent 
which had the same agreed, authorized set of triples. Of course this would only 
be the authorized form for those organizations which recognized the 
arrangement. 

But, allowing alternates and their versions does allow for a method of tracking 
the original problem of three organizations each copying each other endlessly 
to correct their data. In this model it would be an alternate/version spiral 
of states, rather than a flat circle of each changing version with no history, 
and no idea of which was master. (Try re-reading Stuart's (a), (b), (c) below 
with the idea of alternates as well as versions (of the Datasets). I think it 
would become clearer as to what was happening.) There is still no master, but 
at least the state changes can be properly tracked and checked by software 
(and/or humans) so the endless cycle can be addressed - probably by an outside 
(human) decision about the correct form of a triple to use for this bib 
entity.

Or this may all prove to be an unnecessary complication.

Peter


 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of 
 stuart yeates
 Sent: Monday, August 27, 2012 3:42 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google
 
 These have to be named graphs, or at least collections of triples which can 
 be processed through
 workflows as a single unit.
 
 In terms of LD there version needs to be defined in terms of:
 
 (a) synchronisation with the non-bibliographic real world (i.e. Dataset Z 
 version X was released at
 time Y)
 
 (b) correction/augmentation of other datasets (i.e Dataset F version G 
 contains triples augmenting
 Dataset H versions A, B, C and D)
 
 (c) mapping between datasets (i.e. Dataset I contains triples mapping between 
 Dataset J version K and
 Dataset L version M (and visa-versa))
 
 Note that a 'Dataset' here could be a bibliographic dataset (records of 
 works, etc), a classification
 dataset (a version of the Dewey Decimal Scheme, a version of the Māori 
 Subject Headings, a version of
 Dublin Core Scheme, etc), a dataset of real-world entities to do authority 
 control against (a dbpedia
 dump, an organisational structure in an institution, etc), or some arbitrary 
 mapping between some
 arbitrary combination of these.
 
 Most of these are going to be managed and generated using current systems 
 with processes that involve
 periodic dumps (or drops) of data (the dbpedia drops of wikipedia data are a 
 good model here). git
 makes little sense for this kind of data.
 
 github is most likely to be useful for smaller niche collaborative 
 collections (probably no more than
 a million triples) mapping between the larger collections, and scripts for 
 integrating the collections
 into a sane whole.
 
 cheers
 stuart
 
 On 28/08/12 08:36, Karen Coyle wrote:
  Ed, Corey -
 
  I also assumed that Ed wasn't suggesting that we literally use github
  as our platform, but I do want to remind folks how far we are from
  having people friendly versioning software -- at least, none that I

Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-27 Thread stuart yeates

On 28/08/12 12:07, Peter Noerr wrote:

 They are not descendents of the same original, they are independent entities, 
whether they are recorded as singular MARC records or collections of LD triples.


That depends on which end of the stick one grasps.

Conceptually these are descendants of the abstract work in question; 
textually these are independent (or likely to be).


In practice it doesn't matter: since git/svn/etc are all textual in 
nature, they're not good at handling these.


The reconciliation is likely to be a good candidate for temporal 
versioning.


It's interesting to ponder which of the many datasets is going to prove 
to be the hub for reconciliation. My money is on librarything, because 
their merge-ist approach to cataloguing means they have lots and lots of 
different versions of the work information to match against. See for 
example:  https://www.librarything.com/work/683408/editions/11795335 
Wikipedia / dbpedia have redirects which tend in the same direction, but 
only for titles and not ISBNs.


cheers
stuart
--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-26 Thread Ed Summers
Thanks for sharing this bit of detective work. I noticed something
similar fairly recently myself [1], but didn't discover as plausible
of a scenario for what had happened as you did. I imagine others have
noticed this network effect before as well.

On Tue, Aug 21, 2012 at 11:42 AM, Lars Aronsson l...@aronsson.se wrote:
 And sure enough, there it is,
 http://clio.cul.columbia.edu:7018/vwebv/holdingsInfo?bibId=1439352
 But will my error report to Worldcat find its way back
 to CLIO? Or if I report the error to Columbia University,
 will the correction propagate to Google, Hathi and Worldcat?
 (Columbia asks me for a student ID when I want to give
 feedback, so that removes this option for me.)

I realize this probably will sound flippant (or overly grandiose), but
innovating solutions to this problem, where there isn't necessarily
one metadata master that everyone is slaved to seems to be one of the
more important and interesting problems that our sector faces.

When Columbia University can become the source of a bibliographic
record for Google Books, HathiTrust and OpenLibrary, etc how does this
change the hub and spoke workflows (with OCLC as the hub) that we are
more familiar with? I think this topic is what's at the heart of the
discussions about a github-for-data [2,3], since decentralized
version control systems [4] allow for the evolution of more organic,
push/pull, multimaster workflows...and platforms like Github make them
socially feasible, easy and fun.

I also think Linked Library Data, where bibliographic descriptions are
REST enabled Web resources identified with URLs, and patterns such as
webhooks [5] make it easy to trigger update events could be part of an
answer. Feed technologies like Atom, RSS and the work being done on
ResourceSync also seem important technologies for us to use to allow
people to poll for changes [6]. And being able to say where you have
obtained data from, possibly using something like the W3C Provenance
vocabulary [7] also seems like an important part of the puzzle.

I'm sure there are other (and perhaps better) creative analogies or
tools that could help solve this problem. I think you're probably
right that we are starting to see the errors more now that more
library data is becoming part of the visible Web via projects like
GoogleBooks, HathiTrust, OpenLibrary and other enterprising libraries
that design their catalogs to be crawlable and indexable by search
engines.

But I think it's more fun to think about (and hack on) what grassroots
things we could be doing to help these new bibliographic data
workflows to grow and flourish than to get piled under by the errors,
and a sense of futility...

Or it might make for a good article or dissertation topic :-)

//Ed

[1] http://inkdroid.org/journal/2011/12/25/genealogy-of-a-typo/
[2] http://www.informationdiet.com/blog/read/we-need-a-github-for-data
[3] http://sunlightlabs.com/blog/2010/we-dont-need-a-github-for-data/
[4] http://en.wikipedia.org/wiki/Distributed_revision_control
[5] https://help.github.com/articles/post-receive-hooks
[6] http://www.niso.org/workrooms/resourcesync/
[7] http://www.w3.org/TR/prov-primer/


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-25 Thread Lars Aronsson

On August 22 ya'aQovwrote:

*Lars, so the wrong spelling of that Swedish author is based on your
browsing it, not on an automated procedure, or reference to an online
thesaurus. Given your Swedish resources, is there any quality control
mechanism you can suggest?


This author's name was just one that I stumbled upon.
My problem is that I stumble on bad spelling of titles
and authors' names far too often, like 5% of all cases.
I guess this is because they are now exposed, when
Google and Hathi Trust pull all data together. Earlier
these errors have been isolated in various library
catalogs that had no users who speak the language
(in this case: Swedish). In each library catalog these
Swedish entries make up a tiny minority. But now with
the aggregation in Hathi Trust and Worldcat, we can
start to see patterns and fix the errors.

For Swedish books, the national catalog at libris.kb.se
is most often correct. It's also available as open data
under cc0 (Creative Commons zero)
http://www.kb.se/libris/teknisk-information/Oppen-data/Open-Data/
and the Libris authority file is part of VIAF.

You'd need one good reference per language.

But I think you can do a good analysis even without
knowing much about a language. If you find that records
for books in Czech often contain č (c-hacek), but almost
never ĉ (c-circumflex), you can look at the records that
contain the unusual letter and see if they are errors
or perhaps intentional uses of Esperanto words.


--
  Lars Aronsson (l...@aronsson.se)
  Project Runeberg - free Nordic literature - http://runeberg.org/


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-21 Thread Angelina Z
Lars,

With regards to HathiTrust and Google, HathiTrust receives bib records from
partner libraries. Google also receives bib records from partner libraries.
HathiTrust has never received bib records from Google - sometimes we
provide corrected records to them, but the flow is only one-sided. For
HathiTrust-related issues, you can report cataloging errors to us at
feedb...@issues.hathitrust.org, and we can work with the partner library to
get them fixed.

Thanks,
Angelina Zaytsev
Project Librarian
HathiTrust
http://www.hathitrust.org/


On Mon, Aug 20, 2012 at 9:42 PM, ya'aQov yaaq...@gmail.com wrote:

 *Lars,

 Thank you. Which online thesaurus are you matching these names with (when
 you declare them mistakes)? if an online procedure, please be specific; how
 can VIAF benefit, it at all, from *Project Runeberg?
 http://runeberg.org/%20

 Ya'aqov Ziso




-- 
Angelina Zaytsev


In your thirst for knowledge, be sure not to drown in all the
information.  ~Anthony J. D'Angelo

Il y a autant de beaux idéals que de formes de nez différentes ou de
caractères différents. ~Stendhal

Education can give you a skill, but a liberal education can give you
dignity. ~Ellen Key


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-21 Thread Lars Aronsson

On 2012-08-20 22:38, Roy Tennant wrote:

Any errors in WorldCat can be reported to bibcha...@oclc.org. We take
record quality seriously, but as you can imagine when you take in
records from thousands of sources around the world, this is a constant
struggle. We have our own quality control efforts, but individuals
reporting problems are also an important strategy.


The web page for a bibliographic record, when I look at it,
in the top-left menu, has a report feedback option, but
the form that pops up doesn't indicate whether the
referring page URL will be included in the report.
It should be, of course.

What I have done is just to search (worldcat.org and
hathitrust.org) for some common Swedish words, and
I don't have to do this for long before some very
obvious (to a native speaker) spelling mistakes appear.
One example is this,
http://www.worldcat.org/oclc/681652093
This record has a link to Hathi Trust, where the display
of the title page immediatly shows that the A-ring is
on the wrong A.

If both Hathi Trust and Google receive catalog records
from partner libraries, the error must originate from
Columbia University where this book was scanned by Google
in August 2009,
http://books.google.se/books?id=-EBGYAAJ

And sure enough, there it is,
http://clio.cul.columbia.edu:7018/vwebv/holdingsInfo?bibId=1439352
But will my error report to Worldcat find its way back
to CLIO? Or if I report the error to Columbia University,
will the correction propagate to Google, Hathi and Worldcat?
(Columbia asks me for a student ID when I want to give
feedback, so that removes this option for me.)

Searching Worldcat for this title without any diacritics
yields two results: This book from 1858 and another one
(with the A-ring on the right A) from 1855,
http://www.worldcat.org/search?q=kort+afhandling+om+angmachiner

The 1855 book has the author's name wrong,
Karl Bertil Lilliehoehoe instead of
Carl Bertil Lilliehöök, who lived 1809-1890,
http://www.worldcat.org/oclc/249342164
I understand that oe is a poor man's transcription
of the umlaut ö, but where did the -k go?
That Worldcat record doesn't indicate its source.
It's apparently not Hathi Trust.

I don't know how you work, but I think it would be
quite easy to extract all titles for a given language,
and run them through a spell checker, to find an
indication of where there could be mistakes. For
example, afhandling means treatise or dissertation,
and brings up 32,000 hits in a Worldcat search (the
modern spelling avhandling yields 140,000), but
åfhandling is not in any dictionary. (It's obvious
that Google Books never tried this.)


--
  Lars Aronsson (l...@aronsson.se)
  Project Runeberg - free Nordic literature - http://runeberg.org/


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-21 Thread ya'aQov
*Lars, so the wrong spelling of that Swedish author is based on your
browsing it, not on an automated procedure, or reference to an online
thesaurus. Given your Swedish resources, is there any quality control
mechanism you can suggest?



*
On Tue, Aug 21, 2012 at 10:42 AM, Lars Aronsson l...@aronsson.se wrote:

 On 2012-08-20 22:38, Roy Tennant wrote:

 Any errors in WorldCat can be reported to bibcha...@oclc.org. We take
 record quality seriously, but as you can imagine when you take in
 records from thousands of sources around the world, this is a constant
 struggle. We have our own quality control efforts, but individuals
 reporting problems are also an important strategy.


 The web page for a bibliographic record, when I look at it,
 in the top-left menu, has a report feedback option, but
 the form that pops up doesn't indicate whether the
 referring page URL will be included in the report.
 It should be, of course.

 What I have done is just to search (worldcat.org and
 hathitrust.org) for some common Swedish words, and
 I don't have to do this for long before some very
 obvious (to a native speaker) spelling mistakes appear.
 One example is this,
 http://www.worldcat.org/oclc/**681652093http://www.worldcat.org/oclc/681652093
 This record has a link to Hathi Trust, where the display
 of the title page immediatly shows that the A-ring is
 on the wrong A.

 If both Hathi Trust and Google receive catalog records
 from partner libraries, the error must originate from
 Columbia University where this book was scanned by Google
 in August 2009,
 http://books.google.se/books?**id=-EBGYAAJhttp://books.google.se/books?id=-EBGYAAJ

 And sure enough, there it is,
 http://clio.cul.columbia.edu:**7018/vwebv/holdingsInfo?bibId=**1439352http://clio.cul.columbia.edu:7018/vwebv/holdingsInfo?bibId=1439352
 But will my error report to Worldcat find its way back
 to CLIO? Or if I report the error to Columbia University,
 will the correction propagate to Google, Hathi and Worldcat?
 (Columbia asks me for a student ID when I want to give
 feedback, so that removes this option for me.)

 Searching Worldcat for this title without any diacritics
 yields two results: This book from 1858 and another one
 (with the A-ring on the right A) from 1855,
 http://www.worldcat.org/**search?q=kort+afhandling+om+**angmachinerhttp://www.worldcat.org/search?q=kort+afhandling+om+angmachiner

 The 1855 book has the author's name wrong,
 Karl Bertil Lilliehoehoe instead of
 Carl Bertil Lilliehöök, who lived 1809-1890,
 http://www.worldcat.org/oclc/**249342164http://www.worldcat.org/oclc/249342164
 I understand that oe is a poor man's transcription
 of the umlaut ö, but where did the -k go?
 That Worldcat record doesn't indicate its source.
 It's apparently not Hathi Trust.

 I don't know how you work, but I think it would be
 quite easy to extract all titles for a given language,
 and run them through a spell checker, to find an
 indication of where there could be mistakes. For
 example, afhandling means treatise or dissertation,
 and brings up 32,000 hits in a Worldcat search (the
 modern spelling avhandling yields 140,000), but
 åfhandling is not in any dictionary. (It's obvious
 that Google Books never tried this.)


 --
   Lars Aronsson (l...@aronsson.se)
   Project Runeberg - free Nordic literature - http://runeberg.org/




-- 

  ya'a*Q*ov ziso | yaaq...@gmail.com* *| 856 217 3456


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-20 Thread Roy Tennant
Any errors in WorldCat can be reported to bibcha...@oclc.org. We take
record quality seriously, but as you can imagine when you take in
records from thousands of sources around the world, this is a constant
struggle. We have our own quality control efforts, but individuals
reporting problems are also an important strategy.
Roy

On Mon, Aug 20, 2012 at 1:01 PM, Lars Aronsson l...@aronsson.se wrote:
 This big mess of broken bibliographic records that started out
 as Google Book Search, and then spread to Hathi Trust, is now
 also spreading to OCLC Worldcat. I find the same kind of
 misspellings and misconceptions in all three places. So far,
 the VIAF author names seem to be more correct, but maybe they
 will soon import all the misspellings from Worldcat as well?

 What is the way to stop this downward trend? Where can one post
 corrections, so that they go back to the source, and don't
 reappear next time? How can I know who copies from whom?

 Does Hathi Trust copy everything from Google? And does
 Worldcat discover new (misspelled) authors and titles in
 Hathi Trust, and import them rather than reporting the errors?


 --
   Lars Aronsson (l...@aronsson.se)
   Project Runeberg - free Nordic literature - http://runeberg.org/


Re: [CODE4LIB] Corrections to Worldcat/Hathi/Google

2012-08-20 Thread ya'aQov
*Lars,

Thank you. Which online thesaurus are you matching these names with (when
you declare them mistakes)? if an online procedure, please be specific; how
can VIAF benefit, it at all, from *Project Runeberg?http://runeberg.org/%20

Ya'aqov Ziso