Re: [CODE4LIB] wikipedia/author disambiguation
I'm not sure about other dialects of English, but in New Zealand English, a negation there has no impact on the semantics or subtleties of that sentence. cheers stuart On 01/06/11 07:18, Ed Summers wrote: a bit of a fruedian slip there I suppose :-) s/could/couldn't/ //Ed On Tue, May 31, 2011 at 3:17 PM, Ed Summerse...@pobox.com wrote: On Tue, May 31, 2011 at 12:48 PM, Thomas Bergert...@gymel.com wrote: Currently about 150.000 articles on wikipedia.de carry the associated PND number, many of them also LoC-NA and VIAF numbers: Makes me wonder if we could use inter-wiki links to automatically update some of the en.wikipedia articles based on the viaf links in de.wikipedia. Could hurt to see how many there are I suppose. //Ed -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
Re: [CODE4LIB] wikipedia/author disambiguation
Neat! Just tried the human-displayed links off the Immanuel Kant wikipedia page (http://en.wikipedia.org/wiki/Immanuel_Kant), created by the 'Authority Control' template that Daniel or someone else added. VIAF one works great, taking me to the human readable VIAF page. PND one seems to work too, taking me to the authority page in the Deutsche National Bibliothek. The LCCN one does not work. Tries to take me to: http://errol.oclc.org/laf/n79021614.html Which results in an HTTP 500 error from the OCLC server. Since this template apparently generates a URL to an OCLC service (rather than LC? I guess maybe LC itself doesn't have the right permalinks?), I think that OCLC probably ought to fix this. If the template is not creating the right URL, I guess you've got to work with wikipedia to fix it. Or fix your end to accept those URLs properly. Jonathan On 5/25/2011 12:47 PM, Ed Summers wrote: Hey Daniel, It looks like you used the worldcat template [1]: {{worldcat id|id=lccn-n79-21614|VIAF=82088490}} which doesn't actually do anything with the VIAF parameter. Instead (or as well) you'll want to use the Authority control template: {{Authority control|PND=118559796|LCCN=n/79/21614|VIAF=82088490}} After I did that and the crawl ran again it showed up at linkypedia [3]. Thanks for giving it a try! //Ed [1] http://en.wikipedia.org/wiki/Template:Worldcat_id [2] http://en.wikipedia.org/wiki/Template:Authority_control [3] http://linkypedia.info/websites/23/pages/ On Wed, May 25, 2011 at 9:17 AM, Lovins, Danieldaniel.lov...@yale.edu wrote: That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took just a few seconds. I'll subscribe to the linkypedia rss feed and watch for notification. Daniel -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed Summers Sent: Tuesday, May 24, 2011 4:59 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Big +1 for promoting the use of the Authority Control Wikipedia template.I know i'm being a bit of a broken record, but you can watch as people add these by looking at or subscribing to: http://linkypedia.inkdroid.org/websites/23/pages/ Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I just ran across Duke [2] today, which looks like it could help guide record linking a bit. Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread. Haven't tried it yet, so YMMV, etc. //Ed [1] http://wikipedia-miner.sourceforge.net/ [2] http://code.google.com/p/duke/
Re: [CODE4LIB] wikipedia/author disambiguation
On Tue, May 31, 2011 at 11:55 AM, Jonathan Rochkind rochk...@jhu.edu wrote: The LCCN one does not work. Tries to take me to: http://errol.oclc.org/laf/n79021614.html Which results in an HTTP 500 error from the OCLC server. Since this template apparently generates a URL to an OCLC service (rather than LC? I guess maybe LC itself doesn't have the right permalinks?), I think that OCLC probably ought to fix this. If the template is not creating the right URL, I guess you've got to work with wikipedia to fix it. Or fix your end to accept those URLs properly. As far as I know there aren't any permalinks for name authority records at loc.gov that use the LCCN. I've heard informally from some folks at OCLC that they plan to redirect these links to a URL at loc.gov if/when the name authority records are available from there. But I have no idea when that will happen unfortunately. //Ed
Re: [CODE4LIB] wikipedia/author disambiguation
This seems pretty fixable on OCLC's part, if they want to... The errol repository still works, see: http://errol.oclc.org/laf/n79021614.MarcXML (generated from) http://alcme.oclc.org/lcnaf/servlet/OAIHandler?verb=GetRecordmetadataPrefix=MarcXMLidentifier=n79021614 So it's just a case of the rewrites doing the right thing for the .html redirects into OAICat or whatever it is. -Ross. On Tue, May 31, 2011 at 11:55 AM, Jonathan Rochkind rochk...@jhu.edu wrote: Neat! Just tried the human-displayed links off the Immanuel Kant wikipedia page (http://en.wikipedia.org/wiki/Immanuel_Kant), created by the 'Authority Control' template that Daniel or someone else added. VIAF one works great, taking me to the human readable VIAF page. PND one seems to work too, taking me to the authority page in the Deutsche National Bibliothek. The LCCN one does not work. Tries to take me to: http://errol.oclc.org/laf/n79021614.html Which results in an HTTP 500 error from the OCLC server. Since this template apparently generates a URL to an OCLC service (rather than LC? I guess maybe LC itself doesn't have the right permalinks?), I think that OCLC probably ought to fix this. If the template is not creating the right URL, I guess you've got to work with wikipedia to fix it. Or fix your end to accept those URLs properly. Jonathan On 5/25/2011 12:47 PM, Ed Summers wrote: Hey Daniel, It looks like you used the worldcat template [1]: {{worldcat id|id=lccn-n79-21614|VIAF=82088490}} which doesn't actually do anything with the VIAF parameter. Instead (or as well) you'll want to use the Authority control template: {{Authority control|PND=118559796|LCCN=n/79/21614|VIAF=82088490}} After I did that and the crawl ran again it showed up at linkypedia [3]. Thanks for giving it a try! //Ed [1] http://en.wikipedia.org/wiki/Template:Worldcat_id [2] http://en.wikipedia.org/wiki/Template:Authority_control [3] http://linkypedia.info/websites/23/pages/ On Wed, May 25, 2011 at 9:17 AM, Lovins, Danieldaniel.lov...@yale.edu wrote: That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took just a few seconds. I'll subscribe to the linkypedia rss feed and watch for notification. Daniel -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed Summers Sent: Tuesday, May 24, 2011 4:59 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Big +1 for promoting the use of the Authority Control Wikipedia template.I know i'm being a bit of a broken record, but you can watch as people add these by looking at or subscribing to: http://linkypedia.inkdroid.org/websites/23/pages/ Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I just ran across Duke [2] today, which looks like it could help guide record linking a bit. Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread. Haven't tried it yet, so YMMV, etc. //Ed [1] http://wikipedia-miner.sourceforge.net/ [2] http://code.google.com/p/duke/
Re: [CODE4LIB] wikipedia/author disambiguation
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Am 31.05.2011 17:55, schrieb Jonathan Rochkind: VIAF one works great, taking me to the human readable VIAF page. PND one seems to work too, taking me to the authority page in the Deutsche National Bibliothek. The LCCN one does not work. Tries to take me to: http://errol.oclc.org/laf/n79021614.html Which results in an HTTP 500 error from the OCLC server. Since this template apparently generates a URL to an OCLC service (rather than LC? I guess maybe LC itself doesn't have the right permalinks?), I think that OCLC probably ought to fix this. If the template is not creating the right URL, I guess you've got to work with wikipedia to fix it. Or fix your end to accept those URLs properly. These links IIRC are the same ones VIAF employs to link to a representation of the NAF records and they are broken for about 6 weeks now. To my knowledge the {{Authority Control}} Metadata in the English Wikipedia are inspired from a similar effort in the German Wikipedia, which since 2005 notes authority nunbers for persons: They started with PND numbers (Personennormdatei, the Collaborative Authority File for german and austrian libraries) and were backed by an agreement with the German National Library (Deutsche Nationalbibliothek, DNB) to provide mutual links from authority Records to Wikipedia and vice versa. Currently about 150.000 articles on wikipedia.de carry the associated PND number, many of them also LoC-NA and VIAF numbers: http://de.wikipedia.org/wiki/Vorlage:NORMDATENCOUNT The links from portal.d-nb.de to wikipedia.de are not implemented by 856-like manifest URLs in the authority records nor some kind of wikipedia numbers as additional identification numbers. Rather wikipedia.de publishes on a daily base a trivial concordance table relating extracted PND numbers to the corresponding wikipedia lemma. The DNB portal in turn incorporates this table and generates the respective links on the fly whenever an affected authority record is displayed. Some biographical dictionaries, regional bibliographies, classical OPACs and historical projects picked up this mechanism and published their own tables of this kind, all using the PND identification number as common system of reference. This (as such a low-tech approach to the semantic web) was coined PND-BEACON: http://de.wikipedia.org/wiki/Wikipedia:PND/BEACON (english version: http://meta.wikimedia.org/wiki/BEACON ) CKAN data package: http://ckan.net/package/pndbeacon Publishing such beacon files presupposes that your data already carries more-than-local identification numbers. With this precondition met, the gain is twofold: - - publishing a beacon file may direct vistors from the incorporators of the file to your catalogue - - the existing authority numbers in your cataloge enable you to relate (via their beacon files) to other web ressources, thus rounding up the data you present. Thomas Berger -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Cygwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iJwEAQECAAYFAk3lG9wACgkQYhMlmJ6W47MBZAP/Sj1LGGRAqKHnjyhUcHVN6JMP Iy+CH2we1Dowod0PzNXHeR/0rk3Q0MTnWSznuhvM/tmyFESm3IFa1+Uwq8h56uob lG6N0Bbn7OHTm22XXcNqBwMryOexI/irP4+yt9K1tE0Pm+kDydY8om1NK5vm3rSq S4S4nwr0zJ7FVDjKJto= =MqTZ -END PGP SIGNATURE-
Re: [CODE4LIB] wikipedia/author disambiguation
a bit of a fruedian slip there I suppose :-) s/could/couldn't/ //Ed On Tue, May 31, 2011 at 3:17 PM, Ed Summers e...@pobox.com wrote: On Tue, May 31, 2011 at 12:48 PM, Thomas Berger t...@gymel.com wrote: Currently about 150.000 articles on wikipedia.de carry the associated PND number, many of them also LoC-NA and VIAF numbers: Makes me wonder if we could use inter-wiki links to automatically update some of the en.wikipedia articles based on the viaf links in de.wikipedia. Could hurt to see how many there are I suppose. //Ed
Re: [CODE4LIB] wikipedia/author disambiguation
On Tue, May 31, 2011 at 12:48 PM, Thomas Berger t...@gymel.com wrote: Currently about 150.000 articles on wikipedia.de carry the associated PND number, many of them also LoC-NA and VIAF numbers: Makes me wonder if we could use inter-wiki links to automatically update some of the en.wikipedia articles based on the viaf links in de.wikipedia. Could hurt to see how many there are I suppose. //Ed
Re: [CODE4LIB] wikipedia/author disambiguation
Some reflection suggests that for this to authority stuff to really take off we have to make it _really_ easy for wikipedians to do. As a model of how this could be done (and as a template to steal^H^H^H^H^Hreuse code from) I suggest HotCat: https://secure.wikimedia.org/wikipedia/en/wiki/HotCat HotCat adds a little GUI to the bottom of pages to make it really fast and easy to add pages to wikipedia categories. In my complete ignorance of how both javascript and VIAF work, it seems it should be possible to rewrite it to look up VIAF (with the query defaulting to the current page title) and present a list of possible matches for the user to suggest one. cheers stuart On 26/05/11 11:16, Ed Summers wrote: The user profile pages that reference the website should eventually (1 or 2 days) turn up under the Users tab, e.g. http://linkypedia.inkdroid.org/websites/23/users/ I don't see you there yet though :-) //Ed On Wed, May 25, 2011 at 5:03 PM, Karen Coyleli...@kcoyle.net wrote: Hi, Ed. Do you pick up user pages or just wikipedia entry pages? (I added mine to my user page, just for fun.) kc -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
Re: [CODE4LIB] wikipedia/author disambiguation
The lccn links from the template have been giving a java exception for the last few days at least: does the template or the server need fixing? Graham On 05/25/11 17:47, Ed Summers wrote: Hey Daniel, It looks like you used the worldcat template [1]: {{worldcat id|id=lccn-n79-21614|VIAF=82088490}} which doesn't actually do anything with the VIAF parameter. Instead (or as well) you'll want to use the Authority control template: {{Authority control|PND=118559796|LCCN=n/79/21614|VIAF=82088490}} After I did that and the crawl ran again it showed up at linkypedia [3]. Thanks for giving it a try! //Ed [1] http://en.wikipedia.org/wiki/Template:Worldcat_id [2] http://en.wikipedia.org/wiki/Template:Authority_control [3] http://linkypedia.info/websites/23/pages/ On Wed, May 25, 2011 at 9:17 AM, Lovins, Daniel daniel.lov...@yale.edu wrote: That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took just a few seconds. I'll subscribe to the linkypedia rss feed and watch for notification. Daniel -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed Summers Sent: Tuesday, May 24, 2011 4:59 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Big +1 for promoting the use of the Authority Control Wikipedia template.I know i'm being a bit of a broken record, but you can watch as people add these by looking at or subscribing to: http://linkypedia.inkdroid.org/websites/23/pages/ Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I just ran across Duke [2] today, which looks like it could help guide record linking a bit. Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread. Haven't tried it yet, so YMMV, etc. //Ed [1] http://wikipedia-miner.sourceforge.net/ [2] http://code.google.com/p/duke/
Re: [CODE4LIB] wikipedia/author disambiguation
It's the server unfortunately. I think OCLC is trying to figure out what to do with errol ... there's a thread on the wc-devnet-l if you are interested: http://listserv.oclc.org/scripts/wa.exe?A2=ind1105dL=wc-devnet-lT=0F=PX=4D30895CB90D4C912FP=73 //Ed On Thu, May 26, 2011 at 5:15 PM, Graham Seaman gra...@theseamans.net wrote: The lccn links from the template have been giving a java exception for the last few days at least: does the template or the server need fixing?
Re: [CODE4LIB] wikipedia/author disambiguation
That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took just a few seconds. I'll subscribe to the linkypedia rss feed and watch for notification. Daniel -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed Summers Sent: Tuesday, May 24, 2011 4:59 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Big +1 for promoting the use of the Authority Control Wikipedia template.I know i'm being a bit of a broken record, but you can watch as people add these by looking at or subscribing to: http://linkypedia.inkdroid.org/websites/23/pages/ Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I just ran across Duke [2] today, which looks like it could help guide record linking a bit. Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread. Haven't tried it yet, so YMMV, etc. //Ed [1] http://wikipedia-miner.sourceforge.net/ [2] http://code.google.com/p/duke/
Re: [CODE4LIB] wikipedia/author disambiguation
Oops. Thanks Ed! -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed Summers Sent: Wednesday, May 25, 2011 12:47 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Hey Daniel, It looks like you used the worldcat template [1]: {{worldcat id|id=lccn-n79-21614|VIAF=82088490}} which doesn't actually do anything with the VIAF parameter. Instead (or as well) you'll want to use the Authority control template: {{Authority control|PND=118559796|LCCN=n/79/21614|VIAF=82088490}} After I did that and the crawl ran again it showed up at linkypedia [3]. Thanks for giving it a try! //Ed [1] http://en.wikipedia.org/wiki/Template:Worldcat_id [2] http://en.wikipedia.org/wiki/Template:Authority_control [3] http://linkypedia.info/websites/23/pages/ On Wed, May 25, 2011 at 9:17 AM, Lovins, Daniel daniel.lov...@yale.edu wrote: That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took just a few seconds. I'll subscribe to the linkypedia rss feed and watch for notification. Daniel -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed Summers Sent: Tuesday, May 24, 2011 4:59 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Big +1 for promoting the use of the Authority Control Wikipedia template.I know i'm being a bit of a broken record, but you can watch as people add these by looking at or subscribing to: http://linkypedia.inkdroid.org/websites/23/pages/ Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I just ran across Duke [2] today, which looks like it could help guide record linking a bit. Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread. Haven't tried it yet, so YMMV, etc. //Ed [1] http://wikipedia-miner.sourceforge.net/ [2] http://code.google.com/p/duke/
Re: [CODE4LIB] wikipedia/author disambiguation
Hi, Ed. Do you pick up user pages or just wikipedia entry pages? (I added mine to my user page, just for fun.) kc Quoting Ed Summers e...@pobox.com: Hey Daniel, It looks like you used the worldcat template [1]: {{worldcat id|id=lccn-n79-21614|VIAF=82088490}} which doesn't actually do anything with the VIAF parameter. Instead (or as well) you'll want to use the Authority control template: {{Authority control|PND=118559796|LCCN=n/79/21614|VIAF=82088490}} After I did that and the crawl ran again it showed up at linkypedia [3]. Thanks for giving it a try! //Ed [1] http://en.wikipedia.org/wiki/Template:Worldcat_id [2] http://en.wikipedia.org/wiki/Template:Authority_control [3] http://linkypedia.info/websites/23/pages/ On Wed, May 25, 2011 at 9:17 AM, Lovins, Daniel daniel.lov...@yale.edu wrote: That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took just a few seconds. I'll subscribe to the linkypedia rss feed and watch for notification. Daniel -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed Summers Sent: Tuesday, May 24, 2011 4:59 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Big +1 for promoting the use of the Authority Control Wikipedia template.I know i'm being a bit of a broken record, but you can watch as people add these by looking at or subscribing to: http://linkypedia.inkdroid.org/websites/23/pages/ Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I just ran across Duke [2] today, which looks like it could help guide record linking a bit. Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread. Haven't tried it yet, so YMMV, etc. //Ed [1] http://wikipedia-miner.sourceforge.net/ [2] http://code.google.com/p/duke/ -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] wikipedia/author disambiguation
The user profile pages that reference the website should eventually (1 or 2 days) turn up under the Users tab, e.g. http://linkypedia.inkdroid.org/websites/23/users/ I don't see you there yet though :-) //Ed On Wed, May 25, 2011 at 5:03 PM, Karen Coyle li...@kcoyle.net wrote: Hi, Ed. Do you pick up user pages or just wikipedia entry pages? (I added mine to my user page, just for fun.) kc
Re: [CODE4LIB] wikipedia/author disambiguation
Are there any guidelines us wikipedians should be using to increase the likelihood of matches? I'm thinking in particular of the representation of uncertain dates. The first step in the process is identifying entries for people. The presence of a Persondata block in the article is very helpful. An unambiguous way to encode titles would help. Apparently there are markup rules for titles, but they are much abused by editors using them to emphasize text in articles and are unreliable to the point of being ignored. Dates is not a big deal. Wikipedia records are full of dates and we cope with fuzziness in dates reasonably well. Are there perhaps 'ground truth' URLs such as links into identities / VIAF which we can use? If so, what is the exact form of those URLs? There is a template for entering WorldCat Identities URLs. I don't believe there is one yet for VIAF. We've been working for a couple of years now on getting permission to put Identities and VIAF links into Wikipedia records. As it happens, several of us in Research are meeting again on Thursday to discuss this. Apparently there has been some sort of movement on that topic. Any help you can provide would be appreciated! Ralph
Re: [CODE4LIB] wikipedia/author disambiguation
Ralph, all, Regarding the Wikipedia template for VIAF, haven't tried it myself, but I believe the following syntax works: {{Authority control|PND=119408643|LCCN=n/79/113947|VIAF=59263727}} Based on description on this Wikipedia page: http://en.wikipedia.org/wiki/Template:Authority_control / Daniel -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ralph LeVan Sent: Tuesday, May 24, 2011 10:23 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Are there any guidelines us wikipedians should be using to increase the likelihood of matches? I'm thinking in particular of the representation of uncertain dates. The first step in the process is identifying entries for people. The presence of a Persondata block in the article is very helpful. An unambiguous way to encode titles would help. Apparently there are markup rules for titles, but they are much abused by editors using them to emphasize text in articles and are unreliable to the point of being ignored. Dates is not a big deal. Wikipedia records are full of dates and we cope with fuzziness in dates reasonably well. Are there perhaps 'ground truth' URLs such as links into identities / VIAF which we can use? If so, what is the exact form of those URLs? There is a template for entering WorldCat Identities URLs. I don't believe there is one yet for VIAF. We've been working for a couple of years now on getting permission to put Identities and VIAF links into Wikipedia records. As it happens, several of us in Research are meeting again on Thursday to discuss this. Apparently there has been some sort of movement on that topic. Any help you can provide would be appreciated! Ralph
Re: [CODE4LIB] wikipedia/author disambiguation
Very cool! Thanks! Ralph On Tue, May 24, 2011 at 10:36 AM, Lovins, Daniel daniel.lov...@yale.eduwrote: Ralph, all, Regarding the Wikipedia template for VIAF, haven't tried it myself, but I believe the following syntax works: {{Authority control|PND=119408643|LCCN=n/79/113947|VIAF=59263727}} Based on description on this Wikipedia page: http://en.wikipedia.org/wiki/Template:Authority_control / Daniel -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ralph LeVan Sent: Tuesday, May 24, 2011 10:23 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Are there any guidelines us wikipedians should be using to increase the likelihood of matches? I'm thinking in particular of the representation of uncertain dates. The first step in the process is identifying entries for people. The presence of a Persondata block in the article is very helpful. An unambiguous way to encode titles would help. Apparently there are markup rules for titles, but they are much abused by editors using them to emphasize text in articles and are unreliable to the point of being ignored. Dates is not a big deal. Wikipedia records are full of dates and we cope with fuzziness in dates reasonably well. Are there perhaps 'ground truth' URLs such as links into identities / VIAF which we can use? If so, what is the exact form of those URLs? There is a template for entering WorldCat Identities URLs. I don't believe there is one yet for VIAF. We've been working for a couple of years now on getting permission to put Identities and VIAF links into Wikipedia records. As it happens, several of us in Research are meeting again on Thursday to discuss this. Apparently there has been some sort of movement on that topic. Any help you can provide would be appreciated! Ralph
Re: [CODE4LIB] wikipedia/author disambiguation
Daniel, the template's very good indeed. Note though, VIAF includes already LCCN and Deutsche Nationalbibliothek. It suffices for such name needs for WP. That reinforces Ralph's call for linking Identities and VIAF into Wikipedia records, *Ya'aqov* * * * * * * * * *On Tue, May 24, 2011 at 9:36 AM, Lovins, Daniel daniel.lov...@yale.eduwrote: * *Ralph, all, Regarding the Wikipedia template for VIAF, haven't tried it myself, but I believe the following syntax works: {{Authority control|PND=119408643|LCCN=n/79/113947|VIAF=59263727}} Based on description on this Wikipedia page: http://en.wikipedia.org/wiki/Template:Authority_control / Daniel * * -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ralph LeVan Sent: Tuesday, May 24, 2011 10:23 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation * * Are there any guidelines us wikipedians should be using to increase the likelihood of matches? I'm thinking in particular of the representation of uncertain dates. The first step in the process is identifying entries for people. The presence of a Persondata block in the article is very helpful. An unambiguous way to encode titles would help. Apparently there are markup rules for titles, but they are much abused by editors using them to emphasize text in articles and are unreliable to the point of being ignored. Dates is not a big deal. Wikipedia records are full of dates and we cope with fuzziness in dates reasonably well. Are there perhaps 'ground truth' URLs such as links into identities / VIAF which we can use? If so, what is the exact form of those URLs? There is a template for entering WorldCat Identities URLs. I don't believe there is one yet for VIAF. We've been working for a couple of years now on getting permission to put Identities and VIAF links into Wikipedia records. As it happens, several of us in Research are meeting again on Thursday to discuss this. Apparently there has been some sort of movement on that topic. Any help you can provide would be appreciated! Ralph * * -- ya'aqovZISO | yaaq...@gmail.com | 856 217 3456 *
Re: [CODE4LIB] wikipedia/author disambiguation
On 25/05/11 02:36, Lovins, Daniel wrote: Ralph, all, Regarding the Wikipedia template for VIAF, haven't tried it myself, but I believe the following syntax works: {{Authority control|PND=119408643|LCCN=n/79/113947|VIAF=59263727}} Based on description on this Wikipedia page: http://en.wikipedia.org/wiki/Template:Authority_control Excellent! The page history indicates that it's an import from the German language wikipedia, thus the good support for German Persons, Corporations and Subjects. I suspect that the template would get more widely used if it were discussed at https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:GLAM_getting_started (recently renamed from https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Advice_for_the_cultural_sector ) which is a canonical starting-point for libraries people. The structure of the template is interesting. Due to the way wikipedia works, the barriers to adding something to a page (in this case, a template) are much lower than adding a new page. So adding a new authority to the page should be relatively easy, providing it works in the same way as the other authorities already there. cheers stuart -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
Re: [CODE4LIB] wikipedia/author disambiguation
Big +1 for promoting the use of the Authority Control Wikipedia template.I know i'm being a bit of a broken record, but you can watch as people add these by looking at or subscribing to: http://linkypedia.inkdroid.org/websites/23/pages/ Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I just ran across Duke [2] today, which looks like it could help guide record linking a bit. Duke is a fast and flexible deduplication (or entity resolution, or record linkage) engine written in Java on top of Lucene. At the moment (2011-04-07) it can process 1,000,000 records in 11 minutes on a standard laptop in a single thread. Haven't tried it yet, so YMMV, etc. //Ed [1] http://wikipedia-miner.sourceforge.net/ [2] http://code.google.com/p/duke/
Re: [CODE4LIB] wikipedia/author disambiguation
I think you misunderstood that Ya'aqov. What we do is make local authority records out of the Wikipedia records that we've identified as names. So the adding dates and stuff is to the local authority record of that Wikipedia record. We then use our usual VIAF matching technology between those Wikipedia authority records and the other authority records in VIAF. Wikipedia records that end up in a VIAF cluster get kept and the others get dropped as not matching anything we have in VIAF. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ya'aqov Ziso Sent: Sunday, May 22, 2011 5:15 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Thanks Karen, but you don't indicate yet, how you solve disambiguation? You indicate how you use WP as a resource for adding dates and subjects when they are missing. You don't indicate when/how you are resolving ambiguities with WP data. Again, please use Morris William as an example, *Ya'aqov* *Once a year OCLC downloads Wikipedia and then we extract as much information from it as we can. This generally involves reading through their current information for templates, etc. Then we try to figure out which pages are people. Within the people pages we look for birth dates, death dates, work titles, ISBNs, oclc numbers, worldcat identity links, LCCNs ... anything that we have in VIAF for matching purposes. Then we build marc-ish records for each of the extracted person. After that the records go through the normal VIAF matching processes. The process gets changed and tweaked each year.*
Re: [CODE4LIB] wikipedia/author disambiguation
*Oh yes, your clarification helps, Ralph. * * * ***So WP data ends up in a cluster (more than one entity) for a certain string that applies to more than one person/heading (therefore it is ambiguous). What processes is VIAF running to dis-ambiguate THAT heading?* *Ya'aqov* On Mon, May 23, 2011 at 8:04 AM, LeVan,Ralph le...@oclc.org wrote: I think you misunderstood that Ya'aqov. What we do is make local authority records out of the Wikipedia records that we've identified as names. So the adding dates and stuff is to the local authority record of that Wikipedia record. We then use our usual VIAF matching technology between those Wikipedia authority records and the other authority records in VIAF. Wikipedia records that end up in a VIAF cluster get kept and the others get dropped as not matching anything we have in VIAF. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ya'aqov Ziso Sent: Sunday, May 22, 2011 5:15 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Thanks Karen, but you don't indicate yet, how you solve disambiguation? You indicate how you use WP as a resource for adding dates and subjects when they are missing. You don't indicate when/how you are resolving ambiguities with WP data. Again, please use Morris William as an example, *Ya'aqov* *Once a year OCLC downloads Wikipedia and then we extract as much information from it as we can. This generally involves reading through their current information for templates, etc. Then we try to figure out which pages are people. Within the people pages we look for birth dates, death dates, work titles, ISBNs, oclc numbers, worldcat identity links, LCCNs ... anything that we have in VIAF for matching purposes. Then we build marc-ish records for each of the extracted person. After that the records go through the normal VIAF matching processes. The process gets changed and tweaked each year.* -- *ya'aqov**ZISO | **yaaq...@gmail.com **| 856 217 3456 *
Re: [CODE4LIB] wikipedia/author disambiguation
We look for common work titles and common dates primarily. I believe there are also sometimes actual links to other authority records. Ralph On Mon, May 23, 2011 at 12:00 PM, Ya'aqov Ziso yaaq...@gmail.com wrote: *Oh yes, your clarification helps, Ralph. * * * ***So WP data ends up in a cluster (more than one entity) for a certain string that applies to more than one person/heading (therefore it is ambiguous). What processes is VIAF running to dis-ambiguate THAT heading?* *Ya'aqov* On Mon, May 23, 2011 at 8:04 AM, LeVan,Ralph le...@oclc.org wrote: I think you misunderstood that Ya'aqov. What we do is make local authority records out of the Wikipedia records that we've identified as names. So the adding dates and stuff is to the local authority record of that Wikipedia record. We then use our usual VIAF matching technology between those Wikipedia authority records and the other authority records in VIAF. Wikipedia records that end up in a VIAF cluster get kept and the others get dropped as not matching anything we have in VIAF. I hope that helps! Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ya'aqov Ziso Sent: Sunday, May 22, 2011 5:15 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] wikipedia/author disambiguation Thanks Karen, but you don't indicate yet, how you solve disambiguation? You indicate how you use WP as a resource for adding dates and subjects when they are missing. You don't indicate when/how you are resolving ambiguities with WP data. Again, please use Morris William as an example, *Ya'aqov* *Once a year OCLC downloads Wikipedia and then we extract as much information from it as we can. This generally involves reading through their current information for templates, etc. Then we try to figure out which pages are people. Within the people pages we look for birth dates, death dates, work titles, ISBNs, oclc numbers, worldcat identity links, LCCNs ... anything that we have in VIAF for matching purposes. Then we build marc-ish records for each of the extracted person. After that the records go through the normal VIAF matching processes. The process gets changed and tweaked each year.* -- *ya'aqov**ZISO | **yaaq...@gmail.com **| 856 217 3456 *
Re: [CODE4LIB] wikipedia/author disambiguation
On 24/05/11 01:04, LeVan,Ralph wrote: What we do is make local authority records out of the Wikipedia records that we've identified as names. So the adding dates and stuff is to the local authority record of that Wikipedia record. We then use our usual VIAF matching technology between those Wikipedia authority records and the other authority records in VIAF. Wikipedia records that end up in a VIAF cluster get kept and the others get dropped as not matching anything we have in VIAF. Very interesting. Are there any guidelines us wikipedians should be using to increase the likelihood of matches? I'm thinking in particular of the representation of uncertain dates. Are there perhaps 'ground truth' URLs such as links into identities / VIAF which we can use? If so, what is the exact form of those URLs? cheers stuart -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
Re: [CODE4LIB] wikipedia/author disambiguation
On 21/05/11 16:29, Ya'aqov Ziso wrote: - Ludvig van Beethoven doesn't need much disambiguation. Are you sure? http://toolserver.org/~dispenser/cgi-bin/rdcheck.py?page=Ludwig_van_Beethoven cheers stuart -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
Re: [CODE4LIB] wikipedia/author disambiguation
*Ode to your cheers: a**s a personal name, with birth and death dates, in NAF, is not an ambiguous heading. toolserver.org is all yours.* * * ***Joy to Stuart. * * * *** * On Sun, May 22, 2011 at 3:23 PM, stuart yeates stuart.yea...@vuw.ac.nzwrote: On 21/05/11 16:29, Ya'aqov Ziso wrote: toolserver.orghttp://toolserver.org/~dispenser/cgi-bin/rdcheck.py?page=Ludwig_van_Beethoven - Ludvig van Beethoven doesn't need much disambiguation. Are you sure? http://toolserver.org/~dispenser/cgi-bin/rdcheck.py?page=Ludwig_van_Beethoven cheers stuart -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/ -- *ya'aqov**ZISO | **yaaq...@gmail.com **| 856 217 3456 *
Re: [CODE4LIB] wikipedia/author disambiguation
Thanks Karen, but you don't indicate yet, how you solve disambiguation? You indicate how you use WP as a resource for adding dates and subjects when they are missing. You don't indicate when/how you are resolving ambiguities with WP data. Again, please use Morris William as an example, *Ya'aqov* *Once a year OCLC downloads Wikipedia and then we extract as much information from it as we can. This generally involves reading through their current information for templates, etc. Then we try to figure out which pages are people. Within the people pages we look for birth dates, death dates, work titles, ISBNs, oclc numbers, worldcat identity links, LCCNs ... anything that we have in VIAF for matching purposes. Then we build marc-ish records for each of the extracted person. After that the records go through the normal VIAF matching processes. The process gets changed and tweaked each year.*
Re: [CODE4LIB] wikipedia/author disambiguation
Thanks Karen Looks like the OL code uses birth date + name, which is just what I was thinking of doing. Although you say it's to run against a wikipedia dump it looks like it should actually work with small changes against the wikipedia API too. But I need this in PHP so I'll have to pick through it and convert from perl and python before I can test it properly. What was the javascript code you referenced below? Your link is to jquery, rather than the script you mentioned in the text. Graham On 05/19/11 15:39, Karen Coyle wrote: This sounds like a great way to translate from library forms to wikipedia name forms. But for on-the-fly use I wonder if it wouldn't be more efficient to eliminate the middle man. Karen, can you say a little about what it took to link library names to WP? Was it a one-step, two-step, etc.? There is a script that I've seen used, although it doesn't seem to be production ready: https://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js One interesting note from the OL experience of linking to WP: generally you need to re-reverse the names to get a match: from Twain, Mark to Mark Twain. But for some names that isn't the case: Mao, Tse-Tung. Edward Betts used Wikipedia to determine which names do not get re-reversed. The OL code for its wikipedia lookup is at: https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/wikipedia It, however, runs against dumps rather than an API. kc Quoting Karen Coombs librarywebc...@gmail.com: Graham, I'd advocate using WorldCat Identities to get to the appropriate url for dbpedia. Each Identity record has a wikipedia element in it that you could use to link to either Wikipedia or dbpedia. If you want to see an example of this in action you can check out the Author Info demo I did for code4lib 2010 here - http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031 The code for this demo is available for download at - http://www.worldcat.org/devnet/code/devnetDemos/trunk/ You'll want the author_info folder and identity_info.php Karen Karen A. Coombs Product Manager OCLC Developer Network coom...@oclc.org On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote: I need to be able to take author data from a catalogue record and use it to look up the author on Wikipedia on the fly. So I may have birth date and possibly year of death in addition to (one spelling of) the name, the title of one book the author wrote etc. I know there are various efforts in progress that will improve the current situation, but as things stand at the moment what is the best* way to do this? 1. query wikipedia for as much as possible, parse and select the best fitting result 2. go via dbpedia/freebase and work back from there 3. use VIAF and/or OCLC services 4. Other? (I have no experience of 2-4 yet :-( Thanks Graham * 'best' being constrained by: - need to do this in real-time - need to avoid dependence on services which may be taken away or charged for - being able to justify to librarians as reasonably accurate :-)
Re: [CODE4LIB] wikipedia/author disambiguation
Hi Karen Thanks for the code. As far as I can see though it doesn't actually solve my disambiguation problem - since identity_info.php just takes a name as input, it can't guess which of the people with this name is meant other than by using the most commonly referenced one, which in the OCLC data actually seems to often be an amalgam of several people with the name; for example http://worldcat.org/identities/viaf-DNB|100804799 is William Morris, the 18th century African-American engineer whose most widely held works include News from Nowhere, Introduction to Fly Fishing, and Ancient Slavery Disapproved of by God - ie an amalgamation of the various most famous people known by this name. I guess this is just a hard problem overall. Graham On 05/19/11 14:56, Karen Coombs wrote: Graham, I'd advocate using WorldCat Identities to get to the appropriate url for dbpedia. Each Identity record has a wikipedia element in it that you could use to link to either Wikipedia or dbpedia. If you want to see an example of this in action you can check out the Author Info demo I did for code4lib 2010 here - http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031 The code for this demo is available for download at - http://www.worldcat.org/devnet/code/devnetDemos/trunk/ You'll want the author_info folder and identity_info.php Karen Karen A. Coombs Product Manager OCLC Developer Network coom...@oclc.org On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote: I need to be able to take author data from a catalogue record and use it to look up the author on Wikipedia on the fly. So I may have birth date and possibly year of death in addition to (one spelling of) the name, the title of one book the author wrote etc. I know there are various efforts in progress that will improve the current situation, but as things stand at the moment what is the best* way to do this? 1. query wikipedia for as much as possible, parse and select the best fitting result 2. go via dbpedia/freebase and work back from there 3. use VIAF and/or OCLC services 4. Other? (I have no experience of 2-4 yet :-( Thanks Graham * 'best' being constrained by: - need to do this in real-time - need to avoid dependence on services which may be taken away or charged for - being able to justify to librarians as reasonably accurate :-)
Re: [CODE4LIB] wikipedia/author disambiguation
Karen, I'll have to get Ralph LeVan or Thom Hickey to comment on how we were able to create the Wikipedia links in Identities. I don't know the details just that the data is there. Karen On Thu, May 19, 2011 at 9:39 AM, Karen Coyle li...@kcoyle.net wrote: This sounds like a great way to translate from library forms to wikipedia name forms. But for on-the-fly use I wonder if it wouldn't be more efficient to eliminate the middle man. Karen, can you say a little about what it took to link library names to WP? Was it a one-step, two-step, etc.? There is a script that I've seen used, although it doesn't seem to be production ready: https://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js One interesting note from the OL experience of linking to WP: generally you need to re-reverse the names to get a match: from Twain, Mark to Mark Twain. But for some names that isn't the case: Mao, Tse-Tung. Edward Betts used Wikipedia to determine which names do not get re-reversed. The OL code for its wikipedia lookup is at: https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/wikipedia It, however, runs against dumps rather than an API. kc Quoting Karen Coombs librarywebc...@gmail.com: Graham, I'd advocate using WorldCat Identities to get to the appropriate url for dbpedia. Each Identity record has a wikipedia element in it that you could use to link to either Wikipedia or dbpedia. If you want to see an example of this in action you can check out the Author Info demo I did for code4lib 2010 here - http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031 The code for this demo is available for download at - http://www.worldcat.org/devnet/code/devnetDemos/trunk/ You'll want the author_info folder and identity_info.php Karen Karen A. Coombs Product Manager OCLC Developer Network coom...@oclc.org On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote: I need to be able to take author data from a catalogue record and use it to look up the author on Wikipedia on the fly. So I may have birth date and possibly year of death in addition to (one spelling of) the name, the title of one book the author wrote etc. I know there are various efforts in progress that will improve the current situation, but as things stand at the moment what is the best* way to do this? 1. query wikipedia for as much as possible, parse and select the best fitting result 2. go via dbpedia/freebase and work back from there 3. use VIAF and/or OCLC services 4. Other? (I have no experience of 2-4 yet :-( Thanks Graham * 'best' being constrained by: - need to do this in real-time - need to avoid dependence on services which may be taken away or charged for - being able to justify to librarians as reasonably accurate :-) -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] wikipedia/author disambiguation
Karen, * - Identities in WorldCat are based on literary warrant, i.e., names for people who authored/edited something or were subject in someone else's literary work. Personal names in WikiPedia are not entered according to literary warrant. Nor is their form vetted according to NAF. - Ludvig van Beethoven doesn't need much disambiguation. Nor does Mark Twain. - So yes, Karen/Ralph/Tom -- how exactly is Wikipedia used for disambiguation? are you certain it's used for THAT purpose? if yes, can you send us a proper example? Morris, William seems a good example. Ya'aqov * * * * * *On Thu, May 19, 2011 at 3:48 PM, Graham Seaman gra...@theseamans.netwrote: * *Hi Karen Thanks for the code. As far as I can see though it doesn't actually solve my disambiguation problem - since identity_info.php just takes a name as input, it can't guess which of the people with this name is meant other than by using the most commonly referenced one, which in the OCLC data actually seems to often be an amalgam of several people with the name; for example http://worldcat.org/identities/viaf-DNB|100804799 is William Morris, the 18th century African-American engineer whose most widely held works include News from Nowhere, Introduction to Fly Fishing, and Ancient Slavery Disapproved of by God - ie an amalgamation of the various most famous people known by this name. I guess this is just a hard problem overall. Graham * * On 05/19/11 14:56, Karen Coombs wrote: Graham, I'd advocate using WorldCat Identities to get to the appropriate url for dbpedia. Each Identity record has a wikipedia element in it that you could use to link to either Wikipedia or dbpedia. If you want to see an example of this in action you can check out the Author Info demo I did for code4lib 2010 here - http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031 The code for this demo is available for download at - http://www.worldcat.org/devnet/code/devnetDemos/trunk/ You'll want the author_info folder and identity_info.php Karen Karen A. Coombs Product Manager OCLC Developer Network coom...@oclc.org On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote: I need to be able to take author data from a catalogue record and use it to look up the author on Wikipedia on the fly. So I may have birth date and possibly year of death in addition to (one spelling of) the name, the title of one book the author wrote etc. I know there are various efforts in progress that will improve the current situation, but as things stand at the moment what is the best* way to do this? 1. query wikipedia for as much as possible, parse and select the best fitting result 2. go via dbpedia/freebase and work back from there 3. use VIAF and/or OCLC services 4. Other? (I have no experience of 2-4 yet :-( Thanks Graham * 'best' being constrained by: - need to do this in real-time - need to avoid dependence on services which may be taken away or charged for - being able to justify to librarians as reasonably accurate :-) * * -- ya'aqovZISO | yaaq...@gmail.com | 856 217 3456 *
Re: [CODE4LIB] wikipedia/author disambiguation
Graham, I'd advocate using WorldCat Identities to get to the appropriate url for dbpedia. Each Identity record has a wikipedia element in it that you could use to link to either Wikipedia or dbpedia. If you want to see an example of this in action you can check out the Author Info demo I did for code4lib 2010 here - http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031 The code for this demo is available for download at - http://www.worldcat.org/devnet/code/devnetDemos/trunk/ You'll want the author_info folder and identity_info.php Karen Karen A. Coombs Product Manager OCLC Developer Network coom...@oclc.org On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote: I need to be able to take author data from a catalogue record and use it to look up the author on Wikipedia on the fly. So I may have birth date and possibly year of death in addition to (one spelling of) the name, the title of one book the author wrote etc. I know there are various efforts in progress that will improve the current situation, but as things stand at the moment what is the best* way to do this? 1. query wikipedia for as much as possible, parse and select the best fitting result 2. go via dbpedia/freebase and work back from there 3. use VIAF and/or OCLC services 4. Other? (I have no experience of 2-4 yet :-( Thanks Graham * 'best' being constrained by: - need to do this in real-time - need to avoid dependence on services which may be taken away or charged for - being able to justify to librarians as reasonably accurate :-)
Re: [CODE4LIB] wikipedia/author disambiguation
In addition to the approaches you note, might be worth investigating this tool that came up in a thread just a few days ago on this list: http://wikipedia-miner.sourceforge.net/ I think nobody's done enough with this yet to be sure what will work best, I think you're going to have to experiment and let us know. VIAF/OCLC services are presumably using some sort of statistical analysis/text mining approaches under the hood; wikipedia-miner is using such approaches but giving you the code in open source too if you're curious exactly what they're doing. I suspect statistical approaches like wikipedia-miner uses are likely to be more productive than pure parsing approaches considering only one record at a time in isolation. But writing your own statistics analysis algorithms is probably more work than you want, especially when wikipedia-miner and/or VIAF/OCLC services already exist. If you don't do statistical analysis of the corpus, and do end up actually trying to search wikipedia directly -- then I suspect dbpedia is a lot more convenient endpoint than trying to screen-scrape HTML wikipedia. That's pretty much what dbpedia is for. But these are all just my guesses, not informed by any work I've done. Jonathan On 5/19/2011 5:40 AM, graham wrote: I need to be able to take author data from a catalogue record and use it to look up the author on Wikipedia on the fly. So I may have birth date and possibly year of death in addition to (one spelling of) the name, the title of one book the author wrote etc. I know there are various efforts in progress that will improve the current situation, but as things stand at the moment what is the best* way to do this? 1. query wikipedia for as much as possible, parse and select the best fitting result 2. go via dbpedia/freebase and work back from there 3. use VIAF and/or OCLC services 4. Other? (I have no experience of 2-4 yet :-( Thanks Graham * 'best' being constrained by: - need to do this in real-time - need to avoid dependence on services which may be taken away or charged for - being able to justify to librarians as reasonably accurate :-)
Re: [CODE4LIB] wikipedia/author disambiguation
This sounds like a great way to translate from library forms to wikipedia name forms. But for on-the-fly use I wonder if it wouldn't be more efficient to eliminate the middle man. Karen, can you say a little about what it took to link library names to WP? Was it a one-step, two-step, etc.? There is a script that I've seen used, although it doesn't seem to be production ready: https://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js One interesting note from the OL experience of linking to WP: generally you need to re-reverse the names to get a match: from Twain, Mark to Mark Twain. But for some names that isn't the case: Mao, Tse-Tung. Edward Betts used Wikipedia to determine which names do not get re-reversed. The OL code for its wikipedia lookup is at: https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/wikipedia It, however, runs against dumps rather than an API. kc Quoting Karen Coombs librarywebc...@gmail.com: Graham, I'd advocate using WorldCat Identities to get to the appropriate url for dbpedia. Each Identity record has a wikipedia element in it that you could use to link to either Wikipedia or dbpedia. If you want to see an example of this in action you can check out the Author Info demo I did for code4lib 2010 here - http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031 The code for this demo is available for download at - http://www.worldcat.org/devnet/code/devnetDemos/trunk/ You'll want the author_info folder and identity_info.php Karen Karen A. Coombs Product Manager OCLC Developer Network coom...@oclc.org On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote: I need to be able to take author data from a catalogue record and use it to look up the author on Wikipedia on the fly. So I may have birth date and possibly year of death in addition to (one spelling of) the name, the title of one book the author wrote etc. I know there are various efforts in progress that will improve the current situation, but as things stand at the moment what is the best* way to do this? 1. query wikipedia for as much as possible, parse and select the best fitting result 2. go via dbpedia/freebase and work back from there 3. use VIAF and/or OCLC services 4. Other? (I have no experience of 2-4 yet :-( Thanks Graham * 'best' being constrained by: - need to do this in real-time - need to avoid dependence on services which may be taken away or charged for - being able to justify to librarians as reasonably accurate :-) -- Karen Coyle kco...@kcoyle.net http://kcoyle.net ph: 1-510-540-7596 m: 1-510-435-8234 skype: kcoylenet
Re: [CODE4LIB] wikipedia/author disambiguation
Curious what script you've used that isn't production ready -- I don't think you meant to post in the URL for the JQuery library? On 5/19/2011 10:39 AM, Karen Coyle wrote: This sounds like a great way to translate from library forms to wikipedia name forms. But for on-the-fly use I wonder if it wouldn't be more efficient to eliminate the middle man. Karen, can you say a little about what it took to link library names to WP? Was it a one-step, two-step, etc.? There is a script that I've seen used, although it doesn't seem to be production ready: https://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js One interesting note from the OL experience of linking to WP: generally you need to re-reverse the names to get a match: from Twain, Mark to Mark Twain. But for some names that isn't the case: Mao, Tse-Tung. Edward Betts used Wikipedia to determine which names do not get re-reversed. The OL code for its wikipedia lookup is at: https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/wikipedia It, however, runs against dumps rather than an API. kc Quoting Karen Coombs librarywebc...@gmail.com: Graham, I'd advocate using WorldCat Identities to get to the appropriate url for dbpedia. Each Identity record has a wikipedia element in it that you could use to link to either Wikipedia or dbpedia. If you want to see an example of this in action you can check out the Author Info demo I did for code4lib 2010 here - http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031 The code for this demo is available for download at - http://www.worldcat.org/devnet/code/devnetDemos/trunk/ You'll want the author_info folder and identity_info.php Karen Karen A. Coombs Product Manager OCLC Developer Network coom...@oclc.org On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote: I need to be able to take author data from a catalogue record and use it to look up the author on Wikipedia on the fly. So I may have birth date and possibly year of death in addition to (one spelling of) the name, the title of one book the author wrote etc. I know there are various efforts in progress that will improve the current situation, but as things stand at the moment what is the best* way to do this? 1. query wikipedia for as much as possible, parse and select the best fitting result 2. go via dbpedia/freebase and work back from there 3. use VIAF and/or OCLC services 4. Other? (I have no experience of 2-4 yet :-( Thanks Graham * 'best' being constrained by: - need to do this in real-time - need to avoid dependence on services which may be taken away or charged for - being able to justify to librarians as reasonably accurate :-)