Re: [CODE4LIB] wikipedia/author disambiguation

2011-06-01 Thread stuart yeates
I'm not sure about other dialects of English, but in New Zealand 
English, a negation there has no impact on the semantics or subtleties 
of that sentence.


cheers
stuart

On 01/06/11 07:18, Ed Summers wrote:

a bit of a fruedian slip there I suppose :-)

 s/could/couldn't/

//Ed

On Tue, May 31, 2011 at 3:17 PM, Ed Summerse...@pobox.com  wrote:

On Tue, May 31, 2011 at 12:48 PM, Thomas Bergert...@gymel.com  wrote:

Currently about 150.000 articles on wikipedia.de carry the associated
PND number, many of them also LoC-NA and VIAF numbers:


Makes me wonder if we could use inter-wiki links to automatically
update some of the en.wikipedia articles based on the viaf links in
de.wikipedia. Could hurt to see how many there are I suppose.

//Ed






--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-31 Thread Jonathan Rochkind

Neat!

Just tried the human-displayed links off the Immanuel Kant wikipedia 
page (http://en.wikipedia.org/wiki/Immanuel_Kant), created by the 
'Authority Control' template that Daniel or someone else added.


VIAF one works great, taking me to the human readable VIAF page.

PND one seems to work too, taking me to the authority page in the 
Deutsche National Bibliothek.


The LCCN one does not work. Tries to take me to: 
http://errol.oclc.org/laf/n79021614.html


Which results in an HTTP 500 error from the OCLC server.

Since this template apparently generates a URL to an OCLC service 
(rather than LC? I guess maybe LC itself doesn't have the right 
permalinks?), I think that OCLC probably ought to fix this. If the 
template is not creating the right URL, I guess you've got to work with 
wikipedia to fix it. Or fix your end to accept those URLs properly.


Jonathan

On 5/25/2011 12:47 PM, Ed Summers wrote:

Hey Daniel,

It looks like you used the worldcat template [1]:

 {{worldcat id|id=lccn-n79-21614|VIAF=82088490}}

which doesn't actually do anything with the VIAF parameter. Instead
(or as well) you'll want to use the Authority control template:

 {{Authority control|PND=118559796|LCCN=n/79/21614|VIAF=82088490}}

After I did that and the crawl ran again it showed up at linkypedia
[3]. Thanks for giving it a try!

//Ed

[1] http://en.wikipedia.org/wiki/Template:Worldcat_id
[2] http://en.wikipedia.org/wiki/Template:Authority_control
[3] http://linkypedia.info/websites/23/pages/



On Wed, May 25, 2011 at 9:17 AM, Lovins, Danieldaniel.lov...@yale.edu  wrote:

That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took just a 
few seconds. I'll subscribe to the linkypedia rss feed and watch for 
notification.

Daniel

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed 
Summers
Sent: Tuesday, May 24, 2011 4:59 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] wikipedia/author disambiguation

Big +1 for promoting the use of the Authority Control Wikipedia
template.I know i'm being a bit of a broken record, but you can watch
as people add these by looking at or subscribing to:

http://linkypedia.inkdroid.org/websites/23/pages/

Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I
just ran across Duke [2] today, which looks like it could help guide
record linking a bit.


Duke is a fast and flexible deduplication (or entity resolution, or
record linkage) engine written in Java on top of Lucene. At the moment
(2011-04-07) it can process 1,000,000 records in 11 minutes on a
standard laptop in a single thread.


Haven't tried it yet, so YMMV, etc.

//Ed

[1] http://wikipedia-miner.sourceforge.net/
[2] http://code.google.com/p/duke/



Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-31 Thread Ed Summers
On Tue, May 31, 2011 at 11:55 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
 The LCCN one does not work. Tries to take me to:
 http://errol.oclc.org/laf/n79021614.html

 Which results in an HTTP 500 error from the OCLC server.

 Since this template apparently generates a URL to an OCLC service (rather
 than LC? I guess maybe LC itself doesn't have the right permalinks?), I
 think that OCLC probably ought to fix this. If the template is not creating
 the right URL, I guess you've got to work with wikipedia to fix it. Or fix
 your end to accept those URLs properly.

As far as I know there aren't any permalinks for name authority
records at loc.gov that use the LCCN. I've heard informally from some
folks at OCLC that they plan to redirect these links to a URL at
loc.gov if/when the name authority records are available from there.
But I have no idea when that will happen unfortunately.

//Ed


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-31 Thread Ross Singer
This seems pretty fixable on OCLC's part, if they want to...

The errol repository still works, see:
http://errol.oclc.org/laf/n79021614.MarcXML
(generated from)
http://alcme.oclc.org/lcnaf/servlet/OAIHandler?verb=GetRecordmetadataPrefix=MarcXMLidentifier=n79021614

So it's just a case of the rewrites doing the right thing for the
.html redirects into OAICat or whatever it is.

-Ross.

On Tue, May 31, 2011 at 11:55 AM, Jonathan Rochkind rochk...@jhu.edu wrote:
 Neat!

 Just tried the human-displayed links off the Immanuel Kant wikipedia page
 (http://en.wikipedia.org/wiki/Immanuel_Kant), created by the 'Authority
 Control' template that Daniel or someone else added.

 VIAF one works great, taking me to the human readable VIAF page.

 PND one seems to work too, taking me to the authority page in the Deutsche
 National Bibliothek.

 The LCCN one does not work. Tries to take me to:
 http://errol.oclc.org/laf/n79021614.html

 Which results in an HTTP 500 error from the OCLC server.

 Since this template apparently generates a URL to an OCLC service (rather
 than LC? I guess maybe LC itself doesn't have the right permalinks?), I
 think that OCLC probably ought to fix this. If the template is not creating
 the right URL, I guess you've got to work with wikipedia to fix it. Or fix
 your end to accept those URLs properly.

 Jonathan

 On 5/25/2011 12:47 PM, Ed Summers wrote:

 Hey Daniel,

 It looks like you used the worldcat template [1]:

     {{worldcat id|id=lccn-n79-21614|VIAF=82088490}}

 which doesn't actually do anything with the VIAF parameter. Instead
 (or as well) you'll want to use the Authority control template:

     {{Authority control|PND=118559796|LCCN=n/79/21614|VIAF=82088490}}

 After I did that and the crawl ran again it showed up at linkypedia
 [3]. Thanks for giving it a try!

 //Ed

 [1] http://en.wikipedia.org/wiki/Template:Worldcat_id
 [2] http://en.wikipedia.org/wiki/Template:Authority_control
 [3] http://linkypedia.info/websites/23/pages/



 On Wed, May 25, 2011 at 9:17 AM, Lovins, Danieldaniel.lov...@yale.edu
  wrote:

 That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took
 just a few seconds. I'll subscribe to the linkypedia rss feed and watch for
 notification.

 Daniel

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Ed Summers
 Sent: Tuesday, May 24, 2011 4:59 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] wikipedia/author disambiguation

 Big +1 for promoting the use of the Authority Control Wikipedia
 template.I know i'm being a bit of a broken record, but you can watch
 as people add these by looking at or subscribing to:

    http://linkypedia.inkdroid.org/websites/23/pages/

 Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I
 just ran across Duke [2] today, which looks like it could help guide
 record linking a bit.

 
 Duke is a fast and flexible deduplication (or entity resolution, or
 record linkage) engine written in Java on top of Lucene. At the moment
 (2011-04-07) it can process 1,000,000 records in 11 minutes on a
 standard laptop in a single thread.
 

 Haven't tried it yet, so YMMV, etc.

 //Ed

 [1] http://wikipedia-miner.sourceforge.net/
 [2] http://code.google.com/p/duke/




Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-31 Thread Thomas Berger
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1



Am 31.05.2011 17:55, schrieb Jonathan Rochkind:

 VIAF one works great, taking me to the human readable VIAF page.
 
 PND one seems to work too, taking me to the authority page in the Deutsche
 National Bibliothek.
 
 The LCCN one does not work. Tries to take me to:
 http://errol.oclc.org/laf/n79021614.html
 
 Which results in an HTTP 500 error from the OCLC server.
 
 Since this template apparently generates a URL to an OCLC service (rather than
 LC? I guess maybe LC itself doesn't have the right permalinks?), I think that
 OCLC probably ought to fix this. If the template is not creating the right 
 URL,
 I guess you've got to work with wikipedia to fix it. Or fix your end to accept
 those URLs properly.

These links IIRC are the same ones VIAF employs to link to a
representation of the NAF records and they are broken for about 6 weeks
now.

To my knowledge the {{Authority Control}} Metadata in the English Wikipedia
are inspired from a similar effort in the German Wikipedia, which since
2005 notes authority nunbers for persons: They started with PND numbers
(Personennormdatei, the Collaborative Authority File for german and
austrian libraries) and were backed by an agreement with the German
National Library (Deutsche Nationalbibliothek, DNB) to provide mutual
links from authority Records to Wikipedia and vice versa.

Currently about 150.000 articles on wikipedia.de carry the associated
PND number, many of them also LoC-NA and VIAF numbers:

http://de.wikipedia.org/wiki/Vorlage:NORMDATENCOUNT

The links from portal.d-nb.de to wikipedia.de are not implemented
by 856-like manifest URLs in the authority records nor some kind of
wikipedia numbers as additional identification numbers. Rather
wikipedia.de publishes on a daily base a trivial concordance table
relating extracted PND numbers to the corresponding wikipedia lemma.
The DNB portal in turn incorporates this table and generates the
respective links on the fly whenever an affected authority record is
displayed.

Some biographical dictionaries, regional bibliographies, classical
OPACs and historical projects picked up this mechanism and published
their own tables of this kind, all using the PND identification number
as common system of reference. This (as such a low-tech approach
to the semantic web) was coined PND-BEACON:

 http://de.wikipedia.org/wiki/Wikipedia:PND/BEACON 
(english version:  http://meta.wikimedia.org/wiki/BEACON )

CKAN data package:  http://ckan.net/package/pndbeacon 

Publishing such beacon files presupposes that your data already
carries more-than-local identification numbers. With this
precondition met, the gain is twofold:

- - publishing a beacon file may direct vistors from the incorporators
  of the file to your catalogue

- - the existing authority numbers in your cataloge enable you to
  relate (via their beacon files) to other web ressources, thus
  rounding up the data you present.

Thomas Berger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iJwEAQECAAYFAk3lG9wACgkQYhMlmJ6W47MBZAP/Sj1LGGRAqKHnjyhUcHVN6JMP
Iy+CH2we1Dowod0PzNXHeR/0rk3Q0MTnWSznuhvM/tmyFESm3IFa1+Uwq8h56uob
lG6N0Bbn7OHTm22XXcNqBwMryOexI/irP4+yt9K1tE0Pm+kDydY8om1NK5vm3rSq
S4S4nwr0zJ7FVDjKJto=
=MqTZ
-END PGP SIGNATURE-


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-31 Thread Ed Summers
a bit of a fruedian slip there I suppose :-)

s/could/couldn't/

//Ed

On Tue, May 31, 2011 at 3:17 PM, Ed Summers e...@pobox.com wrote:
 On Tue, May 31, 2011 at 12:48 PM, Thomas Berger t...@gymel.com wrote:
 Currently about 150.000 articles on wikipedia.de carry the associated
 PND number, many of them also LoC-NA and VIAF numbers:

 Makes me wonder if we could use inter-wiki links to automatically
 update some of the en.wikipedia articles based on the viaf links in
 de.wikipedia. Could hurt to see how many there are I suppose.

 //Ed



Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-31 Thread Ed Summers
On Tue, May 31, 2011 at 12:48 PM, Thomas Berger t...@gymel.com wrote:
 Currently about 150.000 articles on wikipedia.de carry the associated
 PND number, many of them also LoC-NA and VIAF numbers:

Makes me wonder if we could use inter-wiki links to automatically
update some of the en.wikipedia articles based on the viaf links in
de.wikipedia. Could hurt to see how many there are I suppose.

//Ed


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-26 Thread stuart yeates
Some reflection suggests that for this to authority stuff to really take 
off we have to make it _really_ easy for wikipedians to do.


As a model of how this could be done (and as a template to 
steal^H^H^H^H^Hreuse code from) I suggest HotCat: 
https://secure.wikimedia.org/wikipedia/en/wiki/HotCat


HotCat adds a little GUI to the bottom of pages to make it really fast 
and easy to add pages to wikipedia categories. In my complete ignorance 
of how both javascript and VIAF work, it seems it should be possible to 
rewrite it to look up VIAF (with the query defaulting to the current 
page title) and present a list of possible matches for the user to 
suggest one.


cheers
stuart


On 26/05/11 11:16, Ed Summers wrote:

The user profile pages that reference the website should eventually (1
or 2 days) turn up under the Users tab, e.g.

 http://linkypedia.inkdroid.org/websites/23/users/

I don't see you there yet though :-)

//Ed

On Wed, May 25, 2011 at 5:03 PM, Karen Coyleli...@kcoyle.net  wrote:

Hi, Ed. Do you pick up user pages or just wikipedia entry pages? (I added
mine to my user page, just for fun.)

kc





--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-26 Thread Graham Seaman
The lccn links from the template have been giving a java exception for
the last few days at least: does the template or the server need fixing?

Graham

On 05/25/11 17:47, Ed Summers wrote:
 Hey Daniel,
 
 It looks like you used the worldcat template [1]:
 
 {{worldcat id|id=lccn-n79-21614|VIAF=82088490}}
 
 which doesn't actually do anything with the VIAF parameter. Instead
 (or as well) you'll want to use the Authority control template:
 
 {{Authority control|PND=118559796|LCCN=n/79/21614|VIAF=82088490}}
 
 After I did that and the crawl ran again it showed up at linkypedia
 [3]. Thanks for giving it a try!
 
 //Ed
 
 [1] http://en.wikipedia.org/wiki/Template:Worldcat_id
 [2] http://en.wikipedia.org/wiki/Template:Authority_control
 [3] http://linkypedia.info/websites/23/pages/
 
 
 
 On Wed, May 25, 2011 at 9:17 AM, Lovins, Daniel daniel.lov...@yale.edu 
 wrote:
 That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took just 
 a few seconds. I'll subscribe to the linkypedia rss feed and watch for 
 notification.

 Daniel

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed 
 Summers
 Sent: Tuesday, May 24, 2011 4:59 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] wikipedia/author disambiguation

 Big +1 for promoting the use of the Authority Control Wikipedia
 template.I know i'm being a bit of a broken record, but you can watch
 as people add these by looking at or subscribing to:

http://linkypedia.inkdroid.org/websites/23/pages/

 Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I
 just ran across Duke [2] today, which looks like it could help guide
 record linking a bit.

 
 Duke is a fast and flexible deduplication (or entity resolution, or
 record linkage) engine written in Java on top of Lucene. At the moment
 (2011-04-07) it can process 1,000,000 records in 11 minutes on a
 standard laptop in a single thread.
 

 Haven't tried it yet, so YMMV, etc.

 //Ed

 [1] http://wikipedia-miner.sourceforge.net/
 [2] http://code.google.com/p/duke/



Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-26 Thread Ed Summers
It's the server unfortunately. I think OCLC is trying to figure out
what to do with errol ... there's a thread on the wc-devnet-l if you
are interested:


http://listserv.oclc.org/scripts/wa.exe?A2=ind1105dL=wc-devnet-lT=0F=PX=4D30895CB90D4C912FP=73

//Ed

On Thu, May 26, 2011 at 5:15 PM, Graham Seaman gra...@theseamans.net wrote:
 The lccn links from the template have been giving a java exception for
 the last few days at least: does the template or the server need fixing?


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-25 Thread Lovins, Daniel
That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took just a 
few seconds. I'll subscribe to the linkypedia rss feed and watch for 
notification.

Daniel

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed 
Summers
Sent: Tuesday, May 24, 2011 4:59 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] wikipedia/author disambiguation

Big +1 for promoting the use of the Authority Control Wikipedia
template.I know i'm being a bit of a broken record, but you can watch
as people add these by looking at or subscribing to:

http://linkypedia.inkdroid.org/websites/23/pages/

Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I
just ran across Duke [2] today, which looks like it could help guide
record linking a bit.


Duke is a fast and flexible deduplication (or entity resolution, or
record linkage) engine written in Java on top of Lucene. At the moment
(2011-04-07) it can process 1,000,000 records in 11 minutes on a
standard laptop in a single thread.


Haven't tried it yet, so YMMV, etc.

//Ed

[1] http://wikipedia-miner.sourceforge.net/
[2] http://code.google.com/p/duke/


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-25 Thread Lovins, Daniel
Oops.

Thanks Ed!

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed 
Summers
Sent: Wednesday, May 25, 2011 12:47 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] wikipedia/author disambiguation

Hey Daniel,

It looks like you used the worldcat template [1]:

{{worldcat id|id=lccn-n79-21614|VIAF=82088490}}

which doesn't actually do anything with the VIAF parameter. Instead
(or as well) you'll want to use the Authority control template:

{{Authority control|PND=118559796|LCCN=n/79/21614|VIAF=82088490}}

After I did that and the crawl ran again it showed up at linkypedia
[3]. Thanks for giving it a try!

//Ed

[1] http://en.wikipedia.org/wiki/Template:Worldcat_id
[2] http://en.wikipedia.org/wiki/Template:Authority_control
[3] http://linkypedia.info/websites/23/pages/



On Wed, May 25, 2011 at 9:17 AM, Lovins, Daniel daniel.lov...@yale.edu wrote:
 That's really cool, Ed. I just added the viaf # for Immanuel Kant. Took just 
 a few seconds. I'll subscribe to the linkypedia rss feed and watch for 
 notification.

 Daniel

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ed 
 Summers
 Sent: Tuesday, May 24, 2011 4:59 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] wikipedia/author disambiguation

 Big +1 for promoting the use of the Authority Control Wikipedia
 template.I know i'm being a bit of a broken record, but you can watch
 as people add these by looking at or subscribing to:

    http://linkypedia.inkdroid.org/websites/23/pages/

 Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I
 just ran across Duke [2] today, which looks like it could help guide
 record linking a bit.

 
 Duke is a fast and flexible deduplication (or entity resolution, or
 record linkage) engine written in Java on top of Lucene. At the moment
 (2011-04-07) it can process 1,000,000 records in 11 minutes on a
 standard laptop in a single thread.
 

 Haven't tried it yet, so YMMV, etc.

 //Ed

 [1] http://wikipedia-miner.sourceforge.net/
 [2] http://code.google.com/p/duke/



Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-25 Thread Karen Coyle
Hi, Ed. Do you pick up user pages or just wikipedia entry pages? (I  
added mine to my user page, just for fun.)


kc

Quoting Ed Summers e...@pobox.com:


Hey Daniel,

It looks like you used the worldcat template [1]:

{{worldcat id|id=lccn-n79-21614|VIAF=82088490}}

which doesn't actually do anything with the VIAF parameter. Instead
(or as well) you'll want to use the Authority control template:

{{Authority control|PND=118559796|LCCN=n/79/21614|VIAF=82088490}}

After I did that and the crawl ran again it showed up at linkypedia
[3]. Thanks for giving it a try!

//Ed

[1] http://en.wikipedia.org/wiki/Template:Worldcat_id
[2] http://en.wikipedia.org/wiki/Template:Authority_control
[3] http://linkypedia.info/websites/23/pages/



On Wed, May 25, 2011 at 9:17 AM, Lovins, Daniel  
daniel.lov...@yale.edu wrote:
That's really cool, Ed. I just added the viaf # for Immanuel Kant.  
Took just a few seconds. I'll subscribe to the linkypedia rss feed  
and watch for notification.


Daniel

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On  
Behalf Of Ed Summers

Sent: Tuesday, May 24, 2011 4:59 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] wikipedia/author disambiguation

Big +1 for promoting the use of the Authority Control Wikipedia
template.I know i'm being a bit of a broken record, but you can watch
as people add these by looking at or subscribing to:

   http://linkypedia.inkdroid.org/websites/23/pages/

Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I
just ran across Duke [2] today, which looks like it could help guide
record linking a bit.


Duke is a fast and flexible deduplication (or entity resolution, or
record linkage) engine written in Java on top of Lucene. At the moment
(2011-04-07) it can process 1,000,000 records in 11 minutes on a
standard laptop in a single thread.


Haven't tried it yet, so YMMV, etc.

//Ed

[1] http://wikipedia-miner.sourceforge.net/
[2] http://code.google.com/p/duke/







--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-25 Thread Ed Summers
The user profile pages that reference the website should eventually (1
or 2 days) turn up under the Users tab, e.g.

http://linkypedia.inkdroid.org/websites/23/users/

I don't see you there yet though :-)

//Ed

On Wed, May 25, 2011 at 5:03 PM, Karen Coyle li...@kcoyle.net wrote:
 Hi, Ed. Do you pick up user pages or just wikipedia entry pages? (I added
 mine to my user page, just for fun.)

 kc


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-24 Thread Ralph LeVan

 Are there any guidelines us wikipedians should be using to increase the
 likelihood of matches? I'm thinking in particular of the representation of
 uncertain dates.


The first step in the process is identifying entries for people.  The
presence of a Persondata block in the article is very helpful.

An unambiguous way to encode titles would help.  Apparently there are markup
rules for titles, but they are much abused by editors using them to
emphasize text in articles and are unreliable to the point of being ignored.

Dates is not a big deal.  Wikipedia records are full of dates and we cope
with fuzziness in dates reasonably well.


 Are there perhaps 'ground truth' URLs such as links into identities / VIAF
 which we can use? If so, what is the exact form of those URLs?


There is a template for entering WorldCat Identities URLs.  I don't believe
there is one yet for VIAF.

We've been working for a couple of years now on getting permission to put
Identities and VIAF links into Wikipedia records.  As it happens, several of
us in Research are meeting again on Thursday to discuss this.  Apparently
there has been some sort of movement on that topic.  Any help you can
provide would be appreciated!

Ralph


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-24 Thread Lovins, Daniel
Ralph, all, 

Regarding the Wikipedia template for VIAF, haven't tried it myself, but I 
believe the following syntax works: {{Authority 
control|PND=119408643|LCCN=n/79/113947|VIAF=59263727}}

Based on description on this Wikipedia page: 
http://en.wikipedia.org/wiki/Template:Authority_control

/ Daniel

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ralph 
LeVan
Sent: Tuesday, May 24, 2011 10:23 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] wikipedia/author disambiguation


 Are there any guidelines us wikipedians should be using to increase the
 likelihood of matches? I'm thinking in particular of the representation of
 uncertain dates.


The first step in the process is identifying entries for people.  The
presence of a Persondata block in the article is very helpful.

An unambiguous way to encode titles would help.  Apparently there are markup
rules for titles, but they are much abused by editors using them to
emphasize text in articles and are unreliable to the point of being ignored.

Dates is not a big deal.  Wikipedia records are full of dates and we cope
with fuzziness in dates reasonably well.


 Are there perhaps 'ground truth' URLs such as links into identities / VIAF
 which we can use? If so, what is the exact form of those URLs?


There is a template for entering WorldCat Identities URLs.  I don't believe
there is one yet for VIAF.

We've been working for a couple of years now on getting permission to put
Identities and VIAF links into Wikipedia records.  As it happens, several of
us in Research are meeting again on Thursday to discuss this.  Apparently
there has been some sort of movement on that topic.  Any help you can
provide would be appreciated!

Ralph


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-24 Thread Ralph LeVan
Very cool!  Thanks!

Ralph

On Tue, May 24, 2011 at 10:36 AM, Lovins, Daniel daniel.lov...@yale.eduwrote:

 Ralph, all,

 Regarding the Wikipedia template for VIAF, haven't tried it myself, but I
 believe the following syntax works: {{Authority
 control|PND=119408643|LCCN=n/79/113947|VIAF=59263727}}

 Based on description on this Wikipedia page:
 http://en.wikipedia.org/wiki/Template:Authority_control

 / Daniel

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Ralph LeVan
 Sent: Tuesday, May 24, 2011 10:23 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] wikipedia/author disambiguation

 
  Are there any guidelines us wikipedians should be using to increase the
  likelihood of matches? I'm thinking in particular of the representation
 of
  uncertain dates.
 

 The first step in the process is identifying entries for people.  The
 presence of a Persondata block in the article is very helpful.

 An unambiguous way to encode titles would help.  Apparently there are
 markup
 rules for titles, but they are much abused by editors using them to
 emphasize text in articles and are unreliable to the point of being
 ignored.

 Dates is not a big deal.  Wikipedia records are full of dates and we cope
 with fuzziness in dates reasonably well.


  Are there perhaps 'ground truth' URLs such as links into identities /
 VIAF
  which we can use? If so, what is the exact form of those URLs?
 

 There is a template for entering WorldCat Identities URLs.  I don't believe
 there is one yet for VIAF.

 We've been working for a couple of years now on getting permission to put
 Identities and VIAF links into Wikipedia records.  As it happens, several
 of
 us in Research are meeting again on Thursday to discuss this.  Apparently
 there has been some sort of movement on that topic.  Any help you can
 provide would be appreciated!

 Ralph



Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-24 Thread Ya'aqov Ziso
Daniel, the template's very good indeed.
Note though, VIAF includes already LCCN and Deutsche Nationalbibliothek. It
suffices for such name needs for WP. That reinforces Ralph's call for
linking Identities and VIAF into Wikipedia records,  *Ya'aqov*
*
*
*
*
*
*
*
*
*On Tue, May 24, 2011 at 9:36 AM, Lovins, Daniel daniel.lov...@yale.eduwrote:
*

 *Ralph, all,

 Regarding the Wikipedia template for VIAF, haven't tried it myself, but I
 believe the following syntax works: {{Authority
 control|PND=119408643|LCCN=n/79/113947|VIAF=59263727}}

 Based on description on this Wikipedia page:
 http://en.wikipedia.org/wiki/Template:Authority_control

 / Daniel
 *
 *
 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Ralph LeVan
 Sent: Tuesday, May 24, 2011 10:23 AM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] wikipedia/author disambiguation

 
 *
 * Are there any guidelines us wikipedians should be using to increase the
  likelihood of matches? I'm thinking in particular of the representation
 of
  uncertain dates.
 

 The first step in the process is identifying entries for people.  The
 presence of a Persondata block in the article is very helpful.

 An unambiguous way to encode titles would help.  Apparently there are
 markup
 rules for titles, but they are much abused by editors using them to
 emphasize text in articles and are unreliable to the point of being
 ignored.

 Dates is not a big deal.  Wikipedia records are full of dates and we cope
 with fuzziness in dates reasonably well.


  Are there perhaps 'ground truth' URLs such as links into identities /
 VIAF
  which we can use? If so, what is the exact form of those URLs?
 

 There is a template for entering WorldCat Identities URLs.  I don't believe
 there is one yet for VIAF.

 We've been working for a couple of years now on getting permission to put
 Identities and VIAF links into Wikipedia records.  As it happens, several
 of
 us in Research are meeting again on Thursday to discuss this.  Apparently
 there has been some sort of movement on that topic.  Any help you can
 provide would be appreciated!

 Ralph
 *

*


-- 
ya'aqovZISO | yaaq...@gmail.com | 856 217 3456

*


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-24 Thread stuart yeates

On 25/05/11 02:36, Lovins, Daniel wrote:

Ralph, all,

Regarding the Wikipedia template for VIAF, haven't tried it myself, but I 
believe the following syntax works: {{Authority 
control|PND=119408643|LCCN=n/79/113947|VIAF=59263727}}

Based on description on this Wikipedia page: 
http://en.wikipedia.org/wiki/Template:Authority_control


Excellent!

The page history indicates that it's an import from the German language 
wikipedia, thus the good support for German Persons, Corporations and 
Subjects.


I suspect that the template would get more widely used if it were 
discussed at 
https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:GLAM_getting_started 
(recently renamed from 
https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Advice_for_the_cultural_sector 
) which is a canonical starting-point for libraries people.


The structure of the template is interesting.

Due to the way wikipedia works, the barriers to adding something to a 
page (in this case, a template) are much lower than adding a new page. 
So adding a new authority to the page should be relatively easy, 
providing it works in the same way as the other authorities already there.


cheers
stuart
--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-24 Thread Ed Summers
Big +1 for promoting the use of the Authority Control Wikipedia
template.I know i'm being a bit of a broken record, but you can watch
as people add these by looking at or subscribing to:

http://linkypedia.inkdroid.org/websites/23/pages/

Also, re: Jonathan's good advice to check out Wikipedia Miner [1] I
just ran across Duke [2] today, which looks like it could help guide
record linking a bit.


Duke is a fast and flexible deduplication (or entity resolution, or
record linkage) engine written in Java on top of Lucene. At the moment
(2011-04-07) it can process 1,000,000 records in 11 minutes on a
standard laptop in a single thread.


Haven't tried it yet, so YMMV, etc.

//Ed

[1] http://wikipedia-miner.sourceforge.net/
[2] http://code.google.com/p/duke/


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-23 Thread LeVan,Ralph
I think you misunderstood that Ya'aqov.

What we do is make local authority records out of the Wikipedia records that 
we've identified as names.  So the adding dates and stuff is to the local 
authority record of that Wikipedia record.  We then use our usual VIAF matching 
technology between those Wikipedia authority records and the other authority 
records in VIAF.  Wikipedia records that end up in a VIAF cluster get kept and 
the others get dropped as not matching anything we have in VIAF.

I hope that helps!

Ralph

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Ya'aqov Ziso
 Sent: Sunday, May 22, 2011 5:15 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] wikipedia/author disambiguation
 
 Thanks Karen, but you don't indicate yet, how you solve disambiguation?
 
 You indicate how you use WP as a resource for adding dates and subjects
 when
 they are missing.
 You don't indicate when/how you are resolving ambiguities with WP data.
 
 Again, please use Morris William as an example,
 *Ya'aqov*
 
 
 
 
  *Once a year OCLC downloads Wikipedia and then we extract as much
  information from it as we can. This generally involves reading through
  their current information for templates, etc. Then we try to figure
  out which pages are people. Within the people pages we look for birth
  dates, death dates, work titles, ISBNs, oclc numbers, worldcat
  identity links, LCCNs ... anything that we have in VIAF for matching
  purposes. Then we build marc-ish records for each of the extracted
  person. After that the records go through the normal VIAF matching
  processes.
 
  The process gets changed and tweaked each year.*


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-23 Thread Ya'aqov Ziso
*Oh yes, your clarification helps, Ralph.  *
*
*
***So WP data ends up in a cluster (more than one entity) for a  certain
string that applies to more than one person/heading (therefore it is
ambiguous). What processes is VIAF running to dis-ambiguate THAT heading?*
*Ya'aqov*


On Mon, May 23, 2011 at 8:04 AM, LeVan,Ralph le...@oclc.org wrote:

 I think you misunderstood that Ya'aqov.

 What we do is make local authority records out of the Wikipedia records
 that we've identified as names.  So the adding dates and stuff is to the
 local authority record of that Wikipedia record.  We then use our usual VIAF
 matching technology between those Wikipedia authority records and the other
 authority records in VIAF.  Wikipedia records that end up in a VIAF cluster
 get kept and the others get dropped as not matching anything we have in
 VIAF.

 I hope that helps!

 Ralph

  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
  Ya'aqov Ziso
  Sent: Sunday, May 22, 2011 5:15 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] wikipedia/author disambiguation
 
  Thanks Karen, but you don't indicate yet, how you solve disambiguation?
 
  You indicate how you use WP as a resource for adding dates and subjects
  when
  they are missing.
  You don't indicate when/how you are resolving ambiguities with WP data.
 
  Again, please use Morris William as an example,
  *Ya'aqov*
 
 
 
 
   *Once a year OCLC downloads Wikipedia and then we extract as much
   information from it as we can. This generally involves reading through
   their current information for templates, etc. Then we try to figure
   out which pages are people. Within the people pages we look for birth
   dates, death dates, work titles, ISBNs, oclc numbers, worldcat
   identity links, LCCNs ... anything that we have in VIAF for matching
   purposes. Then we build marc-ish records for each of the extracted
   person. After that the records go through the normal VIAF matching
   processes.
  
   The process gets changed and tweaked each year.*




-- 
*ya'aqov**ZISO | **yaaq...@gmail.com **| 856 217 3456

*


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-23 Thread Ralph LeVan
We look for common work titles and common dates primarily.  I believe there
are also sometimes actual links to other authority records.

Ralph

On Mon, May 23, 2011 at 12:00 PM, Ya'aqov Ziso yaaq...@gmail.com wrote:

 *Oh yes, your clarification helps, Ralph.  *
 *
 *
 ***So WP data ends up in a cluster (more than one entity) for a  certain
 string that applies to more than one person/heading (therefore it is
 ambiguous). What processes is VIAF running to dis-ambiguate THAT heading?*
 *Ya'aqov*


 On Mon, May 23, 2011 at 8:04 AM, LeVan,Ralph le...@oclc.org wrote:

  I think you misunderstood that Ya'aqov.
 
  What we do is make local authority records out of the Wikipedia records
  that we've identified as names.  So the adding dates and stuff is to the
  local authority record of that Wikipedia record.  We then use our usual
 VIAF
  matching technology between those Wikipedia authority records and the
 other
  authority records in VIAF.  Wikipedia records that end up in a VIAF
 cluster
  get kept and the others get dropped as not matching anything we have in
  VIAF.
 
  I hope that helps!
 
  Ralph
 
   -Original Message-
   From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
 Of
   Ya'aqov Ziso
   Sent: Sunday, May 22, 2011 5:15 PM
   To: CODE4LIB@LISTSERV.ND.EDU
   Subject: Re: [CODE4LIB] wikipedia/author disambiguation
  
   Thanks Karen, but you don't indicate yet, how you solve disambiguation?
  
   You indicate how you use WP as a resource for adding dates and subjects
   when
   they are missing.
   You don't indicate when/how you are resolving ambiguities with WP data.
  
   Again, please use Morris William as an example,
   *Ya'aqov*
  
  
  
  
*Once a year OCLC downloads Wikipedia and then we extract as much
information from it as we can. This generally involves reading
 through
their current information for templates, etc. Then we try to figure
out which pages are people. Within the people pages we look for birth
dates, death dates, work titles, ISBNs, oclc numbers, worldcat
identity links, LCCNs ... anything that we have in VIAF for matching
purposes. Then we build marc-ish records for each of the extracted
person. After that the records go through the normal VIAF matching
processes.
   
The process gets changed and tweaked each year.*
 



 --
 *ya'aqov**ZISO | **yaaq...@gmail.com **| 856 217 3456

 *



Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-23 Thread stuart yeates

On 24/05/11 01:04, LeVan,Ralph wrote:

What we do is make local authority records out of the Wikipedia records that 
we've identified as names.  So the adding dates and stuff is to the local 
authority record of that Wikipedia record.  We then use our usual VIAF matching 
technology between those Wikipedia authority records and the other authority 
records in VIAF.  Wikipedia records that end up in a VIAF cluster get kept and 
the others get dropped as not matching anything we have in VIAF.


Very interesting.

Are there any guidelines us wikipedians should be using to increase the 
likelihood of matches? I'm thinking in particular of the representation 
of uncertain dates.


Are there perhaps 'ground truth' URLs such as links into identities / 
VIAF which we can use? If so, what is the exact form of those URLs?


cheers
stuart
--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-22 Thread stuart yeates

On 21/05/11 16:29, Ya'aqov Ziso wrote:


- Ludvig van Beethoven doesn't need much disambiguation.


Are you sure?

http://toolserver.org/~dispenser/cgi-bin/rdcheck.py?page=Ludwig_van_Beethoven

cheers
stuart
--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-22 Thread Ya'aqov Ziso
*Ode to your cheers: a**s a personal name, with birth and death dates, in
NAF,  is not an ambiguous heading. toolserver.org is all yours.*
*
*
***Joy to Stuart.
*
*
*
***
*

On Sun, May 22, 2011 at 3:23 PM, stuart yeates stuart.yea...@vuw.ac.nzwrote:

 On 21/05/11 16:29, Ya'aqov Ziso wrote:
 toolserver.orghttp://toolserver.org/~dispenser/cgi-bin/rdcheck.py?page=Ludwig_van_Beethoven

- Ludvig van Beethoven doesn't need much disambiguation.


 Are you sure?


 http://toolserver.org/~dispenser/cgi-bin/rdcheck.py?page=Ludwig_van_Beethoven

 cheers
 stuart
 --
 Stuart Yeates
 Library Technology Services http://www.victoria.ac.nz/library/




-- 
*ya'aqov**ZISO | **yaaq...@gmail.com **| 856 217 3456

*


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-22 Thread Ya'aqov Ziso
Thanks Karen, but you don't indicate yet, how you solve disambiguation?

You indicate how you use WP as a resource for adding dates and subjects when
they are missing.
You don't indicate when/how you are resolving ambiguities with WP data.

Again, please use Morris William as an example,
*Ya'aqov*




 *Once a year OCLC downloads Wikipedia and then we extract as much
 information from it as we can. This generally involves reading through
 their current information for templates, etc. Then we try to figure
 out which pages are people. Within the people pages we look for birth
 dates, death dates, work titles, ISBNs, oclc numbers, worldcat
 identity links, LCCNs ... anything that we have in VIAF for matching
 purposes. Then we build marc-ish records for each of the extracted
 person. After that the records go through the normal VIAF matching
 processes.

 The process gets changed and tweaked each year.*


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-20 Thread Graham Seaman
Thanks Karen

Looks like the OL code uses birth date + name, which is just what I was
thinking of doing. Although you say it's to run against a wikipedia dump
it looks like it should actually work with small changes against the
wikipedia API too. But I need this in PHP so I'll have to pick through
it and convert from perl and python before I can test it properly.

What was the javascript code you referenced below? Your link is to
jquery, rather than the script you mentioned in the text.

Graham



On 05/19/11 15:39, Karen Coyle wrote:
 This sounds like a great way to translate from library forms to
 wikipedia name forms. But for on-the-fly use I wonder if it wouldn't be
 more efficient to eliminate the middle man. Karen, can you say a
 little about what it took to link library names to WP? Was it a
 one-step, two-step, etc.?
 
 There is a script that I've seen used, although it doesn't seem to be
 production ready:
 
   https://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js
 
 One interesting note from the OL experience of linking to WP: generally
 you need to re-reverse the names to get a match: from Twain, Mark to
 Mark Twain. But for some names that isn't the case: Mao, Tse-Tung.
 Edward Betts used Wikipedia to determine which names do not get
 re-reversed.
 
 The OL code for its wikipedia lookup is at:
  
 https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/wikipedia
 
 
 It, however, runs against dumps rather than an API.
 
 kc
 
 Quoting Karen Coombs librarywebc...@gmail.com:
 
 Graham,

 I'd advocate using WorldCat Identities to get to the appropriate url
 for dbpedia. Each Identity record has a wikipedia element in it that
 you could use to link to either Wikipedia or dbpedia.

 If you want to see an example of this in action you can check out the
 Author Info demo I did for code4lib 2010 here -
 http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031


 The code for this demo is available for download at -
 http://www.worldcat.org/devnet/code/devnetDemos/trunk/

 You'll want the author_info folder and identity_info.php

 Karen

 Karen A. Coombs
 Product Manager
 OCLC Developer Network
 coom...@oclc.org


 On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote:
 I need to be able to take author data from a catalogue record and use it
 to look up the author on Wikipedia on the fly. So I may have birth date
 and possibly year of death in addition to (one spelling of) the name,
 the title of one book the author wrote etc.

 I know there are various efforts in progress that will improve the
 current situation, but as things stand at the moment what is the best*
 way to do this?

 1. query wikipedia for as much as possible, parse and select the best
 fitting result

 2. go via dbpedia/freebase and work back from there

 3. use VIAF and/or OCLC services

 4. Other?

 (I have no experience of 2-4 yet :-(


 Thanks
 Graham
 * 'best' being constrained by:
 - need to do this in real-time
 - need to avoid dependence on services which may be taken away
 or charged for
 - being able to justify to librarians as reasonably accurate :-)


 
 
 


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-20 Thread Graham Seaman
Hi Karen

Thanks for the code. As far as I can see though it doesn't actually
solve my disambiguation problem -  since identity_info.php just takes a
name as input, it can't guess which of the people with this name is
meant other than by using the most commonly referenced one, which in the
OCLC data actually seems to often be an amalgam of several people with
the name; for example

 http://worldcat.org/identities/viaf-DNB|100804799

is William Morris, the 18th century African-American engineer whose most
widely held works include News from Nowhere, Introduction to Fly
Fishing, and Ancient Slavery Disapproved of by God - ie an amalgamation
of the various most famous people known by this name.

I guess this is just a hard problem overall.

Graham



On 05/19/11 14:56, Karen Coombs wrote:
 Graham,
 
 I'd advocate using WorldCat Identities to get to the appropriate url
 for dbpedia. Each Identity record has a wikipedia element in it that
 you could use to link to either Wikipedia or dbpedia.
 
 If you want to see an example of this in action you can check out the
 Author Info demo I did for code4lib 2010 here -
 http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031
 
 The code for this demo is available for download at -
 http://www.worldcat.org/devnet/code/devnetDemos/trunk/
 
 You'll want the author_info folder and identity_info.php
 
 Karen
 
 Karen A. Coombs
 Product Manager
 OCLC Developer Network
 coom...@oclc.org
 
 
 On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote:
 I need to be able to take author data from a catalogue record and use it
 to look up the author on Wikipedia on the fly. So I may have birth date
 and possibly year of death in addition to (one spelling of) the name,
 the title of one book the author wrote etc.

 I know there are various efforts in progress that will improve the
 current situation, but as things stand at the moment what is the best*
 way to do this?

 1. query wikipedia for as much as possible, parse and select the best
 fitting result

 2. go via dbpedia/freebase and work back from there

 3. use VIAF and/or OCLC services

 4. Other?

 (I have no experience of 2-4 yet :-(


 Thanks
 Graham
 * 'best' being constrained by:
 - need to do this in real-time
 - need to avoid dependence on services which may be taken away
 or charged for
 - being able to justify to librarians as reasonably accurate :-)



Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-20 Thread Karen Coombs
Karen,

I'll have to get Ralph LeVan or Thom Hickey to comment on how we were
able to create the Wikipedia links in Identities. I don't know the
details just that the data is there.

Karen

On Thu, May 19, 2011 at 9:39 AM, Karen Coyle li...@kcoyle.net wrote:
 This sounds like a great way to translate from library forms to wikipedia
 name forms. But for on-the-fly use I wonder if it wouldn't be more efficient
 to eliminate the middle man. Karen, can you say a little about what it
 took to link library names to WP? Was it a one-step, two-step, etc.?

 There is a script that I've seen used, although it doesn't seem to be
 production ready:

  https://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js

 One interesting note from the OL experience of linking to WP: generally you
 need to re-reverse the names to get a match: from Twain, Mark to Mark
 Twain. But for some names that isn't the case: Mao, Tse-Tung. Edward Betts
 used Wikipedia to determine which names do not get re-reversed.

 The OL code for its wikipedia lookup is at:
  https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/wikipedia

 It, however, runs against dumps rather than an API.

 kc

 Quoting Karen Coombs librarywebc...@gmail.com:

 Graham,

 I'd advocate using WorldCat Identities to get to the appropriate url
 for dbpedia. Each Identity record has a wikipedia element in it that
 you could use to link to either Wikipedia or dbpedia.

 If you want to see an example of this in action you can check out the
 Author Info demo I did for code4lib 2010 here -

 http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031

 The code for this demo is available for download at -
 http://www.worldcat.org/devnet/code/devnetDemos/trunk/

 You'll want the author_info folder and identity_info.php

 Karen

 Karen A. Coombs
 Product Manager
 OCLC Developer Network
 coom...@oclc.org


 On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote:

 I need to be able to take author data from a catalogue record and use it
 to look up the author on Wikipedia on the fly. So I may have birth date
 and possibly year of death in addition to (one spelling of) the name,
 the title of one book the author wrote etc.

 I know there are various efforts in progress that will improve the
 current situation, but as things stand at the moment what is the best*
 way to do this?

 1. query wikipedia for as much as possible, parse and select the best
 fitting result

 2. go via dbpedia/freebase and work back from there

 3. use VIAF and/or OCLC services

 4. Other?

 (I have no experience of 2-4 yet :-(


 Thanks
 Graham
 * 'best' being constrained by:
 - need to do this in real-time
 - need to avoid dependence on services which may be taken away
 or charged for
 - being able to justify to librarians as reasonably accurate :-)





 --
 Karen Coyle
 kco...@kcoyle.net http://kcoyle.net
 ph: 1-510-540-7596
 m: 1-510-435-8234
 skype: kcoylenet



Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-20 Thread Ya'aqov Ziso
Karen,
*

   - Identities in WorldCat are based on literary warrant, i.e., names for
   people who authored/edited something or were subject in someone else's
   literary work. Personal names in WikiPedia are not entered according to
   literary warrant. Nor is their form vetted according to NAF.
   - Ludvig van Beethoven doesn't need much disambiguation. Nor does Mark
   Twain.
   - So yes, Karen/Ralph/Tom --  how exactly is Wikipedia used for
   disambiguation? are you certain it's used for THAT purpose? if yes, can you
   send us a proper example? Morris, William seems a good example.

Ya'aqov


*
*
*
*
*
*On Thu, May 19, 2011 at 3:48 PM, Graham Seaman gra...@theseamans.netwrote:
*

 *Hi Karen

 Thanks for the code. As far as I can see though it doesn't actually
 solve my disambiguation problem -  since identity_info.php just takes a
 name as input, it can't guess which of the people with this name is
 meant other than by using the most commonly referenced one, which in the
 OCLC data actually seems to often be an amalgam of several people with
 the name; for example

  http://worldcat.org/identities/viaf-DNB|100804799

 is William Morris, the 18th century African-American engineer whose most
 widely held works include News from Nowhere, Introduction to Fly
 Fishing, and Ancient Slavery Disapproved of by God - ie an amalgamation
 of the various most famous people known by this name.

 I guess this is just a hard problem overall.

 Graham
 *
 *


 On 05/19/11 14:56, Karen Coombs wrote:
  Graham,
 
  I'd advocate using WorldCat Identities to get to the appropriate url
  for dbpedia. Each Identity record has a wikipedia element in it that
  you could use to link to either Wikipedia or dbpedia.
 
  If you want to see an example of this in action you can check out the
  Author Info demo I did for code4lib 2010 here -
 
 http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031
 
  The code for this demo is available for download at -
  http://www.worldcat.org/devnet/code/devnetDemos/trunk/
 
  You'll want the author_info folder and identity_info.php
 
  Karen
 
  Karen A. Coombs
  Product Manager
  OCLC Developer Network
  coom...@oclc.org
 
 
  On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote:
  I need to be able to take author data from a catalogue record and use it
  to look up the author on Wikipedia on the fly. So I may have birth date
  and possibly year of death in addition to (one spelling of) the name,
  the title of one book the author wrote etc.
 
  I know there are various efforts in progress that will improve the
  current situation, but as things stand at the moment what is the best*
  way to do this?
 
  1. query wikipedia for as much as possible, parse and select the best
  fitting result
 
  2. go via dbpedia/freebase and work back from there
 
  3. use VIAF and/or OCLC services
 
  4. Other?
 
  (I have no experience of 2-4 yet :-(
 
 
  Thanks
  Graham
  * 'best' being constrained by:
  - need to do this in real-time
  - need to avoid dependence on services which may be taken away
  or charged for
  - being able to justify to librarians as reasonably accurate :-)
 
 *

*


-- 
ya'aqovZISO | yaaq...@gmail.com | 856 217 3456

*


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-19 Thread Karen Coombs
Graham,

I'd advocate using WorldCat Identities to get to the appropriate url
for dbpedia. Each Identity record has a wikipedia element in it that
you could use to link to either Wikipedia or dbpedia.

If you want to see an example of this in action you can check out the
Author Info demo I did for code4lib 2010 here -
http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031

The code for this demo is available for download at -
http://www.worldcat.org/devnet/code/devnetDemos/trunk/

You'll want the author_info folder and identity_info.php

Karen

Karen A. Coombs
Product Manager
OCLC Developer Network
coom...@oclc.org


On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote:
 I need to be able to take author data from a catalogue record and use it
 to look up the author on Wikipedia on the fly. So I may have birth date
 and possibly year of death in addition to (one spelling of) the name,
 the title of one book the author wrote etc.

 I know there are various efforts in progress that will improve the
 current situation, but as things stand at the moment what is the best*
 way to do this?

 1. query wikipedia for as much as possible, parse and select the best
 fitting result

 2. go via dbpedia/freebase and work back from there

 3. use VIAF and/or OCLC services

 4. Other?

 (I have no experience of 2-4 yet :-(


 Thanks
 Graham
 * 'best' being constrained by:
 - need to do this in real-time
 - need to avoid dependence on services which may be taken away
 or charged for
 - being able to justify to librarians as reasonably accurate :-)



Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-19 Thread Jonathan Rochkind
In addition to the approaches you note, might be worth investigating 
this tool that came up in a thread just a few days ago on this list:


http://wikipedia-miner.sourceforge.net/


I think nobody's done enough with this yet to be sure what will work 
best, I think you're going to have to experiment and let us know.


VIAF/OCLC services are presumably using some sort of statistical 
analysis/text mining approaches under the hood; wikipedia-miner is using 
such approaches but giving you the code in open source too if you're 
curious exactly what they're doing.  I suspect statistical approaches 
like wikipedia-miner uses are likely to be more productive than pure 
parsing approaches considering only one record at a time in 
isolation.   But writing your own statistics analysis algorithms is 
probably more work than you want, especially when wikipedia-miner and/or 
VIAF/OCLC services already exist.


If you don't do statistical analysis of the corpus, and do end up 
actually trying to search wikipedia directly -- then I suspect dbpedia 
is a lot more convenient endpoint than trying to screen-scrape HTML 
wikipedia. That's pretty much what dbpedia is for.


But these are all just my guesses, not informed by any work I've done.

Jonathan


On 5/19/2011 5:40 AM, graham wrote:

I need to be able to take author data from a catalogue record and use it
to look up the author on Wikipedia on the fly. So I may have birth date
and possibly year of death in addition to (one spelling of) the name,
the title of one book the author wrote etc.

I know there are various efforts in progress that will improve the
current situation, but as things stand at the moment what is the best*
way to do this?

1. query wikipedia for as much as possible, parse and select the best
fitting result

2. go via dbpedia/freebase and work back from there

3. use VIAF and/or OCLC services

4. Other?

(I have no experience of 2-4 yet :-(


Thanks
Graham
* 'best' being constrained by:
- need to do this in real-time
- need to avoid dependence on services which may be taken away
or charged for
- being able to justify to librarians as reasonably accurate :-)



Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-19 Thread Karen Coyle
This sounds like a great way to translate from library forms to  
wikipedia name forms. But for on-the-fly use I wonder if it wouldn't  
be more efficient to eliminate the middle man. Karen, can you say a  
little about what it took to link library names to WP? Was it a  
one-step, two-step, etc.?


There is a script that I've seen used, although it doesn't seem to be  
production ready:


  https://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js

One interesting note from the OL experience of linking to WP:  
generally you need to re-reverse the names to get a match: from  
Twain, Mark to Mark Twain. But for some names that isn't the case:  
Mao, Tse-Tung. Edward Betts used Wikipedia to determine which names do  
not get re-reversed.


The OL code for its wikipedia lookup is at:
   
https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/wikipedia


It, however, runs against dumps rather than an API.

kc

Quoting Karen Coombs librarywebc...@gmail.com:


Graham,

I'd advocate using WorldCat Identities to get to the appropriate url
for dbpedia. Each Identity record has a wikipedia element in it that
you could use to link to either Wikipedia or dbpedia.

If you want to see an example of this in action you can check out the
Author Info demo I did for code4lib 2010 here -
http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031

The code for this demo is available for download at -
http://www.worldcat.org/devnet/code/devnetDemos/trunk/

You'll want the author_info folder and identity_info.php

Karen

Karen A. Coombs
Product Manager
OCLC Developer Network
coom...@oclc.org


On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote:

I need to be able to take author data from a catalogue record and use it
to look up the author on Wikipedia on the fly. So I may have birth date
and possibly year of death in addition to (one spelling of) the name,
the title of one book the author wrote etc.

I know there are various efforts in progress that will improve the
current situation, but as things stand at the moment what is the best*
way to do this?

1. query wikipedia for as much as possible, parse and select the best
fitting result

2. go via dbpedia/freebase and work back from there

3. use VIAF and/or OCLC services

4. Other?

(I have no experience of 2-4 yet :-(


Thanks
Graham
* 'best' being constrained by:
- need to do this in real-time
- need to avoid dependence on services which may be taken away
or charged for
- being able to justify to librarians as reasonably accurate :-)







--
Karen Coyle
kco...@kcoyle.net http://kcoyle.net
ph: 1-510-540-7596
m: 1-510-435-8234
skype: kcoylenet


Re: [CODE4LIB] wikipedia/author disambiguation

2011-05-19 Thread Jonathan Rochkind
Curious what script you've used that isn't production ready -- I don't 
think you meant to post in the URL for the JQuery library?


On 5/19/2011 10:39 AM, Karen Coyle wrote:
This sounds like a great way to translate from library forms to 
wikipedia name forms. But for on-the-fly use I wonder if it wouldn't 
be more efficient to eliminate the middle man. Karen, can you say a 
little about what it took to link library names to WP? Was it a 
one-step, two-step, etc.?


There is a script that I've seen used, although it doesn't seem to be 
production ready:


  https://ajax.googleapis.com/ajax/libs/jquery/1.4.4/jquery.min.js

One interesting note from the OL experience of linking to WP: 
generally you need to re-reverse the names to get a match: from 
Twain, Mark to Mark Twain. But for some names that isn't the case: 
Mao, Tse-Tung. Edward Betts used Wikipedia to determine which names do 
not get re-reversed.


The OL code for its wikipedia lookup is at:
  
https://github.com/openlibrary/openlibrary/tree/master/openlibrary/catalog/wikipedia


It, however, runs against dumps rather than an API.

kc

Quoting Karen Coombs librarywebc...@gmail.com:


Graham,

I'd advocate using WorldCat Identities to get to the appropriate url
for dbpedia. Each Identity record has a wikipedia element in it that
you could use to link to either Wikipedia or dbpedia.

If you want to see an example of this in action you can check out the
Author Info demo I did for code4lib 2010 here -
http://www.librarywebchic.net/mashups/author_info/info_about_this_author.php?OCLCNum=32939031 



The code for this demo is available for download at -
http://www.worldcat.org/devnet/code/devnetDemos/trunk/

You'll want the author_info folder and identity_info.php

Karen

Karen A. Coombs
Product Manager
OCLC Developer Network
coom...@oclc.org


On Thu, May 19, 2011 at 4:40 AM, graham gra...@theseamans.net wrote:
I need to be able to take author data from a catalogue record and 
use it

to look up the author on Wikipedia on the fly. So I may have birth date
and possibly year of death in addition to (one spelling of) the name,
the title of one book the author wrote etc.

I know there are various efforts in progress that will improve the
current situation, but as things stand at the moment what is the best*
way to do this?

1. query wikipedia for as much as possible, parse and select the best
fitting result

2. go via dbpedia/freebase and work back from there

3. use VIAF and/or OCLC services

4. Other?

(I have no experience of 2-4 yet :-(


Thanks
Graham
* 'best' being constrained by:
- need to do this in real-time
- need to avoid dependence on services which may be taken away
or charged for
- being able to justify to librarians as reasonably accurate :-)