Re: [CODE4LIB] Has anyone used Mechanical Turk?

2008-12-09 Thread Steve Toub
I used MTurk for a data gathering project recently. Turnaround time is 
great: within a few hours. You can set params to only accept people 
whose rejection rate is 5% or less or what not (95% acceptance rate is 
the default) to weed out the riffraff. Pretty cost effective.


My instructions were complex (check websites A, B, and C and on each, 
look for data elements X, Y and Z and do the hokey-pokey and...) and as 
a result the quality wasn't great. To avoid: I'd recommend either 
farming out each subtask into a separate HIT or aggregating the entire 
task in a single HIT.


   --SET


Aaron Rubinstein wrote:

I'm looking at Amazon's Mechanical Turk, https://www.mturk.com/mturk/welcome
to automate checking the page order of 11,000 scanned books (some were
scanned backwards by a long out-of-business vendor).

Does anyone on the list have experience using Mechanical Turk?  This is a
time sensitive project so I was hoping to figure out some best practices for
attracting Requestors, and maybe even a sense of how quickly tasks might get
accomplished. 

Thanks, 

Aaron  

  


Re: [CODE4LIB] oca api?

2008-02-25 Thread Steve Toub
--- Tim Shearer [EMAIL PROTECTED] wrote:

 Hi Folks,

 I'm looking into tapping the texts in the Open Content Alliance.

 A few questions...

 As near as I can tell, they don't expose (perhaps even store?) any common
 unique identifiers (oclc number, issn, isbn, loc number).

I poked around in this world a few months ago in my previous job at California 
Digital Library,
also an OCA partner.

The unique key seems to be text string identifier (one that seems to be 
completely different from
the text string identifier in Open Library). Apparently there was talk at the 
last partner meeting
about moving to ISBNs:
http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-alliance/

To obtain identifiers in bulk, I think the recommended approach is the OAI-PMH 
interface, which
seems more reliable in recent months:

http://www.archive.org/services/oai.php?verb=Identify

http://www.archive.org/services/oai.php?verb=ListIdentifiersmetadataPrefix=oai_dcset=collection:cdl

etc.


Additional instructions if you want to grab the content files.

From any book's metadata page (e.g., 
http://www.archive.org/details/chemicallecturee00newtrich)
click through on the Usage Rights: See Terms link; the rights are on a pane 
on the left-hand
side.

Once you know the identifier, you can grab the content files, using this syntax:
http://www.archive.org/details/$ID
Like so:
http://www.archive.org/details/chemicallecturee00newtrich

And then sniff the page to find the FTP link:
ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich

But I think they prefer to use HTTP for these, not the FTP, so switch this to:
http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich

Hope this helps!

  --SET


 We're a contributer so I can use curl to grab our records via http (and
 regexp my way to our local catalog identifiers, which they do
 store/expose).

 I've played a bit with the z39.50 interface at indexdata
 (http://www.indexdata.dk/opencontent/), but I'm not confident about the
 content behind it.  I get very limited results, for instance I can't find
 any UNC records and we're fairly new to the game.

 Again, I'm looking for unique identifiers in what I can get back and it's
 slim pickings.

 Anyone cracked this nut?  Got any life lessons for me?

 Thanks!
 Tim

 +++
 Tim Shearer

 Web Development Coordinator
 The University Library
 University of North Carolina at Chapel Hill
 [EMAIL PROTECTED]
 919-962-1288
 +++



Re: [CODE4LIB] library find and bibliographic citation export?

2007-09-27 Thread Steve Toub

A reminder that the data model for OpenURL/COinS does not have all
metadata fields: only one author allowed, no abstract, etc.

You may want to consider using unAPI instead of COinS.

DLF Aquifer has a Rails presentation layer and is using unAPI. The unAPI
interface is exposing MODS, which is the native format. I've also asked
for RIS to get exposed as well for EndNote/RefWorks support. I tested
the developer's code this week; the unAPI part is in great shape, but
the Zotero import part still needs a bit of polish before it's final.
Expected open source release eventually but I'm not sure the ETA.

   --SET


Godmar Back wrote:

FWIW, if we really wanted to, we could process COinS even if they show
up via AJAX (at least in FF via DOMChanged event.)

 - Godmar

On 9/27/07, Reese, Terry [EMAIL PROTECTED] wrote:

COINs are included in the output, but because the current pages are loaded via 
AJAX, the data isn't visible to browser plugins like Libx, Zotero, etc.  0.8.3 
will remove nearly all the ajax -- and when that happens, the COINS data should 
be visible.

--TR

***
Terry Reese
Cataloger for Networked Resources
Digital Production Unit Head
Oregon State University Libraries
Corvallis, OR  97331
tel: 541-737-6384
email: [EMAIL PROTECTED]
http: http://oregonstate.edu/~reeset
***



From: Code for Libraries on behalf of Karen Coombs
Sent: Thu 9/27/2007 11:31 AM
To: CODE4LIB@listserv.nd.edu
Subject: Re: [CODE4LIB] library find and bibliographic citation export?



I believe that LibraryFind includes COinS but they aren't working quite
right in the current version. If the COinS were working correctly (which
they are supposed to in the next version) then Zotero would read them and
allow you to import results. I don't know of anyone who has added a citation
export feature otherwise though.

Jeremy or Terry please correct me if I've got my COinS information in which
version confused.

Karen


On 9/27/07 11:57 AM, Tim Shearer [EMAIL PROTECTED] wrote:


Hi,

I'm interested to know if anyone working with LibraryFind has begun work
to create a tool for bibliographic export to citation management tools
like refworks, etc.

Thanks!
Tim

+++
Tim Shearer

Web Development Coordinator
The University Library
University of North Carolina at Chapel Hill
[EMAIL PROTECTED]
919-962-1288
+++

--
Karen A. Coombs
Head of Libraries' Web Services
University of Houston
114 University Libraries
Houston, TX  77204-2000
Phone: (713) 743-3713
Fax: (713) 743-9811
Email: [EMAIL PROTECTED]





Re: [CODE4LIB] Citation parsing?

2007-07-18 Thread Steve Toub

Godmar Back wrote:

A year or so ago a couple of students looked into this for LibX. There
are a number of systems that people have published about, although
some are not available and none worked very well or were easy to get
to work. The systems also varied in their computational complexity,
with some not suitable for interactive use. Google for libx citation
sensing, or generally for citation extraction, automatic record
boundary detection or extraction. (Unfortunately, pubs.dlib.vt.edu
appears to be down at the moment - otherwise, Suresh Menon's report
contains a useful bibliography of work. I'll ping them.)


I've tested ParaTools
http://search.cpan.org/src/MJEWELL/Biblio-Document-Parser-1.10/docs/html/intro.html
but after it choked on most of it's own examples, tried looking elsewhere.

Inera's eXtyles refXpress claims to do this. You can see it in action
at: http://www.crossref.org/SimpleTextQuery/. Better than ParaTools
but still missed a lot of things I thought would have been obvious.
Inera said most of the issues I picked out were a problem with
CrossRef's implementation, but the cost of the product was so great that
I didn't explore further.

There was an interesting paper at JCDL 2007 on an unsupervised way of
doing this that had promising results
http://doi.acm.org/10.1145/1255175.1255219 but I haven't found any of
their code online.


For citations that contain item titles (which is true for a majority,
but definitely not all citation styles) LibX's magic button uses
Scholar as a hidden backend to produce an actionable OpenURL. Combined
with a similarity analysis, this  magic button functionality
produces a usable OpenURL in (on average) 81% of cases for a set of
400 randomly chosen citations from 4 widely read journals from 4
different areas published in 2006 [1].  With some fixes, we could
probably get this number up to 90%. Obviously, this approach only
works for individual use, Google would object for large scale batch
uses.


Agreed that a lookup against something like Google Scholar, Web of
Science, or a set of federated search targets instance may yield better
results. We've discussed by haven't done any testing.
   --SET



- Godmar

[1] Annette Bailey and Godmar Back, Retrieving Known Items with LibX.
The Serials Librarian, 2007. To appear.

On 7/17/07, Jonathan Rochkind [EMAIL PROTECTED] wrote:

Does anyone have any decent open source code to parse a citation? I'm
talking about a completely narrative citation like someone might
cut-and-paste from a bibliography or web page. I realize there are a
number of differnet formats this could be in (not to mention the human
error problems that always occur from human entered free text)--but
thinking about it, I suspect that with some work you could get something
that worked reasonably well (if not perfect). So I'm wondering if anyone
has donethis work.

(One of the commerical legal product--I forget if it's Lexis or
West--does this with legal citations--a more limited domain--quite
well.  I'm not sure if any of the commerical bibliographic citation
management software does this?)

The goal, as you can probably guess, is a box that the user can paste a
citation into; make an OpenURL out of it; show the user where to get the
citation.  I'm pretty confident something useful could be created here,
with enough time put into it. But saldy, it's probably more time than
anyone has individually. Unless someone's done it already?

Hopefully,
Jonathan





[CODE4LIB] OpenURL validation services

2007-03-17 Thread Steve Toub

Hi--

Is there any existing code that can validate the descriptive metadata of
an OpenURL ContextObject?

For example,
http://www.openurl.info/registry/docs/mtx/info:ofi/fmt:kev:mtx:journal
states thats auinit1 can have zero or one value and it must be the
first author's first initial. Is there something into which I can input
an OpenURL to see whether indeed the auinit1 param value is only one
character (either A-Z or a-z) and has no more than one occurrence...
plus all the other constraints for the other parameters in the Matrix on
http://www.openurl.info/registry/docs/mtx/info:ofi/fmt:kev:mtx:journal
?


   --SET