Re: [CODE4LIB] Has anyone used Mechanical Turk?
I used MTurk for a data gathering project recently. Turnaround time is great: within a few hours. You can set params to only accept people whose rejection rate is 5% or less or what not (95% acceptance rate is the default) to weed out the riffraff. Pretty cost effective. My instructions were complex (check websites A, B, and C and on each, look for data elements X, Y and Z and do the hokey-pokey and...) and as a result the quality wasn't great. To avoid: I'd recommend either farming out each subtask into a separate HIT or aggregating the entire task in a single HIT. --SET Aaron Rubinstein wrote: I'm looking at Amazon's Mechanical Turk, https://www.mturk.com/mturk/welcome to automate checking the page order of 11,000 scanned books (some were scanned backwards by a long out-of-business vendor). Does anyone on the list have experience using Mechanical Turk? This is a time sensitive project so I was hoping to figure out some best practices for attracting Requestors, and maybe even a sense of how quickly tasks might get accomplished. Thanks, Aaron
Re: [CODE4LIB] oca api?
--- Tim Shearer [EMAIL PROTECTED] wrote: Hi Folks, I'm looking into tapping the texts in the Open Content Alliance. A few questions... As near as I can tell, they don't expose (perhaps even store?) any common unique identifiers (oclc number, issn, isbn, loc number). I poked around in this world a few months ago in my previous job at California Digital Library, also an OCA partner. The unique key seems to be text string identifier (one that seems to be completely different from the text string identifier in Open Library). Apparently there was talk at the last partner meeting about moving to ISBNs: http://dilettantes.code4lib.org/2007/10/22/tales-from-the-open-content-alliance/ To obtain identifiers in bulk, I think the recommended approach is the OAI-PMH interface, which seems more reliable in recent months: http://www.archive.org/services/oai.php?verb=Identify http://www.archive.org/services/oai.php?verb=ListIdentifiersmetadataPrefix=oai_dcset=collection:cdl etc. Additional instructions if you want to grab the content files. From any book's metadata page (e.g., http://www.archive.org/details/chemicallecturee00newtrich) click through on the Usage Rights: See Terms link; the rights are on a pane on the left-hand side. Once you know the identifier, you can grab the content files, using this syntax: http://www.archive.org/details/$ID Like so: http://www.archive.org/details/chemicallecturee00newtrich And then sniff the page to find the FTP link: ftp://ia340915.us.archive.org/2/items/chemicallecturee00newtrich But I think they prefer to use HTTP for these, not the FTP, so switch this to: http://ia340915.us.archive.org/2/items/chemicallecturee00newtrich Hope this helps! --SET We're a contributer so I can use curl to grab our records via http (and regexp my way to our local catalog identifiers, which they do store/expose). I've played a bit with the z39.50 interface at indexdata (http://www.indexdata.dk/opencontent/), but I'm not confident about the content behind it. I get very limited results, for instance I can't find any UNC records and we're fairly new to the game. Again, I'm looking for unique identifiers in what I can get back and it's slim pickings. Anyone cracked this nut? Got any life lessons for me? Thanks! Tim +++ Tim Shearer Web Development Coordinator The University Library University of North Carolina at Chapel Hill [EMAIL PROTECTED] 919-962-1288 +++
Re: [CODE4LIB] library find and bibliographic citation export?
A reminder that the data model for OpenURL/COinS does not have all metadata fields: only one author allowed, no abstract, etc. You may want to consider using unAPI instead of COinS. DLF Aquifer has a Rails presentation layer and is using unAPI. The unAPI interface is exposing MODS, which is the native format. I've also asked for RIS to get exposed as well for EndNote/RefWorks support. I tested the developer's code this week; the unAPI part is in great shape, but the Zotero import part still needs a bit of polish before it's final. Expected open source release eventually but I'm not sure the ETA. --SET Godmar Back wrote: FWIW, if we really wanted to, we could process COinS even if they show up via AJAX (at least in FF via DOMChanged event.) - Godmar On 9/27/07, Reese, Terry [EMAIL PROTECTED] wrote: COINs are included in the output, but because the current pages are loaded via AJAX, the data isn't visible to browser plugins like Libx, Zotero, etc. 0.8.3 will remove nearly all the ajax -- and when that happens, the COINS data should be visible. --TR *** Terry Reese Cataloger for Networked Resources Digital Production Unit Head Oregon State University Libraries Corvallis, OR 97331 tel: 541-737-6384 email: [EMAIL PROTECTED] http: http://oregonstate.edu/~reeset *** From: Code for Libraries on behalf of Karen Coombs Sent: Thu 9/27/2007 11:31 AM To: CODE4LIB@listserv.nd.edu Subject: Re: [CODE4LIB] library find and bibliographic citation export? I believe that LibraryFind includes COinS but they aren't working quite right in the current version. If the COinS were working correctly (which they are supposed to in the next version) then Zotero would read them and allow you to import results. I don't know of anyone who has added a citation export feature otherwise though. Jeremy or Terry please correct me if I've got my COinS information in which version confused. Karen On 9/27/07 11:57 AM, Tim Shearer [EMAIL PROTECTED] wrote: Hi, I'm interested to know if anyone working with LibraryFind has begun work to create a tool for bibliographic export to citation management tools like refworks, etc. Thanks! Tim +++ Tim Shearer Web Development Coordinator The University Library University of North Carolina at Chapel Hill [EMAIL PROTECTED] 919-962-1288 +++ -- Karen A. Coombs Head of Libraries' Web Services University of Houston 114 University Libraries Houston, TX 77204-2000 Phone: (713) 743-3713 Fax: (713) 743-9811 Email: [EMAIL PROTECTED]
Re: [CODE4LIB] Citation parsing?
Godmar Back wrote: A year or so ago a couple of students looked into this for LibX. There are a number of systems that people have published about, although some are not available and none worked very well or were easy to get to work. The systems also varied in their computational complexity, with some not suitable for interactive use. Google for libx citation sensing, or generally for citation extraction, automatic record boundary detection or extraction. (Unfortunately, pubs.dlib.vt.edu appears to be down at the moment - otherwise, Suresh Menon's report contains a useful bibliography of work. I'll ping them.) I've tested ParaTools http://search.cpan.org/src/MJEWELL/Biblio-Document-Parser-1.10/docs/html/intro.html but after it choked on most of it's own examples, tried looking elsewhere. Inera's eXtyles refXpress claims to do this. You can see it in action at: http://www.crossref.org/SimpleTextQuery/. Better than ParaTools but still missed a lot of things I thought would have been obvious. Inera said most of the issues I picked out were a problem with CrossRef's implementation, but the cost of the product was so great that I didn't explore further. There was an interesting paper at JCDL 2007 on an unsupervised way of doing this that had promising results http://doi.acm.org/10.1145/1255175.1255219 but I haven't found any of their code online. For citations that contain item titles (which is true for a majority, but definitely not all citation styles) LibX's magic button uses Scholar as a hidden backend to produce an actionable OpenURL. Combined with a similarity analysis, this magic button functionality produces a usable OpenURL in (on average) 81% of cases for a set of 400 randomly chosen citations from 4 widely read journals from 4 different areas published in 2006 [1]. With some fixes, we could probably get this number up to 90%. Obviously, this approach only works for individual use, Google would object for large scale batch uses. Agreed that a lookup against something like Google Scholar, Web of Science, or a set of federated search targets instance may yield better results. We've discussed by haven't done any testing. --SET - Godmar [1] Annette Bailey and Godmar Back, Retrieving Known Items with LibX. The Serials Librarian, 2007. To appear. On 7/17/07, Jonathan Rochkind [EMAIL PROTECTED] wrote: Does anyone have any decent open source code to parse a citation? I'm talking about a completely narrative citation like someone might cut-and-paste from a bibliography or web page. I realize there are a number of differnet formats this could be in (not to mention the human error problems that always occur from human entered free text)--but thinking about it, I suspect that with some work you could get something that worked reasonably well (if not perfect). So I'm wondering if anyone has donethis work. (One of the commerical legal product--I forget if it's Lexis or West--does this with legal citations--a more limited domain--quite well. I'm not sure if any of the commerical bibliographic citation management software does this?) The goal, as you can probably guess, is a box that the user can paste a citation into; make an OpenURL out of it; show the user where to get the citation. I'm pretty confident something useful could be created here, with enough time put into it. But saldy, it's probably more time than anyone has individually. Unless someone's done it already? Hopefully, Jonathan
[CODE4LIB] OpenURL validation services
Hi-- Is there any existing code that can validate the descriptive metadata of an OpenURL ContextObject? For example, http://www.openurl.info/registry/docs/mtx/info:ofi/fmt:kev:mtx:journal states thats auinit1 can have zero or one value and it must be the first author's first initial. Is there something into which I can input an OpenURL to see whether indeed the auinit1 param value is only one character (either A-Z or a-z) and has no more than one occurrence... plus all the other constraints for the other parameters in the Matrix on http://www.openurl.info/registry/docs/mtx/info:ofi/fmt:kev:mtx:journal ? --SET