Re: [CODE4LIB] coverage of google book viewability API
At 06:52 PM 05/08/2008, Tim wrote: So, I took a long slow look at ten of the examples from Godmar's file. Nothing I saw disabused me of my opinion: "No preview" pages on Google Book Search are very weak tea. Are they worthless? Not always. But they usually are. And, unfortunately, you generally need to read the various references pages carefully before you know you were wasting your time. Some examples: Risks in Chemical Units (http://books.google.com/books?id=7ctpCAAJ) has one glancing, un-annotated reference in the footnotes of another, apparently different book. How Trouble Made the Monkey Eat Pepper (http://books.google.com/books?id=wLnGCAAJ) sports three references from other books, two in snippet view and one with no view. Two are bare-bones bibliographic mentions in an index of Canadian children's books and an index of Canadian chidren's illustrators. The third is another bare-bones mention in a book in Sinhalese. I don't think anyone's saying there aren't some pretty useless entries in GBS, but these two examples strike me as illustrating only one point: if you go shopping for weak tea you can usually find it. The post-"here's the code" discussion was focused on the usefulness (or not) of "scanless" GBS records to users of an academic library catalog. A fairly obscure book held by 9 OCLC-participating academic libraries of the unofficial Carnegie class "we have the money so let's buy almost everything" and a 28-page children's story published in 1977 (held by 7 US academics and 11 Canadian academics) are hardly representative of what might be found in most academic library catalogs. While they are perfectly valid illustrations of bad GBS data, they are not valid illustrations of the worthlessness of links to "scanless" GBS records in academic library catalogs. Note that I'm not looking for valid illustrations (and I'm sure some are out there), because a handful of incredibly intelligent real live academic reference librarian educators down the hall from me have already done extensive playing around with this and have determined there's enough useful data in GBS to offer our users links even when there's no preview or full text available. Bob Duncan ~!~!~!~!~!~!~!~!~!~!~!~!~ Robert E. Duncan Systems Librarian Editor of IT Communications Lafayette College Easton, PA 18042 [EMAIL PROTECTED] http://www.library.lafayette.edu/
Re: [CODE4LIB] marc records sample set
On Fri, May 09, 2008 at 11:58:03AM -0700, Casey Durfee wrote: > On Fri, May 9, 2008 at 11:14 AM, Bess Sadler <[EMAIL PROTECTED]> wrote: > > > > > Casey, you say you're getting indexing times of 1000 records / > > second? That's amazing! I really have to take a closer look at > > MarcThing. Could pymarc really be that much faster than marc4j? Or > > are we comparing apples to oranges since we haven't normalized for > > the kinds of mapping we're doing and the hardware it's running on? > > > > Well, you can't take a closer look at it yet, since I haven't gotten off my > lazy butt and released it. We're still using an older version of the > project in production on LT. I'm going to cut us over to the latest version > this weekend. At this point, being able to say we eat our own dogfood is > the only barrier to release. Looking forward to the release. I'd be interested to see how it compares to the pymarc-indexer branch in FBO/Helios [1]. > The last indexer I wrote (the one used by fac-back-opac) used marc4j and was > around 100-150 a second. Some of the boost was due to better designed code > on my end, but I can't take too much credit. Pymarc is much, much faster. > I never bothered to figure out why. (That wasn't why I switched, though -- > there are some problems with parsing ANSEL with marc4j (*) which I decided > I'd rather be mauled by bears than try and fix -- the performance boost was > just a pleasant surprise). Of course one could use pymarc from java with > Jython. On the small set of documents I'm now indexing (3327) I get 141 rec/sec. This is on my test server, an AMD64 whose processor speed I can't recall. That rate includes pymarc processing (~65%) and the loading of the CSV file into SOLR (~35%). Surely there's some room there for optimization, but it's fast enough for my current purposes. Also, I'm in the camp that would be happy with a ~10,000 record test set. There will always be some edge cases that we'll only solve as they're encountered. I need rapid iteration! Gabriel [1] http://fruct.us/trac/fbo/browser/branches/pymarc-indexer/indexer
Re: [CODE4LIB] marc records sample set
On Fri, May 9, 2008 at 11:14 AM, Bess Sadler <[EMAIL PROTECTED]> wrote: > > Casey, you say you're getting indexing times of 1000 records / > second? That's amazing! I really have to take a closer look at > MarcThing. Could pymarc really be that much faster than marc4j? Or > are we comparing apples to oranges since we haven't normalized for > the kinds of mapping we're doing and the hardware it's running on? > Well, you can't take a closer look at it yet, since I haven't gotten off my lazy butt and released it. We're still using an older version of the project in production on LT. I'm going to cut us over to the latest version this weekend. At this point, being able to say we eat our own dogfood is the only barrier to release. The last indexer I wrote (the one used by fac-back-opac) used marc4j and was around 100-150 a second. Some of the boost was due to better designed code on my end, but I can't take too much credit. Pymarc is much, much faster. I never bothered to figure out why. (That wasn't why I switched, though -- there are some problems with parsing ANSEL with marc4j (*) which I decided I'd rather be mauled by bears than try and fix -- the performance boost was just a pleasant surprise). Of course one could use pymarc from java with Jython. Undoubtedly we're comparing apples to oranges here. 1000/sec. is about what I can get on my Macbook Pro on some random MARC records I have lying around, with plenty of hand-waving involved. MARCThing does do a fair amount of munging for expanding codes, guessing physical format and what-have-you (but nothing with dates, which is sorely needed), but I think it would be a bad idea to read too much into some anecdotal numbers. --Casey (*) in marc4j's defense, actually due to a bug in Horizon ILS.
Re: [CODE4LIB] marc records sample set
On Fri, May 9, 2008 at 2:23 PM, Joe Hourcle <[EMAIL PROTECTED]> wrote: > OpenLibrary has other datasets that you might be able to use / combine / > whatever to meet your requirements: > > http://openlibrary.org/dev/docs/data This'll get you the other MARC dumps that have been made available to IA through OL: http://www.archive.org/search.php?query=collection%3Aol_data%20marc Lots to work with here. I also wonder if rather than one large test set it wouldn't be good to have smaller test sets which exhibit particular problems or are of a particular type (i.e. music). Jason
Re: [CODE4LIB] marc records sample set
On Fri, 9 May 2008, Bess Sadler wrote: Those of us involved in the Blacklight and VuFind projects are spending lots of time recently thinking about marc records indexing. We're about to start running some performance tests, and we want to create unit tests for our marc to solr indexer, and also people wanting to download and play with the software need to have easy access to a small but representative set of marc records that they can play with. [trimmed] It seems to me that the set that Casey donated to Open Library (http://www.archive.org/details/marc_records_scriblio_net) would be a good place from which to draw records, because although IANAL, this seems to sidestep any legal hurdles. I'd also love to see the ability for the community to contribute test cases. Assuming such a set doesn't exist already (see my question below) this seems like the ideal sort of project for code4lib to host, too. OpenLibrary has other datasets that you might be able to use / combine / whatever to meet your requirements: http://openlibrary.org/dev/docs/data - Joe Hourcle
Re: [CODE4LIB] marc records sample set
On May 9, 2008, at 1:42 PM, Jonathan Rochkind wrote: The Blacklight code is not currently using XML or XSLT. It's indexing binary MARC files. I don't know it's speed, but I hear it's pretty fast. Right, I'm talking about the java indexer we're working on, which we're hoping to turn into a plugin contrib module for solr. It processes binary marc files. We're getting times of about 150 records / second, but that's on an unfortunately throttled server and we're munging each record significantly (replacing musical instrument and language codes with their English language equivalents, calculating composition era, etc). Casey, you say you're getting indexing times of 1000 records / second? That's amazing! I really have to take a closer look at MarcThing. Could pymarc really be that much faster than marc4j? Or are we comparing apples to oranges since we haven't normalized for the kinds of mapping we're doing and the hardware it's running on? Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305
Re: [CODE4LIB] marc records sample set
The Blacklight code is not currently using XML or XSLT. It's indexing binary MARC files. I don't know it's speed, but I hear it's pretty fast. But for the kind of test set I want, even waiting half an hour is too long. I want a test set where I can make a change to my indexing configuration and then see the results in a few minutes, if not seconds. This has become apparent in my attempts to get the indexer _working_, where I might not be actually changing the index mapping at all, I'm just changing the indexer configuration until I know it's working. When I get to actually messing with indexer mapping to try out new ideas, I believe it will also be important, however. Jonathan Casey Durfee wrote: I strongly agree that we need something like this. The LoC records that Casey donated are a great resource but far from ideal from this purpose. They're pretty homogeneous. I do think it needs to be bigger than 10,000 though. 100,000 would be a better target. And I would like to see a UNIMARC/DANMARC-based one as well as a MARC21 based one (can one's parser handle DANMARC's "ø" subfield?). I don't know about Blacklight or VuFind, but using our MarcThing package + Solr we can index up to 1000 records a second. I know using XSLT severely limits how fast you can index (I'll refrain from giving another rant about how wrong it is to use XSL to handle MARC -- the Society for Prevention of Cruelty to Dead Horses has my number as it is.) But I'd still expect you can do a good 50-100 records a second. That's only a half hour to an hour of work to index 100,000 records. You could run it over your lunch break. Seems reasonable to me. In addition to a wide variety of languages, encodings, formats and so forth, it would definitely need to have records explicitly designed to break things. Blank MARC tags, extraneous subfield markers, non-printing control characters, incorrect length fixed fields etc. The kind of stuff that should never happen in theory, but happens frequently in real life (I'm looking at you, Horizon's MARC export utility). The legal aspect of this is the difficult part. We (LibraryThing) could easily grab 200 random records from 500 different Z39.50 sources worldwide. Technically, it could be done in a couple of hours. Legally, I don't think it could ever be done, sadly. --Casey On Fri, May 9, 2008 at 9:33 AM, Bess Sadler <[EMAIL PROTECTED]> wrote: Those of us involved in the Blacklight and VuFind projects are spending lots of time recently thinking about marc records indexing. We're about to start running some performance tests, and we want to create unit tests for our marc to solr indexer, and also people wanting to download and play with the software need to have easy access to a small but representative set of marc records that they can play with. According to the combined brainstorming of Jonathan Rochkind and myself, the ideal record set should: 1. contain about 10k records, enough to really see the features, but small enough that you could index it in a few minutes on a typical desktop 2. contain a distribution of kinds of records, e.g., books, CDs, musical scores, DVDs, special collection items, etc. 3. contain a distribution of languages, so we can test unicode handling 4. contain holdings information in addition to bib records 5. contain a distribution of typical errors one might encounter with marc records in the wild It seems to me that the set that Casey donated to Open Library (http://www.archive.org/details/marc_records_scriblio_net) would be a good place from which to draw records, because although IANAL, this seems to sidestep any legal hurdles. I'd also love to see the ability for the community to contribute test cases. Assuming such a set doesn't exist already (see my question below) this seems like the ideal sort of project for code4lib to host, too. Since code4lib is my lazyweb, I'm asking you: 1. Does something like this exist already and I just don't know about it? 2. If not, do you have suggestions on how to go about making such a data set? I have some ideas on how to do it bit by bit, and we have a certain small set of records that we're already using for testing, but maybe there's a better method that I don't know about? 3. Are there features missing from the above list that would make this more useful? Thoughts? Comments? Thanks! Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305 -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] marc records sample set
I think you start with a smaller set, but then when you find idiosyncratic records that were NOT represented in your smaller set, you add representative samples to the sample set. The sample set organically grows. Certainly at some point you've got to test on a larger set too. But I think there's a lot of value in having a small test set too. Of course, it is something of a challenge to even come up with a reasonably representative small set. But it doesn't need to be absolutely representative---when you find examples not represented, you add them. It grows. Jonathan Kyle Banerjee wrote: According to the combined brainstorming of Jonathan Rochkind and myself, the ideal record set should: 1. contain about 10k records, enough to really see the features, but small enough that you could index it in a few minutes on a typical desktop... 5. contain a distribution of typical errors one might encounter with marc records in the wild This is much harder to do than might appear on the surface. 10K is a really small set, and the issue is that unless people know how to create a set that has really targets the problem areas, you will inevitably miss important stuff. At the end of the day, it's the screwball stuff you didn't think about that always causes the most problems. I think such data sizes are useful for testing interfaces, but not for determining catalog behavior and setup. Despite the indexing time, I believe in testing with much larger sets. There are certain very important things that just can't be examined with small sets. For example, one huge problem with catalog data is that the completeness and quality is highly variable. When we were experimenting sometime back, we found that how you normalize the data and how you weight terms as well as documents has an enormous impact on search results and that unless you do some tuning, you will inevitably find a lot of garbage too close to the top with a bunch of good stuff ranked so low it isn't found. kyle -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] marc records sample set
> Sounds like you have some experience of this, Kyle! > Do you have a list of "the screwball stuff"? Even an offhand one would > be interesting... I don't have the list with me, but just to rattle a few things off, some extra short records rank high because so much of a search term matches the whole document. Some records contain fields that have been repeated many times which artificially boosts them. You'll see nonstandard use of fields as well as foreign character sets. There are a number of ways URLs are displayed. Deriving format can be problematic because encoded mat type and what you're providing access to are different. Some records contain lots of added entries, while many important ones are fairly minimalist. There are conversion on the fly, purchased record sets, automatically generated ones, full level, ones that automatically have some subject heading added that contains a common search term. There are a zillion other things, but you get the idea. > What sorts of normalizations do you do? I'm starting to look for > standard measures of data quality or data validation/normalization > routines for MARC. Index terms only once. This helps deal with repeated terms in repetitive subject headings and added entries. Look at presence of fields to assign additional material types (particularly useful for electronic resources since these typically have paper records -- but don't be fooled by links to TOC and stuff that's not full text). Give special handling for serials. Keywords need to be weighted differently depending on where they're from (e.g. title worth more than subject). We also assigned a rough "record quality" score based presence/absence of fields so that longer more complete records don't become less important simply because a search term matches less of them than a short record. Give a bit more weight to a true full text retrieval. Number of libraries holding the item is considered. When indexing, 650|a is more important than |x, |y, or |z. Don't treat 650|z the same way as 651|a. Recognize that 650|v is a form, that some common 650|x fields should be treated this way (and neither should just be a regular index term). The only thing we didn't use that a lot of places put a lot of weight on is date -- this is good for retrieving popular fiction, but you have to be really careful with it in academic collections because it can hide classic stuff that's been around a long time. I can't remember everything off the top of my head, but there's a lot and it makes a big difference. kyle
Re: [CODE4LIB] marc records sample set
I agree with Kyle that a big, wide set of records is better for testing purposes. In processing records for Evergreen imports, I've found that there are often just a handful that throw marc4j for a loop. I suppose I should cull those and attach them to bug reports... instead I've taken the path of least resistance and just used yaz-marcdump. (bad Dan!) There are, of course, _lots_ of MARC records available for download from http://www.archive.org/search.php?query=collection%3A%22ol_data%22%20AND%20%28MARC%20records%29 - not just the LoC set. So one could presumably assemble a nice big set of records starting here. Dan >>> On Fri, May 9, 2008 at 12:33 PM, Bess Sadler <[EMAIL PROTECTED]> wrote: > Those of us involved in the Blacklight and VuFind projects are > spending lots of time recently thinking about marc records indexing. > We're about to start running some performance tests, and we want to > create unit tests for our marc to solr indexer, and also people > wanting to download and play with the software need to have easy > access to a small but representative set of marc records that they > can play with. > > According to the combined brainstorming of Jonathan Rochkind and > myself, the ideal record set should: > > 1. contain about 10k records, enough to really see the features, but > small enough that you could index it in a few minutes on a typical > desktop > 2. contain a distribution of kinds of records, e.g., books, CDs, > musical scores, DVDs, special collection items, etc. > 3. contain a distribution of languages, so we can test unicode handling > 4. contain holdings information in addition to bib records > 5. contain a distribution of typical errors one might encounter with > marc records in the wild > > It seems to me that the set that Casey donated to Open Library > (http://www.archive.org/details/marc_records_scriblio_net) would be a > good place from which to draw records, because although IANAL, this > seems to sidestep any legal hurdles. I'd also love to see the ability > for the community to contribute test cases. Assuming such a set > doesn't exist already (see my question below) this seems like the > ideal sort of project for code4lib to host, too. > > Since code4lib is my lazyweb, I'm asking you: > > 1. Does something like this exist already and I just don't know about > it? > 2. If not, do you have suggestions on how to go about making such a > data set? I have some ideas on how to do it bit by bit, and we have a > certain small set of records that we're already using for testing, > but maybe there's a better method that I don't know about? > 3. Are there features missing from the above list that would make > this more useful? > > Thoughts? Comments? > > Thanks! > Bess > > > Elizabeth (Bess) Sadler > Research and Development Librarian > Digital Scholarship Services > Box 400129 > Alderman Library > University of Virginia > Charlottesville, VA 22904 > > [EMAIL PROTECTED] > (434) 243- 2305
Re: [CODE4LIB] marc records sample set
I strongly agree that we need something like this. The LoC records that Casey donated are a great resource but far from ideal from this purpose. They're pretty homogeneous. I do think it needs to be bigger than 10,000 though. 100,000 would be a better target. And I would like to see a UNIMARC/DANMARC-based one as well as a MARC21 based one (can one's parser handle DANMARC's "ø" subfield?). I don't know about Blacklight or VuFind, but using our MarcThing package + Solr we can index up to 1000 records a second. I know using XSLT severely limits how fast you can index (I'll refrain from giving another rant about how wrong it is to use XSL to handle MARC -- the Society for Prevention of Cruelty to Dead Horses has my number as it is.) But I'd still expect you can do a good 50-100 records a second. That's only a half hour to an hour of work to index 100,000 records. You could run it over your lunch break. Seems reasonable to me. In addition to a wide variety of languages, encodings, formats and so forth, it would definitely need to have records explicitly designed to break things. Blank MARC tags, extraneous subfield markers, non-printing control characters, incorrect length fixed fields etc. The kind of stuff that should never happen in theory, but happens frequently in real life (I'm looking at you, Horizon's MARC export utility). The legal aspect of this is the difficult part. We (LibraryThing) could easily grab 200 random records from 500 different Z39.50 sources worldwide. Technically, it could be done in a couple of hours. Legally, I don't think it could ever be done, sadly. --Casey On Fri, May 9, 2008 at 9:33 AM, Bess Sadler <[EMAIL PROTECTED]> wrote: > Those of us involved in the Blacklight and VuFind projects are > spending lots of time recently thinking about marc records indexing. > We're about to start running some performance tests, and we want to > create unit tests for our marc to solr indexer, and also people > wanting to download and play with the software need to have easy > access to a small but representative set of marc records that they > can play with. > > According to the combined brainstorming of Jonathan Rochkind and > myself, the ideal record set should: > > 1. contain about 10k records, enough to really see the features, but > small enough that you could index it in a few minutes on a typical > desktop > 2. contain a distribution of kinds of records, e.g., books, CDs, > musical scores, DVDs, special collection items, etc. > 3. contain a distribution of languages, so we can test unicode handling > 4. contain holdings information in addition to bib records > 5. contain a distribution of typical errors one might encounter with > marc records in the wild > > It seems to me that the set that Casey donated to Open Library > (http://www.archive.org/details/marc_records_scriblio_net) would be a > good place from which to draw records, because although IANAL, this > seems to sidestep any legal hurdles. I'd also love to see the ability > for the community to contribute test cases. Assuming such a set > doesn't exist already (see my question below) this seems like the > ideal sort of project for code4lib to host, too. > > Since code4lib is my lazyweb, I'm asking you: > > 1. Does something like this exist already and I just don't know about > it? > 2. If not, do you have suggestions on how to go about making such a > data set? I have some ideas on how to do it bit by bit, and we have a > certain small set of records that we're already using for testing, > but maybe there's a better method that I don't know about? > 3. Are there features missing from the above list that would make > this more useful? > > Thoughts? Comments? > > Thanks! > Bess > > > Elizabeth (Bess) Sadler > Research and Development Librarian > Digital Scholarship Services > Box 400129 > Alderman Library > University of Virginia > Charlottesville, VA 22904 > > [EMAIL PROTECTED] > (434) 243-2305 >
Re: [CODE4LIB] marc records sample set
>This is much harder to do than might appear on the surface. 10K is a >really small set, and the issue is that unless people know how to >create a set that has really targets the problem areas, you will >inevitably miss important stuff. At the end of the day, it's the >screwball stuff you didn't think about that always causes the most >problems. I think such data sizes are useful for testing interfaces, >but not for determining catalog behavior and setup. Sounds like you have some experience of this, Kyle! Do you have a list of "the screwball stuff"? Even an offhand one would be interesting... >Despite the indexing time, I believe in testing with much larger sets. >There are certain very important things that just can't be examined >with small sets. For example, one huge problem with catalog data is >that the completeness and quality is highly variable. When we were >experimenting sometime back, we found that how you normalize the data >and how you weight terms as well as documents has an enormous impact >on search results and that unless you do some tuning, you will >inevitably find a lot of garbage too close to the top with a bunch of >good stuff ranked so low it isn't found. What sorts of normalizations do you do? I'm starting to look for standard measures of data quality or data validation/normalization routines for MARC. I ask because in my experiments with the FRBR Display Tool, I've found sorts of variations you describe, and I'd like to experiment with more data validation & normalization. I'm very new at working with MARC data, so even pointers to standard stuff would be really helpful! -Jodi Jodi Schneider Science Library Specialist Amherst College 413-542-2076
Re: [CODE4LIB] marc records sample set
Bess Sadler wrote: 3. Are there features missing from the above list that would make this more useful? One of the things that Bill Moen showed at Access a couple of years ago (Edmonton?) was what he and others were calling a "radioactive" Marc record. One that had no "normal" payload but, IIRC had a 245$a whose value was "245$a" etc. As I recall, it was used to test processes where you wanted to be sure that a specific field was mapped to a specific index, or was showing in a particular Z39.50 profiles. Walter
Re: [CODE4LIB] marc records sample set
> According to the combined brainstorming of Jonathan Rochkind and > myself, the ideal record set should: > > 1. contain about 10k records, enough to really see the features, but > small enough that you could index it in a few minutes on a typical > desktop... > 5. contain a distribution of typical errors one might encounter with > marc records in the wild This is much harder to do than might appear on the surface. 10K is a really small set, and the issue is that unless people know how to create a set that has really targets the problem areas, you will inevitably miss important stuff. At the end of the day, it's the screwball stuff you didn't think about that always causes the most problems. I think such data sizes are useful for testing interfaces, but not for determining catalog behavior and setup. Despite the indexing time, I believe in testing with much larger sets. There are certain very important things that just can't be examined with small sets. For example, one huge problem with catalog data is that the completeness and quality is highly variable. When we were experimenting sometime back, we found that how you normalize the data and how you weight terms as well as documents has an enormous impact on search results and that unless you do some tuning, you will inevitably find a lot of garbage too close to the top with a bunch of good stuff ranked so low it isn't found. kyle
[CODE4LIB] iteration. interface ideas? unintended consequences.
>What I dislike here is your abnegation of the responsibility to care >about the choices students make. If you're not considering the value >of all resources-including the book-you're not playing the library >game, the educator game or the Google game. You're just throwing stuff >on screens because you can. Tim, this appears to conflate the role of the reference librarian and that of the library technologist. While some of us (myself included) straddle these two roles, this list focuses exclusively on the latter. I read your repeated commentary on this topic as a demand that if we can't play your game (caring about the choices students make), let's not play any game at all. I respectfully disagree. Let's start where we are and iterate. Incremental, iterative improvement is much more powerful that I would ever have suspected, even 10 months ago. So even if it's "throwing stuff on screens because you can" (which doesn't sound generous), let's start. We'll get someplace, I think. But only if we play the game. Here on code4lib, I'd rather get back to your very useful examples of deadends [1][2]. Let's ask "Would this be useful in any context? "How can we give our users more information without sending them spinning to dead-ends?" Distinguishing between full-text links, partial-text links, and pointers to more information is critical and unlike our 856 fields [3], the GoogleBooks API *does* give us that information. The question is how to present it to users. Tim, I hope you'll draw up an interface suggestion, and send us a link to a screenshot. Of course, I don't think that GoogleBooks is the be-all and end-all; we need to be pulling information from various sources (and some on this list have systems that already are). If you're concerned about that, why not survey the landscape for other APIs--all the ones you can find--and submit a paper to the Journal, write a blog post, or do up a prototype? (Personally, I'd welcome such a collation from anybody!) Unintended consequences are also worth further exploration, and I'm not the only one[4] who'd love to see more studies of the consequences of changing catalogs and changing online availability. Anything you can contribute there--with real evidence of before and after--would be most welcome as well. -Jodi [1] http://books.google.com/books?id=7ctpCAAJ [2] http://books.google.com/books?id=wLnGCAAJ [3] http://roytennant.com/proto/856/analysis.html [4] http://kcoyle.blogspot.com/2007/03/unintended-consequences.html "When we make materials available, or when we make them more available than they have been in the past, we aren't just providing more service -- we are actually changing what knowledge will be created. We tend to pay attention to what we are making available, but not to think about how differing availability affects the work of our users."
Re: [CODE4LIB] coverage of google book viewability API
> I agree that showing the user evaluative resources that are not any good > is not a service to the user. When there are no good evaluative > resources available, we should not show bad ones to the user. I think we actually agree on what should happen. We disagree on the theory behind that :) > [And in most library contexts I am familiar with, that is NOT accurately > descirbed as "right over there". Perhaps you are familiar with other > sorts of libraries then I. In a large urban public library system or > just about ANY academic library, this is just as likely to be: "in the > library, but you are sitting at home right now, which could be a mile > away or could be two states away", "in another building [which may be > miles away]", "checked out to a user, but you can recall it if you > like", "in this building three floors and 200 meters away", or "not in > our system at all right now but you can ILL it." I hear you. This is very situation-dependent, but real. Not all books are easy to hand. I am depressed by many libraries' willingness to ship their books to holding facilities. While the rest of the information world is getting easier, finding your own library's books is often getting harder. Often these offloaded books are the ones you can't find out about from any other source, so you're screwed both physically and digitally. That said, a lot of talk about the difficulty of getting a book seems like whining. It's a teenager staring into the fridge and yelling "Mom, is there anything to eat!" Colleges are places of serious intellectual work. College work requires a lot of effort *after* you get the resource. Difficult intellectual work isn't a bug, it's a feature. It's why you go to college. Research itself often teaches you. So expecting students to put up with some effort to get better results is not, as the teenager would say, "the end of the world." > I certainly agree that making it easier to find a book on the shelves is > another enhancement we should be looking at. I think it was David Walker > who had a nice OPAC feature that actually gave you a map, with hilighted > path, from your computer terminal (if you were sitting in the library, > which again is probably a _minority_ of our opac use), to the book on > the shelves. That's awfully cool. I completely agree. I know David Pattern did something like that. I've been thinking about how to offer that sort of mapping as a commodity service. > But you started out, to my reading, suggesting that Table of Contents, > reviews, > and links to other editions could not possibly be useful, and I still take > exception to that. No. TOCs and reviews are usually pretty useful. Other editions are useful if there's data there. And the most important cross-edition link should be in your *catalog*, something almost no library does. Best, Tim
[CODE4LIB] marc records sample set
Those of us involved in the Blacklight and VuFind projects are spending lots of time recently thinking about marc records indexing. We're about to start running some performance tests, and we want to create unit tests for our marc to solr indexer, and also people wanting to download and play with the software need to have easy access to a small but representative set of marc records that they can play with. According to the combined brainstorming of Jonathan Rochkind and myself, the ideal record set should: 1. contain about 10k records, enough to really see the features, but small enough that you could index it in a few minutes on a typical desktop 2. contain a distribution of kinds of records, e.g., books, CDs, musical scores, DVDs, special collection items, etc. 3. contain a distribution of languages, so we can test unicode handling 4. contain holdings information in addition to bib records 5. contain a distribution of typical errors one might encounter with marc records in the wild It seems to me that the set that Casey donated to Open Library (http://www.archive.org/details/marc_records_scriblio_net) would be a good place from which to draw records, because although IANAL, this seems to sidestep any legal hurdles. I'd also love to see the ability for the community to contribute test cases. Assuming such a set doesn't exist already (see my question below) this seems like the ideal sort of project for code4lib to host, too. Since code4lib is my lazyweb, I'm asking you: 1. Does something like this exist already and I just don't know about it? 2. If not, do you have suggestions on how to go about making such a data set? I have some ideas on how to do it bit by bit, and we have a certain small set of records that we're already using for testing, but maybe there's a better method that I don't know about? 3. Are there features missing from the above list that would make this more useful? Thoughts? Comments? Thanks! Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305
Re: [CODE4LIB] coverage of google book viewability API
Isn't the whole point of this to get the user to the book? Knowledge about the book should/will come from research and reading, not bad metadata... or even hastily automated extraneous info... and in fact, I'd say that most MARC metadata is only there to get a user to the book, not to describe it to the user (aside from subjects), but mainly to describe it to the library. That said, I showed an example (http://books.google.com/books?id=kdiYGQAACAAJ) in which a Google Books "no view" gets a user to a "full view", completely electronically! (though, my goodness, in the state it's in now it does require experimentation and two extra mouse clicks) What's more, Google already has a link for every book to Worldcat, which helps the user get to a "full view", completely physically. Ideally, researchers will begin to have more and more opportunities to research texts side-by-side with a physical and an electronic copy. That is, of course, once they have more options to download different formats rather than just the rather large PDFs currently available at Google. Right now, yes, most "no view" options offer far less information to a user who can read a MARC record... but I'd presume that as more texts go online, the more value and richness will be added, even to those "no view" cases. Mark Custer -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Tim Spalding Sent: Friday, May 09, 2008 11:44 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] coverage of google book viewability API > Most of our users will start out in an electronic environment whether we like it or not > (most of us on THIS list like it)---and will decide, based on what they > find there, on their own, without us making the decision for > them---whether to obtain (or attempt to obtain) a copy of the physical > book or not. Whether we like it or not. >But if you think the options are between US deciding whether the user should consult a physical book or not---then we're not even playing the same game. What I dislike here is your abnegation of the responsibility to care about the choices students make. If you're not considering the value of all resources-including the book-you're not playing the library game, the educator game or the Google game. You're just throwing stuff on screens because you can. "Whether you like it or not" you're pointing students in some directions and not others. You're giving these resources different amounts of emphasis in your UI. You're including some and not others-the others includes all other web pages and all other offline resources. You aren't making choices for the user, but you're not stepping back and washing your hands of the responsibility to help the student. In a physical-book context, the book is one of the resources. It deserves to weighted and evaluated within this larger set of choices. It's your responsibility to consider it within the mix of options. If the book is excellent and the online resources poor, helping the user means communicating this. So, sometimes the OPAC should basically say "there's nothing good online about this book; but it's on the shelf right over there." *Certainly in Classics that's still true-the online world is a very impoverished window into the discipline.
Re: [CODE4LIB] coverage of google book viewability API
What I dislike here is your assumption that you know better than your users what's "good" for them/what they want/what they OUGHT to want/what they need/etc. Providing them with available information in a reasonably accessible way, and then trusting them -- whether they're undergrads, grad students, lecturers, profs, researchers, community users, etc -- to make their own decisions about what to do with that based upon their particular, individual, and personal circumstances, locations, contexts, etc., isn't "abnegating" your responsibility -- it IS your responsibility. Larry Campbell UBC Library Tim Spalding wrote: Most of our users will start out in an electronic environment whether we like it or not (most of us on THIS list like it)---and will decide, based on what they find there, on their own, without us making the decision for them---whether to obtain (or attempt to obtain) a copy of the physical book or not. Whether we like it or not. But if you think the options are between US deciding whether the user should consult a physical book or not---then we're not even playing the same game. What I dislike here is your abnegation of the responsibility to care about the choices students make. If you're not considering the value of all resources—including the book—you're not playing the library game, the educator game or the Google game. You're just throwing stuff on screens because you can. "Whether you like it or not" you're pointing students in some directions and not others. You're giving these resources different amounts of emphasis in your UI. You're including some and not others—the others includes all other web pages and all other offline resources. You aren't making choices for the user, but you're not stepping back and washing your hands of the responsibility to help the student. In a physical-book context, the book is one of the resources. It deserves to weighted and evaluated within this larger set of choices. It's your responsibility to consider it within the mix of options. If the book is excellent and the online resources poor, helping the user means communicating this. So, sometimes the OPAC should basically say "there's nothing good online about this book; but it's on the shelf right over there." *Certainly in Classics that's still true—the online world is a very impoverished window into the discipline.
Re: [CODE4LIB] coverage of google book viewability API
I agree that showing the user evaluative resources that are not any good is not a service to the user. When there are no good evaluative resources available, we should not show bad ones to the user. In either case though, with or without evaluative resources, we tell the user where the book is located on the shelves. [And in most library contexts I am familiar with, that is NOT accurately descirbed as "right over there". Perhaps you are familiar with other sorts of libraries then I. In a large urban public library system or just about ANY academic library, this is just as likely to be: "in the library, but you are sitting at home right now, which could be a mile away or could be two states away", "in another building [which may be miles away]", "checked out to a user, but you can recall it if you like", "in this building three floors and 200 meters away", or "not in our system at all right now but you can ILL it." So I'm having problems with your continued assertions that the book is "right over there" for a user consulting the OPAC. Not neccesarily in a majority of the cases for my users, or the users of most other libraries I am familiar with.] I certainly agree that making it easier to find a book on the shelves is another enhancement we should be looking at. I think it was David Walker who had a nice OPAC feature that actually gave you a map, with hilighted path, from your computer terminal (if you were sitting in the library, which again is probably a _minority_ of our opac use), to the book on the shelves. That's awfully cool. And in either case, with or without extra evaluative metadata, with or without an interactive map showing you were the book is---in the end it's the user's choice of whether to obtain the book or not. Our job is, indeed, giving them good and useful and accurate evaluative information to aid in this selection process. If you suggest that Google metadata is NOT good and useful and accurate metadata, that's legitimate. But you started out, to my reading, suggesting that Table of Contents, reviews, and links to other editions could not possibly be useful, and I still take exception to that. Jonathan Tim Spalding wrote: Most of our users will start out in an electronic environment whether we like it or not (most of us on THIS list like it)---and will decide, based on what they find there, on their own, without us making the decision for them---whether to obtain (or attempt to obtain) a copy of the physical book or not. Whether we like it or not. But if you think the options are between US deciding whether the user should consult a physical book or not---then we're not even playing the same game. What I dislike here is your abnegation of the responsibility to care about the choices students make. If you're not considering the value of all resources—including the book—you're not playing the library game, the educator game or the Google game. You're just throwing stuff on screens because you can. "Whether you like it or not" you're pointing students in some directions and not others. You're giving these resources different amounts of emphasis in your UI. You're including some and not others—the others includes all other web pages and all other offline resources. You aren't making choices for the user, but you're not stepping back and washing your hands of the responsibility to help the student. In a physical-book context, the book is one of the resources. It deserves to weighted and evaluated within this larger set of choices. It's your responsibility to consider it within the mix of options. If the book is excellent and the online resources poor, helping the user means communicating this. So, sometimes the OPAC should basically say "there's nothing good online about this book; but it's on the shelf right over there." *Certainly in Classics that's still true—the online world is a very impoverished window into the discipline. -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] coverage of google book viewability API
> Most of our users will start out in an electronic environment whether we like > it or not > (most of us on THIS list like it)---and will decide, based on what they > find there, on their own, without us making the decision for > them---whether to obtain (or attempt to obtain) a copy of the physical > book or not. Whether we like it or not. >But if you think the options are between US deciding whether the user should >consult a physical book or not---then we're not even playing the same game. What I dislike here is your abnegation of the responsibility to care about the choices students make. If you're not considering the value of all resources—including the book—you're not playing the library game, the educator game or the Google game. You're just throwing stuff on screens because you can. "Whether you like it or not" you're pointing students in some directions and not others. You're giving these resources different amounts of emphasis in your UI. You're including some and not others—the others includes all other web pages and all other offline resources. You aren't making choices for the user, but you're not stepping back and washing your hands of the responsibility to help the student. In a physical-book context, the book is one of the resources. It deserves to weighted and evaluated within this larger set of choices. It's your responsibility to consider it within the mix of options. If the book is excellent and the online resources poor, helping the user means communicating this. So, sometimes the OPAC should basically say "there's nothing good online about this book; but it's on the shelf right over there." *Certainly in Classics that's still true—the online world is a very impoverished window into the discipline.
Re: [CODE4LIB] coverage of google book viewability API
And indeed, that is exactly how I plan to make use of the Google metadata link, and have been suggesting is the best way to make use of it since I entered this conversation: As part of a set of links to 'additional information' about a resource, with no special prominence given to the Google link. Other "additional information" links I either have already or plan to have soon include: Amazon.com isbndb.com (a good aggregator of online prices for purchase, which has a nice api) Ulrich's (for serials) Books in Print OCLC Worldcat OCLC Identities (for the author of the resource) Of course, there is definitely a danger here of information overload. If these "additonal information" links are layed out on the page they shouldn't interfere with the use of hte page by people who want to ignore them entirely. But too _many_ links in the "additional information" section may make people ignore it entirely, where a smaller list limited to the most useful links would get more use. But at the moment, we're not really sure which are the 'most useful' links, so I think I'll err on the side of inclusion, up to a maximum of 6 or 7 links. Jonathan Tim Spalding wrote: If the Google link were part of a much larger set of unstressed links, I'd be more inclined to favor it. Lots of linking is a good thing. But a single no-info Google link from a low-information OPAC page seems to compound the deficiencies of one paradigm with that of another. On the subject of "lazy" students, I do think there is a legitimate distinction between what students will do and what they ought to do. Being pro-Web 2.0 doesn't require us to be information relativists. Certainly there is a lot of ignorant criticism about Wikipedia. Wikipedia is a remarkable resource, and an inspiration to us all. Students will and probably should use it when they're starting out on a topic. That some students will use it long after that, in the place of better resources online and off, because it's the "path of least resistance" isn't just a fact of life we must all bow before. It is a problem we must confront. Not infrequently the right answer is "get off your butt and read a book." Tim On Fri, May 9, 2008 at 8:44 AM, Custer, Mark <[EMAIL PROTECTED]> wrote: For the most part, I completely agree. That said, it's a very tangled web out there, and on occasion those "no preview" views can still lead a user to a "full view" that's offered elsewhere. Here's just one example: http://books.google.com/books?id=kdiYGQAACAAJ (from there, a user can click on the first link to be taken to another metadata page that has access to a "full view") Unfortunately, there's no indication that either of these links will get you to a full-text digitized copy of the book in question (the links always, of course, appear under the header of "References from web pages", which Google has nicely added), and there's also no way to know that a "no preview" book has any such "references from web pages" until you access the item, but it's something, at least, however unintended. It'd be nice, perhaps, if you could put some sort of standard in the metadata header of the webpage (DC or otherwise) to indicate to a harvester (in this case, a crawler) the specific format of the retrieval. Then these links could be labeled as "digitized copies available elsewhere", rather than simply "references from web pages" (which, of course, is all that they are right now), and could also be added to the API callback. That is, of course, if Google doesn't eventually put up these and other localized resources as well (and I'm sure they'll cover most of these, with the collections that they do have)... but until or if they do, it would go a longer way to fulfilling their mission. Mark Custer -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Tim Spalding Sent: Thursday, May 08, 2008 6:52 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] coverage of google book viewability API So, I took a long slow look at ten of the examples from Godmar's file. Nothing I saw disabused me of my opinion: "No preview" pages on Google Book Search are very weak tea. Are they worthless? Not always. But they usually are. And, unfortunately, you generally need to read the various references pages carefully before you know you were wasting your time. Some examples: Risks in Chemical Units (http://books.google.com/books?id=7ctpCAAJ) has one glancing, un-annotated reference in the footnotes of another, apparently different book. How Trouble Made the Monkey Eat Pepper (http://books.google.com/books?id=wLnGCAAJ) sports three references from other books, two in snippet view and one with no view. Two are bare-bones bibliographic mentions in an index of Canadian children's books and an index of Canadian chidren's illustrators. The third is another bare-bones mention in a book in Sinhalese. If the patron is sitting on a computer (which, given this discussion, they
Re: [CODE4LIB] coverage of google book viewability API
Ah, but in our actual world we _don't_ have two choices, to send the user to the physical book or to an electronic metadata surrogate. We _don't_ get to force the user to look at the book. Most of our users will start out in an electronic environment whether we like it or not (most of us on THIS list like it)---and will decide, based on what they find there, on their own, without us making the decision for them---whether to obtain (or attempt to obtain) a copy of the physical book or not. Whether we like it or not. But again, most of us on THIS list like it, and like the challenge of giving the user useful information in the electronic surrogate environment to make up their own mind about whether to obtain the physical book or not. So we can disagree about whether Google metadata is a valuable aid to the user in selection task or not, sure. But if you think the options are between US deciding whether the user should consult a physical book or not---then we're not even playing the same game. Jonathan Tim Spalding wrote: So, I took a long slow look at ten of the examples from Godmar's file. Nothing I saw disabused me of my opinion: "No preview" pages on Google Book Search are very weak tea. Are they worthless? Not always. But they usually are. And, unfortunately, you generally need to read the various references pages carefully before you know you were wasting your time. Some examples: Risks in Chemical Units (http://books.google.com/books?id=7ctpCAAJ) has one glancing, un-annotated reference in the footnotes of another, apparently different book. How Trouble Made the Monkey Eat Pepper (http://books.google.com/books?id=wLnGCAAJ) sports three references from other books, two in snippet view and one with no view. Two are bare-bones bibliographic mentions in an index of Canadian children's books and an index of Canadian chidren's illustrators. The third is another bare-bones mention in a book in Sinhalese. If the patron is sitting on a computer (which, given this discussion, they obviously are), the path of least resistance dictates that a journal article will be used before a book. An excellent example. Let's imagine you were doing reference-desk work and a student were to come up to you with a question about a topic. You have two sources you can send them to—the book itself in all its glory, and another source. The other source is the Croatian-language MySpace page of someone whose boyfriend read a chapter of the book once, five years ago. You're not sure if the blog mentions the book, but it might. That something provides the path of least resistance isn't an argument for something. It depends on where the path goes. -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] coverage of google book viewability API
If the Google link were part of a much larger set of unstressed links, I'd be more inclined to favor it. Lots of linking is a good thing. But a single no-info Google link from a low-information OPAC page seems to compound the deficiencies of one paradigm with that of another. On the subject of "lazy" students, I do think there is a legitimate distinction between what students will do and what they ought to do. Being pro-Web 2.0 doesn't require us to be information relativists. Certainly there is a lot of ignorant criticism about Wikipedia. Wikipedia is a remarkable resource, and an inspiration to us all. Students will and probably should use it when they're starting out on a topic. That some students will use it long after that, in the place of better resources online and off, because it's the "path of least resistance" isn't just a fact of life we must all bow before. It is a problem we must confront. Not infrequently the right answer is "get off your butt and read a book." Tim On Fri, May 9, 2008 at 8:44 AM, Custer, Mark <[EMAIL PROTECTED]> wrote: > For the most part, I completely agree. That said, it's a very tangled > web out there, and on occasion those "no preview" views can still lead a > user to a "full view" that's offered elsewhere. Here's just one > example: > > http://books.google.com/books?id=kdiYGQAACAAJ > (from there, a user can click on the first link to be taken to another > metadata page that has access to a "full view") > > Unfortunately, there's no indication that either of these links will get > you to a full-text digitized copy of the book in question (the links > always, of course, appear under the header of "References from web > pages", which Google has nicely added), and there's also no way to know > that a "no preview" book has any such "references from web pages" until > you access the item, but it's something, at least, however unintended. > > It'd be nice, perhaps, if you could put some sort of standard in the > metadata header of the webpage (DC or otherwise) to indicate to a > harvester (in this case, a crawler) the specific format of the > retrieval. Then these links could be labeled as "digitized copies > available elsewhere", rather than simply "references from web pages" > (which, of course, is all that they are right now), and could also be > added to the API callback. That is, of course, if Google doesn't > eventually put up these and other localized resources as well (and I'm > sure they'll cover most of these, with the collections that they do > have)... but until or if they do, it would go a longer way to > fulfilling their mission. > > Mark Custer > > > > -Original Message- > From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of > Tim Spalding > Sent: Thursday, May 08, 2008 6:52 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] coverage of google book viewability API > > So, I took a long slow look at ten of the examples from Godmar's file. > Nothing I saw disabused me of my opinion: "No preview" pages on Google > Book Search are very weak tea. > > Are they worthless? Not always. But they usually are. And, > unfortunately, you generally need to read the various references pages > carefully before you know you were wasting your time. > > Some examples: > > Risks in Chemical Units > (http://books.google.com/books?id=7ctpCAAJ) has one glancing, > un-annotated reference in the footnotes of another, apparently > different book. > > How Trouble Made the Monkey Eat Pepper > (http://books.google.com/books?id=wLnGCAAJ) sports three > references from other books, two in snippet view and one with no view. > Two are bare-bones bibliographic mentions in an index of Canadian > children's books and an index of Canadian chidren's illustrators. The > third is another bare-bones mention in a book in Sinhalese. > >> If the patron is sitting on a computer (which, given this > discussion, they obviously are), the >> path of least resistance dictates that a journal article will be used > before a book. > > An excellent example. Let's imagine you were doing reference-desk work > and a student were to come up to you with a question about a topic. > You have two sources you can send them to-the book itself in all its > glory, and another source. The other source is the Croatian-language > MySpace page of someone whose boyfriend read a chapter of the book > once, five years ago. You're not sure if the blog mentions the book, > but it might. > > That something provides the path of least resistance isn't an argument > for something. It depends on where the path goes. > -- Check out my library at http://www.librarything.com/profile/timspalding
[CODE4LIB] Sakai and Emory Reserves Direct
A while ago someone posted to this list feedback on reserves applications. Emory Reserves Direct was one of the open source recommendations. We are migrating to from Blackboard to Sakai open source course management and would like to add some Copyright components. Is there anyone that has experience with either Emory Direct or Sakai? We will be having a consultant tweak Sakai so any input would be greatly appreciated. I can be contacted off list at [EMAIL PROTECTED] or to the group if appropriate. Thank you very much. Marianne -- Marianne E. Giltrud, MSLS Acting Access Services Librarian The Catholic University of America Libraries Mullen Library 620 Michigan Avenue, NE Washington, DC 20064 202-319-4453 [EMAIL PROTECTED] http://libraries.cua.edu
Re: [CODE4LIB] coverage of google book viewability API
For the most part, I completely agree. That said, it's a very tangled web out there, and on occasion those "no preview" views can still lead a user to a "full view" that's offered elsewhere. Here's just one example: http://books.google.com/books?id=kdiYGQAACAAJ (from there, a user can click on the first link to be taken to another metadata page that has access to a "full view") Unfortunately, there's no indication that either of these links will get you to a full-text digitized copy of the book in question (the links always, of course, appear under the header of "References from web pages", which Google has nicely added), and there's also no way to know that a "no preview" book has any such "references from web pages" until you access the item, but it's something, at least, however unintended. It'd be nice, perhaps, if you could put some sort of standard in the metadata header of the webpage (DC or otherwise) to indicate to a harvester (in this case, a crawler) the specific format of the retrieval. Then these links could be labeled as "digitized copies available elsewhere", rather than simply "references from web pages" (which, of course, is all that they are right now), and could also be added to the API callback. That is, of course, if Google doesn't eventually put up these and other localized resources as well (and I'm sure they'll cover most of these, with the collections that they do have)... but until or if they do, it would go a longer way to fulfilling their mission. Mark Custer -Original Message- From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of Tim Spalding Sent: Thursday, May 08, 2008 6:52 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] coverage of google book viewability API So, I took a long slow look at ten of the examples from Godmar's file. Nothing I saw disabused me of my opinion: "No preview" pages on Google Book Search are very weak tea. Are they worthless? Not always. But they usually are. And, unfortunately, you generally need to read the various references pages carefully before you know you were wasting your time. Some examples: Risks in Chemical Units (http://books.google.com/books?id=7ctpCAAJ) has one glancing, un-annotated reference in the footnotes of another, apparently different book. How Trouble Made the Monkey Eat Pepper (http://books.google.com/books?id=wLnGCAAJ) sports three references from other books, two in snippet view and one with no view. Two are bare-bones bibliographic mentions in an index of Canadian children's books and an index of Canadian chidren's illustrators. The third is another bare-bones mention in a book in Sinhalese. > If the patron is sitting on a computer (which, given this discussion, they obviously are), the > path of least resistance dictates that a journal article will be used before a book. An excellent example. Let's imagine you were doing reference-desk work and a student were to come up to you with a question about a topic. You have two sources you can send them to-the book itself in all its glory, and another source. The other source is the Croatian-language MySpace page of someone whose boyfriend read a chapter of the book once, five years ago. You're not sure if the blog mentions the book, but it might. That something provides the path of least resistance isn't an argument for something. It depends on where the path goes.
Re: [CODE4LIB] Latest OpenLibrary.org release
On Thu, 2008-05-08 at 11:41 -0400, Godmar Back wrote: > On Thu, May 8, 2008 at 11:25 AM, Dr R. Sanderson > <[EMAIL PROTECTED]> wrote: > > > > Like what? The current API seems to be concerned with search. Search > > is what SRU does well. If it was concerned with harvest, I (and I'm > > sure many others) would have instead suggested OAI-PMH. > > > No, the API presented does not support search. Well, it only doesn't support search because of the way that the API has been described without using the word 'search'! To quote the documentation in the API: -- Infogami provides an API to query the database for objects matching particular criteria ... To find objects matching a particular query, send a GET request to http://openlibrary.org/api/things with query as parameter. In this documentation we use curl as a simple command line query client; any software that supports http GET can be used. ... The API supports querying for objects based of string matching. - And so on. There's a query, which can have its results sorted, be limited in terms of the number of results returned, and have the beginning of that result list start at an offset. Sounds a lot like a search? Rob