Re: [CODE4LIB] coverage of google book viewability API

2008-05-09 Thread Bob Duncan

At 06:52 PM 05/08/2008, Tim wrote:

So, I took a long slow look at ten of the examples from Godmar's file.
Nothing I saw disabused me of my opinion: "No preview" pages on Google
Book Search are very weak tea.

Are they worthless? Not always. But they usually are. And,
unfortunately, you generally need to read the various references pages
carefully before you know you were wasting your time.

Some examples:

Risks in Chemical Units
(http://books.google.com/books?id=7ctpCAAJ) has one glancing,
un-annotated reference in the footnotes of another, apparently
different book.

How Trouble Made the Monkey Eat Pepper
(http://books.google.com/books?id=wLnGCAAJ) sports three
references from other books, two in snippet view and one with no view.
Two are bare-bones bibliographic mentions in an index of Canadian
children's books and an index of Canadian chidren's illustrators. The
third is another bare-bones mention in a book in Sinhalese.



I don't think anyone's saying there aren't some pretty useless
entries in GBS, but these two examples strike me as illustrating only
one point:  if you go shopping for weak tea you can usually find
it.  The post-"here's the code" discussion was focused on the
usefulness (or not) of "scanless" GBS records to users of an academic
library catalog.  A fairly obscure book held by 9 OCLC-participating
academic libraries of the unofficial Carnegie class "we have the
money so let's buy almost everything" and a 28-page children's story
published in 1977 (held by 7 US academics and 11 Canadian academics)
are hardly representative of what might be found in most academic
library catalogs.  While they are perfectly valid illustrations of
bad GBS data, they are not valid illustrations of the worthlessness
of links to "scanless" GBS records in academic library catalogs.

Note that I'm not looking for valid illustrations (and I'm sure some
are out there), because a handful of incredibly intelligent real live
academic reference librarian educators down the hall from me have
already done extensive playing around with this and have determined
there's enough useful data in GBS to offer our users links even when
there's no preview or full text available.

Bob Duncan


~!~!~!~!~!~!~!~!~!~!~!~!~
Robert E. Duncan
Systems Librarian
Editor of IT Communications
Lafayette College
Easton, PA  18042
[EMAIL PROTECTED]
http://www.library.lafayette.edu/


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Gabriel Sean Farrell
On Fri, May 09, 2008 at 11:58:03AM -0700, Casey Durfee wrote:
> On Fri, May 9, 2008 at 11:14 AM, Bess Sadler <[EMAIL PROTECTED]> wrote:
>
> >
> > Casey, you say you're getting indexing times of 1000 records /
> > second? That's amazing! I really have to take a closer look at
> > MarcThing. Could pymarc really be that much faster than marc4j? Or
> > are we comparing apples to oranges since we haven't normalized for
> > the kinds of mapping we're doing and the hardware it's running on?
> >
>
> Well, you can't take a closer look at it yet, since I haven't gotten off my
> lazy butt and released it.  We're still using an older version of the
> project in production on LT.  I'm going to cut us over to the latest version
> this weekend.  At this point, being able to say we eat our own dogfood is
> the only barrier to release.

Looking forward to the release.  I'd be interested to see how it
compares to the pymarc-indexer branch in FBO/Helios [1].

> The last indexer I wrote (the one used by fac-back-opac) used marc4j and was
> around 100-150 a second.  Some of the boost was due to better designed code
> on my end, but I can't take too much credit.  Pymarc is much, much faster.
> I never bothered to figure out why.  (That wasn't why I switched, though --
> there are some problems with parsing ANSEL with marc4j (*) which I decided
> I'd rather be mauled by bears than try and fix -- the performance boost was
> just a pleasant surprise). Of course one could use pymarc from java with
> Jython.

On the small set of documents I'm now indexing (3327) I get 141
rec/sec.  This is on my test server, an AMD64 whose processor speed I
can't recall.  That rate includes pymarc processing (~65%) and the
loading of the CSV file into SOLR (~35%).  Surely there's some room
there for optimization, but it's fast enough for my current purposes.
Also, I'm in the camp that would be happy with a ~10,000 record test
set.  There will always be some edge cases that we'll only solve as
they're encountered.  I need rapid iteration!

Gabriel

[1] http://fruct.us/trac/fbo/browser/branches/pymarc-indexer/indexer


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Casey Durfee
On Fri, May 9, 2008 at 11:14 AM, Bess Sadler <[EMAIL PROTECTED]> wrote:

>
> Casey, you say you're getting indexing times of 1000 records /
> second? That's amazing! I really have to take a closer look at
> MarcThing. Could pymarc really be that much faster than marc4j? Or
> are we comparing apples to oranges since we haven't normalized for
> the kinds of mapping we're doing and the hardware it's running on?
>

Well, you can't take a closer look at it yet, since I haven't gotten off my
lazy butt and released it.  We're still using an older version of the
project in production on LT.  I'm going to cut us over to the latest version
this weekend.  At this point, being able to say we eat our own dogfood is
the only barrier to release.

The last indexer I wrote (the one used by fac-back-opac) used marc4j and was
around 100-150 a second.  Some of the boost was due to better designed code
on my end, but I can't take too much credit.  Pymarc is much, much faster.
I never bothered to figure out why.  (That wasn't why I switched, though --
there are some problems with parsing ANSEL with marc4j (*) which I decided
I'd rather be mauled by bears than try and fix -- the performance boost was
just a pleasant surprise). Of course one could use pymarc from java with
Jython.

Undoubtedly we're comparing apples to oranges here.  1000/sec. is about what
I can get on my Macbook Pro on some random MARC records I have lying around,
with plenty of hand-waving involved.  MARCThing does do a fair amount of
munging for expanding codes, guessing physical format and what-have-you (but
nothing with dates, which is sorely needed), but I think it would be a bad
idea to read too much into some anecdotal numbers.

--Casey

(*) in marc4j's defense, actually due to a bug in Horizon ILS.


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Jason Ronallo
On Fri, May 9, 2008 at 2:23 PM, Joe Hourcle
<[EMAIL PROTECTED]> wrote:
> OpenLibrary has other datasets that you might be able to use / combine /
> whatever to meet your requirements:
>
>   http://openlibrary.org/dev/docs/data

This'll get you the other MARC dumps that have been made available to
IA through OL:
http://www.archive.org/search.php?query=collection%3Aol_data%20marc

Lots to work with here.

I also wonder if rather than one large test set it wouldn't be good to
have smaller test sets which exhibit particular problems or are of a
particular type (i.e. music).

Jason


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Joe Hourcle

On Fri, 9 May 2008, Bess Sadler wrote:


Those of us involved in the Blacklight and VuFind projects are
spending lots of time recently thinking about marc records indexing.
We're about to start running some performance tests, and we want to
create unit tests for our marc to solr indexer, and also people
wanting to download and play with the software need to have easy
access to a small but representative set of marc records that they
can play with.


[trimmed]


It seems to me that the set that Casey donated to Open Library
(http://www.archive.org/details/marc_records_scriblio_net) would be a
good place from which to draw records, because although IANAL, this
seems to sidestep any legal hurdles. I'd also love to see the ability
for the community to contribute test cases. Assuming such a set
doesn't exist already (see my question below) this seems like the
ideal sort of project for code4lib to host, too.


OpenLibrary has other datasets that you might be able to use / combine /
whatever to meet your requirements:

   http://openlibrary.org/dev/docs/data


-
Joe Hourcle


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Bess Sadler

On May 9, 2008, at 1:42 PM, Jonathan Rochkind wrote:


The Blacklight code is not currently using XML or XSLT. It's indexing
binary MARC files. I don't know it's speed, but I hear it's pretty
fast.


Right, I'm talking about the java indexer we're working on, which
we're hoping to turn into a plugin contrib module for solr. It
processes binary marc files. We're getting times of about 150
records / second, but that's on an unfortunately throttled server and
we're munging each record significantly (replacing musical instrument
and language codes with their English language equivalents,
calculating composition era, etc).

Casey, you say you're getting indexing times of 1000 records /
second? That's amazing! I really have to take a closer look at
MarcThing. Could pymarc really be that much faster than marc4j? Or
are we comparing apples to oranges since we haven't normalized for
the kinds of mapping we're doing and the hardware it's running on?

Bess


Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Jonathan Rochkind

The Blacklight code is not currently using XML or XSLT. It's indexing
binary MARC files. I don't know it's speed, but I hear it's pretty fast.

But for the kind of test set I want, even waiting half an hour is too
long. I want a test set where I can make a change to my indexing
configuration and then see the results in a few minutes, if not seconds.

This has become apparent in my attempts to get the indexer _working_,
where I might not be actually changing the index mapping at all, I'm
just changing the indexer configuration until I know it's working. When
I get to actually messing with indexer mapping to try out new ideas, I
believe it will also be important, however.

Jonathan

Casey Durfee wrote:

I strongly agree that we need something like this.  The LoC records that
Casey donated are a great resource but far from ideal from this purpose.
They're pretty homogeneous.  I do think it needs to be bigger than 10,000
though.  100,000 would be a better target.  And I would like to see a
UNIMARC/DANMARC-based one as well as a MARC21 based one (can one's parser
handle DANMARC's "ø" subfield?).

I don't know about Blacklight or VuFind, but using our MarcThing package +
Solr we can index up to 1000 records a second.  I know using XSLT severely
limits how fast you can index (I'll refrain from giving another rant about
how wrong it is to use XSL to handle MARC -- the Society for Prevention of
Cruelty to Dead Horses has my number as it is.)  But I'd still expect you
can do a good 50-100 records a second.  That's only a half hour to an hour
of work to index 100,000 records.  You could run it over your lunch break.
Seems reasonable to me.

In addition to a wide variety of languages, encodings, formats and so forth,
it would definitely need to have records explicitly designed to break
things.  Blank MARC tags, extraneous subfield markers, non-printing control
characters, incorrect length fixed fields etc.  The kind of stuff that
should never happen in theory, but happens frequently in real life (I'm
looking at you, Horizon's MARC export utility).

The legal aspect of this is the difficult part.  We (LibraryThing) could
easily grab 200 random records from 500 different Z39.50 sources worldwide.
Technically, it could be done in a couple of hours.  Legally, I don't think
it could ever be done, sadly.

--Casey


On Fri, May 9, 2008 at 9:33 AM, Bess Sadler <[EMAIL PROTECTED]> wrote:



Those of us involved in the Blacklight and VuFind projects are
spending lots of time recently thinking about marc records indexing.
We're about to start running some performance tests, and we want to
create unit tests for our marc to solr indexer, and also people
wanting to download and play with the software need to have easy
access to a small but representative set of marc records that they
can play with.

According to the combined brainstorming of Jonathan Rochkind and
myself, the ideal record set should:

1. contain about 10k records, enough to really see the features, but
small enough that you could index it in a few minutes on a typical
desktop
2. contain a distribution of kinds of records, e.g., books, CDs,
musical scores, DVDs, special collection items, etc.
3. contain a distribution of languages, so we can test unicode handling
4. contain holdings information in addition to bib records
5. contain a distribution of typical errors one might encounter with
marc records in the wild

It seems to me that the set that Casey donated to Open Library
(http://www.archive.org/details/marc_records_scriblio_net) would be a
good place from which to draw records, because although IANAL, this
seems to sidestep any legal hurdles. I'd also love to see the ability
for the community to contribute test cases. Assuming such a set
doesn't exist already (see my question below) this seems like the
ideal sort of project for code4lib to host, too.

Since code4lib is my lazyweb, I'm asking you:

1. Does something like this exist already and I just don't know about
it?
2. If not, do you have suggestions on how to go about making such a
data set? I have some ideas on how to do it bit by bit, and we have a
certain small set of records that we're already using for testing,
but maybe there's a better method that I don't know about?
3. Are there features missing from the above list that would make
this more useful?

Thoughts? Comments?

Thanks!
Bess


Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305







--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Jonathan Rochkind

I think you start with a smaller set, but then when you find
idiosyncratic records that were NOT represented in your smaller set, you
add representative samples to the sample set. The sample set organically
grows.

Certainly at some point you've got to test on a larger set too. But I
think there's a lot of value in having a small test set too. Of course,
it is something of a challenge to even come up with a reasonably
representative small set. But it doesn't need to be absolutely
representative---when you find examples not represented, you add them.
It grows.

Jonathan

Kyle Banerjee wrote:

According to the combined brainstorming of Jonathan Rochkind and
myself, the ideal record set should:

1. contain about 10k records, enough to really see the features, but
small enough that you could index it in a few minutes on a typical
desktop...





5. contain a distribution of typical errors one might encounter with
marc records in the wild



This is much harder to do than might appear on the surface. 10K is a
really small set, and the issue is that unless people know how to
create a set that has really targets the problem areas, you will
inevitably miss important stuff. At the end of the day, it's the
screwball stuff you didn't think about that always causes the most
problems. I think such data sizes are useful for testing interfaces,
but not for determining catalog behavior and setup.

Despite the indexing time, I believe in testing with much larger sets.
There are certain very important things that just can't be examined
with small sets. For example, one huge problem with catalog data is
that the completeness and quality is highly variable. When we were
experimenting sometime back, we found that how you normalize the data
and how you weight terms as well as documents has an enormous impact
on search results and that unless you do some tuning, you will
inevitably find a lot of garbage too close to the top with a bunch of
good stuff ranked so low it isn't found.

kyle




--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Kyle Banerjee
> Sounds like you have some experience of this, Kyle!
> Do you have a list of "the screwball stuff"? Even an offhand one would
> be interesting...

I don't have the list with me, but just to rattle a few things off,
some extra short records rank high because so much of a search term
matches the whole document. Some records contain fields that have been
repeated many times which artificially boosts them. You'll see
nonstandard use of fields as well as foreign character sets. There are
a number of ways URLs are displayed. Deriving format can be
problematic because encoded mat type and what you're providing access
to are different. Some records contain lots of added entries, while
many important ones are fairly minimalist. There are conversion on the
fly, purchased record sets, automatically generated ones, full level,
ones that automatically have some subject heading added that contains
a common search term. There are a zillion other things, but you get
the idea.

> What sorts of normalizations do you do? I'm starting to look for
> standard measures of data quality or data validation/normalization
> routines for MARC.

Index terms only once. This helps deal with repeated terms in
repetitive subject headings and added entries. Look at presence of
fields to assign additional material types (particularly useful for
electronic resources since these typically have paper records -- but
don't be fooled by links to TOC and stuff that's not full text). Give
special handling for serials. Keywords need to be weighted differently
depending on where they're from (e.g. title worth more than subject).
We also assigned a rough "record quality" score based presence/absence
of fields so that longer more complete records don't become less
important simply because a search term matches less of them than a
short record. Give a bit more weight to a true full text retrieval.
Number of libraries holding the item is considered. When indexing,
650|a is more important than |x, |y, or |z. Don't treat 650|z the same
way as 651|a. Recognize that 650|v is a form, that some common 650|x
fields should be treated this way (and neither should just be a
regular index term). The only thing we didn't use that a lot of places
put a lot of weight on is date -- this is good for retrieving popular
fiction, but you have to be really careful with it in academic
collections because it can hide classic stuff that's been around a
long time. I can't remember everything off the top of my head, but
there's a lot and it makes a big difference.

kyle


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Dan Scott
I agree with Kyle that a big, wide set of records is better for testing 
purposes. In processing records for Evergreen imports, I've found that there 
are often just a handful that throw marc4j for a loop. I suppose I should cull 
those and attach them to bug reports... instead I've taken the path of least 
resistance and just used yaz-marcdump. (bad Dan!)

There are, of course, _lots_ of MARC records available for download from
http://www.archive.org/search.php?query=collection%3A%22ol_data%22%20AND%20%28MARC%20records%29
- not just the LoC set. So one could presumably assemble a nice big set of 
records starting here.

Dan


>>> On Fri, May 9, 2008 at 12:33 PM, Bess Sadler <[EMAIL PROTECTED]> wrote:
> Those of us involved in the Blacklight and VuFind projects are
> spending lots of time recently thinking about marc records indexing.
> We're about to start running some performance tests, and we want to
> create unit tests for our marc to solr indexer, and also people
> wanting to download and play with the software need to have easy
> access to a small but representative set of marc records that they
> can play with.
>
> According to the combined brainstorming of Jonathan Rochkind and
> myself, the ideal record set should:
>
> 1. contain about 10k records, enough to really see the features, but
> small enough that you could index it in a few minutes on a typical
> desktop
> 2. contain a distribution of kinds of records, e.g., books, CDs,
> musical scores, DVDs, special collection items, etc.
> 3. contain a distribution of languages, so we can test unicode handling
> 4. contain holdings information in addition to bib records
> 5. contain a distribution of typical errors one might encounter with
> marc records in the wild
>
> It seems to me that the set that Casey donated to Open Library
> (http://www.archive.org/details/marc_records_scriblio_net) would be a
> good place from which to draw records, because although IANAL, this
> seems to sidestep any legal hurdles. I'd also love to see the ability
> for the community to contribute test cases. Assuming such a set
> doesn't exist already (see my question below) this seems like the
> ideal sort of project for code4lib to host, too.
>
> Since code4lib is my lazyweb, I'm asking you:
>
> 1. Does something like this exist already and I just don't know about
> it?
> 2. If not, do you have suggestions on how to go about making such a
> data set? I have some ideas on how to do it bit by bit, and we have a
> certain small set of records that we're already using for testing,
> but maybe there's a better method that I don't know about?
> 3. Are there features missing from the above list that would make
> this more useful?
>
> Thoughts? Comments?
>
> Thanks!
> Bess
>
>
> Elizabeth (Bess) Sadler
> Research and Development Librarian
> Digital Scholarship Services
> Box 400129
> Alderman Library
> University of Virginia
> Charlottesville, VA 22904
>
> [EMAIL PROTECTED]
> (434) 243- 2305


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Casey Durfee
I strongly agree that we need something like this.  The LoC records that
Casey donated are a great resource but far from ideal from this purpose.
They're pretty homogeneous.  I do think it needs to be bigger than 10,000
though.  100,000 would be a better target.  And I would like to see a
UNIMARC/DANMARC-based one as well as a MARC21 based one (can one's parser
handle DANMARC's "ø" subfield?).

I don't know about Blacklight or VuFind, but using our MarcThing package +
Solr we can index up to 1000 records a second.  I know using XSLT severely
limits how fast you can index (I'll refrain from giving another rant about
how wrong it is to use XSL to handle MARC -- the Society for Prevention of
Cruelty to Dead Horses has my number as it is.)  But I'd still expect you
can do a good 50-100 records a second.  That's only a half hour to an hour
of work to index 100,000 records.  You could run it over your lunch break.
Seems reasonable to me.

In addition to a wide variety of languages, encodings, formats and so forth,
it would definitely need to have records explicitly designed to break
things.  Blank MARC tags, extraneous subfield markers, non-printing control
characters, incorrect length fixed fields etc.  The kind of stuff that
should never happen in theory, but happens frequently in real life (I'm
looking at you, Horizon's MARC export utility).

The legal aspect of this is the difficult part.  We (LibraryThing) could
easily grab 200 random records from 500 different Z39.50 sources worldwide.
Technically, it could be done in a couple of hours.  Legally, I don't think
it could ever be done, sadly.

--Casey


On Fri, May 9, 2008 at 9:33 AM, Bess Sadler <[EMAIL PROTECTED]> wrote:

> Those of us involved in the Blacklight and VuFind projects are
> spending lots of time recently thinking about marc records indexing.
> We're about to start running some performance tests, and we want to
> create unit tests for our marc to solr indexer, and also people
> wanting to download and play with the software need to have easy
> access to a small but representative set of marc records that they
> can play with.
>
> According to the combined brainstorming of Jonathan Rochkind and
> myself, the ideal record set should:
>
> 1. contain about 10k records, enough to really see the features, but
> small enough that you could index it in a few minutes on a typical
> desktop
> 2. contain a distribution of kinds of records, e.g., books, CDs,
> musical scores, DVDs, special collection items, etc.
> 3. contain a distribution of languages, so we can test unicode handling
> 4. contain holdings information in addition to bib records
> 5. contain a distribution of typical errors one might encounter with
> marc records in the wild
>
> It seems to me that the set that Casey donated to Open Library
> (http://www.archive.org/details/marc_records_scriblio_net) would be a
> good place from which to draw records, because although IANAL, this
> seems to sidestep any legal hurdles. I'd also love to see the ability
> for the community to contribute test cases. Assuming such a set
> doesn't exist already (see my question below) this seems like the
> ideal sort of project for code4lib to host, too.
>
> Since code4lib is my lazyweb, I'm asking you:
>
> 1. Does something like this exist already and I just don't know about
> it?
> 2. If not, do you have suggestions on how to go about making such a
> data set? I have some ideas on how to do it bit by bit, and we have a
> certain small set of records that we're already using for testing,
> but maybe there's a better method that I don't know about?
> 3. Are there features missing from the above list that would make
> this more useful?
>
> Thoughts? Comments?
>
> Thanks!
> Bess
>
>
> Elizabeth (Bess) Sadler
> Research and Development Librarian
> Digital Scholarship Services
> Box 400129
> Alderman Library
> University of Virginia
> Charlottesville, VA 22904
>
> [EMAIL PROTECTED]
> (434) 243-2305
>


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Jodi Schneider
>This is much harder to do than might appear on the surface. 10K is a
>really small set, and the issue is that unless people know how to
>create a set that has really targets the problem areas, you will
>inevitably miss important stuff. At the end of the day, it's the
>screwball stuff you didn't think about that always causes the most
>problems. I think such data sizes are useful for testing interfaces,
>but not for determining catalog behavior and setup.

Sounds like you have some experience of this, Kyle!
Do you have a list of "the screwball stuff"? Even an offhand one would
be interesting...

>Despite the indexing time, I believe in testing with much larger sets.
>There are certain very important things that just can't be examined
>with small sets. For example, one huge problem with catalog data is
>that the completeness and quality is highly variable. When we were
>experimenting sometime back, we found that how you normalize the data
>and how you weight terms as well as documents has an enormous impact
>on search results and that unless you do some tuning, you will
>inevitably find a lot of garbage too close to the top with a bunch of
>good stuff ranked so low it isn't found.

What sorts of normalizations do you do? I'm starting to look for
standard measures of data quality or data validation/normalization
routines for MARC.

I ask because in my experiments with the FRBR Display Tool, I've found
sorts of variations you describe, and I'd like to experiment with more
data validation & normalization. I'm very new at working with MARC data,
so even pointers to standard stuff would be really helpful!

-Jodi

Jodi Schneider
Science Library Specialist
Amherst College
413-542-2076


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Walter Lewis

Bess Sadler wrote:

3. Are there features missing from the above list that would make
this more useful?

One of the things that Bill Moen showed at Access a couple of years ago
(Edmonton?) was what he and others were calling a "radioactive" Marc
record.  One that had no "normal" payload but, IIRC had a 245$a whose
value was "245$a" etc.  As I recall, it was used to test processes where
you wanted to be sure that a specific field was mapped to a specific
index, or was showing in a particular Z39.50 profiles.

Walter


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Kyle Banerjee
> According to the combined brainstorming of Jonathan Rochkind and
> myself, the ideal record set should:
>
> 1. contain about 10k records, enough to really see the features, but
> small enough that you could index it in a few minutes on a typical
> desktop...

> 5. contain a distribution of typical errors one might encounter with
> marc records in the wild

This is much harder to do than might appear on the surface. 10K is a
really small set, and the issue is that unless people know how to
create a set that has really targets the problem areas, you will
inevitably miss important stuff. At the end of the day, it's the
screwball stuff you didn't think about that always causes the most
problems. I think such data sizes are useful for testing interfaces,
but not for determining catalog behavior and setup.

Despite the indexing time, I believe in testing with much larger sets.
There are certain very important things that just can't be examined
with small sets. For example, one huge problem with catalog data is
that the completeness and quality is highly variable. When we were
experimenting sometime back, we found that how you normalize the data
and how you weight terms as well as documents has an enormous impact
on search results and that unless you do some tuning, you will
inevitably find a lot of garbage too close to the top with a bunch of
good stuff ranked so low it isn't found.

kyle


[CODE4LIB] iteration. interface ideas? unintended consequences.

2008-05-09 Thread Jodi Schneider
>What I dislike here is your abnegation of the responsibility to care
>about the choices students make. If you're not considering the value
>of all resources-including the book-you're not playing the library
>game, the educator game or the Google game. You're just throwing stuff
>on screens because you can.

Tim, this appears to conflate the role of the reference librarian and
that of the library technologist. While some of us (myself included)
straddle these two roles, this list focuses exclusively on the latter.

I read your repeated commentary on this topic as a demand that if we
can't play your game (caring about the choices students make), let's not
play any game at all. I respectfully disagree. Let's start where we are
and iterate. Incremental, iterative improvement is much more powerful
that I would ever have suspected, even 10 months ago. So even if it's
"throwing stuff on screens because you can" (which doesn't sound
generous), let's start. We'll get someplace, I think. But only if we
play the game.

Here on code4lib, I'd rather get back to your very useful examples of
deadends [1][2]. Let's ask "Would this be useful in any context? "How
can we give our users more information without sending them spinning to
dead-ends?" Distinguishing between full-text links, partial-text links,
and pointers to more information is critical and unlike our 856 fields
[3], the GoogleBooks API *does* give us that information. The question
is how to present it to users. Tim, I hope you'll draw up an interface
suggestion, and send us a link to a screenshot.

Of course, I don't think that GoogleBooks is the be-all and end-all; we
need to be pulling information from various sources (and some on this
list have systems that already are). If you're concerned about that, why
not survey the landscape for other APIs--all the ones you can find--and
submit a paper to the Journal, write a blog post, or do up a prototype?
(Personally, I'd welcome such a collation from anybody!)

Unintended consequences are also worth further exploration, and I'm not
the only one[4] who'd love to see more studies of the consequences of
changing catalogs and changing online availability. Anything you can
contribute there--with real evidence of before and after--would be most
welcome as well.

-Jodi


[1] http://books.google.com/books?id=7ctpCAAJ
[2] http://books.google.com/books?id=wLnGCAAJ
[3] http://roytennant.com/proto/856/analysis.html
[4] http://kcoyle.blogspot.com/2007/03/unintended-consequences.html
"When we make materials available, or when we make them more available
than they have been in the past, we aren't just providing more service
-- we are actually changing what knowledge will be created. We tend to
pay attention to what we are making available, but not to think about
how differing availability affects the work of our users."


Re: [CODE4LIB] coverage of google book viewability API

2008-05-09 Thread Tim Spalding
> I agree that showing the user evaluative resources that are not any good
> is not a service to the user. When there are no good evaluative
> resources available, we should not show bad ones to the user.

I think we actually agree on what should happen. We disagree on the
theory behind that :)

> [And in most library contexts I am familiar with, that is NOT accurately
> descirbed as "right over there". Perhaps you are familiar with other
> sorts of libraries then I. In a large urban public library system or
> just about ANY academic library, this is just as likely to be: "in the
> library, but you are sitting at home right now, which could be a mile
> away or could be two states away", "in another building [which may be
> miles away]", "checked out to a user, but you can recall it if you
> like", "in this building three floors and 200 meters away", or "not in
> our system at all right now but you can ILL it."

I hear you. This is very situation-dependent, but real. Not all books
are easy to hand. I am depressed by many libraries' willingness to
ship their books to holding facilities. While the rest of the
information world is getting easier, finding your own library's books
is often getting harder. Often these offloaded books are the ones you
can't find out about from any other source, so you're screwed both
physically and digitally.

That said, a lot of talk about the difficulty of getting a book seems
like whining. It's a teenager staring into the fridge and yelling
"Mom, is there anything to eat!" Colleges are places of serious
intellectual work. College work requires a lot of effort *after* you
get the resource. Difficult intellectual work isn't a bug, it's a
feature. It's why you go to college. Research itself often teaches
you. So expecting students to put up with some effort to get better
results is not, as the teenager would say, "the end of the world."

> I certainly agree that making it easier to find a book on the shelves is
> another enhancement we should be looking at. I think it was David Walker
> who had a nice OPAC feature that actually gave you a map, with hilighted
> path, from your computer terminal (if you were sitting in the library,
> which again is probably a _minority_ of our opac use), to the book on
> the shelves. That's awfully cool.

I completely agree. I know David Pattern did something like that. I've
been thinking about how to offer that sort of mapping as a commodity
service.

> But you started out, to my reading, suggesting that Table of Contents, 
> reviews,
> and links to other editions could not possibly be useful, and I still take 
> exception to that.

No. TOCs and reviews are usually pretty useful. Other editions are
useful if there's data there. And the most important cross-edition
link should be in your *catalog*, something almost no library does.

Best,
Tim


[CODE4LIB] marc records sample set

2008-05-09 Thread Bess Sadler

Those of us involved in the Blacklight and VuFind projects are
spending lots of time recently thinking about marc records indexing.
We're about to start running some performance tests, and we want to
create unit tests for our marc to solr indexer, and also people
wanting to download and play with the software need to have easy
access to a small but representative set of marc records that they
can play with.

According to the combined brainstorming of Jonathan Rochkind and
myself, the ideal record set should:

1. contain about 10k records, enough to really see the features, but
small enough that you could index it in a few minutes on a typical
desktop
2. contain a distribution of kinds of records, e.g., books, CDs,
musical scores, DVDs, special collection items, etc.
3. contain a distribution of languages, so we can test unicode handling
4. contain holdings information in addition to bib records
5. contain a distribution of typical errors one might encounter with
marc records in the wild

It seems to me that the set that Casey donated to Open Library
(http://www.archive.org/details/marc_records_scriblio_net) would be a
good place from which to draw records, because although IANAL, this
seems to sidestep any legal hurdles. I'd also love to see the ability
for the community to contribute test cases. Assuming such a set
doesn't exist already (see my question below) this seems like the
ideal sort of project for code4lib to host, too.

Since code4lib is my lazyweb, I'm asking you:

1. Does something like this exist already and I just don't know about
it?
2. If not, do you have suggestions on how to go about making such a
data set? I have some ideas on how to do it bit by bit, and we have a
certain small set of records that we're already using for testing,
but maybe there's a better method that I don't know about?
3. Are there features missing from the above list that would make
this more useful?

Thoughts? Comments?

Thanks!
Bess


Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305


Re: [CODE4LIB] coverage of google book viewability API

2008-05-09 Thread Custer, Mark
Isn't the whole point of this to get the user to the book?  Knowledge
about the book should/will come from research and reading, not bad
metadata...  or even hastily automated extraneous info...  and in fact,
I'd say that most MARC metadata is only there to get a user to the book,
not to describe it to the user (aside from subjects), but mainly to
describe it to the library.

That said, I showed an example
(http://books.google.com/books?id=kdiYGQAACAAJ) in which a Google Books
"no view" gets a user to a "full view", completely electronically!
(though, my goodness, in the state it's in now it does require
experimentation and two extra mouse clicks)

What's more, Google already has a link for every book to Worldcat, which
helps the user get to a "full view", completely physically.  Ideally,
researchers will begin to have more and more opportunities to research
texts side-by-side with a physical and an electronic copy.  That is, of
course, once they have more options to download different formats rather
than just the rather large PDFs currently available at Google.

Right now, yes, most "no view" options offer far less information to a
user who can read a MARC record...  but I'd presume that as more texts
go online, the more value and richness will be added, even to those "no
view" cases.

Mark Custer


-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Tim Spalding
Sent: Friday, May 09, 2008 11:44 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] coverage of google book viewability API

> Most of our users will start out in an electronic environment whether
we like it or not
> (most of us on THIS list like it)---and will decide,  based on what
they
> find there, on their own, without us making the decision for
> them---whether to obtain (or attempt to obtain) a copy of the physical
> book or not. Whether we like it or not.

>But if you think the options are between US deciding whether the user
should consult a physical book or not---then we're not even playing the
same game.

What I dislike here is your abnegation of the responsibility to care
about the choices students make. If you're not considering the value
of all resources-including the book-you're not playing the library
game, the educator game or the Google game. You're just throwing stuff
on screens because you can.

"Whether you like it or not" you're pointing students in some
directions and not others. You're giving these resources different
amounts of emphasis in your UI. You're including some and not
others-the others includes all other web pages and all other offline
resources. You aren't making choices for the user, but you're not
stepping back and washing your hands of the responsibility to help the
student.

In a physical-book context, the book is one of the resources. It
deserves to weighted and evaluated within this larger set of choices.
It's your responsibility to consider it within the mix of options. If
the book is excellent and the online resources poor, helping the user
means communicating this. So, sometimes the OPAC should basically say
"there's nothing good online about this book; but it's on the shelf
right over there."

*Certainly in Classics that's still true-the online world is a very
impoverished window into the discipline.


Re: [CODE4LIB] coverage of google book viewability API

2008-05-09 Thread Larry Campbell

What I dislike here is your assumption that you know better than your
users what's "good" for them/what they want/what they OUGHT to want/what
they need/etc. Providing them with available information in a reasonably
accessible way, and then trusting them -- whether they're undergrads,
grad students, lecturers, profs, researchers, community users, etc -- to
make their own decisions about what to do with that based upon their
particular, individual, and personal circumstances, locations, contexts,
etc., isn't "abnegating" your responsibility -- it IS your responsibility.

Larry Campbell
UBC Library

Tim Spalding wrote:


Most of our users will start out in an electronic environment whether we like 
it or not
(most of us on THIS list like it)---and will decide,  based on what they
find there, on their own, without us making the decision for
them---whether to obtain (or attempt to obtain) a copy of the physical
book or not. Whether we like it or not.







But if you think the options are between US deciding whether the user should 
consult a physical book or not---then we're not even playing the same game.




What I dislike here is your abnegation of the responsibility to care
about the choices students make. If you're not considering the value
of all resources—including the book—you're not playing the library
game, the educator game or the Google game. You're just throwing stuff
on screens because you can.

"Whether you like it or not" you're pointing students in some
directions and not others. You're giving these resources different
amounts of emphasis in your UI. You're including some and not
others—the others includes all other web pages and all other offline
resources. You aren't making choices for the user, but you're not
stepping back and washing your hands of the responsibility to help the
student.

In a physical-book context, the book is one of the resources. It
deserves to weighted and evaluated within this larger set of choices.
It's your responsibility to consider it within the mix of options. If
the book is excellent and the online resources poor, helping the user
means communicating this. So, sometimes the OPAC should basically say
"there's nothing good online about this book; but it's on the shelf
right over there."

*Certainly in Classics that's still true—the online world is a very
impoverished window into the discipline.





Re: [CODE4LIB] coverage of google book viewability API

2008-05-09 Thread Jonathan Rochkind

I agree that showing the user evaluative resources that are not any good
is not a service to the user. When there are no good evaluative
resources available, we should not show bad ones to the user.

In either case though, with or without evaluative resources, we tell the
user where the book is located on the shelves.

[And in most library contexts I am familiar with, that is NOT accurately
descirbed as "right over there". Perhaps you are familiar with other
sorts of libraries then I. In a large urban public library system or
just about ANY academic library, this is just as likely to be: "in the
library, but you are sitting at home right now, which could be a mile
away or could be two states away", "in another building [which may be
miles away]", "checked out to a user, but you can recall it if you
like", "in this building three floors and 200 meters away", or "not in
our system at all right now but you can ILL it."  So I'm having problems
with your continued assertions that the book is "right over there" for a
user consulting the OPAC.  Not neccesarily in a majority of the cases
for my users, or the users of most other libraries I am familiar with.]

I certainly agree that making it easier to find a book on the shelves is
another enhancement we should be looking at. I think it was David Walker
who had a nice OPAC feature that actually gave you a map, with hilighted
path, from your computer terminal (if you were sitting in the library,
which again is probably a _minority_ of our opac use), to the book on
the shelves. That's awfully cool.

And in either case, with or without extra evaluative metadata, with or
without an interactive map showing you were the book is---in the end
it's the user's choice of whether to obtain the book or not. Our job is,
indeed, giving them good and useful and accurate evaluative information
to aid in this selection process. If you suggest that Google metadata is
NOT good and useful and accurate metadata, that's legitimate.  But you
started out, to my reading, suggesting that Table of Contents, reviews,
and links to other editions could not possibly be useful, and I still
take exception to that.

Jonathan

Tim Spalding wrote:

Most of our users will start out in an electronic environment whether we like 
it or not
(most of us on THIS list like it)---and will decide,  based on what they
find there, on their own, without us making the decision for
them---whether to obtain (or attempt to obtain) a copy of the physical
book or not. Whether we like it or not.





But if you think the options are between US deciding whether the user should 
consult a physical book or not---then we're not even playing the same game.



What I dislike here is your abnegation of the responsibility to care
about the choices students make. If you're not considering the value
of all resources—including the book—you're not playing the library
game, the educator game or the Google game. You're just throwing stuff
on screens because you can.

"Whether you like it or not" you're pointing students in some
directions and not others. You're giving these resources different
amounts of emphasis in your UI. You're including some and not
others—the others includes all other web pages and all other offline
resources. You aren't making choices for the user, but you're not
stepping back and washing your hands of the responsibility to help the
student.

In a physical-book context, the book is one of the resources. It
deserves to weighted and evaluated within this larger set of choices.
It's your responsibility to consider it within the mix of options. If
the book is excellent and the online resources poor, helping the user
means communicating this. So, sometimes the OPAC should basically say
"there's nothing good online about this book; but it's on the shelf
right over there."

*Certainly in Classics that's still true—the online world is a very
impoverished window into the discipline.




--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] coverage of google book viewability API

2008-05-09 Thread Tim Spalding
> Most of our users will start out in an electronic environment whether we like 
> it or not
> (most of us on THIS list like it)---and will decide,  based on what they
> find there, on their own, without us making the decision for
> them---whether to obtain (or attempt to obtain) a copy of the physical
> book or not. Whether we like it or not.

>But if you think the options are between US deciding whether the user should 
>consult a physical book or not---then we're not even playing the same game.

What I dislike here is your abnegation of the responsibility to care
about the choices students make. If you're not considering the value
of all resources—including the book—you're not playing the library
game, the educator game or the Google game. You're just throwing stuff
on screens because you can.

"Whether you like it or not" you're pointing students in some
directions and not others. You're giving these resources different
amounts of emphasis in your UI. You're including some and not
others—the others includes all other web pages and all other offline
resources. You aren't making choices for the user, but you're not
stepping back and washing your hands of the responsibility to help the
student.

In a physical-book context, the book is one of the resources. It
deserves to weighted and evaluated within this larger set of choices.
It's your responsibility to consider it within the mix of options. If
the book is excellent and the online resources poor, helping the user
means communicating this. So, sometimes the OPAC should basically say
"there's nothing good online about this book; but it's on the shelf
right over there."

*Certainly in Classics that's still true—the online world is a very
impoverished window into the discipline.


Re: [CODE4LIB] coverage of google book viewability API

2008-05-09 Thread Jonathan Rochkind

And indeed, that is exactly how I plan to make use of the Google
metadata link, and have been suggesting is the best way to make use of
it since I entered this conversation:  As part of a set of links to
'additional information' about a resource, with no special prominence
given to the Google link.

Other "additional information" links I either have already or plan to
have soon include:
Amazon.com
isbndb.com (a good aggregator of online prices for purchase, which has a
nice api)
Ulrich's (for serials)
Books in Print
OCLC Worldcat
OCLC Identities (for the author of the resource)

Of course, there is definitely a danger here of information overload. If
these "additonal information" links are layed out on the page they
shouldn't interfere with the use of hte page by people who want to
ignore them entirely. But too _many_ links in the "additional
information" section may make people ignore it entirely, where a smaller
list limited to the most useful links would get more use.  But at the
moment, we're not really sure which are the 'most useful' links, so I
think I'll err on the side of inclusion, up to a maximum of 6 or 7 links.

Jonathan

Tim Spalding wrote:

If the Google link were part of a much larger set of unstressed links,
I'd be more inclined to favor it. Lots of linking is a good thing. But
a single no-info Google link from a low-information OPAC page seems to
compound the deficiencies of one paradigm with that of another.

On the subject of "lazy" students, I do think there is a legitimate
distinction between what students will do and what they ought to do.
Being pro-Web 2.0 doesn't require us to be information relativists.
Certainly there is a lot of ignorant criticism about Wikipedia.
Wikipedia is a remarkable resource, and an inspiration to us all.
Students will and probably should use it when they're starting out on
a topic.

That some students will use it long after that, in the place of better
resources online and off, because it's the "path of least resistance"
isn't just a fact of life we must all bow before. It is a problem we
must confront. Not infrequently the right answer is "get off your butt
and read a book."

Tim

On Fri, May 9, 2008 at 8:44 AM, Custer, Mark <[EMAIL PROTECTED]> wrote:


For the most part, I completely agree.  That said, it's a very tangled
web out there, and on occasion those "no preview" views can still lead a
user to a "full view" that's offered elsewhere.  Here's just one
example:

http://books.google.com/books?id=kdiYGQAACAAJ
(from there, a user can click on the first link to be taken to another
metadata page that has access to a "full view")

Unfortunately, there's no indication that either of these links will get
you to a full-text digitized copy of the book in question (the links
always, of course, appear under the header of "References from web
pages", which Google has nicely added), and there's also no way to know
that a "no preview" book has any such "references from web pages" until
you access the item, but it's something, at least, however unintended.

It'd be nice, perhaps, if you could put some sort of standard in the
metadata header of the webpage (DC or otherwise) to indicate to a
harvester (in this case, a crawler) the specific format of the
retrieval.  Then these links could be labeled as "digitized copies
available elsewhere", rather than simply "references from web pages"
(which, of course, is all that they are right now), and could also be
added to the API callback.  That is, of course, if Google doesn't
eventually put up these and other localized resources as well (and I'm
sure they'll cover most of these, with the collections that they do
have)...  but until or if they do, it would go a longer way to
fulfilling their mission.

Mark Custer



-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Tim Spalding
Sent: Thursday, May 08, 2008 6:52 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] coverage of google book viewability API

So, I took a long slow look at ten of the examples from Godmar's file.
Nothing I saw disabused me of my opinion: "No preview" pages on Google
Book Search are very weak tea.

Are they worthless? Not always. But they usually are. And,
unfortunately, you generally need to read the various references pages
carefully before you know you were wasting your time.

Some examples:

Risks in Chemical Units
(http://books.google.com/books?id=7ctpCAAJ) has one glancing,
un-annotated reference in the footnotes of another, apparently
different book.

How Trouble Made the Monkey Eat Pepper
(http://books.google.com/books?id=wLnGCAAJ) sports three
references from other books, two in snippet view and one with no view.
Two are bare-bones bibliographic mentions in an index of Canadian
children's books and an index of Canadian chidren's illustrators. The
third is another bare-bones mention in a book in Sinhalese.



 If the patron is sitting on  a computer (which, given this


discussion, they

Re: [CODE4LIB] coverage of google book viewability API

2008-05-09 Thread Jonathan Rochkind

Ah, but in our actual world we _don't_ have two choices, to send the
user to the physical book or to an electronic metadata surrogate. We
_don't_ get to force the user to look at the book. Most of our users
will start out in an electronic environment whether we like it or not
(most of us on THIS list like it)---and will decide,  based on what they
find there, on their own, without us making the decision for
them---whether to obtain (or attempt to obtain) a copy of the physical
book or not. Whether we like it or not. But again, most of us on THIS
list like it, and like the challenge of giving the user useful
information in the electronic surrogate environment to make up their own
mind about whether to obtain the physical book or not.

So we can disagree about whether Google metadata is a valuable aid to
the user in selection task or not, sure.

But if you think the options are between US deciding whether the user
should consult a physical book or not---then we're not even playing the
same game.

Jonathan

Tim Spalding wrote:

So, I took a long slow look at ten of the examples from Godmar's file.
Nothing I saw disabused me of my opinion: "No preview" pages on Google
Book Search are very weak tea.

Are they worthless? Not always. But they usually are. And,
unfortunately, you generally need to read the various references pages
carefully before you know you were wasting your time.

Some examples:

Risks in Chemical Units
(http://books.google.com/books?id=7ctpCAAJ) has one glancing,
un-annotated reference in the footnotes of another, apparently
different book.

How Trouble Made the Monkey Eat Pepper
(http://books.google.com/books?id=wLnGCAAJ) sports three
references from other books, two in snippet view and one with no view.
Two are bare-bones bibliographic mentions in an index of Canadian
children's books and an index of Canadian chidren's illustrators. The
third is another bare-bones mention in a book in Sinhalese.



 If the patron is sitting on  a computer (which, given this discussion, they 
obviously are), the
 path of least resistance dictates that a journal article will be used before a 
book.



An excellent example. Let's imagine you were doing reference-desk work
and a student were to come up to you with a question about a topic.
You have two sources you can send them to—the book itself in all its
glory, and another source. The other source is the Croatian-language
MySpace page of someone whose boyfriend read a chapter of the book
once, five years ago. You're not sure if the blog mentions the book,
but it might.

That something provides the path of least resistance isn't an argument
for something. It depends on where the path goes.




--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] coverage of google book viewability API

2008-05-09 Thread Tim Spalding
If the Google link were part of a much larger set of unstressed links,
I'd be more inclined to favor it. Lots of linking is a good thing. But
a single no-info Google link from a low-information OPAC page seems to
compound the deficiencies of one paradigm with that of another.

On the subject of "lazy" students, I do think there is a legitimate
distinction between what students will do and what they ought to do.
Being pro-Web 2.0 doesn't require us to be information relativists.
Certainly there is a lot of ignorant criticism about Wikipedia.
Wikipedia is a remarkable resource, and an inspiration to us all.
Students will and probably should use it when they're starting out on
a topic.

That some students will use it long after that, in the place of better
resources online and off, because it's the "path of least resistance"
isn't just a fact of life we must all bow before. It is a problem we
must confront. Not infrequently the right answer is "get off your butt
and read a book."

Tim

On Fri, May 9, 2008 at 8:44 AM, Custer, Mark <[EMAIL PROTECTED]> wrote:
> For the most part, I completely agree.  That said, it's a very tangled
> web out there, and on occasion those "no preview" views can still lead a
> user to a "full view" that's offered elsewhere.  Here's just one
> example:
>
> http://books.google.com/books?id=kdiYGQAACAAJ
> (from there, a user can click on the first link to be taken to another
> metadata page that has access to a "full view")
>
> Unfortunately, there's no indication that either of these links will get
> you to a full-text digitized copy of the book in question (the links
> always, of course, appear under the header of "References from web
> pages", which Google has nicely added), and there's also no way to know
> that a "no preview" book has any such "references from web pages" until
> you access the item, but it's something, at least, however unintended.
>
> It'd be nice, perhaps, if you could put some sort of standard in the
> metadata header of the webpage (DC or otherwise) to indicate to a
> harvester (in this case, a crawler) the specific format of the
> retrieval.  Then these links could be labeled as "digitized copies
> available elsewhere", rather than simply "references from web pages"
> (which, of course, is all that they are right now), and could also be
> added to the API callback.  That is, of course, if Google doesn't
> eventually put up these and other localized resources as well (and I'm
> sure they'll cover most of these, with the collections that they do
> have)...  but until or if they do, it would go a longer way to
> fulfilling their mission.
>
> Mark Custer
>
>
>
> -Original Message-
> From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
> Tim Spalding
> Sent: Thursday, May 08, 2008 6:52 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] coverage of google book viewability API
>
> So, I took a long slow look at ten of the examples from Godmar's file.
> Nothing I saw disabused me of my opinion: "No preview" pages on Google
> Book Search are very weak tea.
>
> Are they worthless? Not always. But they usually are. And,
> unfortunately, you generally need to read the various references pages
> carefully before you know you were wasting your time.
>
> Some examples:
>
> Risks in Chemical Units
> (http://books.google.com/books?id=7ctpCAAJ) has one glancing,
> un-annotated reference in the footnotes of another, apparently
> different book.
>
> How Trouble Made the Monkey Eat Pepper
> (http://books.google.com/books?id=wLnGCAAJ) sports three
> references from other books, two in snippet view and one with no view.
> Two are bare-bones bibliographic mentions in an index of Canadian
> children's books and an index of Canadian chidren's illustrators. The
> third is another bare-bones mention in a book in Sinhalese.
>
>>  If the patron is sitting on  a computer (which, given this
> discussion, they obviously are), the
>>  path of least resistance dictates that a journal article will be used
> before a book.
>
> An excellent example. Let's imagine you were doing reference-desk work
> and a student were to come up to you with a question about a topic.
> You have two sources you can send them to-the book itself in all its
> glory, and another source. The other source is the Croatian-language
> MySpace page of someone whose boyfriend read a chapter of the book
> once, five years ago. You're not sure if the blog mentions the book,
> but it might.
>
> That something provides the path of least resistance isn't an argument
> for something. It depends on where the path goes.
>



--
Check out my library at http://www.librarything.com/profile/timspalding


[CODE4LIB] Sakai and Emory Reserves Direct

2008-05-09 Thread Marianne Giltrud

A while ago someone posted to this list feedback on reserves
applications.  Emory Reserves Direct was one of the open source
recommendations.  We are migrating to from Blackboard to Sakai open
source course management and would like to add some Copyright
components.  Is there anyone that has experience with either Emory
Direct or Sakai? We will be having a consultant tweak Sakai so any input
would be greatly appreciated.

I can be contacted off list at [EMAIL PROTECTED] or to the group if
appropriate.

Thank you very much.
Marianne
--

Marianne E. Giltrud, MSLS
Acting Access Services Librarian
The Catholic University of America Libraries
Mullen Library
620 Michigan Avenue, NE
Washington, DC 20064
202-319-4453
[EMAIL PROTECTED]
http://libraries.cua.edu


Re: [CODE4LIB] coverage of google book viewability API

2008-05-09 Thread Custer, Mark
For the most part, I completely agree.  That said, it's a very tangled
web out there, and on occasion those "no preview" views can still lead a
user to a "full view" that's offered elsewhere.  Here's just one
example:

http://books.google.com/books?id=kdiYGQAACAAJ
(from there, a user can click on the first link to be taken to another
metadata page that has access to a "full view")

Unfortunately, there's no indication that either of these links will get
you to a full-text digitized copy of the book in question (the links
always, of course, appear under the header of "References from web
pages", which Google has nicely added), and there's also no way to know
that a "no preview" book has any such "references from web pages" until
you access the item, but it's something, at least, however unintended.

It'd be nice, perhaps, if you could put some sort of standard in the
metadata header of the webpage (DC or otherwise) to indicate to a
harvester (in this case, a crawler) the specific format of the
retrieval.  Then these links could be labeled as "digitized copies
available elsewhere", rather than simply "references from web pages"
(which, of course, is all that they are right now), and could also be
added to the API callback.  That is, of course, if Google doesn't
eventually put up these and other localized resources as well (and I'm
sure they'll cover most of these, with the collections that they do
have)...  but until or if they do, it would go a longer way to
fulfilling their mission.

Mark Custer



-Original Message-
From: Code for Libraries [mailto:[EMAIL PROTECTED] On Behalf Of
Tim Spalding
Sent: Thursday, May 08, 2008 6:52 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] coverage of google book viewability API

So, I took a long slow look at ten of the examples from Godmar's file.
Nothing I saw disabused me of my opinion: "No preview" pages on Google
Book Search are very weak tea.

Are they worthless? Not always. But they usually are. And,
unfortunately, you generally need to read the various references pages
carefully before you know you were wasting your time.

Some examples:

Risks in Chemical Units
(http://books.google.com/books?id=7ctpCAAJ) has one glancing,
un-annotated reference in the footnotes of another, apparently
different book.

How Trouble Made the Monkey Eat Pepper
(http://books.google.com/books?id=wLnGCAAJ) sports three
references from other books, two in snippet view and one with no view.
Two are bare-bones bibliographic mentions in an index of Canadian
children's books and an index of Canadian chidren's illustrators. The
third is another bare-bones mention in a book in Sinhalese.

>  If the patron is sitting on  a computer (which, given this
discussion, they obviously are), the
>  path of least resistance dictates that a journal article will be used
before a book.

An excellent example. Let's imagine you were doing reference-desk work
and a student were to come up to you with a question about a topic.
You have two sources you can send them to-the book itself in all its
glory, and another source. The other source is the Croatian-language
MySpace page of someone whose boyfriend read a chapter of the book
once, five years ago. You're not sure if the blog mentions the book,
but it might.

That something provides the path of least resistance isn't an argument
for something. It depends on where the path goes.


Re: [CODE4LIB] Latest OpenLibrary.org release

2008-05-09 Thread Rob Sanderson
On Thu, 2008-05-08 at 11:41 -0400, Godmar Back wrote:
> On Thu, May 8, 2008 at 11:25 AM, Dr R. Sanderson
> <[EMAIL PROTECTED]> wrote:
> >
> >  Like what?  The current API seems to be concerned with search.  Search
> >  is what SRU does well.  If it was concerned with harvest, I (and I'm
> >  sure many others) would have instead suggested OAI-PMH.
> >
> No, the API presented does not support search.

Well, it only doesn't support search because of the way that the API has
been described without using the word 'search'!

To quote the documentation in the API:

--
Infogami provides an API to query the database for objects matching
particular criteria
...
To find objects matching a particular query, send a GET request to
http://openlibrary.org/api/things with query as parameter. In this
documentation we use curl as a simple command line query client; any
software that supports http GET can be used.
...
The API supports querying for objects based of string matching.
-

And so on.

There's a query, which can have its results sorted, be limited in terms
of the number of results returned, and have the beginning of that result
list start at an offset.

Sounds a lot like a search?

Rob