Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Kyle Banerjee
 According to the combined brainstorming of Jonathan Rochkind and
 myself, the ideal record set should:

 1. contain about 10k records, enough to really see the features, but
 small enough that you could index it in a few minutes on a typical
 desktop...

 5. contain a distribution of typical errors one might encounter with
 marc records in the wild

This is much harder to do than might appear on the surface. 10K is a
really small set, and the issue is that unless people know how to
create a set that has really targets the problem areas, you will
inevitably miss important stuff. At the end of the day, it's the
screwball stuff you didn't think about that always causes the most
problems. I think such data sizes are useful for testing interfaces,
but not for determining catalog behavior and setup.

Despite the indexing time, I believe in testing with much larger sets.
There are certain very important things that just can't be examined
with small sets. For example, one huge problem with catalog data is
that the completeness and quality is highly variable. When we were
experimenting sometime back, we found that how you normalize the data
and how you weight terms as well as documents has an enormous impact
on search results and that unless you do some tuning, you will
inevitably find a lot of garbage too close to the top with a bunch of
good stuff ranked so low it isn't found.

kyle


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Walter Lewis

Bess Sadler wrote:

3. Are there features missing from the above list that would make
this more useful?

One of the things that Bill Moen showed at Access a couple of years ago
(Edmonton?) was what he and others were calling a radioactive Marc
record.  One that had no normal payload but, IIRC had a 245$a whose
value was 245$a etc.  As I recall, it was used to test processes where
you wanted to be sure that a specific field was mapped to a specific
index, or was showing in a particular Z39.50 profiles.

Walter


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Jodi Schneider
This is much harder to do than might appear on the surface. 10K is a
really small set, and the issue is that unless people know how to
create a set that has really targets the problem areas, you will
inevitably miss important stuff. At the end of the day, it's the
screwball stuff you didn't think about that always causes the most
problems. I think such data sizes are useful for testing interfaces,
but not for determining catalog behavior and setup.

Sounds like you have some experience of this, Kyle!
Do you have a list of the screwball stuff? Even an offhand one would
be interesting...

Despite the indexing time, I believe in testing with much larger sets.
There are certain very important things that just can't be examined
with small sets. For example, one huge problem with catalog data is
that the completeness and quality is highly variable. When we were
experimenting sometime back, we found that how you normalize the data
and how you weight terms as well as documents has an enormous impact
on search results and that unless you do some tuning, you will
inevitably find a lot of garbage too close to the top with a bunch of
good stuff ranked so low it isn't found.

What sorts of normalizations do you do? I'm starting to look for
standard measures of data quality or data validation/normalization
routines for MARC.

I ask because in my experiments with the FRBR Display Tool, I've found
sorts of variations you describe, and I'd like to experiment with more
data validation  normalization. I'm very new at working with MARC data,
so even pointers to standard stuff would be really helpful!

-Jodi

Jodi Schneider
Science Library Specialist
Amherst College
413-542-2076


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Dan Scott
I agree with Kyle that a big, wide set of records is better for testing 
purposes. In processing records for Evergreen imports, I've found that there 
are often just a handful that throw marc4j for a loop. I suppose I should cull 
those and attach them to bug reports... instead I've taken the path of least 
resistance and just used yaz-marcdump. (bad Dan!)

There are, of course, _lots_ of MARC records available for download from
http://www.archive.org/search.php?query=collection%3A%22ol_data%22%20AND%20%28MARC%20records%29
- not just the LoC set. So one could presumably assemble a nice big set of 
records starting here.

Dan


 On Fri, May 9, 2008 at 12:33 PM, Bess Sadler [EMAIL PROTECTED] wrote:
 Those of us involved in the Blacklight and VuFind projects are
 spending lots of time recently thinking about marc records indexing.
 We're about to start running some performance tests, and we want to
 create unit tests for our marc to solr indexer, and also people
 wanting to download and play with the software need to have easy
 access to a small but representative set of marc records that they
 can play with.

 According to the combined brainstorming of Jonathan Rochkind and
 myself, the ideal record set should:

 1. contain about 10k records, enough to really see the features, but
 small enough that you could index it in a few minutes on a typical
 desktop
 2. contain a distribution of kinds of records, e.g., books, CDs,
 musical scores, DVDs, special collection items, etc.
 3. contain a distribution of languages, so we can test unicode handling
 4. contain holdings information in addition to bib records
 5. contain a distribution of typical errors one might encounter with
 marc records in the wild

 It seems to me that the set that Casey donated to Open Library
 (http://www.archive.org/details/marc_records_scriblio_net) would be a
 good place from which to draw records, because although IANAL, this
 seems to sidestep any legal hurdles. I'd also love to see the ability
 for the community to contribute test cases. Assuming such a set
 doesn't exist already (see my question below) this seems like the
 ideal sort of project for code4lib to host, too.

 Since code4lib is my lazyweb, I'm asking you:

 1. Does something like this exist already and I just don't know about
 it?
 2. If not, do you have suggestions on how to go about making such a
 data set? I have some ideas on how to do it bit by bit, and we have a
 certain small set of records that we're already using for testing,
 but maybe there's a better method that I don't know about?
 3. Are there features missing from the above list that would make
 this more useful?

 Thoughts? Comments?

 Thanks!
 Bess


 Elizabeth (Bess) Sadler
 Research and Development Librarian
 Digital Scholarship Services
 Box 400129
 Alderman Library
 University of Virginia
 Charlottesville, VA 22904

 [EMAIL PROTECTED]
 (434) 243- 2305


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Kyle Banerjee
 Sounds like you have some experience of this, Kyle!
 Do you have a list of the screwball stuff? Even an offhand one would
 be interesting...

I don't have the list with me, but just to rattle a few things off,
some extra short records rank high because so much of a search term
matches the whole document. Some records contain fields that have been
repeated many times which artificially boosts them. You'll see
nonstandard use of fields as well as foreign character sets. There are
a number of ways URLs are displayed. Deriving format can be
problematic because encoded mat type and what you're providing access
to are different. Some records contain lots of added entries, while
many important ones are fairly minimalist. There are conversion on the
fly, purchased record sets, automatically generated ones, full level,
ones that automatically have some subject heading added that contains
a common search term. There are a zillion other things, but you get
the idea.

 What sorts of normalizations do you do? I'm starting to look for
 standard measures of data quality or data validation/normalization
 routines for MARC.

Index terms only once. This helps deal with repeated terms in
repetitive subject headings and added entries. Look at presence of
fields to assign additional material types (particularly useful for
electronic resources since these typically have paper records -- but
don't be fooled by links to TOC and stuff that's not full text). Give
special handling for serials. Keywords need to be weighted differently
depending on where they're from (e.g. title worth more than subject).
We also assigned a rough record quality score based presence/absence
of fields so that longer more complete records don't become less
important simply because a search term matches less of them than a
short record. Give a bit more weight to a true full text retrieval.
Number of libraries holding the item is considered. When indexing,
650|a is more important than |x, |y, or |z. Don't treat 650|z the same
way as 651|a. Recognize that 650|v is a form, that some common 650|x
fields should be treated this way (and neither should just be a
regular index term). The only thing we didn't use that a lot of places
put a lot of weight on is date -- this is good for retrieving popular
fiction, but you have to be really careful with it in academic
collections because it can hide classic stuff that's been around a
long time. I can't remember everything off the top of my head, but
there's a lot and it makes a big difference.

kyle


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Jonathan Rochkind

I think you start with a smaller set, but then when you find
idiosyncratic records that were NOT represented in your smaller set, you
add representative samples to the sample set. The sample set organically
grows.

Certainly at some point you've got to test on a larger set too. But I
think there's a lot of value in having a small test set too. Of course,
it is something of a challenge to even come up with a reasonably
representative small set. But it doesn't need to be absolutely
representative---when you find examples not represented, you add them.
It grows.

Jonathan

Kyle Banerjee wrote:

According to the combined brainstorming of Jonathan Rochkind and
myself, the ideal record set should:

1. contain about 10k records, enough to really see the features, but
small enough that you could index it in a few minutes on a typical
desktop...





5. contain a distribution of typical errors one might encounter with
marc records in the wild



This is much harder to do than might appear on the surface. 10K is a
really small set, and the issue is that unless people know how to
create a set that has really targets the problem areas, you will
inevitably miss important stuff. At the end of the day, it's the
screwball stuff you didn't think about that always causes the most
problems. I think such data sizes are useful for testing interfaces,
but not for determining catalog behavior and setup.

Despite the indexing time, I believe in testing with much larger sets.
There are certain very important things that just can't be examined
with small sets. For example, one huge problem with catalog data is
that the completeness and quality is highly variable. When we were
experimenting sometime back, we found that how you normalize the data
and how you weight terms as well as documents has an enormous impact
on search results and that unless you do some tuning, you will
inevitably find a lot of garbage too close to the top with a bunch of
good stuff ranked so low it isn't found.

kyle




--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Jonathan Rochkind

The Blacklight code is not currently using XML or XSLT. It's indexing
binary MARC files. I don't know it's speed, but I hear it's pretty fast.

But for the kind of test set I want, even waiting half an hour is too
long. I want a test set where I can make a change to my indexing
configuration and then see the results in a few minutes, if not seconds.

This has become apparent in my attempts to get the indexer _working_,
where I might not be actually changing the index mapping at all, I'm
just changing the indexer configuration until I know it's working. When
I get to actually messing with indexer mapping to try out new ideas, I
believe it will also be important, however.

Jonathan

Casey Durfee wrote:

I strongly agree that we need something like this.  The LoC records that
Casey donated are a great resource but far from ideal from this purpose.
They're pretty homogeneous.  I do think it needs to be bigger than 10,000
though.  100,000 would be a better target.  And I would like to see a
UNIMARC/DANMARC-based one as well as a MARC21 based one (can one's parser
handle DANMARC's ΓΈ subfield?).

I don't know about Blacklight or VuFind, but using our MarcThing package +
Solr we can index up to 1000 records a second.  I know using XSLT severely
limits how fast you can index (I'll refrain from giving another rant about
how wrong it is to use XSL to handle MARC -- the Society for Prevention of
Cruelty to Dead Horses has my number as it is.)  But I'd still expect you
can do a good 50-100 records a second.  That's only a half hour to an hour
of work to index 100,000 records.  You could run it over your lunch break.
Seems reasonable to me.

In addition to a wide variety of languages, encodings, formats and so forth,
it would definitely need to have records explicitly designed to break
things.  Blank MARC tags, extraneous subfield markers, non-printing control
characters, incorrect length fixed fields etc.  The kind of stuff that
should never happen in theory, but happens frequently in real life (I'm
looking at you, Horizon's MARC export utility).

The legal aspect of this is the difficult part.  We (LibraryThing) could
easily grab 200 random records from 500 different Z39.50 sources worldwide.
Technically, it could be done in a couple of hours.  Legally, I don't think
it could ever be done, sadly.

--Casey


On Fri, May 9, 2008 at 9:33 AM, Bess Sadler [EMAIL PROTECTED] wrote:



Those of us involved in the Blacklight and VuFind projects are
spending lots of time recently thinking about marc records indexing.
We're about to start running some performance tests, and we want to
create unit tests for our marc to solr indexer, and also people
wanting to download and play with the software need to have easy
access to a small but representative set of marc records that they
can play with.

According to the combined brainstorming of Jonathan Rochkind and
myself, the ideal record set should:

1. contain about 10k records, enough to really see the features, but
small enough that you could index it in a few minutes on a typical
desktop
2. contain a distribution of kinds of records, e.g., books, CDs,
musical scores, DVDs, special collection items, etc.
3. contain a distribution of languages, so we can test unicode handling
4. contain holdings information in addition to bib records
5. contain a distribution of typical errors one might encounter with
marc records in the wild

It seems to me that the set that Casey donated to Open Library
(http://www.archive.org/details/marc_records_scriblio_net) would be a
good place from which to draw records, because although IANAL, this
seems to sidestep any legal hurdles. I'd also love to see the ability
for the community to contribute test cases. Assuming such a set
doesn't exist already (see my question below) this seems like the
ideal sort of project for code4lib to host, too.

Since code4lib is my lazyweb, I'm asking you:

1. Does something like this exist already and I just don't know about
it?
2. If not, do you have suggestions on how to go about making such a
data set? I have some ideas on how to do it bit by bit, and we have a
certain small set of records that we're already using for testing,
but maybe there's a better method that I don't know about?
3. Are there features missing from the above list that would make
this more useful?

Thoughts? Comments?

Thanks!
Bess


Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305







--
Jonathan Rochkind
Digital Services Software Engineer
The Sheridan Libraries
Johns Hopkins University
410.516.8886
rochkind (at) jhu.edu


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Bess Sadler

On May 9, 2008, at 1:42 PM, Jonathan Rochkind wrote:


The Blacklight code is not currently using XML or XSLT. It's indexing
binary MARC files. I don't know it's speed, but I hear it's pretty
fast.


Right, I'm talking about the java indexer we're working on, which
we're hoping to turn into a plugin contrib module for solr. It
processes binary marc files. We're getting times of about 150
records / second, but that's on an unfortunately throttled server and
we're munging each record significantly (replacing musical instrument
and language codes with their English language equivalents,
calculating composition era, etc).

Casey, you say you're getting indexing times of 1000 records /
second? That's amazing! I really have to take a closer look at
MarcThing. Could pymarc really be that much faster than marc4j? Or
are we comparing apples to oranges since we haven't normalized for
the kinds of mapping we're doing and the hardware it's running on?

Bess


Elizabeth (Bess) Sadler
Research and Development Librarian
Digital Scholarship Services
Box 400129
Alderman Library
University of Virginia
Charlottesville, VA 22904

[EMAIL PROTECTED]
(434) 243-2305


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Joe Hourcle

On Fri, 9 May 2008, Bess Sadler wrote:


Those of us involved in the Blacklight and VuFind projects are
spending lots of time recently thinking about marc records indexing.
We're about to start running some performance tests, and we want to
create unit tests for our marc to solr indexer, and also people
wanting to download and play with the software need to have easy
access to a small but representative set of marc records that they
can play with.


[trimmed]


It seems to me that the set that Casey donated to Open Library
(http://www.archive.org/details/marc_records_scriblio_net) would be a
good place from which to draw records, because although IANAL, this
seems to sidestep any legal hurdles. I'd also love to see the ability
for the community to contribute test cases. Assuming such a set
doesn't exist already (see my question below) this seems like the
ideal sort of project for code4lib to host, too.


OpenLibrary has other datasets that you might be able to use / combine /
whatever to meet your requirements:

   http://openlibrary.org/dev/docs/data


-
Joe Hourcle


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Jason Ronallo
On Fri, May 9, 2008 at 2:23 PM, Joe Hourcle
[EMAIL PROTECTED] wrote:
 OpenLibrary has other datasets that you might be able to use / combine /
 whatever to meet your requirements:

   http://openlibrary.org/dev/docs/data

This'll get you the other MARC dumps that have been made available to
IA through OL:
http://www.archive.org/search.php?query=collection%3Aol_data%20marc

Lots to work with here.

I also wonder if rather than one large test set it wouldn't be good to
have smaller test sets which exhibit particular problems or are of a
particular type (i.e. music).

Jason


Re: [CODE4LIB] marc records sample set

2008-05-09 Thread Gabriel Sean Farrell
On Fri, May 09, 2008 at 11:58:03AM -0700, Casey Durfee wrote:
 On Fri, May 9, 2008 at 11:14 AM, Bess Sadler [EMAIL PROTECTED] wrote:

 
  Casey, you say you're getting indexing times of 1000 records /
  second? That's amazing! I really have to take a closer look at
  MarcThing. Could pymarc really be that much faster than marc4j? Or
  are we comparing apples to oranges since we haven't normalized for
  the kinds of mapping we're doing and the hardware it's running on?
 

 Well, you can't take a closer look at it yet, since I haven't gotten off my
 lazy butt and released it.  We're still using an older version of the
 project in production on LT.  I'm going to cut us over to the latest version
 this weekend.  At this point, being able to say we eat our own dogfood is
 the only barrier to release.

Looking forward to the release.  I'd be interested to see how it
compares to the pymarc-indexer branch in FBO/Helios [1].

 The last indexer I wrote (the one used by fac-back-opac) used marc4j and was
 around 100-150 a second.  Some of the boost was due to better designed code
 on my end, but I can't take too much credit.  Pymarc is much, much faster.
 I never bothered to figure out why.  (That wasn't why I switched, though --
 there are some problems with parsing ANSEL with marc4j (*) which I decided
 I'd rather be mauled by bears than try and fix -- the performance boost was
 just a pleasant surprise). Of course one could use pymarc from java with
 Jython.

On the small set of documents I'm now indexing (3327) I get 141
rec/sec.  This is on my test server, an AMD64 whose processor speed I
can't recall.  That rate includes pymarc processing (~65%) and the
loading of the CSV file into SOLR (~35%).  Surely there's some room
there for optimization, but it's fast enough for my current purposes.
Also, I'm in the camp that would be happy with a ~10,000 record test
set.  There will always be some edge cases that we'll only solve as
they're encountered.  I need rapid iteration!

Gabriel

[1] http://fruct.us/trac/fbo/browser/branches/pymarc-indexer/indexer