Re: [CODE4LIB] marc records sample set
According to the combined brainstorming of Jonathan Rochkind and myself, the ideal record set should: 1. contain about 10k records, enough to really see the features, but small enough that you could index it in a few minutes on a typical desktop... 5. contain a distribution of typical errors one might encounter with marc records in the wild This is much harder to do than might appear on the surface. 10K is a really small set, and the issue is that unless people know how to create a set that has really targets the problem areas, you will inevitably miss important stuff. At the end of the day, it's the screwball stuff you didn't think about that always causes the most problems. I think such data sizes are useful for testing interfaces, but not for determining catalog behavior and setup. Despite the indexing time, I believe in testing with much larger sets. There are certain very important things that just can't be examined with small sets. For example, one huge problem with catalog data is that the completeness and quality is highly variable. When we were experimenting sometime back, we found that how you normalize the data and how you weight terms as well as documents has an enormous impact on search results and that unless you do some tuning, you will inevitably find a lot of garbage too close to the top with a bunch of good stuff ranked so low it isn't found. kyle
Re: [CODE4LIB] marc records sample set
Bess Sadler wrote: 3. Are there features missing from the above list that would make this more useful? One of the things that Bill Moen showed at Access a couple of years ago (Edmonton?) was what he and others were calling a radioactive Marc record. One that had no normal payload but, IIRC had a 245$a whose value was 245$a etc. As I recall, it was used to test processes where you wanted to be sure that a specific field was mapped to a specific index, or was showing in a particular Z39.50 profiles. Walter
Re: [CODE4LIB] marc records sample set
This is much harder to do than might appear on the surface. 10K is a really small set, and the issue is that unless people know how to create a set that has really targets the problem areas, you will inevitably miss important stuff. At the end of the day, it's the screwball stuff you didn't think about that always causes the most problems. I think such data sizes are useful for testing interfaces, but not for determining catalog behavior and setup. Sounds like you have some experience of this, Kyle! Do you have a list of the screwball stuff? Even an offhand one would be interesting... Despite the indexing time, I believe in testing with much larger sets. There are certain very important things that just can't be examined with small sets. For example, one huge problem with catalog data is that the completeness and quality is highly variable. When we were experimenting sometime back, we found that how you normalize the data and how you weight terms as well as documents has an enormous impact on search results and that unless you do some tuning, you will inevitably find a lot of garbage too close to the top with a bunch of good stuff ranked so low it isn't found. What sorts of normalizations do you do? I'm starting to look for standard measures of data quality or data validation/normalization routines for MARC. I ask because in my experiments with the FRBR Display Tool, I've found sorts of variations you describe, and I'd like to experiment with more data validation normalization. I'm very new at working with MARC data, so even pointers to standard stuff would be really helpful! -Jodi Jodi Schneider Science Library Specialist Amherst College 413-542-2076
Re: [CODE4LIB] marc records sample set
I agree with Kyle that a big, wide set of records is better for testing purposes. In processing records for Evergreen imports, I've found that there are often just a handful that throw marc4j for a loop. I suppose I should cull those and attach them to bug reports... instead I've taken the path of least resistance and just used yaz-marcdump. (bad Dan!) There are, of course, _lots_ of MARC records available for download from http://www.archive.org/search.php?query=collection%3A%22ol_data%22%20AND%20%28MARC%20records%29 - not just the LoC set. So one could presumably assemble a nice big set of records starting here. Dan On Fri, May 9, 2008 at 12:33 PM, Bess Sadler [EMAIL PROTECTED] wrote: Those of us involved in the Blacklight and VuFind projects are spending lots of time recently thinking about marc records indexing. We're about to start running some performance tests, and we want to create unit tests for our marc to solr indexer, and also people wanting to download and play with the software need to have easy access to a small but representative set of marc records that they can play with. According to the combined brainstorming of Jonathan Rochkind and myself, the ideal record set should: 1. contain about 10k records, enough to really see the features, but small enough that you could index it in a few minutes on a typical desktop 2. contain a distribution of kinds of records, e.g., books, CDs, musical scores, DVDs, special collection items, etc. 3. contain a distribution of languages, so we can test unicode handling 4. contain holdings information in addition to bib records 5. contain a distribution of typical errors one might encounter with marc records in the wild It seems to me that the set that Casey donated to Open Library (http://www.archive.org/details/marc_records_scriblio_net) would be a good place from which to draw records, because although IANAL, this seems to sidestep any legal hurdles. I'd also love to see the ability for the community to contribute test cases. Assuming such a set doesn't exist already (see my question below) this seems like the ideal sort of project for code4lib to host, too. Since code4lib is my lazyweb, I'm asking you: 1. Does something like this exist already and I just don't know about it? 2. If not, do you have suggestions on how to go about making such a data set? I have some ideas on how to do it bit by bit, and we have a certain small set of records that we're already using for testing, but maybe there's a better method that I don't know about? 3. Are there features missing from the above list that would make this more useful? Thoughts? Comments? Thanks! Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243- 2305
Re: [CODE4LIB] marc records sample set
Sounds like you have some experience of this, Kyle! Do you have a list of the screwball stuff? Even an offhand one would be interesting... I don't have the list with me, but just to rattle a few things off, some extra short records rank high because so much of a search term matches the whole document. Some records contain fields that have been repeated many times which artificially boosts them. You'll see nonstandard use of fields as well as foreign character sets. There are a number of ways URLs are displayed. Deriving format can be problematic because encoded mat type and what you're providing access to are different. Some records contain lots of added entries, while many important ones are fairly minimalist. There are conversion on the fly, purchased record sets, automatically generated ones, full level, ones that automatically have some subject heading added that contains a common search term. There are a zillion other things, but you get the idea. What sorts of normalizations do you do? I'm starting to look for standard measures of data quality or data validation/normalization routines for MARC. Index terms only once. This helps deal with repeated terms in repetitive subject headings and added entries. Look at presence of fields to assign additional material types (particularly useful for electronic resources since these typically have paper records -- but don't be fooled by links to TOC and stuff that's not full text). Give special handling for serials. Keywords need to be weighted differently depending on where they're from (e.g. title worth more than subject). We also assigned a rough record quality score based presence/absence of fields so that longer more complete records don't become less important simply because a search term matches less of them than a short record. Give a bit more weight to a true full text retrieval. Number of libraries holding the item is considered. When indexing, 650|a is more important than |x, |y, or |z. Don't treat 650|z the same way as 651|a. Recognize that 650|v is a form, that some common 650|x fields should be treated this way (and neither should just be a regular index term). The only thing we didn't use that a lot of places put a lot of weight on is date -- this is good for retrieving popular fiction, but you have to be really careful with it in academic collections because it can hide classic stuff that's been around a long time. I can't remember everything off the top of my head, but there's a lot and it makes a big difference. kyle
Re: [CODE4LIB] marc records sample set
I think you start with a smaller set, but then when you find idiosyncratic records that were NOT represented in your smaller set, you add representative samples to the sample set. The sample set organically grows. Certainly at some point you've got to test on a larger set too. But I think there's a lot of value in having a small test set too. Of course, it is something of a challenge to even come up with a reasonably representative small set. But it doesn't need to be absolutely representative---when you find examples not represented, you add them. It grows. Jonathan Kyle Banerjee wrote: According to the combined brainstorming of Jonathan Rochkind and myself, the ideal record set should: 1. contain about 10k records, enough to really see the features, but small enough that you could index it in a few minutes on a typical desktop... 5. contain a distribution of typical errors one might encounter with marc records in the wild This is much harder to do than might appear on the surface. 10K is a really small set, and the issue is that unless people know how to create a set that has really targets the problem areas, you will inevitably miss important stuff. At the end of the day, it's the screwball stuff you didn't think about that always causes the most problems. I think such data sizes are useful for testing interfaces, but not for determining catalog behavior and setup. Despite the indexing time, I believe in testing with much larger sets. There are certain very important things that just can't be examined with small sets. For example, one huge problem with catalog data is that the completeness and quality is highly variable. When we were experimenting sometime back, we found that how you normalize the data and how you weight terms as well as documents has an enormous impact on search results and that unless you do some tuning, you will inevitably find a lot of garbage too close to the top with a bunch of good stuff ranked so low it isn't found. kyle -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] marc records sample set
The Blacklight code is not currently using XML or XSLT. It's indexing binary MARC files. I don't know it's speed, but I hear it's pretty fast. But for the kind of test set I want, even waiting half an hour is too long. I want a test set where I can make a change to my indexing configuration and then see the results in a few minutes, if not seconds. This has become apparent in my attempts to get the indexer _working_, where I might not be actually changing the index mapping at all, I'm just changing the indexer configuration until I know it's working. When I get to actually messing with indexer mapping to try out new ideas, I believe it will also be important, however. Jonathan Casey Durfee wrote: I strongly agree that we need something like this. The LoC records that Casey donated are a great resource but far from ideal from this purpose. They're pretty homogeneous. I do think it needs to be bigger than 10,000 though. 100,000 would be a better target. And I would like to see a UNIMARC/DANMARC-based one as well as a MARC21 based one (can one's parser handle DANMARC's ΓΈ subfield?). I don't know about Blacklight or VuFind, but using our MarcThing package + Solr we can index up to 1000 records a second. I know using XSLT severely limits how fast you can index (I'll refrain from giving another rant about how wrong it is to use XSL to handle MARC -- the Society for Prevention of Cruelty to Dead Horses has my number as it is.) But I'd still expect you can do a good 50-100 records a second. That's only a half hour to an hour of work to index 100,000 records. You could run it over your lunch break. Seems reasonable to me. In addition to a wide variety of languages, encodings, formats and so forth, it would definitely need to have records explicitly designed to break things. Blank MARC tags, extraneous subfield markers, non-printing control characters, incorrect length fixed fields etc. The kind of stuff that should never happen in theory, but happens frequently in real life (I'm looking at you, Horizon's MARC export utility). The legal aspect of this is the difficult part. We (LibraryThing) could easily grab 200 random records from 500 different Z39.50 sources worldwide. Technically, it could be done in a couple of hours. Legally, I don't think it could ever be done, sadly. --Casey On Fri, May 9, 2008 at 9:33 AM, Bess Sadler [EMAIL PROTECTED] wrote: Those of us involved in the Blacklight and VuFind projects are spending lots of time recently thinking about marc records indexing. We're about to start running some performance tests, and we want to create unit tests for our marc to solr indexer, and also people wanting to download and play with the software need to have easy access to a small but representative set of marc records that they can play with. According to the combined brainstorming of Jonathan Rochkind and myself, the ideal record set should: 1. contain about 10k records, enough to really see the features, but small enough that you could index it in a few minutes on a typical desktop 2. contain a distribution of kinds of records, e.g., books, CDs, musical scores, DVDs, special collection items, etc. 3. contain a distribution of languages, so we can test unicode handling 4. contain holdings information in addition to bib records 5. contain a distribution of typical errors one might encounter with marc records in the wild It seems to me that the set that Casey donated to Open Library (http://www.archive.org/details/marc_records_scriblio_net) would be a good place from which to draw records, because although IANAL, this seems to sidestep any legal hurdles. I'd also love to see the ability for the community to contribute test cases. Assuming such a set doesn't exist already (see my question below) this seems like the ideal sort of project for code4lib to host, too. Since code4lib is my lazyweb, I'm asking you: 1. Does something like this exist already and I just don't know about it? 2. If not, do you have suggestions on how to go about making such a data set? I have some ideas on how to do it bit by bit, and we have a certain small set of records that we're already using for testing, but maybe there's a better method that I don't know about? 3. Are there features missing from the above list that would make this more useful? Thoughts? Comments? Thanks! Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305 -- Jonathan Rochkind Digital Services Software Engineer The Sheridan Libraries Johns Hopkins University 410.516.8886 rochkind (at) jhu.edu
Re: [CODE4LIB] marc records sample set
On May 9, 2008, at 1:42 PM, Jonathan Rochkind wrote: The Blacklight code is not currently using XML or XSLT. It's indexing binary MARC files. I don't know it's speed, but I hear it's pretty fast. Right, I'm talking about the java indexer we're working on, which we're hoping to turn into a plugin contrib module for solr. It processes binary marc files. We're getting times of about 150 records / second, but that's on an unfortunately throttled server and we're munging each record significantly (replacing musical instrument and language codes with their English language equivalents, calculating composition era, etc). Casey, you say you're getting indexing times of 1000 records / second? That's amazing! I really have to take a closer look at MarcThing. Could pymarc really be that much faster than marc4j? Or are we comparing apples to oranges since we haven't normalized for the kinds of mapping we're doing and the hardware it's running on? Bess Elizabeth (Bess) Sadler Research and Development Librarian Digital Scholarship Services Box 400129 Alderman Library University of Virginia Charlottesville, VA 22904 [EMAIL PROTECTED] (434) 243-2305
Re: [CODE4LIB] marc records sample set
On Fri, 9 May 2008, Bess Sadler wrote: Those of us involved in the Blacklight and VuFind projects are spending lots of time recently thinking about marc records indexing. We're about to start running some performance tests, and we want to create unit tests for our marc to solr indexer, and also people wanting to download and play with the software need to have easy access to a small but representative set of marc records that they can play with. [trimmed] It seems to me that the set that Casey donated to Open Library (http://www.archive.org/details/marc_records_scriblio_net) would be a good place from which to draw records, because although IANAL, this seems to sidestep any legal hurdles. I'd also love to see the ability for the community to contribute test cases. Assuming such a set doesn't exist already (see my question below) this seems like the ideal sort of project for code4lib to host, too. OpenLibrary has other datasets that you might be able to use / combine / whatever to meet your requirements: http://openlibrary.org/dev/docs/data - Joe Hourcle
Re: [CODE4LIB] marc records sample set
On Fri, May 9, 2008 at 2:23 PM, Joe Hourcle [EMAIL PROTECTED] wrote: OpenLibrary has other datasets that you might be able to use / combine / whatever to meet your requirements: http://openlibrary.org/dev/docs/data This'll get you the other MARC dumps that have been made available to IA through OL: http://www.archive.org/search.php?query=collection%3Aol_data%20marc Lots to work with here. I also wonder if rather than one large test set it wouldn't be good to have smaller test sets which exhibit particular problems or are of a particular type (i.e. music). Jason
Re: [CODE4LIB] marc records sample set
On Fri, May 09, 2008 at 11:58:03AM -0700, Casey Durfee wrote: On Fri, May 9, 2008 at 11:14 AM, Bess Sadler [EMAIL PROTECTED] wrote: Casey, you say you're getting indexing times of 1000 records / second? That's amazing! I really have to take a closer look at MarcThing. Could pymarc really be that much faster than marc4j? Or are we comparing apples to oranges since we haven't normalized for the kinds of mapping we're doing and the hardware it's running on? Well, you can't take a closer look at it yet, since I haven't gotten off my lazy butt and released it. We're still using an older version of the project in production on LT. I'm going to cut us over to the latest version this weekend. At this point, being able to say we eat our own dogfood is the only barrier to release. Looking forward to the release. I'd be interested to see how it compares to the pymarc-indexer branch in FBO/Helios [1]. The last indexer I wrote (the one used by fac-back-opac) used marc4j and was around 100-150 a second. Some of the boost was due to better designed code on my end, but I can't take too much credit. Pymarc is much, much faster. I never bothered to figure out why. (That wasn't why I switched, though -- there are some problems with parsing ANSEL with marc4j (*) which I decided I'd rather be mauled by bears than try and fix -- the performance boost was just a pleasant surprise). Of course one could use pymarc from java with Jython. On the small set of documents I'm now indexing (3327) I get 141 rec/sec. This is on my test server, an AMD64 whose processor speed I can't recall. That rate includes pymarc processing (~65%) and the loading of the CSV file into SOLR (~35%). Surely there's some room there for optimization, but it's fast enough for my current purposes. Also, I'm in the camp that would be happy with a ~10,000 record test set. There will always be some edge cases that we'll only solve as they're encountered. I need rapid iteration! Gabriel [1] http://fruct.us/trac/fbo/browser/branches/pymarc-indexer/indexer