Re: [CODE4LIB] MARC Magic for file
On Wed, May 23, 2012 at 6:16 PM, Kyle Banerjee wrote: > I'm not sure whether to laugh or cry that it's a sign of progress that a 40 > year old utility designed to identify file types is now just beginning to > be able to recognize a format that's been around for almost 50 years... Laugh :-) //Ed
Re: [CODE4LIB] MARC Magic for file
On Wed, May 23, 2012 at 12:14 PM, Ford, Kevin wrote: > I finally had occasion today (read: remembered) to see if the *nix "file" > command would recognize a MARC record file. I haven't tested extensively, > but it did identify the file as MARC21 Bibliographic record. It also > correctly identified a MARC21 Authority Record. I'm running the most > recent version of Ubuntu (12.04 - precise pangolin). > I'm not sure whether to laugh or cry that it's a sign of progress that a 40 year old utility designed to identify file types is now just beginning to be able to recognize a format that's been around for almost 50 years... kyle -- -- Kyle Banerjee Digital Services Program Manager Orbis Cascade Alliance baner...@orbiscascade.org / 503.999.9787
Re: [CODE4LIB] MARC Magic for file
The file format magic format magic changed between versions; I think the OSX version was not compatible with more up to date versions (in the original thread, this caused me some confusion). Simon On Wed, May 23, 2012 at 4:34 PM, Ross Singer wrote: > On May 23, 2012, at 4:22 PM, Kevin Ford wrote: > > > Don't know what to say. Crawling through the source for "file" at [1], > the pattern matching code as in place as of Sept 2011. It could be present > earlier than Sept 2011, but I stopped hunting for it. The earliest it > would have made its way into the magic db would have been April 2011. > > > > Perhaps OpenBSD is using some custom branch of "file", haven't updated > the db, etc. > > As Stuart pointed out, some implementations are slow to update the db. > OSX, for example, also just says "data" (hence my question on the output). > > -Ross. > > > > Yours, > > > > Kevin > > > > > > > > On 05/23/2012 03:36 PM, Francis Kayiwa wrote: > >> On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote: > >>> Wow, this is pretty cool. > >>> > >>> Kevin, do you have examples of the output? > >>> > >>> Does it work for bulk files? > >>> > >>> I mean, I could just try this on my Ubuntu machine, but it's all the > way downstairs... > >> > >> My OS lists it as `data` > >> > >> $ cd > >> $ ls > >> devid_rsa.pub laflin marc orthancssh > >> updating > >> $ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt > >> Trying 140.211.166.6... > >> Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt > >> 100% > >> > |**| > >> 5965 00:00 > >> 5965 bytes received in 0.00 seconds (1.56 MB/s) > >> $ ls > >> 5_records_utf8.mrc_.txt id_rsa.pub marc > >> ssh > >> dev laflin orthanc > >> updating > >> $ mkdir test > >> $ mv 5_records_utf8.mrc_.txt test/ > >> $ cd test/ > >> $ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc > >> $ ls > >> 5_records_utf8.mrc > >> $ file 5_records_utf8.mrc > >> 5_records_utf8.mrc: data > >> $ ls > >> 5_records_utf8.mrc > >> $ ls -al > >> total 32 > >> drwxr-xr-x 2 kayiwa kayiwa 512 May 23 14:34 . > >> drwxr-xr-x 10 kayiwa kayiwa 512 May 23 14:34 .. > >> -rw-r--r-- 1 kayiwa kayiwa 5965 May 23 14:33 5_records_utf8.mrc > >> $ uname -a > >> OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386 > >> > >> ./fxk > >> > >>> > >>> -Ross. > >>> > >>> On May 23, 2012, at 3:14 PM, Ford, Kevin wrote: > >>> > I finally had occasion today (read: remembered) to see if the *nix > "file" command would recognize a MARC record file. I haven't tested > extensively, but it did identify the file as MARC21 Bibliographic record. > It also correctly identified a MARC21 Authority Record. I'm running the > most recent version of Ubuntu (12.04 - precise pangolin). > > I write because the inclusion of a "file" MARC21 specification rule > in the magic.db stems from a Code4lib exchange that started in March 2011 > [1] (it ends in April if you want to go crawling for the entire thread). > > Rgds, > > Kevin > > [1] > https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728 > > -- > Kevin Ford > Network Development and MARC Standards Office > Library of Congress > Washington, DC > >>> > >> >
Re: [CODE4LIB] MARC Magic for file
On Wed, May 23, 2012 at 04:34:47PM -0400, Ross Singer wrote: > On May 23, 2012, at 4:22 PM, Kevin Ford wrote: > > > Don't know what to say. Crawling through the source for "file" at [1], the > > pattern matching code as in place as of Sept 2011. It could be present > > earlier than Sept 2011, but I stopped hunting for it. The earliest it > > would have made its way into the magic db would have been April 2011. > > > > Perhaps OpenBSD is using some custom branch of "file", haven't updated the > > db, etc. > > As Stuart pointed out, some implementations are slow to update the db. OSX, > for example, also just says "data" (hence my question on the output). adding FreeBSD's magicfile from this commit on a users $HOME http://lists.freebsd.org/pipermail/svn-src-vendor/2011-October/000851.html For my next trick I will try to remember that I need to do that. ./fxk -- If builders built buildings the way programmers wrote programs, then the first woodpecker to come along would destroy civilization.
Re: [CODE4LIB] MARC Magic for file
On May 23, 2012, at 4:22 PM, Kevin Ford wrote: > Don't know what to say. Crawling through the source for "file" at [1], the > pattern matching code as in place as of Sept 2011. It could be present > earlier than Sept 2011, but I stopped hunting for it. The earliest it would > have made its way into the magic db would have been April 2011. > > Perhaps OpenBSD is using some custom branch of "file", haven't updated the > db, etc. As Stuart pointed out, some implementations are slow to update the db. OSX, for example, also just says "data" (hence my question on the output). -Ross. > > Yours, > > Kevin > > > > On 05/23/2012 03:36 PM, Francis Kayiwa wrote: >> On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote: >>> Wow, this is pretty cool. >>> >>> Kevin, do you have examples of the output? >>> >>> Does it work for bulk files? >>> >>> I mean, I could just try this on my Ubuntu machine, but it's all the way >>> downstairs... >> >> My OS lists it as `data` >> >> $ cd >> $ ls >> devid_rsa.pub laflin marc orthancssh >> updating >> $ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt >> Trying 140.211.166.6... >> Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt >> 100% >> |**| >> 5965 00:00 >> 5965 bytes received in 0.00 seconds (1.56 MB/s) >> $ ls >> 5_records_utf8.mrc_.txt id_rsa.pub marc >> ssh >> dev laflin orthanc >> updating >> $ mkdir test >> $ mv 5_records_utf8.mrc_.txt test/ >> $ cd test/ >> $ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc >> $ ls >> 5_records_utf8.mrc >> $ file 5_records_utf8.mrc >> 5_records_utf8.mrc: data >> $ ls >> 5_records_utf8.mrc >> $ ls -al >> total 32 >> drwxr-xr-x 2 kayiwa kayiwa 512 May 23 14:34 . >> drwxr-xr-x 10 kayiwa kayiwa 512 May 23 14:34 .. >> -rw-r--r-- 1 kayiwa kayiwa 5965 May 23 14:33 5_records_utf8.mrc >> $ uname -a >> OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386 >> >> ./fxk >> >>> >>> -Ross. >>> >>> On May 23, 2012, at 3:14 PM, Ford, Kevin wrote: >>> I finally had occasion today (read: remembered) to see if the *nix "file" command would recognize a MARC record file. I haven't tested extensively, but it did identify the file as MARC21 Bibliographic record. It also correctly identified a MARC21 Authority Record. I'm running the most recent version of Ubuntu (12.04 - precise pangolin). I write because the inclusion of a "file" MARC21 specification rule in the magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends in April if you want to go crawling for the entire thread). Rgds, Kevin [1] https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728 -- Kevin Ford Network Development and MARC Standards Office Library of Congress Washington, DC >>> >>
Re: [CODE4LIB] MARC Magic for file
> It failed on a file containing all of LC Classification. I need to > figure out why. -- To reply to myself: Having looked at the "file" db pattern source [1], I see that the "file" maintainer introduced a typo into the matching pattern for correctly identifying Classification records. That's way it's failing for Class records. Over and out, Kevin [1] ftp://ftp.astron.com/pub/file/ On 05/23/2012 03:48 PM, Ford, Kevin wrote: Does it work for bulk files? -- It passed on a file containing 215 MARC Bibs and on a file containing 2,574 MARC Auth records. Don't know if you consider these "bulk," but there is more than 1 record in each file (caveat: "file" stops after evaluating the first line, so of the 2,574 Auth records, the last 2,573 could be invalid). It failed on a file containing all of LC Classification. I need to figure out why. Kevin, do you have examples of the output? -- I received "MARC21 Bibliography" and "MARC21 Authority" respectively. In theory, if Leader 20-23 are not "4500" then "(non-conforming)" should be appended to the identification. If requested, the mimetype - application/marc - should also be outputted. Rgds, Kevin -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ross Singer Sent: Wednesday, May 23, 2012 3:29 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC Magic for file Wow, this is pretty cool. Kevin, do you have examples of the output? Does it work for bulk files? I mean, I could just try this on my Ubuntu machine, but it's all the way downstairs... -Ross. On May 23, 2012, at 3:14 PM, Ford, Kevin wrote: I finally had occasion today (read: remembered) to see if the *nix "file" command would recognize a MARC record file. I haven't tested extensively, but it did identify the file as MARC21 Bibliographic record. It also correctly identified a MARC21 Authority Record. I'm running the most recent version of Ubuntu (12.04 - precise pangolin). I write because the inclusion of a "file" MARC21 specification rule in the magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends in April if you want to go crawling for the entire thread). Rgds, Kevin [1] https://listserv.nd.edu/cgi- bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=1 12728 -- Kevin Ford Network Development and MARC Standards Office Library of Congress Washington, DC
Re: [CODE4LIB] MARC Magic for file
Don't know what to say. Crawling through the source for "file" at [1], the pattern matching code as in place as of Sept 2011. It could be present earlier than Sept 2011, but I stopped hunting for it. The earliest it would have made its way into the magic db would have been April 2011. Perhaps OpenBSD is using some custom branch of "file", haven't updated the db, etc. Yours, Kevin On 05/23/2012 03:36 PM, Francis Kayiwa wrote: On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote: Wow, this is pretty cool. Kevin, do you have examples of the output? Does it work for bulk files? I mean, I could just try this on my Ubuntu machine, but it's all the way downstairs... My OS lists it as `data` $ cd $ ls devid_rsa.pub laflin marc orthancssh updating $ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt Trying 140.211.166.6... Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt 100% |**| 5965 00:00 5965 bytes received in 0.00 seconds (1.56 MB/s) $ ls 5_records_utf8.mrc_.txt id_rsa.pub marc ssh dev laflin orthanc updating $ mkdir test $ mv 5_records_utf8.mrc_.txt test/ $ cd test/ $ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc $ ls 5_records_utf8.mrc $ file 5_records_utf8.mrc 5_records_utf8.mrc: data $ ls 5_records_utf8.mrc $ ls -al total 32 drwxr-xr-x 2 kayiwa kayiwa 512 May 23 14:34 . drwxr-xr-x 10 kayiwa kayiwa 512 May 23 14:34 .. -rw-r--r-- 1 kayiwa kayiwa 5965 May 23 14:33 5_records_utf8.mrc $ uname -a OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386 ./fxk -Ross. On May 23, 2012, at 3:14 PM, Ford, Kevin wrote: I finally had occasion today (read: remembered) to see if the *nix "file" command would recognize a MARC record file. I haven't tested extensively, but it did identify the file as MARC21 Bibliographic record. It also correctly identified a MARC21 Authority Record. I'm running the most recent version of Ubuntu (12.04 - precise pangolin). I write because the inclusion of a "file" MARC21 specification rule in the magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends in April if you want to go crawling for the entire thread). Rgds, Kevin [1] https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728 -- Kevin Ford Network Development and MARC Standards Office Library of Congress Washington, DC
Re: [CODE4LIB] MARC Magic for file
On 24/05/12 07:14, Ford, Kevin wrote: I finally had occasion today (read: remembered) to see if the *nix "file" command would recognize a MARC record file. I haven't tested extensively, but it did identify the file as MARC21 Bibliographic record. It also correctly identified a MARC21 Authority Record. I'm running the most recent version of Ubuntu (12.04 - precise pangolin). I write because the inclusion of a "file" MARC21 specification rule in the magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends in April if you want to go crawling for the entire thread). A couple of warnings about the unix file command (a) it only looks at the start of the file. This is great because it works fast on big files. This is dreadful because it can't warn you that everything after the first 10k of a 2GB file is corrupt or a 1k MARC file is pre-pended to a 400GB astronomy data file. (b) it is not uncommon for a file to match multiple file types. This can cause problems when using file to check whether inputs to a program are actually the type the program is expecting. (c) some platforms have been notoriously slow to add new definitions, ubuntu is not such a platform. cheers stuart -- Stuart Yeates Library Technology Services http://www.victoria.ac.nz/library/
Re: [CODE4LIB] MARC Magic for file
I have become recently unpleasantly aquainted with the world of Marc that is not Marc21, but is ISO 2709. What'll it do on ISO 2709? I might be able to dig up an example. I wonder if it'll claim it's Marc21 (not), or if it's Marc21 "Non-confirming" (no, it's not quite that either. It's ISO-2709 MARC that's not Marc21). If it just doens't know anything about it and says 'data', that's just fine, if it knows about Marc21 but not non-Marc21 ISO 2709. On 5/23/2012 3:48 PM, Ford, Kevin wrote: Does it work for bulk files? -- It passed on a file containing 215 MARC Bibs and on a file containing 2,574 MARC Auth records. Don't know if you consider these "bulk," but there is more than 1 record in each file (caveat: "file" stops after evaluating the first line, so of the 2,574 Auth records, the last 2,573 could be invalid). It failed on a file containing all of LC Classification. I need to figure out why. Kevin, do you have examples of the output? -- I received "MARC21 Bibliography" and "MARC21 Authority" respectively. In theory, if Leader 20-23 are not "4500" then "(non-conforming)" should be appended to the identification. If requested, the mimetype - application/marc - should also be outputted. Rgds, Kevin -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Ross Singer Sent: Wednesday, May 23, 2012 3:29 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC Magic for file Wow, this is pretty cool. Kevin, do you have examples of the output? Does it work for bulk files? I mean, I could just try this on my Ubuntu machine, but it's all the way downstairs... -Ross. On May 23, 2012, at 3:14 PM, Ford, Kevin wrote: I finally had occasion today (read: remembered) to see if the *nix "file" command would recognize a MARC record file. I haven't tested extensively, but it did identify the file as MARC21 Bibliographic record. It also correctly identified a MARC21 Authority Record. I'm running the most recent version of Ubuntu (12.04 - precise pangolin). I write because the inclusion of a "file" MARC21 specification rule in the magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends in April if you want to go crawling for the entire thread). Rgds, Kevin [1] https://listserv.nd.edu/cgi- bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=1 12728 -- Kevin Ford Network Development and MARC Standards Office Library of Congress Washington, DC
Re: [CODE4LIB] MARC Magic for file
> Does it work for bulk files? -- It passed on a file containing 215 MARC Bibs and on a file containing 2,574 MARC Auth records. Don't know if you consider these "bulk," but there is more than 1 record in each file (caveat: "file" stops after evaluating the first line, so of the 2,574 Auth records, the last 2,573 could be invalid). It failed on a file containing all of LC Classification. I need to figure out why. > Kevin, do you have examples of the output? -- I received "MARC21 Bibliography" and "MARC21 Authority" respectively. In theory, if Leader 20-23 are not "4500" then "(non-conforming)" should be appended to the identification. If requested, the mimetype - application/marc - should also be outputted. Rgds, Kevin > -Original Message- > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of > Ross Singer > Sent: Wednesday, May 23, 2012 3:29 PM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] MARC Magic for file > > Wow, this is pretty cool. > > Kevin, do you have examples of the output? > > Does it work for bulk files? > > I mean, I could just try this on my Ubuntu machine, but it's all the > way downstairs... > > -Ross. > > On May 23, 2012, at 3:14 PM, Ford, Kevin wrote: > > > I finally had occasion today (read: remembered) to see if the *nix > "file" command would recognize a MARC record file. I haven't tested > extensively, but it did identify the file as MARC21 Bibliographic > record. It also correctly identified a MARC21 Authority Record. I'm > running the most recent version of Ubuntu (12.04 - precise pangolin). > > > > I write because the inclusion of a "file" MARC21 specification rule > in the magic.db stems from a Code4lib exchange that started in March > 2011 [1] (it ends in April if you want to go crawling for the entire > thread). > > > > Rgds, > > > > Kevin > > > > [1] > > https://listserv.nd.edu/cgi- > bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=1 > > 12728 > > > > -- > > Kevin Ford > > Network Development and MARC Standards Office Library of Congress > > Washington, DC
Re: [CODE4LIB] MARC Magic for file
On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote: > Wow, this is pretty cool. > > Kevin, do you have examples of the output? > > Does it work for bulk files? > > I mean, I could just try this on my Ubuntu machine, but it's all the way > downstairs... My OS lists it as `data` $ cd $ ls devid_rsa.pub laflin marc orthancssh updating $ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt Trying 140.211.166.6... Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt 100% |**| 5965 00:00 5965 bytes received in 0.00 seconds (1.56 MB/s) $ ls 5_records_utf8.mrc_.txt id_rsa.pub marc ssh dev laflin orthanc updating $ mkdir test $ mv 5_records_utf8.mrc_.txt test/ $ cd test/ $ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc $ ls 5_records_utf8.mrc $ file 5_records_utf8.mrc 5_records_utf8.mrc: data $ ls 5_records_utf8.mrc $ ls -al total 32 drwxr-xr-x 2 kayiwa kayiwa 512 May 23 14:34 . drwxr-xr-x 10 kayiwa kayiwa 512 May 23 14:34 .. -rw-r--r-- 1 kayiwa kayiwa 5965 May 23 14:33 5_records_utf8.mrc $ uname -a OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386 ./fxk > > -Ross. > > On May 23, 2012, at 3:14 PM, Ford, Kevin wrote: > > > I finally had occasion today (read: remembered) to see if the *nix "file" > > command would recognize a MARC record file. I haven't tested extensively, > > but it did identify the file as MARC21 Bibliographic record. It also > > correctly identified a MARC21 Authority Record. I'm running the most > > recent version of Ubuntu (12.04 - precise pangolin). > > > > I write because the inclusion of a "file" MARC21 specification rule in the > > magic.db stems from a Code4lib exchange that started in March 2011 [1] (it > > ends in April if you want to go crawling for the entire thread). > > > > Rgds, > > > > Kevin > > > > [1] > > https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728 > > > > -- > > Kevin Ford > > Network Development and MARC Standards Office > > Library of Congress > > Washington, DC > -- If builders built buildings the way programmers wrote programs, then the first woodpecker to come along would destroy civilization.
Re: [CODE4LIB] MARC Magic for file
Wow, this is pretty cool. Kevin, do you have examples of the output? Does it work for bulk files? I mean, I could just try this on my Ubuntu machine, but it's all the way downstairs... -Ross. On May 23, 2012, at 3:14 PM, Ford, Kevin wrote: > I finally had occasion today (read: remembered) to see if the *nix "file" > command would recognize a MARC record file. I haven't tested extensively, > but it did identify the file as MARC21 Bibliographic record. It also > correctly identified a MARC21 Authority Record. I'm running the most recent > version of Ubuntu (12.04 - precise pangolin). > > I write because the inclusion of a "file" MARC21 specification rule in the > magic.db stems from a Code4lib exchange that started in March 2011 [1] (it > ends in April if you want to go crawling for the entire thread). > > Rgds, > > Kevin > > [1] > https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728 > > -- > Kevin Ford > Network Development and MARC Standards Office > Library of Congress > Washington, DC
[CODE4LIB] MARC Magic for file
I finally had occasion today (read: remembered) to see if the *nix "file" command would recognize a MARC record file. I haven't tested extensively, but it did identify the file as MARC21 Bibliographic record. It also correctly identified a MARC21 Authority Record. I'm running the most recent version of Ubuntu (12.04 - precise pangolin). I write because the inclusion of a "file" MARC21 specification rule in the magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends in April if you want to go crawling for the entire thread). Rgds, Kevin [1] https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728 -- Kevin Ford Network Development and MARC Standards Office Library of Congress Washington, DC
Re: [CODE4LIB] MARC magic for file
http://i.imgur.com/6WtA0.png (Sorry, it's Friday. Also, blame dchud for the idea.) -Sean On 4/6/11 4:53 PM, "Mike Taylor" wrote: > On 6 April 2011 19:53, Jonathan Rochkind wrote: >> On 4/6/2011 2:43 PM, William Denton wrote: >>> >>> "Validity" does mean something definite ... but Postel's Law is a good >>> guideline, especially with the swamp of bad MARC, old MARC, alternate >>> MARC, that's out there. Valid MARC is valid MARC, but if---for the sake >>> of file and its magic---we can identify technically invalid but still >>> usable MARC, that's good. >> >> Hmm, accept in the case of Web Browsers, I think general consensus is >> Postel's law was not helpful. These days, most people seem to think that >> having different browsers be tolerant of invalid data in different ways was >> actually harmful rather than helpful to inter-operability (which is >> theoretically the goal of Postel's law), and that's not what people do >> anymore in web browser land, at least not to the extremes they used to do >> it. > > But the idea that browsers should be less permissive in what they > accept is a modern one that we now have the luxury of only because > adherence to Postel's law in the early days of the Web allowed it to > become ubiquitous. Though it's true, as Harvey Thompson has observed > that "it's difficult to retro-fit correctness", Clay Shirky was also > very right when he pointed out that "You cannot simultaneously have > mass adoption and rigor". If browsers in 1995 had been as pedantic as > the browsers of 2011 (rightly) are, we wouldn't even have the Web; or > if it existed at all it would just be a nichey thing that a few > scientists used to make their publications available to each other. > > So while I agree that in the case of HTML we are right to now be > moving towards more rigorous demands of what to accept (as well, of > course, as being conservative in what we emit), I don't think we could > have made the leap from nothing to modern rigour. > > -- Mike
Re: [CODE4LIB] MARC magic for file
On 6 April 2011 19:53, Jonathan Rochkind wrote: > On 4/6/2011 2:43 PM, William Denton wrote: >> >> "Validity" does mean something definite ... but Postel's Law is a good >> guideline, especially with the swamp of bad MARC, old MARC, alternate >> MARC, that's out there. Valid MARC is valid MARC, but if---for the sake >> of file and its magic---we can identify technically invalid but still >> usable MARC, that's good. > > Hmm, accept in the case of Web Browsers, I think general consensus is > Postel's law was not helpful. These days, most people seem to think that > having different browsers be tolerant of invalid data in different ways was > actually harmful rather than helpful to inter-operability (which is > theoretically the goal of Postel's law), and that's not what people do > anymore in web browser land, at least not to the extremes they used to do > it. But the idea that browsers should be less permissive in what they accept is a modern one that we now have the luxury of only because adherence to Postel's law in the early days of the Web allowed it to become ubiquitous. Though it's true, as Harvey Thompson has observed that "it's difficult to retro-fit correctness", Clay Shirky was also very right when he pointed out that "You cannot simultaneously have mass adoption and rigor". If browsers in 1995 had been as pedantic as the browsers of 2011 (rightly) are, we wouldn't even have the Web; or if it existed at all it would just be a nichey thing that a few scientists used to make their publications available to each other. So while I agree that in the case of HTML we are right to now be moving towards more rigorous demands of what to accept (as well, of course, as being conservative in what we emit), I don't think we could have made the leap from nothing to modern rigour. -- Mike
Re: [CODE4LIB] MARC magic for file
> Well, the problem is when the original Marc4J author took the spec at it's > word, and actually _acted upon_ the '4' and the '5', changing file semantics > if they were different, and throwing an exception if it was a non-digit. > At least the author actually used the values rather than checking to see if a 4 or 5 were there. I still don't see what the point of looking for a 0 in an undefined field would be. I'm wondering what kind of nut job would write this into the standard, but that's not the author's problem. > Do you think he got it wrong? How was he supposed to know he got it wrong, > he wrote to the spec and took it at it's word. Are you SURE there aren't any > Marc formats other than Marc21 out there that actually do use these bytes > with their intended meaning, instead of fixing them? I wouldn't call it wrong -- the spec is a logical point of departure. MARC21 derives from an ISO standard that does not use those character positions and which otherwise requires the same data layout, but the author wouldn't necessarily know that. Standards have something in common with laws in that how they are used in the real world is as or more important than what is actually defined -- what's written and what's done in practice can be very different. Everyone here who has parsed catalog data who has done an ILS migration knows better than to just think for a second that fields can be assumed to be used as defined except for very basic stuff. > How was the Marc4J author supposed to be sure of that, or even guess it > might be the case, and know he'd be serving users better by ignoring the > spec here instead of following it? There might not have been a good way to know. With data, one thing you always want to do is ask a bunch of people who work with it all the time about anomalies in the wild. Many great works of fiction masquerade as documents which supposedly describe reality. > Ie: I _thought_ I was writing only for Marc21, but then it turns out I've > got to accept records from Outer Weirdistan that are a kind of legal Marc > that actually uses those bytes for their intended meaning Any such MARC as it would be noncompliant with the ISO standard from which MARC21 hails. If working from the MARC21 standard and weird records are in question, there would be a greater chance of choking on nonumeric tags as those are allowed by the ISO standard. Ignoring that MARC21 would need to be redefined to be able to take on other values, one can safely conclude that such a redefinition could only be written by totally deranged individuals. Values lower than 4 and 5 respectively would limit record length to the point little or no data could be stored, and greater values would be completely nonsensical as the MARC record length limitation would mean that the extra space allocated by the digits could only contain zeros. In any case, MARC is a legacy standard from the 60's. The chances of new flavors emerging are dismal at best. > Again, I realize in the actual environment we've got, this is not a luxury > we have. But it's a fault, not a benefit, to have lots of software > everywhere behaving in non-compliant ways and creating invalid (according to > the spec!) data. > Creating is another matter entirely. Since we can control what we create ourselves, we make things a little better every time we make things comformant. However, we can't control what others do and being able to read everything is useful, including stuff created using tools/processes that aren't up to scratch. kyle
Re: [CODE4LIB] MARC magic for file
On 4/6/2011 2:43 PM, William Denton wrote: "Validity" does mean something definite ... but Postel's Law is a good guideline, especially with the swamp of bad MARC, old MARC, alternate MARC, that's out there. Valid MARC is valid MARC, but if---for the sake of file and its magic---we can identify technically invalid but still usable MARC, that's good. Hmm, accept in the case of Web Browsers, I think general consensus is Postel's law was not helpful. These days, most people seem to think that having different browsers be tolerant of invalid data in different ways was actually harmful rather than helpful to inter-operability (which is theoretically the goal of Postel's law), and that's not what people do anymore in web browser land, at least not to the extremes they used to do it. So Postel's Law may not be a universal. Although marc data may or may not be analagous to a web browser/html. :) It doesn't _really_ matter, cause we're stuck with the legacy we're stuck with, there's no changing it now. But there are real world negative consequences to it, some of which I've tried to explain in previous messages. (And still don't call it "validity" if it's not please! But yes, sometimes insisting on strict validity is not the appropriate solution). Also note that assuming that byte 20-21 is "45" even when it's something else is possibly not something Postel would accept as an application of his law -- unless you document your software specifically as working only with Marc21, and not any Marc. [Postel's Law: "Be conservative in what you send; be liberal in what you accept." http://en.wikipedia.org/wiki/Robustness_principle . That wiki page also notes the general category of downside in following Postel's law, which is what was encountered with HTML, and which _I've_ encountered with MARC: "For example, a defective implementation that sends non-conforming messages might be used only with implementations that tolerate those deviations from the specification until, possibly several years later, it is connected with a less tolerant application that rejects its messages. In such a situation, identifying the problem is often difficult, and deploying a solution can be costly. " Yes, identifying the problem and deploying the solution was costly, in my MARC case, although it definitely could have been worse. ]
Re: [CODE4LIB] MARC magic for file
On 6 April 2011, Jonathan Rochkind wrote: I think we computer programmers are really better-served by reserving the notion of "validity" for things specified by formal specifications -- as we normally do, talking about any other data format. And the only formal specifications I can find for Marc21 say that leader bytes 20-23 should be 4500. (Not true of Marc in general just Marc21). "Validity" does mean something definite ... but Postel's Law is a good guideline, especially with the swamp of bad MARC, old MARC, alternate MARC, that's out there. Valid MARC is valid MARC, but if---for the sake of file and its magic---we can identify technically invalid but still usable MARC, that's good. Bill -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
Re: [CODE4LIB] MARC magic for file
On 4/6/2011 2:02 PM, Kyle Banerjee wrote: I'd go so far as to question the value of validating redundant data that theoretically has meaning but which are never supposed to vary. The 4 and the 5 simply repeat what is already known about the structure of the MARC record. Choking on stuff like this is like having a web browser ask you want to do with a page because it lacks a document type declaration. Well, the problem is when the original Marc4J author took the spec at it's word, and actually _acted upon_ the '4' and the '5', changing file semantics if they were different, and throwing an exception if it was a non-digit. This actually happened, I'm not making this up! Took me a while to debug. So do you think he got it wrong? How was he supposed to know he got it wrong, he wrote to the spec and took it at it's word. Are you SURE there aren't any Marc formats other than Marc21 out there that actually do use these bytes with their intended meaning, instead of fixing them? How was the Marc4J author supposed to be sure of that, or even guess it might be the case, and know he'd be serving users better by ignoring the spec here instead of following it? What documents instead of the actual specifications should he have been looking at to determine that he ought not to be taking those bytes at their words, but just ignoring them? To realize that we have so much non-conformant data out there that we're better off ignoring these bytes, is something you can really only learn through experience -- and something you can then later realize you're wrong on too: Ie: I _thought_ I was writing only for Marc21, but then it turns out I've got to accept records from Outer Weirdistan that are a kind of legal Marc that actually uses those bytes for their intended meaning -- better go back and fix my entire software stack, involving various proprietary and open source products from multiple sources, each of which has undocumented behavior when it comes to these bytes, maybe they follow the spec or maybe the follow Kyle's advice, but they don't tell me. This is a mess. Maybe this scenario is impossible, maybe there ARE and NEVER HAVE BEEN any Marc variants that actually use leader bytes 20-22 in this way -- how can I determine that? I've just got to guess and hope for the best. The point of specifications in the first place is for inter-operability, so we know that if all software and data conforms to the spec, then all software and data will interact in expected ways. Once we start guessing at which parts of the spec we really ought to be ignoring Again, I realize in the actual environment we've got, this is not a luxury we have. But it's a fault, not a benefit, to have lots of software everywhere behaving in non-compliant ways and creating invalid (according to the spec!) data.
Re: [CODE4LIB] MARC magic for file
.. Maybe we have different understandings of "valid". > > If leader bytes 20-23 are not "4500", I suggest that is _by definition_ not > a "valid" Marc21 file. It violates the Marc21 specification. > > Now, they may still be _usable_, by software that ignores these bytes > anyway or works around them. We definitely have a lot of software that does > that. > > Which can end up causing problems that remind me of very analagous problems > caused by the early days of web browsers that felt like being 'tolerant' of > bad data. "My html works in every web brower BUT this one, why not? Oh, > becuase that's the only one that actually followed the standard, oops." > There is some question as to what value there is in validating fields that have no meaning by definition. What benefit does validating an undefined value have other than create an opportunity to break things and slow the process down just a little? The entire concept of an invalid entry in an undefined field (e.g byte 23) is oxymoronic. I'd go so far as to question the value of validating redundant data that theoretically has meaning but which are never supposed to vary. The 4 and the 5 simply repeat what is already known about the structure of the MARC record. Choking on stuff like this is like having a web browser ask you want to do with a page because it lacks a document type declaration. Garbage data is the reality, so having parsers stop when they encounter data they don't actually need unnecessarily complicates things. That kind of stuff should generate a warning at worst. kyle
Re: [CODE4LIB] MARC magic for file
Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of fields, which in my opinion, is more open to interpretation based on system usage of the data. Wait, so is there any formal specification of "validity" that you can look at to determine your definition of "validity", or it's just "well, if I can recover it into useful data, using my own algorithms" I think we computer programmers are really better-served by reserving the notion of "validity" for things specified by formal specifications -- as we normally do, talking about any other data format. And the only formal specifications I can find for Marc21 say that leader bytes 20-23 should be 4500. (Not true of Marc in general just Marc21). Now it may very well be (is!) true that the library community with Marc have been in the practice of tolerating "working" Marc that is NOT valid according to any specification. So, sure, we may need to write software to take account of that sordid history. But I think it IS a sordid history -- not having a specification to ensure validity makes it VERY hard to write any new software that recognizes what you expect it to be recognize, because what you expect it to recognize isn't formally specified anywhere. It's a problem. We shouldn't try to hide the problem in our discussions by using the word "valid" to mean something different than we use it for any modern data format. "valid" only has a meaning when you're talking about valid according to some specific specification.
Re: [CODE4LIB] MARC magic for file
I'm honestly not family with magic. I can tell you in MarcEdit, the way that the process works is there is a very generic function that reads the structure of the data not trusting the information in the leader (since I find this data very un-reliable). Then, if users want to apply a set of rules to the validation -- I apply those as a secondary process. If you are looking to validate specific content within a record, then what you want to do in this function may be appropriate -- though you'll find some local systems will consistently fail the process. --tr From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of William Denton [w...@pobox.com] Sent: Wednesday, April 06, 2011 10:29 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file On 6 April 2011, Reese, Terry wrote: > Actually -- I'd disagree because that is a very narrow view of the > specification. When validating MARC, I'd take the approach to validate > structure (which allows you to then read any MARC format) -- then use a > separate process for validating content of fields, which in my opinion, > is more open to interpretation based on system usage of the data. What do you think is the best way to recognize MARC files (up to some level of validity, given all the MARC you've seen and parsed) that could be made to work the way magic is defined? Bill -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
Re: [CODE4LIB] MARC magic for file
On 6 April 2011, Reese, Terry wrote: Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of fields, which in my opinion, is more open to interpretation based on system usage of the data. What do you think is the best way to recognize MARC files (up to some level of validity, given all the MARC you've seen and parsed) that could be made to work the way magic is defined? Bill -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
Re: [CODE4LIB] MARC magic for file
Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of fields, which in my opinion, is more open to interpretation based on system usage of the data. For example, 22 and 23 are undefined values that local systems may very well have a practical need to define and use given that there are only so many values in the leader. This is why I sometimes see additional values in the 09 field (which should be a or blank) to define different character set types, or additional elements added to other fields. If I want to validate the content of those fields, I'd validate it through a different process -- but I separate the process from the validation of the structure -- because the two are not exclusive. --TR > -Original Message- > From: Jonathan Rochkind [mailto:rochk...@jhu.edu] > Sent: Wednesday, April 06, 2011 9:59 AM > To: Code for Libraries > Cc: Reese, Terry > Subject: Re: [CODE4LIB] MARC magic for file > > I'm not sure what you mean Terry. Maybe we have different understandings > of "valid". > > If leader bytes 20-23 are not "4500", I suggest that is _by definition_ not a > "valid" Marc21 file. It violates the Marc21 specification. > > Now, they may still be _usable_, by software that ignores these bytes > anyway or works around them. We definitely have a lot of software that > does that. > > Which can end up causing problems that remind me of very analagous > problems caused by the early days of web browsers that felt like being > 'tolerant' of bad data. "My html works in every web brower BUT this one, > why not? Oh, becuase that's the only one that actually followed the > standard, oops." > > I actually ran into an example of that problem with this exact issue. > MOST software just ignores marc leader bytes 20-23, and assumes the > semantics of "4500"---the only legal semantics for Marc21. But Marc4j > actually _respected_ them, apparently the author thought that some marc in > the wild might intentionally set different bytes here (no idea if that's true > or > not). So if the leader bytes 20-23 were "invalid" > (according to the spec), Marc47 would suddenly decide that the "length of > field portion" was NOT 4, but actually BELIEVE whatever was in leader byte > 20, causing the record to be parsed improperly. And I had records like that > coming out of my ILS (not even a vendor database). That was an unfun > couple days of debugging to figure out what was going on. > > On 4/6/2011 12:52 PM, Reese, Terry wrote: > > Actually, you can have records that are MARC21 coming out of vendor > databases (who sometime embed control characters into the leader) and still > be valid. Once you stop looking at just your ILS or OCLC, you probably > wouldn't be surprised to know that records start looking very different. > > > > --TR > > > > > > > > Terry Reese, Associate Professor > > Gray Family Chair > > for Innovative Library Services > > 121 Valley Libraries > > Corvallis, Or 97331 > > tel: 541.737.6384 > > > > > > > > > >> -Original Message- > >> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf > >> Of Jonathan Rochkind > >> Sent: Wednesday, April 06, 2011 9:44 AM > >> To: CODE4LIB@LISTSERV.ND.EDU > >> Subject: Re: [CODE4LIB] MARC magic for file > >> > >> Can't you have a legal "MARC" file that does NOT have 4500 in those > >> leader positions? It's just not legal "Marc21", right? Other marc > >> formats may specify or even allow flexibility in the things these > >> bytes > >> specify: > >> > >> * Length of the length-of-field portion > >> * Number of characters in the starting-character-position portion of > >> a Directory entry > >> * Number of characters in the implementation-defined portion of a > >> Directory entry > >> > >> Or, um, 23, which is I guess is left to the specific Marc > >> implementation (ie, > >> Marc21 is one such) to use for it's own purposes. > >> > >> I have no idea how that should inform the 'marc magic'. > >> > >> Is mime-type application/marc defined as specifically Marc21, or as > >> any Marc? > >> > >> Jonathan > >> > >>
Re: [CODE4LIB] MARC magic for file
Just as a historical note, this non-standard use of LDR/22 is likely due to OCLC's use of the character as a hexadecimal flag from back in the days when marc records were mostly schlepped around on tapes. They referred to it as the "Transaction type code". When records were sent to oclc for processing, various values of the flag indicated whether a catalog card was to be produced, whether the record was an update, whether the user location symbol was to be set, etc. I'm sure others have used it for their own nefarious purposes as well. Tim Prettyman University of Michigan/LIT On Apr 6, 2011, at 12:28 PM, Ford, Kevin wrote: > Well, this brings us right up against the issue of files that adhere to their > specifications versus forgiving applications. Think of browsers and HTML. > Suffice it to say, MARC applications are quite likely to be forgiving of > leader positions 20-23. In my non-conforming MARC file and in Bill's, the > leader positions 20-21 ("45") seemed constant, but things could fall apart > for positions 22-23. So... > > I present the following (in-line and attached, to preserve tabs) in an > attempt to straddle the two sides of this issue: applications forgiving of > non-conforming files. Should the two characters following 45 (at position > 20) *not* be 00, then the identification will be noted as "non-conforming." > We could classify this as reasonable identification but hardly ironclad > (indeed, simply checking to confirm that part of the first 24 positions match > the specification hardly constitutes a robust identification, but it's > something). > > It will also give you a mimetype too, now. > > Would any like testing it out more fully on their own files? > > > # > # MARC 21 Magic (Third cut) > > # Set at position 0 > 0 bytex > > # leader position 20-21 must be 45 >> 20 string 45 > > # leader starts with 5 digits, followed by codes specific to MARC format >>> 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic > !:mimeapplication/marc >>> 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority > !:mimeapplication/marc >>> 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings > !:mimeapplication/marc >>> 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification > !:mimeapplication/marc >>> 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community > !:mimeapplication/marc > > # leader position 22-23, should be "00" but is it? >>> 0 regex/1 (^.{21})([^0]{2}) (non-conforming) > !:mimeapplication/marc > > > If this works, I'll see about submitting this copy. Thanks to all your > efforts already. > > Warmly, > > Kevin > > -- > Library of Congress > Network Development and MARC Standards Office > > > > > > > From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero > [s...@unc.edu] > Sent: Sunday, April 03, 2011 14:01 > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] MARC magic for file > > I am pretty sure that the marc4j standard reader ignores them; the tolerant > reader definitely does. Otherwise JHU might have about two parseable records > based on the mangled leaders that J-Rock gets stuck with :-) > > An analysis of the ~7M LC bib records from the scriblio.net data files (~ > Dec 2006) indicated that leader has less than 8 bits of information in it > (shannon-weaver definition). This excludes the initial length value, which > is redundant given the end of record marker. > > > The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader. > The final characters of the leader are "450". > > Also, I object to the phrase "decent MARC tool". Any tool capable of > dealing with MARC as it exists cannot afford the luxury of decency :-) > > [ HA: "A clear conscience?" > BW: "Yes, Sir Humphrey." > HA: "When did you acquire this taste for luxuries?"] > > Simon > > On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens wrote: > >> "I'm sure any decent MARC tool can deal with them, since decent MARC tools >> are certainly going to be forgiving enough to deal with four characters >> that >> apparently don't even really matter." >> >> You say that, but I'm pretty sure Marc4J throws errors MARC records where >> these characters are incorrect >> >> Owen >> >> On Fri, Apr 1, 2011 at 3:51 AM, William Denton wrote: >> &g
Re: [CODE4LIB] MARC magic for file
I'm not sure what you mean Terry. Maybe we have different understandings of "valid". If leader bytes 20-23 are not "4500", I suggest that is _by definition_ not a "valid" Marc21 file. It violates the Marc21 specification. Now, they may still be _usable_, by software that ignores these bytes anyway or works around them. We definitely have a lot of software that does that. Which can end up causing problems that remind me of very analagous problems caused by the early days of web browsers that felt like being 'tolerant' of bad data. "My html works in every web brower BUT this one, why not? Oh, becuase that's the only one that actually followed the standard, oops." I actually ran into an example of that problem with this exact issue. MOST software just ignores marc leader bytes 20-23, and assumes the semantics of "4500"---the only legal semantics for Marc21. But Marc4j actually _respected_ them, apparently the author thought that some marc in the wild might intentionally set different bytes here (no idea if that's true or not). So if the leader bytes 20-23 were "invalid" (according to the spec), Marc47 would suddenly decide that the "length of field portion" was NOT 4, but actually BELIEVE whatever was in leader byte 20, causing the record to be parsed improperly. And I had records like that coming out of my ILS (not even a vendor database). That was an unfun couple days of debugging to figure out what was going on. On 4/6/2011 12:52 PM, Reese, Terry wrote: Actually, you can have records that are MARC21 coming out of vendor databases (who sometime embed control characters into the leader) and still be valid. Once you stop looking at just your ILS or OCLC, you probably wouldn't be surprised to know that records start looking very different. --TR Terry Reese, Associate Professor Gray Family Chair for Innovative Library Services 121 Valley Libraries Corvallis, Or 97331 tel: 541.737.6384 -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 9:44 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file Can't you have a legal "MARC" file that does NOT have 4500 in those leader positions? It's just not legal "Marc21", right? Other marc formats may specify or even allow flexibility in the things these bytes specify: * Length of the length-of-field portion * Number of characters in the starting-character-position portion of a Directory entry * Number of characters in the implementation-defined portion of a Directory entry Or, um, 23, which is I guess is left to the specific Marc implementation (ie, Marc21 is one such) to use for it's own purposes. I have no idea how that should inform the 'marc magic'. Is mime-type application/marc defined as specifically Marc21, or as any Marc? Jonathan On 4/6/2011 12:28 PM, Ford, Kevin wrote: Well, this brings us right up against the issue of files that adhere to their specifications versus forgiving applications. Think of browsers and HTML. Suffice it to say, MARC applications are quite likely to be forgiving of leader positions 20-23. In my non-conforming MARC file and in Bill's, the leader positions 20-21 ("45") seemed constant, but things could fall apart for positions 22-23. So... I present the following (in-line and attached, to preserve tabs) in an attempt to straddle the two sides of this issue: applications forgiving of non- conforming files. Should the two characters following 45 (at position 20) *not* be 00, then the identification will be noted as "non-conforming." We could classify this as reasonable identification but hardly ironclad (indeed, simply checking to confirm that part of the first 24 positions match the specification hardly constitutes a robust identification, but it's something). It will also give you a mimetype too, now. Would any like testing it out more fully on their own files? # # MARC 21 Magic (Third cut) # Set at position 0 0 bytex # leader position 20-21 must be 45 20 string 45 # leader starts with 5 digits, followed by codes specific to MARC format 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic !:mime application/marc 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority !:mime application/marc 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings !:mime application/marc 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification !:mime application/marc 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community !:mime application/marc # leader position 22-23, should be "00" but is it? 0 regex/1 (^.{21})([^0]{2})
Re: [CODE4LIB] MARC magic for file
Actually, you can have records that are MARC21 coming out of vendor databases (who sometime embed control characters into the leader) and still be valid. Once you stop looking at just your ILS or OCLC, you probably wouldn't be surprised to know that records start looking very different. --TR Terry Reese, Associate Professor Gray Family Chair for Innovative Library Services 121 Valley Libraries Corvallis, Or 97331 tel: 541.737.6384 > -Original Message- > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of > Jonathan Rochkind > Sent: Wednesday, April 06, 2011 9:44 AM > To: CODE4LIB@LISTSERV.ND.EDU > Subject: Re: [CODE4LIB] MARC magic for file > > Can't you have a legal "MARC" file that does NOT have 4500 in those > leader positions? It's just not legal "Marc21", right? Other marc > formats may specify or even allow flexibility in the things these bytes > specify: > > * Length of the length-of-field portion > * Number of characters in the starting-character-position portion of a > Directory entry > * Number of characters in the implementation-defined portion of a Directory > entry > > Or, um, 23, which is I guess is left to the specific Marc implementation (ie, > Marc21 is one such) to use for it's own purposes. > > I have no idea how that should inform the 'marc magic'. > > Is mime-type application/marc defined as specifically Marc21, or as any > Marc? > > Jonathan > > On 4/6/2011 12:28 PM, Ford, Kevin wrote: > > Well, this brings us right up against the issue of files that adhere to > > their > specifications versus forgiving applications. Think of browsers and HTML. > Suffice it to say, MARC applications are quite likely to be forgiving of > leader > positions 20-23. In my non-conforming MARC file and in Bill's, the leader > positions 20-21 ("45") seemed constant, but things could fall apart for > positions 22-23. So... > > > > I present the following (in-line and attached, to preserve tabs) in an > attempt to straddle the two sides of this issue: applications forgiving of > non- > conforming files. Should the two characters following 45 (at position 20) > *not* be 00, then the identification will be noted as "non-conforming." We > could classify this as reasonable identification but hardly ironclad (indeed, > simply checking to confirm that part of the first 24 positions match the > specification hardly constitutes a robust identification, but it's something). > > > > It will also give you a mimetype too, now. > > > > Would any like testing it out more fully on their own files? > > > > > > # > > # MARC 21 Magic (Third cut) > > > > # Set at position 0 > > 0 bytex > > > > # leader position 20-21 must be 45 > >> 20 string 45 > > # leader starts with 5 digits, followed by codes specific to MARC > > format > >>> 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic > > !:mime application/marc > >>> 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority > > !:mime application/marc > >>> 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings > > !:mime application/marc > >>> 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification > > !:mime application/marc > >>> 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community > > !:mime application/marc > > > > # leader position 22-23, should be "00" but is it? > >>> 0 regex/1 (^.{21})([^0]{2}) (non-conforming) > > !:mime application/marc > > > > > > If this works, I'll see about submitting this copy. Thanks to all your > > efforts > already. > > > > Warmly, > > > > Kevin > > > > -- > > Library of Congress > > Network Development and MARC Standards Office > > > > > > > > > > > > > > From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of > Simon > > Spero [s...@unc.edu] > > Sent: Sunday, April 03, 2011 14:01 > > To: CODE4LIB@LISTSERV.ND.EDU > > Subject: Re: [CODE4LIB] MARC magic for file > > > > I am pretty sure that the marc4j standard reader ignores them; the > > tolerant reader definitely does. Otherwise JHU might have about two > > parseable records based on the mangled leaders that J-Rock gets stuck > > with :-) > > > > An analysis of the ~7M LC bib records from the scriblio.net
Re: [CODE4LIB] MARC magic for file
Can't you have a legal "MARC" file that does NOT have 4500 in those leader positions? It's just not legal "Marc21", right? Other marc formats may specify or even allow flexibility in the things these bytes specify: * Length of the length-of-field portion * Number of characters in the starting-character-position portion of a Directory entry * Number of characters in the implementation-defined portion of a Directory entry Or, um, 23, which is I guess is left to the specific Marc implementation (ie, Marc21 is one such) to use for it's own purposes. I have no idea how that should inform the 'marc magic'. Is mime-type application/marc defined as specifically Marc21, or as any Marc? Jonathan On 4/6/2011 12:28 PM, Ford, Kevin wrote: Well, this brings us right up against the issue of files that adhere to their specifications versus forgiving applications. Think of browsers and HTML. Suffice it to say, MARC applications are quite likely to be forgiving of leader positions 20-23. In my non-conforming MARC file and in Bill's, the leader positions 20-21 ("45") seemed constant, but things could fall apart for positions 22-23. So... I present the following (in-line and attached, to preserve tabs) in an attempt to straddle the two sides of this issue: applications forgiving of non-conforming files. Should the two characters following 45 (at position 20) *not* be 00, then the identification will be noted as "non-conforming." We could classify this as reasonable identification but hardly ironclad (indeed, simply checking to confirm that part of the first 24 positions match the specification hardly constitutes a robust identification, but it's something). It will also give you a mimetype too, now. Would any like testing it out more fully on their own files? # # MARC 21 Magic (Third cut) # Set at position 0 0 bytex # leader position 20-21 must be 45 20 string 45 # leader starts with 5 digits, followed by codes specific to MARC format 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic !:mime application/marc 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority !:mime application/marc 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings !:mime application/marc 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification !:mime application/marc 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community !:mime application/marc # leader position 22-23, should be "00" but is it? 0 regex/1 (^.{21})([^0]{2}) (non-conforming) !:mime application/marc If this works, I'll see about submitting this copy. Thanks to all your efforts already. Warmly, Kevin -- Library of Congress Network Development and MARC Standards Office From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero [s...@unc.edu] Sent: Sunday, April 03, 2011 14:01 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file I am pretty sure that the marc4j standard reader ignores them; the tolerant reader definitely does. Otherwise JHU might have about two parseable records based on the mangled leaders that J-Rock gets stuck with :-) An analysis of the ~7M LC bib records from the scriblio.net data files (~ Dec 2006) indicated that leader has less than 8 bits of information in it (shannon-weaver definition). This excludes the initial length value, which is redundant given the end of record marker. The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader. The final characters of the leader are "450". Also, I object to the phrase "decent MARC tool". Any tool capable of dealing with MARC as it exists cannot afford the luxury of decency :-) [ HA: "A clear conscience?" BW: "Yes, Sir Humphrey." HA: "When did you acquire this taste for luxuries?"] Simon On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens wrote: "I'm sure any decent MARC tool can deal with them, since decent MARC tools are certainly going to be forgiving enough to deal with four characters that apparently don't even really matter." You say that, but I'm pretty sure Marc4J throws errors MARC records where these characters are incorrect Owen On Fri, Apr 1, 2011 at 3:51 AM, William Denton wrote: On 28 March 2011, Ford, Kevin wrote: I couldn't get Simon's MARC 21 Magic file to work. Among other issues, I received "line too long" errors. But, since I've been curious about this for sometime, I figured I'd take a whack at it myself. Try this: This is very nice! Thanks. I tried it on a bunch of MARC files I have, and it recognized almost all of them. A few it didn't, so I had a closer look, and they're invalid. For
Re: [CODE4LIB] MARC magic for file
Well, this brings us right up against the issue of files that adhere to their specifications versus forgiving applications. Think of browsers and HTML. Suffice it to say, MARC applications are quite likely to be forgiving of leader positions 20-23. In my non-conforming MARC file and in Bill's, the leader positions 20-21 ("45") seemed constant, but things could fall apart for positions 22-23. So... I present the following (in-line and attached, to preserve tabs) in an attempt to straddle the two sides of this issue: applications forgiving of non-conforming files. Should the two characters following 45 (at position 20) *not* be 00, then the identification will be noted as "non-conforming." We could classify this as reasonable identification but hardly ironclad (indeed, simply checking to confirm that part of the first 24 positions match the specification hardly constitutes a robust identification, but it's something). It will also give you a mimetype too, now. Would any like testing it out more fully on their own files? # # MARC 21 Magic (Third cut) # Set at position 0 0 bytex # leader position 20-21 must be 45 >20 string 45 # leader starts with 5 digits, followed by codes specific to MARC format >>0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic !:mime application/marc >>0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority !:mime application/marc >>0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings !:mime application/marc >>0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification !:mime application/marc >>0 regex/1 (^[0-9]{5})[cdn][q] MARC Community !:mime application/marc # leader position 22-23, should be "00" but is it? >>0 regex/1 (^.{21})([^0]{2}) (non-conforming) !:mime application/marc If this works, I'll see about submitting this copy. Thanks to all your efforts already. Warmly, Kevin -- Library of Congress Network Development and MARC Standards Office From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero [s...@unc.edu] Sent: Sunday, April 03, 2011 14:01 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file I am pretty sure that the marc4j standard reader ignores them; the tolerant reader definitely does. Otherwise JHU might have about two parseable records based on the mangled leaders that J-Rock gets stuck with :-) An analysis of the ~7M LC bib records from the scriblio.net data files (~ Dec 2006) indicated that leader has less than 8 bits of information in it (shannon-weaver definition). This excludes the initial length value, which is redundant given the end of record marker. The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader. The final characters of the leader are "450". Also, I object to the phrase "decent MARC tool". Any tool capable of dealing with MARC as it exists cannot afford the luxury of decency :-) [ HA: "A clear conscience?" BW: "Yes, Sir Humphrey." HA: "When did you acquire this taste for luxuries?"] Simon On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens wrote: > "I'm sure any decent MARC tool can deal with them, since decent MARC tools > are certainly going to be forgiving enough to deal with four characters > that > apparently don't even really matter." > > You say that, but I'm pretty sure Marc4J throws errors MARC records where > these characters are incorrect > > Owen > > On Fri, Apr 1, 2011 at 3:51 AM, William Denton wrote: > > > On 28 March 2011, Ford, Kevin wrote: > > > > I couldn't get Simon's MARC 21 Magic file to work. Among other issues, > I > >> received "line too long" errors. But, since I've been curious about > this > >> for sometime, I figured I'd take a whack at it myself. Try this: > >> > > > > This is very nice! Thanks. I tried it on a bunch of MARC files I have, > > and it recognized almost all of them. A few it didn't, so I had a closer > > look, and they're invalid. > > > > For example, the Internet Archive's Binghamton catalogue dump: > > > > http://ia600307.us.archive.org/6/items/marc_binghamton_univ/ > > > > $ file -m marc.magic bgm*mrc > > bgm_openlib_final_0-5.mrc: data > > bgm_openlib_final_10-15.mrc: MARC Bibliographic > > bgm_openlib_final_15-18.mrc: data > > bgm_openlib_final_5-10.mrc:MARC Bibliographic > > > > But why? Aha: > > > > $ head -c 25 bgm_openlib_final_*mrc > > ==> bgm_openlib_final_0-5.mrc <== > > 01812cas 2200457
Re: [CODE4LIB] MARC magic for file
I am pretty sure that the marc4j standard reader ignores them; the tolerant reader definitely does. Otherwise JHU might have about two parseable records based on the mangled leaders that J-Rock gets stuck with :-) An analysis of the ~7M LC bib records from the scriblio.net data files (~ Dec 2006) indicated that leader has less than 8 bits of information in it (shannon-weaver definition). This excludes the initial length value, which is redundant given the end of record marker. The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader. The final characters of the leader are "450". Also, I object to the phrase "decent MARC tool". Any tool capable of dealing with MARC as it exists cannot afford the luxury of decency :-) [ HA: "A clear conscience?" BW: "Yes, Sir Humphrey." HA: "When did you acquire this taste for luxuries?"] Simon On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens wrote: > "I'm sure any decent MARC tool can deal with them, since decent MARC tools > are certainly going to be forgiving enough to deal with four characters > that > apparently don't even really matter." > > You say that, but I'm pretty sure Marc4J throws errors MARC records where > these characters are incorrect > > Owen > > On Fri, Apr 1, 2011 at 3:51 AM, William Denton wrote: > > > On 28 March 2011, Ford, Kevin wrote: > > > > I couldn't get Simon's MARC 21 Magic file to work. Among other issues, > I > >> received "line too long" errors. But, since I've been curious about > this > >> for sometime, I figured I'd take a whack at it myself. Try this: > >> > > > > This is very nice! Thanks. I tried it on a bunch of MARC files I have, > > and it recognized almost all of them. A few it didn't, so I had a closer > > look, and they're invalid. > > > > For example, the Internet Archive's Binghamton catalogue dump: > > > > http://ia600307.us.archive.org/6/items/marc_binghamton_univ/ > > > > $ file -m marc.magic bgm*mrc > > bgm_openlib_final_0-5.mrc: data > > bgm_openlib_final_10-15.mrc: MARC Bibliographic > > bgm_openlib_final_15-18.mrc: data > > bgm_openlib_final_5-10.mrc:MARC Bibliographic > > > > But why? Aha: > > > > $ head -c 25 bgm_openlib_final_*mrc > > ==> bgm_openlib_final_0-5.mrc <== > > 01812cas 2200457 45x00 > > ==> bgm_openlib_final_10-15.mrc <== > > 01008nam 2200289ua 45000 > > ==> bgm_openlib_final_15-18.mrc <== > > 01614cam00385 45 0 > > ==> bgm_openlib_final_5-10.mrc <== > > 00887nam 2200265v 45000 > > > > As you say, the leader should end with 4500 (as defined at > > http://www.loc.gov/marc/authority/adleader.html) but two of those files > > don't. So they're not valid MARC. I'm sure any decent MARC tool can > deal > > with them, since decent MARC tools are certainly going to be forgiving > > enough to deal with four characters that apparently don't even really > > matter. > > > > So on the one hand they're usable MARC but file wouldn't say so, and on > the > > other that's a good indication that the files have failed a basic > validity > > test. I wonder if there are similar situations for JPEGs or MP3s. > > > > I think you should definitely submit this for inclusion in the magic > file. > > It would be very useful for us all! > > > > Bill > > > > P.S. I'd never used head -c (to show a fixed number of bytes) before. > > Always nice to find a new useful option to an old command. > > > > > > # > >> # MARC 21 Magic (Second cut) > >> > >> # Set at position 0 > >> 0 short >0x > >> > >> # leader ends with 4500 > >> > >>> 20 string 4500 > >>> > >> > >> # leader starts with 5 digits, followed by codes specific to MARC format > >> > >>> 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic > 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority > 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings > 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification > 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community > > >>> > > > > -- > > William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org > > > > > > -- > Owen Stephens > Owen Stephens Consulting > Web: http://www.ostephens.com > Email: o...@ostephens.com >
Re: [CODE4LIB] MARC magic for file
"I'm sure any decent MARC tool can deal with them, since decent MARC tools are certainly going to be forgiving enough to deal with four characters that apparently don't even really matter." You say that, but I'm pretty sure Marc4J throws errors MARC records where these characters are incorrect Owen On Fri, Apr 1, 2011 at 3:51 AM, William Denton wrote: > On 28 March 2011, Ford, Kevin wrote: > > I couldn't get Simon's MARC 21 Magic file to work. Among other issues, I >> received "line too long" errors. But, since I've been curious about this >> for sometime, I figured I'd take a whack at it myself. Try this: >> > > This is very nice! Thanks. I tried it on a bunch of MARC files I have, > and it recognized almost all of them. A few it didn't, so I had a closer > look, and they're invalid. > > For example, the Internet Archive's Binghamton catalogue dump: > > http://ia600307.us.archive.org/6/items/marc_binghamton_univ/ > > $ file -m marc.magic bgm*mrc > bgm_openlib_final_0-5.mrc: data > bgm_openlib_final_10-15.mrc: MARC Bibliographic > bgm_openlib_final_15-18.mrc: data > bgm_openlib_final_5-10.mrc:MARC Bibliographic > > But why? Aha: > > $ head -c 25 bgm_openlib_final_*mrc > ==> bgm_openlib_final_0-5.mrc <== > 01812cas 2200457 45x00 > ==> bgm_openlib_final_10-15.mrc <== > 01008nam 2200289ua 45000 > ==> bgm_openlib_final_15-18.mrc <== > 01614cam00385 45 0 > ==> bgm_openlib_final_5-10.mrc <== > 00887nam 2200265v 45000 > > As you say, the leader should end with 4500 (as defined at > http://www.loc.gov/marc/authority/adleader.html) but two of those files > don't. So they're not valid MARC. I'm sure any decent MARC tool can deal > with them, since decent MARC tools are certainly going to be forgiving > enough to deal with four characters that apparently don't even really > matter. > > So on the one hand they're usable MARC but file wouldn't say so, and on the > other that's a good indication that the files have failed a basic validity > test. I wonder if there are similar situations for JPEGs or MP3s. > > I think you should definitely submit this for inclusion in the magic file. > It would be very useful for us all! > > Bill > > P.S. I'd never used head -c (to show a fixed number of bytes) before. > Always nice to find a new useful option to an old command. > > > # >> # MARC 21 Magic (Second cut) >> >> # Set at position 0 >> 0 short >0x >> >> # leader ends with 4500 >> >>> 20 string 4500 >>> >> >> # leader starts with 5 digits, followed by codes specific to MARC format >> >>> 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community >>> > > -- > William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org > -- Owen Stephens Owen Stephens Consulting Web: http://www.ostephens.com Email: o...@ostephens.com
Re: [CODE4LIB] MARC magic for file
On 28 March 2011, Ford, Kevin wrote: I couldn't get Simon's MARC 21 Magic file to work. Among other issues, I received "line too long" errors. But, since I've been curious about this for sometime, I figured I'd take a whack at it myself. Try this: This is very nice! Thanks. I tried it on a bunch of MARC files I have, and it recognized almost all of them. A few it didn't, so I had a closer look, and they're invalid. For example, the Internet Archive's Binghamton catalogue dump: http://ia600307.us.archive.org/6/items/marc_binghamton_univ/ $ file -m marc.magic bgm*mrc bgm_openlib_final_0-5.mrc: data bgm_openlib_final_10-15.mrc: MARC Bibliographic bgm_openlib_final_15-18.mrc: data bgm_openlib_final_5-10.mrc:MARC Bibliographic But why? Aha: $ head -c 25 bgm_openlib_final_*mrc ==> bgm_openlib_final_0-5.mrc <== 01812cas 2200457 45x00 ==> bgm_openlib_final_10-15.mrc <== 01008nam 2200289ua 45000 ==> bgm_openlib_final_15-18.mrc <== 01614cam00385 45 0 ==> bgm_openlib_final_5-10.mrc <== 00887nam 2200265v 45000 As you say, the leader should end with 4500 (as defined at http://www.loc.gov/marc/authority/adleader.html) but two of those files don't. So they're not valid MARC. I'm sure any decent MARC tool can deal with them, since decent MARC tools are certainly going to be forgiving enough to deal with four characters that apparently don't even really matter. So on the one hand they're usable MARC but file wouldn't say so, and on the other that's a good indication that the files have failed a basic validity test. I wonder if there are similar situations for JPEGs or MP3s. I think you should definitely submit this for inclusion in the magic file. It would be very useful for us all! Bill P.S. I'd never used head -c (to show a fixed number of bytes) before. Always nice to find a new useful option to an old command. # # MARC 21 Magic (Second cut) # Set at position 0 0 short >0x # leader ends with 4500 20 string 4500 # leader starts with 5 digits, followed by codes specific to MARC format 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority 0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
Re: [CODE4LIB] MARC magic for file
I couldn't get Simon's MARC 21 Magic file to work. Among other issues, I received "line too long" errors. But, since I've been curious about this for sometime, I figured I'd take a whack at it myself. Try this: # # MARC 21 Magic (Second cut) # Set at position 0 0 short >0x # leader ends with 4500 >20 string 4500 # leader starts with 5 digits, followed by codes specific to MARC format >>0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z] MARC Bibliographic >>0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority >>0 regex/1 (^[0-9]{5})[cdn][uvxy] MARC Holdings >>0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification >>0 regex/1 (^[0-9]{5})[cdn][q] MARC Community I've also attached it to this email to preserve the tabs. In any event, I can confirm it works on MARC Bib, MARC Authority, and MARC Classification files I have bumping around my computer. I've not tested it on MARC Holdings and MARC Community. Do let us/me know if it works for you (and the community generally). I can see about submitting it for formal inclusion in the magic file. Warmly, Kevin -- Library of Congress Network Development and MARC Standards Office From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero [s...@unc.edu] Sent: Thursday, March 24, 2011 12:28 To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] MARC magic for file Some of the problems in your first cut are: 1. Offsets for regex are given in terms of lines. MARC files don't have newlines in them, unless you're Millennium, in which case they can be inserted every 200,000 bytes to keep things interesting. 2. Byte matches match byte values, so "20 byte 4" is looking for the binary value, not the ascii digit. 3. Sometimes you need to prime the buffer before you can do a regexp match. Is this good enough? # MARC 21 Magic (First cut) # indicator count must be "2" 10 string 2 # leader must end in "4500" >20 string 4500 # leader must start with five digits, a record status, and a record type >0 regex ^([0-9]{5})[acdnp][acdefgijkmoprt][abcims] MARC Bibliographic >0 regex ^([0-9]{5})[acdnp][z] MARC Authority Simon On Wed, Mar 23, 2011 at 8:09 PM, William Denton wrote: > Has anyone figured out the magic necessary for file to recognize MARC > files? > > If you don't know it, file is a Unix command that tells you what kind of > file a file is. For example: > > $ file 101015_001.mp3 > 101015_001.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS, > layer III, v1, 192 kbps, 44.1 kHz, Stereo > > $ file P126.jpg > P126.jpg: JPEG image data, EXIF standard, comment: "AppleMark" > > It's a really useful command. I assume it's on OSX, but I don't know. You > can get it for Windows with Cygwin. > > The problem is, file doesn't grok MARC: > > $ file catalog.01.mrc > catalog.01.mrc: data > > I took a stab at getting the magic defined, but it didn't work. I'll > include what I used below. You can put it into a magic.txt file, and then > use > > file -m magic.txt some_file.mrc > > to test it. It'll tell you the file is MARC Bibliographic ... but it also > thinks that PDFs, JPEGs, and text files are MARC. That's no good. > > It'd be great if the MARC magic got into the central magic database so > everyone would be able to recognize various MARC file types. > > Bill > > > # --- clip'n'test > # MARC 21 for Bibliographic Data > # http://www.loc.gov/marc/bibliographic/bdleader.html > # > # This doesn't work properly > > 0 stringx > >> 5regex [acdnp] >> 6regex [acdefgijkmoprt] >> 7regex [abcims] >> 8regex [\ a] >> 9regex [\ a] >> 10 byte x >> 11 byte x >> 12 stringx >> 17 regex [\ 12345678uz] >> 18 regex [\ aciu] >> 19 regex [\ abc] MARC Bibliographic >> > #>20 byte 4 > #>21 byte 5 > #>22 byte 0 > #>23 byte 0 MARC Bibliographic > > # --- end clip'n'test > > -- > William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org > marc.magic Description: marc.magic
Re: [CODE4LIB] MARC magic for file
Some of the problems in your first cut are: 1. Offsets for regex are given in terms of lines. MARC files don't have newlines in them, unless you're Millennium, in which case they can be inserted every 200,000 bytes to keep things interesting. 2. Byte matches match byte values, so "20 byte 4" is looking for the binary value, not the ascii digit. 3. Sometimes you need to prime the buffer before you can do a regexp match. Is this good enough? # MARC 21 Magic (First cut) # indicator count must be "2" 10 string 2 # leader must end in "4500" >20 string 4500 # leader must start with five digits, a record status, and a record type >0 regex ^([0-9]{5})[acdnp][acdefgijkmoprt][abcims] MARC Bibliographic >0 regex ^([0-9]{5})[acdnp][z] MARC Authority Simon On Wed, Mar 23, 2011 at 8:09 PM, William Denton wrote: > Has anyone figured out the magic necessary for file to recognize MARC > files? > > If you don't know it, file is a Unix command that tells you what kind of > file a file is. For example: > > $ file 101015_001.mp3 > 101015_001.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS, > layer III, v1, 192 kbps, 44.1 kHz, Stereo > > $ file P126.jpg > P126.jpg: JPEG image data, EXIF standard, comment: "AppleMark" > > It's a really useful command. I assume it's on OSX, but I don't know. You > can get it for Windows with Cygwin. > > The problem is, file doesn't grok MARC: > > $ file catalog.01.mrc > catalog.01.mrc: data > > I took a stab at getting the magic defined, but it didn't work. I'll > include what I used below. You can put it into a magic.txt file, and then > use > > file -m magic.txt some_file.mrc > > to test it. It'll tell you the file is MARC Bibliographic ... but it also > thinks that PDFs, JPEGs, and text files are MARC. That's no good. > > It'd be great if the MARC magic got into the central magic database so > everyone would be able to recognize various MARC file types. > > Bill > > > # --- clip'n'test > # MARC 21 for Bibliographic Data > # http://www.loc.gov/marc/bibliographic/bdleader.html > # > # This doesn't work properly > > 0 stringx > >> 5regex [acdnp] >> 6regex [acdefgijkmoprt] >> 7regex [abcims] >> 8regex [\ a] >> 9regex [\ a] >> 10 byte x >> 11 byte x >> 12 stringx >> 17 regex [\ 12345678uz] >> 18 regex [\ aciu] >> 19 regex [\ abc] MARC Bibliographic >> > #>20 byte 4 > #>21 byte 5 > #>22 byte 0 > #>23 byte 0 MARC Bibliographic > > # --- end clip'n'test > > -- > William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org >
[CODE4LIB] MARC magic for file
Has anyone figured out the magic necessary for file to recognize MARC files? If you don't know it, file is a Unix command that tells you what kind of file a file is. For example: $ file 101015_001.mp3 101015_001.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer III, v1, 192 kbps, 44.1 kHz, Stereo $ file P126.jpg P126.jpg: JPEG image data, EXIF standard, comment: "AppleMark" It's a really useful command. I assume it's on OSX, but I don't know. You can get it for Windows with Cygwin. The problem is, file doesn't grok MARC: $ file catalog.01.mrc catalog.01.mrc: data I took a stab at getting the magic defined, but it didn't work. I'll include what I used below. You can put it into a magic.txt file, and then use file -m magic.txt some_file.mrc to test it. It'll tell you the file is MARC Bibliographic ... but it also thinks that PDFs, JPEGs, and text files are MARC. That's no good. It'd be great if the MARC magic got into the central magic database so everyone would be able to recognize various MARC file types. Bill # --- clip'n'test # MARC 21 for Bibliographic Data # http://www.loc.gov/marc/bibliographic/bdleader.html # # This doesn't work properly 0 stringx 5regex [acdnp] 6regex [acdefgijkmoprt] 7regex [abcims] 8regex [\ a] 9regex [\ a] 10 byte x 11 byte x 12 stringx 17 regex [\ 12345678uz] 18 regex [\ aciu] 19 regex [\ abc] MARC Bibliographic #>20 byte 4 #>21 byte 5 #>22 byte 0 #>23 byte 0 MARC Bibliographic # --- end clip'n'test -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org