Re: [CODE4LIB] MARC Magic for file

2012-05-24 Thread Ed Summers
On Wed, May 23, 2012 at 6:16 PM, Kyle Banerjee
 wrote:
> I'm not sure whether to laugh or cry that it's a sign of progress that a 40
> year old utility designed to identify file types is now just beginning to
> be able to recognize a format that's been around for almost 50 years...

Laugh :-)

//Ed


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Kyle Banerjee
On Wed, May 23, 2012 at 12:14 PM, Ford, Kevin  wrote:

> I finally had occasion today (read: remembered) to see if the *nix "file"
> command would recognize a MARC record file.  I haven't tested extensively,
> but it did identify the file as MARC21 Bibliographic record.  It also
> correctly identified a MARC21 Authority Record.  I'm running the most
> recent version of Ubuntu (12.04 - precise pangolin).
>

I'm not sure whether to laugh or cry that it's a sign of progress that a 40
year old utility designed to identify file types is now just beginning to
be able to recognize a format that's been around for almost 50 years...

kyle
-- 
--
Kyle Banerjee
Digital Services Program Manager
Orbis Cascade Alliance
baner...@orbiscascade.org / 503.999.9787


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Simon Spero
The file format magic format magic changed between versions; I think the
OSX version was not compatible with more up to date versions (in the
original thread, this caused me some confusion).

Simon

On Wed, May 23, 2012 at 4:34 PM, Ross Singer  wrote:

> On May 23, 2012, at 4:22 PM, Kevin Ford wrote:
>
> > Don't know what to say.  Crawling through the source for "file" at [1],
> the pattern matching code as in place as of Sept 2011.  It could be present
> earlier than Sept 2011, but I stopped hunting for it.  The earliest it
> would have made its way into the magic db would have been April 2011.
> >
> > Perhaps OpenBSD is using some custom branch of "file", haven't updated
> the db, etc.
>
> As Stuart pointed out, some implementations are slow to update the db.
>  OSX, for example, also just says "data" (hence my question on the output).
>
> -Ross.
> >
> > Yours,
> >
> > Kevin
> >
> >
> >
> > On 05/23/2012 03:36 PM, Francis Kayiwa wrote:
> >> On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote:
> >>> Wow, this is pretty cool.
> >>>
> >>> Kevin, do you have examples of the output?
> >>>
> >>> Does it work for bulk files?
> >>>
> >>> I mean, I could just try this on my Ubuntu machine, but it's all the
> way downstairs...
> >>
> >> My OS lists it as `data`
> >>
> >> $ cd
> >> $ ls
> >> devid_rsa.pub laflin marc   orthancssh
> >> updating
> >> $ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt
> >> Trying 140.211.166.6...
> >> Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt
> >> 100%
> >>
> |**|
> >> 5965   00:00
> >> 5965 bytes received in 0.00 seconds (1.56 MB/s)
> >> $ ls
> >> 5_records_utf8.mrc_.txt  id_rsa.pub   marc
> >> ssh
> >> dev  laflin   orthanc
> >> updating
> >> $ mkdir test
> >> $ mv 5_records_utf8.mrc_.txt test/
> >> $ cd test/
> >> $ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc
> >> $ ls
> >> 5_records_utf8.mrc
> >> $ file 5_records_utf8.mrc
> >> 5_records_utf8.mrc: data
> >> $ ls
> >> 5_records_utf8.mrc
> >> $ ls -al
> >> total 32
> >> drwxr-xr-x   2 kayiwa  kayiwa   512 May 23 14:34 .
> >> drwxr-xr-x  10 kayiwa  kayiwa   512 May 23 14:34 ..
> >> -rw-r--r--   1 kayiwa  kayiwa  5965 May 23 14:33 5_records_utf8.mrc
> >> $ uname -a
> >> OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386
> >>
> >> ./fxk
> >>
> >>>
> >>> -Ross.
> >>>
> >>> On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:
> >>>
>  I finally had occasion today (read: remembered) to see if the *nix
> "file" command would recognize a MARC record file.  I haven't tested
> extensively, but it did identify the file as MARC21 Bibliographic record.
>  It also correctly identified a MARC21 Authority Record.  I'm running the
> most recent version of Ubuntu (12.04 - precise pangolin).
> 
>  I write because the inclusion of a "file" MARC21 specification rule
> in the magic.db stems from a Code4lib exchange that started in March 2011
> [1] (it ends in April if you want to go crawling for the entire thread).
> 
>  Rgds,
> 
>  Kevin
> 
>  [1]
> https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728
> 
>  --
>  Kevin Ford
>  Network Development and MARC Standards Office
>  Library of Congress
>  Washington, DC
> >>>
> >>
>


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Francis Kayiwa
On Wed, May 23, 2012 at 04:34:47PM -0400, Ross Singer wrote:
> On May 23, 2012, at 4:22 PM, Kevin Ford wrote:
> 
> > Don't know what to say.  Crawling through the source for "file" at [1], the 
> > pattern matching code as in place as of Sept 2011.  It could be present 
> > earlier than Sept 2011, but I stopped hunting for it.  The earliest it 
> > would have made its way into the magic db would have been April 2011.
> > 
> > Perhaps OpenBSD is using some custom branch of "file", haven't updated the 
> > db, etc.
> 
> As Stuart pointed out, some implementations are slow to update the db.  OSX, 
> for example, also just says "data" (hence my question on the output).


adding FreeBSD's magicfile from this commit on a users $HOME

http://lists.freebsd.org/pipermail/svn-src-vendor/2011-October/000851.html

For my next trick I will try to remember that I need to do that.

./fxk




-- 
If builders built buildings the way programmers wrote programs,
then the first woodpecker to come along would destroy civilization.


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Ross Singer
On May 23, 2012, at 4:22 PM, Kevin Ford wrote:

> Don't know what to say.  Crawling through the source for "file" at [1], the 
> pattern matching code as in place as of Sept 2011.  It could be present 
> earlier than Sept 2011, but I stopped hunting for it.  The earliest it would 
> have made its way into the magic db would have been April 2011.
> 
> Perhaps OpenBSD is using some custom branch of "file", haven't updated the 
> db, etc.

As Stuart pointed out, some implementations are slow to update the db.  OSX, 
for example, also just says "data" (hence my question on the output).

-Ross.
> 
> Yours,
> 
> Kevin
> 
> 
> 
> On 05/23/2012 03:36 PM, Francis Kayiwa wrote:
>> On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote:
>>> Wow, this is pretty cool.
>>> 
>>> Kevin, do you have examples of the output?
>>> 
>>> Does it work for bulk files?
>>> 
>>> I mean, I could just try this on my Ubuntu machine, but it's all the way 
>>> downstairs...
>> 
>> My OS lists it as `data`
>> 
>> $ cd
>> $ ls
>> devid_rsa.pub laflin marc   orthancssh
>> updating
>> $ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt
>> Trying 140.211.166.6...
>> Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt
>> 100%
>> |**|
>> 5965   00:00
>> 5965 bytes received in 0.00 seconds (1.56 MB/s)
>> $ ls
>> 5_records_utf8.mrc_.txt  id_rsa.pub   marc
>> ssh
>> dev  laflin   orthanc
>> updating
>> $ mkdir test
>> $ mv 5_records_utf8.mrc_.txt test/
>> $ cd test/
>> $ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc
>> $ ls
>> 5_records_utf8.mrc
>> $ file 5_records_utf8.mrc
>> 5_records_utf8.mrc: data
>> $ ls
>> 5_records_utf8.mrc
>> $ ls -al
>> total 32
>> drwxr-xr-x   2 kayiwa  kayiwa   512 May 23 14:34 .
>> drwxr-xr-x  10 kayiwa  kayiwa   512 May 23 14:34 ..
>> -rw-r--r--   1 kayiwa  kayiwa  5965 May 23 14:33 5_records_utf8.mrc
>> $ uname -a
>> OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386
>> 
>> ./fxk
>> 
>>> 
>>> -Ross.
>>> 
>>> On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:
>>> 
 I finally had occasion today (read: remembered) to see if the *nix "file" 
 command would recognize a MARC record file.  I haven't tested extensively, 
 but it did identify the file as MARC21 Bibliographic record.  It also 
 correctly identified a MARC21 Authority Record.  I'm running the most 
 recent version of Ubuntu (12.04 - precise pangolin).
 
 I write because the inclusion of a "file" MARC21 specification rule in the 
 magic.db stems from a Code4lib exchange that started in March 2011 [1] (it 
 ends in April if you want to go crawling for the entire thread).
 
 Rgds,
 
 Kevin
 
 [1] 
 https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728
 
 --
 Kevin Ford
 Network Development and MARC Standards Office
 Library of Congress
 Washington, DC
>>> 
>> 


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Kevin Ford

> It failed on a file containing all of LC Classification.  I need to
> figure out why.
-- To reply to myself: Having looked at the "file" db pattern source 
[1], I see that the "file" maintainer introduced a typo into the 
matching pattern for correctly identifying Classification records. 
That's way it's failing for Class records.


Over and out,

Kevin

[1] ftp://ftp.astron.com/pub/file/


On 05/23/2012 03:48 PM, Ford, Kevin wrote:

Does it work for bulk files?

-- It passed on a file containing 215 MARC Bibs and on a file containing 2,574 MARC Auth records.  
Don't know if you consider these "bulk," but there is more than 1 record in each file 
(caveat: "file" stops after evaluating the first line, so of the 2,574 Auth records, the 
last 2,573 could be invalid).  It failed on a file containing all of LC Classification.  I need to 
figure out why.


Kevin, do you have examples of the output?

-- I received "MARC21 Bibliography" and "MARC21 Authority" respectively.  In theory, if Leader 
20-23 are not "4500" then "(non-conforming)" should be appended to the identification.  If 
requested, the mimetype - application/marc - should also be outputted.

Rgds,

Kevin





-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Ross Singer
Sent: Wednesday, May 23, 2012 3:29 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC Magic for file

Wow, this is pretty cool.

Kevin, do you have examples of the output?

Does it work for bulk files?

I mean, I could just try this on my Ubuntu machine, but it's all the
way downstairs...

-Ross.

On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:


I finally had occasion today (read: remembered) to see if the *nix

"file" command would recognize a MARC record file.  I haven't tested
extensively, but it did identify the file as MARC21 Bibliographic
record.  It also correctly identified a MARC21 Authority Record.  I'm
running the most recent version of Ubuntu (12.04 - precise pangolin).


I write because the inclusion of a "file" MARC21 specification rule

in the magic.db stems from a Code4lib exchange that started in March
2011 [1] (it ends in April if you want to go crawling for the entire
thread).


Rgds,

Kevin

[1]
https://listserv.nd.edu/cgi-

bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=1

12728

--
Kevin Ford
Network Development and MARC Standards Office Library of Congress
Washington, DC


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Kevin Ford
Don't know what to say.  Crawling through the source for "file" at [1], 
the pattern matching code as in place as of Sept 2011.  It could be 
present earlier than Sept 2011, but I stopped hunting for it.  The 
earliest it would have made its way into the magic db would have been 
April 2011.


Perhaps OpenBSD is using some custom branch of "file", haven't updated 
the db, etc.


Yours,

Kevin



On 05/23/2012 03:36 PM, Francis Kayiwa wrote:

On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote:

Wow, this is pretty cool.

Kevin, do you have examples of the output?

Does it work for bulk files?

I mean, I could just try this on my Ubuntu machine, but it's all the way 
downstairs...


My OS lists it as `data`

$ cd
$ ls
devid_rsa.pub laflin marc   orthancssh
updating
$ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt
Trying 140.211.166.6...
Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt
100%
|**|
5965   00:00
5965 bytes received in 0.00 seconds (1.56 MB/s)
$ ls
5_records_utf8.mrc_.txt  id_rsa.pub   marc
ssh
dev  laflin   orthanc
updating
$ mkdir test
$ mv 5_records_utf8.mrc_.txt test/
$ cd test/
$ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc
$ ls
5_records_utf8.mrc
$ file 5_records_utf8.mrc
5_records_utf8.mrc: data
$ ls
5_records_utf8.mrc
$ ls -al
total 32
drwxr-xr-x   2 kayiwa  kayiwa   512 May 23 14:34 .
drwxr-xr-x  10 kayiwa  kayiwa   512 May 23 14:34 ..
-rw-r--r--   1 kayiwa  kayiwa  5965 May 23 14:33 5_records_utf8.mrc
$ uname -a
OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386

./fxk



-Ross.

On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:


I finally had occasion today (read: remembered) to see if the *nix "file" 
command would recognize a MARC record file.  I haven't tested extensively, but it did 
identify the file as MARC21 Bibliographic record.  It also correctly identified a MARC21 
Authority Record.  I'm running the most recent version of Ubuntu (12.04 - precise 
pangolin).

I write because the inclusion of a "file" MARC21 specification rule in the 
magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends in April 
if you want to go crawling for the entire thread).

Rgds,

Kevin

[1] https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728

--
Kevin Ford
Network Development and MARC Standards Office
Library of Congress
Washington, DC






Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread stuart yeates

On 24/05/12 07:14, Ford, Kevin wrote:

I finally had occasion today (read: remembered) to see if the *nix "file" 
command would recognize a MARC record file.  I haven't tested extensively, but it did 
identify the file as MARC21 Bibliographic record.  It also correctly identified a MARC21 
Authority Record.  I'm running the most recent version of Ubuntu (12.04 - precise 
pangolin).

I write because the inclusion of a "file" MARC21 specification rule in the 
magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends in April 
if you want to go crawling for the entire thread).


A couple of warnings about the unix file command

(a) it only looks at the start of the file. This is great because it 
works fast on big files. This is dreadful because it can't warn you that 
everything after the first 10k of a 2GB file is corrupt or a 1k MARC 
file is pre-pended to a 400GB astronomy data file.


(b) it is not uncommon for a file to match multiple file types. This can 
cause problems when using file to check whether inputs to a program are 
actually the type the program is expecting.


(c) some platforms have been notoriously slow to add new definitions, 
ubuntu is not such a platform.


cheers
stuart
--
Stuart Yeates
Library Technology Services http://www.victoria.ac.nz/library/


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Jonathan Rochkind
I have become recently unpleasantly aquainted with the world of Marc 
that is not Marc21, but is ISO 2709.


What'll it do on ISO 2709? I might be able to dig up an example. I 
wonder if it'll claim it's Marc21 (not), or if it's Marc21 
"Non-confirming" (no, it's not quite that either. It's ISO-2709 MARC 
that's not Marc21).


If it just doens't know anything about it and says 'data', that's just 
fine, if it knows about Marc21 but not non-Marc21 ISO 2709.


On 5/23/2012 3:48 PM, Ford, Kevin wrote:

Does it work for bulk files?

-- It passed on a file containing 215 MARC Bibs and on a file containing 2,574 MARC Auth records.  
Don't know if you consider these "bulk," but there is more than 1 record in each file 
(caveat: "file" stops after evaluating the first line, so of the 2,574 Auth records, the 
last 2,573 could be invalid).  It failed on a file containing all of LC Classification.  I need to 
figure out why.


Kevin, do you have examples of the output?

-- I received "MARC21 Bibliography" and "MARC21 Authority" respectively.  In theory, if Leader 
20-23 are not "4500" then "(non-conforming)" should be appended to the identification.  If 
requested, the mimetype - application/marc - should also be outputted.

Rgds,

Kevin





-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Ross Singer
Sent: Wednesday, May 23, 2012 3:29 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC Magic for file

Wow, this is pretty cool.

Kevin, do you have examples of the output?

Does it work for bulk files?

I mean, I could just try this on my Ubuntu machine, but it's all the
way downstairs...

-Ross.

On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:


I finally had occasion today (read: remembered) to see if the *nix

"file" command would recognize a MARC record file.  I haven't tested
extensively, but it did identify the file as MARC21 Bibliographic
record.  It also correctly identified a MARC21 Authority Record.  I'm
running the most recent version of Ubuntu (12.04 - precise pangolin).


I write because the inclusion of a "file" MARC21 specification rule

in the magic.db stems from a Code4lib exchange that started in March
2011 [1] (it ends in April if you want to go crawling for the entire
thread).


Rgds,

Kevin

[1]
https://listserv.nd.edu/cgi-

bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=1

12728

--
Kevin Ford
Network Development and MARC Standards Office Library of Congress
Washington, DC




Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Ford, Kevin
> Does it work for bulk files?
-- It passed on a file containing 215 MARC Bibs and on a file containing 2,574 
MARC Auth records.  Don't know if you consider these "bulk," but there is more 
than 1 record in each file (caveat: "file" stops after evaluating the first 
line, so of the 2,574 Auth records, the last 2,573 could be invalid).  It 
failed on a file containing all of LC Classification.  I need to figure out 
why.  

> Kevin, do you have examples of the output?
-- I received "MARC21 Bibliography" and "MARC21 Authority" respectively.  In 
theory, if Leader 20-23 are not "4500" then "(non-conforming)" should be 
appended to the identification.  If requested, the mimetype - application/marc 
- should also be outputted.

Rgds,

Kevin




> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Ross Singer
> Sent: Wednesday, May 23, 2012 3:29 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] MARC Magic for file
> 
> Wow, this is pretty cool.
> 
> Kevin, do you have examples of the output?
> 
> Does it work for bulk files?
> 
> I mean, I could just try this on my Ubuntu machine, but it's all the
> way downstairs...
> 
> -Ross.
> 
> On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:
> 
> > I finally had occasion today (read: remembered) to see if the *nix
> "file" command would recognize a MARC record file.  I haven't tested
> extensively, but it did identify the file as MARC21 Bibliographic
> record.  It also correctly identified a MARC21 Authority Record.  I'm
> running the most recent version of Ubuntu (12.04 - precise pangolin).
> >
> > I write because the inclusion of a "file" MARC21 specification rule
> in the magic.db stems from a Code4lib exchange that started in March
> 2011 [1] (it ends in April if you want to go crawling for the entire
> thread).
> >
> > Rgds,
> >
> > Kevin
> >
> > [1]
> > https://listserv.nd.edu/cgi-
> bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=1
> > 12728
> >
> > --
> > Kevin Ford
> > Network Development and MARC Standards Office Library of Congress
> > Washington, DC


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Francis Kayiwa
On Wed, May 23, 2012 at 03:28:56PM -0400, Ross Singer wrote:
> Wow, this is pretty cool.
> 
> Kevin, do you have examples of the output?
> 
> Does it work for bulk files?
> 
> I mean, I could just try this on my Ubuntu machine, but it's all the way 
> downstairs...

My OS lists it as `data`

$ cd
$ ls
devid_rsa.pub laflin marc   orthancssh
updating
$ ftp http://drupal.org/files/issues/5_records_utf8.mrc_.txt
Trying 140.211.166.6...
Requesting http://drupal.org/files/issues/5_records_utf8.mrc_.txt
100%
|**|
5965   00:00
5965 bytes received in 0.00 seconds (1.56 MB/s)
$ ls
5_records_utf8.mrc_.txt  id_rsa.pub   marc
ssh
dev  laflin   orthanc
updating
$ mkdir test
$ mv 5_records_utf8.mrc_.txt test/  

   
$ cd test/  

   
$ mv 5_records_utf8.mrc_.txt 5_records_utf8.mrc 

   
$ ls
5_records_utf8.mrc
$ file 5_records_utf8.mrc   

   
5_records_utf8.mrc: data
$ ls
5_records_utf8.mrc
$ ls -al
total 32
drwxr-xr-x   2 kayiwa  kayiwa   512 May 23 14:34 .
drwxr-xr-x  10 kayiwa  kayiwa   512 May 23 14:34 ..
-rw-r--r--   1 kayiwa  kayiwa  5965 May 23 14:33 5_records_utf8.mrc
$ uname -a
OpenBSD orthanc.lib.uic.edu 5.1 GENERIC.MP#256 i386

./fxk

> 
> -Ross.
> 
> On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:
> 
> > I finally had occasion today (read: remembered) to see if the *nix "file" 
> > command would recognize a MARC record file.  I haven't tested extensively, 
> > but it did identify the file as MARC21 Bibliographic record.  It also 
> > correctly identified a MARC21 Authority Record.  I'm running the most 
> > recent version of Ubuntu (12.04 - precise pangolin).
> > 
> > I write because the inclusion of a "file" MARC21 specification rule in the 
> > magic.db stems from a Code4lib exchange that started in March 2011 [1] (it 
> > ends in April if you want to go crawling for the entire thread).
> > 
> > Rgds,
> > 
> > Kevin
> > 
> > [1] 
> > https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728
> > 
> > --
> > Kevin Ford
> > Network Development and MARC Standards Office
> > Library of Congress
> > Washington, DC
> 

-- 
If builders built buildings the way programmers wrote programs,
then the first woodpecker to come along would destroy civilization.


Re: [CODE4LIB] MARC Magic for file

2012-05-23 Thread Ross Singer
Wow, this is pretty cool.

Kevin, do you have examples of the output?

Does it work for bulk files?

I mean, I could just try this on my Ubuntu machine, but it's all the way 
downstairs...

-Ross.

On May 23, 2012, at 3:14 PM, Ford, Kevin wrote:

> I finally had occasion today (read: remembered) to see if the *nix "file" 
> command would recognize a MARC record file.  I haven't tested extensively, 
> but it did identify the file as MARC21 Bibliographic record.  It also 
> correctly identified a MARC21 Authority Record.  I'm running the most recent 
> version of Ubuntu (12.04 - precise pangolin).
> 
> I write because the inclusion of a "file" MARC21 specification rule in the 
> magic.db stems from a Code4lib exchange that started in March 2011 [1] (it 
> ends in April if you want to go crawling for the entire thread).
> 
> Rgds,
> 
> Kevin
> 
> [1] 
> https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728
> 
> --
> Kevin Ford
> Network Development and MARC Standards Office
> Library of Congress
> Washington, DC


[CODE4LIB] MARC Magic for file

2012-05-23 Thread Ford, Kevin
I finally had occasion today (read: remembered) to see if the *nix "file" 
command would recognize a MARC record file.  I haven't tested extensively, but 
it did identify the file as MARC21 Bibliographic record.  It also correctly 
identified a MARC21 Authority Record.  I'm running the most recent version of 
Ubuntu (12.04 - precise pangolin).

I write because the inclusion of a "file" MARC21 specification rule in the 
magic.db stems from a Code4lib exchange that started in March 2011 [1] (it ends 
in April if you want to go crawling for the entire thread).

Rgds,

Kevin

[1] https://listserv.nd.edu/cgi-bin/wa?A2=ind1103&L=CODE4LIB&T=0&F=&S=&P=112728

--
Kevin Ford
Network Development and MARC Standards Office
Library of Congress
Washington, DC


Re: [CODE4LIB] MARC magic for file

2011-04-08 Thread Sean Hannan
http://i.imgur.com/6WtA0.png

(Sorry, it's Friday. Also, blame dchud for the idea.)

-Sean


On 4/6/11 4:53 PM, "Mike Taylor"  wrote:

> On 6 April 2011 19:53, Jonathan Rochkind  wrote:
>> On 4/6/2011 2:43 PM, William Denton wrote:
>>> 
>>> "Validity" does mean something definite ... but Postel's Law is a good
>>> guideline, especially with the swamp of bad MARC, old MARC, alternate
>>> MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake
>>> of file and its magic---we can identify technically invalid but still
>>> usable MARC, that's good.
>> 
>> Hmm, accept in the case of Web Browsers, I think general consensus is
>> Postel's law was not helpful. These days, most people seem to think that
>> having different browsers be tolerant of invalid data in different ways was
>> actually harmful rather than helpful to inter-operability (which is
>> theoretically the goal of Postel's law), and that's not what people do
>> anymore in web browser land, at least not to the extremes they used to do
>> it.
> 
> But the idea that browsers should be less permissive in what they
> accept is a modern one that we now have the luxury of only because
> adherence to Postel's law in the early days of the Web allowed it to
> become ubiquitous.  Though it's true, as Harvey Thompson has observed
> that "it's difficult to retro-fit correctness", Clay Shirky was also
> very right when he pointed out that "You cannot simultaneously have
> mass adoption and rigor".  If browsers in 1995 had been as pedantic as
> the browsers of 2011 (rightly) are, we wouldn't even have the Web; or
> if it existed at all it would just be a nichey thing that a few
> scientists used to make their publications available to each other.
> 
> So while I agree that in the case of HTML we are right to now be
> moving towards more rigorous demands of what to accept (as well, of
> course, as being conservative in what we emit), I don't think we could
> have made the leap from nothing to modern rigour.
> 
> -- Mike


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Mike Taylor
On 6 April 2011 19:53, Jonathan Rochkind  wrote:
> On 4/6/2011 2:43 PM, William Denton wrote:
>>
>> "Validity" does mean something definite ... but Postel's Law is a good
>> guideline, especially with the swamp of bad MARC, old MARC, alternate
>> MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake
>> of file and its magic---we can identify technically invalid but still
>> usable MARC, that's good.
>
> Hmm, accept in the case of Web Browsers, I think general consensus is
> Postel's law was not helpful. These days, most people seem to think that
> having different browsers be tolerant of invalid data in different ways was
> actually harmful rather than helpful to inter-operability (which is
> theoretically the goal of Postel's law), and that's not what people do
> anymore in web browser land, at least not to the extremes they used to do
> it.

But the idea that browsers should be less permissive in what they
accept is a modern one that we now have the luxury of only because
adherence to Postel's law in the early days of the Web allowed it to
become ubiquitous.  Though it's true, as Harvey Thompson has observed
that "it's difficult to retro-fit correctness", Clay Shirky was also
very right when he pointed out that "You cannot simultaneously have
mass adoption and rigor".  If browsers in 1995 had been as pedantic as
the browsers of 2011 (rightly) are, we wouldn't even have the Web; or
if it existed at all it would just be a nichey thing that a few
scientists used to make their publications available to each other.

So while I agree that in the case of HTML we are right to now be
moving towards more rigorous demands of what to accept (as well, of
course, as being conservative in what we emit), I don't think we could
have made the leap from nothing to modern rigour.

-- Mike


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Kyle Banerjee
> Well, the problem is when the original Marc4J author took the spec at it's
> word, and actually _acted upon_ the '4' and the '5', changing file semantics
> if they were different, and throwing an exception if it was a non-digit.
>
At least the author actually used the values rather than checking to see if
a 4 or 5 were there. I still don't see what the point of looking for a 0 in
an undefined field would be. I'm wondering what kind of nut job would write
this into the standard, but that's not the author's problem.


> Do you think he got it wrong?  How was he supposed to know he got it wrong,
> he wrote to the spec and took it at it's word. Are you SURE there aren't any
> Marc formats other than Marc21 out there that actually do use these bytes
> with their intended meaning, instead of fixing them?


I wouldn't call it wrong -- the spec is a logical point of departure. MARC21
derives from an ISO standard that does not use those character positions and
which otherwise requires the same data layout, but the author wouldn't
necessarily know that.

Standards have something in common with laws in that how they are used in
the real world is as or more important than what is actually defined --
what's written and what's done in practice can be very different.

Everyone here who has parsed catalog data who has done an ILS migration
knows better than to just think for a second that fields can be assumed to
be used as defined except for very basic stuff.


> How was the Marc4J author supposed to be sure of that, or even guess it
> might be the case, and know he'd be serving users better by ignoring the
> spec here instead of following it?


There might not have been a good way to know. With data, one thing you
always want to do is ask a bunch of people who work with it all the time
about anomalies in the wild. Many great works of fiction masquerade as
documents which supposedly describe reality.


> Ie: I _thought_ I was writing only for Marc21, but then it turns out I've
> got to accept records from Outer Weirdistan that are a kind of legal Marc
> that actually uses those bytes for their intended meaning


Any such MARC as it would be noncompliant with the ISO standard from which
MARC21 hails. If working from the MARC21 standard and weird records are in
question, there would be a greater chance of choking on nonumeric tags as
those are allowed by the ISO standard.

Ignoring that MARC21 would need to be redefined to be able to take on other
values, one can safely conclude that such a redefinition could only be
written by totally deranged individuals. Values lower than 4 and 5
respectively would limit record length to the point little or no data could
be stored, and greater values would be completely nonsensical as the MARC
record length limitation would mean that the extra space allocated by the
digits could only contain zeros.

In any case, MARC is a legacy standard from the 60's. The chances of new
flavors emerging are dismal at best.


> Again, I realize in the actual environment we've got, this is not a luxury
> we have. But it's a fault, not a benefit, to have lots of software
> everywhere behaving in non-compliant ways and creating invalid (according to
> the spec!) data.
>
Creating is another matter entirely. Since we can control what we create
ourselves, we make things a little better every time we make things
comformant. However, we can't control what others do and being able to read
everything is useful, including stuff created using tools/processes that
aren't up to scratch.

kyle


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind

On 4/6/2011 2:43 PM, William Denton wrote:


"Validity" does mean something definite ... but Postel's Law is a good
guideline, especially with the swamp of bad MARC, old MARC, alternate
MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake
of file and its magic---we can identify technically invalid but still
usable MARC, that's good.


Hmm, accept in the case of Web Browsers, I think general consensus is 
Postel's law was not helpful. These days, most people seem to think that 
having different browsers be tolerant of invalid data in different ways 
was actually harmful rather than helpful to inter-operability (which is 
theoretically the goal of Postel's law), and that's not what people do 
anymore in web browser land, at least not to the extremes they used to 
do it.


So Postel's Law may not be a universal.  Although marc data may or may 
not be analagous to a web browser/html. :)  It doesn't _really_ matter, 
cause we're stuck with the legacy we're stuck with, there's no changing 
it now. But there are real world negative consequences to it, some of 
which I've tried to explain in previous messages. (And still don't call 
it "validity" if it's not please! But yes, sometimes insisting on strict 
validity is not the appropriate solution).


Also note that assuming that byte 20-21 is "45" even when it's something 
else is possibly not something Postel would accept as an application of 
his law -- unless you document your software specifically as working 
only with Marc21, and not any Marc.


[Postel's Law: "Be conservative in what you send; be liberal in what you 
accept." http://en.wikipedia.org/wiki/Robustness_principle  .  That wiki 
page also notes the general category of downside in following Postel's 
law, which is what was encountered with HTML, and which _I've_ 
encountered with MARC:  "For example, a defective implementation that 
sends non-conforming messages might be used only with implementations 
that tolerate those deviations from the specification until, possibly 
several years later, it is connected with a less tolerant application 
that rejects its messages. In such a situation, identifying the problem 
is often difficult, and deploying a solution can be costly. "


Yes, identifying the problem and deploying the solution was costly, in 
my MARC case, although it definitely could have been worse. ]


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread William Denton

On 6 April 2011, Jonathan Rochkind wrote:

I think we computer programmers are really better-served by reserving the 
notion of "validity" for things specified by formal specifications -- as we 
normally do, talking about any other data format.   And the only formal 
specifications I can find for Marc21 say that leader bytes 20-23 should be 
4500. (Not true of Marc in general just Marc21).


"Validity" does mean something definite ... but Postel's Law is a good 
guideline, especially with the swamp of bad MARC, old MARC, alternate 
MARC, that's out there.  Valid MARC is valid MARC, but if---for the sake 
of file and its magic---we can identify technically invalid but still 
usable MARC, that's good.


Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind

On 4/6/2011 2:02 PM, Kyle Banerjee wrote:

I'd go so far as to question the value of validating redundant data that
theoretically has meaning but which are never supposed to vary. The 4 and
the 5 simply repeat what is already known about the structure of the MARC
record. Choking on stuff like this is like having a web browser ask you want
to do with a page because it lacks a document type declaration.


Well, the problem is when the original Marc4J author took the spec at 
it's word, and actually _acted upon_ the '4' and the '5', changing file 
semantics if they were different, and throwing an exception if it was a 
non-digit.


This actually happened, I'm not making this up!  Took me a while to debug.

So do you think he got it wrong?  How was he supposed to know he got it 
wrong, he wrote to the spec and took it at it's word. Are you SURE there 
aren't any Marc formats other than Marc21 out there that actually do use 
these bytes with their intended meaning, instead of fixing them? How was 
the Marc4J author supposed to be sure of that, or even guess it might be 
the case, and know he'd be serving users better by ignoring the spec 
here instead of following it?  What documents instead of the actual 
specifications should he have been looking at to determine that he ought 
not to be taking those bytes at their words, but just ignoring them?


To realize that we have so much non-conformant data out there that we're 
better off ignoring these bytes, is something you can really only learn 
through experience -- and something you can then later realize you're 
wrong on too:


Ie: I _thought_ I was writing only for Marc21, but then it turns out 
I've got to accept records from Outer Weirdistan that are a kind of 
legal Marc that actually uses those bytes for their intended meaning -- 
better go back and fix my entire software stack, involving various 
proprietary and open source products from multiple sources, each of 
which has undocumented behavior when it comes to these bytes, maybe they 
follow the spec or maybe the follow Kyle's advice, but they don't tell 
me.  This is a mess.


Maybe this scenario is impossible, maybe there ARE and NEVER HAVE BEEN 
any Marc variants that actually use leader bytes 20-22 in this way -- 
how can I determine that?  I've just got to guess and hope for the 
best.  The point of specifications in the first place is for 
inter-operability, so we know that if all software and data conforms to 
the spec, then all software and data will interact in expected ways.  
Once we start guessing at which parts of the spec we really ought to be 
ignoring


Again, I realize in the actual environment we've got, this is not a 
luxury we have. But it's a fault, not a benefit, to have lots of 
software everywhere behaving in non-compliant ways and creating invalid 
(according to the spec!) data.


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Kyle Banerjee
..  Maybe we have different understandings of "valid".
>
> If leader bytes 20-23 are not "4500", I suggest that is _by definition_ not
> a "valid" Marc21 file. It violates the Marc21 specification.
>
> Now, they may still be _usable_, by software that ignores these bytes
> anyway or works around them. We definitely have a lot of software that does
> that.
>
> Which can end up causing problems that remind me of very analagous problems
> caused by the early days of web browsers that felt like being 'tolerant' of
> bad data. "My html works in every web brower BUT this one, why not? Oh,
> becuase that's the only one that actually followed the standard, oops."
>

There is some question as to what value there is in validating fields that
have no meaning by definition. What benefit does validating an undefined
value have other than create an opportunity to break things and slow the
process down just a little? The entire concept of an invalid entry in an
undefined field (e.g byte 23) is oxymoronic.

I'd go so far as to question the value of validating redundant data that
theoretically has meaning but which are never supposed to vary. The 4 and
the 5 simply repeat what is already known about the structure of the MARC
record. Choking on stuff like this is like having a web browser ask you want
to do with a page because it lacks a document type declaration.

Garbage data is the reality, so having parsers stop when they encounter data
they don't actually need unnecessarily complicates things. That kind of
stuff should generate a warning at worst.

kyle


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind

Actually -- I'd disagree because that is a very narrow view of the
specification.  When validating MARC, I'd take the approach to validate
structure (which allows you to then read any MARC format) -- then use a
separate process for validating content of fields, which in my opinion,
is more open to interpretation based on system usage of the data.


Wait, so is there any formal specification of "validity" that you can 
look at to determine your definition of "validity", or it's just "well, 
if I can recover it into useful data, using my own algorithms"


I think we computer programmers are really better-served by reserving 
the notion of "validity" for things specified by formal specifications 
-- as we normally do, talking about any other data format.   And the 
only formal specifications I can find for Marc21 say that leader bytes 
20-23 should be 4500. (Not true of Marc in general just Marc21).


Now it may very well be (is!) true that the library community with Marc 
have been in the practice of tolerating "working" Marc that is NOT valid 
according to any specification.   So, sure, we may need to write 
software to take account of that sordid history. But I think it IS a 
sordid history -- not having a specification to ensure validity makes it 
VERY hard to write any new software that recognizes what you expect it 
to be recognize, because what you expect it to recognize isn't formally 
specified anywhere. It's a problem.  We shouldn't try to hide the 
problem in our discussions by using the word "valid" to mean something 
different than we use it for any modern data format. "valid" only has a 
meaning when you're talking about valid according to some specific 
specification.


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
I'm honestly not family with magic.  I can tell you in MarcEdit, the way that 
the process works is there is a very generic function that reads the structure 
of the data not trusting the information in the leader (since I find this data 
very un-reliable).  Then, if users want to apply a set of rules to the 
validation -- I apply those as a secondary process.  If you are looking to 
validate specific content within a record, then what you want to do in this 
function may be appropriate -- though you'll find some local systems will 
consistently fail the process.

--tr


From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of William Denton 
[w...@pobox.com]
Sent: Wednesday, April 06, 2011 10:29 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

On 6 April 2011, Reese, Terry wrote:

> Actually -- I'd disagree because that is a very narrow view of the
> specification.  When validating MARC, I'd take the approach to validate
> structure (which allows you to then read any MARC format) -- then use a
> separate process for validating content of fields, which in my opinion,
> is more open to interpretation based on system usage of the data.

What do you think is the best way to recognize MARC files (up to some
level of validity, given all the MARC you've seen and parsed) that could
be made to work the way magic is defined?

Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread William Denton

On 6 April 2011, Reese, Terry wrote:

Actually -- I'd disagree because that is a very narrow view of the 
specification.  When validating MARC, I'd take the approach to validate 
structure (which allows you to then read any MARC format) -- then use a 
separate process for validating content of fields, which in my opinion, 
is more open to interpretation based on system usage of the data.


What do you think is the best way to recognize MARC files (up to some 
level of validity, given all the MARC you've seen and parsed) that could 
be made to work the way magic is defined?


Bill
--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
Actually -- I'd disagree because that is a very narrow view of the 
specification.  When validating MARC, I'd take the approach to validate 
structure (which allows you to then read any MARC format) -- then use a 
separate process for validating content of fields, which in my opinion, is more 
open to interpretation based on system usage of the data.  For example, 22 and 
23 are undefined values that local systems may very well have a practical need 
to define and use given that there are only so many values in the leader.  This 
is why I sometimes see additional values in the 09 field (which should be a or 
blank) to define different character set types, or additional elements added to 
other fields.  If I want to validate the content of those fields, I'd validate 
it through a different process -- but I separate the process from the 
validation of the structure -- because the two are not exclusive.

--TR

> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Wednesday, April 06, 2011 9:59 AM
> To: Code for Libraries
> Cc: Reese, Terry
> Subject: Re: [CODE4LIB] MARC magic for file
> 
> I'm not sure what you mean Terry.  Maybe we have different understandings
> of "valid".
> 
> If leader bytes 20-23 are not "4500", I suggest that is _by definition_ not a
> "valid" Marc21 file. It violates the Marc21 specification.
> 
> Now, they may still be _usable_, by software that ignores these bytes
> anyway or works around them. We definitely have a lot of software that
> does that.
> 
> Which can end up causing problems that remind me of very analagous
> problems caused by the early days of web browsers that felt like being
> 'tolerant' of bad data. "My html works in every web brower BUT this one,
> why not? Oh, becuase that's the only one that actually followed the
> standard, oops."
> 
> I actually ran into an example of that problem with this exact issue.
> MOST software just ignores marc leader bytes 20-23, and assumes the
> semantics of "4500"---the only legal semantics for Marc21.  But Marc4j
> actually _respected_ them, apparently the author thought that some marc in
> the wild might intentionally set different bytes here (no idea if that's true 
> or
> not). So if the leader bytes 20-23 were "invalid"
> (according to the spec), Marc47 would suddenly decide that the "length of
> field portion" was NOT 4, but actually BELIEVE whatever was in leader byte
> 20, causing the record to be parsed improperly.  And I had records like that
> coming out of my ILS (not even a vendor database). That was an unfun
> couple days of debugging to figure out what was going on.
> 
> On 4/6/2011 12:52 PM, Reese, Terry wrote:
> > Actually, you can have records that are MARC21 coming out of vendor
> databases (who sometime embed control characters into the leader) and still
> be valid.  Once you stop looking at just your ILS or OCLC, you probably
> wouldn't be surprised to know that records start looking very different.
> >
> > --TR
> >
> >
> > 
> > Terry Reese, Associate Professor
> > Gray Family Chair
> > for Innovative Library Services
> > 121 Valley Libraries
> > Corvallis, Or 97331
> > tel: 541.737.6384
> > 
> >
> >
> >
> >> -Original Message-
> >> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
> >> Of Jonathan Rochkind
> >> Sent: Wednesday, April 06, 2011 9:44 AM
> >> To: CODE4LIB@LISTSERV.ND.EDU
> >> Subject: Re: [CODE4LIB] MARC magic for file
> >>
> >> Can't you have a legal "MARC" file that does NOT have 4500 in those
> >> leader positions?  It's just not legal "Marc21", right?   Other marc
> >> formats may specify or even allow flexibility in the things these
> >> bytes
> >> specify:
> >>
> >> * Length of the length-of-field portion
> >> * Number of characters in the starting-character-position portion of
> >> a Directory entry
> >> * Number of characters in the implementation-defined portion of a
> >> Directory entry
> >>
> >> Or, um, 23, which is I guess is left to the specific Marc
> >> implementation (ie,
> >> Marc21 is one such) to use for it's own purposes.
> >>
> >> I have no idea how that should inform the 'marc magic'.
> >>
> >> Is mime-type application/marc defined as specifically Marc21, or as
> >> any Marc?
> >>
> >> Jonathan
> >>
> >>

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Prettyman, Timothy
Just as a historical note, this non-standard use of LDR/22 is likely due to 
OCLC's use of the character as a hexadecimal flag from back in the days when 
marc records were mostly schlepped around on tapes.  They referred to it as the 
"Transaction type code".  When records were sent to oclc for processing, 
various values of the flag indicated whether a catalog card was to be produced, 
whether the record was an update, whether the user location symbol was to be 
set, etc.  I'm sure others have used it for their own nefarious purposes as 
well.

Tim Prettyman
University of Michigan/LIT

On Apr 6, 2011, at 12:28 PM, Ford, Kevin wrote:

> Well, this brings us right up against the issue of files that adhere to their 
> specifications versus forgiving applications.  Think of browsers and HTML.  
> Suffice it to say, MARC applications are quite likely to be forgiving of 
> leader positions 20-23.  In my non-conforming MARC file and in Bill's, the 
> leader positions 20-21 ("45") seemed constant, but things could fall apart 
> for positions 22-23.  So...
> 
> I present the following (in-line and attached, to preserve tabs) in an 
> attempt to straddle the two sides of this issue: applications forgiving of 
> non-conforming files.  Should the two characters following 45 (at position 
> 20) *not* be 00, then the identification will be noted as "non-conforming."  
> We could classify this as reasonable identification but hardly ironclad 
> (indeed, simply checking to confirm that part of the first 24 positions match 
> the specification hardly constitutes a robust identification, but it's 
> something).
> 
> It will also give you a mimetype too, now.
> 
> Would any like testing it out more fully on their own files?
> 
> 
> #
> # MARC 21 Magic  (Third cut)
> 
> # Set at position 0
> 0 bytex   
> 
> # leader position 20-21 must be 45
>> 20   string  45  
> 
> # leader starts with 5 digits, followed by codes specific to MARC format
>>> 0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
> !:mimeapplication/marc
>>> 0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
> !:mimeapplication/marc
>>> 0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
> !:mimeapplication/marc
>>> 0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
> !:mimeapplication/marc
>>> 0   regex/1 (^[0-9]{5})[cdn][q] MARC Community
> !:mimeapplication/marc
> 
> # leader position 22-23, should be "00" but is it?
>>> 0   regex/1 (^.{21})([^0]{2})   (non-conforming)
> !:mimeapplication/marc
> 
> 
> If this works, I'll see about submitting this copy.  Thanks to all your 
> efforts already.
> 
> Warmly,
> 
> Kevin
> 
> --
> Library of Congress
> Network Development and MARC Standards Office
> 
> 
> 
> 
> 
> 
> From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero 
> [s...@unc.edu]
> Sent: Sunday, April 03, 2011 14:01
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] MARC magic for file
> 
> I am pretty sure that the marc4j standard reader ignores them; the tolerant
> reader definitely does. Otherwise JHU might have about two parseable records
> based on the mangled leaders that J-Rock  gets stuck with :-)
> 
> An analysis of the ~7M LC bib records from the scriblio.net data files (~
> Dec 2006) indicated that leader  has less than 8 bits of information in it
> (shannon-weaver definition). This excludes the initial length value, which
> is redundant given the end of record marker.
> 
> 
> The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
> The final characters of the leader are "450".
> 
> Also, I object to the phrase "decent MARC tool".  Any tool capable of
> dealing with MARC as it exists cannot afford the luxury of decency :-)
> 
> [ HA: "A clear conscience?"
> BW: "Yes, Sir Humphrey."
> HA: "When did you acquire this taste for luxuries?"]
> 
> Simon
> 
> On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens  wrote:
> 
>> "I'm sure any decent MARC tool can deal with them, since decent MARC tools
>> are certainly going to be forgiving enough to deal with four characters
>> that
>> apparently don't even really matter."
>> 
>> You say that, but I'm pretty sure Marc4J throws errors MARC records where
>> these characters are incorrect
>> 
>> Owen
>> 
>> On Fri, Apr 1, 2011 at 3:51 AM, William Denton  wrote:
>> 
&g

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind
I'm not sure what you mean Terry.  Maybe we have different 
understandings of "valid".


If leader bytes 20-23 are not "4500", I suggest that is _by definition_ 
not a "valid" Marc21 file. It violates the Marc21 specification.


Now, they may still be _usable_, by software that ignores these bytes 
anyway or works around them. We definitely have a lot of software that 
does that.


Which can end up causing problems that remind me of very analagous 
problems caused by the early days of web browsers that felt like being 
'tolerant' of bad data. "My html works in every web brower BUT this one, 
why not? Oh, becuase that's the only one that actually followed the 
standard, oops."


I actually ran into an example of that problem with this exact issue. 
MOST software just ignores marc leader bytes 20-23, and assumes the 
semantics of "4500"---the only legal semantics for Marc21.  But Marc4j 
actually _respected_ them, apparently the author thought that some marc 
in the wild might intentionally set different bytes here (no idea if 
that's true or not). So if the leader bytes 20-23 were "invalid" 
(according to the spec), Marc47 would suddenly decide that the "length 
of field portion" was NOT 4, but actually BELIEVE whatever was in leader 
byte 20, causing the record to be parsed improperly.  And I had records 
like that coming out of my ILS (not even a vendor database). That was an 
unfun couple days of debugging to figure out what was going on.


On 4/6/2011 12:52 PM, Reese, Terry wrote:

Actually, you can have records that are MARC21 coming out of vendor databases 
(who sometime embed control characters into the leader) and still be valid.  
Once you stop looking at just your ILS or OCLC, you probably wouldn't be 
surprised to know that records start looking very different.

--TR



Terry Reese, Associate Professor
Gray Family Chair
for Innovative Library Services
121 Valley Libraries
Corvallis, Or 97331
tel: 541.737.6384





-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Wednesday, April 06, 2011 9:44 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

Can't you have a legal "MARC" file that does NOT have 4500 in those
leader positions?  It's just not legal "Marc21", right?   Other marc
formats may specify or even allow flexibility in the things these bytes
specify:

* Length of the length-of-field portion
* Number of characters in the starting-character-position portion of a
Directory entry
* Number of characters in the implementation-defined portion of a Directory
entry

Or, um, 23, which is I guess is left to the specific Marc implementation (ie,
Marc21 is one such) to use for it's own purposes.

I have no idea how that should inform the 'marc magic'.

Is mime-type application/marc defined as specifically Marc21, or as any
Marc?

Jonathan

On 4/6/2011 12:28 PM, Ford, Kevin wrote:

Well, this brings us right up against the issue of files that adhere to their

specifications versus forgiving applications.  Think of browsers and HTML.
Suffice it to say, MARC applications are quite likely to be forgiving of leader
positions 20-23.  In my non-conforming MARC file and in Bill's, the leader
positions 20-21 ("45") seemed constant, but things could fall apart for
positions 22-23.  So...

I present the following (in-line and attached, to preserve tabs) in an

attempt to straddle the two sides of this issue: applications forgiving of non-
conforming files.  Should the two characters following 45 (at position 20)
*not* be 00, then the identification will be noted as "non-conforming."  We
could classify this as reasonable identification but hardly ironclad (indeed,
simply checking to confirm that part of the first 24 positions match the
specification hardly constitutes a robust identification, but it's something).

It will also give you a mimetype too, now.

Would any like testing it out more fully on their own files?


#
# MARC 21 Magic  (Third cut)

# Set at position 0
0   bytex

# leader position 20-21 must be 45

20  string  45

# leader starts with 5 digits, followed by codes specific to MARC
format

0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic

!:mime  application/marc

0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority

!:mime  application/marc

0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings

!:mime  application/marc

0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification

!:mime  application/marc

0   regex/1 (^[0-9]{5})[cdn][q] MARC Community

!:mime  application/marc

# leader position 22-23, should be "00" but is it?

0   regex/1 (^.{21})([^0]{2})

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
Actually, you can have records that are MARC21 coming out of vendor databases 
(who sometime embed control characters into the leader) and still be valid.  
Once you stop looking at just your ILS or OCLC, you probably wouldn't be 
surprised to know that records start looking very different.

--TR



Terry Reese, Associate Professor
Gray Family Chair 
for Innovative Library Services
121 Valley Libraries
Corvallis, Or 97331
tel: 541.737.6384




> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Jonathan Rochkind
> Sent: Wednesday, April 06, 2011 9:44 AM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] MARC magic for file
> 
> Can't you have a legal "MARC" file that does NOT have 4500 in those
> leader positions?  It's just not legal "Marc21", right?   Other marc
> formats may specify or even allow flexibility in the things these bytes
> specify:
> 
> * Length of the length-of-field portion
> * Number of characters in the starting-character-position portion of a
> Directory entry
> * Number of characters in the implementation-defined portion of a Directory
> entry
> 
> Or, um, 23, which is I guess is left to the specific Marc implementation (ie,
> Marc21 is one such) to use for it's own purposes.
> 
> I have no idea how that should inform the 'marc magic'.
> 
> Is mime-type application/marc defined as specifically Marc21, or as any
> Marc?
> 
> Jonathan
> 
> On 4/6/2011 12:28 PM, Ford, Kevin wrote:
> > Well, this brings us right up against the issue of files that adhere to 
> > their
> specifications versus forgiving applications.  Think of browsers and HTML.
> Suffice it to say, MARC applications are quite likely to be forgiving of 
> leader
> positions 20-23.  In my non-conforming MARC file and in Bill's, the leader
> positions 20-21 ("45") seemed constant, but things could fall apart for
> positions 22-23.  So...
> >
> > I present the following (in-line and attached, to preserve tabs) in an
> attempt to straddle the two sides of this issue: applications forgiving of 
> non-
> conforming files.  Should the two characters following 45 (at position 20)
> *not* be 00, then the identification will be noted as "non-conforming."  We
> could classify this as reasonable identification but hardly ironclad (indeed,
> simply checking to confirm that part of the first 24 positions match the
> specification hardly constitutes a robust identification, but it's something).
> >
> > It will also give you a mimetype too, now.
> >
> > Would any like testing it out more fully on their own files?
> >
> >
> > #
> > # MARC 21 Magic  (Third cut)
> >
> > # Set at position 0
> > 0   bytex
> >
> > # leader position 20-21 must be 45
> >> 20 string  45
> > # leader starts with 5 digits, followed by codes specific to MARC
> > format
> >>> 0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
> > !:mime  application/marc
> >>> 0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
> > !:mime  application/marc
> >>> 0 regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
> > !:mime  application/marc
> >>> 0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification
> > !:mime  application/marc
> >>> 0 regex/1 (^[0-9]{5})[cdn][q] MARC Community
> > !:mime  application/marc
> >
> > # leader position 22-23, should be "00" but is it?
> >>> 0 regex/1 (^.{21})([^0]{2})   (non-conforming)
> > !:mime  application/marc
> >
> >
> > If this works, I'll see about submitting this copy.  Thanks to all your 
> > efforts
> already.
> >
> > Warmly,
> >
> > Kevin
> >
> > --
> > Library of Congress
> > Network Development and MARC Standards Office
> >
> >
> >
> >
> >
> > 
> > From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Simon
> > Spero [s...@unc.edu]
> > Sent: Sunday, April 03, 2011 14:01
> > To: CODE4LIB@LISTSERV.ND.EDU
> > Subject: Re: [CODE4LIB] MARC magic for file
> >
> > I am pretty sure that the marc4j standard reader ignores them; the
> > tolerant reader definitely does. Otherwise JHU might have about two
> > parseable records based on the mangled leaders that J-Rock  gets stuck
> > with :-)
> >
> > An analysis of the ~7M LC bib records from the scriblio.net 

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind
Can't you have a legal "MARC" file that does NOT have 4500 in those 
leader positions?  It's just not legal "Marc21", right?   Other marc 
formats may specify or even allow flexibility in the things these bytes 
specify:


* Length of the length-of-field portion
* Number of characters in the starting-character-position portion of a 
Directory entry
* Number of characters in the implementation-defined portion of a 
Directory entry


Or, um, 23, which is I guess is left to the specific Marc implementation 
(ie, Marc21 is one such) to use for it's own purposes.


I have no idea how that should inform the 'marc magic'.

Is mime-type application/marc defined as specifically Marc21, or as any 
Marc?


Jonathan

On 4/6/2011 12:28 PM, Ford, Kevin wrote:

Well, this brings us right up against the issue of files that adhere to their 
specifications versus forgiving applications.  Think of browsers and HTML.  Suffice it to 
say, MARC applications are quite likely to be forgiving of leader positions 20-23.  In my 
non-conforming MARC file and in Bill's, the leader positions 20-21 ("45") 
seemed constant, but things could fall apart for positions 22-23.  So...

I present the following (in-line and attached, to preserve tabs) in an attempt to 
straddle the two sides of this issue: applications forgiving of non-conforming files.  
Should the two characters following 45 (at position 20) *not* be 00, then the 
identification will be noted as "non-conforming."  We could classify this as 
reasonable identification but hardly ironclad (indeed, simply checking to confirm that 
part of the first 24 positions match the specification hardly constitutes a robust 
identification, but it's something).

It will also give you a mimetype too, now.

Would any like testing it out more fully on their own files?


#
# MARC 21 Magic  (Third cut)

# Set at position 0
0   bytex   

# leader position 20-21 must be 45

20  string  45  

# leader starts with 5 digits, followed by codes specific to MARC format

0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic

!:mime  application/marc

0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority

!:mime  application/marc

0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings

!:mime  application/marc

0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification

!:mime  application/marc

0   regex/1 (^[0-9]{5})[cdn][q] MARC Community

!:mime  application/marc

# leader position 22-23, should be "00" but is it?

0   regex/1 (^.{21})([^0]{2})   (non-conforming)

!:mime  application/marc


If this works, I'll see about submitting this copy.  Thanks to all your efforts 
already.

Warmly,

Kevin

--
Library of Congress
Network Development and MARC Standards Office






From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero 
[s...@unc.edu]
Sent: Sunday, April 03, 2011 14:01
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

I am pretty sure that the marc4j standard reader ignores them; the tolerant
reader definitely does. Otherwise JHU might have about two parseable records
based on the mangled leaders that J-Rock  gets stuck with :-)

An analysis of the ~7M LC bib records from the scriblio.net data files (~
Dec 2006) indicated that leader  has less than 8 bits of information in it
(shannon-weaver definition). This excludes the initial length value, which
is redundant given the end of record marker.


The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
  The final characters of the leader are "450".

Also, I object to the phrase "decent MARC tool".  Any tool capable of
dealing with MARC as it exists cannot afford the luxury of decency :-)

[ HA: "A clear conscience?"
  BW: "Yes, Sir Humphrey."
  HA: "When did you acquire this taste for luxuries?"]

Simon

On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens  wrote:


"I'm sure any decent MARC tool can deal with them, since decent MARC tools
are certainly going to be forgiving enough to deal with four characters
that
apparently don't even really matter."

You say that, but I'm pretty sure Marc4J throws errors MARC records where
these characters are incorrect

Owen

On Fri, Apr 1, 2011 at 3:51 AM, William Denton  wrote:


On 28 March 2011, Ford, Kevin wrote:

  I couldn't get Simon's MARC 21 Magic file to work.  Among other issues,

I

received "line too long" errors.  But, since I've been curious about

this

for sometime, I figured I'd take a whack at it myself.  Try this:


This is very nice!  Thanks.  I tried it on a bunch of MARC files I have,
and it recognized almost all of them.  A few it didn't, so I had a closer
look, and they're invalid.

For

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Ford, Kevin
Well, this brings us right up against the issue of files that adhere to their 
specifications versus forgiving applications.  Think of browsers and HTML.  
Suffice it to say, MARC applications are quite likely to be forgiving of leader 
positions 20-23.  In my non-conforming MARC file and in Bill's, the leader 
positions 20-21 ("45") seemed constant, but things could fall apart for 
positions 22-23.  So...

I present the following (in-line and attached, to preserve tabs) in an attempt 
to straddle the two sides of this issue: applications forgiving of 
non-conforming files.  Should the two characters following 45 (at position 20) 
*not* be 00, then the identification will be noted as "non-conforming."  We 
could classify this as reasonable identification but hardly ironclad (indeed, 
simply checking to confirm that part of the first 24 positions match the 
specification hardly constitutes a robust identification, but it's something).

It will also give you a mimetype too, now.

Would any like testing it out more fully on their own files?


#
# MARC 21 Magic  (Third cut)

# Set at position 0
0   bytex   

# leader position 20-21 must be 45
>20 string  45  

# leader starts with 5 digits, followed by codes specific to MARC format
>>0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
!:mime  application/marc
>>0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
!:mime  application/marc
>>0 regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
!:mime  application/marc
>>0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification
!:mime  application/marc
>>0 regex/1 (^[0-9]{5})[cdn][q] MARC Community
!:mime  application/marc

# leader position 22-23, should be "00" but is it?
>>0 regex/1 (^.{21})([^0]{2})   (non-conforming)
!:mime  application/marc


If this works, I'll see about submitting this copy.  Thanks to all your efforts 
already.

Warmly,

Kevin

--
Library of Congress
Network Development and MARC Standards Office






From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero 
[s...@unc.edu]
Sent: Sunday, April 03, 2011 14:01
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

I am pretty sure that the marc4j standard reader ignores them; the tolerant
reader definitely does. Otherwise JHU might have about two parseable records
based on the mangled leaders that J-Rock  gets stuck with :-)

An analysis of the ~7M LC bib records from the scriblio.net data files (~
Dec 2006) indicated that leader  has less than 8 bits of information in it
(shannon-weaver definition). This excludes the initial length value, which
is redundant given the end of record marker.


The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
 The final characters of the leader are "450".

Also, I object to the phrase "decent MARC tool".  Any tool capable of
dealing with MARC as it exists cannot afford the luxury of decency :-)

[ HA: "A clear conscience?"
 BW: "Yes, Sir Humphrey."
 HA: "When did you acquire this taste for luxuries?"]

Simon

On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens  wrote:

> "I'm sure any decent MARC tool can deal with them, since decent MARC tools
> are certainly going to be forgiving enough to deal with four characters
> that
> apparently don't even really matter."
>
> You say that, but I'm pretty sure Marc4J throws errors MARC records where
> these characters are incorrect
>
> Owen
>
> On Fri, Apr 1, 2011 at 3:51 AM, William Denton  wrote:
>
> > On 28 March 2011, Ford, Kevin wrote:
> >
> >  I couldn't get Simon's MARC 21 Magic file to work.  Among other issues,
> I
> >> received "line too long" errors.  But, since I've been curious about
> this
> >> for sometime, I figured I'd take a whack at it myself.  Try this:
> >>
> >
> > This is very nice!  Thanks.  I tried it on a bunch of MARC files I have,
> > and it recognized almost all of them.  A few it didn't, so I had a closer
> > look, and they're invalid.
> >
> > For example, the Internet Archive's Binghamton catalogue dump:
> >
> > http://ia600307.us.archive.org/6/items/marc_binghamton_univ/
> >
> > $ file -m marc.magic bgm*mrc
> > bgm_openlib_final_0-5.mrc: data
> > bgm_openlib_final_10-15.mrc:   MARC Bibliographic
> > bgm_openlib_final_15-18.mrc:   data
> > bgm_openlib_final_5-10.mrc:MARC Bibliographic
> >
> > But why?  Aha:
> >
> > $ head -c 25 bgm_openlib_final_*mrc
> > ==> bgm_openlib_final_0-5.mrc <==
> > 01812cas  2200457   

Re: [CODE4LIB] MARC magic for file

2011-04-03 Thread Simon Spero
I am pretty sure that the marc4j standard reader ignores them; the tolerant
reader definitely does. Otherwise JHU might have about two parseable records
based on the mangled leaders that J-Rock  gets stuck with :-)

An analysis of the ~7M LC bib records from the scriblio.net data files (~
Dec 2006) indicated that leader  has less than 8 bits of information in it
(shannon-weaver definition). This excludes the initial length value, which
is redundant given the end of record marker.


The LC V'GER adds a pseudo tag 000 to it's HTML view of the MARC leader.
 The final characters of the leader are "450".

Also, I object to the phrase "decent MARC tool".  Any tool capable of
dealing with MARC as it exists cannot afford the luxury of decency :-)

[ HA: "A clear conscience?"
 BW: "Yes, Sir Humphrey."
 HA: "When did you acquire this taste for luxuries?"]

Simon

On Fri, Apr 1, 2011 at 5:16 AM, Owen Stephens  wrote:

> "I'm sure any decent MARC tool can deal with them, since decent MARC tools
> are certainly going to be forgiving enough to deal with four characters
> that
> apparently don't even really matter."
>
> You say that, but I'm pretty sure Marc4J throws errors MARC records where
> these characters are incorrect
>
> Owen
>
> On Fri, Apr 1, 2011 at 3:51 AM, William Denton  wrote:
>
> > On 28 March 2011, Ford, Kevin wrote:
> >
> >  I couldn't get Simon's MARC 21 Magic file to work.  Among other issues,
> I
> >> received "line too long" errors.  But, since I've been curious about
> this
> >> for sometime, I figured I'd take a whack at it myself.  Try this:
> >>
> >
> > This is very nice!  Thanks.  I tried it on a bunch of MARC files I have,
> > and it recognized almost all of them.  A few it didn't, so I had a closer
> > look, and they're invalid.
> >
> > For example, the Internet Archive's Binghamton catalogue dump:
> >
> > http://ia600307.us.archive.org/6/items/marc_binghamton_univ/
> >
> > $ file -m marc.magic bgm*mrc
> > bgm_openlib_final_0-5.mrc: data
> > bgm_openlib_final_10-15.mrc:   MARC Bibliographic
> > bgm_openlib_final_15-18.mrc:   data
> > bgm_openlib_final_5-10.mrc:MARC Bibliographic
> >
> > But why?  Aha:
> >
> > $ head -c 25 bgm_openlib_final_*mrc
> > ==> bgm_openlib_final_0-5.mrc <==
> > 01812cas  2200457   45x00
> > ==> bgm_openlib_final_10-15.mrc <==
> > 01008nam  2200289ua 45000
> > ==> bgm_openlib_final_15-18.mrc <==
> > 01614cam00385   45  0
> > ==> bgm_openlib_final_5-10.mrc <==
> > 00887nam  2200265v  45000
> >
> > As you say, the leader should end with 4500 (as defined at
> > http://www.loc.gov/marc/authority/adleader.html) but two of those files
> > don't.  So they're not valid MARC.  I'm sure any decent MARC tool can
> deal
> > with them, since decent MARC tools are certainly going to be forgiving
> > enough to deal with four characters that apparently don't even really
> > matter.
> >
> > So on the one hand they're usable MARC but file wouldn't say so, and on
> the
> > other that's a good indication that the files have failed a basic
> validity
> > test.  I wonder if there are similar situations for JPEGs or MP3s.
> >
> > I think you should definitely submit this for inclusion in the magic
> file.
> > It would be very useful for us all!
> >
> > Bill
> >
> > P.S. I'd never used head -c (to show a fixed number of bytes) before.
> > Always nice to find a new useful option to an old command.
> >
> >
> >  #
> >> # MARC 21 Magic  (Second cut)
> >>
> >> # Set at position 0
> >> 0   short   >0x
> >>
> >> # leader ends with 4500
> >>
> >>> 20  string  4500
> >>>
> >>
> >> # leader starts with 5 digits, followed by codes specific to MARC format
> >>
> >>> 0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
>  0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
>  0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
>  0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
>  0   regex/1 (^[0-9]{5})[cdn][q] MARC Community
> 
> >>>
> >
> > --
> > William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
> >
>
>
>
> --
> Owen Stephens
> Owen Stephens Consulting
> Web: http://www.ostephens.com
> Email: o...@ostephens.com
>


Re: [CODE4LIB] MARC magic for file

2011-04-01 Thread Owen Stephens
"I'm sure any decent MARC tool can deal with them, since decent MARC tools
are certainly going to be forgiving enough to deal with four characters that
apparently don't even really matter."

You say that, but I'm pretty sure Marc4J throws errors MARC records where
these characters are incorrect

Owen

On Fri, Apr 1, 2011 at 3:51 AM, William Denton  wrote:

> On 28 March 2011, Ford, Kevin wrote:
>
>  I couldn't get Simon's MARC 21 Magic file to work.  Among other issues, I
>> received "line too long" errors.  But, since I've been curious about this
>> for sometime, I figured I'd take a whack at it myself.  Try this:
>>
>
> This is very nice!  Thanks.  I tried it on a bunch of MARC files I have,
> and it recognized almost all of them.  A few it didn't, so I had a closer
> look, and they're invalid.
>
> For example, the Internet Archive's Binghamton catalogue dump:
>
> http://ia600307.us.archive.org/6/items/marc_binghamton_univ/
>
> $ file -m marc.magic bgm*mrc
> bgm_openlib_final_0-5.mrc: data
> bgm_openlib_final_10-15.mrc:   MARC Bibliographic
> bgm_openlib_final_15-18.mrc:   data
> bgm_openlib_final_5-10.mrc:MARC Bibliographic
>
> But why?  Aha:
>
> $ head -c 25 bgm_openlib_final_*mrc
> ==> bgm_openlib_final_0-5.mrc <==
> 01812cas  2200457   45x00
> ==> bgm_openlib_final_10-15.mrc <==
> 01008nam  2200289ua 45000
> ==> bgm_openlib_final_15-18.mrc <==
> 01614cam00385   45  0
> ==> bgm_openlib_final_5-10.mrc <==
> 00887nam  2200265v  45000
>
> As you say, the leader should end with 4500 (as defined at
> http://www.loc.gov/marc/authority/adleader.html) but two of those files
> don't.  So they're not valid MARC.  I'm sure any decent MARC tool can deal
> with them, since decent MARC tools are certainly going to be forgiving
> enough to deal with four characters that apparently don't even really
> matter.
>
> So on the one hand they're usable MARC but file wouldn't say so, and on the
> other that's a good indication that the files have failed a basic validity
> test.  I wonder if there are similar situations for JPEGs or MP3s.
>
> I think you should definitely submit this for inclusion in the magic file.
> It would be very useful for us all!
>
> Bill
>
> P.S. I'd never used head -c (to show a fixed number of bytes) before.
> Always nice to find a new useful option to an old command.
>
>
>  #
>> # MARC 21 Magic  (Second cut)
>>
>> # Set at position 0
>> 0   short   >0x
>>
>> # leader ends with 4500
>>
>>> 20  string  4500
>>>
>>
>> # leader starts with 5 digits, followed by codes specific to MARC format
>>
>>> 0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
 0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
 0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
 0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
 0   regex/1 (^[0-9]{5})[cdn][q] MARC Community

>>>
>
> --
> William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
>



-- 
Owen Stephens
Owen Stephens Consulting
Web: http://www.ostephens.com
Email: o...@ostephens.com


Re: [CODE4LIB] MARC magic for file

2011-03-31 Thread William Denton

On 28 March 2011, Ford, Kevin wrote:

I couldn't get Simon's MARC 21 Magic file to work.  Among other issues, 
I received "line too long" errors.  But, since I've been curious about 
this for sometime, I figured I'd take a whack at it myself.  Try this:


This is very nice!  Thanks.  I tried it on a bunch of MARC files I have, 
and it recognized almost all of them.  A few it didn't, so I had a closer 
look, and they're invalid.


For example, the Internet Archive's Binghamton catalogue dump:

http://ia600307.us.archive.org/6/items/marc_binghamton_univ/

$ file -m marc.magic bgm*mrc
bgm_openlib_final_0-5.mrc: data
bgm_openlib_final_10-15.mrc:   MARC Bibliographic
bgm_openlib_final_15-18.mrc:   data
bgm_openlib_final_5-10.mrc:MARC Bibliographic

But why?  Aha:

$ head -c 25 bgm_openlib_final_*mrc
==> bgm_openlib_final_0-5.mrc <==
01812cas  2200457   45x00
==> bgm_openlib_final_10-15.mrc <==
01008nam  2200289ua 45000
==> bgm_openlib_final_15-18.mrc <==
01614cam00385   45  0
==> bgm_openlib_final_5-10.mrc <==
00887nam  2200265v  45000

As you say, the leader should end with 4500 (as defined at 
http://www.loc.gov/marc/authority/adleader.html) but two of those files 
don't.  So they're not valid MARC.  I'm sure any decent MARC tool can deal 
with them, since decent MARC tools are certainly going to be forgiving 
enough to deal with four characters that apparently don't even really 
matter.


So on the one hand they're usable MARC but file wouldn't say so, and on 
the other that's a good indication that the files have failed a basic 
validity test.  I wonder if there are similar situations for JPEGs or 
MP3s.


I think you should definitely submit this for inclusion in the magic file. 
It would be very useful for us all!


Bill

P.S. I'd never used head -c (to show a fixed number of bytes) before. 
Always nice to find a new useful option to an old command.



#
# MARC 21 Magic  (Second cut)

# Set at position 0
0   short   >0x

# leader ends with 4500

20  string  4500


# leader starts with 5 digits, followed by codes specific to MARC format

0   regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
0   regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
0   regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
0   regex/1 (^[0-9]{5})[acdn][w]MARC Classification
0   regex/1 (^[0-9]{5})[cdn][q] MARC Community



--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org


Re: [CODE4LIB] MARC magic for file

2011-03-28 Thread Ford, Kevin
I couldn't get Simon's MARC 21 Magic file to work.  Among other issues, I 
received "line too long" errors.  But, since I've been curious about this for 
sometime, I figured I'd take a whack at it myself.  Try this:

#
# MARC 21 Magic  (Second cut)

# Set at position 0
0   short   >0x 

# leader ends with 4500
>20 string  4500

# leader starts with 5 digits, followed by codes specific to MARC format
>>0 regex/1 (^[0-9]{5})[acdnp][^bhlnqsu-z]  MARC Bibliographic
>>0 regex/1 (^[0-9]{5})[acdnosx][z] MARC Authority
>>0 regex/1 (^[0-9]{5})[cdn][uvxy]  MARC Holdings
>>0 regex/1 (^[0-9]{5})[acdn][w]MARC Classification
>>0 regex/1 (^[0-9]{5})[cdn][q] MARC Community

I've also attached it to this email to preserve the tabs.  

In any event, I can confirm it works on MARC Bib, MARC Authority, and MARC 
Classification files I have bumping around my computer.  I've not tested it on 
MARC Holdings and MARC Community.

Do let us/me know if it works for you (and the community generally).  I can see 
about submitting it for formal inclusion in the magic file.

Warmly,

Kevin

--
Library of Congress
Network Development and MARC Standards Office




From: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Simon Spero 
[s...@unc.edu]
Sent: Thursday, March 24, 2011 12:28
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MARC magic for file

Some of the problems in your first cut are:

1. Offsets for regex are given in terms of lines.  MARC files don't have
newlines in them, unless you're Millennium, in which case they can be
inserted every 200,000 bytes to keep things interesting.
2.  Byte matches match byte values, so "20 byte 4"   is looking for the
binary value, not the ascii digit.
3.  Sometimes you need to prime the buffer before you can do a regexp match.

Is this good enough?


# MARC 21 Magic  (First cut)
#  indicator count must be "2"
10 string 2
#  leader must end in "4500"
>20 string 4500
#  leader must start with five digits, a record status, and a record
type
>0 regex ^([0-9]{5})[acdnp][acdefgijkmoprt][abcims] MARC Bibliographic
>0 regex ^([0-9]{5})[acdnp][z] MARC Authority

Simon


On Wed, Mar 23, 2011 at 8:09 PM, William Denton  wrote:

> Has anyone figured out the magic necessary for file to recognize MARC
> files?
>
> If you don't know it, file is a Unix command that tells you what kind of
> file a file is.  For example:
>
> $ file 101015_001.mp3
> 101015_001.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS,
> layer III, v1, 192 kbps, 44.1 kHz, Stereo
>
> $ file P126.jpg
> P126.jpg: JPEG image data, EXIF standard, comment: "AppleMark"
>
> It's a really useful command.  I assume it's on OSX, but I don't know. You
> can get it for Windows with Cygwin.
>
> The problem is, file doesn't grok MARC:
>
> $ file catalog.01.mrc
> catalog.01.mrc: data
>
> I took a stab at getting the magic defined, but it didn't work.  I'll
> include what I used below.  You can put it into a magic.txt file, and then
> use
>
> file -m magic.txt some_file.mrc
>
> to test it.  It'll tell you the file is MARC Bibliographic ... but it also
> thinks that PDFs, JPEGs, and text files are MARC.  That's no good.
>
> It'd be great if the MARC magic got into the central magic database so
> everyone would be able to recognize various MARC file types.
>
> Bill
>
>
> # --- clip'n'test
> # MARC 21 for Bibliographic Data
> # http://www.loc.gov/marc/bibliographic/bdleader.html
> #
> # This doesn't work properly
>
> 0 stringx
>
>> 5regex  [acdnp]
>> 6regex  [acdefgijkmoprt]
>> 7regex  [abcims]
>> 8regex  [\ a]
>> 9regex  [\ a]
>> 10   byte  x
>> 11   byte  x
>> 12   stringx
>> 17   regex [\ 12345678uz]
>> 18   regex  [\ aciu]
>> 19   regex  [\ abc] MARC Bibliographic
>>
> #>20   byte 4
> #>21   byte 5
> #>22   byte 0
> #>23   byte 0   MARC Bibliographic
>
> # --- end clip'n'test
>
> --
> William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
>


marc.magic
Description: marc.magic


Re: [CODE4LIB] MARC magic for file

2011-03-24 Thread Simon Spero
Some of the problems in your first cut are:

1. Offsets for regex are given in terms of lines.  MARC files don't have
newlines in them, unless you're Millennium, in which case they can be
inserted every 200,000 bytes to keep things interesting.
2.  Byte matches match byte values, so "20 byte 4"   is looking for the
binary value, not the ascii digit.
3.  Sometimes you need to prime the buffer before you can do a regexp match.

Is this good enough?


# MARC 21 Magic  (First cut)
#  indicator count must be "2"
10 string 2
#  leader must end in "4500"
>20 string 4500
#  leader must start with five digits, a record status, and a record
type
>0 regex ^([0-9]{5})[acdnp][acdefgijkmoprt][abcims] MARC Bibliographic
>0 regex ^([0-9]{5})[acdnp][z] MARC Authority

Simon


On Wed, Mar 23, 2011 at 8:09 PM, William Denton  wrote:

> Has anyone figured out the magic necessary for file to recognize MARC
> files?
>
> If you don't know it, file is a Unix command that tells you what kind of
> file a file is.  For example:
>
> $ file 101015_001.mp3
> 101015_001.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS,
> layer III, v1, 192 kbps, 44.1 kHz, Stereo
>
> $ file P126.jpg
> P126.jpg: JPEG image data, EXIF standard, comment: "AppleMark"
>
> It's a really useful command.  I assume it's on OSX, but I don't know. You
> can get it for Windows with Cygwin.
>
> The problem is, file doesn't grok MARC:
>
> $ file catalog.01.mrc
> catalog.01.mrc: data
>
> I took a stab at getting the magic defined, but it didn't work.  I'll
> include what I used below.  You can put it into a magic.txt file, and then
> use
>
> file -m magic.txt some_file.mrc
>
> to test it.  It'll tell you the file is MARC Bibliographic ... but it also
> thinks that PDFs, JPEGs, and text files are MARC.  That's no good.
>
> It'd be great if the MARC magic got into the central magic database so
> everyone would be able to recognize various MARC file types.
>
> Bill
>
>
> # --- clip'n'test
> # MARC 21 for Bibliographic Data
> # http://www.loc.gov/marc/bibliographic/bdleader.html
> #
> # This doesn't work properly
>
> 0 stringx
>
>> 5regex  [acdnp]
>> 6regex  [acdefgijkmoprt]
>> 7regex  [abcims]
>> 8regex  [\ a]
>> 9regex  [\ a]
>> 10   byte  x
>> 11   byte  x
>> 12   stringx
>> 17   regex [\ 12345678uz]
>> 18   regex  [\ aciu]
>> 19   regex  [\ abc] MARC Bibliographic
>>
> #>20   byte 4
> #>21   byte 5
> #>22   byte 0
> #>23   byte 0   MARC Bibliographic
>
> # --- end clip'n'test
>
> --
> William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org
>


[CODE4LIB] MARC magic for file

2011-03-23 Thread William Denton
Has anyone figured out the magic necessary for file to recognize MARC 
files?


If you don't know it, file is a Unix command that tells you what kind of 
file a file is.  For example:


$ file 101015_001.mp3
101015_001.mp3: Audio file with ID3 version 2.3.0, contains: MPEG ADTS, layer 
III, v1, 192 kbps, 44.1 kHz, Stereo

$ file P126.jpg
P126.jpg: JPEG image data, EXIF standard, comment: "AppleMark"

It's a really useful command.  I assume it's on OSX, but I don't know. 
You can get it for Windows with Cygwin.


The problem is, file doesn't grok MARC:

$ file catalog.01.mrc
catalog.01.mrc: data

I took a stab at getting the magic defined, but it didn't work.  I'll 
include what I used below.  You can put it into a magic.txt file, and then 
use


file -m magic.txt some_file.mrc

to test it.  It'll tell you the file is MARC Bibliographic ... but it also 
thinks that PDFs, JPEGs, and text files are MARC.  That's no good.


It'd be great if the MARC magic got into the central magic database so 
everyone would be able to recognize various MARC file types.


Bill


# --- clip'n'test
# MARC 21 for Bibliographic Data
# http://www.loc.gov/marc/bibliographic/bdleader.html
#
# This doesn't work properly

0 stringx

5regex  [acdnp]
6regex  [acdefgijkmoprt]
7regex  [abcims]
8regex  [\ a]
9regex  [\ a]
10   byte  x
11   byte  x
12   stringx
17   regex [\ 12345678uz]
18   regex  [\ aciu]
19   regex  [\ abc] MARC Bibliographic

#>20   byte  4
#>21   byte  5
#>22   byte  0
#>23   byte  0   MARC Bibliographic

# --- end clip'n'test

--
William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org