Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
Actually, you can have records that are MARC21 coming out of vendor databases (who sometime embed control characters into the leader) and still be valid. Once you stop looking at just your ILS or OCLC, you probably wouldn't be surprised to know that records start looking very different. --TR

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind
I'm not sure what you mean Terry. Maybe we have different understandings of valid. If leader bytes 20-23 are not 4500, I suggest that is _by definition_ not a valid Marc21 file. It violates the Marc21 specification. Now, they may still be _usable_, by software that ignores these bytes

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Prettyman, Timothy
Just as a historical note, this non-standard use of LDR/22 is likely due to OCLC's use of the character as a hexadecimal flag from back in the days when marc records were mostly schlepped around on tapes. They referred to it as the Transaction type code. When records were sent to oclc for

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of fields, which in my opinion, is more

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread William Denton
On 6 April 2011, Reese, Terry wrote: Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Reese, Terry
I'm honestly not family with magic. I can tell you in MarcEdit, the way that the process works is there is a very generic function that reads the structure of the data not trusting the information in the leader (since I find this data very un-reliable). Then, if users want to apply a set of

[CODE4LIB] Fwd: OAC RFP Annoncement

2011-04-06 Thread Robert Sanderson
Forwarded: The Open Annotation Collaboration (OAC) project is pleased to announce a Request For Proposal to collaborate with OAC researchers for building implementations of the OAC data model and ontology. The OAC is seeking to collaborate with scholars and/or librarians currently using and/or

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Kyle Banerjee
.. Maybe we have different understandings of valid. If leader bytes 20-23 are not 4500, I suggest that is _by definition_ not a valid Marc21 file. It violates the Marc21 specification. Now, they may still be _usable_, by software that ignores these bytes anyway or works around them. We

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind
Actually -- I'd disagree because that is a very narrow view of the specification. When validating MARC, I'd take the approach to validate structure (which allows you to then read any MARC format) -- then use a separate process for validating content of fields, which in my opinion, is more open

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind
On 4/6/2011 2:02 PM, Kyle Banerjee wrote: I'd go so far as to question the value of validating redundant data that theoretically has meaning but which are never supposed to vary. The 4 and the 5 simply repeat what is already known about the structure of the MARC record. Choking on stuff like

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread William Denton
On 6 April 2011, Jonathan Rochkind wrote: I think we computer programmers are really better-served by reserving the notion of validity for things specified by formal specifications -- as we normally do, talking about any other data format. And the only formal specifications I can find for

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Jonathan Rochkind
On 4/6/2011 2:43 PM, William Denton wrote: Validity does mean something definite ... but Postel's Law is a good guideline, especially with the swamp of bad MARC, old MARC, alternate MARC, that's out there. Valid MARC is valid MARC, but if---for the sake of file and its magic---we can identify

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Kyle Banerjee
Well, the problem is when the original Marc4J author took the spec at it's word, and actually _acted upon_ the '4' and the '5', changing file semantics if they were different, and throwing an exception if it was a non-digit. At least the author actually used the values rather than checking to

[CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Eric Lease Morgan
Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jonathan Rochkind
I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread LeVan,Ralph
Can you share the record somewhere? I suspect many of us have tools we can turn loose on it. Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 4:28 PM To: CODE4LIB@LISTSERV.ND.EDU

[CODE4LIB] **SKOS-2-HIVE: CREATING SKOS VOCABULARIES TO HELP INTERDISCIPLINARY VOCABULARY ENGINEERING**

2011-04-06 Thread Kevin S. Clarke
Forwarding because I think this will be of interest to some folks on the list... -- Forwarded message -- ***SKOS-2-HIVE: CREATING SKOS VOCABULARIES TO HELP INTERDISCIPLINARY VOCABULARY ENGINEERING*** We are pleased to announce the addition of more HIVE workshops! *DATES

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Eric Lease Morgan
On Apr 6, 2011, at 4:46 PM, LeVan,Ralph wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode Can you share the record somewhere? I suspect many of us have tools we can turn loose

Re: [CODE4LIB] MARC magic for file

2011-04-06 Thread Mike Taylor
On 6 April 2011 19:53, Jonathan Rochkind rochk...@jhu.edu wrote: On 4/6/2011 2:43 PM, William Denton wrote: Validity does mean something definite ... but Postel's Law is a good guideline, especially with the swamp of bad MARC, old MARC, alternate MARC, that's out there.  Valid MARC is valid

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Reese, Terry
I'd echo Jonathan's question -- the 0xC2 code is the sound recording marker in MARC-8. I'd guess the file isn't in UTF8. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Wednesday, April 06, 2011 1:28 PM To:

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread LeVan,Ralph
Lol! So right off the bat I see that the leader says the record is 1091 bytes long, but it is actually 1089 bytes long and I end up missing the leader for the next record. Maybe a CR/LF problem? I see that frequently as a way to mangle MARC records when moving them around. Is your problem in

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jonathan Rochkind
That's hilarious, that Terry has had to do enough ugliness with Marc encodings that he indeed can recognize 0xC2 off the bat as the Marc8 encoding it represents! I am in awe, as well as sympathy. If the record is in Marc8, then you need to know if Perl Batch::Marc can handle Marc8. If it's

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jon Gorman
I'm not quite convinced that it's marc-8 just because there's \xC2 ;). If you look at a hex dump I'm seeing a lot of what might be combining characters. The leader appears to have 'a' in the field to indicate unicode. In the raw hex I'm seeing a lot of two character sequences like: 756c 69c3

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread William Denton
On 6 April 2011, Eric Lease Morgan wrote: http://zoia.library.nd.edu/tmp/tor.marc Happily, Kevin's magic formula recognizes this as MARC! Bill -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org