Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-11 Thread Eric Lease Morgan
On Apr 6, 2011, at 5:39 PM, Jon Gorman wrote: http://zoia.library.nd.edu/tmp/tor.marc When debugging any encoding issue it's always good to know: a) how the records were obtained b) how have they been manipulated before you touch them (basically, how many times may they

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-11 Thread Jonathan Rochkind
XML well-formedness and validity checks can't find badly encoded characters either -- char data that claims to be one encoding but is really another, or that has been double-encoded and now means something different than intended. There's really no way to catch that but heuristics. All of

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-11 Thread Mike Taylor
On 11 April 2011 16:40, Jonathan Rochkind rochk...@jhu.edu wrote: XML well-formedness and validity checks can't find badly encoded characters either -- char data that claims to be one encoding but is really another, or that has been double-encoded and now means something different than

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-11 Thread Jon Gorman
I'm making headway on my MARC records, but only through the use of brute force. I used wget to retrieve the MARC records (as well as associated PDF and text files) from the Internet Archive. I know IA has some bad marc records (and also records w/ bad encoding) from my experience with

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-07 Thread Tod Olson
To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past

[CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Eric Lease Morgan
Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode This is a real pain, and I'm hoping someone here can help me either: 1) trap this error allowing me to move on, or 2) figure out how to open

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jonathan Rochkind
I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first things to look at: 1. Is your Marc file

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread LeVan,Ralph
Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Eric Lease Morgan
On Apr 6, 2011, at 4:46 PM, LeVan,Ralph wrote: Ack! While using the venerable Perl MARC::Batch module I get the following error while trying to read a MARC record: utf8 \xC2 does not map to Unicode Can you share the record somewhere? I suspect many of us have tools we can turn loose

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Reese, Terry
To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread LeVan,Ralph
in the very first record? Ralph -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric Lease Morgan Sent: Wednesday, April 06, 2011 4:55 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] utf8 \xC2 does not map to Unicode On Apr 6, 2011

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jonathan Rochkind
: [CODE4LIB] utf8 \xC2 does not map to Unicode I am not familar with that Perl module. But I'm more familiar then I'd want with char encoding in Marc. I don't recognize the bytes 0xC2 (there are some bytes I became pathetically familiar with in past debugging, but I've forgotten em), but the first

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread Jon Gorman
I'm not quite convinced that it's marc-8 just because there's \xC2 ;). If you look at a hex dump I'm seeing a lot of what might be combining characters. The leader appears to have 'a' in the field to indicate unicode. In the raw hex I'm seeing a lot of two character sequences like: 756c 69c3

Re: [CODE4LIB] utf8 \xC2 does not map to Unicode

2011-04-06 Thread William Denton
On 6 April 2011, Eric Lease Morgan wrote: http://zoia.library.nd.edu/tmp/tor.marc Happily, Kevin's magic formula recognizes this as MARC! Bill -- William Denton, Toronto : miskatonic.org www.frbr.org openfrbr.org