Re: [CODE4LIB] more on MARC char encoding

2012-04-26 Thread Joe Atzberger
mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Deng, Sai Sent: Friday, April 20, 2012 8:55 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-20 Thread Andrew Cunningham
@LISTSERV.ND.EDU] On Behalf Of Robert Haschart Sent: Thursday, April 19, 2012 2:23 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 On 4/18/2012 12:08 PM, Jonathan Rochkind wrote: On 4/18/2012 11:09 AM, Doran, Michael D wrote

Re: [CODE4LIB] more on MARC char encoding

2012-04-20 Thread Deng, Sai
, 2012 2:14 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding Ah, thanks Terry. That canned cleaner in MarcEdit sounds potentially useful -- I'm in a continuing battle to keep the character encoding in our local marc corpus clean. (The real blame here is on cataloger

Re: [CODE4LIB] more on MARC char encoding

2012-04-20 Thread Reese, Terry
outside of the general smart quote issue. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Deng, Sai Sent: Friday, April 20, 2012 6:55 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding If a canned cleaner can

Re: [CODE4LIB] more on MARC char encoding

2012-04-20 Thread Doran, Michael D
# http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Deng, Sai Sent: Friday, April 20, 2012 8:55 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding If a canned cleaner can be added

Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread Deng, Sai
[mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Tod Olson Sent: Tuesday, April 17, 2012 10:13 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 In practice it seems to mean UTF-8. At least I've only seen UTF-8, and I can't

Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread Reese, Terry
AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding If your records are really in MARC8 not UTF8, your best bet is to use a tool to convert them to UTF8 before hitting your XSLT. The open source 'yaz' command line tools can do it for Marc21. The Marc4J package can

Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread Jonathan Rochkind
Sent: Thursday, April 19, 2012 11:13 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding If your records are really in MARC8 not UTF8, your best bet is to use a tool to convert them to UTF8 before hitting your XSLT. The open source 'yaz' command line tools can do

Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread LeVan,Ralph
quotes/values. --TR -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Thursday, April 19, 2012 11:13 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding If your records are really

Re: [CODE4LIB] more on MARC char encoding

2012-04-19 Thread Jonathan Rochkind
On 4/19/2012 3:23 PM, LeVan,Ralph wrote: We see Unicode data pasted into MARC8 records all the time. It happens enough that my MARC8-Unicode converter takes a second look at illegal MARC8 bytes and tries a UTF-8 encoding as well. Right. I see it too. I'm arguing that means cataloger entry

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-19 Thread Robert Haschart
On 4/18/2012 12:08 PM, Jonathan Rochkind wrote: On 4/18/2012 11:09 AM, Doran, Michael D wrote: I don't believe that is the case. Take UTF-8 out of the picture, and consider the MARC-8 character set with its escape sequences and combining characters. A character such as an n with a tilde

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Tod Olson
It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory structure to the byte-offsets in the fixed fields. The values in these places all assume 8-bit character data, it's completely baked in to the file format. -Tod On Apr 17, 2012, at 6:55 PM, Jonathan Rochkind wrote:

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Peter Noerr
for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Bill Dueber Sent: Tuesday, April 17, 2012 5:50 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 On Tue, Apr 17, 2012 at 8:46 PM, Simon Spero sesunc

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Jonathan Rochkind
On 4/18/2012 6:04 AM, Tod Olson wrote: It has to mean UTF-8. ISO 2709 is very byte-oriented, from the directory structure to the byte-offsets in the fixed fields. The values in these places all assume 8-bit character data, it's completely baked in to the file format. I'm not sure that

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Doran, Michael D
-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Tod Olson Sent: Wednesday, April 18, 2012 5:04 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread LeVan,Ralph
In fact, I worry that the standard may pre-date UTF-8, with it's reference to UCS --- if I understand things right, at one point there was only one unicode encoding, called UCS, which is basically a backwards-compatible subset of what became UTF-16. So I worry the standard really means

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Karen Coyle
UTF-8 was the marc standard from the beginning: http://www.loc.gov/marc/marbi/1998/98-18.html The first proposals were a character mapping between Unicode and MARC-8 and didn't mention the character encodings, thus the term UCS which was a common term for Unicode at that time. (see:

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Huwig,Steve
. -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Doran, Michael D Sent: Wednesday, April 18, 2012 10:05 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 Hi Tod, I'm

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Doran, Michael D
/ -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Huwig,Steve Sent: Wednesday, April 18, 2012 9:21 AM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 I could be mistaken (never having

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Andy Kohler
I don't know about ISO 2709 itself, but the MARC21 implementation of it refers to octets, aka 8-bit bytes: http://www.loc.gov/marc/specifications/specrecstruc.html Characters may be encoded using one or more than one octet, depending on the character set. All ASCII characters are encoded using

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Houghton,Andrew
-Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Jonathan Rochkind Sent: Tuesday, April 17, 2012 19:55 To: CODE4LIB@LISTSERV.ND.EDU Subject: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 Okay, forget XML for a

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Doran, Michael D
.) ;-) -- Michael -Original Message- From: Jonathan Rochkind [mailto:rochk...@jhu.edu] Sent: Wednesday, April 18, 2012 11:09 AM To: Code for Libraries Cc: Doran, Michael D Subject: Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21 On 4/18/2012 11:09 AM

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-18 Thread Tod Olson
In practice it seems to mean UTF-8. At least I've only seen UTF-8, and I can't imagine the code that processes this stuff being safe for UTF-16 or UTF-32. All of the offsets are byte-oriented, and there's too much legacy code that makes assumption about null-terminated strings. -Tod On Apr

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-17 Thread Simon Spero
On Tue, Apr 17, 2012 at 7:55 PM, Jonathan Rochkind rochk...@jhu.edu wrote: Okay, forget XML for a moment, let's just look at marc 'binary'. First, for Anglophone-centric MARC21. Actually Anglo and Francophone centric. And the USMARC style 245 was a poor replacement for the UKMARC approach

Re: [CODE4LIB] more on MARC char encoding: Now we're about ISO_2709 and MARC21

2012-04-17 Thread Bill Dueber
On Tue, Apr 17, 2012 at 8:46 PM, Simon Spero sesunc...@gmail.com wrote: Actually Anglo and Francophone centric. And the USMARC style 245 was a poor replacement for the UKMARC approach (someone at the British Library hosted Linked Data meeting wondered why there were punctation characters