Re: [CODE4LIB] marc-8

2011-10-24 Thread Doran, Michael D
Hi Eric,

 In Perl, how do I specify MARC-8 when reading (decoding) and writing
 (encoding) data?

You can't.  MARC-8 is a character set that is unknown to the operating system.  
Your best bet is to convert MARC-8-encoded records into UTF-8. 

 ...it is converted it Perl's
 internal encoding (UTF-8)

As an FTY, UTF-8 is *not* Perl's internal encoding.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/



 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric
 Lease Morgan
 Sent: Monday, October 24, 2011 1:18 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: [CODE4LIB] marc-8
 
 In Perl, how do I specify MARC-8 when reading (decoding) and writing
 (encoding) data?
 
 Character encoding is the bane of my existence. I have learned that when
 reading from a file I ought to specify the type of encoding the file is in
 and decode accordingly, or else. Once read, it is converted it Perl's
 internal encoding (UTF-8) and can be manipulated. Similarly, when writing I
 am expected to specify the encoding. Both the reading (decoding) and the
 writing (encoding) can be done with the Encode module. Here is a some code
 illustrating what I'm trying to do with MARC records which are apparently in
 MARC-8:
 
   # require
   use Encode qw( encode decode );
 
   # initialize
   my $batch = MARC::Batch-new( 'USMARC', './records.mrc' );
   open OUT, '  updated.mrc';
 
   # process each record
   while ( my $marc = $batch-next ) {
 
 # get the title
 my $_245 = decode( 'FOO', $marc-title );
 
 # do cool stuff with the title here
 
 # output the cool stuff
 print OUT encode( 'FOO', $_245 );
 
   }
 
   # done
   close OUT;
   exit;
 
 
 My problem is, I don't know what to put in place of FOO. What is the official
 name of MARC-8's encoding scheme?
 
 --
 Eric The Ugly American Morgan
 University of Notre Dame
 
 (574) 631-8604


Re: [CODE4LIB] marc-8

2011-10-24 Thread Walker, David
 I know yaz-marcdump changes the encoding bit in MARC
 leaders. Does it also convert MARC-8 characters to UTF-8?

Yes.  We use it for that purpose all the time.

--Dave

-
David Walker
Library Web Services Manager
California State University


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric 
Lease Morgan
Sent: Monday, October 24, 2011 11:39 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] marc-8

On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:

 In Perl, how do I specify MARC-8 when reading (decoding) and writing
 (encoding) data?
 
 You can't.  MARC-8 is a character set that is unknown to the operating 
 system.  Your best bet is to convert MARC-8-encoded records into UTF-8. 

/me throws his hands up in the air and screams!

Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know 
yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert 
MARC-8 characters to UTF-8? (I guess I could simply try it and see what 
happens.)

-- 
Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Ross Singer
On Mon, Oct 24, 2011 at 7:39 PM, Eric Lease Morgan emor...@nd.edu wrote:

 Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know 
 yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert 
 MARC-8 characters to UTF-8? (I guess I could simply try it and see what 
 happens.)


Yes, it does.  It uses yaz-iconv.  Theoretically, you could wrap some
Perl module around that.  I've contemplated it for ruby-marc, but then
it always seems a lot easier to ignore it and delete any emails that
request it.

A whole lot easier.

-Ross.


Re: [CODE4LIB] marc-8

2011-10-24 Thread Jonathan Rochkind
What _ought_ to be easiest of all is getting our ILS's to NEVER export 
Marc8 _ever_ again.  UTF8 only.


Sadly, that only ought to be easiest.

But IMO there's no reason any of us should be dealing with Marc8 ever 
again.  The only thing that should deal in Marc8 is an ILS, and should 
only input it, NEVER output it, UTF8 only, please!


But this is not the world we live in.

I tried to figure out how to custom add a new encoding to ruby 1.9 with 
the idea of adding Marc8 as an actuall  ruby 1.9 character encoding 
supported same as any other built in char encoding, but I couldn't 
figure out if that was possible or how to do it.  If it was possible to 
do at that low level in ruby 1.9, it might justify the time to do it.


On 10/24/2011 2:55 PM, Doran, Michael D wrote:

Eric,

Sometimes for grandpa Perl stuff -- especially as concerns charsets and/or 
internationalization -- it's worth pinging these lists:

perl4...@perl.org (yes, still alive and kicking)

perl-i...@perl.org (very low traffic list, but some knowledgeable 
subscribers)

-- Michael


-Original Message-
From: Doran, Michael D
Sent: Monday, October 24, 2011 1:48 PM
To: 'Code for Libraries'
Subject: RE: [CODE4LIB] marc-8


Okay. How do I go about converting MARC-8 encoded records into UTF-8?

In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed
Summers, and now maintained by Galen Charlton).

-- Michael


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of

Eric

Lease Morgan
Sent: Monday, October 24, 2011 1:39 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] marc-8

On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:


In Perl, how do I specify MARC-8 when reading (decoding) and writing
(encoding) data?

You can't.  MARC-8 is a character set that is unknown to the operating

system.  Your best bet is to convert MARC-8-encoded records into UTF-8.

/me throws his hands up in the air and screams!

Okay. How do I go about converting MARC-8 encoded records into UTF-8? I

know

yaz-marcdump changes the encoding bit in MARC leaders. Does it also

convert

MARC-8 characters to UTF-8? (I guess I could simply try it and see what
happens.)

--
Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Jon Gorman
 In Perl, how do I specify MARC-8 when reading (decoding) and writing
 (encoding) data?

 You can't.  MARC-8 is a character set that is unknown to the operating 
 system.  Your best bet is to convert MARC-8-encoded records into UTF-8.

 /me throws his hands up in the air and screams!

 Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know 
 yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert 
 MARC-8 characters to UTF-8? (I guess I could simply try it and see what 
 happens.)


I seem to remember there was an older version of yaz-marcdump that
seemed a bit buggy (would just change the header but not change
encoding despite command-line options, if there was a certain
combination chosen).  It's also possible I was just working with a
script that specified the encoding change but not the leader.

I'd say get the most recent version of yaz (don't use anything in an
OS repository) and then follow the docs:
http://www.indexdata.com/yaz/doc/yaz-marcdump.html.  The first example
is what you want:

 yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 marc21.raw marc21.utf8.raw

The -f is the source encoding, the -t is the target encoding, and the
-l 9=97 sets leader to a (decimal of character to change the 9th
character to a).

I've typically found this is one of the easier ways to do the
character set encoding, although the various Perl modules (if they're
recent enough) should be able to handle the conversion as well through
the MARC::Charset library.  Check the cpan pages.

Jon Gorman

ps.  For the love of all that is good, don't try to do anything in
Perl with the raw MARC record to do the encoding change yourself.
I've seen someone really screw records up because they altered
individual characters, which in turn lead to different byte lengths.
This caused all sorts of insanity which meant really weird things
happened with MARC parsers that tried to follow the MARC directory
(which uses byte addresses to deal with variable fields).


Re: [CODE4LIB] marc-8

2011-10-24 Thread Doran, Michael D
Hi Jonathan,

 I tried to figure out how to custom add a new encoding to ruby 1.9 with
 the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
 supported same as any other built in char encoding

Not a trivial undertaking.  Remember that the MARC-8 environment allows 
alternate character sets to be invoked within a MARC record using two different 
escape methods [1].  Just one of the reasons why you're not finding a bunch 
of these MARC-8 conversion modules, and one for every language. ;-)

-- Michael

[1] Technique 1 is unique to MARC-8 and provides access to a small number of 
Greek symbols, subscripts, and superscripts. Technique 2 is based on the ANSI 
X3.41 (ISO 2022) Code Extension Techniques for Use with 7-bit and 8-bit 
Character Sets standard. See the MARC 21 Specification for details on 
accessing alternate graphic character sets 
(http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative).
 

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Jonathan Rochkind
 Sent: Monday, October 24, 2011 2:01 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] marc-8
 
 What _ought_ to be easiest of all is getting our ILS's to NEVER export
 Marc8 _ever_ again.  UTF8 only.
 
 Sadly, that only ought to be easiest.
 
 But IMO there's no reason any of us should be dealing with Marc8 ever
 again.  The only thing that should deal in Marc8 is an ILS, and should
 only input it, NEVER output it, UTF8 only, please!
 
 But this is not the world we live in.
 
 I tried to figure out how to custom add a new encoding to ruby 1.9 with
 the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
 supported same as any other built in char encoding, but I couldn't
 figure out if that was possible or how to do it.  If it was possible to
 do at that low level in ruby 1.9, it might justify the time to do it.
 
 On 10/24/2011 2:55 PM, Doran, Michael D wrote:
  Eric,
 
  Sometimes for grandpa Perl stuff -- especially as concerns charsets and/or
 internationalization -- it's worth pinging these lists:
 
  perl4...@perl.org (yes, still alive and kicking)
 
  perl-i...@perl.org (very low traffic list, but some knowledgeable
 subscribers)
 
  -- Michael
 
  -Original Message-
  From: Doran, Michael D
  Sent: Monday, October 24, 2011 1:48 PM
  To: 'Code for Libraries'
  Subject: RE: [CODE4LIB] marc-8
 
  Okay. How do I go about converting MARC-8 encoded records into UTF-8?
  In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed
  Summers, and now maintained by Galen Charlton).
 
  -- Michael
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
  Eric
  Lease Morgan
  Sent: Monday, October 24, 2011 1:39 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] marc-8
 
  On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:
 
  In Perl, how do I specify MARC-8 when reading (decoding) and writing
  (encoding) data?
  You can't.  MARC-8 is a character set that is unknown to the operating
  system.  Your best bet is to convert MARC-8-encoded records into UTF-8.
 
  /me throws his hands up in the air and screams!
 
  Okay. How do I go about converting MARC-8 encoded records into UTF-8? I
  know
  yaz-marcdump changes the encoding bit in MARC leaders. Does it also
  convert
  MARC-8 characters to UTF-8? (I guess I could simply try it and see what
  happens.)
 
  --
  Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Jonathan Rochkind
Yeah, but if there's Perl code and Java code to do it, can't be _that_ 
hard to port to ruby if I could figure out what you need to do to 
get first-class char encoding support in ruby 1.9 anyway.


I mean, you could do it just as a library without that... but it's 
enough trouble that, yeah, I don't want to do it, but if the benefit was 
first-class encoding support same as any other encoding in ruby 1.9, 
that you can use with the built in tools for converting encodings and 
any library that uses em bigger benefit.


But I had no idea Marc8 allowed escape sequences to temporarily switch 
to a different encoding. Really? Oh my god.


On 10/24/2011 3:10 PM, Doran, Michael D wrote:

Hi Jonathan,


I tried to figure out how to custom add a new encoding to ruby 1.9 with
the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
supported same as any other built in char encoding

Not a trivial undertaking.  Remember that the MARC-8 environment allows alternate 
character sets to be invoked within a MARC record using two different escape 
methods [1].  Just one of the reasons why you're not finding a bunch of these MARC-8 
conversion modules, and one for every language. ;-)

-- Michael

[1] Technique 1 is unique to MARC-8 and provides access to a small number of Greek 
symbols, subscripts, and superscripts. Technique 2 is based on the ANSI X3.41 (ISO 2022) 
Code Extension Techniques for Use with 7-bit and 8-bit Character Sets 
standard. See the MARC 21 Specification for details on accessing alternate graphic 
character sets (http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative).



-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Monday, October 24, 2011 2:01 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] marc-8

What _ought_ to be easiest of all is getting our ILS's to NEVER export
Marc8 _ever_ again.  UTF8 only.

Sadly, that only ought to be easiest.

But IMO there's no reason any of us should be dealing with Marc8 ever
again.  The only thing that should deal in Marc8 is an ILS, and should
only input it, NEVER output it, UTF8 only, please!

But this is not the world we live in.

I tried to figure out how to custom add a new encoding to ruby 1.9 with
the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
supported same as any other built in char encoding, but I couldn't
figure out if that was possible or how to do it.  If it was possible to
do at that low level in ruby 1.9, it might justify the time to do it.

On 10/24/2011 2:55 PM, Doran, Michael D wrote:

Eric,

Sometimes for grandpa Perl stuff -- especially as concerns charsets and/or

internationalization -- it's worth pinging these lists:

perl4...@perl.org (yes, still alive and kicking)

perl-i...@perl.org (very low traffic list, but some knowledgeable

subscribers)

-- Michael


-Original Message-
From: Doran, Michael D
Sent: Monday, October 24, 2011 1:48 PM
To: 'Code for Libraries'
Subject: RE: [CODE4LIB] marc-8


Okay. How do I go about converting MARC-8 encoded records into UTF-8?

In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed
Summers, and now maintained by Galen Charlton).

-- Michael


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of

Eric

Lease Morgan
Sent: Monday, October 24, 2011 1:39 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] marc-8

On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:


In Perl, how do I specify MARC-8 when reading (decoding) and writing
(encoding) data?

You can't.  MARC-8 is a character set that is unknown to the operating

system.  Your best bet is to convert MARC-8-encoded records into UTF-8.

/me throws his hands up in the air and screams!

Okay. How do I go about converting MARC-8 encoded records into UTF-8? I

know

yaz-marcdump changes the encoding bit in MARC leaders. Does it also

convert

MARC-8 characters to UTF-8? (I guess I could simply try it and see what
happens.)

--
Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Doran, Michael D
 But I had no idea Marc8 allowed escape sequences to temporarily switch
 to a different encoding. Really? Oh my god.

For you young'uns that were born Unicode and are a bit foggy on the MARC-8 
environment (and all its... intricacies), I did a short write-up a few years 
ago:

Coded Character Sets  A Technical Primer for Librarians
http://rocky.uta.edu/doran/charsets/

Feel free to skip the intro, but the MARC-8 and MARC Unicode sections are 
short and worth a read.  Plus there's a lot of bonus stuff, including 
Resources on the Web (http://rocky.uta.edu/doran/charsets/resources.html) 
with an emphasis on library automation and the internet environment.

Begging your pardon for the self-promotion,

-- Michael

 -Original Message-
 From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
 Sent: Monday, October 24, 2011 2:14 PM
 To: Code for Libraries
 Cc: Doran, Michael D
 Subject: Re: [CODE4LIB] marc-8
 
 Yeah, but if there's Perl code and Java code to do it, can't be _that_
 hard to port to ruby if I could figure out what you need to do to
 get first-class char encoding support in ruby 1.9 anyway.
 
 I mean, you could do it just as a library without that... but it's
 enough trouble that, yeah, I don't want to do it, but if the benefit was
 first-class encoding support same as any other encoding in ruby 1.9,
 that you can use with the built in tools for converting encodings and
 any library that uses em bigger benefit.
 
 But I had no idea Marc8 allowed escape sequences to temporarily switch
 to a different encoding. Really? Oh my god.
 
 On 10/24/2011 3:10 PM, Doran, Michael D wrote:
  Hi Jonathan,
 
  I tried to figure out how to custom add a new encoding to ruby 1.9 with
  the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
  supported same as any other built in char encoding
  Not a trivial undertaking.  Remember that the MARC-8 environment allows
 alternate character sets to be invoked within a MARC record using two
 different escape methods [1].  Just one of the reasons why you're not
 finding a bunch of these MARC-8 conversion modules, and one for every
 language. ;-)
 
  -- Michael
 
  [1] Technique 1 is unique to MARC-8 and provides access to a small number
 of Greek symbols, subscripts, and superscripts. Technique 2 is based on the
 ANSI X3.41 (ISO 2022) Code Extension Techniques for Use with 7-bit and 8-
 bit Character Sets standard. See the MARC 21 Specification for details on
 accessing alternate graphic character sets
 (http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative).
 
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
  Jonathan Rochkind
  Sent: Monday, October 24, 2011 2:01 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] marc-8
 
  What _ought_ to be easiest of all is getting our ILS's to NEVER export
  Marc8 _ever_ again.  UTF8 only.
 
  Sadly, that only ought to be easiest.
 
  But IMO there's no reason any of us should be dealing with Marc8 ever
  again.  The only thing that should deal in Marc8 is an ILS, and should
  only input it, NEVER output it, UTF8 only, please!
 
  But this is not the world we live in.
 
  I tried to figure out how to custom add a new encoding to ruby 1.9 with
  the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
  supported same as any other built in char encoding, but I couldn't
  figure out if that was possible or how to do it.  If it was possible to
  do at that low level in ruby 1.9, it might justify the time to do it.
 
  On 10/24/2011 2:55 PM, Doran, Michael D wrote:
  Eric,
 
  Sometimes for grandpa Perl stuff -- especially as concerns charsets
 and/or
  internationalization -- it's worth pinging these lists:
perl4...@perl.org (yes, still alive and kicking)
 
perl-i...@perl.org (very low traffic list, but some knowledgeable
  subscribers)
  -- Michael
 
  -Original Message-
  From: Doran, Michael D
  Sent: Monday, October 24, 2011 1:48 PM
  To: 'Code for Libraries'
  Subject: RE: [CODE4LIB] marc-8
 
  Okay. How do I go about converting MARC-8 encoded records into UTF-8?
  In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed
  Summers, and now maintained by Galen Charlton).
 
  -- Michael
 
  -Original Message-
  From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf
 Of
  Eric
  Lease Morgan
  Sent: Monday, October 24, 2011 1:39 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] marc-8
 
  On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:
 
  In Perl, how do I specify MARC-8 when reading (decoding) and writing
  (encoding) data?
  You can't.  MARC-8 is a character set that is unknown to the
 operating
  system.  Your best bet is to convert MARC-8-encoded records into UTF-
 8.
 
  /me throws his hands up in the air and screams!
 
  Okay. How do I go about converting MARC-8 encoded records into UTF-8?
 I
  know
  yaz-marcdump changes the encoding