Re: [CODE4LIB] marc-8

2011-10-24 Thread Doran, Michael D
> But I had no idea Marc8 allowed escape sequences to temporarily switch
> to a different encoding. Really? Oh my god.

For you young'uns that were "born Unicode" and are a bit foggy on the MARC-8 
environment (and all its... intricacies), I did a short write-up a few years 
ago:

Coded Character Sets > A Technical Primer for Librarians
http://rocky.uta.edu/doran/charsets/

Feel free to skip the intro, but the "MARC-8" and "MARC Unicode" sections are 
short and worth a read.  Plus there's a lot of bonus stuff, including 
"Resources on the Web" (http://rocky.uta.edu/doran/charsets/resources.html) 
with an emphasis on library automation and the internet environment.

Begging your pardon for the self-promotion,

-- Michael

> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Monday, October 24, 2011 2:14 PM
> To: Code for Libraries
> Cc: Doran, Michael D
> Subject: Re: [CODE4LIB] marc-8
> 
> Yeah, but if there's Perl code and Java code to do it, can't be _that_
> hard to port to ruby if I could figure out what you need to do to
> get first-class char encoding support in ruby 1.9 anyway.
> 
> I mean, you could do it just as a library without that... but it's
> enough trouble that, yeah, I don't want to do it, but if the benefit was
> first-class encoding support same as any other encoding in ruby 1.9,
> that you can use with the built in tools for converting encodings and
> any library that uses em bigger benefit.
> 
> But I had no idea Marc8 allowed escape sequences to temporarily switch
> to a different encoding. Really? Oh my god.
> 
> On 10/24/2011 3:10 PM, Doran, Michael D wrote:
> > Hi Jonathan,
> >
> >> I tried to figure out how to custom add a new encoding to ruby 1.9 with
> >> the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
> >> supported same as any other built in char encoding
> > Not a trivial undertaking.  Remember that the MARC-8 environment allows
> alternate character sets to be invoked within a MARC record using two
> different "escape" methods [1].  Just one of the reasons why you're not
> finding a bunch of these MARC-8 conversion modules, and one for every
> language. ;-)
> >
> > -- Michael
> >
> > [1] Technique 1 is unique to MARC-8 and provides access to a small number
> of Greek symbols, subscripts, and superscripts. Technique 2 is based on the
> ANSI X3.41 (ISO 2022) "Code Extension Techniques for Use with 7-bit and 8-
> bit Character Sets" standard. See the MARC 21 Specification for details on
> accessing alternate graphic character sets
> (http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative).
> >
> >
> >> -Original Message-
> >> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> >> Jonathan Rochkind
> >> Sent: Monday, October 24, 2011 2:01 PM
> >> To: CODE4LIB@LISTSERV.ND.EDU
> >> Subject: Re: [CODE4LIB] marc-8
> >>
> >> What _ought_ to be easiest of all is getting our ILS's to NEVER export
> >> Marc8 _ever_ again.  UTF8 only.
> >>
> >> Sadly, that only ought to be easiest.
> >>
> >> But IMO there's no reason any of us should be dealing with Marc8 ever
> >> again.  The only thing that should deal in Marc8 is an ILS, and should
> >> only input it, NEVER output it, UTF8 only, please!
> >>
> >> But this is not the world we live in.
> >>
> >> I tried to figure out how to custom add a new encoding to ruby 1.9 with
> >> the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
> >> supported same as any other built in char encoding, but I couldn't
> >> figure out if that was possible or how to do it.  If it was possible to
> >> do at that low level in ruby 1.9, it might justify the time to do it.
> >>
> >> On 10/24/2011 2:55 PM, Doran, Michael D wrote:
> >>> Eric,
> >>>
> >>> Sometimes for grandpa Perl stuff -- especially as concerns charsets
> and/or
> >> internationalization -- it's worth pinging these lists:
> >>>   perl4...@perl.org (yes, still alive and kicking)
> >>>
> >>>   perl-i...@perl.org (very low traffic list, but some knowledgeable
> >> subscribers)
> >>> -- Michael
> >>>
> >>>> -Original Message-
> >>>> From: Doran, Michael D
> >>>> Sent: Monday, October 24, 2011 1:48 PM
> >>>> To: 'Code for Libraries'
> >>>> Subject: RE:

Re: [CODE4LIB] marc-8

2011-10-24 Thread Eric Lease Morgan
On Oct 24, 2011, at 3:03 PM, Jon Gorman wrote:

> yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 marc21.raw >marc21.utf8.raw

This worked great! My version of yaz-marcdump was older and was not doing the 
trick.  code4lib++

-- 
Eric


Re: [CODE4LIB] marc-8

2011-10-24 Thread Jonathan Rochkind
Yeah, but if there's Perl code and Java code to do it, can't be _that_ 
hard to port to ruby if I could figure out what you need to do to 
get first-class char encoding support in ruby 1.9 anyway.


I mean, you could do it just as a library without that... but it's 
enough trouble that, yeah, I don't want to do it, but if the benefit was 
first-class encoding support same as any other encoding in ruby 1.9, 
that you can use with the built in tools for converting encodings and 
any library that uses em bigger benefit.


But I had no idea Marc8 allowed escape sequences to temporarily switch 
to a different encoding. Really? Oh my god.


On 10/24/2011 3:10 PM, Doran, Michael D wrote:

Hi Jonathan,


I tried to figure out how to custom add a new encoding to ruby 1.9 with
the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
supported same as any other built in char encoding

Not a trivial undertaking.  Remember that the MARC-8 environment allows alternate 
character sets to be invoked within a MARC record using two different "escape" 
methods [1].  Just one of the reasons why you're not finding a bunch of these MARC-8 
conversion modules, and one for every language. ;-)

-- Michael

[1] Technique 1 is unique to MARC-8 and provides access to a small number of Greek 
symbols, subscripts, and superscripts. Technique 2 is based on the ANSI X3.41 (ISO 2022) 
"Code Extension Techniques for Use with 7-bit and 8-bit Character Sets" 
standard. See the MARC 21 Specification for details on accessing alternate graphic 
character sets (http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative).



-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
Jonathan Rochkind
Sent: Monday, October 24, 2011 2:01 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] marc-8

What _ought_ to be easiest of all is getting our ILS's to NEVER export
Marc8 _ever_ again.  UTF8 only.

Sadly, that only ought to be easiest.

But IMO there's no reason any of us should be dealing with Marc8 ever
again.  The only thing that should deal in Marc8 is an ILS, and should
only input it, NEVER output it, UTF8 only, please!

But this is not the world we live in.

I tried to figure out how to custom add a new encoding to ruby 1.9 with
the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
supported same as any other built in char encoding, but I couldn't
figure out if that was possible or how to do it.  If it was possible to
do at that low level in ruby 1.9, it might justify the time to do it.

On 10/24/2011 2:55 PM, Doran, Michael D wrote:

Eric,

Sometimes for grandpa Perl stuff -- especially as concerns charsets and/or

internationalization -- it's worth pinging these lists:

perl4...@perl.org (yes, still alive and kicking)

perl-i...@perl.org (very low traffic list, but some knowledgeable

subscribers)

-- Michael


-Original Message-
From: Doran, Michael D
Sent: Monday, October 24, 2011 1:48 PM
To: 'Code for Libraries'
Subject: RE: [CODE4LIB] marc-8


Okay. How do I go about converting MARC-8 encoded records into UTF-8?

In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed
Summers, and now maintained by Galen Charlton).

-- Michael


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of

Eric

Lease Morgan
Sent: Monday, October 24, 2011 1:39 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] marc-8

On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:


In Perl, how do I specify MARC-8 when reading (decoding) and writing
(encoding) data?

You can't.  MARC-8 is a character set that is unknown to the operating

system.  Your best bet is to convert MARC-8-encoded records into UTF-8.

/me throws his hands up in the air and screams!

Okay. How do I go about converting MARC-8 encoded records into UTF-8? I

know

yaz-marcdump changes the encoding bit in MARC leaders. Does it also

convert

MARC-8 characters to UTF-8? (I guess I could simply try it and see what
happens.)

--
Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Doran, Michael D
Hi Jonathan,

> I tried to figure out how to custom add a new encoding to ruby 1.9 with
> the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
> supported same as any other built in char encoding

Not a trivial undertaking.  Remember that the MARC-8 environment allows 
alternate character sets to be invoked within a MARC record using two different 
"escape" methods [1].  Just one of the reasons why you're not finding a bunch 
of these MARC-8 conversion modules, and one for every language. ;-)

-- Michael

[1] Technique 1 is unique to MARC-8 and provides access to a small number of 
Greek symbols, subscripts, and superscripts. Technique 2 is based on the ANSI 
X3.41 (ISO 2022) "Code Extension Techniques for Use with 7-bit and 8-bit 
Character Sets" standard. See the MARC 21 Specification for details on 
accessing alternate graphic character sets 
(http://www.loc.gov/marc/specifications/speccharmarc8.html#alternative).
 

> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Jonathan Rochkind
> Sent: Monday, October 24, 2011 2:01 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] marc-8
> 
> What _ought_ to be easiest of all is getting our ILS's to NEVER export
> Marc8 _ever_ again.  UTF8 only.
> 
> Sadly, that only ought to be easiest.
> 
> But IMO there's no reason any of us should be dealing with Marc8 ever
> again.  The only thing that should deal in Marc8 is an ILS, and should
> only input it, NEVER output it, UTF8 only, please!
> 
> But this is not the world we live in.
> 
> I tried to figure out how to custom add a new encoding to ruby 1.9 with
> the idea of adding Marc8 as an actuall  ruby 1.9 character encoding
> supported same as any other built in char encoding, but I couldn't
> figure out if that was possible or how to do it.  If it was possible to
> do at that low level in ruby 1.9, it might justify the time to do it.
> 
> On 10/24/2011 2:55 PM, Doran, Michael D wrote:
> > Eric,
> >
> > Sometimes for grandpa Perl stuff -- especially as concerns charsets and/or
> internationalization -- it's worth pinging these lists:
> >
> > perl4...@perl.org (yes, still alive and kicking)
> >
> > perl-i...@perl.org (very low traffic list, but some knowledgeable
> subscribers)
> >
> > -- Michael
> >
> >> -Original Message-
> >> From: Doran, Michael D
> >> Sent: Monday, October 24, 2011 1:48 PM
> >> To: 'Code for Libraries'
> >> Subject: RE: [CODE4LIB] marc-8
> >>
> >>> Okay. How do I go about converting MARC-8 encoded records into UTF-8?
> >> In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed
> >> Summers, and now maintained by Galen Charlton).
> >>
> >> -- Michael
> >>
> >>> -Original Message-
> >>> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> >> Eric
> >>> Lease Morgan
> >>> Sent: Monday, October 24, 2011 1:39 PM
> >>> To: CODE4LIB@LISTSERV.ND.EDU
> >>> Subject: Re: [CODE4LIB] marc-8
> >>>
> >>> On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:
> >>>
> >>>>> In Perl, how do I specify MARC-8 when reading (decoding) and writing
> >>>>> (encoding) data?
> >>>> You can't.  MARC-8 is a character set that is unknown to the operating
> >>> system.  Your best bet is to convert MARC-8-encoded records into UTF-8.
> >>>
> >>> /me throws his hands up in the air and screams!
> >>>
> >>> Okay. How do I go about converting MARC-8 encoded records into UTF-8? I
> >> know
> >>> yaz-marcdump changes the encoding bit in MARC leaders. Does it also
> >> convert
> >>> MARC-8 characters to UTF-8? (I guess I could simply try it and see what
> >>> happens.)
> >>>
> >>> --
> >>> Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Jon Gorman
>>> In Perl, how do I specify MARC-8 when reading (decoding) and writing
>>> (encoding) data?
>>
>> You can't.  MARC-8 is a character set that is unknown to the operating 
>> system.  Your best bet is to convert MARC-8-encoded records into UTF-8.
>
> /me throws his hands up in the air and screams!
>
> Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know 
> yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert 
> MARC-8 characters to UTF-8? (I guess I could simply try it and see what 
> happens.)
>

I seem to remember there was an older version of yaz-marcdump that
seemed a bit buggy (would just change the header but not change
encoding despite command-line options, if there was a certain
combination chosen).  It's also possible I was just working with a
script that specified the encoding change but not the leader.

I'd say get the most recent version of yaz (don't use anything in an
OS repository) and then follow the docs:
http://www.indexdata.com/yaz/doc/yaz-marcdump.html.  The first example
is what you want:

 yaz-marcdump -f MARC-8 -t UTF-8 -o marc -l 9=97 marc21.raw >marc21.utf8.raw

The -f is the source encoding, the -t is the target encoding, and the
-l 9=97 sets leader to a (decimal of character to change the 9th
character to a).

I've typically found this is one of the easier ways to do the
character set encoding, although the various Perl modules (if they're
recent enough) should be able to handle the conversion as well through
the MARC::Charset library.  Check the cpan pages.

Jon Gorman

ps.  For the love of all that is good, don't try to do anything in
Perl with the raw MARC record to do the encoding change yourself.
I've seen someone really screw records up because they altered
individual characters, which in turn lead to different byte lengths.
This caused all sorts of insanity which meant really weird things
happened with MARC parsers that tried to follow the MARC directory
(which uses byte addresses to deal with variable fields).


Re: [CODE4LIB] marc-8

2011-10-24 Thread Jonathan Rochkind
What _ought_ to be easiest of all is getting our ILS's to NEVER export 
Marc8 _ever_ again.  UTF8 only.


Sadly, that only ought to be easiest.

But IMO there's no reason any of us should be dealing with Marc8 ever 
again.  The only thing that should deal in Marc8 is an ILS, and should 
only input it, NEVER output it, UTF8 only, please!


But this is not the world we live in.

I tried to figure out how to custom add a new encoding to ruby 1.9 with 
the idea of adding Marc8 as an actuall  ruby 1.9 character encoding 
supported same as any other built in char encoding, but I couldn't 
figure out if that was possible or how to do it.  If it was possible to 
do at that low level in ruby 1.9, it might justify the time to do it.


On 10/24/2011 2:55 PM, Doran, Michael D wrote:

Eric,

Sometimes for grandpa Perl stuff -- especially as concerns charsets and/or 
internationalization -- it's worth pinging these lists:

perl4...@perl.org (yes, still alive and kicking)

perl-i...@perl.org (very low traffic list, but some knowledgeable 
subscribers)

-- Michael


-Original Message-
From: Doran, Michael D
Sent: Monday, October 24, 2011 1:48 PM
To: 'Code for Libraries'
Subject: RE: [CODE4LIB] marc-8


Okay. How do I go about converting MARC-8 encoded records into UTF-8?

In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed
Summers, and now maintained by Galen Charlton).

-- Michael


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of

Eric

Lease Morgan
Sent: Monday, October 24, 2011 1:39 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] marc-8

On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:


In Perl, how do I specify MARC-8 when reading (decoding) and writing
(encoding) data?

You can't.  MARC-8 is a character set that is unknown to the operating

system.  Your best bet is to convert MARC-8-encoded records into UTF-8.

/me throws his hands up in the air and screams!

Okay. How do I go about converting MARC-8 encoded records into UTF-8? I

know

yaz-marcdump changes the encoding bit in MARC leaders. Does it also

convert

MARC-8 characters to UTF-8? (I guess I could simply try it and see what
happens.)

--
Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Jonathan Rochkind

On 10/24/2011 2:52 PM, Ross Singer wrote:

On Mon, Oct 24, 2011 at 7:39 PM, Eric Lease Morgan  wrote:


Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know 
yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert 
MARC-8 characters to UTF-8? (I guess I could simply try it and see what 
happens.)


Yes, it does.  It uses yaz-iconv.  Theoretically, you could wrap some
Perl module around that.  I've contemplated it for ruby-marc, but then
it always seems a lot easier to ignore it and delete any emails that
request it.


Or use jruby, where you can use Marc4J.   Or actually port either the 
Java or (apparently?) Perl version into ruby; okay that one is not 
"easier" then anything in the short term, but in the long term I'd 
rather have pure ruby that something that relies on an external bash 
call or a C extension, those latter are invariably going to be annoying 
and confusing maintenance down the line, in my experience.


But I'm not doing any of these things anytime soon either. So far all my 
ruby that deals with Marc gets something else to convert it first.  (In 
my largest case, Java Marc4J converts it before it's stored in a stored 
field in a Solr index, and my ruby only gets it from the stored field in 
Solr, already converted).


Re: [CODE4LIB] marc-8

2011-10-24 Thread Michael J. Giarlo
If I understand correctly, there's some support for this in pymarc as well:

https://github.com/edsu/pymarc/blob/master/pymarc/marc8.py#L22

-Mike


On Mon, Oct 24, 2011 at 14:52, Jonathan Rochkind  wrote:
> Woah, there is a library in Perl to do that? Sweet!  Okay, now I know two
> languages with such a library, Perl and Java.
>
> Anyone want to write one for ruby? :)
>
> On 10/24/2011 2:47 PM, Doran, Michael D wrote:
>>>
>>> Okay. How do I go about converting MARC-8 encoded records into UTF-8?
>>
>> In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed
>> Summers, and now maintained by Galen Charlton).
>>
>> -- Michael
>>
>>> -Original Message-
>>> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
>>> Eric
>>> Lease Morgan
>>> Sent: Monday, October 24, 2011 1:39 PM
>>> To: CODE4LIB@LISTSERV.ND.EDU
>>> Subject: Re: [CODE4LIB] marc-8
>>>
>>> On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:
>>>
>>>>> In Perl, how do I specify MARC-8 when reading (decoding) and writing
>>>>> (encoding) data?
>>>>
>>>> You can't.  MARC-8 is a character set that is unknown to the operating
>>>
>>> system.  Your best bet is to convert MARC-8-encoded records into UTF-8.
>>>
>>> /me throws his hands up in the air and screams!
>>>
>>> Okay. How do I go about converting MARC-8 encoded records into UTF-8? I
>>> know
>>> yaz-marcdump changes the encoding bit in MARC leaders. Does it also
>>> convert
>>> MARC-8 characters to UTF-8? (I guess I could simply try it and see what
>>> happens.)
>>>
>>> --
>>> Eric Morgan
>


Re: [CODE4LIB] marc-8

2011-10-24 Thread Doran, Michael D
Eric,

Sometimes for grandpa Perl stuff -- especially as concerns charsets and/or 
internationalization -- it's worth pinging these lists:

perl4...@perl.org (yes, still alive and kicking)

perl-i...@perl.org (very low traffic list, but some knowledgeable 
subscribers)

-- Michael

> -Original Message-
> From: Doran, Michael D
> Sent: Monday, October 24, 2011 1:48 PM
> To: 'Code for Libraries'
> Subject: RE: [CODE4LIB] marc-8
> 
> > Okay. How do I go about converting MARC-8 encoded records into UTF-8?
> 
> In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed
> Summers, and now maintained by Galen Charlton).
> 
> -- Michael
> 
> > -Original Message-
> > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Eric
> > Lease Morgan
> > Sent: Monday, October 24, 2011 1:39 PM
> > To: CODE4LIB@LISTSERV.ND.EDU
> > Subject: Re: [CODE4LIB] marc-8
> >
> > On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:
> >
> > >> In Perl, how do I specify MARC-8 when reading (decoding) and writing
> > >> (encoding) data?
> > >
> > > You can't.  MARC-8 is a character set that is unknown to the operating
> > system.  Your best bet is to convert MARC-8-encoded records into UTF-8.
> >
> > /me throws his hands up in the air and screams!
> >
> > Okay. How do I go about converting MARC-8 encoded records into UTF-8? I
> know
> > yaz-marcdump changes the encoding bit in MARC leaders. Does it also
> convert
> > MARC-8 characters to UTF-8? (I guess I could simply try it and see what
> > happens.)
> >
> > --
> > Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Ross Singer
On Mon, Oct 24, 2011 at 7:39 PM, Eric Lease Morgan  wrote:

> Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know 
> yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert 
> MARC-8 characters to UTF-8? (I guess I could simply try it and see what 
> happens.)
>

Yes, it does.  It uses yaz-iconv.  Theoretically, you could wrap some
Perl module around that.  I've contemplated it for ruby-marc, but then
it always seems a lot easier to ignore it and delete any emails that
request it.

A whole lot easier.

-Ross.


Re: [CODE4LIB] marc-8

2011-10-24 Thread Jonathan Rochkind
Woah, there is a library in Perl to do that? Sweet!  Okay, now I know 
two languages with such a library, Perl and Java.


Anyone want to write one for ruby? :)

On 10/24/2011 2:47 PM, Doran, Michael D wrote:

Okay. How do I go about converting MARC-8 encoded records into UTF-8?

In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed Summers, 
and now maintained by Galen Charlton).

-- Michael


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric
Lease Morgan
Sent: Monday, October 24, 2011 1:39 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] marc-8

On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:


In Perl, how do I specify MARC-8 when reading (decoding) and writing
(encoding) data?

You can't.  MARC-8 is a character set that is unknown to the operating

system.  Your best bet is to convert MARC-8-encoded records into UTF-8.

/me throws his hands up in the air and screams!

Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know
yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert
MARC-8 characters to UTF-8? (I guess I could simply try it and see what
happens.)

--
Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Jonathan Rochkind
The only language that I know of with a library for reading Marc8 and 
converting to another encoding (such as UTF-8) is Java. The Marc4J 
package will do it.


I suppose there may be C libraries too; is yaz written in C?

As Michael suggests the easiest thing to do (if you're not in Java) is 
probably to use the 'yaz' tools to convert to UTF-8 before anything else 
touches it.


If you do end up writing a Marc8 handling library in another language 
like Perl (presumably you could use the Java code in Marc4J as a guide), 
please do share! Heh.


On 10/24/2011 2:34 PM, Doran, Michael D wrote:

Hi Eric,


In Perl, how do I specify MARC-8 when reading (decoding) and writing
(encoding) data?

You can't.  MARC-8 is a character set that is unknown to the operating system.  
Your best bet is to convert MARC-8-encoded records into UTF-8.


...it is converted it Perl's
internal encoding (UTF-8)

As an FTY, UTF-8 is *not* Perl's internal encoding.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/




-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric
Lease Morgan
Sent: Monday, October 24, 2011 1:18 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] marc-8

In Perl, how do I specify MARC-8 when reading (decoding) and writing
(encoding) data?

Character encoding is the bane of my existence. I have learned that when
reading from a file I ought to specify the type of encoding the file is in
and decode accordingly, or else. Once read, it is converted it Perl's
internal encoding (UTF-8) and can be manipulated. Similarly, when writing I
am expected to specify the encoding. Both the reading (decoding) and the
writing (encoding) can be done with the Encode module. Here is a some code
illustrating what I'm trying to do with MARC records which are apparently in
MARC-8:

   # require
   use Encode qw( encode decode );

   # initialize
   my $batch = MARC::Batch->new( 'USMARC', './records.mrc' );
   open OUT, '>  updated.mrc';

   # process each record
   while ( my $marc = $batch->next ) {

 # get the title
 my $_245 = decode( 'FOO', $marc->title );

 # do cool stuff with the title here

 # output the cool stuff
 print OUT encode( 'FOO', $_245 );

   }

   # done
   close OUT;
   exit;


My problem is, I don't know what to put in place of FOO. What is the official
name of MARC-8's encoding scheme?

--
Eric "The Ugly American" Morgan
University of Notre Dame

(574) 631-8604


Re: [CODE4LIB] marc-8

2011-10-24 Thread Doran, Michael D
> Okay. How do I go about converting MARC-8 encoded records into UTF-8?

In Perl... using the handy MARC::Charset module (tip 'o the hat to Ed Summers, 
and now maintained by Galen Charlton).

-- Michael

> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric
> Lease Morgan
> Sent: Monday, October 24, 2011 1:39 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: Re: [CODE4LIB] marc-8
> 
> On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:
> 
> >> In Perl, how do I specify MARC-8 when reading (decoding) and writing
> >> (encoding) data?
> >
> > You can't.  MARC-8 is a character set that is unknown to the operating
> system.  Your best bet is to convert MARC-8-encoded records into UTF-8.
> 
> /me throws his hands up in the air and screams!
> 
> Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know
> yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert
> MARC-8 characters to UTF-8? (I guess I could simply try it and see what
> happens.)
> 
> --
> Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Walker, David
> I know yaz-marcdump changes the encoding bit in MARC
> leaders. Does it also convert MARC-8 characters to UTF-8?

Yes.  We use it for that purpose all the time.

--Dave

-
David Walker
Library Web Services Manager
California State University


-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric 
Lease Morgan
Sent: Monday, October 24, 2011 11:39 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] marc-8

On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:

>> In Perl, how do I specify MARC-8 when reading (decoding) and writing
>> (encoding) data?
> 
> You can't.  MARC-8 is a character set that is unknown to the operating 
> system.  Your best bet is to convert MARC-8-encoded records into UTF-8. 

/me throws his hands up in the air and screams!

Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know 
yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert 
MARC-8 characters to UTF-8? (I guess I could simply try it and see what 
happens.)

-- 
Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Eric Lease Morgan
On Oct 24, 2011, at 2:34 PM, Doran, Michael D wrote:

>> In Perl, how do I specify MARC-8 when reading (decoding) and writing
>> (encoding) data?
> 
> You can't.  MARC-8 is a character set that is unknown to the operating 
> system.  Your best bet is to convert MARC-8-encoded records into UTF-8. 

/me throws his hands up in the air and screams!

Okay. How do I go about converting MARC-8 encoded records into UTF-8? I know 
yaz-marcdump changes the encoding bit in MARC leaders. Does it also convert 
MARC-8 characters to UTF-8? (I guess I could simply try it and see what 
happens.)

-- 
Eric Morgan


Re: [CODE4LIB] marc-8

2011-10-24 Thread Doran, Michael D
> As an FTY,

Oops, in a hurry.  s/FTY/FYI/

> -Original Message-
> From: Doran, Michael D
> Sent: Monday, October 24, 2011 1:35 PM
> To: 'Code for Libraries'
> Subject: RE: marc-8
> 
> Hi Eric,
> 
> > In Perl, how do I specify MARC-8 when reading (decoding) and writing
> > (encoding) data?
> 
> You can't.  MARC-8 is a character set that is unknown to the operating
> system.  Your best bet is to convert MARC-8-encoded records into UTF-8.
> 
> > ...it is converted it Perl's
> > internal encoding (UTF-8)
> 
> As an FTY, UTF-8 is *not* Perl's internal encoding.
> 
> -- Michael
> 
> # Michael Doran, Systems Librarian
> # University of Texas at Arlington
> # 817-272-5326 office
> # 817-688-1926 mobile
> # do...@uta.edu
> # http://rocky.uta.edu/doran/
> 
> 
> 
> > -Original Message-
> > From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
> Eric
> > Lease Morgan
> > Sent: Monday, October 24, 2011 1:18 PM
> > To: CODE4LIB@LISTSERV.ND.EDU
> > Subject: [CODE4LIB] marc-8
> >
> > In Perl, how do I specify MARC-8 when reading (decoding) and writing
> > (encoding) data?
> >
> > Character encoding is the bane of my existence. I have learned that when
> > reading from a file I ought to specify the type of encoding the file is in
> > and decode accordingly, or else. Once read, it is converted it Perl's
> > internal encoding (UTF-8) and can be manipulated. Similarly, when writing I
> > am expected to specify the encoding. Both the reading (decoding) and the
> > writing (encoding) can be done with the Encode module. Here is a some code
> > illustrating what I'm trying to do with MARC records which are apparently
> in
> > MARC-8:
> >
> >   # require
> >   use Encode qw( encode decode );
> >
> >   # initialize
> >   my $batch = MARC::Batch->new( 'USMARC', './records.mrc' );
> >   open OUT, ' > updated.mrc';
> >
> >   # process each record
> >   while ( my $marc = $batch->next ) {
> >
> > # get the title
> > my $_245 = decode( 'FOO', $marc->title );
> >
> > # do cool stuff with the title here
> >
> > # output the cool stuff
> > print OUT encode( 'FOO', $_245 );
> >
> >   }
> >
> >   # done
> >   close OUT;
> >   exit;
> >
> >
> > My problem is, I don't know what to put in place of FOO. What is the
> official
> > name of MARC-8's encoding scheme?
> >
> > --
> > Eric "The Ugly American" Morgan
> > University of Notre Dame
> >
> > (574) 631-8604


Re: [CODE4LIB] marc-8

2011-10-24 Thread Doran, Michael D
Hi Eric,

> In Perl, how do I specify MARC-8 when reading (decoding) and writing
> (encoding) data?

You can't.  MARC-8 is a character set that is unknown to the operating system.  
Your best bet is to convert MARC-8-encoded records into UTF-8. 

> ...it is converted it Perl's
> internal encoding (UTF-8)

As an FTY, UTF-8 is *not* Perl's internal encoding.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/



> -Original Message-
> From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Eric
> Lease Morgan
> Sent: Monday, October 24, 2011 1:18 PM
> To: CODE4LIB@LISTSERV.ND.EDU
> Subject: [CODE4LIB] marc-8
> 
> In Perl, how do I specify MARC-8 when reading (decoding) and writing
> (encoding) data?
> 
> Character encoding is the bane of my existence. I have learned that when
> reading from a file I ought to specify the type of encoding the file is in
> and decode accordingly, or else. Once read, it is converted it Perl's
> internal encoding (UTF-8) and can be manipulated. Similarly, when writing I
> am expected to specify the encoding. Both the reading (decoding) and the
> writing (encoding) can be done with the Encode module. Here is a some code
> illustrating what I'm trying to do with MARC records which are apparently in
> MARC-8:
> 
>   # require
>   use Encode qw( encode decode );
> 
>   # initialize
>   my $batch = MARC::Batch->new( 'USMARC', './records.mrc' );
>   open OUT, ' > updated.mrc';
> 
>   # process each record
>   while ( my $marc = $batch->next ) {
> 
> # get the title
> my $_245 = decode( 'FOO', $marc->title );
> 
> # do cool stuff with the title here
> 
> # output the cool stuff
> print OUT encode( 'FOO', $_245 );
> 
>   }
> 
>   # done
>   close OUT;
>   exit;
> 
> 
> My problem is, I don't know what to put in place of FOO. What is the official
> name of MARC-8's encoding scheme?
> 
> --
> Eric "The Ugly American" Morgan
> University of Notre Dame
> 
> (574) 631-8604


[CODE4LIB] marc-8

2011-10-24 Thread Eric Lease Morgan
In Perl, how do I specify MARC-8 when reading (decoding) and writing (encoding) 
data?

Character encoding is the bane of my existence. I have learned that when 
reading from a file I ought to specify the type of encoding the file is in and 
decode accordingly, or else. Once read, it is converted it Perl's internal 
encoding (UTF-8) and can be manipulated. Similarly, when writing I am expected 
to specify the encoding. Both the reading (decoding) and the writing (encoding) 
can be done with the Encode module. Here is a some code illustrating what I'm 
trying to do with MARC records which are apparently in MARC-8:

  # require
  use Encode qw( encode decode );
  
  # initialize
  my $batch = MARC::Batch->new( 'USMARC', './records.mrc' );
  open OUT, ' > updated.mrc';
  
  # process each record
  while ( my $marc = $batch->next ) {
  
# get the title
my $_245 = decode( 'FOO', $marc->title );

# do cool stuff with the title here

# output the cool stuff
print OUT encode( 'FOO', $_245 );
  
  }
  
  # done
  close OUT;
  exit;


My problem is, I don't know what to put in place of FOO. What is the official 
name of MARC-8's encoding scheme?

-- 
Eric "The Ugly American" Morgan
University of Notre Dame

(574) 631-8604