RE: printing UTF-8 encoded MARC records with as_usmarc

2012-08-15 Thread Doran, Michael D
Hi Devon,

 I just recently came across this presentation which lays out pretty much
 all the issues with Unicode in perl, and makes some recommendations for
 best practices.

While Nick Patch's presentation is excellent, I'm not sure that it lays out 
pretty much all the issues with Unicode in perl.  ;-)

To fit that bill, I highly recommend this series of talks given by Tom 
Christiansen at OSCON 2011:

 1. Perl Unicode Essentials
 2. Unicode in Perl Regexes
 3. Unicode Support Shootout: The Good, The Bad,  the (mostly) Ugly

http://training.perl.com/OSCON2011/index.html
(resolves to http://98.245.80.27/tcpc/OSCON2011/index.html)

If you read through those presentations and disagree, I promise to buy you a 
beer at the next conference (code4lib?) we both attend.

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 mobile
# do...@uta.edu
# http://rocky.uta.edu/doran/

 -Original Message-
 From: Smith,Devon [mailto:smit...@oclc.org]
 Sent: Tuesday, July 31, 2012 8:26 AM
 To: William Dueber; Shelley Doljack
 Cc: perl4lib@perl.org
 Subject: RE: printing UTF-8 encoded MARC records with as_usmarc
 
 I just recently came across this presentation which lays out pretty much
 all the issues with Unicode in perl, and makes some recommendations for
 best practices. You may find some general insight into the whole
 situation by going over it.
 
 http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-
 perl-workshop-2012
 
 /dev
 --
 Devon Smith
 Consulting Software Engineer
 OCLC Research
 http://www.oclc.org/research/people/smith.htm
 
 
 -Original Message-
 From: William Dueber [mailto:dueb...@umich.edu]
 Sent: Monday, July 30, 2012 8:14 PM
 To: Shelley Doljack
 Cc: perl4lib@perl.org
 Subject: Re: printing UTF-8 encoded MARC records with as_usmarc
 
 First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
 MARC-8, perhaps just lousy characters) in your MARC. I know we have
 plenty
 of that crap.
 
 You need to tell perl that you'll be outputting UTF-8 using 'bincode'
 
   binmode(FILE, ':utf8');
 
 In general, you'll want to do this to basically every file you open for
 reading or writing.
 
 A great overview of Perl and UTF-8 can be found at:
 
 http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-
 utf-8-by-default
 
 
 
 
 
 On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack
 sdolj...@stanford.eduwrote:
 
  Hi,
 
  I wrote a script that extracts marc records from a file given certain
  conditions and puts them in a new file. When my input record is
 correctly
  encoded in UTF-8 and I run my script from windows command prompt, this
  warning message appears: Wide character in print at
 record_extraction.plline 99 (the line in my script where I print to a
 new file using
  as_usmarc). I compared the extracted record before and after in
 MarcEdit
  and the diacritic was changed. I tried marcdump newfile.mrc to see what
  happens and I get this error: utf8 \xF4 does not map to Unicode at
  C:/Perl64/lib/Encode.pm line 176. When I run my extraction script
 again
  with MARC-8 encoded data then I don't have the same problem.
 
  The basic outline of my script is:
 
  my $batch = MARC::Batch-new('USMARC', $input_file);
 
  while (my $record = $batch-next()) {
   #do some checks
   #if checks ok then
   print FILE $record-as_usmarc();
  }
 
  Do I need to add something that specifies to interpret the data as UTF-
 8?
  Does MARC::Record not handle UTF-8 at all?
 
  Thanks,
  Shelley
 
  
  Shelley Doljack
  E-Resources Metadata Librarian
  Metadata and Library Systems
  Stanford University Libraries
  sdolj...@stanford.edu
  650-725-0167
 
 
 
 
 --
 
 Bill Dueber
 Programmer -- Library Systems
 University of Michigan


RE: printing UTF-8 encoded MARC records with as_usmarc

2012-08-01 Thread PHILLIPS M.E.
 -Original Message-
 From: Shelley Doljack [mailto:sdolj...@stanford.edu]
 Sent: 31 July 2012 20:18

 The problem was I wasn't telling perl to output UTF-8. Now that I added
 binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds
 like once I set binmode to UTF-8 everything will be interpreted as such, even
 when the record is in MARC-8. Is that right? So this means that I can only use
 my script with a file of records where all of them are encoded in UTF-8. If I
 want to run the script against a file with all MARC-8 encoding, then I'd need
 to remove the binmode line.

It depends how much manipulation of the records you are doing in the script.  
One approach is to use

binmode(FILE, ':raw');

for both input and output.  Perl will then keep the bytes of the records 
exactly as they are.  You won't be able to test  for exotic characters so 
easily, and amending field content would be inadvisable, but if all you are 
doing is something like reading in the records and filtering out any that have 
no 245 field, or something fairly basic like that, this could be the best 
approach.

The MARC::Record module does not seem to care how the records are encoded.  
It's only once you start altering field content, testing field content, or 
adding fields that the character set being used becomes an issue.  Removing 
fields would be fine too.

MARC-8 can be very complex, particularly if other code tables like CJK are 
invoked, or even just Greek or Cyrillic.  If you were manipulating field 
content in that kind of way they converting everything to UTF-8 would make 
things very much easier.

Matthew

-- 
Matthew Phillips
Electronic Systems Librarian, Durham University
Durham University Library, Stockton Road, Durham, DH1 3LY
+44 (0)191 334 2941




Re: printing UTF-8 encoded MARC records with as_usmarc

2012-08-01 Thread Colin Campbell
On Tue, Jul 31, 2012 at 09:25:55AM -0400, Smith,Devon wrote:
 I just recently came across this presentation which lays out pretty much all 
 the issues with Unicode in perl, and makes some recommendations for best 
 practices. You may find some general insight into the whole situation by 
 going over it.
In the course of preparing the latest edition of the Camel book Tom
Christiansen created a Perl Unicode Cookbook see
http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html

Its available in a few different places on the web

C.

-- 
Colin Campbell
Chief Software Engineer,
PTFS Europe Limited
Content Management and Library Solutions
+44 (0) 800 756 6803 (phone)
+44 (0) 7759 633626  (mobile)
colin.campb...@ptfs-europe.com
skype: colin_campbell2

http://www.ptfs-europe.com


RE: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread Smith,Devon
I just recently came across this presentation which lays out pretty much all 
the issues with Unicode in perl, and makes some recommendations for best 
practices. You may find some general insight into the whole situation by going 
over it.

http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-perl-workshop-2012

/dev
-- 
Devon Smith
Consulting Software Engineer
OCLC Research
http://www.oclc.org/research/people/smith.htm


-Original Message-
From: William Dueber [mailto:dueb...@umich.edu] 
Sent: Monday, July 30, 2012 8:14 PM
To: Shelley Doljack
Cc: perl4lib@perl.org
Subject: Re: printing UTF-8 encoded MARC records with as_usmarc

First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty
of that crap.

You need to tell perl that you'll be outputting UTF-8 using 'bincode'

  binmode(FILE, ':utf8');

In general, you'll want to do this to basically every file you open for
reading or writing.

A great overview of Perl and UTF-8 can be found at:

http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default





On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack sdolj...@stanford.eduwrote:

 Hi,

 I wrote a script that extracts marc records from a file given certain
 conditions and puts them in a new file. When my input record is correctly
 encoded in UTF-8 and I run my script from windows command prompt, this
 warning message appears: Wide character in print at record_extraction.plline 
 99 (the line in my script where I print to a new file using
 as_usmarc). I compared the extracted record before and after in MarcEdit
 and the diacritic was changed. I tried marcdump newfile.mrc to see what
 happens and I get this error: utf8 \xF4 does not map to Unicode at
 C:/Perl64/lib/Encode.pm line 176. When I run my extraction script again
 with MARC-8 encoded data then I don't have the same problem.

 The basic outline of my script is:

 my $batch = MARC::Batch-new('USMARC', $input_file);

 while (my $record = $batch-next()) {
  #do some checks
  #if checks ok then
  print FILE $record-as_usmarc();
 }

 Do I need to add something that specifies to interpret the data as UTF-8?
 Does MARC::Record not handle UTF-8 at all?

 Thanks,
 Shelley

 
 Shelley Doljack
 E-Resources Metadata Librarian
 Metadata and Library Systems
 Stanford University Libraries
 sdolj...@stanford.edu
 650-725-0167




-- 

Bill Dueber
Programmer -- Library Systems
University of Michigan


Re: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread Shelley Doljack
The problem was I wasn't telling perl to output UTF-8. Now that I added 
binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds 
like once I set binmode to UTF-8 everything will be interpreted as such, even 
when the record is in MARC-8. Is that right? So this means that I can only use 
my script with a file of records where all of them are encoded in UTF-8. If I 
want to run the script against a file with all MARC-8 encoding, then I'd need 
to remove the binmode line. 

It doesn't seem possible to say: 

if ($record-encoding() eq 'UTF-8' ) { 
binmode(FILE, ':utf8') ; 
FILE $record-as_usmarc() ; 
} 
else { 
print FILE $record-as_usmarc() ; 
} 

This will result in messing up the diacritics if a file has a mixture of 
records in MARC-8 and UTF-8. Is that correct? 

Thanks, 
Shelley 

- Original Message -

 From: William Dueber dueb...@umich.edu
 To: Shelley Doljack sdolj...@stanford.edu
 Cc: perl4lib@perl.org
 Sent: Monday, July 30, 2012 5:13:41 PM
 Subject: Re: printing UTF-8 encoded MARC records with as_usmarc

 First off, it's entirely possible that you have bad UTF-8 (perhaps
 rogue MARC-8, perhaps just lousy characters) in your MARC. I know we
 have plenty of that crap.

 You need to tell perl that you'll be outputting UTF-8 using 'bincode'

 binmode(FILE, ':utf8');

 In general, you'll want to do this to basically every file you open
 for reading or writing.

 A great overview of Perl and UTF-8 can be found at:

 http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

 On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack 
 sdolj...@stanford.edu  wrote:

  Hi,
 

  I wrote a script that extracts marc records from a file given
  certain
  conditions and puts them in a new file. When my input record is
  correctly encoded in UTF-8 and I run my script from windows command
  prompt, this warning message appears: Wide character in print at
  record_extraction.pl line 99 (the line in my script where I print
  to a new file using as_usmarc). I compared the extracted record
  before and after in MarcEdit and the diacritic was changed. I tried
  marcdump newfile.mrc to see what happens and I get this error:
  utf8
  \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176.
  When I run my extraction script again with MARC-8 encoded data then
  I don't have the same problem.
 

  The basic outline of my script is:
 

  my $batch = MARC::Batch-new('USMARC', $input_file);
 

  while (my $record = $batch-next()) {
 
  #do some checks
 
  #if checks ok then
 
  print FILE $record-as_usmarc();
 
  }
 

  Do I need to add something that specifies to interpret the data as
  UTF-8? Does MARC::Record not handle UTF-8 at all?
 

  Thanks,
 
  Shelley
 

  
 
  Shelley Doljack
 
  E-Resources Metadata Librarian
 
  Metadata and Library Systems
 
  Stanford University Libraries
 
  sdolj...@stanford.edu
 
  650-725-0167
 

 --

 Bill Dueber
 Programmer -- Library Systems
 University of Michigan


Re: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread Dr. Saiful Amin
On Wed, Aug 1, 2012 at 12:47 AM, Shelley Doljack sdolj...@stanford.eduwrote:

 The problem was I wasn't telling perl to output UTF-8. Now that I added
 binmode(FILE, ':utf8') to my script, the problem is fixed. However, it
 sounds like once I set binmode to UTF-8 everything will be interpreted as
 such, even when the record is in MARC-8. Is that right? So this means that
 I can only use my script with a file of records where all of them are
 encoded in UTF-8. If I want to run the script against a file with all
 MARC-8 encoding, then I'd need to remove the binmode line.


Sometimes it's easier to use the yaz-marcdump utility for MARC-8 to UTF-8
conversion (it's much faster):
yaz-marcdump -f MARC-8 -t UTF-8 -o marc marc21.in marc21.out

http://www.indexdata.com/yaz/doc/yaz-marcdump.html

Best regards,
Saiful Amin
DRTC, Bangalore


Re: printing UTF-8 encoded MARC records with as_usmarc

2012-07-30 Thread William Dueber
First off, it's entirely possible that you have bad UTF-8 (perhaps rogue
MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty
of that crap.

You need to tell perl that you'll be outputting UTF-8 using 'bincode'

  binmode(FILE, ':utf8');

In general, you'll want to do this to basically every file you open for
reading or writing.

A great overview of Perl and UTF-8 can be found at:

http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default





On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack sdolj...@stanford.eduwrote:

 Hi,

 I wrote a script that extracts marc records from a file given certain
 conditions and puts them in a new file. When my input record is correctly
 encoded in UTF-8 and I run my script from windows command prompt, this
 warning message appears: Wide character in print at record_extraction.plline 
 99 (the line in my script where I print to a new file using
 as_usmarc). I compared the extracted record before and after in MarcEdit
 and the diacritic was changed. I tried marcdump newfile.mrc to see what
 happens and I get this error: utf8 \xF4 does not map to Unicode at
 C:/Perl64/lib/Encode.pm line 176. When I run my extraction script again
 with MARC-8 encoded data then I don't have the same problem.

 The basic outline of my script is:

 my $batch = MARC::Batch-new('USMARC', $input_file);

 while (my $record = $batch-next()) {
  #do some checks
  #if checks ok then
  print FILE $record-as_usmarc();
 }

 Do I need to add something that specifies to interpret the data as UTF-8?
 Does MARC::Record not handle UTF-8 at all?

 Thanks,
 Shelley

 
 Shelley Doljack
 E-Resources Metadata Librarian
 Metadata and Library Systems
 Stanford University Libraries
 sdolj...@stanford.edu
 650-725-0167




-- 

Bill Dueber
Programmer -- Library Systems
University of Michigan