RE: printing UTF-8 encoded MARC records with as_usmarc
Hi Devon, I just recently came across this presentation which lays out pretty much all the issues with Unicode in perl, and makes some recommendations for best practices. While Nick Patch's presentation is excellent, I'm not sure that it lays out pretty much all the issues with Unicode in perl. ;-) To fit that bill, I highly recommend this series of talks given by Tom Christiansen at OSCON 2011: 1. Perl Unicode Essentials 2. Unicode in Perl Regexes 3. Unicode Support Shootout: The Good, The Bad, the (mostly) Ugly http://training.perl.com/OSCON2011/index.html (resolves to http://98.245.80.27/tcpc/OSCON2011/index.html) If you read through those presentations and disagree, I promise to buy you a beer at the next conference (code4lib?) we both attend. -- Michael # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 mobile # do...@uta.edu # http://rocky.uta.edu/doran/ -Original Message- From: Smith,Devon [mailto:smit...@oclc.org] Sent: Tuesday, July 31, 2012 8:26 AM To: William Dueber; Shelley Doljack Cc: perl4lib@perl.org Subject: RE: printing UTF-8 encoded MARC records with as_usmarc I just recently came across this presentation which lays out pretty much all the issues with Unicode in perl, and makes some recommendations for best practices. You may find some general insight into the whole situation by going over it. http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore- perl-workshop-2012 /dev -- Devon Smith Consulting Software Engineer OCLC Research http://www.oclc.org/research/people/smith.htm -Original Message- From: William Dueber [mailto:dueb...@umich.edu] Sent: Monday, July 30, 2012 8:14 PM To: Shelley Doljack Cc: perl4lib@perl.org Subject: Re: printing UTF-8 encoded MARC records with as_usmarc First off, it's entirely possible that you have bad UTF-8 (perhaps rogue MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty of that crap. You need to tell perl that you'll be outputting UTF-8 using 'bincode' binmode(FILE, ':utf8'); In general, you'll want to do this to basically every file you open for reading or writing. A great overview of Perl and UTF-8 can be found at: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid- utf-8-by-default On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack sdolj...@stanford.eduwrote: Hi, I wrote a script that extracts marc records from a file given certain conditions and puts them in a new file. When my input record is correctly encoded in UTF-8 and I run my script from windows command prompt, this warning message appears: Wide character in print at record_extraction.plline 99 (the line in my script where I print to a new file using as_usmarc). I compared the extracted record before and after in MarcEdit and the diacritic was changed. I tried marcdump newfile.mrc to see what happens and I get this error: utf8 \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176. When I run my extraction script again with MARC-8 encoded data then I don't have the same problem. The basic outline of my script is: my $batch = MARC::Batch-new('USMARC', $input_file); while (my $record = $batch-next()) { #do some checks #if checks ok then print FILE $record-as_usmarc(); } Do I need to add something that specifies to interpret the data as UTF- 8? Does MARC::Record not handle UTF-8 at all? Thanks, Shelley Shelley Doljack E-Resources Metadata Librarian Metadata and Library Systems Stanford University Libraries sdolj...@stanford.edu 650-725-0167 -- Bill Dueber Programmer -- Library Systems University of Michigan
RE: printing UTF-8 encoded MARC records with as_usmarc
-Original Message- From: Shelley Doljack [mailto:sdolj...@stanford.edu] Sent: 31 July 2012 20:18 The problem was I wasn't telling perl to output UTF-8. Now that I added binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds like once I set binmode to UTF-8 everything will be interpreted as such, even when the record is in MARC-8. Is that right? So this means that I can only use my script with a file of records where all of them are encoded in UTF-8. If I want to run the script against a file with all MARC-8 encoding, then I'd need to remove the binmode line. It depends how much manipulation of the records you are doing in the script. One approach is to use binmode(FILE, ':raw'); for both input and output. Perl will then keep the bytes of the records exactly as they are. You won't be able to test for exotic characters so easily, and amending field content would be inadvisable, but if all you are doing is something like reading in the records and filtering out any that have no 245 field, or something fairly basic like that, this could be the best approach. The MARC::Record module does not seem to care how the records are encoded. It's only once you start altering field content, testing field content, or adding fields that the character set being used becomes an issue. Removing fields would be fine too. MARC-8 can be very complex, particularly if other code tables like CJK are invoked, or even just Greek or Cyrillic. If you were manipulating field content in that kind of way they converting everything to UTF-8 would make things very much easier. Matthew -- Matthew Phillips Electronic Systems Librarian, Durham University Durham University Library, Stockton Road, Durham, DH1 3LY +44 (0)191 334 2941
Re: printing UTF-8 encoded MARC records with as_usmarc
On Tue, Jul 31, 2012 at 09:25:55AM -0400, Smith,Devon wrote: I just recently came across this presentation which lays out pretty much all the issues with Unicode in perl, and makes some recommendations for best practices. You may find some general insight into the whole situation by going over it. In the course of preparing the latest edition of the Camel book Tom Christiansen created a Perl Unicode Cookbook see http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html Its available in a few different places on the web C. -- Colin Campbell Chief Software Engineer, PTFS Europe Limited Content Management and Library Solutions +44 (0) 800 756 6803 (phone) +44 (0) 7759 633626 (mobile) colin.campb...@ptfs-europe.com skype: colin_campbell2 http://www.ptfs-europe.com
RE: printing UTF-8 encoded MARC records with as_usmarc
I just recently came across this presentation which lays out pretty much all the issues with Unicode in perl, and makes some recommendations for best practices. You may find some general insight into the whole situation by going over it. http://www.slideshare.net/nickpatch/fundamental-unicode-at-dcbaltimore-perl-workshop-2012 /dev -- Devon Smith Consulting Software Engineer OCLC Research http://www.oclc.org/research/people/smith.htm -Original Message- From: William Dueber [mailto:dueb...@umich.edu] Sent: Monday, July 30, 2012 8:14 PM To: Shelley Doljack Cc: perl4lib@perl.org Subject: Re: printing UTF-8 encoded MARC records with as_usmarc First off, it's entirely possible that you have bad UTF-8 (perhaps rogue MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty of that crap. You need to tell perl that you'll be outputting UTF-8 using 'bincode' binmode(FILE, ':utf8'); In general, you'll want to do this to basically every file you open for reading or writing. A great overview of Perl and UTF-8 can be found at: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack sdolj...@stanford.eduwrote: Hi, I wrote a script that extracts marc records from a file given certain conditions and puts them in a new file. When my input record is correctly encoded in UTF-8 and I run my script from windows command prompt, this warning message appears: Wide character in print at record_extraction.plline 99 (the line in my script where I print to a new file using as_usmarc). I compared the extracted record before and after in MarcEdit and the diacritic was changed. I tried marcdump newfile.mrc to see what happens and I get this error: utf8 \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176. When I run my extraction script again with MARC-8 encoded data then I don't have the same problem. The basic outline of my script is: my $batch = MARC::Batch-new('USMARC', $input_file); while (my $record = $batch-next()) { #do some checks #if checks ok then print FILE $record-as_usmarc(); } Do I need to add something that specifies to interpret the data as UTF-8? Does MARC::Record not handle UTF-8 at all? Thanks, Shelley Shelley Doljack E-Resources Metadata Librarian Metadata and Library Systems Stanford University Libraries sdolj...@stanford.edu 650-725-0167 -- Bill Dueber Programmer -- Library Systems University of Michigan
Re: printing UTF-8 encoded MARC records with as_usmarc
The problem was I wasn't telling perl to output UTF-8. Now that I added binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds like once I set binmode to UTF-8 everything will be interpreted as such, even when the record is in MARC-8. Is that right? So this means that I can only use my script with a file of records where all of them are encoded in UTF-8. If I want to run the script against a file with all MARC-8 encoding, then I'd need to remove the binmode line. It doesn't seem possible to say: if ($record-encoding() eq 'UTF-8' ) { binmode(FILE, ':utf8') ; FILE $record-as_usmarc() ; } else { print FILE $record-as_usmarc() ; } This will result in messing up the diacritics if a file has a mixture of records in MARC-8 and UTF-8. Is that correct? Thanks, Shelley - Original Message - From: William Dueber dueb...@umich.edu To: Shelley Doljack sdolj...@stanford.edu Cc: perl4lib@perl.org Sent: Monday, July 30, 2012 5:13:41 PM Subject: Re: printing UTF-8 encoded MARC records with as_usmarc First off, it's entirely possible that you have bad UTF-8 (perhaps rogue MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty of that crap. You need to tell perl that you'll be outputting UTF-8 using 'bincode' binmode(FILE, ':utf8'); In general, you'll want to do this to basically every file you open for reading or writing. A great overview of Perl and UTF-8 can be found at: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack sdolj...@stanford.edu wrote: Hi, I wrote a script that extracts marc records from a file given certain conditions and puts them in a new file. When my input record is correctly encoded in UTF-8 and I run my script from windows command prompt, this warning message appears: Wide character in print at record_extraction.pl line 99 (the line in my script where I print to a new file using as_usmarc). I compared the extracted record before and after in MarcEdit and the diacritic was changed. I tried marcdump newfile.mrc to see what happens and I get this error: utf8 \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176. When I run my extraction script again with MARC-8 encoded data then I don't have the same problem. The basic outline of my script is: my $batch = MARC::Batch-new('USMARC', $input_file); while (my $record = $batch-next()) { #do some checks #if checks ok then print FILE $record-as_usmarc(); } Do I need to add something that specifies to interpret the data as UTF-8? Does MARC::Record not handle UTF-8 at all? Thanks, Shelley Shelley Doljack E-Resources Metadata Librarian Metadata and Library Systems Stanford University Libraries sdolj...@stanford.edu 650-725-0167 -- Bill Dueber Programmer -- Library Systems University of Michigan
Re: printing UTF-8 encoded MARC records with as_usmarc
On Wed, Aug 1, 2012 at 12:47 AM, Shelley Doljack sdolj...@stanford.eduwrote: The problem was I wasn't telling perl to output UTF-8. Now that I added binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds like once I set binmode to UTF-8 everything will be interpreted as such, even when the record is in MARC-8. Is that right? So this means that I can only use my script with a file of records where all of them are encoded in UTF-8. If I want to run the script against a file with all MARC-8 encoding, then I'd need to remove the binmode line. Sometimes it's easier to use the yaz-marcdump utility for MARC-8 to UTF-8 conversion (it's much faster): yaz-marcdump -f MARC-8 -t UTF-8 -o marc marc21.in marc21.out http://www.indexdata.com/yaz/doc/yaz-marcdump.html Best regards, Saiful Amin DRTC, Bangalore
Re: printing UTF-8 encoded MARC records with as_usmarc
First off, it's entirely possible that you have bad UTF-8 (perhaps rogue MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty of that crap. You need to tell perl that you'll be outputting UTF-8 using 'bincode' binmode(FILE, ':utf8'); In general, you'll want to do this to basically every file you open for reading or writing. A great overview of Perl and UTF-8 can be found at: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack sdolj...@stanford.eduwrote: Hi, I wrote a script that extracts marc records from a file given certain conditions and puts them in a new file. When my input record is correctly encoded in UTF-8 and I run my script from windows command prompt, this warning message appears: Wide character in print at record_extraction.plline 99 (the line in my script where I print to a new file using as_usmarc). I compared the extracted record before and after in MarcEdit and the diacritic was changed. I tried marcdump newfile.mrc to see what happens and I get this error: utf8 \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176. When I run my extraction script again with MARC-8 encoded data then I don't have the same problem. The basic outline of my script is: my $batch = MARC::Batch-new('USMARC', $input_file); while (my $record = $batch-next()) { #do some checks #if checks ok then print FILE $record-as_usmarc(); } Do I need to add something that specifies to interpret the data as UTF-8? Does MARC::Record not handle UTF-8 at all? Thanks, Shelley Shelley Doljack E-Resources Metadata Librarian Metadata and Library Systems Stanford University Libraries sdolj...@stanford.edu 650-725-0167 -- Bill Dueber Programmer -- Library Systems University of Michigan