Re: reading and writing of utf-8 with marc::batch
Whenever I see characters like é, I consult this website http://www.i18nqa.com/debug/bug-utf-8-latin1.html to help me figure out what's going on. You might find it helpful too. Shelley - Original Message - From: Eric Lease Morgan emor...@nd.edu To: perl4lib@perl.org Sent: Tuesday, March 26, 2013 1:22:03 PM Subject: reading and writing of utf-8 with marc::batch For the life of me I can't figure out how to do reading and writing of UTF-8 with MARC::Batch. I have a UTF-8 encoded file of MARC records. Dumping the records and greping for a particular string illustrates the validity: $ marcdump und.marc | grep Sainte-Face und.marc 1000 records 2000 records 3000 records 4000 records 5000 records 6000 records 7000 records 8000 records 9000 records 1 records 11000 records 12000 records 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face 610 20 _aArchiconfrérie de la Sainte-Face 13000 records $ I then run a Perl script that simply reads each record and dumps it to STDOUT. Notice how I define both my input and output as UTF-8: #!/shared/perl/current/bin/perl # configure use constant MARC = './und.marc'; # require use strict; use MARC::Batch; # initialize binmode ( MARC, :utf8 ); my $batch = MARC::Batch-new( 'USMARC', MARC ); $batch-strict_off; $batch-warnings_off; binmode( STDOUT, :utf8 ); # read write while ( my $marc = $batch-next ) { print $marc-as_usmarc } # done exit; But my output is munged: $ ./marc.pl und.mrc $ marcdump und.mrc | grep Sainte-Face und.mrc 1000 records 2000 records 3000 records 4000 records 5000 records 6000 records 7000 records 8000 records 9000 records 1 records 11000 records 12000 records 245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face 610_aArchiconfrérie de la Sainte-Face 13000 records $ What am I doing wrong!? -- Eric Lease Morgan University of Notre Dame 574/631-8604 -- Shelley Doljack E-Resources Metadata Librarian Metadata Department Stanford University Libraries sdolj...@stanford.edu 650-725-0167
Re: reading and writing of utf-8 with marc::batch [double encoding]
I use MarcEdit to view records and check if the mnemonic form of a diacritic (e.g. {eacute}) appears or not and what the LDR/09 value is. That's the best way I've come up with so far. MarcEdit is pretty good at guessing what the character encoding is without relying on the LDR/09 value. I think there are some perl modules you could use that guess what the encoding is of a character but I've never used them. I'm interested in finding out other methods (preferably automated) for detecting wrong or mixed character encodings in a MARC record. Shelley - Original Message - From: Eric Lease Morgan emor...@nd.edu To: perl4lib@perl.org Sent: Wednesday, March 27, 2013 2:11:26 PM Subject: Re: reading and writing of utf-8 with marc::batch [double encoding] On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan emor...@nd.edu wrote: When it calls as_usmarc, I think MARC::Batch tries to honor the value set in position #9 of the leader. In other words, if the leader is empty, then it tries to output records as MARC-8, and when the leader is a value of a, it tries to encode the data as UTF-8. How can I figure out whether or not a MARC record contains ONLY characters from the UTF-8 character set? Put another way, how can I determine whether or not position #9 of a given MARC leader is accurate? If position #9 is an a, then how can I read the balance of the record to determine whether or not all the characters really and truly are UTF-8 encoded? -- Eric This Is Almost Too Much For Me Morgan
Re: printing UTF-8 encoded MARC records with as_usmarc
The problem was I wasn't telling perl to output UTF-8. Now that I added binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds like once I set binmode to UTF-8 everything will be interpreted as such, even when the record is in MARC-8. Is that right? So this means that I can only use my script with a file of records where all of them are encoded in UTF-8. If I want to run the script against a file with all MARC-8 encoding, then I'd need to remove the binmode line. It doesn't seem possible to say: if ($record-encoding() eq 'UTF-8' ) { binmode(FILE, ':utf8') ; FILE $record-as_usmarc() ; } else { print FILE $record-as_usmarc() ; } This will result in messing up the diacritics if a file has a mixture of records in MARC-8 and UTF-8. Is that correct? Thanks, Shelley - Original Message - From: William Dueber dueb...@umich.edu To: Shelley Doljack sdolj...@stanford.edu Cc: perl4lib@perl.org Sent: Monday, July 30, 2012 5:13:41 PM Subject: Re: printing UTF-8 encoded MARC records with as_usmarc First off, it's entirely possible that you have bad UTF-8 (perhaps rogue MARC-8, perhaps just lousy characters) in your MARC. I know we have plenty of that crap. You need to tell perl that you'll be outputting UTF-8 using 'bincode' binmode(FILE, ':utf8'); In general, you'll want to do this to basically every file you open for reading or writing. A great overview of Perl and UTF-8 can be found at: http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack sdolj...@stanford.edu wrote: Hi, I wrote a script that extracts marc records from a file given certain conditions and puts them in a new file. When my input record is correctly encoded in UTF-8 and I run my script from windows command prompt, this warning message appears: Wide character in print at record_extraction.pl line 99 (the line in my script where I print to a new file using as_usmarc). I compared the extracted record before and after in MarcEdit and the diacritic was changed. I tried marcdump newfile.mrc to see what happens and I get this error: utf8 \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176. When I run my extraction script again with MARC-8 encoded data then I don't have the same problem. The basic outline of my script is: my $batch = MARC::Batch-new('USMARC', $input_file); while (my $record = $batch-next()) { #do some checks #if checks ok then print FILE $record-as_usmarc(); } Do I need to add something that specifies to interpret the data as UTF-8? Does MARC::Record not handle UTF-8 at all? Thanks, Shelley Shelley Doljack E-Resources Metadata Librarian Metadata and Library Systems Stanford University Libraries sdolj...@stanford.edu 650-725-0167 -- Bill Dueber Programmer -- Library Systems University of Michigan
printing UTF-8 encoded MARC records with as_usmarc
Hi, I wrote a script that extracts marc records from a file given certain conditions and puts them in a new file. When my input record is correctly encoded in UTF-8 and I run my script from windows command prompt, this warning message appears: Wide character in print at record_extraction.pl line 99 (the line in my script where I print to a new file using as_usmarc). I compared the extracted record before and after in MarcEdit and the diacritic was changed. I tried marcdump newfile.mrc to see what happens and I get this error: utf8 \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176. When I run my extraction script again with MARC-8 encoded data then I don't have the same problem. The basic outline of my script is: my $batch = MARC::Batch-new('USMARC', $input_file); while (my $record = $batch-next()) { #do some checks #if checks ok then print FILE $record-as_usmarc(); } Do I need to add something that specifies to interpret the data as UTF-8? Does MARC::Record not handle UTF-8 at all? Thanks, Shelley Shelley Doljack E-Resources Metadata Librarian Metadata and Library Systems Stanford University Libraries sdolj...@stanford.edu 650-725-0167
Re: Customizing MARC::Errorchecks
Hi Bryan, Thanks for the information. I also downloaded Lintadditions. Right now, I'm trying to figure out what I want to do. I'm new to object-oriented programming in Perl, so it's a learning experience for me. What I think I need to do is create my own module for checks specific to ebook records, like checking that all records have an 856 field with 2nd indicator 0 or an electronic resource 007, among other things. I need to think about it more and play around with the different modules more. Thanks, Shelley - Original Message - From: Bryan Baldus bryan.bal...@quality-books.com To: Shelley Doljack sdolj...@stanford.edu, perl4lib@perl.org Sent: Tuesday, July 17, 2012 6:15:45 PM Subject: RE: Customizing MARC::Errorchecks On Tuesday, July 17, 2012 4:34 PM, Shelley Doljack [sdolj...@stanford.edu] wrote: I'm playing around with using MARC::Errorchecks for reviewing ebook records we get from vendors. I want to make some modifications to the module, but I find that if I do so in a similar manner described in the tutorial for customizing MARC::Lint, by making a subclass of the module, it doesn't work. Is this not possible with Errorchecks? Indeed, MARC::Errorchecks was not written in the object-oriented style that MARC::Lint uses. Skimming through the code just now (I've not worked with it as regularly as I might like to be able to keep it fresh in my memory), I believe it is essentially a collection of subs with a wrapper sub to call each check--check_all_subs() calls each of the checking subroutines and returns the arrayref of any errors found. When I wrote it I was still early in learning Perl (and while I've gotten better since then, lack of recent practice working with it hasn't necessarily improved my knowledge of the language), so I'm sure it's not the most optimized code possible. check_all_subs() and the POD comments could serve as an index to each of the checks, with the SYNOPSIS showing examples of how to call the individual checks. That said, if you have ideas for additions or changes, or other questions, I welcome hearing about them, either to add to the base module or to help with creating a related module of your own. I do know that I need to get working on the changes required for RDA records, but haven't really even started looking into the challenges those will pose (though that will likely result in a new module or more devoted just to RDA, and will also likely require changes/subclasses to MARC::Lint and MARC::Lintadditions). Also of note, I have a newer version I've just uploaded to CPAN [1] with the following changes (in addition to those listed below, I plan on removing MARC::Lint::CodeData from the Errorchecks distribution and then requiring MARC::Lint, which includes CodeData (to hopefully resolve issues with installing both module packages at the same time due to this file): Version 1.16: Updated May 16-Nov. 14, 2011. Released 7-17-2012. -Removed MARC::Lint::CodeData and require MARC::Lint -Turned off check_fieldlength($record) in check_all_subs() -Turned off checking of floating hyphens in 520 fields in findfloatinghyphens($record) -Updated validate008 subs (and 006) related to 008/24-27 (Books and Continuing Resources) for MARC Update no. 10, Oct. 2009 and Update no. 11, 2010; no. 12, Oct. 2010; and no. 13, Sept. 2011. -Updated %ldrbytes with leader/18 'c' and redefinition of 'i' per MARC Update no. 12, Oct. 2010. Version 1.15: Updated June 24-August 16, 2009. Released , 2009. -Updated checks related to 300 to better account for electronic resources. -Revised wording in validate008($field008, $mattype, $biblvl) language code (008/35-37) for ' '/zxx. -Updated validate008 subs (and 006) related to 008/24-27 (Books and Continuing Resources) for MARC Update no. 9, Oct. 2008. -Updated validate008 sub (and 006) for Books byte 33, Literary form, invalidating code 'c' and referring it to 008/24-27 value 'c' . -Updated video007vs300vs538($record) to allow Blu-ray in 538 and 's' in 07/04. [1] While the CPAN indexer works on that: http://www.cpan.org/authors/id/E/EI/EIJABB/MARC-Errorchecks-1.16.tar.gz , I've also posted the file to my website: http://home.comcast.net/~eijabb/bryanmodules/MARC-Errorchecks-1.16.tar.gz, with text versions of each file visible in: http://home.comcast.net/~eijabb/bryanmodules/MARC-Errorchecks-1.16 # Finally, I meant to mention it on this list earlier, but I've posted a new version of MARC::Lint, 1.45, to CPAN [2], with the current development version (as of now, same as CPAN's version), in SourceForge's Git repository [3]. Updates to that module include: - Updated Lint::DATA section with Update No. 10 (Oct. 2009) through Update No. 14 (Apr. 2012) - Updated _check_article with the exceptions: 'A ', 'L is ' # [2] http://search.cpan.org/~eijabb/MARC-Lint-1.45/ [3] http
Customizing MARC::Errorchecks
Hi, I'm playing around with using MARC::Errorchecks for reviewing ebook records we get from vendors. I want to make some modifications to the module, but I find that if I do so in a similar manner described in the tutorial for customizing MARC::Lint, by making a subclass of the module, it doesn't work. Is this not possible with Errorchecks? Thanks, Shelley Shelley Doljack E-Resources Metadata Librarian Metadata and Library Systems Stanford University Libraries sdolj...@stanford.edu 650-725-0167