Re: reading and writing of utf-8 with marc::batch

2013-03-27 Thread Shelley Doljack
Whenever I see characters like é, I consult this website 
http://www.i18nqa.com/debug/bug-utf-8-latin1.html to help me figure out what's 
going on. You might find it helpful too.

Shelley

- Original Message -
 From: Eric Lease Morgan emor...@nd.edu
 To: perl4lib@perl.org
 Sent: Tuesday, March 26, 2013 1:22:03 PM
 Subject: reading and writing of utf-8 with marc::batch
 
 
 For the life of me I can't figure out how to do reading and writing
 of UTF-8 with MARC::Batch.
 
 I have a UTF-8 encoded file of MARC records. Dumping the records and
 greping for a particular string illustrates the validity:
 
   $ marcdump und.marc | grep Sainte-Face
   und.marc
   1000 records
   2000 records
   3000 records
   4000 records
   5000 records
   6000 records
   7000 records
   8000 records
   9000 records
   1 records
   11000 records
   12000 records
   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
   610 20 _aArchiconfrérie de la Sainte-Face
   13000 records
   $
 
 I then run a Perl script that simply reads each record and dumps it
 to STDOUT. Notice how I define both my input and output as UTF-8:
 
   #!/shared/perl/current/bin/perl
 
   # configure
   use constant MARC = './und.marc';
 
   # require
   use strict;
   use MARC::Batch;
 
   # initialize
   binmode ( MARC, :utf8 );
   my $batch = MARC::Batch-new( 'USMARC', MARC );
   $batch-strict_off;
   $batch-warnings_off;
   binmode( STDOUT, :utf8 );
 
   # read  write
   while ( my $marc = $batch-next ) { print $marc-as_usmarc }
 
   # done
   exit;
 
 But my output is munged:
 
   $ ./marc.pl  und.mrc
   $ marcdump und.mrc | grep Sainte-Face
   und.mrc
   1000 records
   2000 records
   3000 records
   4000 records
   5000 records
   6000 records
   7000 records
   8000 records
   9000 records
   1 records
   11000 records
   12000 records
   245 00 _aAnnales de l'Archiconfrérie de la Sainte-Face
   610_aArchiconfrérie de la Sainte-Face
   13000 records
   $
 
 What am I doing wrong!?
 
 --
 Eric Lease Morgan
 University of Notre Dame
 
 574/631-8604
 
 
 
 

-- 
Shelley Doljack  
E-Resources Metadata Librarian 
Metadata Department
Stanford University Libraries
sdolj...@stanford.edu
650-725-0167


Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-27 Thread Shelley Doljack
I use MarcEdit to view records and check if the mnemonic form of a diacritic 
(e.g. {eacute}) appears or not and what the LDR/09 value is. That's the best 
way I've come up with so far. MarcEdit is pretty good at guessing what the 
character encoding is without relying on the LDR/09 value. I think there are 
some perl modules you could use that guess what the encoding is of a 
character but I've never used them. I'm interested in finding out other methods 
(preferably automated) for detecting wrong or mixed character encodings in a 
MARC record. 

Shelley

- Original Message -
 From: Eric Lease Morgan emor...@nd.edu
 To: perl4lib@perl.org
 Sent: Wednesday, March 27, 2013 2:11:26 PM
 Subject: Re: reading and writing of utf-8 with marc::batch [double encoding]
 
 
 On Mar 27, 2013, at 4:59 PM, Eric Lease Morgan emor...@nd.edu
 wrote:
 
  When it calls as_usmarc, I think MARC::Batch tries to honor the
  value set in position #9 of the leader. In other words, if the
  leader is empty, then it tries to output records as MARC-8, and
  when the leader is a value of a, it tries to encode the data as
  UTF-8.
 
 How can I figure out whether or not a MARC record contains ONLY
 characters from the UTF-8 character set?
 
 Put another way, how can I determine whether or not position #9 of a
 given MARC leader is accurate? If position #9 is an a, then how
 can I read the balance of the record to determine whether or not all
 the characters really and truly are UTF-8 encoded?
 
 --
 Eric This Is Almost Too Much For Me Morgan
 
 


Re: printing UTF-8 encoded MARC records with as_usmarc

2012-07-31 Thread Shelley Doljack
The problem was I wasn't telling perl to output UTF-8. Now that I added 
binmode(FILE, ':utf8') to my script, the problem is fixed. However, it sounds 
like once I set binmode to UTF-8 everything will be interpreted as such, even 
when the record is in MARC-8. Is that right? So this means that I can only use 
my script with a file of records where all of them are encoded in UTF-8. If I 
want to run the script against a file with all MARC-8 encoding, then I'd need 
to remove the binmode line. 

It doesn't seem possible to say: 

if ($record-encoding() eq 'UTF-8' ) { 
binmode(FILE, ':utf8') ; 
FILE $record-as_usmarc() ; 
} 
else { 
print FILE $record-as_usmarc() ; 
} 

This will result in messing up the diacritics if a file has a mixture of 
records in MARC-8 and UTF-8. Is that correct? 

Thanks, 
Shelley 

- Original Message -

 From: William Dueber dueb...@umich.edu
 To: Shelley Doljack sdolj...@stanford.edu
 Cc: perl4lib@perl.org
 Sent: Monday, July 30, 2012 5:13:41 PM
 Subject: Re: printing UTF-8 encoded MARC records with as_usmarc

 First off, it's entirely possible that you have bad UTF-8 (perhaps
 rogue MARC-8, perhaps just lousy characters) in your MARC. I know we
 have plenty of that crap.

 You need to tell perl that you'll be outputting UTF-8 using 'bincode'

 binmode(FILE, ':utf8');

 In general, you'll want to do this to basically every file you open
 for reading or writing.

 A great overview of Perl and UTF-8 can be found at:

 http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default

 On Mon, Jul 30, 2012 at 6:51 PM, Shelley Doljack 
 sdolj...@stanford.edu  wrote:

  Hi,
 

  I wrote a script that extracts marc records from a file given
  certain
  conditions and puts them in a new file. When my input record is
  correctly encoded in UTF-8 and I run my script from windows command
  prompt, this warning message appears: Wide character in print at
  record_extraction.pl line 99 (the line in my script where I print
  to a new file using as_usmarc). I compared the extracted record
  before and after in MarcEdit and the diacritic was changed. I tried
  marcdump newfile.mrc to see what happens and I get this error:
  utf8
  \xF4 does not map to Unicode at C:/Perl64/lib/Encode.pm line 176.
  When I run my extraction script again with MARC-8 encoded data then
  I don't have the same problem.
 

  The basic outline of my script is:
 

  my $batch = MARC::Batch-new('USMARC', $input_file);
 

  while (my $record = $batch-next()) {
 
  #do some checks
 
  #if checks ok then
 
  print FILE $record-as_usmarc();
 
  }
 

  Do I need to add something that specifies to interpret the data as
  UTF-8? Does MARC::Record not handle UTF-8 at all?
 

  Thanks,
 
  Shelley
 

  
 
  Shelley Doljack
 
  E-Resources Metadata Librarian
 
  Metadata and Library Systems
 
  Stanford University Libraries
 
  sdolj...@stanford.edu
 
  650-725-0167
 

 --

 Bill Dueber
 Programmer -- Library Systems
 University of Michigan


printing UTF-8 encoded MARC records with as_usmarc

2012-07-30 Thread Shelley Doljack
Hi,

I wrote a script that extracts marc records from a file given certain 
conditions and puts them in a new file. When my input record is correctly 
encoded in UTF-8 and I run my script from windows command prompt, this warning 
message appears: Wide character in print at record_extraction.pl line 99 (the 
line in my script where I print to a new file using as_usmarc). I compared the 
extracted record before and after in MarcEdit and the diacritic was changed. I 
tried marcdump newfile.mrc to see what happens and I get this error: utf8 \xF4 
does not map to Unicode at C:/Perl64/lib/Encode.pm line 176. When I run my 
extraction script again with MARC-8 encoded data then I don't have the same 
problem. 

The basic outline of my script is:

my $batch = MARC::Batch-new('USMARC', $input_file);

while (my $record = $batch-next()) {
 #do some checks
 #if checks ok then
 print FILE $record-as_usmarc();
}

Do I need to add something that specifies to interpret the data as UTF-8? Does 
MARC::Record not handle UTF-8 at all? 

Thanks,
Shelley


Shelley Doljack  
E-Resources Metadata Librarian 
Metadata and Library Systems
Stanford University Libraries
sdolj...@stanford.edu
650-725-0167


Re: Customizing MARC::Errorchecks

2012-07-19 Thread Shelley Doljack
Hi Bryan,

Thanks for the information. I also downloaded Lintadditions. Right now, I'm 
trying to figure out what I want to do. I'm new to object-oriented programming 
in Perl, so it's a learning experience for me. What I think I need to do is 
create my own module for checks specific to ebook records, like checking that 
all records have an 856 field with 2nd indicator 0 or an electronic resource 
007, among other things. I need to think about it more and play around with the 
different modules more. 

Thanks,
Shelley


- Original Message -
 From: Bryan Baldus bryan.bal...@quality-books.com
 To: Shelley Doljack sdolj...@stanford.edu, perl4lib@perl.org
 Sent: Tuesday, July 17, 2012 6:15:45 PM
 Subject: RE: Customizing MARC::Errorchecks
 
 On Tuesday, July 17, 2012 4:34 PM, Shelley Doljack
 [sdolj...@stanford.edu] wrote:
 I'm playing around with using MARC::Errorchecks for reviewing ebook
 records we get from vendors. I want to make some modifications to
 the module, but I find that if I do so in a similar manner
 described in the tutorial for customizing MARC::Lint, by making a
 subclass of the module, it doesn't work. Is this not possible with
 Errorchecks?
 
 Indeed, MARC::Errorchecks was not written in the object-oriented
 style that MARC::Lint uses. Skimming through the code just now (I've
 not worked with it as regularly as I might like to be able to keep
 it fresh in my memory), I believe it is essentially a collection of
 subs with a wrapper sub to call each check--check_all_subs() calls
 each of the checking subroutines and returns the arrayref of any
 errors found. When I wrote it I was still early in learning Perl
 (and while I've gotten better since then, lack of recent practice
 working with it hasn't necessarily improved my knowledge of the
 language), so I'm sure it's not the most optimized code possible.
 check_all_subs() and the POD comments could serve as an index to
 each of the checks, with the SYNOPSIS showing examples of how to
 call the individual checks.
 
 That said, if you have ideas for additions or changes, or other
 questions, I welcome hearing about them, either to add to the base
 module or to help with creating a related module of your own. I do
 know that I need to get working on the changes required for RDA
 records, but haven't really even started looking into the challenges
 those will pose (though that will likely result in a new module or
 more devoted just to RDA, and will also likely require
 changes/subclasses to MARC::Lint and MARC::Lintadditions).
 
 Also of note, I have a newer version I've just uploaded to CPAN [1]
 with the following changes (in addition to those listed below, I
 plan on removing MARC::Lint::CodeData from the Errorchecks
 distribution and then requiring MARC::Lint, which includes CodeData
 (to hopefully resolve issues with installing both module packages at
 the same time due to this file):
 
 Version 1.16: Updated May 16-Nov. 14, 2011. Released 7-17-2012.
  -Removed MARC::Lint::CodeData and require MARC::Lint
  -Turned off check_fieldlength($record) in check_all_subs()
  -Turned off checking of floating hyphens in 520 fields in
  findfloatinghyphens($record)
  -Updated validate008 subs (and 006) related to 008/24-27 (Books and
  Continuing Resources) for MARC Update no. 10, Oct. 2009 and Update
  no. 11, 2010; no. 12, Oct. 2010; and no. 13, Sept. 2011.
  -Updated %ldrbytes with leader/18 'c' and redefinition of 'i' per
  MARC Update no. 12, Oct. 2010.
 
 Version 1.15: Updated June 24-August 16, 2009. Released , 2009.
 
  -Updated checks related to 300 to better account for electronic
  resources.
  -Revised wording in validate008($field008, $mattype, $biblvl)
  language code (008/35-37) for '   '/zxx.
  -Updated validate008 subs (and 006) related to 008/24-27 (Books and
  Continuing Resources) for MARC Update no. 9, Oct. 2008.
  -Updated validate008 sub (and 006) for Books byte 33, Literary form,
  invalidating code 'c' and referring it to 008/24-27 value 'c' .
  -Updated video007vs300vs538($record) to allow Blu-ray in 538 and 's'
  in 07/04.
 
 [1] While the CPAN indexer works on that:
 http://www.cpan.org/authors/id/E/EI/EIJABB/MARC-Errorchecks-1.16.tar.gz
 , I've also posted the file to my website:
 http://home.comcast.net/~eijabb/bryanmodules/MARC-Errorchecks-1.16.tar.gz,
 with text versions of each file visible in:
 http://home.comcast.net/~eijabb/bryanmodules/MARC-Errorchecks-1.16
 
 #
 
 Finally, I meant to mention it on this list earlier, but I've posted
 a new version of MARC::Lint, 1.45, to CPAN [2], with the current
 development version (as of now, same as CPAN's version), in
 SourceForge's Git repository [3]. Updates to that module include:
  - Updated Lint::DATA section with Update No. 10 (Oct. 2009) through
  Update No. 14 (Apr. 2012)
  - Updated _check_article with the exceptions: 'A  ', 'L is '
 
 #
 
 [2] http://search.cpan.org/~eijabb/MARC-Lint-1.45/
 [3]
 http

Customizing MARC::Errorchecks

2012-07-17 Thread Shelley Doljack
Hi,

I'm playing around with using MARC::Errorchecks for reviewing ebook records we 
get from vendors. I want to make some modifications to the module, but I find 
that if I do so in a similar manner described in the tutorial for customizing 
MARC::Lint, by making a subclass of the module, it doesn't work. Is this not 
possible with Errorchecks? 

Thanks,
Shelley  


Shelley Doljack  
E-Resources Metadata Librarian 
Metadata and Library Systems
Stanford University Libraries
sdolj...@stanford.edu
650-725-0167