Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Jacobs, Jane W
Hi folks,

I'm trying to write a routine to construct a text file of OCLC search key from 
a group of existing records.  What I want is something like:

Brah,vasa/2003

That is 1st four letters of 100 + comma + 1st four letters of 245 + slash + 
date.

In principle I have this working with:


open( FOURS, 4-4-date.txt );


while ( my $r = $batch-next() ) {
  
my @fields = $r-field( '100' );
foreach my $field ( @fields ) {
my $ME = $field-subfield('a');
my $four100 = substr( $ME, 0, 4 );
  
print FOURS $four100;
} 

my @fields = $r-field( '245' );
foreach my $field ( @fields ) {
my $TITLE = $field-subfield('a');
my $four245 = substr( $TITLE, 0, 4 );
print FOURS ,$four245;
} 

my @fields = $r-field( '260' );
foreach my $field ( @fields ) {
my $PD = $field-subfield('c');
my $four260 = substr( $PD, 0, 4);
print FOURS \\$four260\n;
} 


My result was something like:

Dave,Ayod\2003
Paòt,Kaâs\2002
Baks,Dasa\2003
,Viâs\2002

Problem 1: As you can see, I don't really want the first four characters, I 
want the first four SEARCHABLE characters.  How can I tell MARC Record to give 
me the first four characters, excluding diacritics?

Problem 2:  In these examples 260 $c works OK, but I could get a cleaner result 
by accessing the date from the fixed field (008 07-10).  How would I do that?  
I was looking in the tutorial, but couldn't seem to find anything that seemed 
to help.  If I'm missing something there please point it up.

 Thanks in advance to anyone who can help.

 
JJ



**Views expressed by the author do not necessarily represent those of the 
Queens Library.**

Jane Jacobs
Asst. Coord., Catalog Division
Queens Borough Public Library
89-11 Merrick Blvd.
Jamaica, NY 11432

tel.: (718) 990-0804
e-mail: [EMAIL PROTECTED]
FAX. (718) 990-8566 



Re: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Ed Summers
Hi Jane:

On Tue, Jan 11, 2005 at 01:29:55PM -0500, Jacobs, Jane W wrote:
 My result was something like:
 
 Dave,Ayod\2003
 Paòt,Kaâs\2002
 Baks,Dasa\2003
 ,Viâs\2002
 
 Problem 1: As you can see, I don't really want the first four characters, I 
 want the first four SEARCHABLE characters.  How can I tell MARC Record to 
 give me the first four characters, excluding diacritics?

What output would you have rather seen?

Dave,Ayod\2003
Paot, Kaas\2002
Baks,Dasa\2003
,Vias\2002

?

 Problem 2:  In these examples 260 $c works OK, but I could get a cleaner 
 result by accessing the date from the fixed field (008 07-10).  How would I 
 do that?  I was looking in the tutorial, but couldn't seem to find anything 
 that seemed to help.  If I'm missing something there please point it up.

You probably want to use the data() method on the MARC::Field object for
the '008' field, in combination with substr() to extract a substring
based on an offset and a length.

my $f008 = $record-field('008');
if ( $f008 ) { $year = substr( $f008-data(), 7, 4 ); }

I only added the if statement since it may not be true that all your
records have an 008 field...

//Ed


RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Jacobs, Jane W
 Problem 1: As you can see, I don't really want the first four 
 characters, I want the first four SEARCHABLE characters.  How can I 
 tell MARC Record to give me the first four characters, excluding 
 diacritics?

What output would you have rather seen?

Dave,Ayod\2003
Paot, Kaas\2002
Baks,Dasa\2003
,Vias\2002

?

I changed out the order to put the problem children at the bottom. Thus the 
correct output would be:

Baks,Dasa\2003
Dave,Ayod\2003
Pata,Kasm\2002   * actual text is: 100 Patani, Rajana. 245 
Kasmakasa 
  ** Raw MARC reads:
100 Patani, Rajana. 245 
Kasmakasa 
,Vias\2002* actual text is: 245 Visvaprasiddha vartao
  ** Raw MARC reads: 245 Visvaprasiddha 
vartao

You probably want to use the data() method on the MARC::Field object for the 
'008' field, in combination with substr() to extract a substring based on an 
offset and a length.

Worked brilliantly; Thanks!

JJ

**Views expressed by the author do not necessarily represent those of the 
Queens Library.**

Jane Jacobs
Asst. Coord., Catalog Division
Queens Borough Public Library
89-11 Merrick Blvd.
Jamaica, NY 11432

tel.: (718) 990-0804
e-mail: [EMAIL PROTECTED]
FAX. (718) 990-8566



RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Doran, Michael D
Hi Jane,

These answers assume that the data you are processing:
1) is encoded in the MARC-8 character set, and
2) consists of the MARC-8 default basic and extended Latin characters.

 Dave,Ayod\2003
 Paòt,Kaâs\2002
 Baks,Dasa\2003
 ,Viâs\2002

 Problem 1: As you can see, I don't really want the first four 
 characters, I want the first four SEARCHABLE characters. How
 can I tell MARC Record to give me the first four characters, 
 excluding diacritics?

Assuming that you asking how to strip out the MARC-8 combining diacritic 
characters, try inserting the substitution commands listed (as shown below) 
just prior to the substr commands:

 my $ME = $field-subfield('a');
  $ME =~ s/[\xE1-\xFE]//g;
 my $four100 = substr( $ME, 0, 4 );

 my $TITLE = $field-subfield('a');
  $TITLE =~ s/[\xE1-\xFE]//g;
 my $four245 = substr( $TITLE, 0, 4 );

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

 -Original Message-
 From: Jacobs, Jane W [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, January 11, 2005 12:30 PM
 To: perl4lib@perl.org
 Subject: Ignoring Diacritics accessing Fixed Field Data
 
 Hi folks,
 
 I'm trying to write a routine to construct a text file of 
 OCLC search key from a group of existing records.  What I 
 want is something like:
 
 Brah,vasa/2003
 
 That is 1st four letters of 100 + comma + 1st four letters of 
 245 + slash + date.
 
 In principle I have this working with:
 
 
 open( FOURS, 4-4-date.txt );
 
 
 while ( my $r = $batch-next() ) {
   
 my @fields = $r-field( '100' );
 foreach my $field ( @fields ) {
 my $ME = $field-subfield('a');
 my $four100 = substr( $ME, 0, 4 );
   
 print FOURS $four100;
 } 
 
 my @fields = $r-field( '245' );
 foreach my $field ( @fields ) {
 my $TITLE = $field-subfield('a');
 my $four245 = substr( $TITLE, 0, 4 );
 print FOURS ,$four245;
 } 
 
 my @fields = $r-field( '260' );
 foreach my $field ( @fields ) {
 my $PD = $field-subfield('c');
 my $four260 = substr( $PD, 0, 4);
 print FOURS \\$four260\n;
 } 
 
 
 My result was something like:
 
 Dave,Ayod\2003
 Paòt,Kaâs\2002
 Baks,Dasa\2003
 ,Viâs\2002
 
 Problem 1: As you can see, I don't really want the first four 
 characters, I want the first four SEARCHABLE characters.  How 
 can I tell MARC Record to give me the first four characters, 
 excluding diacritics?
 
 Problem 2:  In these examples 260 $c works OK, but I could 
 get a cleaner result by accessing the date from the fixed 
 field (008 07-10).  How would I do that?  I was looking in 
 the tutorial, but couldn't seem to find anything that seemed 
 to help.  If I'm missing something there please point it up.
 
  Thanks in advance to anyone who can help.
 
  
 JJ
 
 
 
 **Views expressed by the author do not necessarily represent 
 those of the Queens Library.**
 
 Jane Jacobs
 Asst. Coord., Catalog Division
 Queens Borough Public Library
 89-11 Merrick Blvd.
 Jamaica, NY 11432
 
 tel.: (718) 990-0804
 e-mail: [EMAIL PROTECTED]
 FAX. (718) 990-8566 
 
 


RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Bryan Baldus
On Tuesday, January 11, 2005 2:13 PM, Michael Doran wrote:

Assuming that you asking how to strip out the MARC-8 combining diacritic
characters, try inserting the substitution commands listed (as shown below)
just prior to the substr commands:
 my $ME = $field-subfield('a');
  $ME =~ s/[\xE1-\xFE]//g;
 my $four100 = substr( $ME, 0, 4 );

 my $TITLE = $field-subfield('a');
  $TITLE =~ s/[\xE1-\xFE]//g;
 my $four245 = substr( $TITLE, 0, 4 );


You might want to change the procedure for getting the title to skip
articles (untested, may need corrections):

#given $record being the MARC::Record object, and exactly 1 245 field being
present, as required by MARC21 rules
my $titleind2 = $record-$field('245')-indicator(2);
my $TITLE = $field-subfield('a');
$TITLE =~ s/[\xE1-\xFE]//g;
my $four245 = substr( $TITLE, 0+$titleind2, 4 ) if $titleind2 =~/^[0-9]$/;
#the if statement should be unnecessary, since 245 2nd indicator should
always be some number, but just in case.

Hope this helps,

Bryan Baldus
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://home.inwave.com/eija
 


RE: Ignoring Diacritics accessing Fixed Field Data

2005-01-11 Thread Jacobs, Jane W
That worked well!
Thanks!
JJ

**Views expressed by the author do not necessarily represent those of the 
Queens Library.**

Jane Jacobs
Asst. Coord., Catalog Division
Queens Borough Public Library
89-11 Merrick Blvd.
Jamaica, NY 11432

tel.: (718) 990-0804
e-mail: [EMAIL PROTECTED]
FAX. (718) 990-8566



-Original Message-
From: Doran, Michael D [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, January 11, 2005 3:13 PM
To: perl4lib@perl.org
Subject: RE: Ignoring Diacritics accessing Fixed Field Data


Hi Jane,

These answers assume that the data you are processing:
1) is encoded in the MARC-8 character set, and
2) consists of the MARC-8 default basic and extended Latin characters.

 Dave,Ayod\2003
 Paòt,Kaâs\2002
 Baks,Dasa\2003
 ,Viâs\2002

 Problem 1: As you can see, I don't really want the first four
 characters, I want the first four SEARCHABLE characters. How
 can I tell MARC Record to give me the first four characters, 
 excluding diacritics?

Assuming that you asking how to strip out the MARC-8 combining diacritic 
characters, try inserting the substitution commands listed (as shown below) 
just prior to the substr commands:

 my $ME = $field-subfield('a');
  $ME =~ s/[\xE1-\xFE]//g;
 my $four100 = substr( $ME, 0, 4 );

 my $TITLE = $field-subfield('a');
  $TITLE =~ s/[\xE1-\xFE]//g;
 my $four245 = substr( $TITLE, 0, 4 );

-- Michael

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/ 

 -Original Message-
 From: Jacobs, Jane W [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, January 11, 2005 12:30 PM
 To: perl4lib@perl.org
 Subject: Ignoring Diacritics accessing Fixed Field Data
 
 Hi folks,
 
 I'm trying to write a routine to construct a text file of
 OCLC search key from a group of existing records.  What I 
 want is something like:
 
 Brah,vasa/2003
 
 That is 1st four letters of 100 + comma + 1st four letters of
 245 + slash + date.
 
 In principle I have this working with:
 
 
 open( FOURS, 4-4-date.txt );
 
 
 while ( my $r = $batch-next() ) {
   
 my @fields = $r-field( '100' );
 foreach my $field ( @fields ) {
 my $ME = $field-subfield('a');
 my $four100 = substr( $ME, 0, 4 );
   
 print FOURS $four100;
 } 
 
 my @fields = $r-field( '245' );
 foreach my $field ( @fields ) {
 my $TITLE = $field-subfield('a');
 my $four245 = substr( $TITLE, 0, 4 );
 print FOURS ,$four245;
 } 
 
 my @fields = $r-field( '260' );
 foreach my $field ( @fields ) {
 my $PD = $field-subfield('c');
 my $four260 = substr( $PD, 0, 4);
 print FOURS \\$four260\n;
 } 
 
 
 My result was something like:
 
 Dave,Ayod\2003
 Paòt,Kaâs\2002
 Baks,Dasa\2003
 ,Viâs\2002
 
 Problem 1: As you can see, I don't really want the first four
 characters, I want the first four SEARCHABLE characters.  How 
 can I tell MARC Record to give me the first four characters, 
 excluding diacritics?
 
 Problem 2:  In these examples 260 $c works OK, but I could
 get a cleaner result by accessing the date from the fixed 
 field (008 07-10).  How would I do that?  I was looking in 
 the tutorial, but couldn't seem to find anything that seemed 
 to help.  If I'm missing something there please point it up.
 
  Thanks in advance to anyone who can help.
 
  
 JJ
 
 
 
 **Views expressed by the author do not necessarily represent
 those of the Queens Library.**
 
 Jane Jacobs
 Asst. Coord., Catalog Division
 Queens Borough Public Library
 89-11 Merrick Blvd.
 Jamaica, NY 11432
 
 tel.: (718) 990-0804
 e-mail: [EMAIL PROTECTED]
 FAX. (718) 990-8566