Marc::XML with MARC21

2010-01-25 Thread Michele Pinassi
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,
im'm working on a perl plugin for EPrints that let user importing from
Aleph simply using system id. It use Aleph OAI-PMH service that export
metadata in MARC21 format:

OAI-PMH xsi:schemaLocation=http://www.openarchives.org/OAI/2.0/
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd;
responseDate2010-01-22T15:17:32Z/responseDate
request verb=GetRecord identifier=oai:siena:
metadataPrefix=marc21http://xxx:8991/OAI/request
GetRecord
record
header
identifieroai:siena:-000762662/identifier
datestamp2009-09-18T10:43:21Z/datestamp
setSpecSBS/setSpec
/header
metadata
marc:record xsi:schemaLocation=http://www.loc.gov/MARC21/slim
http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd;
marc:leader^cam^^22^^i^4500/marc:leader
marc:controlfield tag=001000762662/marc:controlfield
marc:datafield tag=020 ind1=  ind2= 
marc:subfield code=a8814075913/marc:subfield
/marc:datafield
marc:datafield tag=040 ind1=  ind2= 


Then i use MARC::Record module:

my $file = MARC::Record-new_from_xml($marc-serialize(),UTF-8,MARC21);
$epdata = $plugin-EPrints::Plugin::Import::MARC::convert_input(
$file );

and here come troubles: only few metadatas will be interpreted
correctly, losing a lot of datas.

I can't figure why: maybe namespaces confuse MARC::Record parser ?

Here's an example of MARC21 XML which i fed the MARC::Record :

marc:record xmlns:marc=http://www.loc.gov/MARC21/slim;
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation=http://www.loc.gov/MARC21/slim
http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd;marc:leader^cam^^22^^i^4500/marc:leadermarc:controlfield
tag=001000762662/marc:controlfieldmarc:datafield tag=020 ind1=
 ind2= marc:subfield
code=a8814075913/marc:subfield/marc:datafieldmarc:datafield
tag=040 ind1=  ind2= marc:subfield
code=aIT/marc:subfieldmarc:subfield code=-Servizio
Bibliotecario Senese/marc:subfieldmarc:subfield
code=eRICA/marc:subfield/marc:datafieldmarc:datafield tag=300
ind1=  ind2= marc:subfield code=aVI, 262 p.
;/marc:subfieldmarc:subfield code=c24
cm/marc:subfield/marc:datafieldmarc:datafield tag=653 ind1=0
ind2= marc:subfield code=aNavigazione da
diporto/marc:subfieldmarc:subfield
code=aLegislazione/marc:subfield/marc:datafieldmarc:datafield
tag=700 ind1=1 ind2= marc:subfield
code=aAntonini,Alfredo/marc:subfield/marc:datafieldmarc:datafield
tag=700 ind1=1 ind2= marc:subfield
code=aMorandi,Francesco/marc:subfield/marc:datafieldmarc:datafield
tag=041 ind1=0 ind2= marc:subfield
code=aita/marc:subfield/marc:datafieldmarc:datafield tag=245
ind1=1 ind2=0marc:subfield code=aLa navigazione da diporto
:/marc:subfieldmarc:subfield code=ble infrastrutture, l'
organizzazione, i contratti e le responsabilità
:/marc:subfieldmarc:subfield code=batti del convegno, Trieste, 27
marzo 1998 //marc:subfieldmarc:subfield code=ca cura di Alfredo
Antonini e Francesco
Morandi/marc:subfield/marc:datafieldmarc:datafield tag=260 ind1=
 ind2= marc:subfield code=aMilano
:/marc:subfieldmarc:subfield
code=bGiuffrè/marc:subfieldmarc:subfield
code=c1999/marc:subfield/marc:datafieldmarc:datafield tag=490
ind1=  ind2=0marc:subfield code=aCollana del Dipartimento di
scienze giuridiche e della Facoltà di giurisprudenza dell' Università di
Modena e Reggio Emilia/marc:subfieldmarc:subfield code=pNuova
serie ;/marc:subfieldmarc:subfield
code=v0048/marc:subfield/marc:datafieldmarc:datafield tag=760
ind1=1 ind2= marc:subfield code=tCollana del Dipartimento di
scienze giuridiche e della Facoltà di giurisprudenza dell' Università di
Modena e Reggio Emilia/marc:subfieldmarc:subfield
code=g0048/marc:subfield/marc:datafieldmarc:datafield tag=082
ind1=  ind2= marc:subfield
code=a343.45096/marc:subfieldmarc:subfield
code=220/marc:subfield/marc:datafieldmarc:controlfield
tag=008^^sxx^|r^|||/marc:controlfield/marc:record

Any useful hints ? Thanks !

Michele

- --
|| Michele Pinassi
|| System Manager Area Sistema Biblioteche - UniSi
|| https://sites.google.com/a/unisi.it/o-zone/
|| Assistenza: +39.577.232299 (int. 2299)
|| Personale: +39.577.232477 (int. 2477)
|| FAX: +39.577.232430 (int. 2430)
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAktdbLAACgkQFPw35TwkuY47nwCdG8jZMaW2eS7Pww1OlLwqiyr8
W3sAmwXslnbjNizR78z0mlNqxVRj2h0/
=ds5R
-END PGP SIGNATURE-


Re: Marc::XML with MARC21

2010-01-25 Thread Jon Gorman

 my $file = MARC::Record-new_from_xml($marc-serialize(),UTF-8,MARC21);
        $epdata = $plugin-EPrints::Plugin::Import::MARC::convert_input(
 $file );

 and here come troubles: only few metadatas will be interpreted
 correctly, losing a lot of datas.

Ummm, so what metdata makes it through?  I see examples of what you
feed it, but not what is coming out.  Just from looking quickly at the
MarcXML the only thing that seems really weird right away is the
trailing 008 for the control field for the leader.  Don't know what
the xsd states about the ordering, but typically all the controlfields
are at the top of a MARC record.

Jon Gorman


Splitting a large file of MARC records into smaller files

2010-01-25 Thread Nolte, Jennifer
Hello-

I am working with files of MARC records that are over a million records each. 
I'd like to split them down into smaller chunks, preferably using a command 
line. MARCedit works, but is slow and made for the desktop. I've looked around 
and haven't found anything truly useful- Endeavor's MARCsplit comes close but 
doesn't separate files into even numbers, only by matching criteria, so there 
could be lots of record duplication between files.

Any idea where to begin? I am a (super) novice Perl person.

Thank you!

~Jenn Nolte


Jenn Nolte
Applications Manager / Database Analyst
Production Systems Team
Information Technology Office
Yale University Library
130 Wall St.
New Haven CT 06520
203 432 4878




Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Emmanuel Di Pretoro
Hi,

A long time ago, I've written the following :

--- snippet ---
#!/usr/bin/env perl

use strict;
use warnings;

use MARC::File::USMARC;
use MARC::Record;

use Getopt::Long;

my $config = { output = 'input' };

GetOptions($config, 'input=s', 'chunk=s', 'output=s', 'max=s');

if (not exists $config-{input} and not exists $config-{chunk}) {
die Usage: $0 --input file --chunk size [--output file]\n;
} else {
run($config-{input}, $config-{output}, $config-{chunk},
$config-{max});

}

sub run {
my ($input, $output, $chunk, $max) = @_;

my $marcfile = MARC::File::USMARC-in($input);

my $fh = $output eq 'input' ? create_file($input) :
create_file($output);
my $cpt = 1;
my $total = 0;
while (my $record = $marcfile-next) {
$total++;

if (defined $max) {
last if $total  $max;
}
if ($cpt++  $chunk) {
close $fh;
$fh = $output eq 'input' ? create_file($input) :
create_file($output);
$cpt = 1;
}

print $fh $record-as_usmarc;
}
close $fh;
}

sub create_file {
my ($output) = @_;
my $cpt = 0;

my $filename = sprintf('%s.%03d', $output, $cpt++);
while (-e $filename) {
$filename = sprintf('%s.%03d', $output, $cpt++);
}

open my $fh, '', $filename;
return $fh;
}
--- snippet ---

Hope this help

Emmanuel Di Pretoro

2010/1/25 Nolte, Jennifer jennifer.no...@yale.edu

 Hello-

 I am working with files of MARC records that are over a million records
 each. I'd like to split them down into smaller chunks, preferably using a
 command line. MARCedit works, but is slow and made for the desktop. I've
 looked around and haven't found anything truly useful- Endeavor's MARCsplit
 comes close but doesn't separate files into even numbers, only by matching
 criteria, so there could be lots of record duplication between files.

 Any idea where to begin? I am a (super) novice Perl person.

 Thank you!

 ~Jenn Nolte


 Jenn Nolte
 Applications Manager / Database Analyst
 Production Systems Team
 Information Technology Office
 Yale University Library
 130 Wall St.
 New Haven CT 06520
 203 432 4878





RE: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Smith,Devon
This isn't a perl solution, but it may work for you.

You can use the unix split command to split a file into several other
files with the same number of lines each. For that to work, you'll first
have to use tr to convert the ^] record separators into newlines. Then
use tr to convert them all back in each split file.

# tr '^]' '\n'  filename  filename.nl
# split -l $lines_per_file filename.nl SPLIT
# for file in SPLIT*; do tr '\n' '^]'  $file  ${file%.nl}.rs

Or something like that.

/dev
-- 

Devon Smith
Consulting Software Engineer
OCLC Office of Research



-Original Message-
From: Nolte, Jennifer [mailto:jennifer.no...@yale.edu] 
Sent: Monday, January 25, 2010 9:48 AM
To: perl4lib@perl.org
Subject: Splitting a large file of MARC records into smaller files

Hello-

I am working with files of MARC records that are over a million records
each. I'd like to split them down into smaller chunks, preferably using
a command line. MARCedit works, but is slow and made for the desktop.
I've looked around and haven't found anything truly useful- Endeavor's
MARCsplit comes close but doesn't separate files into even numbers, only
by matching criteria, so there could be lots of record duplication
between files.

Any idea where to begin? I am a (super) novice Perl person.

Thank you!

~Jenn Nolte


Jenn Nolte
Applications Manager / Database Analyst
Production Systems Team
Information Technology Office
Yale University Library
130 Wall St.
New Haven CT 06520
203 432 4878






Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Ashley Sanders

Jennifer,


I am working with files of MARC records that are over a million records each. 
I'd like to split them down into smaller chunks, preferably using a command 
line. MARCedit works, but is slow and made for the desktop. I've looked around 
and haven't found anything truly useful- Endeavor's MARCsplit comes close but 
doesn't separate files into even numbers, only by matching criteria, so there 
could be lots of record duplication between files.

Any idea where to begin? I am a (super) novice Perl person.


Well... if you have a *nix style command line and the usual
utilities and your file of MARC records is in exchange format
with the records just delimited by the end-of-record character
0x1d, then you could do something like this:

tr '\035' '\n'  my-marc-file.mrc  recs.txt
split -1000 recs.txt

The tr command will turn the MARC end-of-record characters
into newlines. Then use the split command to carve up
the output of tr into files of 1000 records.

You then may have to use tr to convert the newlines back
to MARC end-of-record characters.

Ashley.

--
Ashley Sanders   a.sand...@manchester.ac.uk
Copac http://copac.ac.uk A Mimas service funded by JISC


RE: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Walker, David
 yaz-marcdump allows you to break a 
 marcfile into chunks of x-records

+1 

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Colin Campbell [colin.campb...@ptfs-europe.com]
Sent: Monday, January 25, 2010 7:08 AM
To: perl4lib@perl.org
Subject: Re: Splitting a large file of MARC records into smaller files

On 25/01/10 14:48, Nolte, Jennifer wrote:
 Hello-

 I am working with files of MARC records that are over a million records each. 
 I'd like to split them down into smaller chunks, preferably using a command 
 line. MARCedit works, but is slow and made for the desktop. I've looked 
 around and haven't found anything truly useful- Endeavor's MARCsplit comes 
 close but doesn't separate files into even numbers, only by matching 
 criteria, so there could be lots of record duplication between files.

If you have Indexdata's yaz installed the program yaz-marcdump allows
you to break a marcfile into chunks of x-records. man yaz-marcdump for
details

Cheers
Colin


--
Colin Campbell
Chief Software Engineer,
PTFS Europe Limited
Content Management and Library Solutions
+44 (0) 208 366 1295 (phone)
+44 (0) 7759 633626  (mobile)
colin.campb...@ptfs-europe.com
skype: colin_campbell2

http://www.ptfs-europe.com

Re: Marc::XML with MARC21

2010-01-25 Thread Ed Summers
Hi Michele:

I copied and pasted the XML from your email and ran it through a
simple test script (both attached) and the record seemed to be parsed
ok. What do you see if you run the attached test.pl?

//Ed


test.pl
Description: Binary data
marc:record xmlns:marc=http://www.loc.gov/MARC21/slim;
xmlns:xsi=http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation=http://www.loc.gov/MARC21/slim
http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd;marc:leader^cam^^22^^i^4500/marc:leadermarc:controlfield
tag=001000762662/marc:controlfieldmarc:datafield tag=020 ind1=
 ind2= marc:subfield
code=a8814075913/marc:subfield/marc:datafieldmarc:datafield
tag=040 ind1=  ind2= marc:subfield
code=aIT/marc:subfieldmarc:subfield code=-Servizio
Bibliotecario Senese/marc:subfieldmarc:subfield
code=eRICA/marc:subfield/marc:datafieldmarc:datafield
tag=300
ind1=  ind2= marc:subfield code=aVI, 262 p.
;/marc:subfieldmarc:subfield code=c24
cm/marc:subfield/marc:datafieldmarc:datafield tag=653
ind1=0
ind2= marc:subfield code=aNavigazione da
diporto/marc:subfieldmarc:subfield
code=aLegislazione/marc:subfield/marc:datafieldmarc:datafield
tag=700 ind1=1 ind2= marc:subfield
code=aAntonini,Alfredo/marc:subfield/marc:datafieldmarc:datafield
tag=700 ind1=1 ind2= marc:subfield
code=aMorandi,Francesco/marc:subfield/marc:datafieldmarc:datafield
tag=041 ind1=0 ind2= marc:subfield
code=aita/marc:subfield/marc:datafieldmarc:datafield
tag=245
ind1=1 ind2=0marc:subfield code=aLa navigazione da diporto
:/marc:subfieldmarc:subfield code=ble infrastrutture, l'
organizzazione, i contratti e le responsabilità
:/marc:subfieldmarc:subfield code=batti del convegno, Trieste,
27
marzo 1998 //marc:subfieldmarc:subfield code=ca cura di
Alfredo
Antonini e Francesco
Morandi/marc:subfield/marc:datafieldmarc:datafield tag=260
ind1=
 ind2= marc:subfield code=aMilano
:/marc:subfieldmarc:subfield
code=bGiuffrè/marc:subfieldmarc:subfield
code=c1999/marc:subfield/marc:datafieldmarc:datafield
tag=490
ind1=  ind2=0marc:subfield code=aCollana del Dipartimento di
scienze giuridiche e della Facoltà di giurisprudenza dell'
Università di
Modena e Reggio Emilia/marc:subfieldmarc:subfield code=pNuova
serie ;/marc:subfieldmarc:subfield
code=v0048/marc:subfield/marc:datafieldmarc:datafield
tag=760
ind1=1 ind2= marc:subfield code=tCollana del Dipartimento di
scienze giuridiche e della Facoltà di giurisprudenza dell'
Università di
Modena e Reggio Emilia/marc:subfieldmarc:subfield
code=g0048/marc:subfield/marc:datafieldmarc:datafield
tag=082
ind1=  ind2= marc:subfield
code=a343.45096/marc:subfieldmarc:subfield
code=220/marc:subfield/marc:datafieldmarc:controlfield
tag=008^^sxx^|r^|||/marc:controlfield/marc:record


Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Sébastien Hinderer
Hi,

The yaz-marcdump utility may be what you are looking for.
See for instance options -s and -C.

hth,
Shérab.


Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Robert Fox
Assuming that memory won't be an issue, you could use MARC::Batch to  
read in the record set and print out seperate files where you split on  
X amount of records. You would have an iterative loop loading each  
record from the large batch, and a counter variable that would get  
reset after X amount of records. You might want to name the sets using  
another counter that keeps track of how many sets you have and name  
each file something like batch_$count.mrc and write them out to a  
specific directory. Just concatenate each record to the previous one  
when you're making your smaller batches.

Rob Fox
Hesburgh Libraries
University of Notre Dame

On Jan 25, 2010, at 9:48 AM, Nolte, Jennifer  
jennifer.no...@yale.edu wrote:

 Hello-

 I am working with files of MARC records that are over a million  
 records each. I'd like to split them down into smaller chunks,  
 preferably using a command line. MARCedit works, but is slow and  
 made for the desktop. I've looked around and haven't found anything  
 truly useful- Endeavor's MARCsplit comes close but doesn't separate  
 files into even numbers, only by matching criteria, so there could  
 be lots of record duplication between files.

 Any idea where to begin? I am a (super) novice Perl person.

 Thank you!

 ~Jenn Nolte


 Jenn Nolte
 Applications Manager / Database Analyst
 Production Systems Team
 Information Technology Office
 Yale University Library
 130 Wall St.
 New Haven CT 06520
 203 432 4878




Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Saiful Amin
I also recommend using MARC::Batch. Attached is a simple script I wrote for
myself.

Saiful Amin
+91-9343826438


On Mon, Jan 25, 2010 at 8:33 PM, Robert Fox rf...@nd.edu wrote:

 Assuming that memory won't be an issue, you could use MARC::Batch to
 read in the record set and print out seperate files where you split on
 X amount of records. You would have an iterative loop loading each
 record from the large batch, and a counter variable that would get
 reset after X amount of records. You might want to name the sets using
 another counter that keeps track of how many sets you have and name
 each file something like batch_$count.mrc and write them out to a
 specific directory. Just concatenate each record to the previous one
 when you're making your smaller batches.

 Rob Fox
 Hesburgh Libraries
 University of Notre Dame

 On Jan 25, 2010, at 9:48 AM, Nolte, Jennifer
 jennifer.no...@yale.edu wrote:

  Hello-
 
  I am working with files of MARC records that are over a million
  records each. I'd like to split them down into smaller chunks,
  preferably using a command line. MARCedit works, but is slow and
  made for the desktop. I've looked around and haven't found anything
  truly useful- Endeavor's MARCsplit comes close but doesn't separate
  files into even numbers, only by matching criteria, so there could
  be lots of record duplication between files.
 
  Any idea where to begin? I am a (super) novice Perl person.
 
  Thank you!
 
  ~Jenn Nolte
 
 
  Jenn Nolte
  Applications Manager / Database Analyst
  Production Systems Team
  Information Technology Office
  Yale University Library
  130 Wall St.
  New Haven CT 06520
  203 432 4878
 
 

#!c:/perl/bin/perl.exe
#
# Name: mbreaker.pl
# Version: 0.1
# Date: Jan 2009
# Author: Saiful Amin sai...@edutech.com
#
# Description: Extract MARC records based on command-line paramenters

use strict;
use warnings;
use Getopt::Long;
use MARC::Batch;

my $start   = 0;
my $end = 1;

GetOptions (start=i = \$start,
end=i   = \$end
);

my $batch = MARC::Batch-new('USMARC', $ARGV[0]);
$batch-strict_off();
$batch-warnings_off();

my $num = 0;
while (my $record = $batch-next() ) {
$num++;
next if $num  $start;
last if $num  $end;
print $record-as_usmarc();
warn $num records\n if ( $num % 1000 == 0 );
}


__END__

=head1 NAME

mbreaker.pl

Breaks the MARC record file as per start and end position specified

=head1 SYNOPSIS

mbreaker.pl [options] file

Options:
 -start start position for reading records
 -end   end position for reading records