Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Emmanuel Di Pretoro
Hi,

A long time ago, I've written the following :

--- snippet ---
#!/usr/bin/env perl

use strict;
use warnings;

use MARC::File::USMARC;
use MARC::Record;

use Getopt::Long;

my $config = { output = 'input' };

GetOptions($config, 'input=s', 'chunk=s', 'output=s', 'max=s');

if (not exists $config-{input} and not exists $config-{chunk}) {
die Usage: $0 --input file --chunk size [--output file]\n;
} else {
run($config-{input}, $config-{output}, $config-{chunk},
$config-{max});

}

sub run {
my ($input, $output, $chunk, $max) = @_;

my $marcfile = MARC::File::USMARC-in($input);

my $fh = $output eq 'input' ? create_file($input) :
create_file($output);
my $cpt = 1;
my $total = 0;
while (my $record = $marcfile-next) {
$total++;

if (defined $max) {
last if $total  $max;
}
if ($cpt++  $chunk) {
close $fh;
$fh = $output eq 'input' ? create_file($input) :
create_file($output);
$cpt = 1;
}

print $fh $record-as_usmarc;
}
close $fh;
}

sub create_file {
my ($output) = @_;
my $cpt = 0;

my $filename = sprintf('%s.%03d', $output, $cpt++);
while (-e $filename) {
$filename = sprintf('%s.%03d', $output, $cpt++);
}

open my $fh, '', $filename;
return $fh;
}
--- snippet ---

Hope this help

Emmanuel Di Pretoro

2010/1/25 Nolte, Jennifer jennifer.no...@yale.edu

 Hello-

 I am working with files of MARC records that are over a million records
 each. I'd like to split them down into smaller chunks, preferably using a
 command line. MARCedit works, but is slow and made for the desktop. I've
 looked around and haven't found anything truly useful- Endeavor's MARCsplit
 comes close but doesn't separate files into even numbers, only by matching
 criteria, so there could be lots of record duplication between files.

 Any idea where to begin? I am a (super) novice Perl person.

 Thank you!

 ~Jenn Nolte


 Jenn Nolte
 Applications Manager / Database Analyst
 Production Systems Team
 Information Technology Office
 Yale University Library
 130 Wall St.
 New Haven CT 06520
 203 432 4878





RE: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Smith,Devon
This isn't a perl solution, but it may work for you.

You can use the unix split command to split a file into several other
files with the same number of lines each. For that to work, you'll first
have to use tr to convert the ^] record separators into newlines. Then
use tr to convert them all back in each split file.

# tr '^]' '\n'  filename  filename.nl
# split -l $lines_per_file filename.nl SPLIT
# for file in SPLIT*; do tr '\n' '^]'  $file  ${file%.nl}.rs

Or something like that.

/dev
-- 

Devon Smith
Consulting Software Engineer
OCLC Office of Research



-Original Message-
From: Nolte, Jennifer [mailto:jennifer.no...@yale.edu] 
Sent: Monday, January 25, 2010 9:48 AM
To: perl4lib@perl.org
Subject: Splitting a large file of MARC records into smaller files

Hello-

I am working with files of MARC records that are over a million records
each. I'd like to split them down into smaller chunks, preferably using
a command line. MARCedit works, but is slow and made for the desktop.
I've looked around and haven't found anything truly useful- Endeavor's
MARCsplit comes close but doesn't separate files into even numbers, only
by matching criteria, so there could be lots of record duplication
between files.

Any idea where to begin? I am a (super) novice Perl person.

Thank you!

~Jenn Nolte


Jenn Nolte
Applications Manager / Database Analyst
Production Systems Team
Information Technology Office
Yale University Library
130 Wall St.
New Haven CT 06520
203 432 4878






Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Ashley Sanders

Jennifer,


I am working with files of MARC records that are over a million records each. 
I'd like to split them down into smaller chunks, preferably using a command 
line. MARCedit works, but is slow and made for the desktop. I've looked around 
and haven't found anything truly useful- Endeavor's MARCsplit comes close but 
doesn't separate files into even numbers, only by matching criteria, so there 
could be lots of record duplication between files.

Any idea where to begin? I am a (super) novice Perl person.


Well... if you have a *nix style command line and the usual
utilities and your file of MARC records is in exchange format
with the records just delimited by the end-of-record character
0x1d, then you could do something like this:

tr '\035' '\n'  my-marc-file.mrc  recs.txt
split -1000 recs.txt

The tr command will turn the MARC end-of-record characters
into newlines. Then use the split command to carve up
the output of tr into files of 1000 records.

You then may have to use tr to convert the newlines back
to MARC end-of-record characters.

Ashley.

--
Ashley Sanders   a.sand...@manchester.ac.uk
Copac http://copac.ac.uk A Mimas service funded by JISC


RE: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Walker, David
 yaz-marcdump allows you to break a 
 marcfile into chunks of x-records

+1 

--Dave

==
David Walker
Library Web Services Manager
California State University
http://xerxes.calstate.edu

From: Colin Campbell [colin.campb...@ptfs-europe.com]
Sent: Monday, January 25, 2010 7:08 AM
To: perl4lib@perl.org
Subject: Re: Splitting a large file of MARC records into smaller files

On 25/01/10 14:48, Nolte, Jennifer wrote:
 Hello-

 I am working with files of MARC records that are over a million records each. 
 I'd like to split them down into smaller chunks, preferably using a command 
 line. MARCedit works, but is slow and made for the desktop. I've looked 
 around and haven't found anything truly useful- Endeavor's MARCsplit comes 
 close but doesn't separate files into even numbers, only by matching 
 criteria, so there could be lots of record duplication between files.

If you have Indexdata's yaz installed the program yaz-marcdump allows
you to break a marcfile into chunks of x-records. man yaz-marcdump for
details

Cheers
Colin


--
Colin Campbell
Chief Software Engineer,
PTFS Europe Limited
Content Management and Library Solutions
+44 (0) 208 366 1295 (phone)
+44 (0) 7759 633626  (mobile)
colin.campb...@ptfs-europe.com
skype: colin_campbell2

http://www.ptfs-europe.com

Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Sébastien Hinderer
Hi,

The yaz-marcdump utility may be what you are looking for.
See for instance options -s and -C.

hth,
Shérab.


Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Robert Fox
Assuming that memory won't be an issue, you could use MARC::Batch to  
read in the record set and print out seperate files where you split on  
X amount of records. You would have an iterative loop loading each  
record from the large batch, and a counter variable that would get  
reset after X amount of records. You might want to name the sets using  
another counter that keeps track of how many sets you have and name  
each file something like batch_$count.mrc and write them out to a  
specific directory. Just concatenate each record to the previous one  
when you're making your smaller batches.

Rob Fox
Hesburgh Libraries
University of Notre Dame

On Jan 25, 2010, at 9:48 AM, Nolte, Jennifer  
jennifer.no...@yale.edu wrote:

 Hello-

 I am working with files of MARC records that are over a million  
 records each. I'd like to split them down into smaller chunks,  
 preferably using a command line. MARCedit works, but is slow and  
 made for the desktop. I've looked around and haven't found anything  
 truly useful- Endeavor's MARCsplit comes close but doesn't separate  
 files into even numbers, only by matching criteria, so there could  
 be lots of record duplication between files.

 Any idea where to begin? I am a (super) novice Perl person.

 Thank you!

 ~Jenn Nolte


 Jenn Nolte
 Applications Manager / Database Analyst
 Production Systems Team
 Information Technology Office
 Yale University Library
 130 Wall St.
 New Haven CT 06520
 203 432 4878




Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Saiful Amin
I also recommend using MARC::Batch. Attached is a simple script I wrote for
myself.

Saiful Amin
+91-9343826438


On Mon, Jan 25, 2010 at 8:33 PM, Robert Fox rf...@nd.edu wrote:

 Assuming that memory won't be an issue, you could use MARC::Batch to
 read in the record set and print out seperate files where you split on
 X amount of records. You would have an iterative loop loading each
 record from the large batch, and a counter variable that would get
 reset after X amount of records. You might want to name the sets using
 another counter that keeps track of how many sets you have and name
 each file something like batch_$count.mrc and write them out to a
 specific directory. Just concatenate each record to the previous one
 when you're making your smaller batches.

 Rob Fox
 Hesburgh Libraries
 University of Notre Dame

 On Jan 25, 2010, at 9:48 AM, Nolte, Jennifer
 jennifer.no...@yale.edu wrote:

  Hello-
 
  I am working with files of MARC records that are over a million
  records each. I'd like to split them down into smaller chunks,
  preferably using a command line. MARCedit works, but is slow and
  made for the desktop. I've looked around and haven't found anything
  truly useful- Endeavor's MARCsplit comes close but doesn't separate
  files into even numbers, only by matching criteria, so there could
  be lots of record duplication between files.
 
  Any idea where to begin? I am a (super) novice Perl person.
 
  Thank you!
 
  ~Jenn Nolte
 
 
  Jenn Nolte
  Applications Manager / Database Analyst
  Production Systems Team
  Information Technology Office
  Yale University Library
  130 Wall St.
  New Haven CT 06520
  203 432 4878
 
 

#!c:/perl/bin/perl.exe
#
# Name: mbreaker.pl
# Version: 0.1
# Date: Jan 2009
# Author: Saiful Amin sai...@edutech.com
#
# Description: Extract MARC records based on command-line paramenters

use strict;
use warnings;
use Getopt::Long;
use MARC::Batch;

my $start   = 0;
my $end = 1;

GetOptions (start=i = \$start,
end=i   = \$end
);

my $batch = MARC::Batch-new('USMARC', $ARGV[0]);
$batch-strict_off();
$batch-warnings_off();

my $num = 0;
while (my $record = $batch-next() ) {
$num++;
next if $num  $start;
last if $num  $end;
print $record-as_usmarc();
warn $num records\n if ( $num % 1000 == 0 );
}


__END__

=head1 NAME

mbreaker.pl

Breaks the MARC record file as per start and end position specified

=head1 SYNOPSIS

mbreaker.pl [options] file

Options:
 -start start position for reading records
 -end   end position for reading records