Re: reading and writing of utf-8 with marc::batch [double encoding]

2013-03-28 Thread Ashley Sanders
Eric,

 How can I figure out whether or not a MARC record contains ONLY characters 
 from the UTF-8 character set?

You can use a regex to check if a string is utf-8. There are various examples
floating around the internet. An example is the one here:

   http://www.w3.org/International/questions/qa-forms-utf-8

You'll need to add the MARC control characters ^_, ^^, and ^] to the ASCII part
of the expression in the above page. (I think the w3c example is aimed at XML1.0
in which the MARC control characters are not allowed.)

Ashley.
--
Ashley Sanders a.sand...@manchester.ac.uk
http://copac.ac.uk -- A Mimas service funded by JISC at the University of 
Manchester



Re: Invalid UTF-8 characters causing MARC::Record crash.

2011-05-17 Thread Ashley Sanders
Hi,

 I'm using MARC::Batch and MARC::Field to iterate through a text file of
 bibliographic records from Voyager.
 
 The unrecoverable error is actually occurring in the Perl Unicode module
 which is, of course, called by MARC::Record.
 It's running into invalid UTF-8 character 0xC2.
 When I looked up the Unicode character list, all of the C2 entries are found
 hex characters, so it appears that the second half is missing.
 

I don't have my MARC-8 character set list with me, but I'd guess it's
some good old MARC-8 data mixed in with UTF-8. Or a MARC-8 record with
the wrong leader info.

Unfortunately bad UTF-8 is pretty common in my experience. You can
use a regexp to check if something is valid utf-8:

  http://keithdevens.com/weblog/archive/2004/Jun/29/UTF-8.regex

Then it's up to you to take appropriate action.

Ashley.
--
Ashley Sanders a.sand...@manchester.ac.uk
Copac http://copac.ac.uk -- A Mimas service funded by JISC



Re: Splitting a large file of MARC records into smaller files

2010-01-25 Thread Ashley Sanders

Jennifer,


I am working with files of MARC records that are over a million records each. 
I'd like to split them down into smaller chunks, preferably using a command 
line. MARCedit works, but is slow and made for the desktop. I've looked around 
and haven't found anything truly useful- Endeavor's MARCsplit comes close but 
doesn't separate files into even numbers, only by matching criteria, so there 
could be lots of record duplication between files.

Any idea where to begin? I am a (super) novice Perl person.


Well... if you have a *nix style command line and the usual
utilities and your file of MARC records is in exchange format
with the records just delimited by the end-of-record character
0x1d, then you could do something like this:

tr '\035' '\n'  my-marc-file.mrc  recs.txt
split -1000 recs.txt

The tr command will turn the MARC end-of-record characters
into newlines. Then use the split command to carve up
the output of tr into files of 1000 records.

You then may have to use tr to convert the newlines back
to MARC end-of-record characters.

Ashley.

--
Ashley Sanders   a.sand...@manchester.ac.uk
Copac http://copac.ac.uk A Mimas service funded by JISC


Re: Help for z39.50 search in proquest dissertations and theses

2008-02-19 Thread Ashley Sanders

Anne L. Highsmith wrote:

I am trying to create a perl program using Net::Z3950 to connect to and search 
the proquest dissertations and theses database. I can connect and search via 
YAZ, but am having no luck via perl/Net::Z3950.

Here's the perl program:
use Net::Z3950;
$mgr = new Net::Z3950::Manager(
user='[user]',
pass='[password]',
);



$target = 'proquest-z3950.umi.com';
$port = '210';
$database = 'PQD';
$recordSyntax = 'SUTRS';
$conn = new Net::Z3950::Connection($mgr, $target, $port, databaseName =$database) or die 
$!\n;
$rs = $conn-search(-prefix = '@and @attr 1=4 steamboat bertrand') or die 
$!\n;
$count = $rs-size();
print $count\n and die;

When I have the $!\n; in the '$rs = ' line, I always get the message Operation in progress. 
When I remove the $!\n; in the '$rs = ' line, and it falls through to $count = $rs-size(); , I get:

Can't call method size on an undefined value at ./zd line 13.


I've never used Net::Z3950, but my understanding of the Manager
class is that it is that connection is being made
asynchronously -- which means you need to use the wait()
function, as in the example here:

   http://perl.z3950.org/docs/Z3950/Manager.html

You could try adding async = 0 to the Manager object
construction, ie:

$mgr = new Net::Z3950::Manager(async = 0, user = xxx, pass = yyy);

or perhaps adding it here instead:

  $conn = new Net::Z3950::Connection($mgr, $host, $port, async = 0);

Hope this helps,

Regards,

Ashley.
--
Ashley Sanders   [EMAIL PROTECTED]
Copac http://copac.ac.uk A Mimas service funded by JISC


Re: MARC::Charset

2007-03-14 Thread Ashley Sanders

Your MARC records appear to be encoded in MARC-8 as evidenced by ergáo in 
which the combining
accent character comes before the character to be modified.  I.e. the byte 
string that displays as
ergáo in your email would display as ergò (with a Latin small letter o with 
grave) in a MARC-8
aware client.


I'd just like to relate my recent experiences of retrieving MARC21 
records through
various library Z39.50 servers. Put simply, you cannot trust the MARC 
leader character

9 to correctly indicate the character set used.

From libraries that have set the leader to indicate the records are in 
the MARC-8 character

set, I have retrieved records encoded as Latin-1, UTF-8 and MARC-8.

From libraries that set the leader to indicate Unicode, I get records 
in MARC-8

and UTF-8.

You also get encodings in MARC-8 records like \1EF6 to indicate a 
Unicode character.
I think #12345; is now legal in MARC-8 now to indicate a Unicode 
character that isn't

in the MARC-8 repertoire.

So, basically, you either need prior knowledge about the actual 
character encoding

used, or you have to test. Testing for UTF-8 is fairly straightforward and a
long string of text (which admittedly you don't tend to get in MARC 
records) that
tests as UTF-8 is very unlikely to be anything else. Distinguishing 
Latin-1 from
MARC-8 is a bit more like guess work. As a test for MARC-8 I look for 
the common

combining diacritics followed by a vowel.

Regards,

Ashley.
--
Ashley Sanders   [EMAIL PROTECTED]
Copac http://copac.ac.uk A MIMAS Service funded by JISC


Re: MARC::Charset

2007-03-14 Thread Ashley Sanders

Michael,

So, basically, you either need prior knowledge about the 
actual character encoding used, or you have to test. Testing 
for UTF-8 is fairly straightforward...


How are you testing for UTF-8?


There's a handy perl regexp on the W3C web site at:

   http://www.w3.org/International/questions/qa-forms-utf-8

You'll need to change the ASCII part of the regexp to something like:

   [\x01-\x7e]

This will more than accommodate for the various control characters you
can find in MARC records (don't forget Esc as the lead in to Greek,
Cyrillic, etc.)

The W3C regexp tests the whole string -- which may be inefficient
if you are testing lots of data. Depending on what sort of accuracy
you want and whether or not overlong UTF-8 sequences are a concern,
you could just test for the following:

   [\xc2-\xf4][\x80-\xbf]

The Wikipedia page on UTF-8 is worth a read.


Distinguishing Latin-1 from MARC-8 is a bit more like guess work.
As a test for MARC-8 I look for the common combining diacritics
followed by a vowel.


Do you have a programmatic way to do that test, or are you eye-balling the 
records.


I use a simple regexp:

  ([\xe1-\xe3][aeiouAEIOU]|\xf0[cC])

which may be rather too simple. For a critical application I'd come up
with something a bit better (after first eye-balling a load of records.)

Just as an aside, I'm not using perl -- I'm using the Boost Regexp
library for C++ (which is a good implementation of perl regexps.)

Regards,

Ashley.
--
Ashley Sanders   [EMAIL PROTECTED]
Copac http://copac.ac.uk A MIMAS Service funded by JISC


Re: Slowdown when fetching multiple records

2006-02-21 Thread Ashley Sanders

Tony Bowden wrote:


However, I think the slowdown I was getting is a different problem. I
don't know enough about how Z39.50 servers actually work, but I think
that requesting each book in turn just had a high overhead of network
traffic,


Correct. If you request a total of ten records one at a time
then that equates to ten request/response conversations between
client and server. This realy is very wasteful of network and
server resources. It is for reasons of efficiency that z39.50
allows you to ask for multiple records from the server in
one fell swoop. (A server may limit the number of records
it will return in any one reuqest, but I wouldn't have thought
20 records would cause any server a problem (unless they are
very large records.))

Ashley.
--
Ashley Sanders [EMAIL PROTECTED]
Copac http://copac.ac.uk -- A MIMAS service funded by JISC


Re: Zeta Perl Opac Format

2004-04-27 Thread Ashley Sanders
David,

 I know the target is capable of returning records in OPAC format --
 which I take is some type of MARC XML?

OPAC record syntax is not XML I'm afraid. OPAC is a conventional MARC
bib record along with separate holdings records. The holdings records
can be either more MARC records or of a special Z39.50 defined format.

Regards,

Ashley.
-- 
Ashley Sanders[EMAIL PROTECTED]
COPAC: A public bibliographic database from MIMAS, funded by JISC
 http://copac.ac.uk/ - [EMAIL PROTECTED]


Re: Adding non standard MARC subfields with MARC::Record

2004-04-05 Thread Ashley Sanders
  Sirsi uses some non standard subfields to create links between records.
  Typically these subfields are '?' and '='.  How can I add these non
  standard subfields to records that I am creating/editing with
  MARC::Record?
 
 MARC::Record is actually quite lenient about what you can use as a subfield. 
 
 $record-append_fields( 
 MARC::Field-new( 245, 0, 0, '?' = 'foo', '=' = 'bar' )
 );
 
 Just make sure you quote '?' and '=' or else weirdness will ensue. :)

Just to point out that '?' and '=' are (amongst many other non alpha-
numeric characters) explicitly allowed by MARC21 for use in local
data elements. So they are standard conforming really.

Ashley.

-- 
Ashley Sanders[EMAIL PROTECTED]
COPAC: A public bibliographic database from MIMAS, funded by JISC
 http://copac.ac.uk/ - [EMAIL PROTECTED]


Re: Manuall created records

2003-10-15 Thread Ashley Sanders
Christoffer,

 i.e. Zebra treats the record as being empty.
 

 00154   00085   
 001001100030009000110050017000201160003724500150005301REALNODE20031015153536.01
  aTest Author00aTest Title

The record isn't strictly correct as the Indicator count and
Subfield code length are both blank (character positions 10 and
11.) MARC21 says these should always be set to 2. Also the
Lenght of length-of-field, length of starting-character-position
and length of implementation-defined (positions 20, 21, 22)
should also be set to 4, 5, and 0 respectively. Position 22
should also be 0.

Also the character at position 25 should be the field terminator
character, not a blank.

So, I think your record should look like:

00154 2200085   
450001001100030009000110050017000201160003724500150005301REALNODE20031015153536.01
 aTest Author00aTest Title

If an application such as zebra is doing things correctly, then
it has every right to think the record is bad if it sees these
errors.

Of course, it may be something else completely.

Regards,

Ashley.

-- 
Ashley Sanders[EMAIL PROTECTED]
COPAC: A public bibliographic database from MIMAS, funded by JISC
 http://copac.ac.uk/ - [EMAIL PROTECTED]