Re: MARC-8 to UTF-8 conversion

2005-12-06 Thread Brad Baxter
I hope it's as simple as that.  I don't have first-hand
experience doing what you are, i.e., building the dbm file
at install time for access at run time.

--
Brad

On 12/5/05, Edward Summers [EMAIL PROTECTED] wrote:

 On Dec 5, 2005, at 8:33 PM, Brad Baxter wrote:
  I think you're correct to be conservative.  I've been spoiled
  by servers with lots of memory, so my judgement may be in
  question.  :-)

 Wow, AnyDBM_File looks perfect. It'll use ndbm, then Berkeley DB,
 GDBM, and then fall back on SDBM. Like you said SDBM comes standard
 with Perl (although it apparently runs slower than berkely-db).

 Thanks very much for the pointers Brad and Mike.

 //Ed



Re: MARC-8 to UTF-8 conversion

2005-12-05 Thread bargioni

Dear Doran, Ed, Bill and others:
thank you for your replies. I installed MARC::Charset using the CPAN 
module to ensure dependencies. I don't know why it is not working well. 
If you have some tricks, please let me know.
Although I'm interested in MARC-8 - UTF-8 conversion in memory, a good 
way to do the conversion seems to be yaz-marcdump. Unfortunately it 
seems unable to work with stdin as input, so I need to use a temp file 
for each record: a heavy conversion process.

BTW, I'm going to manage latin-1, latin-2 and arabic MARC records.
Bye. Stefano

On 02/dic/05, at 16:01, Doran, Michael D wrote:


Hi Stefano,

Installing the MARC::Charset module can be a bit problematic for the
casual Perl user, due to the prerequisites.  However if you need to do 
a

MARC-8 to UTF-8 conversion, that's probably the best tool available.

The issue with MARC-8 conversions is that MARC-8 is only really used 
for

encoding bibliographic records and with its use of combining diacritics
and escape sequences, it is more complex than the typical 8-bit
character set [1].  Most of the software development in the area of
library-centric character sets is done by ILS vendors, who typically
don't make their efforts available in the form of freely available Perl
modules.

You didn't say mention why you were wanting to do a character set
conversion.  If you just need a quick and dirty conversion for
ephemeral display of bibliographic information on a web page, you might
look at alternatives such as converting from MARC-8 to Latin-1 (ISO
8859-1).  That's a potentially lossy conversion, however if most of 
your

records are Italian, the Latin-1 repertoire should suffice.  There are
some available Perl routines that should handle that conversion [2].

-- Michael

[1] Coded Character Sets: A Technical Primer for Librarians
http://rocky.uta.edu/doran/charsets/

[2] MARC to Latin: a charset conversion routine in Perl
http://rocky.uta.edu/doran/charset/

# Michael Doran, Systems Librarian
# University of Texas at Arlington
# 817-272-5326 office
# 817-688-1926 cell
# [EMAIL PROTECTED]
# http://rocky.uta.edu/doran/


-Original Message-
From: bargioni [mailto:[EMAIL PROTECTED]
Sent: Friday, December 02, 2005 4:43 AM
To: perl4lib@perl.org
Subject: MARC-8 to UTF-8 conversion

Hi, I'm trying to convert MARC-8 records to UTF-8 on the fly.
March::Charset doesn't work for me.
Any suggestion? Also a command line way can be good for my purposes.
TIA. Stefano
--
Dott. Stefano Bargioni
Pontificia Universita' della Santa Croce - Roma
Vicedirettore della Biblioteca




RE: MARC-8 to UTF-8 conversion

2005-12-05 Thread Doran, Michael D
 If anyone has any suggestions on how to handle a
 largish character mapping table [...]

For those who aren't familiar with the MARC 21 alternate character set
repertoires (specifically, the East Asian ideographs), by largish, Ed
is talking on the order of a table containing upwards of 16,000
mappings.  

 Perhaps at the very least I can include some
 information about DB_File difficulties prominently in the
 documentation in the new version.

That would definitely be helpful.  It would also be helpful if error
messages that get kicked out during an automated install (perl -MCPAN
-e 'install MARC::Charset') due to missing DB_File prereq components
were more informative as to the problem.  Below are the error messages
that users get now:

BEGIN failed--compilation aborted at lib/MARC/Charset.pm line
12.
Compilation failed in require at Makefile.PL line 7.
BEGIN failed--compilation aborted at Makefile.PL line 7.
Running make test
  Make had some problems, maybe interrupted? Won't test
Running make install
  Make had some problems, maybe interrupted? Won't install

I'm probably starting to sound nit-picky, but please understand that
it's only because I think MARC::Charset is a great module and I'd like
for more people to be using it.  :-)

-- Michael

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On 
 Behalf Of Ed Summers
 Sent: Monday, December 05, 2005 12:14 PM
 To: perl4lib@perl.org
 Subject: Re: MARC-8 to UTF-8 conversion
 
 On 12/5/05, Doran, Michael D [EMAIL PROTECTED] wrote:
  So... this is all very interesting (and I've definitely learned
  something here), but like I suggested previously, this level of
digging
  may be a bit beyond the casual Perl user.  ;-)
 
 Yep, point taken. I'm guessing you are right: when you built perl from
 source it couldnt't find BerkeleyDB so didn't install DB_File. Good to
 know for the future. If anyone has any suggestions on how to handle a
 largish character mapping table when someone does a:
 
 use MARC::Charset;
 
 I'm open to suggestions. Perhaps at the very least I can include some
 information about DB_File difficulties prominently in the
 documentation in the new version.
 
 Thanks!
 //Ed
 


RE: MARC-8 to UTF-8 conversion

2005-12-05 Thread Thomale, J
 I'm probably starting to sound nit-picky, but please 
 understand that it's only because I think MARC::Charset is a 
 great module and I'd like for more people to be using it.  :-)

Let me second Michael's statement. A couple of months ago we tried
installing MARC::Charset and ran into exactly the same problem that
Michael has described (although this was in a Linux environment). It
required DB_File to be installed--which wasn't, in our Perl 5.8.3
installation, and couldn't be installed apart from Berkley DB, if I
remember correctly. I should note that I wasn't the one trying to do the
installation, so I'm not 100% certain about the details. The bottom line
is that we ended up not using the module because we couldn't easily get
DB_File to install.

Jason Thomale
Metadata Librarian
Texas Tech University Libraries


Re: MARC-8 to UTF-8 conversion

2005-12-05 Thread Ed Summers
Ok, this is great information to have moving forward wi the next
MARC::Charset...many thanks Michael and Jason. Micheal you are totally
right the installer really shouldn't fail like that...I'd never tested
it on a system that lacked DB_File so I didn't know. And CPAN testers
didn't pick it up either.

//Ed


Re: MARC-8 to UTF-8 conversion

2005-12-05 Thread Ed Summers
 Am I right that this amounts to less than 1Meg (EastAsian.db +
 UTF8.db)? Depending on your system and your needs (more
 speed?), that may not be considered large and might fit into
 memory fine.  Otherwise, I think any of the in-core (non-DB_File)
 DBM files ought to suffice for that amount of data.

Which in-core dbm modules are these? I thought DB_File was the defacto
standard for doing this...

As for the memory, it's not really the memory which I'm concerned
about as much as it is the time it would take to build the database
everytime someone used the MARC::Charset module. Perhaps I'm falling
victim to the curse of premature optimization again though. If I had a
text file of 16,000 mappings that was read in everytime someone did a:

use MARC::Charset;

would people be put out? I imagine folks in mod_perl environments
wouldn't care too much--although a MB of ram for each apache process
has a way of adding up. At least for high volume sites.

//Ed


Re: MARC-8 to UTF-8 conversion

2005-12-05 Thread Brad Baxter
On 12/5/05, Ed Summers [EMAIL PROTECTED] wrote:

 On 12/5/05, Doran, Michael D [EMAIL PROTECTED] wrote:
  So... this is all very interesting (and I've definitely learned
  something here), but like I suggested previously, this level of digging
  may be a bit beyond the casual Perl user.  ;-)

 Yep, point taken. I'm guessing you are right: when you built perl from
 source it couldnt't find BerkeleyDB so didn't install DB_File. Good to
 know for the future. If anyone has any suggestions on how to handle a
 largish character mapping table when someone does a:

 use MARC::Charset;

 I'm open to suggestions. Perhaps at the very least I can include some
 information about DB_File difficulties prominently in the
 documentation in the new version.


Does it have to be DB_File?  Could you just

use AnyDBM_File;

--
Brad


Re: MARC-8 to UTF-8 conversion

2005-12-05 Thread Edward Summers

On Dec 5, 2005, at 8:33 PM, Brad Baxter wrote:

I think you're correct to be conservative.  I've been spoiled
by servers with lots of memory, so my judgement may be in
question.  :-)


Wow, AnyDBM_File looks perfect. It'll use ndbm, then Berkeley DB,  
GDBM, and then fall back on SDBM. Like you said SDBM comes standard  
with Perl (although it apparently runs slower than berkely-db).


Thanks very much for the pointers Brad and Mike.

//Ed