Re: MARC-8 to UTF-8 conversion
I hope it's as simple as that. I don't have first-hand experience doing what you are, i.e., building the dbm file at install time for access at run time. -- Brad On 12/5/05, Edward Summers [EMAIL PROTECTED] wrote: On Dec 5, 2005, at 8:33 PM, Brad Baxter wrote: I think you're correct to be conservative. I've been spoiled by servers with lots of memory, so my judgement may be in question. :-) Wow, AnyDBM_File looks perfect. It'll use ndbm, then Berkeley DB, GDBM, and then fall back on SDBM. Like you said SDBM comes standard with Perl (although it apparently runs slower than berkely-db). Thanks very much for the pointers Brad and Mike. //Ed
Re: MARC-8 to UTF-8 conversion
Dear Doran, Ed, Bill and others: thank you for your replies. I installed MARC::Charset using the CPAN module to ensure dependencies. I don't know why it is not working well. If you have some tricks, please let me know. Although I'm interested in MARC-8 - UTF-8 conversion in memory, a good way to do the conversion seems to be yaz-marcdump. Unfortunately it seems unable to work with stdin as input, so I need to use a temp file for each record: a heavy conversion process. BTW, I'm going to manage latin-1, latin-2 and arabic MARC records. Bye. Stefano On 02/dic/05, at 16:01, Doran, Michael D wrote: Hi Stefano, Installing the MARC::Charset module can be a bit problematic for the casual Perl user, due to the prerequisites. However if you need to do a MARC-8 to UTF-8 conversion, that's probably the best tool available. The issue with MARC-8 conversions is that MARC-8 is only really used for encoding bibliographic records and with its use of combining diacritics and escape sequences, it is more complex than the typical 8-bit character set [1]. Most of the software development in the area of library-centric character sets is done by ILS vendors, who typically don't make their efforts available in the form of freely available Perl modules. You didn't say mention why you were wanting to do a character set conversion. If you just need a quick and dirty conversion for ephemeral display of bibliographic information on a web page, you might look at alternatives such as converting from MARC-8 to Latin-1 (ISO 8859-1). That's a potentially lossy conversion, however if most of your records are Italian, the Latin-1 repertoire should suffice. There are some available Perl routines that should handle that conversion [2]. -- Michael [1] Coded Character Sets: A Technical Primer for Librarians http://rocky.uta.edu/doran/charsets/ [2] MARC to Latin: a charset conversion routine in Perl http://rocky.uta.edu/doran/charset/ # Michael Doran, Systems Librarian # University of Texas at Arlington # 817-272-5326 office # 817-688-1926 cell # [EMAIL PROTECTED] # http://rocky.uta.edu/doran/ -Original Message- From: bargioni [mailto:[EMAIL PROTECTED] Sent: Friday, December 02, 2005 4:43 AM To: perl4lib@perl.org Subject: MARC-8 to UTF-8 conversion Hi, I'm trying to convert MARC-8 records to UTF-8 on the fly. March::Charset doesn't work for me. Any suggestion? Also a command line way can be good for my purposes. TIA. Stefano -- Dott. Stefano Bargioni Pontificia Universita' della Santa Croce - Roma Vicedirettore della Biblioteca
RE: MARC-8 to UTF-8 conversion
If anyone has any suggestions on how to handle a largish character mapping table [...] For those who aren't familiar with the MARC 21 alternate character set repertoires (specifically, the East Asian ideographs), by largish, Ed is talking on the order of a table containing upwards of 16,000 mappings. Perhaps at the very least I can include some information about DB_File difficulties prominently in the documentation in the new version. That would definitely be helpful. It would also be helpful if error messages that get kicked out during an automated install (perl -MCPAN -e 'install MARC::Charset') due to missing DB_File prereq components were more informative as to the problem. Below are the error messages that users get now: BEGIN failed--compilation aborted at lib/MARC/Charset.pm line 12. Compilation failed in require at Makefile.PL line 7. BEGIN failed--compilation aborted at Makefile.PL line 7. Running make test Make had some problems, maybe interrupted? Won't test Running make install Make had some problems, maybe interrupted? Won't install I'm probably starting to sound nit-picky, but please understand that it's only because I think MARC::Charset is a great module and I'd like for more people to be using it. :-) -- Michael -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Ed Summers Sent: Monday, December 05, 2005 12:14 PM To: perl4lib@perl.org Subject: Re: MARC-8 to UTF-8 conversion On 12/5/05, Doran, Michael D [EMAIL PROTECTED] wrote: So... this is all very interesting (and I've definitely learned something here), but like I suggested previously, this level of digging may be a bit beyond the casual Perl user. ;-) Yep, point taken. I'm guessing you are right: when you built perl from source it couldnt't find BerkeleyDB so didn't install DB_File. Good to know for the future. If anyone has any suggestions on how to handle a largish character mapping table when someone does a: use MARC::Charset; I'm open to suggestions. Perhaps at the very least I can include some information about DB_File difficulties prominently in the documentation in the new version. Thanks! //Ed
RE: MARC-8 to UTF-8 conversion
I'm probably starting to sound nit-picky, but please understand that it's only because I think MARC::Charset is a great module and I'd like for more people to be using it. :-) Let me second Michael's statement. A couple of months ago we tried installing MARC::Charset and ran into exactly the same problem that Michael has described (although this was in a Linux environment). It required DB_File to be installed--which wasn't, in our Perl 5.8.3 installation, and couldn't be installed apart from Berkley DB, if I remember correctly. I should note that I wasn't the one trying to do the installation, so I'm not 100% certain about the details. The bottom line is that we ended up not using the module because we couldn't easily get DB_File to install. Jason Thomale Metadata Librarian Texas Tech University Libraries
Re: MARC-8 to UTF-8 conversion
Ok, this is great information to have moving forward wi the next MARC::Charset...many thanks Michael and Jason. Micheal you are totally right the installer really shouldn't fail like that...I'd never tested it on a system that lacked DB_File so I didn't know. And CPAN testers didn't pick it up either. //Ed
Re: MARC-8 to UTF-8 conversion
Am I right that this amounts to less than 1Meg (EastAsian.db + UTF8.db)? Depending on your system and your needs (more speed?), that may not be considered large and might fit into memory fine. Otherwise, I think any of the in-core (non-DB_File) DBM files ought to suffice for that amount of data. Which in-core dbm modules are these? I thought DB_File was the defacto standard for doing this... As for the memory, it's not really the memory which I'm concerned about as much as it is the time it would take to build the database everytime someone used the MARC::Charset module. Perhaps I'm falling victim to the curse of premature optimization again though. If I had a text file of 16,000 mappings that was read in everytime someone did a: use MARC::Charset; would people be put out? I imagine folks in mod_perl environments wouldn't care too much--although a MB of ram for each apache process has a way of adding up. At least for high volume sites. //Ed
Re: MARC-8 to UTF-8 conversion
On 12/5/05, Ed Summers [EMAIL PROTECTED] wrote: On 12/5/05, Doran, Michael D [EMAIL PROTECTED] wrote: So... this is all very interesting (and I've definitely learned something here), but like I suggested previously, this level of digging may be a bit beyond the casual Perl user. ;-) Yep, point taken. I'm guessing you are right: when you built perl from source it couldnt't find BerkeleyDB so didn't install DB_File. Good to know for the future. If anyone has any suggestions on how to handle a largish character mapping table when someone does a: use MARC::Charset; I'm open to suggestions. Perhaps at the very least I can include some information about DB_File difficulties prominently in the documentation in the new version. Does it have to be DB_File? Could you just use AnyDBM_File; -- Brad
Re: MARC-8 to UTF-8 conversion
On Dec 5, 2005, at 8:33 PM, Brad Baxter wrote: I think you're correct to be conservative. I've been spoiled by servers with lots of memory, so my judgement may be in question. :-) Wow, AnyDBM_File looks perfect. It'll use ndbm, then Berkeley DB, GDBM, and then fall back on SDBM. Like you said SDBM comes standard with Perl (although it apparently runs slower than berkely-db). Thanks very much for the pointers Brad and Mike. //Ed