Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML
On Mon, 20 Mar 2006 10:54:08 -0500 Mike Rylander [EMAIL PROTECTED] wrote: Except that Perl doesn't know that the data is already UTF8 ... which is the problem. [...] You're completely right, I understand the difference. We made UTF8 work from MySQL bu we didn't tried to work on data coming from MySQL. Just select ... and print. So it works but we are limited on strings processing. It's unfortunate that the DBD::mysql people won't fix their module, but there really is a right way to do this, even without their help. Is there a performance penalty with decode()? Yep. Would that go away with a fix to the DBD::mysql module? Mostly, so you really need to bug them. The problem with decode() is the impact. Adding this process on each string retrieved from MySQL represents hundreds of code lines. Not so hard to modify but the solution is not /elegant/. Being able to flag data coming from MySQL as UTF8 to Perl would be the /elegant/ solution, as you said. Maybe we should try harder to have this feature from DBD::mysql developers. Thanks for your precisions. Bye -- Pierrick LE GALL INEO media system
Re: Unimarc, marc21, Unicode, and MARC::File::XML
Mike Rylander a écrit : I tested with the record you sent Ed and me, and everything seems to work for me ... As you can see, I tested several variants of the UNIMARC flag, and even tested not sending the encoding to new_from_xml() ... it all seems to work for me, and I'm not sure what problems you're seeing. Perhaps you just needed to set your binmode for the XML source? strange, strange... What does my script : * retrieve the MARC::Record from zebra * read some datas from mysql * build a page with HTML::Template * send the pages to the browser I added 3 lines to save the record in a file after reading from zebra. Adding binmode(F,':utf8'); before saving my record in F, give me correct UTF-8. without binmode, it's NOK. But when I put the MARC::record in a page builded with HTML::Template, it's wrong. The HTML is utf-8 (html page encoding). It also contains some strings from mySQL and all strings from mySQL appear as correct utf8 while all strings coming from the MARC::record coming from zebra are not ! I can add binmode() to the template output, but everything goes wrong with strings from mySQL. Any suggestion welcomed ! -- Paul POULAIN et Henri Damien LAURENT Consultants indépendants en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
Re: Unimarc, marc21, Unicode, and MARC::File::XML
Mike Rylander a écrit : On 3/20/06, Paul POULAIN [EMAIL PROTECTED] wrote: Mike Rylander a écrit : I tested with the record you sent Ed and me, and everything seems to work for me ... As you can see, I tested several variants of the UNIMARC flag, and even tested not sending the encoding to new_from_xml() ... it all seems to work for me, and I'm not sure what problems you're seeing. Perhaps you just needed to set your binmode for the XML source? strange, strange... What does my script : * retrieve the MARC::Record from zebra * read some datas from mysql * build a page with HTML::Template * send the pages to the browser Are you getting XML or binary MARC from zebra? XML. The test.xml I sended to you on friday comes was the $raw = $rs-record(0)-raw(); record. Are you using decode_utf8($mysql_string) to let Perl know that the database is UTF8 encoded? IIRC, MySQL doesn't know how to tell Perl about that, and the DBD::MySQL maintainer haven't added that functionality to the module yet. I thought we had to decode_utf8($mysql_string), and began to investigate a lot. But after many hours of digging getting problems, I now have a working mySQL in utf8 for all of Koha. without any binmode of decode_utf8 ... And it seems joshua Tümer (Turkey) has the same conclusion : no more problems with mySQL Perl. We all use a recent version of mySQL, even if DBD::mysql maintainer (from mysql.com : joshua dropped him a mail but got no answer) did nothing on the cpan package. -- Paul POULAIN et Henri Damien LAURENT Consultants indépendants en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML
Hello Mike, I'll answer to the second question, since I worked with Paul on Perl/MySQL and UTF-8... On Mon, 20 Mar 2006 09:59:32 -0500 Mike Rylander [EMAIL PROTECTED] wrote: Are you using decode_utf8($mysql_string) to let Perl know that the database is UTF8 encoded? IIRC, MySQL doesn't know how to tell Perl about that, and the DBD::MySQL maintainer haven't added that functionality to the module yet. We don't use decode_utf8. Just after the database handler creation, we force communication to be UTF-8 with set names 'UTF8' SQL query. As we know our data are UTF-8 stored and we want UTF-8, all works fine. Bye -- Pierrick LE GALL INEO media system
Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML
On 3/20/06, Pierrick LE GALL [EMAIL PROTECTED] wrote: Hello Mike, I'll answer to the second question, since I worked with Paul on Perl/MySQL and UTF-8... On Mon, 20 Mar 2006 09:59:32 -0500 Mike Rylander [EMAIL PROTECTED] wrote: Are you using decode_utf8($mysql_string) to let Perl know that the database is UTF8 encoded? IIRC, MySQL doesn't know how to tell Perl about that, and the DBD::MySQL maintainer haven't added that functionality to the module yet. We don't use decode_utf8. Just after the database handler creation, we force communication to be UTF-8 with set names 'UTF8' SQL query. As we know our data are UTF-8 stored and we want UTF-8, all works fine. Except that Perl doesn't know that the data is already UTF8 ... which is the problem. Perl /does/ know that the MARC data is UTF8, and it has to convert one string or the other on output. If you explicitly use binmode() to set the PerlIO state to utf8, then the MARC::Record strings, which are known good UTF8, are not transformed, but the MySQL data, of which Perl has no encoding notions, gets transformed, and thus broken. The only consistent and correct way to deal with UTF8 data in perl is to let PerlIO handle it by marking all sources as either providing UTF8 data or not. You can do that with binmode(), open() and several other ways, including this in modern Perls ( http://search.cpan.org/~nwclark/perl-5.8.8/lib/open.pm ). Because DBD::mysql doesn't give you a way to mark its socket as UTF8, you need to be a little underhanded and tell Perl as soon as possible using decode(), or by making utf8 the default mode for all PerlIO channels. There really isn't any way around this if you want to claim real UTF8 support and be able to use components that really do support UTF8 natively, like MARC::File::XML and MARC::Record. It's unfortunate that the DBD::mysql people won't fix their module, but there really is a right way to do this, even without their help. Is there a performance penalty with decode()? Yep. Would that go away with a fix to the DBD::mysql module? Mostly, so you really need to bug them. Bye -- Pierrick LE GALL INEO media system -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org
Re: Unimarc, marc21, Unicode, and MARC::File::XML
I tested with the record you sent Ed and me, and everything seems to work for me ... comparing the preprocessed XML with a copy that I round-tripped through MARC::Record and MARC::File::XML, they look the same. Here's my little test script (unimarc-test.pl): -- #!/usr/bin/perl # USAGE: ./unimarc-test.pl test.xml use MARC::Record; #use MARC::File::XML ( RecordFormat = 'UNIMARC' ); use MARC::File::XML; MARC::File::XML-default_record_format('UNIMARC'); binmode(STDIN,':utf8'); binmode(STDOUT,':utf8'); $/ = undef; my $xml = ; my $r = MARC::Record-new_from_xml($xml); # ,'utf8'); #,'UNIMARC'); print $r-as_xml(); #'UNIMARC'); print $r-as_usmarc(); __END__ As you can see, I tested several variants of the UNIMARC flag, and even tested not sending the encoding to new_from_xml() ... it all seems to work for me, and I'm not sure what problems you're seeing. Perhaps you just needed to set your binmode for the XML source? -miker On 3/17/06, Paul POULAIN [EMAIL PROTECTED] wrote: Mike Rylander a écrit : CVS checkout intsructions cvs -d:pserver:[EMAIL PROTECTED]:/cvsroot/marcpm login cvs -z3 -d:pserver:[EMAIL PROTECTED]:/cvsroot/marcpm co -P marc-xml Then, cd marc-xml perl Makefile.PL make make test And assuming 'make test' succeeds ... make install I updated MARC::record (2.000), MARC::Charset (0.920) and MARC::File::xml (0.810 ???) from sourceforge. I tested with a unimarc xml file (that I've send to you by pm), without change. what did I make wrong ? -- Paul POULAIN et Henri Damien LAURENT Consultants indépendants en logiciels libres et bibliothéconomie (http://www.koha-fr.org) -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org
Unimarc, marc21, Unicode, and MARC::File::XML
Hello all, Still working on UNICODE in Koha. We are stuck with a not-so-nice problem. (Many many thanks to the librarians that wrote marc21 and unimarc standards...) I explain : yesterday : joshua the new marc::file::xml works fine with utf8 now. me : Great ! i'll give it a try today : me : oh non, ca ne marche pas (in english : hey, it doesn't work...) my XML (coming from zebra) is utf-8, but the MARC::Record after my $record = MARC::Record-new_from_xml($raw, 'utf8'); is marc8... 1 hour later, joshua wakes up, as most americans and we began digging on #koha irc. 1 hour later the problem was identified : PROBLEM : * in MARC21, the encoding is defined by position 9 of the leader. 'a' means UTF-8 * in UNIMARC, this is an empty position ! the encoding is in positions 26-27 and 28-29 of 100$a (200 are all fixed coded fields in unimarc : http://bibliotheque.bgp-fr.com/Unimarc_abrege.pdf, page 8 for 100$a) BIG PROBLEM : MARC::File::XML only checks for position 9, thinking the XML is necessary a marc21 file. I think ( joshua agrees) we will have to hack MARC::File::XML to solve this problem. We have 2 solutions : * add a test to define wether we are UNIMARC or MARC21. In UNIMARC, title is in 200, while 200 is empty in MARC21. * add a parameter to -new_as_xml($xml,'UTF-8','UNIMARC') to specify we are sending the parser an unimarc file. Ed al, let me know what you think, thanks. -- Paul POULAIN et Henri Damien LAURENT Consultants indépendants en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
Re: Unimarc, marc21, Unicode, and MARC::File::XML
On 3/16/06, Mike Rylander [EMAIL PROTECTED] wrote: Will some brave soul please test this with some UNIMARC records and let me know how it goes? Yes please, add the test to the test suite if possible Joshua and Paul. miker_++ //Ed
Re: Unimarc, marc21, Unicode, and MARC::File::XML
Mea culpa ... read on. :) On 3/16/06, Mike Rylander [EMAIL PROTECTED] wrote: I've updated the cvs for MARC::File::XML with what I described below, with one caveat. The one difference from what I was planning is that, because as_xml() is generated by MARC::Record, I can't give it new parameters. To test exporting to XML you'll need to set the record format for export either in the use line for the module or using the default_record_format() class method. Just call that with 'UNIMARC' as the parameter and then export your record as normal using as_xml() on the MARC::Record object. It seems that I am either blind or insane ... I do have access to as_xml(), and I did in fact add the format option to it. Sorry for the confusion. :) I'm updating the POD now, and adding a new method to to export XML without a collection wrapper. (new_from_xml() does not suffer from this as that method is defined in MARC/File/XML.pm, so it takes both an encoding parameter and a format paramter, as explained in the documentation.) Will some brave soul please test this with some UNIMARC records and let me know how it goes? --- CVS checkout intsructions cvs -d:pserver:[EMAIL PROTECTED]:/cvsroot/marcpm login cvs -z3 -d:pserver:[EMAIL PROTECTED]:/cvsroot/marcpm co -P marc-xml Then, cd marc-xml perl Makefile.PL make make test And assuming 'make test' succeeds ... make install --- Thanks in advance, --miker On 3/16/06, Mike Rylander [EMAIL PROTECTED] wrote: I've been attempting to beat the MARC::File::XML stuff into a usable shape as of late, so I'm going to take a stab at fixing this. There will be some limitations (at first) as to what encodings we'll accept for UNIMARC records, but I'll cover the cases that I know about (and understand). Here's the plan: I will add a use flag to set the script-wide default for record format use MARC::File::XML ( RecordFormat = 'UNIMARC' ); that will default to MARC21. There will also be a class method to set this flag MARC::File::XML-default_record_format( 'UNIMARC' ); and, finally, a flag to both as_xml and new_from_xml to tell MARC::File::XML about individual records. I don't think, at this point, we should autodetect based on the existence of a 200 tag, as I'd like to stay away from heuristics if it can be avoided. If others disagree, please make the case! When processing a UNIMARC record, I'll look in 100$a for the encoding, and proceed if it's either 01 (iso646 -- nominally compatible with iso8859, though it requires interpretation) or 50 (UNICODE, which will always mean UTF8 in XML produced by MARC::File::XML). If it's anything else an error will be thrown. We can add support for other encodings as the direct need arises. For UNIMARC/UNICODE, the XML is obviously going to be UTF-8 encoded. For UNIMARC/ISO646, the XML will be marked as ISO-8859-1. Yes, it's a bit of a fib, but most XML parsers don't support ISO646, and most do support LATIN1 (8859-1), and the bytes won't get mangled by the parser in that case. Comments? On 3/16/06, Zeno Tajoli [EMAIL PROTECTED] wrote: Hi, PROBLEM : * in MARC21, the encoding is defined by position 9 of the leader. 'a' means UTF-8 * in UNIMARC, this is an empty position ! the encoding is in positions 26-27 and 28-29 of 100$a (200 are all fixed coded fields in unimarc : http://bibliotheque.bgp-fr.com/Unimarc_abrege.pdf, page 8 for 100$a) BIG PROBLEM : MARC::File::XML only checks for position 9, thinking the XML is necessary a marc21 file. I think ( joshua agrees) we will have to hack MARC::File::XML to solve this problem. We have 2 solutions : * add a test to define wether we are UNIMARC or MARC21. In UNIMARC, title is in 200, while 200 is empty in MARC21. * add a parameter to -new_as_xml($xml,'UTF-8','UNIMARC') to specify we are sending the parser an unimarc file. as a person that has write a Unimarc - MARC21 converter, I prefer the second solution. Thanks for all Bye Zeno Tajoli CILEA - Segrate (MI) tajoliAT_SPAM_no_prendiATcilea.it (Indirizzo mascherato anti-spam; sostituisci quanto tra AT con @) -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org -- Mike Rylander [EMAIL PROTECTED] GPLS -- PINES Development Database Developer http://open-ils.org