Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML

2006-03-21 Thread Pierrick LE GALL
On Mon, 20 Mar 2006 10:54:08 -0500
Mike Rylander [EMAIL PROTECTED] wrote:

 Except that Perl doesn't know that the data is already UTF8 ... which
 is the problem. [...]

You're completely right, I understand the difference. We made UTF8 work
from MySQL bu we didn't tried to work on data coming from MySQL. Just
select ... and print. So it works but we are limited on strings
processing.

 It's unfortunate that the DBD::mysql people won't fix their module,
 but there really is a right way to do this, even without their help. 
 Is there a performance penalty with decode()?  Yep.  Would that go
 away with a fix to the DBD::mysql module?  Mostly, so you really need
 to bug them.

The problem with decode() is the impact. Adding this process on each
string retrieved from MySQL represents hundreds of code lines. Not so
hard to modify but the solution is not /elegant/. Being able to flag
data coming from MySQL as UTF8 to Perl would be the /elegant/ solution,
as you said. Maybe we should try harder to have this feature from
DBD::mysql developers.

Thanks for your precisions.

Bye

-- 
Pierrick LE GALL
INEO media system


Re: Unimarc, marc21, Unicode, and MARC::File::XML

2006-03-20 Thread Paul POULAIN

Mike Rylander a écrit :

I tested with the record you sent Ed and me, and everything seems to
work for me ...
As you can see, I tested several variants of the UNIMARC flag, and
even tested not sending the encoding to new_from_xml() ... it all
seems to work for me, and I'm not sure what problems you're seeing. 
Perhaps you just needed to set your binmode for the XML source?


strange, strange...

What does my script :
* retrieve the MARC::Record from zebra
* read some datas from mysql
* build a page with HTML::Template
* send the pages to the browser

I added 3 lines to save the record in a file after reading from zebra. 
Adding binmode(F,':utf8');

before saving my record in F, give me correct UTF-8.
without binmode, it's NOK.

But when I put the MARC::record in a page builded with HTML::Template, 
it's wrong.

The HTML is utf-8 (html page encoding).
It also contains some strings from mySQL and all strings from mySQL 
appear as correct utf8 while all strings coming from the MARC::record 
coming from zebra are not !


I can add binmode() to the template output, but everything goes wrong 
with strings from mySQL.


Any suggestion welcomed !
--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


Re: Unimarc, marc21, Unicode, and MARC::File::XML

2006-03-20 Thread Paul POULAIN

Mike Rylander a écrit :

On 3/20/06, Paul POULAIN [EMAIL PROTECTED] wrote:


Mike Rylander a écrit :


I tested with the record you sent Ed and me, and everything seems to
work for me ...
As you can see, I tested several variants of the UNIMARC flag, and
even tested not sending the encoding to new_from_xml() ... it all
seems to work for me, and I'm not sure what problems you're seeing.
Perhaps you just needed to set your binmode for the XML source?


strange, strange...

What does my script :
* retrieve the MARC::Record from zebra
* read some datas from mysql
* build a page with HTML::Template
* send the pages to the browser

Are you getting XML or binary MARC from zebra?


XML. The test.xml I sended to you on friday comes was the
$raw = $rs-record(0)-raw();
record.


Are you using decode_utf8($mysql_string) to let Perl know that the
database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
about that, and the DBD::MySQL maintainer haven't added that
functionality to the module yet.


I thought we had to decode_utf8($mysql_string), and began to investigate 
a lot. But after many hours of digging  getting problems, I now have a 
working mySQL in utf8 for all of Koha.

without any binmode of decode_utf8 ...
And it seems joshua  Tümer (Turkey) has the same conclusion : no more 
problems with mySQL  Perl.
We all use a recent version of mySQL, even if DBD::mysql maintainer 
(from mysql.com : joshua dropped him a mail but got no answer) did 
nothing on the cpan package.


--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML

2006-03-20 Thread Pierrick LE GALL
Hello Mike,

I'll answer to the second question, since I worked with Paul on
Perl/MySQL and UTF-8...

On Mon, 20 Mar 2006 09:59:32 -0500
Mike Rylander [EMAIL PROTECTED] wrote:

 Are you using decode_utf8($mysql_string) to let Perl know that the
 database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
 about that, and the DBD::MySQL maintainer haven't added that
 functionality to the module yet.

We don't use decode_utf8. Just after the database handler creation, we
force communication to be UTF-8 with set names 'UTF8' SQL query. As
we know our data are UTF-8 stored and we want UTF-8, all works fine.

Bye

-- 
Pierrick LE GALL
INEO media system


Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML

2006-03-20 Thread Mike Rylander
On 3/20/06, Pierrick LE GALL [EMAIL PROTECTED] wrote:
 Hello Mike,

 I'll answer to the second question, since I worked with Paul on
 Perl/MySQL and UTF-8...

 On Mon, 20 Mar 2006 09:59:32 -0500
 Mike Rylander [EMAIL PROTECTED] wrote:

  Are you using decode_utf8($mysql_string) to let Perl know that the
  database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
  about that, and the DBD::MySQL maintainer haven't added that
  functionality to the module yet.

 We don't use decode_utf8. Just after the database handler creation, we
 force communication to be UTF-8 with set names 'UTF8' SQL query. As
 we know our data are UTF-8 stored and we want UTF-8, all works fine.


Except that Perl doesn't know that the data is already UTF8 ... which
is the problem.  Perl /does/ know that the MARC data is UTF8, and it
has to convert one string or the other on output.  If you explicitly
use binmode() to set the PerlIO state to utf8, then the MARC::Record
strings, which are known good UTF8, are not transformed, but the MySQL
data, of which Perl has no encoding notions, gets transformed, and
thus broken.

The only consistent and correct way to deal with UTF8 data in perl is
to let PerlIO handle it by marking all sources as either providing
UTF8 data or not.  You can do that with binmode(), open() and several
other ways, including this in modern Perls (
http://search.cpan.org/~nwclark/perl-5.8.8/lib/open.pm ).  Because
DBD::mysql doesn't give you a way to mark its socket as UTF8, you need
to be a little underhanded and tell Perl as soon as possible using
decode(), or by making utf8 the default mode for all PerlIO channels. 
There really isn't any way around this if you want to claim real UTF8
support and be able to use components that really do support UTF8
natively, like MARC::File::XML and MARC::Record.

It's unfortunate that the DBD::mysql people won't fix their module,
but there really is a right way to do this, even without their help. 
Is there a performance penalty with decode()?  Yep.  Would that go
away with a fix to the DBD::mysql module?  Mostly, so you really need
to bug them.

 Bye

 --
 Pierrick LE GALL
 INEO media system



--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org


Re: Unimarc, marc21, Unicode, and MARC::File::XML

2006-03-17 Thread Mike Rylander
I tested with the record you sent Ed and me, and everything seems to
work for me ... comparing the preprocessed XML with a copy that I
round-tripped through MARC::Record and MARC::File::XML, they look the
same.  Here's my little test script (unimarc-test.pl):

--
#!/usr/bin/perl

# USAGE: ./unimarc-test.pl  test.xml

use MARC::Record;
#use MARC::File::XML ( RecordFormat = 'UNIMARC' );
use MARC::File::XML;

MARC::File::XML-default_record_format('UNIMARC');

binmode(STDIN,':utf8');
binmode(STDOUT,':utf8');

$/ = undef;

my $xml = ;

my $r = MARC::Record-new_from_xml($xml); # ,'utf8'); #,'UNIMARC');

print $r-as_xml(); #'UNIMARC');
print $r-as_usmarc();

__END__

As you can see, I tested several variants of the UNIMARC flag, and
even tested not sending the encoding to new_from_xml() ... it all
seems to work for me, and I'm not sure what problems you're seeing. 
Perhaps you just needed to set your binmode for the XML source?

-miker

On 3/17/06, Paul POULAIN [EMAIL PROTECTED] wrote:
 Mike Rylander a écrit :
  CVS checkout intsructions
cvs -d:pserver:[EMAIL PROTECTED]:/cvsroot/marcpm login
cvs -z3 -d:pserver:[EMAIL PROTECTED]:/cvsroot/marcpm co
  -P marc-xml
 
  Then,
cd marc-xml
perl Makefile.PL
make
make test
 
  And assuming 'make test' succeeds ...
make install

 I updated MARC::record (2.000), MARC::Charset (0.920) and
 MARC::File::xml (0.810 ???) from sourceforge.

 I tested with a unimarc xml file (that I've send to you by pm), without
 change.
 what did I make wrong ?
 --
 Paul POULAIN et Henri Damien LAURENT
 Consultants indépendants
 en logiciels libres et bibliothéconomie (http://www.koha-fr.org)



--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org


Unimarc, marc21, Unicode, and MARC::File::XML

2006-03-16 Thread Paul POULAIN

Hello all,

Still working on UNICODE in Koha.

We are stuck with a not-so-nice problem. (Many many thanks to the 
librarians that wrote marc21 and unimarc standards...)


I explain :

yesterday :
joshua the new marc::file::xml works fine with utf8 now.
me : Great ! i'll give it a try

today :
me : oh non, ca ne marche pas (in english : hey, it doesn't work...)

my XML (coming from zebra) is utf-8, but the MARC::Record after
my $record = MARC::Record-new_from_xml($raw, 'utf8');
is marc8...

1 hour later, joshua wakes up, as most americans and we began digging on 
#koha irc.

1 hour later the problem was identified :
PROBLEM :
* in MARC21, the encoding is defined by position 9 of the leader. 'a' 
means UTF-8
* in UNIMARC, this is an empty position ! the encoding is in positions 
26-27 and 28-29 of 100$a (200 are all fixed coded fields in unimarc : 
http://bibliotheque.bgp-fr.com/Unimarc_abrege.pdf, page 8 for 100$a)


BIG PROBLEM :
MARC::File::XML only checks for position 9, thinking the XML is 
necessary a marc21 file.


I think ( joshua agrees) we will have to hack MARC::File::XML to solve 
this problem.

We have 2 solutions :
* add a test to define wether we are UNIMARC or MARC21. In UNIMARC, 
title is in 200, while 200 is empty in MARC21.
* add a parameter to -new_as_xml($xml,'UTF-8','UNIMARC') to specify we 
are sending the parser an unimarc file.


Ed  al, let me know what you think, thanks.
--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


Re: Unimarc, marc21, Unicode, and MARC::File::XML

2006-03-16 Thread Ed Summers
On 3/16/06, Mike Rylander [EMAIL PROTECTED] wrote:
 Will some brave soul please test this with some UNIMARC records and
 let me know how it goes?

Yes please, add the test to the test suite if possible Joshua and Paul.

miker_++

//Ed


Re: Unimarc, marc21, Unicode, and MARC::File::XML

2006-03-16 Thread Mike Rylander
Mea culpa ... read on. :)

On 3/16/06, Mike Rylander [EMAIL PROTECTED] wrote:
 I've updated the cvs for MARC::File::XML with what I described below,
 with one caveat.  The one difference from what I was planning is that,
 because as_xml() is generated by MARC::Record, I can't give it new
 parameters.  To test exporting to XML you'll need to set the record
 format for export either in the use line for the module or using the
 default_record_format() class method.  Just call that with 'UNIMARC'
 as the parameter and then export your record as normal using as_xml()
 on the MARC::Record object.

It seems that I am either blind or insane ... I do have access to
as_xml(), and I did in fact add the format option to it.  Sorry for
the confusion. :)

I'm updating the POD now, and adding a new method to to export XML
without a collection wrapper.


 (new_from_xml() does not suffer from this as that method is defined in
 MARC/File/XML.pm, so it takes both an encoding parameter and a format
 paramter, as explained in the documentation.)

 Will some brave soul please test this with some UNIMARC records and
 let me know how it goes?

 ---

 CVS checkout intsructions
   cvs -d:pserver:[EMAIL PROTECTED]:/cvsroot/marcpm login
   cvs -z3 -d:pserver:[EMAIL PROTECTED]:/cvsroot/marcpm co
 -P marc-xml

 Then,
   cd marc-xml
   perl Makefile.PL
   make
   make test

 And assuming 'make test' succeeds ...
   make install

 ---

 Thanks in advance,

 --miker

 On 3/16/06, Mike Rylander [EMAIL PROTECTED] wrote:
  I've been attempting to beat the MARC::File::XML stuff into a usable
  shape as of late, so I'm going to take a stab at fixing this.  There
  will be some limitations (at first) as to what encodings we'll accept
  for UNIMARC records, but I'll cover the cases that I know about (and
  understand).
 
  Here's the plan:
 
  I will add a use flag to set the script-wide default for record format
 
use MARC::File::XML ( RecordFormat = 'UNIMARC' );
 
  that will default to MARC21.  There will also be a class method to set this 
  flag
 
MARC::File::XML-default_record_format( 'UNIMARC' );
 
  and, finally, a flag to both as_xml and new_from_xml to tell
  MARC::File::XML about individual records.  I don't think, at this
  point, we should autodetect based on the existence of a 200 tag, as
  I'd like to stay away from heuristics if it can be avoided.  If others
  disagree, please make the case!
 
  When processing a UNIMARC record, I'll look in 100$a for the encoding,
  and proceed if it's either 01 (iso646 -- nominally compatible with
  iso8859, though it requires interpretation) or 50 (UNICODE, which will
  always mean UTF8 in XML produced by MARC::File::XML).  If it's
  anything else an error will be thrown.  We can add support for other
  encodings as the direct need arises.
 
  For UNIMARC/UNICODE, the XML is obviously going to be UTF-8 encoded.
  For UNIMARC/ISO646, the XML will be marked as ISO-8859-1.  Yes, it's a
  bit of a fib, but most XML parsers don't support ISO646, and most do
  support LATIN1 (8859-1), and the bytes won't get mangled by the parser
  in that case.
 
  Comments?
 
  On 3/16/06, Zeno Tajoli [EMAIL PROTECTED] wrote:
   Hi,
  
   PROBLEM :
   * in MARC21, the encoding is defined by position 9 of the leader.
   'a' means UTF-8
   * in UNIMARC, this is an empty position ! the encoding is in
   positions 26-27 and 28-29 of 100$a (200 are all fixed coded fields
   in unimarc : http://bibliotheque.bgp-fr.com/Unimarc_abrege.pdf, page
   8 for 100$a)
   
   BIG PROBLEM :
   MARC::File::XML only checks for position 9, thinking the XML is
   necessary a marc21 file.
   
   I think ( joshua agrees) we will have to hack MARC::File::XML to
   solve this problem.
   We have 2 solutions :
   * add a test to define wether we are UNIMARC or MARC21. In UNIMARC,
   title is in 200, while 200 is empty in MARC21.
   * add a parameter to -new_as_xml($xml,'UTF-8','UNIMARC') to specify
   we are sending the parser an unimarc file.
  
   as a person that has write a Unimarc - MARC21 converter, I prefer
   the second solution.
  
   Thanks for all
   Bye
  
   Zeno Tajoli
   CILEA - Segrate (MI)
   tajoliAT_SPAM_no_prendiATcilea.it
   (Indirizzo mascherato anti-spam; sostituisci quanto tra AT con @)
  
  
 
 
  --
  Mike Rylander
  [EMAIL PROTECTED]
  GPLS -- PINES Development
  Database Developer
  http://open-ils.org
 


 --
 Mike Rylander
 [EMAIL PROTECTED]
 GPLS -- PINES Development
 Database Developer
 http://open-ils.org



--
Mike Rylander
[EMAIL PROTECTED]
GPLS -- PINES Development
Database Developer
http://open-ils.org