I want to use your XML Perl module (1.5.1) for processing XML document
which are
writen in Czech (diploma thesis of my students). These documents will be
wriiten in ISO Latin 2 (Linux) or CP 1250 (Windows) encodings and I want to
transform it to UTF-8 encoding before the processing.
I have tried your XML module and in standard condition (Linux RH 7.2,
locale = cs_CZ) it's
all OK, all strings were converted from UTF-8 to ISO Latin 2 (8859-2)
(without
error). Unfortunately I need an output in UTF-8. When I have tried the
locale
cs_CZ.utf8 with utf8 option in Perl 5.6.0, the output was in UTF-8, but
strings with multibyte (=2 bytes for Czech) UTF-8 characters are reduced
(shortened) on their end (probably by one character for each multibyte
character in string
i.e. size of output string in bytes is equal to its length in characters).
I do not know where is the bug because I did'nt test utf-8 support in Perl
5.6.0 adequately (core functions [as "length"] seems OK, locale-dependend
ones as "uc" do not).
Can you help me, please.
Jiri Fiser
UJEP University, Pedagogical faculty in Usti nad Labem
Czech republic
Platorm info (perl -V):
Summary of my perl5 (revision 5.0 version 6 subversion 0) configuration:
Platform:
osname=linux, osvers=2.2.17-8smp, archname=i386-linux
uname='linux porky.devel.redhat.com 2.2.17-8smp #1 smp fri nov 17
16:12:17 est 2000 i686 unknow
config_args='-des -Doptimize=-O2 -march=i386 -mcpu=i686 -Dcc=gcc
-Dcccdlflags=-fPIC -Dinstallpr
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef
usemultiplicity=undef
useperlio=undef d_sfio=undef uselargefiles=undef
use64bitint=undef use64bitall=undef uselongdouble=undef usesocks=undef
Compiler:
cc='gcc', optimize='-O2 -march=i386 -mcpu=i686', gccversion=2.96
20000731 (Red Hat Linux 7.1 2.
cppflags='-fno-strict-aliasing'
ccflags ='-fno-strict-aliasing'
stdchar='char', d_stdstdio=define, usevfork=false
intsize=4, longsize=4, ptrsize=4, doublesize=8
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=4
alignbytes=4, usemymalloc=n, prototype=define
Linker and Libraries:
ld='gcc', ldflags =' -L/usr/local/lib'
libpth=/usr/local/lib /lib /usr/lib
libs=-lnsl -ldl -lm -lc -lcrypt
libc=/lib/libc-2.2.2.so, so=so, useshrplib=false, libperl=libperl.a
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib'
Characteristics of this binary (from libperl):
Compile-time options:
Built under linux
Compiled at Mar 23 2001 12:49:50
@INC:
/usr/lib/perl5/5.6.0/i386-linux
/usr/lib/perl5/5.6.0
/usr/lib/perl5/site_perl/5.6.0/i386-linux
/usr/lib/perl5/site_perl/5.6.0
/usr/lib/perl5/site_perl
.
An example of the bug (produced by the standard sample DOMPrint.pl, the
first occurrence in
word (attribute name) lexC)m = lex[e with accute]m, where terminal
character "m" in
output is omited):
INPUT:
<?xml version="1.0"?>
<!DOCTYPE slovo SYSTEM "slovnik.dtd">
<slovo lexC)m="den">
<gramatika>
<substantivum vzor="stroj">
<vC=jimka pC!d="6" tvar="dnu"/>
<vC=jimka pC!d="2" DMC-slo="mnoE>nC)" tvar="dnC-"/>
</substantivum>
</gramatika>
<sC)mantika>
<vC=znam semid="24hodin">
<vC=klad>DMasovC= C:sek 24 hodin</vC=klad>
<ukC!zka>
<text>mC-jC- den po dni</text>
</ukC!zka>
</vC=znam>
<vC=znam semid="bC-lC=">
<vC=klad>doba od vC=chodu do zC!padu slunce</vC=klad>
<ukC!zka>
<pramen>SPJ</pramen>
<text>dlouhC= letnC- den</text>
</ukC!zka>
<ukC!zka>
<text>nechval dne pEYed veDMerem</text>
</ukC!zka>
</vC=znam>
</sC)mantika>
</slovo>
OUTPUT:
<?xml version="1.0"?>
<!DOCTYPE slovo SYSTEM 'slovnik.dtd' >
<slovo lexC)="den">
<gramatika>
<substantivum vzor="stroj">
<vC=jimk pC!="6" tvar="dnu" DMC-s="jednotnC"/>
<vC=jimk pC!="2" tvar="dnC" DMC-s="mnoE>n"/>
</substantivum>
</gramatika>
<sC)mantik>
<vC=zna semid="24hodin">
<vC=kla>DMasovC= C:sek 24</vC=kla>
<ukC!zk>
<text>mC-jC- den po</text>
</ukC!zk>
</vC=zna>
<vC=zna semid="bC-l">
<vC=kla>doba od vC=chodu do zC!padu sl</vC=kla>
<ukC!zk>
<pramen>SPJ</pramen>
<text>dlouhC= letnC-</text>
</ukC!zk>
<ukC!zk>
<text>nechval dne pEYed veDM</text>
</ukC!zk>
</vC=zna>
</sC)mantik>
</slovo>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]