Transcode produces Unicode, which has bugs in perl 5.6.1

Fredrick Paul Eisele Tue, 26 Feb 2002 05:02:29 -0800

I would like some advice, and possibly a change to xerces-perl.
I found a bug in perl-5.6.1 which is related to unicode (actually I found 
several).
An example of which follows:
==================


use Devel::Peek; 

#======================= 
# Something to try out the pattern 
sub try_it { 
� my $pattern = shift; 

� Dump( $pattern ); 

� print STDERR "\ncompiled:\n"; 
� my $re = qr/$pattern/; 
� Dump( $re ); 

� my $match = "Jan 11 14:50:01 10.1.0.1 CRON[15021]: (root) CMD 
(/usr/libexec/atrun)"; 

� if ($match =~ m/$re/) { 
return "Matched\n"; 
� } 
return "Not Matched\n"; 
} 
#======================= 
��� 
# This first example forces a unicode encoding by pushing a smiley onto the 
string. 
# The smiley is then removed. 
# 
FAILURE: { 
�my $failure = 
�� "(?sx)� ( \\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3} )��� \\s�� CRON .+? 
CMD \\s \\( (\\S+?) \\)� " 
�. "\x{263A}"; 
�chop $failure; 
�print "Pattern which should match but does not\n", try_it( $failure ), "\n"; 
} 

# This sample "works" if the previous sample is commented out. 
# If you didn't notice, the difference is the unicode character. 
# 
SUCCESS: { 
�my $success = 
�� "(?sx)� ( \\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3} )��� \\s�� CRON .+? 
CMD \\s \\( (\\S+?) \\)� "; 
�print "Pattern which should match and does\n", try_it( $success ), "\n"; 
} 

==================
The problem, in this case, causes bad regular expressions to be built.
I have been putting patterns into CDATA sections of xml files and
subsequently compiling them into regular expressions.
I have also had some problems with pack when used to create
snmp streams.
The response to this bug follows:
==================
Bugids(1)
20020225.013
Modified
2002-02-25 21:35:43 

Subject
Re: [ID 20020225.013] Unicode vs. Regex 

Source
[EMAIL PROTECTED]

Thanks for your bugreport. The bug you describe has been fixed in the
current development branch and the fix will be in perl 5.8.0.

Numerous bugs in the Unicode sphere have been fixed since 5.6.1. If
you're interested to try out the current development branch, see
perldoc perlhack or just pick a recent snapshot from
    ftp://ftp.funet.fi/pub/languages/perl/snap
and test your code with it.

-- 
andreas

==================
Given that the fixes to these bugs will not be generally available for
a while what can be done in the meantime (I would rather not use
a perl snapshot).
I am thinking that the perl strings returned by the xerces functions 
should be stripped of their UTF-8 nature.
This could be done by supplying a function which does this, much
like transcode already does.
Or maybe an global option which controls the behavior or transcode?
What do you think?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Transcode produces Unicode, which has bugs in perl 5.6.1

Reply via email to