Re: interaction between tr and s (was Re: tr question -- probably wrong list to ask, but ...)

2007-12-03 Thread Joel Rees

For the record --

Is UTF-8 input coming from the likes of Apache a possible source of  
failure? Pack may need to allow for endian-ness of a specific machine.


Well, it depends on how one looks at things, perhaps. I think one of  
the probable reasons for the failure in the DWIM machinery was that I  
am insisting on using shift-JIS characters in the source file instead  
of utf-8 in strings and comments. But, no, Apache wasn't filtering  
shift-JIS to utf-8 for me. Byte order also was not the problem.


After several hours of analysis (using more of the stuff that made  
the original posting of the source somewhat opaque), I determined  
that the problem derived from perl sometimes being stricter about  
shift-JIS than I wanted it to be.


I don't know why the '+' substitute for space would switch to strict  
character interpretation, but it seems to have been doing so.


Shift-JIS is a variable byte width encoding, one or two bytes. Lead  
bytes are inherently not valid as single-byte characters. Trailing  
bytes are sometimes valid as single-byte characters and sometimes  
not. If the regular expression engine is not checking for valid  
bytes, all you have to do is string the decoded bytes together. But  
if it is checking for valid bytes, you have to put the decoded bytes  
into something other than a char. (Blame C for folding the type of a  
byte onto the type of a character.)


But if you are collecting into 16-bit words, you have to actually  
check for the lead bytes yourself. I'm sure someone could put an RE  
together that would do it, but I just decided it was going to be  
simpler to check and build the string by hand.


So, for anybody who's curious, here's what I'm doing for now:

-
my $qString = $ENV{'QUERY_STRING'};
my @list = split( '', $qString, 10 );
my %queries = ();
foreach my $pair ( @list )
{   my ( $key, $value ) = split( '=', $pair, 2 );
# Really should just give in and use CGI.
# $key =~ tr/+/ /;  # You don't expect space in identifiers, but, 
...
$key =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack (C, hex ($1))/eg;

# $queries{ $key . '_' } = $value; # dbg

$value =~ tr/+/ /;

my ( $byteAccm, $hexAccm, $conv ) = ( 0, undef, '' );
while ( $value =~ m/%([\dA-Fa-f][\dA-Fa-f])|(.)/g )
{   if ( defined ( $1 ) )
{   my $hexValue = $1;
my $decValue = hex ( $hexValue );
if ( ! defined ( $hexAccm ) )
			{	if ( $decValue = 0x80 || ( $decValue = 0xa0  $decValue   
0xe0 ) || $decValue = 0xfd )

{   $conv .= pack( 'C', $decValue );
}
else# Lead byte -- loose checks all around.
{   $byteAccm = $decValue;
$hexAccm = $hexValue;
}
}
else
			{	# if ( $decValue = 0x40 || ( $decValue  0xa0  $hexValue   
0xe0 ) )

$conv .= pack( 'S', ( $byteAccm  8 ) + 
$decValue );
$byteAccm = 0;
$hexAccm = undef;
}
}
else
{   my $cValue = $2;
my $decValue = ord ( $cValue );
if ( ! defined ( $hexAccm ) )
{   $conv .= $cValue;
}
else
			{	# if ( $decValue = 0x40 || ( $decValue  0xa0  $hexValue   
0xe0 ) )

$conv .= pack( 'S', ( $byteAccm  8 ) + 
$decValue );
$byteAccm = 0;
$hexAccm = undef;
}
}
}

$queries{ $key } = $conv;
}
-

If this were production code, I should check some more gaps in the  
lead byte (and check where the newest JIS adds the extra several  
thousand characters) and uncomment the checks on the trailing bytes  
(and add some trailing byte checks specific to certain lead bytes,  
geagh). But then I have to figure out what to do with bad bytes.



Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)




interaction between tr and s (was Re: tr question -- probably wrong list to ask, but ...)

2007-12-01 Thread Joel Rees
Okay, given the following (without all the debugging code I had in  
earlier):



# The code that grabs the parameters:

my $qString = $ENV{'QUERY_STRING'};
my @list = split( '', $qString, 10 );
my %queries = ();
foreach my $pair ( @list )
{   my ( $key, $value ) = split( '=', $pair, 2 );
$key =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack (C, hex ($1))/eg;

$value =~ tr/+/ /;

$value =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack (C, hex ($1))/eg;



$queries{ $key } = $value;
}


Anyone know why commenting out the transliteration will recover the  
shift-JIS characters from the url-encoded stream (leaving spaces as  
'+', of course), but leaving the transliteration in will induce the  
code to drop shift-JIS lead bytes and every now and then whole  
characters?


I had a similar problem with

$value =~ s/\+/ /g;

but it was an intermittent problem. (Haven't tried it today to see  
whether it only kills the shift-JIS characters when there is 8-bit  
space in the stream, but that may have been what was happening.)


Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)




Re: interaction between tr and s (was Re: tr question -- probably wrong list to ask, but ...)

2007-12-01 Thread Doug McNutt
At 17:03 +0900 12/1/07, Joel Rees wrote:
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
Message-Id: [EMAIL PROTECTED]
Content-Transfer-Encoding: 7bit
From: Joel Rees [EMAIL PROTECTED]

At 11:43 +0900 12/1/07, Joel Rees wrote:
Content-Type: text/plain; charset=ISO-2022-JP; delsp=yes; format=flowed
Message-Id: [EMAIL PROTECTED]
Content-Transfer-Encoding: 7bit
From: Joel Rees [EMAIL PROTECTED]

I had some intermittent problems reading your postings using Eudora-5 on this 
Mac 8500 running OS9.1 which I prefer for email.

The $line =~ tr/+/ /; showed up as $line =? tr/+/ /;  and I got a couple of yen 
marks.  I blamed it on lack of unicode support. Looking back I see a couple of 
Content headers in your email that bother me They both say simple 7 bit ASCII 
but then they also have divers encodings stated which really are about how to 
use the eighth bit.

There is also the big-endian / little-endian consideration which has reared its 
ugly head with the introduction of Intel machines running Mac OS.

Is it possible that some of the failure to decode %xx encoded stuff is 
associated with development on one machine followed by execution on another? Is 
UTF-8 input coming from the likes of Apache a possible source of failure? Pack 
may need to allow for endian-ness of a specific machine.

-- 

-- From the U S of A, the only socialist country that refuses to admit it. --