For the record --

Is UTF-8 input coming from the likes of Apache a possible source of failure? Pack may need to allow for endian-ness of a specific machine.

Well, it depends on how one looks at things, perhaps. I think one of the probable reasons for the failure in the DWIM machinery was that I am insisting on using shift-JIS characters in the source file instead of utf-8 in strings and comments. But, no, Apache wasn't filtering shift-JIS to utf-8 for me. Byte order also was not the problem.

After several hours of analysis (using more of the stuff that made the original posting of the source somewhat opaque), I determined that the problem derived from perl sometimes being stricter about shift-JIS than I wanted it to be.

I don't know why the '+' substitute for space would switch to strict character interpretation, but it seems to have been doing so.

Shift-JIS is a variable byte width encoding, one or two bytes. Lead bytes are inherently not valid as single-byte characters. Trailing bytes are sometimes valid as single-byte characters and sometimes not. If the regular expression engine is not checking for valid bytes, all you have to do is string the decoded bytes together. But if it is checking for valid bytes, you have to put the decoded bytes into something other than a char. (Blame C for folding the type of a byte onto the type of a character.)

But if you are collecting into 16-bit words, you have to actually check for the lead bytes yourself. I'm sure someone could put an RE together that would do it, but I just decided it was going to be simpler to check and build the string by hand.

So, for anybody who's curious, here's what I'm doing for now:

-----------------------------------------
my $qString = $ENV{'QUERY_STRING'};
my @list = split( '&', $qString, 10 );
my %queries = ();
foreach my $pair ( @list )
{       my ( $key, $value ) = split( '=', $pair, 2 );
        # Really should just give in and use CGI.
        # $key =~ tr/+/ /;      # You don't expect space in identifiers, but, 
...
        $key =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;

        # $queries{ $key . '_' } = $value; # dbg
        
        $value =~ tr/+/ /;
        
        my ( $byteAccm, $hexAccm, $conv ) = ( 0, undef, '' );
        while ( $value =~ m/%([\dA-Fa-f][\dA-Fa-f])|(.)/g )
        {       if ( defined ( $1 ) )
                {       my $hexValue = $1;
                        my $decValue = hex ( $hexValue );
                        if ( ! defined ( $hexAccm ) )
{ if ( $decValue <= 0x80 || ( $decValue >= 0xa0 && $decValue < 0xe0 ) || $decValue >= 0xfd )
                                {       $conv .= pack( 'C', $decValue );
                                }
                                else    # Lead byte -- loose checks all around.
                                {       $byteAccm = $decValue;
                                        $hexAccm = $hexValue;
                                }
                        }
                        else
{ # if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue < 0xe0 ) )
                                $conv .= pack( 'S', ( $byteAccm << 8 ) + 
$decValue );
                                $byteAccm = 0;
                                $hexAccm = undef;
                        }
                }
                else
                {       my $cValue = $2;
                        my $decValue = ord ( $cValue );
                        if ( ! defined ( $hexAccm ) )
                        {       $conv .= $cValue;
                        }
                        else
{ # if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue < 0xe0 ) )
                                $conv .= pack( 'S', ( $byteAccm << 8 ) + 
$decValue );
                                $byteAccm = 0;
                                $hexAccm = undef;
                        }
                }
        }

        $queries{ $key } = $conv;
}
-----------------------------------------

If this were production code, I should check some more gaps in the lead byte (and check where the newest JIS adds the extra several thousand characters) and uncomment the checks on the trailing bytes (and add some trailing byte checks specific to certain lead bytes, geagh). But then I have to figure out what to do with bad bytes.


Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)


Reply via email to