Re: interaction between tr and s (was Re: tr question -- probably wrong list to ask, but ...)

Joel Rees Mon, 03 Dec 2007 01:41:49 -0800

For the record --

Is UTF-8 input coming from the likes of Apache a possible source offailure? Pack may need to allow for endian-ness of a specific machine.

Well, it depends on how one looks at things, perhaps. I think one ofthe probable reasons for the failure in the DWIM machinery was that Iam insisting on using shift-JIS characters in the source file insteadof utf-8 in strings and comments. But, no, Apache wasn't filteringshift-JIS to utf-8 for me. Byte order also was not the problem.

After several hours of analysis (using more of the stuff that madethe original posting of the source somewhat opaque), I determinedthat the problem derived from perl sometimes being stricter aboutshift-JIS than I wanted it to be.

I don't know why the '+' substitute for space would switch to strictcharacter interpretation, but it seems to have been doing so.

Shift-JIS is a variable byte width encoding, one or two bytes. Leadbytes are inherently not valid as single-byte characters. Trailingbytes are sometimes valid as single-byte characters and sometimesnot. If the regular expression engine is not checking for validbytes, all you have to do is string the decoded bytes together. Butif it is checking for valid bytes, you have to put the decoded bytesinto something other than a char. (Blame C for folding the type of abyte onto the type of a character.)

But if you are collecting into 16-bit words, you have to actuallycheck for the lead bytes yourself. I'm sure someone could put an REtogether that would do it, but I just decided it was going to besimpler to check and build the string by hand.


So, for anybody who's curious, here's what I'm doing for now:

-----------------------------------------
my $qString = $ENV{'QUERY_STRING'};
my @list = split( '&', $qString, 10 );
my %queries = ();
foreach my $pair ( @list )
{       my ( $key, $value ) = split( '=', $pair, 2 );
        # Really should just give in and use CGI.
        # $key =~ tr/+/ /;      # You don't expect space in identifiers, but, 
...
        $key =~ s/%([\dA-Fa-f][\dA-Fa-f])/pack ("C", hex ($1))/eg;

        # $queries{ $key . '_' } = $value; # dbg
        
        $value =~ tr/+/ /;
        
        my ( $byteAccm, $hexAccm, $conv ) = ( 0, undef, '' );
        while ( $value =~ m/%([\dA-Fa-f][\dA-Fa-f])|(.)/g )
        {       if ( defined ( $1 ) )
                {       my $hexValue = $1;
                        my $decValue = hex ( $hexValue );
                        if ( ! defined ( $hexAccm ) )

{ if ( $decValue <= 0x80 || ( $decValue >= 0xa0 && $decValue <0xe0 ) || $decValue >= 0xfd )

                                {       $conv .= pack( 'C', $decValue );
                                }
                                else    # Lead byte -- loose checks all around.
                                {       $byteAccm = $decValue;
                                        $hexAccm = $hexValue;
                                }
                        }
                        else

{ # if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue <0xe0 ) )

                                $conv .= pack( 'S', ( $byteAccm << 8 ) + 
$decValue );
                                $byteAccm = 0;
                                $hexAccm = undef;
                        }
                }
                else
                {       my $cValue = $2;
                        my $decValue = ord ( $cValue );
                        if ( ! defined ( $hexAccm ) )
                        {       $conv .= $cValue;
                        }
                        else

{ # if ( $decValue >= 0x40 || ( $decValue > 0xa0 && $hexValue <0xe0 ) )

                                $conv .= pack( 'S', ( $byteAccm << 8 ) + 
$decValue );
                                $byteAccm = 0;
                                $hexAccm = undef;
                        }
                }
        }

        $queries{ $key } = $conv;
}
-----------------------------------------

If this were production code, I should check some more gaps in thelead byte (and check where the newest JIS adds the extra severalthousand characters) and uncomment the checks on the trailing bytes(and add some trailing byte checks specific to certain lead bytes,geagh). But then I have to figure out what to do with bad bytes.



Joel Rees
(waiting for a 3+GHz ARM processor to come out,
to test Steve's willingness to switch again.)

Re: interaction between tr and s (was Re: tr question -- probably wrong list to ask, but ...)

Reply via email to