php-general Digest 4 Jan 2008 14:17:07 -0000 Issue 5216

Topics (messages 266598 through 266599):

Re: First stupid post of the year. [SOLVED]
        266598 by: Nisse Engström
        266599 by: tedd

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [EMAIL PROTECTED]


----------------------------------------------------------------------
--- Begin Message ---
On Thu, 3 Jan 2008 12:39:36 -0500, tedd wrote:

> At 4:24 PM +0100 1/3/08, Nisse =?utf-8?Q?Engstr=C3=B6m?= wrote:
>>On Wed, 2 Jan 2008 19:36:56 -0500, tedd wrote:
>>
>>>  To find out, I did put the operation through FireFox and reversed the
>>>  POST/GET operations to get a look at the string -- it is:
>>>
>>>  %C2%A0%C2%A0%C2%A0Z%C2%A0%C2%A0%C2%A0  < where Z is the value passed.
>>>
>>>  Now, C2 (HEX) is a linefeed (194 DEC)

By the way, C2 is not a linefeed as far as I know.

>>>  And, A0 (HEX) is a non-breaking space (160 DEC;) which is a &nbsp;
>>
>>Not quite. <A0> is non-breaking space in *some* character
>>encodings, such as the ISO-8859-... encodings. It may
>>be different in other encodings. In UTF-8, it is <C2 A0>,
>>which is exactly what you're seing.
> 
> Well considering that UTF-8 encompasses/includes all of the code 
> points found ISO-8859, then I think that both encodings would 
> reference the same character. After all, if they didn't then what's 
> the point of Unicode?
> 
> Now, one can argue how many bytes are needed to represent a character 
> in what encoding, but that doesn't change the character. In the end, 
> I believe that <A0> is the same regardless of what charset or 
> encoding you're using.

   You have a point here: the character is the same. In
Unicode it is called U+00A0. But Unicode alone does not
tell you how to represent the character in bytes. You
need an encoding for this.

   Unicode specifies a few different encodings, called
transformation formats (the T and F in UTF). The actual
bytes representing U+00A0 are as follows:

      UTF-32:   <00 00 00 A0>
      UTF-16BE: <00 A0>
      UTF-16LE: <A0 00>
      UTF-8:    <C2 A0>

(where the <xx ...> syntax denotes *byte* sequences.
 A byte sequence and a character are different things.)

   The fact that the byte <A0> occurs in UTF-8 is just
an interesting, and easily confusing, coincident.

   In other encodings, the character U+00A0 may be
encoded differently. For example, in CP850 for DOS,
U+00A0 is encoded using the single byte <ff>.

  -   -   -  

   In HTML, there are a few ways to encode U+00A0. If
you have specified a character encoding for the document,
you can use the encoded character directly. You can
specify the encoding in HTTP (preferable) using PHP:

    header ('Content-Type: text/html; charset=utf-8')

or .htaccess files (Apache 2):

    AddDefaultCharset utf-8

Richard Lynch would tell you to also use a <meta> element:

    <meta http-equiv="Content-Type"
          content="text/html; charset=utf-8">


   If you don't want, or can't, use the encoded character
directly, you can also use HTML character references, such
as `&nbsp;´, `&#160´ or `&#x00a0´. Numerical character
references *always* refer to Unicode characters, *regardless*
of the encoding used in the document. For example, if your
document is encoded in CP850, you would use `&#xa0´ and not
`&#xff´ to represent U+00A0.


  -   -   -  

   But let's go back to your problem again:
 
> I just don't understand where C2 comes from or why it's there. I 
> would think that <00 A0> would be more appropriate.

   When your document (web page) doesn't specify which
character encoding it is using, the browser will have to
guess. Many browser will use cp1252 or similar. Others
might use UTF-8, or inspect the document and guess which
is more apropriate. Some browsers can be configured to
prefer a particular encoding.

   When the form is submitted, the form control values
are encoded using whichever character encoding the
browser has settled on. If your browser has settled on
UTF-8, the `&nbsp;´ in your form will be sent as <C2 A0>,
because character references can only be used in the HTML
document. In URLs they are encoded using numerical
references (eg. %C2%A0).

   And here's what is going wrong: Your server side
script is expecting the form submission to be encoded
in an single-byte encoding (such as cp1252 or iso-8859-1
or similar). The sequence %C2%A0 is interpreted as two
character rather than one character.

   Which two character would that be then? Well that,
again, depends on which character encoding your script
expects from the form submission:

      Encoding    Characters
      --------    ----------
      iso-8859-1: U+00C2, U+00A0 (A-circumflex, nbsp)
      cp850:      U+252C, U+00E1 (box drawing character, a-acute)
      cp1252:     U+00C2, U+00A0 (A-circumflex, nbsp)
      cp874:      U+0E22, U+00A0 (Thai YO YAK, nbsp)
      KSC5601:    U+D63B (Hangul HIEUH-O-KIYEOKSIOS)

>>  > Therefore, if I simply use:
>>>
>>>  $submit = str_replace( chr(194), '', $submit );
>>>  $submit = str_replace( chr(160), '', $submit );
>>>
>>>  This is the solution.
>>
>>Hardly.
> 
> If you mean my solution doesn't work, then you are mistaken -- for 
> works for me.

     ``This seems to work but I really have no idea what's
       going on, so I'll just make random guesses´´

is very far from *the* solution in my mind.  :-)

> This entire encoding process is more involved than it looks, or so it 
> appears to me.

More reading in no particular order:

The Unicode Standard:
     <http://unicode.org/>
Unicode character repertoire:
     <http://www.unicode.org/Public/UNIDATA/UnicodeData.txt>
Unicode encodings:
     <http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf>
Other encodings:
     <http://www.unicode.org/Public/MAPPINGS/>
RFC 3629 (UTF-8):
     <http://www.rfc-editor.org/rfc/rfc3629.txt>
HTML, Character sets and encodings:
     <http://www.w3.org/TR/html401/charset.html>
HTML, Form submission:
     <http://www.w3.org/TR/html401/interact/forms.html#h-17.13>
Jukka K. Korpela on Characters and Encodings:
     <http://www.cs.tut.fi/~jkorpela/chars/index.html>
the late Alan J. Flavell on internationalization:

<http://web.archive.org/web/20060924054022/ppewww.ph.gla.ac.uk/~flavell/charset/internat.html>


/Nisse

--- End Message ---
--- Begin Message ---
At 10:33 AM +0100 1/4/08, Nisse Engström wrote:
On Thu, 3 Jan 2008 12:39:36 -0500, tedd wrote:

Nisse:

I thank you for your most enlightened and informative reply.

I cut/pasted your post into my list of things to remember.

Cheers,

tedd
--
-------
http://sperling.com  http://ancientstones.com  http://earthstones.com

--- End Message ---

Reply via email to