From: "Peter Kirk" <[EMAIL PROTECTED]> > On 13/01/2004 13:35, Philippe Verdy wrote: > > > ... > > > >If your form page uses ISO-8859-1, then specify explicitly the ISO-8859-1 > >encoding as the one to use for submitting forms, as an explicit attribute of > >your <form> element. But then visitors won't be able to send other > >characters > >than ISO-8859-1 in their form data, whever the form method is GET with > >URL-encoding, or POST in standard form-data format. > > > > > Is this actually true? Other characters can be entered into an > ISO-8859-1 form in the format "&#nnn;"; or at least Mozilla 1.5 uses > this format. I suspect this is what happened to me recently when I typed > a schwa into a message in the webmail interface of a Yahoo group, and > this appeared in my mail received from the group as "ə" - because > the message source contained "&#601;". The problem seems to be that > the process reading the form data was not expecting this format and so > took the & as a literal rather than as an escape.
It's true that you can pre-feed the form data within your HTML page encoded with ISO-8859-1 using numeric character entities to specify non-ISO-8859-1 characters. If you try to submit it with a form specifying that it should be encoded with ISO-8859-1, the browser may not notice that this pre-feeded data (which still appeared correct in the rendered form) was bogous and normally impossible to encode with ISO-8859-1. What browsers do when they find form data which should not be encodable with the specified charset is still unpredictable. Normally the form data in the browser should be reencoded in the specified encoding. But the browser should refllect immediately to the user that some pre-feeded data in the form is bogous and some characters will immediately appear as "?". If the browser does not do that, because it prefers to render the form even with its bogous data impossible to submit as is, then the browser should check that the edited form data can be safely encoded into the target encoding specified in the form, or the encoding of the HTML page if it is not specified. Most HTML forms I have seen nearly never specify the encoding for submitting form data. So most browsers assume that form data uses the same encoding as the HTML page, even if there are numeric character references. But your claim that a browser would send form data containing numeric character references is wrong here: it violates the format needed for forms submitted by "GET" method (should be UTF-8 unless something else is specified or the HTML form is not encoded with UTF-8, and then URL-encoded), or "POST" method. I don't know which other of these two submission formats are supported by browsers, but I think that browsers should now adopt some XML format for form data submitted by "POST". This way, browsers will be able to use numeric cahracter references for characters not supported in the selected target encoding. As UTF-8 is also the default encoding for XML files, browsers would in fact not need to specify it in the XML declaration of their POST'ed form data document. Is there now a defined schema for sending POST data with a registered media-type supported by browsers and that could be specified as the format attribute of the HTML form? Will Apache or script processors like PHP support this new XML-formated form data, instead of the legacy URL-formatted data and the poor, INI-like, POST variable assignments? Browsers that don't support the new format would still use the default format for GET and POST, but there, it should be impossible to encode all characters if the target submission encoding is not UTF-8. Such impossibility to encode these characters properly in the submitted form data should be signaled to the user, instead of being sent unreliably and invisibly. I think it's a deficiency of browsers, and something that the W3C has not specified with enough precision so that it could be corrected in Internet Explorer-based and Mozilla Gecko-based explorers and in Opera (which are now more than 98% of the total browser market).

