[ 
https://issues.apache.org/jira/browse/SLING-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602634#action_12602634
 ] 

Felix Meschberger commented on SLING-508:
-----------------------------------------

First off: Servlet Container parameters are not re-encoded by Sling (any more). 
They are taken as is.

Now, to what happens here:

On the one hand, the W3C [1] recomends browser vendors to encode non-ASCI 
characters in URLs in UTF-8. This should IMO also include the encoding of 
parameters in application/x-www-formurlencoded POSTed parameters, altough I 
could not find a real codification of this.

On the other hand, the Servlet Specification states, that all data read from 
POSTed content should be decoded with ISO-8859-1 encoding (Servlet API 2.4, 
Section 4.9). As servlet containers only read application/x-www-formurlencoded 
POST requests this issue is about these parameters.

Third, servlet containers are implemented inconsistently: Some (e.g. Tomcat) 
apply the Servlet API spec and read the data as ISO-8859-1 and some apply (e.g. 
Jetty) the W3C recommendation and read the data as UTF-8.

Fourth, browsers do not apply the W3C recomendation but instead encode the 
parameters in the character encoding of the page on which the form is placed.

Consider now the situation of a Servlet API conforming servlet container 
accepting form data of an UTF-8 encoded page: The parameters are encoded in 
UTF-8 and servlet container decodes this as ISO-8859-1 giving unreadable data. 
Conversely, if running in a W3C conforming container accepting form data of an 
ISO-8859-1 encoded page, the data will also be corrupt due to UTF-8 decoding of 
ISO-8859-1 data.

To come around this, we have very lilttle power. Best we can do is try to force 
the servlet container in decoding the parameter data in ISO-8859-1 and then to 
recode the raw data in whatever character encoding has been declared with the 
"_charset_" request parameter.

Two remarks:
(1) We use ISO-8859-1 because this encoding defines a 1:1 mapping of raw bytes 
to characters. In fact, the lower 256 characters of Unicode are exactly the 
characters from the ISO-8859-1 encoding. Thus ISO-8859-1 is kind of an identity 
encoding.
(2) "Trying to force" the container means, that we ensure the correct character 
set to use for reading the input, but if the input has already been read (e.g. 
by a filter outside Sling), we can not do much any more. This is probably not 
much of an issue, but we must be aware of it.


[1] http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1

> Parameter decoding uses wrong default charset
> ---------------------------------------------
>
>                 Key: SLING-508
>                 URL: https://issues.apache.org/jira/browse/SLING-508
>             Project: Sling
>          Issue Type: Bug
>          Components: Engine
>    Affects Versions: 2.0.0
>            Reporter: Tobias Bocanegra
>            Assignee: Felix Meschberger
>            Priority: Blocker
>
> As of SLING-152 the request paremeters are re-encoded if a _charset_ 
> parameter is present. it assumes that the default encoding is
> UTF-8 which is not the case for servlet spec compliant containers (eg. 
> tomcat).
> change the default encoding to ISO-8851-1 or make it configurable.
> see: 
> http://svn.apache.org/viewvc/incubator/sling/trunk/engine/src/main/java/org/apache/sling/engine/impl/parameters/Util.java?view=markup

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to