[
https://issues.apache.org/jira/browse/SLING-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12602634#action_12602634
]
Felix Meschberger commented on SLING-508:
-----------------------------------------
First off: Servlet Container parameters are not re-encoded by Sling (any more).
They are taken as is.
Now, to what happens here:
On the one hand, the W3C [1] recomends browser vendors to encode non-ASCI
characters in URLs in UTF-8. This should IMO also include the encoding of
parameters in application/x-www-formurlencoded POSTed parameters, altough I
could not find a real codification of this.
On the other hand, the Servlet Specification states, that all data read from
POSTed content should be decoded with ISO-8859-1 encoding (Servlet API 2.4,
Section 4.9). As servlet containers only read application/x-www-formurlencoded
POST requests this issue is about these parameters.
Third, servlet containers are implemented inconsistently: Some (e.g. Tomcat)
apply the Servlet API spec and read the data as ISO-8859-1 and some apply (e.g.
Jetty) the W3C recommendation and read the data as UTF-8.
Fourth, browsers do not apply the W3C recomendation but instead encode the
parameters in the character encoding of the page on which the form is placed.
Consider now the situation of a Servlet API conforming servlet container
accepting form data of an UTF-8 encoded page: The parameters are encoded in
UTF-8 and servlet container decodes this as ISO-8859-1 giving unreadable data.
Conversely, if running in a W3C conforming container accepting form data of an
ISO-8859-1 encoded page, the data will also be corrupt due to UTF-8 decoding of
ISO-8859-1 data.
To come around this, we have very lilttle power. Best we can do is try to force
the servlet container in decoding the parameter data in ISO-8859-1 and then to
recode the raw data in whatever character encoding has been declared with the
"_charset_" request parameter.
Two remarks:
(1) We use ISO-8859-1 because this encoding defines a 1:1 mapping of raw bytes
to characters. In fact, the lower 256 characters of Unicode are exactly the
characters from the ISO-8859-1 encoding. Thus ISO-8859-1 is kind of an identity
encoding.
(2) "Trying to force" the container means, that we ensure the correct character
set to use for reading the input, but if the input has already been read (e.g.
by a filter outside Sling), we can not do much any more. This is probably not
much of an issue, but we must be aware of it.
[1] http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1
> Parameter decoding uses wrong default charset
> ---------------------------------------------
>
> Key: SLING-508
> URL: https://issues.apache.org/jira/browse/SLING-508
> Project: Sling
> Issue Type: Bug
> Components: Engine
> Affects Versions: 2.0.0
> Reporter: Tobias Bocanegra
> Assignee: Felix Meschberger
> Priority: Blocker
>
> As of SLING-152 the request paremeters are re-encoded if a _charset_
> parameter is present. it assumes that the default encoding is
> UTF-8 which is not the case for servlet spec compliant containers (eg.
> tomcat).
> change the default encoding to ISO-8851-1 or make it configurable.
> see:
> http://svn.apache.org/viewvc/incubator/sling/trunk/engine/src/main/java/org/apache/sling/engine/impl/parameters/Util.java?view=markup
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.