[
https://issues.apache.org/jira/browse/SHINDIG-46?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567202#action_12567202
]
Kevin Brown commented on SHINDIG-46:
------------------------------------
The use of byte[] instead of String in RemoteContentFetcher's interface is
intentional -- this is used to fetch binary data as well as text (see
ProxyHandler.fetch)
java.io.StringReader does not appear to properly strip the BOM on utf8 files
that cause the xml parsers to choke (which is what Utf8InputStream addresses).
This is a common problem with gadgets authored on windows. If someone knows a
cleaner way to do this I'm all for it.
Really, converting from encoding -> utf8 needs to happen at the point where we
convert from a byte stream to text. This means:
- In the XML processing routines we should pass a String instead of a byte[],
and we must require that the strings be UTF-8 with no BOM.
- RemoteContent should have support for detecting its own character encoding
from the http headers and returning the content body as a string in that
character set as well as the raw bytes. If we can't convert from the claimed
encoding to UTF-8, we fail the request. We'll use these strings to pass to the
XML parsing routines.
- GadgetRenderingServlet, JsonRpcServlet, and ProxyServlet and should
explicitly set the utf-8 output encoding.
> gadgets.io.makeRequest malfunctions on non-ASCII web sites.
> -----------------------------------------------------------
>
> Key: SHINDIG-46
> URL: https://issues.apache.org/jira/browse/SHINDIG-46
> Project: Shindig
> Issue Type: Bug
> Components: Gadgets Server - Java
> Reporter: Brian Eaton
> Assignee: John Hjelmstad
> Attachments: patch
>
>
> See this thread for background:
> http://mail-archives.apache.org/mod_mbox/incubator-shindig-dev/200802.mbox/browser
> Short term, we should change the HTTP proxy code to always use UTF-8 as the
> character set for converting remote content bytes to strings before returning
> them to clients. We should do this ASAP to prevent anyone from becoming
> dependent on the current undefined behavior.
> Long term we might want to add some kind of character set detection, probably
> via the HTTP content-type header. IE style charset content sniffing would
> probably not be a good idea.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.