On Feb 1, 2008 12:41 PM, Brian Eaton <[EMAIL PROTECTED]> wrote: > The current fetchJson implementation uses "new > String(results.getByteArray())" to convert the response bytes to a > string for inclusion in the JSON reply to the gadget. The behavior of > new String(byte[]) is undefined "when the given bytes are not valid in > the default charset". > > The default charset could be anything, and the returned bytes from the > remote server could also be anything. This is likely to cause > problems (data corruption) for gadgets fetching data from non-english > web sites.
The default charset is almost always utf-8 in practice (unless you've done something particularly bizarre, like modifying system properties), but you're right that the back end could be anything. Honestly, the real answer here is that this should *NOT* be a string at all -- it should be a sequence of bytes. RemoteContentFetcher should not care about encoding. What if I'm using this to fetch non-text data, such as an image file, for the open proxy? For text data (such as what you would fetch from gadgets.io.makeRequest), it should always be utf-8. This does mean that we need to do encoding detection / conversion in here. It has nothing to do with "non-English" web sites, but rather websites which use regional character encodings (ISO-8859-1 probably being the most problematic since it "looks like" ASCII or UTF8 until you start using diacritics; BIG5 is another likely problem for chinese language sites). I'll open up a JIRA issue for this, but I wanted to see whether anyone > had proposals for a solution. The fix will probably involve using > CharsetDecoder, so we at least have well-defined behavior. How we > pick the CharsetDecoder to use is an open question. What to do when > the CharsetDecoding fails is another issue. I'm tempted to put in a > quick fix that specifies UTF-8 for the character set. That will > prevent anyone from depending on the current undefined behavior while > we work out what should happen. If it can't be converted to utf8, or we can't detect the encoding, we simply fail the request. This is consistent with the behavior on igoogle today.

