Status: New
Owner: ----

New issue 2875 by [email protected]: Can generate (and parse) invalid UTF-8
http://code.google.com/p/v8/issues/detail?id=2875

If a String contains an unpaired Unicode surrogate (U+D800 through U+DFFF) encoding it as UTF-8 will result in an invalid string. This is because UTF-8 is defined (in RFC-3629) not to allow surrogate characters at all. (for context: This caused us problems because we were relying on a Node.js frontend to only output valid utf-8 regardless of the validity of user input. Everything worked fine except in the case of incoming unpaired surrogates, at which point our backend crashed with an encoding error).

I've attached a naive fix as `generate-valid-utf8.patch`. (I say naive because it breaks the tests, and I've not figured out how best to alter them).

Relatedly when parsing UTF-8, surrogates are accepted. This should not be allowed (according to RFC-3629 or UNICODE-TR26), instead they should be replaced by U+FFFD in the same was as other invalid byte sequences.

I've attached this approach as `parse-utf8-only.patch`.

That said, it may be the case that people are relying on this laxness so that they can use CESU-8 (though I don't have any evidence for this). It may be more pragmatic to ignore the security recommendations in UNICODE-TR26 and continue allowing correctly paired surrogates when decoding UTF-8 so that CESU-8 continues to work. Even in that case, we should still not parse incorrectly paired surrogates, as they are not allowed in either CESU-8 or UTF-8.

I've attached this approach as `parse-utf8-or-cesu8.patch`

More work will be needed to make any of these patches acceptable, but I'd like to get an idea of which approach you guys would prefer to take.

See also https://code.google.com/p/v8/issues/detail?id=761#c33

Attachments:
        generate-valid-utf8.patch  406 bytes
        parse-utf8-only.patch  555 bytes
        parse-utf8-or-cesu8.patch  1.8 KB

--
You received this message because this project is configured to send all issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

--
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to