Reviewers: drcarney,
Description:
String:WriteUtf8: Replace invalid code points
This patch is a work in progress, but I'd appreciate a review at this
point before taking care of the remaining work.
The purpose of this patch is to make String::WriteUtf8 replace invalid
code points (lone surrogate pairs) with the unicode replacement
character. This is done to avoid creating invalid UTF-8 output which
leads to compatibility issues with software requiring valid UTF-8 inputs
(e.g. the WebSocket protocol requires valid UTF-8 and terminates
connections when invalid UTF-8 is encountered).
There are several changes:
* Removed the burden of combining surrogate pairs from Utf8::Encode and
move this logic into Utf8WriterVisitor::WritePair. This is done
because the logic was duplicated between Utf8::Encode and
Utf8WriterVisitor. An additional benefit is the improved separation of
concerns.
* Added a new allow_invalid bool option to Utf8::Encode. If set to false,
surrogate code points are replaced with the unicode replacement
character.
* Added test cases for replacing orphan surrogates in between ascii
characters, as well as replacing strings composed entirely of orphan
lead or trail surrogates.
The patch should compile, and the tests should pass. However, the
following tasks will have to be completed before considering this patch
for inclusion:
* Test concatenated strings. This was suggested by Dan and will require
some additional test cases.
* Add new DISALLOW_INVALID_UTF8 option to the WriteOptions for
WriteUtf8.
* Check for performance regressions. I expect that this patch has no
negative impact on performance, but my intuition is certainly not to
be trusted when it comes to virtual machines.
* Run the entire test suite. I am only running the StringWrite tests at
this point to speed up the development process.
Questions from my end:
* Why are different types used for previous/current code units? It seems
like using uint16_t or int for both would provide greater consistency
and reduce the time required for reasoning about the code.
* Are the code comments I added acceptable? I've used the documentation
style used by Go as I'm quite fond of it, but I'm not sure what the v8
policy on commenting is.
Many thanks for reviewing this in advance!
--fg
BUG=
Please review this at https://codereview.chromium.org/121173009/
SVN Base: git://github.com/v8/v8.git@master
Affected files (+78, -22 lines):
M src/api.cc
M src/unicode-inl.h
M src/unicode.h
M test/cctest/test-api.cc
--
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
---
You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.