Reviewers: drcarney,

Description:
String:WriteUtf8: Replace invalid code points

This patch is a work in progress, but I'd appreciate a review at this
point before taking care of the remaining work.

The purpose of this patch is to make String::WriteUtf8 replace invalid
code points (lone surrogate pairs) with the unicode replacement
character. This is done to avoid creating invalid UTF-8 output which
leads to compatibility issues with software requiring valid UTF-8 inputs
(e.g. the WebSocket protocol requires valid UTF-8 and terminates
connections when invalid UTF-8 is encountered).

There are several changes:

* Removed the burden of combining surrogate pairs from Utf8::Encode and
  move this logic into Utf8WriterVisitor::WritePair. This is done
  because the logic was duplicated between Utf8::Encode and
  Utf8WriterVisitor. An additional benefit is the improved separation of
  concerns.
* Added a new allow_invalid bool option to Utf8::Encode. If set to false,
  surrogate code points are replaced with the unicode replacement
  character.
* Added test cases for replacing orphan surrogates in between ascii
  characters, as well as replacing strings composed entirely of orphan
  lead or trail surrogates.

The patch should compile, and the tests should pass. However, the
following tasks will have to be completed before considering this patch
for inclusion:

* Test concatenated strings. This was suggested by Dan and will require
  some additional test cases.
* Add new DISALLOW_INVALID_UTF8 option to the WriteOptions for
  WriteUtf8.
* Check for performance regressions. I expect that this patch has no
  negative impact on performance, but my intuition is certainly not to
  be trusted when it comes to virtual machines.
* Run the entire test suite. I am only running the StringWrite tests at
  this point to speed up the development process.

Questions from my end:

* Why are different types used for previous/current code units? It seems
  like using uint16_t or int for both would provide greater consistency
  and reduce the time required for reasoning about the code.
* Are the code comments I added acceptable? I've used the documentation
  style used by Go as I'm quite fond of it, but I'm not sure what the v8
  policy on commenting is.

Many thanks for reviewing this in advance!
--fg

BUG=

Please review this at https://codereview.chromium.org/121173009/

SVN Base: git://github.com/v8/v8.git@master

Affected files (+78, -22 lines):
  M src/api.cc
  M src/unicode-inl.h
  M src/unicode.h
  M test/cctest/test-api.cc


--
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev
--- You received this message because you are subscribed to the Google Groups "v8-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to