[v8-dev] Re: Fix input and output to handle UTF16 surrogate pairs. (issue 9600009)

erik . corry Sun, 11 Mar 2012 12:29:27 -0700

New version uploaded



http://codereview.chromium.org/9600009/diff/14003/src/debug-agent.cc
File src/debug-agent.cc (right):

http://codereview.chromium.org/9600009/diff/14003/src/debug-agent.cc#newcode405
src/debug-agent.cc:405: int kEncodedSurrogateLength = 3;
On 2012/03/07 13:32:47, rossberg wrote:

Maybe move this declaration up and use it instead of literal 3 in the

above

condition. Also, make it const for clarity.


The 3 above is actually a different 3.  I gave them both names in
unicode.h.

http://codereview.chromium.org/9600009/diff/14003/src/objects.cc
File src/objects.cc (right):

http://codereview.chromium.org/9600009/diff/14003/src/objects.cc#newcode6389
src/objects.cc:6389: // flatten the string and retry.  Failures are
caused by surrogate pairs in deep
On 2012/03/07 13:32:47, rossberg wrote:

Not sure I understand the failure mode. Why does the function need to

fail on

these, and not on other recursions?


The old version of this function was constructed as follows:

To get the UTF-8 length of a cons script, recurse on both the car and
the cdr and add the lengths.  But since we don't want arbitrary
recursion that will overflow the stack, we manually transform, the
algorithm:
1) Introduce an accumulator for the length
2) Recurse first on the short side (the one with <= half of the
characters).
3) The recursion on the other side is now a tail call, which we can
manually eliminate, turning it into an iteration.

This way, the recursion depth cannot exceed log2 of the length and all
is good.  WriteToFlat uses the same idea.

But now with UTF-16 we have a new case to consider.  The car may end in
a lead surrogate and the cdr may start with a trail surrogate.  Together
they make a surrogate pair, which is coded as 4 bytes of UTF-8, instead
of two 3-byte encodings (one for each surrogate).    This is easy to
code as recursion, but if we always recurse on the short one first, I
wasn't able to see how to make it into a tail call and transform the
second one into iteration.  If we always recurse on one (I chose the
cdr) then it is still possible to do the transformation.

So the new version always recurses on the cdr and iterates on the car.
This works for most strings, but we can recurse too deeply, in which
case we have to bail out, flatten the string and start over.  Since
these strings are often build-then-use-once-then-discard strings that is
a waste.  Currently WriteUTF flattens them, but there are ways to avoid
that in another CL (eg writing them backwards normally works because
cons strings are almost always unbalanced in the same direction).

http://codereview.chromium.org/9600009/diff/14003/src/objects.cc#newcode6446
src/objects.cc:6446: } else if (max_recursion > 0) {
On 2012/03/07 13:32:47, rossberg wrote:

Why is this the only recursive path actually checking max_recursion?


See above.

http://codereview.chromium.org/9600009/diff/14003/src/objects.cc#newcode6486
src/objects.cc:6486: total -= 2;
On 2012/03/07 13:32:47, rossberg wrote:

I wouldn't mind a comment here, why -2? Is that

kSizeOfUnmatchedSurrogate-1?

No, it's kBytesSavedByCombiningSurrogates.  Fixed.

http://codereview.chromium.org/9600009/diff/14003/src/objects.h
File src/objects.h (right):

http://codereview.chromium.org/9600009/diff/14003/src/objects.h#newcode6772
src/objects.h:6772: inline int Utf8Length() {
On 2012/03/07 13:32:47, rossberg wrote:

Even with the inline definition here, I think a comment is warranted

that a

negative result signals failure, and how to react to that (maybe

combine with

the comment below).


This issue disappears with the move to a handle based function.

http://codereview.chromium.org/9600009/diff/14003/src/objects.h#newcode6780
src/objects.h:6780: static int Utf8Length(String* input,
On 2012/03/07 13:32:47, rossberg wrote:

Please specify the result of these functions in the case where partial
surrogates appear at either end of the string.


Done.  (They are coded as 'illegal' UTF-8 sequences.  The same applies
to unpaired surrogates inside the string.)

http://codereview.chromium.org/9600009/diff/14003/src/objects.h#newcode6784
src/objects.h:6784: int max_recursion,
On 2012/03/07 13:32:47, rossberg wrote:

Is this a useful interface? How is a caller supposed to pick a

suitable

recursion limit? It seems to me that you could as well make it an

internal

constant.

In fact, why not abstract away the flattening and retry internally,

instead of

making it the responsibility of every call site? Is there a reason a

caller

would want to know about this happening, or do something special? (I

didn't see

anything like that in the CL.)


Yes, this is the correct solution.  This means it is a function that
takes a handle, rather than an operation on the raw string pointer.  As
such it moves to handles.h instead of being in objects.h

http://codereview.chromium.org/9600009/diff/14003/src/scanner-character-streams.cc
File src/scanner-character-streams.cc (right):

http://codereview.chromium.org/9600009/diff/14003/src/scanner-character-streams.cc#newcode191
src/scanner-character-streams.cc:191: while (i < length - 1) {
On 2012/03/07 13:32:47, rossberg wrote:

When does this ever consume the last character in the stream if the

previous one

does not happen to be a lead surrogate?


This loop condition does not prevent it from consuming the last UTF-8
byte (the 'break' below is the guardian of that), it merely prevents us
from writing into the last slot of the output buffer.

http://codereview.chromium.org/9600009/diff/14003/src/scanner-character-streams.h
File src/scanner-character-streams.h (right):

http://codereview.chromium.org/9600009/diff/14003/src/scanner-character-streams.h#newcode52
src/scanner-character-streams.h:52: virtual void SlowPushBack(uc16
character);
On 2012/03/08 09:02:40, Lasse Reichstein Nielsen wrote:

uc16->uc32 or utf16?


It should be utf16, but I don't really want to change the name of that
type in this changelist.  It occurs 308 times in the src directory and
it would drown the real changes in this CL.

http://codereview.chromium.org/9600009/diff/14003/src/unicode-inl.h
File src/unicode-inl.h (right):

http://codereview.chromium.org/9600009/diff/14003/src/unicode-inl.h#newcode92
src/unicode-inl.h:92: previous != kNoPreviousCharacter &&
On 2012/03/07 13:32:47, rossberg wrote:

Isn't that implied by Utf16::IsLeadSurrogate(previous)?


No, but it should be.  Fixed.

http://codereview.chromium.org/9600009/diff/14003/src/unicode-inl.h#newcode96
src/unicode-inl.h:96: Utf8::kNoPreviousCharacter) - 3;
On 2012/03/07 13:32:47, rossberg wrote:

kSizeOfUnmatchedSurrogate instead of 3 perhaps (here and above).


Done.

http://codereview.chromium.org/9600009/diff/14003/src/unicode-inl.h#newcode130
src/unicode-inl.h:130: previous != kNoPreviousCharacter &&
On 2012/03/07 13:32:47, rossberg wrote:

See above.


Done.

http://codereview.chromium.org/9600009/diff/14003/src/unicode-inl.h#newcode132
src/unicode-inl.h:132: return 1;
On 2012/03/07 13:32:47, rossberg wrote:

This is 4 - 3 already counted, I suppose.


Fixed

http://codereview.chromium.org/9600009/diff/14003/src/unicode.h
File src/unicode.h (right):

http://codereview.chromium.org/9600009/diff/14003/src/unicode.h#newcode118
src/unicode.h:118: class Utf16 {
On 2012/03/07 13:32:47, rossberg wrote:

Nit: this doesn't quite fit into the above Utf8 section comment.


Done.

http://codereview.chromium.org/9600009/diff/14003/src/unicode.h#newcode126
src/unicode.h:126: static inline int CombineSurrogatePair(uchar lead,
uchar trail) {
On 2012/03/07 13:32:47, rossberg wrote:

Isn't int32_t more accurate as result type?


Done.

http://codereview.chromium.org/9600009/diff/14003/src/unicode.h#newcode130
src/unicode.h:130: static inline uchar LeadSurrogate(int char_code) {
On 2012/03/07 13:32:47, rossberg wrote:

Similar here (and below), isn't char_code an int32?


Done.

http://codereview.chromium.org/9600009/

--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

[v8-dev] Re: Fix input and output to handle UTF16 surrogate pairs. (issue 9600009)

Reply via email to