New version uploaded
http://codereview.chromium.org/9600009/diff/14003/src/debug-agent.cc File src/debug-agent.cc (right): http://codereview.chromium.org/9600009/diff/14003/src/debug-agent.cc#newcode405 src/debug-agent.cc:405: int kEncodedSurrogateLength = 3; On 2012/03/07 13:32:47, rossberg wrote:
Maybe move this declaration up and use it instead of literal 3 in the
above
condition. Also, make it const for clarity.
The 3 above is actually a different 3. I gave them both names in unicode.h. http://codereview.chromium.org/9600009/diff/14003/src/objects.cc File src/objects.cc (right): http://codereview.chromium.org/9600009/diff/14003/src/objects.cc#newcode6389 src/objects.cc:6389: // flatten the string and retry. Failures are caused by surrogate pairs in deep On 2012/03/07 13:32:47, rossberg wrote:
Not sure I understand the failure mode. Why does the function need to
fail on
these, and not on other recursions?
The old version of this function was constructed as follows: To get the UTF-8 length of a cons script, recurse on both the car and the cdr and add the lengths. But since we don't want arbitrary recursion that will overflow the stack, we manually transform, the algorithm: 1) Introduce an accumulator for the length 2) Recurse first on the short side (the one with <= half of the characters). 3) The recursion on the other side is now a tail call, which we can manually eliminate, turning it into an iteration. This way, the recursion depth cannot exceed log2 of the length and all is good. WriteToFlat uses the same idea. But now with UTF-16 we have a new case to consider. The car may end in a lead surrogate and the cdr may start with a trail surrogate. Together they make a surrogate pair, which is coded as 4 bytes of UTF-8, instead of two 3-byte encodings (one for each surrogate). This is easy to code as recursion, but if we always recurse on the short one first, I wasn't able to see how to make it into a tail call and transform the second one into iteration. If we always recurse on one (I chose the cdr) then it is still possible to do the transformation. So the new version always recurses on the cdr and iterates on the car. This works for most strings, but we can recurse too deeply, in which case we have to bail out, flatten the string and start over. Since these strings are often build-then-use-once-then-discard strings that is a waste. Currently WriteUTF flattens them, but there are ways to avoid that in another CL (eg writing them backwards normally works because cons strings are almost always unbalanced in the same direction). http://codereview.chromium.org/9600009/diff/14003/src/objects.cc#newcode6446 src/objects.cc:6446: } else if (max_recursion > 0) { On 2012/03/07 13:32:47, rossberg wrote:
Why is this the only recursive path actually checking max_recursion?
See above. http://codereview.chromium.org/9600009/diff/14003/src/objects.cc#newcode6486 src/objects.cc:6486: total -= 2; On 2012/03/07 13:32:47, rossberg wrote:
I wouldn't mind a comment here, why -2? Is that
kSizeOfUnmatchedSurrogate-1? No, it's kBytesSavedByCombiningSurrogates. Fixed. http://codereview.chromium.org/9600009/diff/14003/src/objects.h File src/objects.h (right): http://codereview.chromium.org/9600009/diff/14003/src/objects.h#newcode6772 src/objects.h:6772: inline int Utf8Length() { On 2012/03/07 13:32:47, rossberg wrote:
Even with the inline definition here, I think a comment is warranted
that a
negative result signals failure, and how to react to that (maybe
combine with
the comment below).
This issue disappears with the move to a handle based function. http://codereview.chromium.org/9600009/diff/14003/src/objects.h#newcode6780 src/objects.h:6780: static int Utf8Length(String* input, On 2012/03/07 13:32:47, rossberg wrote:
Please specify the result of these functions in the case where partial surrogates appear at either end of the string.
Done. (They are coded as 'illegal' UTF-8 sequences. The same applies to unpaired surrogates inside the string.) http://codereview.chromium.org/9600009/diff/14003/src/objects.h#newcode6784 src/objects.h:6784: int max_recursion, On 2012/03/07 13:32:47, rossberg wrote:
Is this a useful interface? How is a caller supposed to pick a
suitable
recursion limit? It seems to me that you could as well make it an
internal
constant.
In fact, why not abstract away the flattening and retry internally,
instead of
making it the responsibility of every call site? Is there a reason a
caller
would want to know about this happening, or do something special? (I
didn't see
anything like that in the CL.)
Yes, this is the correct solution. This means it is a function that takes a handle, rather than an operation on the raw string pointer. As such it moves to handles.h instead of being in objects.h http://codereview.chromium.org/9600009/diff/14003/src/scanner-character-streams.cc File src/scanner-character-streams.cc (right): http://codereview.chromium.org/9600009/diff/14003/src/scanner-character-streams.cc#newcode191 src/scanner-character-streams.cc:191: while (i < length - 1) { On 2012/03/07 13:32:47, rossberg wrote:
When does this ever consume the last character in the stream if the
previous one
does not happen to be a lead surrogate?
This loop condition does not prevent it from consuming the last UTF-8 byte (the 'break' below is the guardian of that), it merely prevents us from writing into the last slot of the output buffer. http://codereview.chromium.org/9600009/diff/14003/src/scanner-character-streams.h File src/scanner-character-streams.h (right): http://codereview.chromium.org/9600009/diff/14003/src/scanner-character-streams.h#newcode52 src/scanner-character-streams.h:52: virtual void SlowPushBack(uc16 character); On 2012/03/08 09:02:40, Lasse Reichstein Nielsen wrote:
uc16->uc32 or utf16?
It should be utf16, but I don't really want to change the name of that type in this changelist. It occurs 308 times in the src directory and it would drown the real changes in this CL. http://codereview.chromium.org/9600009/diff/14003/src/unicode-inl.h File src/unicode-inl.h (right): http://codereview.chromium.org/9600009/diff/14003/src/unicode-inl.h#newcode92 src/unicode-inl.h:92: previous != kNoPreviousCharacter && On 2012/03/07 13:32:47, rossberg wrote:
Isn't that implied by Utf16::IsLeadSurrogate(previous)?
No, but it should be. Fixed. http://codereview.chromium.org/9600009/diff/14003/src/unicode-inl.h#newcode96 src/unicode-inl.h:96: Utf8::kNoPreviousCharacter) - 3; On 2012/03/07 13:32:47, rossberg wrote:
kSizeOfUnmatchedSurrogate instead of 3 perhaps (here and above).
Done. http://codereview.chromium.org/9600009/diff/14003/src/unicode-inl.h#newcode130 src/unicode-inl.h:130: previous != kNoPreviousCharacter && On 2012/03/07 13:32:47, rossberg wrote:
See above.
Done. http://codereview.chromium.org/9600009/diff/14003/src/unicode-inl.h#newcode132 src/unicode-inl.h:132: return 1; On 2012/03/07 13:32:47, rossberg wrote:
This is 4 - 3 already counted, I suppose.
Fixed http://codereview.chromium.org/9600009/diff/14003/src/unicode.h File src/unicode.h (right): http://codereview.chromium.org/9600009/diff/14003/src/unicode.h#newcode118 src/unicode.h:118: class Utf16 { On 2012/03/07 13:32:47, rossberg wrote:
Nit: this doesn't quite fit into the above Utf8 section comment.
Done. http://codereview.chromium.org/9600009/diff/14003/src/unicode.h#newcode126 src/unicode.h:126: static inline int CombineSurrogatePair(uchar lead, uchar trail) { On 2012/03/07 13:32:47, rossberg wrote:
Isn't int32_t more accurate as result type?
Done. http://codereview.chromium.org/9600009/diff/14003/src/unicode.h#newcode130 src/unicode.h:130: static inline uchar LeadSurrogate(int char_code) { On 2012/03/07 13:32:47, rossberg wrote:
Similar here (and below), isn't char_code an int32?
Done. http://codereview.chromium.org/9600009/ -- v8-dev mailing list [email protected] http://groups.google.com/group/v8-dev
