[v8-dev] Re: Update UTF-8 decoder to detect more special cases. (issue 1148653007 by [email protected])

vogelheim Thu, 21 May 2015 09:58:53 -0700

I'm afraid this CL has left me largely confused.


https://codereview.chromium.org/1148653007/diff/1/src/unicode.cc
File src/unicode.cc (right):

https://codereview.chromium.org/1148653007/diff/1/src/unicode.cc#newcode205
src/unicode.cc:205: 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
The table is difficult to read. It also leaves me rather confused.

- I take it the purpose is to determine the length of an UTF-8 sequence
from it's first byte.
  I'm seeing: 128 x 0, 96 x 2, 16 x 3, 5 x 4, 11 x 0.
  Shouldn't this be: 192 x 0, 32 x 2, 16 x 3, 8 x 4, 8 x 0 ?

- In particular:
  - I don't understand why you count the continuation characters
(0b10xxxxxx) as '2'.
  - I don't understand why you count some of the 4-byte prefixes as
    0/error (0b11110101 + 0b1111011x)

- I'd find an 16x16 layout probably more readable.

https://codereview.chromium.org/1148653007/diff/1/src/unicode.cc#newcode210
src/unicode.cc:210: uchar Utf8::CalculateValue(const byte* str, size_t
max_length, size_t* cursor) {
I believe this deserves some commentary, and if only that it's meant to
be consistent with Blink's UTF decoder.

https://codereview.chromium.org/1148653007/diff/1/src/unicode.cc#newcode210
src/unicode.cc:210: uchar Utf8::CalculateValue(const byte* str, size_t
max_length, size_t* cursor) {
This might also benefit from a unit test that will try some of the edge
cases this is meant to fix.

https://codereview.chromium.org/1148653007/diff/1/src/unicode.cc#newcode217
src/unicode.cc:217: if (length == 2) {
I was trying to figure out *why* these characters are restricted. It
seems ECMAScript specs explicitly allow pretty much all of them.

"All Unicode code point values from U+0000 to U+10FFFF, including
surrogate code points, may occur in source text where permitted by the
ECMAScript grammars." [From an ES6 draft. ES5 final has similar text.]

https://codereview.chromium.org/1148653007/diff/1/src/unicode.cc#newcode217
src/unicode.cc:217: if (length == 2) {
I find the code below to be somewhat confusing. If I get this correctly,
the code below either does the default UTF-8 processing, or for some
'special' code point ranges it does additional checks.

Maybe an alternative would be to fold the dispatching of all those
'special' checks into the length matrix above.

I.e., for the code block starting with 0xE0, make lengths[0xE0] == 0xE0,
etc. Then these blocks below would all look like this:

switch (NonASCIISequenceLength[str[0]) {
  case 0: ... error:
  case 2..4: normal UTF-8 processing
  case 0xE0: extra check, then UTF-8 processing.
  ....
  default: UNREACHABLE();
}

https://codereview.chromium.org/1148653007/

--
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

---You received this message because you are subscribed to the Google Groups "v8-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[v8-dev] Re: Update UTF-8 decoder to detect more special cases. (issue 1148653007 by [email protected])

Reply via email to