[v8-dev] Re: Update UTF-8 decoder to detect more special cases. (issue 1148653007 by [email protected])

jochen Fri, 22 May 2015 05:38:38 -0700


https://codereview.chromium.org/1148653007/diff/1/src/unicode.cc
File src/unicode.cc (right):


https://codereview.chromium.org/1148653007/diff/1/src/unicode.cc#newcode205
src/unicode.cc:205: 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
On 2015/05/21 at 16:58:38, vogelheim wrote:

The table is difficult to read. It also leaves me rather confused.

- I take it the purpose is to determine the length of an UTF-8

sequence from it's first byte.

   I'm seeing: 128 x 0, 96 x 2, 16 x 3, 5 x 4, 11 x 0.
   Shouldn't this be: 192 x 0, 32 x 2, 16 x 3, 8 x 4, 8 x 0 ?


I updated the matrix to be 16 x 16, and added comments

- In particular:
   - I don't understand why you count the continuation characters

(0b10xxxxxx) as '2'.

fixed

   - I don't understand why you count some of the 4-byte prefixes as
     0/error (0b11110101 + 0b1111011x)


the code points those would encode are outside of the unicode range

- I'd find an 16x16 layout probably more readable.


done

https://codereview.chromium.org/1148653007/diff/1/src/unicode.cc#newcode210
src/unicode.cc:210: uchar Utf8::CalculateValue(const byte* str, size_t
max_length, size_t* cursor) {
On 2015/05/21 at 16:58:38, vogelheim wrote:

I believe this deserves some commentary, and if only that it's meant

to be consistent with Blink's UTF decoder.

yeah, actually, it's supposed to be consistent with the definition of
UTF-8 which the previous version was not :-/

there are tests, and they fail now. I'll update them

https://codereview.chromium.org/1148653007/diff/1/src/unicode.cc#newcode217
src/unicode.cc:217: if (length == 2) {
On 2015/05/21 at 16:58:38, vogelheim wrote:

I was trying to figure out *why* these characters are restricted. It

seems ECMAScript specs explicitly allow pretty much all of them.

"All Unicode code point values from U+0000 to U+10FFFF, including

surrogate code points, may occur in source text where permitted by the
ECMAScript grammars." [From an ES6 draft. ES5 final has similar text.]

right. It's just that UTF-8 cannot encode all of unicode, in particular
not U+D800 through U+DFFF.

https://codereview.chromium.org/1148653007/diff/40001/test/cctest/test-api.cc
File test/cctest/test-api.cc (left):

https://codereview.chromium.org/1148653007/diff/40001/test/cctest/test-api.cc#oldcode7334
test/cctest/test-api.cc:7334: WriteUtf8Helper(context, "b", "alens", 9);
these tests were testing what happens when converting strings with
unmatched surrogate pairs back and forth between utf8 and utf16. now
that we no longer support surrogate pairs in utf8, those tests don't
make sense anymore

https://codereview.chromium.org/1148653007/diff/40001/test/cctest/test-api.cc#oldcode7368
test/cctest/test-api.cc:7368: "\355\240\201\355\260\205");  // 2 3-byte
surrogates.
same here.

if you convert the test to utf-16, it's a bit pointless as there are no
two alternative encodings for this code point, it's always an surrogate
pair.

https://codereview.chromium.org/1148653007/

--
--
v8-dev mailing list
[email protected]
http://groups.google.com/group/v8-dev

---You received this message because you are subscribed to the Google Groups "v8-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[v8-dev] Re: Update UTF-8 decoder to detect more special cases. (issue 1148653007 by [email protected])

Reply via email to