Re: [Qemu-devel] [PATCH 24/56] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8")

2018-08-17 Thread Markus Armbruster
Markus Armbruster  writes:

> Eric Blake  writes:
>
>> On 08/10/2018 10:48 AM, Eric Blake wrote:
>>> On 08/08/2018 07:03 AM, Markus Armbruster wrote:
 This is consistent with qobject_to_json().  See commit e2ec3f97680.
>>>
>>> Side note: that commit mentions that on output, ASCII DEL (0x7f) is
>>> always escaped. RFC 7159 does not require it to be escaped on input,
>
> Weird, isn't it?
>
>>> but I wonder if any of your earlier testsuite improvements should
>>> specifically cover \x7f vs. \u007f on input being canonicalized to
>>> \u007f on round trip output.
>
> From utf8_string():
>
> /* 2.2.1  1 byte U+007F */
> {
> "\x7F",
> "\x7F",
> "\\u007F",
> },
>
> We test parsing of JSON "\x7F" (expecting C string "\x7F"), unparsing of
> that C string (expecting JSON "\\u007F"), and after PATCH 29 parsing of
> that JSON (expecting the C string again).  Sufficient?
>

 Signed-off-by: Markus Armbruster 
 ---
   qobject/json-lexer.c  | 2 +-
   qobject/json-parser.c | 2 +-
   tests/check-qjson.c   | 8 +---
   3 files changed, 3 insertions(+), 9 deletions(-)

 diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
 index ca1e0e2c03..36fb665b12 100644
 --- a/qobject/json-lexer.c
 +++ b/qobject/json-lexer.c
 @@ -93,7 +93,7 @@
*   interpolation = %((l|ll|I64)[du]|[ipsf])
*
    * Note:
 - * - Input must be encoded in UTF-8.
 + * - Input must be encoded in modified UTF-8.
>>>
>>> Worth documenting this in the QMP doc as an explicit extension?
>
> qmp-spec.txt:
>
> The sever expects its input to be encoded in UTF-8, and sends its
> output encoded in ASCII.
>
> The obvious update would be to stick in "modified".

Not really necessary, because:

* Before this patch, the JSON parser rejects \0 as ASCII control
  character, and \xC0\x80 as overlong UTF-8.

  Note that PATCH 17 fixed rejection of \0 in JSON strings.  PATCH 21
  fixed rejection of invalid UTF-8, but \xC0\x80 wasn't broken.

* This patch makes \xC0\x80 pass the "invalid UTF-8" check, only to get
  rejected as ASCII control character.  The error message changes,
  that's all.

The patch's benefit is consistency with the other direction:
qobject_to_json() maps \xC0\x80 to \\u.  I guess my commit message
should explain this a bit better.

[...]



Re: [Qemu-devel] [PATCH 24/56] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8")

2018-08-14 Thread Markus Armbruster
Eric Blake  writes:

> On 08/13/2018 02:00 AM, Markus Armbruster wrote:
>> Eric Blake  writes:
>>
>>> On 08/10/2018 10:48 AM, Eric Blake wrote:
 On 08/08/2018 07:03 AM, Markus Armbruster wrote:
> This is consistent with qobject_to_json().  See commit e2ec3f97680.

 Side note: that commit mentions that on output, ASCII DEL (0x7f) is
 always escaped. RFC 7159 does not require it to be escaped on input,
>>
>> Weird, isn't it?
>>
 but I wonder if any of your earlier testsuite improvements should
 specifically cover \x7f vs. \u007f on input being canonicalized to
 \u007f on round trip output.
>>
>>>From utf8_string():
>>
>>  /* 2.2.1  1 byte U+007F */
>>  {
>>  "\x7F",
>>  "\x7F",
>>  "\\u007F",
>>  },
>>
>> We test parsing of JSON "\x7F" (expecting C string "\x7F"), unparsing of
>> that C string (expecting JSON "\\u007F"), and after PATCH 29 parsing of
>> that JSON (expecting the C string again).  Sufficient?
>
> Indeed, looks like we have coverage of DEL thanks to the bounds
> testing of various interesting UTF-8 inflection points.
>
> +++ b/qobject/json-lexer.c
> @@ -93,7 +93,7 @@
> *   interpolation = %((l|ll|I64)[du]|[ipsf])
> *
>     * Note:
> - * - Input must be encoded in UTF-8.
> + * - Input must be encoded in modified UTF-8.

 Worth documenting this in the QMP doc as an explicit extension?
>>
>> qmp-spec.txt:
>>
>>  The sever expects its input to be encoded in UTF-8, and sends its
>>  output encoded in ASCII.
>>
>> The obvious update would be to stick in "modified".
>>
  In
 general, our QMP interfaces that take binary input do so via base64
 encoding, rather than via a modified UTF-8 string -
>>
>> Agreed.
>>
>> However, whether QMP has a use for funny characters or not, the JSON
>> parser has to handle them *somehow*.  "Handle" in the broadest possible
>> sense, including "reject".  Not including misbehavior like "crash" and
>> "silently ignore some input following ASCII NUL".
>>
>
>>
>>>  So it's really
>>> only a question of whether our input engine can pass "\x00"
>>> vs. "\\u" when we NEED an input NUL, and except for the testsuite,
>>> our QAPI schema never really needs an input NUL.
>>
>> The question is how the JSON parser is to handle "\u" escapes in
>> JSON strings and NUL bytes anywhere.
>>
>> The answer for NUL bytes is obvious: reject them just like any other
>> byte <= 0x1F, as required by the RFC.
>
> Yes, rejecting \x00 on input is fine, which leaves only the escape
> inside strings:
>
>>
>> My answer for \u is to treat it as much like any other codepoint <=
>> 0x1F as possible.  Treating it exactly like them isn't possible, because
>> NUL bytes in C strings aren't possible.  However, \xC0\x80 sequences
>> are.
>
> Yes, for internal processing to use modified UTF-8 so that we can
> accept "\\u" on input is reasonable. And even nicer if we turn a
> QString containing \xc0\x80 back into "\\u" on output, so that our
> use of modified UTF-8 never even escapes to the QMP client (and then
> we don't need a documentation change).

We do, in to_json():

for (; *ptr; ptr = end) {
--->cp = mod_utf8_codepoint(ptr, 6, );
switch (cp) {
[special cases like \n...]
default:
if (cp < 0) {
cp = 0xFFFD; /* replacement character */
}
if (cp > 0x) {
/* beyond BMP; need a surrogate pair */
snprintf(buf, sizeof(buf), "\\u%04X\\u%04X",
 0xD800 + ((cp - 0x1) >> 10),
 0xDC00 + ((cp - 0x1) & 0x3FF));
} else if (cp < 0x20 || cp >= 0x7F) {
--->snprintf(buf, sizeof(buf), "\\u%04X", cp);
} else {
buf[0] = cp;
buf[1] = 0;
}
qstring_append(str, buf);
}

mod_utf8_codepoint() recognizes \xC0\x80 as codepoint 0.  That's an
ASCII control character, and those gets emitted using format \\u%04X.

>>
>> Reasonably simple and consistent, don't you think?
>>



Re: [Qemu-devel] [PATCH 24/56] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8")

2018-08-13 Thread Eric Blake

On 08/13/2018 02:00 AM, Markus Armbruster wrote:

Eric Blake  writes:


On 08/10/2018 10:48 AM, Eric Blake wrote:

On 08/08/2018 07:03 AM, Markus Armbruster wrote:

This is consistent with qobject_to_json().  See commit e2ec3f97680.


Side note: that commit mentions that on output, ASCII DEL (0x7f) is
always escaped. RFC 7159 does not require it to be escaped on input,


Weird, isn't it?


but I wonder if any of your earlier testsuite improvements should
specifically cover \x7f vs. \u007f on input being canonicalized to
\u007f on round trip output.



From utf8_string():


 /* 2.2.1  1 byte U+007F */
 {
 "\x7F",
 "\x7F",
 "\\u007F",
 },

We test parsing of JSON "\x7F" (expecting C string "\x7F"), unparsing of
that C string (expecting JSON "\\u007F"), and after PATCH 29 parsing of
that JSON (expecting the C string again).  Sufficient?


Indeed, looks like we have coverage of DEL thanks to the bounds testing 
of various interesting UTF-8 inflection points.



+++ b/qobject/json-lexer.c
@@ -93,7 +93,7 @@
*   interpolation = %((l|ll|I64)[du]|[ipsf])
*
    * Note:
- * - Input must be encoded in UTF-8.
+ * - Input must be encoded in modified UTF-8.


Worth documenting this in the QMP doc as an explicit extension?


qmp-spec.txt:

 The sever expects its input to be encoded in UTF-8, and sends its
 output encoded in ASCII.

The obvious update would be to stick in "modified".


 In
general, our QMP interfaces that take binary input do so via base64
encoding, rather than via a modified UTF-8 string -


Agreed.

However, whether QMP has a use for funny characters or not, the JSON
parser has to handle them *somehow*.  "Handle" in the broadest possible
sense, including "reject".  Not including misbehavior like "crash" and
"silently ignore some input following ASCII NUL".






 So it's really
only a question of whether our input engine can pass "\x00"
vs. "\\u" when we NEED an input NUL, and except for the testsuite,
our QAPI schema never really needs an input NUL.


The question is how the JSON parser is to handle "\u" escapes in
JSON strings and NUL bytes anywhere.

The answer for NUL bytes is obvious: reject them just like any other
byte <= 0x1F, as required by the RFC.


Yes, rejecting \x00 on input is fine, which leaves only the escape 
inside strings:




My answer for \u is to treat it as much like any other codepoint <=
0x1F as possible.  Treating it exactly like them isn't possible, because
NUL bytes in C strings aren't possible.  However, \xC0\x80 sequences
are.


Yes, for internal processing to use modified UTF-8 so that we can accept 
"\\u" on input is reasonable. And even nicer if we turn a QString 
containing \xc0\x80 back into "\\u" on output, so that our use of 
modified UTF-8 never even escapes to the QMP client (and then we don't 
need a documentation change).




Reasonably simple and consistent, don't you think?



--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-devel] [PATCH 24/56] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8")

2018-08-13 Thread Markus Armbruster
Eric Blake  writes:

> On 08/10/2018 10:48 AM, Eric Blake wrote:
>> On 08/08/2018 07:03 AM, Markus Armbruster wrote:
>>> This is consistent with qobject_to_json().  See commit e2ec3f97680.
>>
>> Side note: that commit mentions that on output, ASCII DEL (0x7f) is
>> always escaped. RFC 7159 does not require it to be escaped on input,

Weird, isn't it?

>> but I wonder if any of your earlier testsuite improvements should
>> specifically cover \x7f vs. \u007f on input being canonicalized to
>> \u007f on round trip output.

>From utf8_string():

/* 2.2.1  1 byte U+007F */
{
"\x7F",
"\x7F",
"\\u007F",
},

We test parsing of JSON "\x7F" (expecting C string "\x7F"), unparsing of
that C string (expecting JSON "\\u007F"), and after PATCH 29 parsing of
that JSON (expecting the C string again).  Sufficient?

>>>
>>> Signed-off-by: Markus Armbruster 
>>> ---
>>>   qobject/json-lexer.c  | 2 +-
>>>   qobject/json-parser.c | 2 +-
>>>   tests/check-qjson.c   | 8 +---
>>>   3 files changed, 3 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
>>> index ca1e0e2c03..36fb665b12 100644
>>> --- a/qobject/json-lexer.c
>>> +++ b/qobject/json-lexer.c
>>> @@ -93,7 +93,7 @@
>>>*   interpolation = %((l|ll|I64)[du]|[ipsf])
>>>*
>>>    * Note:
>>> - * - Input must be encoded in UTF-8.
>>> + * - Input must be encoded in modified UTF-8.
>>
>> Worth documenting this in the QMP doc as an explicit extension?

qmp-spec.txt:

The sever expects its input to be encoded in UTF-8, and sends its
output encoded in ASCII.

The obvious update would be to stick in "modified".

>> In
>> general, our QMP interfaces that take binary input do so via base64
>> encoding, rather than via a modified UTF-8 string -

Agreed.

However, whether QMP has a use for funny characters or not, the JSON
parser has to handle them *somehow*.  "Handle" in the broadest possible
sense, including "reject".  Not including misbehavior like "crash" and
"silently ignore some input following ASCII NUL".

>> and I don't know
>> how yajl or jansson would feel about an extension for producing
>> modified UTF-8 for QMP to consume if we really did want to pass NUL
>> bytes without the overhead of UTF-8; what's more, even if you can
>> pass NUL, you still have to worry about all other byte sequences
>> being valid (so base64 is still better for true binary data - it's
>> hard to argue that we'd ever have an interface where we want UTF-8
>> including embedded NUL rather than true binary).  I guess it can
>> also be argued that outputting modified UTF-8 is a violation of
>> JSON, so the fact that we can round-trip NUL doesn't help if the
>> client can't read it.
>>
>> So having typed all that, I guess the answer is no, we don't want to
>> document it; for now, the fact that we accept \xc0\x80 on input and
>> produce it on output is only for the testsuite, and unlikely to
>> matter to any real client of QMP.
>
> Actually, I guess we never output \xc0\x80; but would output the C
> string "\\u" (since any byte above 0x1f is passed through our UTF
> decoder back into a codepoint then output with \u).

to_json() converts the C string sequence by sequence.  Valid sequences
in the BMP other than ASCII control characters (\x00..\x1F and \x7F) are
copied unchanged.  Everything else is escaped.

> So it's really
> only a question of whether our input engine can pass "\x00"
> vs. "\\u" when we NEED an input NUL, and except for the testsuite,
> our QAPI schema never really needs an input NUL.

The question is how the JSON parser is to handle "\u" escapes in
JSON strings and NUL bytes anywhere.

The answer for NUL bytes is obvious: reject them just like any other
byte <= 0x1F, as required by the RFC.

My answer for \u is to treat it as much like any other codepoint <=
0x1F as possible.  Treating it exactly like them isn't possible, because
NUL bytes in C strings aren't possible.  However, \xC0\x80 sequences
are.

Reasonably simple and consistent, don't you think?



Re: [Qemu-devel] [PATCH 24/56] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8")

2018-08-10 Thread Eric Blake

On 08/10/2018 10:48 AM, Eric Blake wrote:


   * Note:
- * - Input must be encoded in UTF-8.
+ * - Input must be encoded in modified UTF-8.


Worth documenting this in the QMP doc as an explicit extension? In 
general, our QMP interfaces that take binary input do so via base64 
encoding, rather than via a modified UTF-8 string - and I don't know how 
yajl or jansson would feel about an extension for producing modified 
UTF-8 for QMP to consume if we really did want to pass NUL bytes without 
the overhead of UTF-8; what's more, even if you can pass NUL, you still 
have to worry about all other byte sequences being valid (so base64 is 
still better for true binary data - it's hard to argue that we'd ever 
have an interface where we want UTF-8 including embedded NUL rather than 
true binary).  I guess it can also be argued that outputting modified 
UTF-8 is a violation of JSON, so the fact that we can round-trip NUL 
doesn't help if the client can't read it.


So having typed all that, I guess the answer is no, we don't want to 
document it; for now, the fact that we accept \xc0\x80 on input and 
produce it on output is only for the testsuite, and unlikely to matter 
to any real client of QMP.


Actually, I guess we never output \xc0\x80; but would output the C 
string "\\u" (since any byte above 0x1f is passed through our UTF 
decoder back into a codepoint then output with \u). So it's really only 
a question of whether our input engine can pass "\x00" vs. "\\u" 
when we NEED an input NUL, and except for the testsuite, our QAPI schema 
never really needs an input NUL.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



Re: [Qemu-devel] [PATCH 24/56] json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8")

2018-08-10 Thread Eric Blake

On 08/08/2018 07:03 AM, Markus Armbruster wrote:

This is consistent with qobject_to_json().  See commit e2ec3f97680.


Side note: that commit mentions that on output, ASCII DEL (0x7f) is 
always escaped. RFC 7159 does not require it to be escaped on input, but 
I wonder if any of your earlier testsuite improvements should 
specifically cover \x7f vs. \u007f on input being canonicalized to 
\u007f on round trip output.




Signed-off-by: Markus Armbruster 
---
  qobject/json-lexer.c  | 2 +-
  qobject/json-parser.c | 2 +-
  tests/check-qjson.c   | 8 +---
  3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/qobject/json-lexer.c b/qobject/json-lexer.c
index ca1e0e2c03..36fb665b12 100644
--- a/qobject/json-lexer.c
+++ b/qobject/json-lexer.c
@@ -93,7 +93,7 @@
   *   interpolation = %((l|ll|I64)[du]|[ipsf])
   *
   * Note:
- * - Input must be encoded in UTF-8.
+ * - Input must be encoded in modified UTF-8.


Worth documenting this in the QMP doc as an explicit extension? In 
general, our QMP interfaces that take binary input do so via base64 
encoding, rather than via a modified UTF-8 string - and I don't know how 
yajl or jansson would feel about an extension for producing modified 
UTF-8 for QMP to consume if we really did want to pass NUL bytes without 
the overhead of UTF-8; what's more, even if you can pass NUL, you still 
have to worry about all other byte sequences being valid (so base64 is 
still better for true binary data - it's hard to argue that we'd ever 
have an interface where we want UTF-8 including embedded NUL rather than 
true binary).  I guess it can also be argued that outputting modified 
UTF-8 is a violation of JSON, so the fact that we can round-trip NUL 
doesn't help if the client can't read it.


So having typed all that, I guess the answer is no, we don't want to 
document it; for now, the fact that we accept \xc0\x80 on input and 
produce it on output is only for the testsuite, and unlikely to matter 
to any real client of QMP.


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org