Re: [xml] Patch to fix ICU flush and pivot buffer

2017-11-08 Thread Joel Hockey
Yes, I will update chromium with this as per
https://cs.chromium.org/chromium/src/third_party/libxml/chromium/roll.py

On Thu, Nov 9, 2017 at 10:35 AM, Jungshik Shin (신정식, 申政湜) <
js...@chromium.org> wrote:

> Thank you, Joel and Nick !
>
> Joel:  I guess you're gonna roll libxml in the Chromium tree to a version
> including these changes.
>
> Jungshik
>
> 2017-11-08 15:22 GMT-08:00 Joel Hockey :
>
>> Thanks Nick.  Nice work with the test.
>>
>>
>>
>> On Sun, Nov 5, 2017 at 2:04 AM, Nick Wellnhofer 
>> wrote:
>>
>>> On 26/10/2017 03:17, Joel Hockey wrote:
>>>
 I've updated the patch using git format-patch.

>>>
>>> Thanks for the updated patch. Applied here:
>>> https://git.gnome.org/browse/libxml2/commit/?id=0b19f236a263
>>> a7b0acacd4ea84dc7237303ee3d9
>>>
>>> The original bug found by fuzzer only relates to UTF8 decoding, so using
 Shift-JIS or anything else wont help.

>>>
>>> Why not? My reasoning was that ICU uses the same code path for all
>>> variable-width encodings. I simply converted your test file to EUC-JP and
>>> it turns out that this triggers the bug as well:
>>>
>>> https://git.gnome.org/browse/libxml2/commit/?id=72182550926d
>>> 31ad17357bd3ed69e49d7e69df02
>>>
>>> Nick
>>>
>>
>>
>
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Patch to fix ICU flush and pivot buffer

2017-11-08 Thread Joel Hockey
Thanks Nick.  Nice work with the test.



On Sun, Nov 5, 2017 at 2:04 AM, Nick Wellnhofer  wrote:

> On 26/10/2017 03:17, Joel Hockey wrote:
>
>> I've updated the patch using git format-patch.
>>
>
> Thanks for the updated patch. Applied here: https://git.gnome.org/browse/l
> ibxml2/commit/?id=0b19f236a263a7b0acacd4ea84dc7237303ee3d9
>
> The original bug found by fuzzer only relates to UTF8 decoding, so using
>> Shift-JIS or anything else wont help.
>>
>
> Why not? My reasoning was that ICU uses the same code path for all
> variable-width encodings. I simply converted your test file to EUC-JP and
> it turns out that this triggers the bug as well:
>
> https://git.gnome.org/browse/libxml2/commit/?id=72182550926d
> 31ad17357bd3ed69e49d7e69df02
>
> Nick
>
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Patch to fix ICU flush and pivot buffer

2017-11-04 Thread Nick Wellnhofer

On 26/10/2017 03:17, Joel Hockey wrote:

I've updated the patch using git format-patch.


Thanks for the updated patch. Applied here: 
https://git.gnome.org/browse/libxml2/commit/?id=0b19f236a263a7b0acacd4ea84dc7237303ee3d9


The original bug found by fuzzer only relates to UTF8 decoding, so using 
Shift-JIS or anything else wont help.


Why not? My reasoning was that ICU uses the same code path for all 
variable-width encodings. I simply converted your test file to EUC-JP and it 
turns out that this triggers the bug as well:


https://git.gnome.org/browse/libxml2/commit/?id=72182550926d31ad17357bd3ed69e49d7e69df02

Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Patch to fix ICU flush and pivot buffer

2017-10-29 Thread Joel Hockey
Nick, how does that updated patch look?  Are you happy to take it?



On Thu, Oct 26, 2017 at 10:03 PM, Joel Hockey 
wrote:

>
>> Does libxml treat 'UTF8' (without dash/hyphen) as UTF-8 ?  If not, 'UTF8'
>> can be used for both ICU and iconv.
>>
>
> Yes.  https://cs.chromium.org/chromium/src/third_party/
> libxml/src/parser.c?l=10329=b54509c3db126e5a3ed9b84fa70df1f821b1fd3e
>
>
>
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Patch to fix ICU flush and pivot buffer

2017-10-26 Thread Joel Hockey
>
>
> Does libxml treat 'UTF8' (without dash/hyphen) as UTF-8 ?  If not, 'UTF8'
> can be used for both ICU and iconv.
>

Yes.
https://cs.chromium.org/chromium/src/third_party/libxml/src/parser.c?l=10329=b54509c3db126e5a3ed9b84fa70df1f821b1fd3e
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


Re: [xml] Patch to fix ICU flush and pivot buffer

2017-10-25 Thread Joel Hockey
I've updated the patch using git format-patch.

* reverted public interface xmlCharEncInFunc.  It now calls
xmlEncInputChunk with flush=1 on all calls as suggested.
Always setting flush=TRUE here makes sense since this is a one-shot
conversion of the entire buffer.

* Moved the pivot buf reset to only happen on success (suggestion from
Markus).

* removed test/icu_parse_test.xml from patch.  This file is attached
separately to this email.
The original bug found by fuzzer only relates to UTF8 decoding, so using
Shift-JIS or anything else wont help.  I can't think of any way that this
test can simultaneously
 - be effective to test the bug when libxml is compiled with icu
 - still parse correctly and pass when libxml is compiled with iconv
The trick is that we need an encoding which will:
 - not be handled natively by libxml
 - be recognized by icu as UTF8
 - be recognized by iconv as UTF8


On Thu, Oct 26, 2017 at 3:21 AM, Nick Wellnhofer 
wrote:

> On 25/10/2017 17:40, Markus Scherer wrote:
>
>> On Wed, Oct 25, 2017 at 4:02 AM, Nick Wellnhofer >  The patch changes public function xmlCharEncInFunc but this function isn't
>> used internally anymore (since commit a78d8036 from 2012). It might
>> still
>> be used in client code that wants to use libxml2's character
>> conversion
>> facilities, though. Maybe it's better to remove the `flush` parameter
>> and
>> always call xmlEncInputChunk with `flush` set to 1. This should at
>> least
>> allow one-shot character conversions without breaking the public API.
>>
>> The ICU conversion functions must be called with flush=FALSE before the
>> end of the stream, and with flush=TRUE at the end of the stream.
>>
>
> Yes, but I'm only talking about xmlCharEncInFunc which isn't used
> internally.
>
> Nick
>
>
From f495b5546927032fb5b3988d66949d3d1b735aa9 Mon Sep 17 00:00:00 2001
From: Joel Hockey 
Date: Wed, 25 Oct 2017 18:11:12 -0700
Subject: [PATCH] Fixed ICU to set flush correctly and provide pivot buffer.

By always setting flush=TRUE when doing multiple reads, ICU
will not correctly handle truncated utf8 chars across read
boundaries.

The fix is to set flush=TRUE only on final read, and to
provide a pivot buffer which is maintained by libxml
between calls to ucnv_convertEx.
---
 encoding.c| 46 +-
 include/libxml/encoding.h |  5 +
 testapi.c |  2 +-
 3 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/encoding.c b/encoding.c
index cd019c51..5dac8a14 100644
--- a/encoding.c
+++ b/encoding.c
@@ -110,6 +110,9 @@ openIcuConverter(const char* name, int toUnicode)
   if (conv == NULL)
 return NULL;
 
+  conv->pivot_source = conv->pivot_buf;
+  conv->pivot_target = conv->pivot_buf;
+
   conv->uconv = ucnv_open(name, );
   if (U_FAILURE(status))
 goto error;
@@ -1850,6 +1853,7 @@ xmlIconvWrapper(iconv_t cd, unsigned char *out, int *outlen,
  * @outlen:  the length of @out
  * @in:  a pointer to an array of ISO Latin 1 chars
  * @inlen:  the length of @in
+ * @flush: if true, indicates end of input
  *
  * Returns 0 if success, or
  * -1 by lack of space, or
@@ -1863,7 +1867,7 @@ xmlIconvWrapper(iconv_t cd, unsigned char *out, int *outlen,
  */
 static int
 xmlUconvWrapper(uconv_t *cd, int toUnicode, unsigned char *out, int *outlen,
-const unsigned char *in, int *inlen) {
+const unsigned char *in, int *inlen, int flush) {
 const char *ucv_in = (const char *) in;
 char *ucv_out = (char *) out;
 UErrorCode err = U_ZERO_ERROR;
@@ -1873,33 +1877,31 @@ xmlUconvWrapper(uconv_t *cd, int toUnicode, unsigned char *out, int *outlen,
 return(-1);
 }
 
-/*
- * TODO(jungshik)
- * 1. is ucnv_convert(To|From)Algorithmic better?
- * 2. had we better use an explicit pivot buffer?
- * 3. error returned comes from 'fromUnicode' only even
- *when toUnicode is true !
- */
 if (toUnicode) {
 /* encoding => UTF-16 => UTF-8 */
 ucnv_convertEx(cd->utf8, cd->uconv, _out, ucv_out + *outlen,
-   _in, ucv_in + *inlen, NULL, NULL, NULL, NULL,
-   0, TRUE, );
+   _in, ucv_in + *inlen, cd->pivot_buf,
+   >pivot_source, >pivot_target,
+   cd->pivot_buf + ICU_PIVOT_BUF_SIZE, 0, flush, );
 } else {
 /* UTF-8 => UTF-16 => encoding */
 ucnv_convertEx(cd->uconv, cd->utf8, _out, ucv_out + *outlen,
-   _in, ucv_in + *inlen, NULL, NULL, NULL, NULL,
-   0, TRUE, );
+   _in, ucv_in + *inlen, cd->pivot_buf,
+   >pivot_source, >pivot_target,
+   cd->pivot_buf + ICU_PIVOT_BUF_SIZE, 0, flush, );
 }
 *inlen = ucv_in - (const char*) in;
 *outlen = ucv_out - (char *) out;
-if (U_SUCCESS(err))
+if 

Re: [xml] Patch to fix ICU flush and pivot buffer

2017-10-25 Thread Nick Wellnhofer

On 25/10/2017 17:40, Markus Scherer wrote:
On Wed, Oct 25, 2017 at 4:02 AM, Nick Wellnhofer 

Re: [xml] Patch to fix ICU flush and pivot buffer

2017-10-25 Thread Nick Wellnhofer

On 25/10/2017 10:32, Joel Hockey wrote:

This patch fixes those issues.


Looks good.

The patch changes public function xmlCharEncInFunc but this function isn't 
used internally anymore (since commit a78d8036 from 2012). It might still be 
used in client code that wants to use libxml2's character conversion 
facilities, though. Maybe it's better to remove the `flush` parameter and 
always call xmlEncInputChunk with `flush` set to 1. This should at least allow 
one-shot character conversions without breaking the public API.


The test sets encoding to "UTF8-".  This encoding is chosen since it is not 
recognized by libxml and forces the decoding to be done by ICU which 
recognizes this encoding as UTF8.  Unfortunately, the test always fails when 
using iconv since iconv does not recognize this encoding.  Since iconv is the 
default in libxml, I understand that it may not make sense to always run this 
testcase, or to include it at all.


It should be possible to write a test that works with both iconv and ICU and 
triggers the bug with a different variable-width encoding like Shift-JIS.


If it is any help, I can send this patch as a pull request to the libxml 
github repo, or I can also create a libxml bug.  It looks like libxml 
preference is to take patches on the mailing list.


I prefer patches created with `git format-patch` that include author 
information and a commit message, either via the mailing list or Bugzilla.


Nick
___
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
https://mail.gnome.org/mailman/listinfo/xml


[xml] Patch to fix ICU flush and pivot buffer

2017-10-25 Thread Joel Hockey
Hi,

The chromium team have recently detected a fuzz-testing bug in libxml / ICU
where UTF8 chars can be decoded incorrectly.  See http://crbug.com/722420.

The root cause of this problem is that libxml is calling ICU ucnv_convertEx
with incorrect params.  It is always setting flush to TRUE.  This param
should only be set to true for the last call when reading an input.

Also, when calling ucnv_convertEx multiple times (with flush=FALSE), the
caller must provide a pivot buffer which is maintained between calls.

This patch fixes those issues.  The patch includes test/icu_parse_test.xml
to reproduce the error.

./configure --with-icu --with-iconv=no
make runtest
./runtest

The test sets encoding to "UTF8-".  This encoding is chosen since it is not
recognized by libxml and forces the decoding to be done by ICU which
recognizes this encoding as UTF8.  Unfortunately, the test always fails
when using iconv since iconv does not recognize this encoding.  Since iconv
is the default in libxml, I understand that it may not make sense to always
run this testcase, or to include it at all.

This patch has been reviewed by Markus Scherer (maintainer of ICU), and
Jungshik in the chromium team who worked on the original integration of
libxml and ICU in 2007.  http://crosreview.com/729616.  The patch is ready
to submit against the chromium local copy of libxml, but it is preferable
to have this accepted into libxml if you are happy with it, and we can then
take the patch from libxml and stay in sync.

If it is any help, I can send this patch as a pull request to the libxml
github repo, or I can also create a libxml bug.  It looks like libxml
preference is to take patches on the mailing list.

Thanks,
Joel
diff --git a/encoding.c b/encoding.c
index cd019c51..617e1ed7 100644
--- a/encoding.c
+++ b/encoding.c
@@ -110,6 +110,9 @@ openIcuConverter(const char* name, int toUnicode)
   if (conv == NULL)
 return NULL;
 
+  conv->pivot_source = conv->pivot_buf;
+  conv->pivot_target = conv->pivot_buf;
+
   conv->uconv = ucnv_open(name, );
   if (U_FAILURE(status))
 goto error;
@@ -1850,6 +1853,7 @@ xmlIconvWrapper(iconv_t cd, unsigned char *out, int *outlen,
  * @outlen:  the length of @out
  * @in:  a pointer to an array of ISO Latin 1 chars
  * @inlen:  the length of @in
+ * @flush: if true, indicates end of input
  *
  * Returns 0 if success, or
  * -1 by lack of space, or
@@ -1863,7 +1867,7 @@ xmlIconvWrapper(iconv_t cd, unsigned char *out, int *outlen,
  */
 static int
 xmlUconvWrapper(uconv_t *cd, int toUnicode, unsigned char *out, int *outlen,
-const unsigned char *in, int *inlen) {
+const unsigned char *in, int *inlen, int flush) {
 const char *ucv_in = (const char *) in;
 char *ucv_out = (char *) out;
 UErrorCode err = U_ZERO_ERROR;
@@ -1873,33 +1877,30 @@ xmlUconvWrapper(uconv_t *cd, int toUnicode, unsigned char *out, int *outlen,
 return(-1);
 }
 
-/*
- * TODO(jungshik)
- * 1. is ucnv_convert(To|From)Algorithmic better?
- * 2. had we better use an explicit pivot buffer?
- * 3. error returned comes from 'fromUnicode' only even
- *when toUnicode is true !
- */
 if (toUnicode) {
 /* encoding => UTF-16 => UTF-8 */
 ucnv_convertEx(cd->utf8, cd->uconv, _out, ucv_out + *outlen,
-   _in, ucv_in + *inlen, NULL, NULL, NULL, NULL,
-   0, TRUE, );
+   _in, ucv_in + *inlen, cd->pivot_buf,
+   >pivot_source, >pivot_target,
+   cd->pivot_buf + ICU_PIVOT_BUF_SIZE, 0, flush, );
 } else {
 /* UTF-8 => UTF-16 => encoding */
 ucnv_convertEx(cd->uconv, cd->utf8, _out, ucv_out + *outlen,
-   _in, ucv_in + *inlen, NULL, NULL, NULL, NULL,
-   0, TRUE, );
+   _in, ucv_in + *inlen, cd->pivot_buf,
+   >pivot_source, >pivot_target,
+   cd->pivot_buf + ICU_PIVOT_BUF_SIZE, 0, flush, );
 }
 *inlen = ucv_in - (const char*) in;
 *outlen = ucv_out - (char *) out;
+/* reset pivot buffer if this is the last call for input (flush==TRUE) */
+if (flush)
+cd->pivot_source = cd->pivot_target = cd->pivot_buf;
 if (U_SUCCESS(err))
 return 0;
 if (err == U_BUFFER_OVERFLOW_ERROR)
 return -1;
 if (err == U_INVALID_CHAR_FOUND || err == U_ILLEGAL_CHAR_FOUND)
 return -2;
-/* if (err == U_TRUNCATED_CHAR_FOUND) */
 return -3;
 }
 #endif /* LIBXML_ICU_ENABLED */
@@ -1912,7 +1913,7 @@ xmlUconvWrapper(uconv_t *cd, int toUnicode, unsigned char *out, int *outlen,
 
 static int
 xmlEncInputChunk(xmlCharEncodingHandler *handler, unsigned char *out,
- int *outlen, const unsigned char *in, int *inlen) {
+ int *outlen, const unsigned char *in, int *inlen, int flush) {
 int ret;
 
 if (handler->input != NULL) {
@@ -1925,7 +1926,8 @@