[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438878#comment-16438878
 ] 

Jesper Steen Møller commented on XALANJ-2419:
-

[~thetaphi]: Now I see what you mean (perhaps): Yes, there is a very tricky 
similar bug in the attribute values of ToHTMLStream, but not in the general 
case (I think it's OK due to line 1440-1447 in writeAttrString, but have *not* 
tested this.)

I only see the issue for ToHTMLStream in the case of URL attributes such as 
A#HREF, where the output has explicitly been set to *not* encoded as am URL 
(line 1294 in writeAttrURI). The default is to escape HTML attributes 
containing URLs using URL-encoding, unless overridden with 
xalan:use-url-escaping=yes in the XSLT output options.

(As an aside: I'm no expert, but the UTF-8 encoder inside the URL-encoding 
(line 1208-1285 in writeAttrURI) seems legit, if a little verbose, instead of 
just doing String.getBytes(UTF_8) and hexing that)

My v2 fix above does *not* address the corner case in line 1294.

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-tests-v2.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438864#comment-16438864
 ] 

Jesper Steen Møller edited comment on XALANJ-2419 at 4/15/18 11:22 PM:
---

[~thetaphi]: So - I fixed it, but will anybody care?

I mean, it's been almost 8 years...


was (Author: jespersm):
[~thetaphi]: I'm sure it could be fixed, but would anybody care?

I mean, it's been almost 8 years...

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-tests-v2.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Updated] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesper Steen Møller updated XALANJ-2419:

Attachment: XALANJ-2419-tests-v2.txt

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-fix.txt, 
> XALANJ-2419-tests-v2.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Updated] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesper Steen Møller updated XALANJ-2419:

Attachment: (was: XALANJ-2419-fix.txt)

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-tests-v2.txt, 
> XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Updated] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jesper Steen Møller updated XALANJ-2419:

Attachment: XALANJ-2419-fix-v2.txt

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-fix.txt, 
> XALANJ-2419-tests-v2.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438848#comment-16438848
 ] 

Uwe Schindler edited comment on XALANJ-2419 at 4/15/18 10:48 PM:
-

Unfortunately the patch does not fix the problem for attributes. Those got 
better with it, but it outputs the correct char and then the second half char 
of the surrogate as decimal escape.

The Policeman Emoji is serialized with the patch correctly, if part of a text 
node. This is fixed by this patch.

But inside an attribute the policeman emoji comes out like:

{code:xml}
 Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("

[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438848#comment-16438848
 ] 

Uwe Schindler commented on XALANJ-2419:
---

Unfortunately the patch does not fix the problem for attributes. Those got 
better with it, but it outputs the correct char and then the second half char 
of the surrogate as decimal escape.

The Policeman Emoji is serialized with the patch correctly, if part of a text 
node. This is fixed by this patch.

But inside an attribute the policeman emoji comes out like:

{code:xml}
 Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("