[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2024-01-24 Thread Jira


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810716#comment-17810716
 ] 

Cédric Damioli edited comment on XALANJ-2419 at 1/25/24 7:37 AM:
-

I think I may know this one !

It reminds me an issue with Encodings.properties loaded in a different order in 
Java 8 and Java 9+, leading to issues on modern JVM because it references 
unexisting encodings.

Could it be related ?

In my case I have modified Encoding.properties by removing all 8859_* encodings 
and all worked again

See XALANJ-2625 and XALANJ-2618


was (Author: cedric):
I think I may know this one !

It reminds me an issue with Encodings.properties loaded in a different order in 
Java 8 and Java 9+, leading to issues on modern JVM because it references 
unexisting encodings.

Could it be related ?

In my case I have modified Encoding.properties by removing all 8859_* encodings 
and all worked again

I'm pretty sure that a Jira issue existe about this one but I can't find it 
anymore ...

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Assignee: Joe Kesselman
>Priority: Major
> Fix For: The Latest Development Code
>
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2024-01-24 Thread Joe Kesselman (Jira)


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810609#comment-17810609
 ] 

Joe Kesselman edited comment on XALANJ-2419 at 1/25/24 12:59 AM:
-

Arggh. Found the difference in invocation, I think. It's an annoying one.

The failing version is using my commandline's current default binding of 
"java", which runs through the /etc/alternatives system to invoke  
`/usr/lib/jvm/java-17-openjdk-17.0.8.0.7-1.fc37.x86_64/bin/java`

The succeeding versions explicitly invoke /usr/lib/jvm/jre-1.8.0/bin/java, 
which /etc/alternatives eventually maps to 
`/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.382.b05-2.fc37.x86_64/jre/bin/java`

If I change my one-liner to use that 1.8 jre rather than my current default 17, 
the errors vanish.

In the vain hope that the problem was specifically OpenJDK 17, I tried it with 
`/jre-21-openjdk-21.0.1.0.12-1.rolling.fc37.x86_64`. Fails there too.

So *something* is being java-version sensitive and changed some time after Java 
1.8. A bug fixed, a new bug, something redefined, something formatted 
differently, something ordered differently. May be ... +_interesting_+ ... to 
track down.



At least we now know how to provoke the divergent behavior in the debugger for 
study.

Deep breath. Let it out slowly. Recite the mantra: "{color:#0747a6}+_If it was 
easy, they wouldn't need people like us_+{color}."


was (Author: JIRAUSER285361):
Arggh. Found the difference in invocation, I think. It's an annoying one.

The failing version is using my commandline's current default binding of 
"java", which runs through the /etc/alternatives system to invoke  
`/usr/lib/jvm/java-17-openjdk-17.0.8.0.7-1.fc37.x86_64/bin/java`

The succeeding versions explicitly invoke /usr/lib/jvm/jre-1.8.0/bin/java, 
which /etc/alternatives eventually maps to 
`/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.382.b05-2.fc37.x86_64/jre/bin/java`

If I change my one-liner to use that 1.8 jre rather than my current default 17, 
the errors vanish.

In the vain hope that the problem was specifically OpenJDK 17, I tried it with 
`/jre-21-openjdk-21.0.1.0.12-1.rolling.fc37.x86_64`. Fails there too.


So *something* is being java-version sensitive and changed some time after Java 
1.8. A bug fixed, a new bug, something redefined, something formatted 
differently, something ordered differently. May be ... +_interesting_+ ... to 
track down. 

At least we now know how to provoke the divergent behavior in the debugger for 
study. 



Deep breath. Let it out slowly. Recite the mantra: 
"{color:#0747a6}{color:#172b4d}+_If it was easy, they wouldn't need people like 
us_+{color}.{color}"

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Assignee: Joe Kesselman
>Priority: Major
> Fix For: The Latest Development Code
>
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2024-01-23 Thread Joe Kesselman (Jira)


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810125#comment-17810125
 ] 

Joe Kesselman edited comment on XALANJ-2419 at 1/23/24 10:41 PM:
-

Everything should have been carried forward. That may have been done manually 
and history may have been lost, but I don't believe any actual work has been 
lost.

In any case, Master *IS* where new development is going. So if you can find 
anything which has not been addressed there, please flag it for our attention. 
Just don't assume that the absence of a particular commit, or a particular 
merge, means something is missing. Real-world git histories sometimes get 
messy, especially when operated by real-world humans. And don't let past 
mistakes get in the way of doing what's right in the future.

("Begin by assuming a spherical cow...")


was (Author: JIRAUSER285361):
Everything should have been carried forward. That may have been done manually 
and history may have been lost, but I don't believe any actual work has been 
lost.

In any case, Master *IS* where new development is going. So if you can find 
anything which has not been addressed there, please flag it for our attention. 
Just don't assume that the absence of a particular commit, or a particular 
merge, means something is missing. Real-world git histories sometimes get 
messy, especially when operated by real-world humans.

("Begin by assuming a spherical cow...")

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Assignee: Joe Kesselman
>Priority: Major
> Fix For: The Latest Development Code
>
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2024-01-23 Thread Max (Jira)


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810043#comment-17810043
 ] 

Max edited comment on XALANJ-2419 at 1/23/24 5:08 PM:
--

[~kesh...@alum.mit.edu] , thank you for working on this. As you suggested, why 
don't you merge what works and I can try to help you work on the split buffer 
issue after? on a good code?

 


was (Author: maxfortun):
[~kesh...@alum.mit.edu] , thank you for working on this. As you suggested, why 
don't you merge what works and I can try to help you work on the split buffer 
issue after on a good code?

 

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Assignee: Joe Kesselman
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2024-01-19 Thread Joseph Kessselman (Jira)


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808815#comment-17808815
 ] 

Joseph Kessselman edited comment on XALANJ-2419 at 1/20/24 7:13 AM:


(apitest rather than smoketest, but it's there.)

Seeing a few oddities in astrals. Thought I had that running. Investigating.

I still need to look at @max's ToStream buffer-bounds tweak and see if that 
still applies. And at whether it ought to be replicated in 
ToXMLStream/ToHTMLStream to replace their handling of the surrogate-pair case; 
arguably so...?

 

 


was (Author: jkesselm):
(apitest rather than smoketest, but it's there.)

I still need to look at @max's ToStream buffer-bounds tweak and see if that 
still applies. And at whether it ought to be replicated in 
ToXMLStream/ToHTMLStream to replace their handling of the surrogate-pair case; 
arguably so...?

 

 

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Assignee: Joe Kesselman
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2024-01-19 Thread Joseph Kessselman (Jira)


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808815#comment-17808815
 ] 

Joseph Kessselman edited comment on XALANJ-2419 at 1/20/24 6:54 AM:


I still need to look at @max's ToStream buffer-bounds tweak and see if that 
still applies. And at whether it ought to be replicated in 
ToXMLStream/ToHTMLStream to replace their handling of the surrogate-pair case; 
arguably so...?

 

 


was (Author: jkesselm):
Well, it's in apitest rather than smoketest, but it's there.

Need to look at @max's ToStream buffer-bounds tweak and see if that still 
applies.

 

 

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Assignee: Joe Kesselman
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2024-01-19 Thread Joseph Kessselman (Jira)


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808815#comment-17808815
 ] 

Joseph Kessselman edited comment on XALANJ-2419 at 1/20/24 6:54 AM:


(apitest rather than smoketest, but it's there.)

I still need to look at @max's ToStream buffer-bounds tweak and see if that 
still applies. And at whether it ought to be replicated in 
ToXMLStream/ToHTMLStream to replace their handling of the surrogate-pair case; 
arguably so...?

 

 


was (Author: jkesselm):
I still need to look at @max's ToStream buffer-bounds tweak and see if that 
still applies. And at whether it ought to be replicated in 
ToXMLStream/ToHTMLStream to replace their handling of the surrogate-pair case; 
arguably so...?

 

 

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Assignee: Joe Kesselman
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2019-02-22 Thread JIRA


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775205#comment-16775205
 ] 

Jesper Steen Møller edited comment on XALANJ-2419 at 2/22/19 2:43 PM:
--

Ok, now I get what's wrong with the encoding. It's bad.

The file `Encodings.property` contains mappings between several "Java names" 
(which probably made sense in the last millenium) and MIME names (which are 
what should be in the XML inputs). There's some logic to only register the 
first, but since that's iterating from a Properties object, that will NOT 
represent the order in the file. In other words, unpredictable. That's why it 
suddenly worked when you specified ISO8859_1. I don't know what changed for 
Java 11, it could be the hashtable ordering, or the accepted charset names, but 
the crux is that "8859-1" is NOT an acceptable encoding name in Java 11:
{code:java}
jshell> "\u00e8".getBytes("ISO-8859-1")
$1 ==> byte[1] { -24 } 
jshell> "\u00e8".getBytes("8859_1")
$2 ==> byte[1] { -24 }
 
jshell> "\u00e8".getBytes("8859-1")
|  Exception java.io.UnsupportedEncodingException: 8859-1
|        at StringCoding.encode (StringCoding.java:427)
|        at String.getBytes (String.java:941)
|        at (#3:1)
jshell> "\u00e8".getBytes("ISO8859-1")
$4 ==> byte[1] { -24 }
jshell> "\u00e8".getBytes("ISO8859_1")
$5 ==> byte[1] { -24 }
jshell>
{code}
 

Possible fix: Remove the line "8859-1     ISO-8859-1                            
 0x00FF" and similar patterns from `Encodings.property`?


was (Author: jespersm):
Ok, now I get what's wrong with the encoding. It's bad.

The file `Encodings.property` contains mappings between several "Java names" 
(which probably made sense in the last millenium) and MIME names (which are 
what should be in the XML inputs). There's some logic to only register the 
first, but since that's iterating from a Properties object, that will NOT 
represent the order in the file. In other words, unpredictable. That's why it 
suddenly worked when you specified ISO8859_1. I don't know what changed for 
Java 11, it could be the hashtable ordering, or the accepted charset names, but 
the crux is that "8859-1" is NOT an acceptable Java name:
{code:java}
jshell> "\u00e8".getBytes("ISO-8859-1")
$1 ==> byte[1] { -24 } 
jshell> "\u00e8".getBytes("8859_1")
$2 ==> byte[1] { -24 }
 
jshell> "\u00e8".getBytes("8859-1")
|  Exception java.io.UnsupportedEncodingException: 8859-1
|        at StringCoding.encode (StringCoding.java:427)
|        at String.getBytes (String.java:941)
|        at (#3:1)
jshell> "\u00e8".getBytes("ISO8859-1")
$4 ==> byte[1] { -24 }
jshell> "\u00e8".getBytes("ISO8859_1")
$5 ==> byte[1] { -24 }
jshell>
{code}
 

Possible fix: Remove the line "8859-1     ISO-8859-1                            
 0x00FF" and similar patterns from `Encodings.property`?

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.

[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2019-02-21 Thread Jason Harrop (JIRA)


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774590#comment-16774590
 ] 

Jason Harrop edited comment on XALANJ-2419 at 2/21/19 11:31 PM:


When I run the smoketest under Java 8, all tests pass.

When I compile using Java 11 and run the smoketest, for testcase2 I get:

{{









}}
 


was (Author: jharrop):
When I run the smoketest under Java 8, all tests pass.

When I compile using Java 11 and run the smoketest, for testcase2 I get:

{{}}
{{ }}{{}}
{{ }}{{}}
{{ }}
{{ }}{{}}
{{ }}

 

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2019-02-21 Thread Jason Harrop (JIRA)


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774590#comment-16774590
 ] 

Jason Harrop edited comment on XALANJ-2419 at 2/21/19 11:29 PM:


When I run the smoketest under Java 8, all tests pass.

When I compile using Java 11 and run the smoketest, for testcase2 I get:

{{}}
{{ }}{{}}
{{ }}{{}}
{{ }}
{{ }}{{}}
{{ }}

 


was (Author: jharrop):
When I run the smoketest under Java 8, it works.

When I compile using Java 11 and run the smoketest, for testcase2 I get:

```















```

 

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2019-02-21 Thread Jason Harrop (JIRA)


[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774590#comment-16774590
 ] 

Jason Harrop edited comment on XALANJ-2419 at 2/21/19 11:32 PM:


When I run the smoketest under Java 8, all tests pass.

When I compile using Java 11 and run the smoketest, for testcase2 I get:


{code:xml}










{code}



was (Author: jharrop):
When I run the smoketest under Java 8, all tests pass.

When I compile using Java 11 and run the smoketest, for testcase2 I get:

{{









}}
 

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439454#comment-16439454
 ] 

Uwe Schindler edited comment on XALANJ-2419 at 4/16/18 1:38 PM:


Fix works for me. +1 to start a release of XALANJ (and maybe also XERCES).


was (Author: thetaphi):
Fix works for me.

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439386#comment-16439386
 ] 

Uwe Schindler edited comment on XALANJ-2419 at 4/16/18 12:51 PM:
-

bq. (Or are you using a jarjar'ed build inside Solr or Lucene?)

Hah, you noticed it. Yes, I am part of Lucene/Solr PMC. But this issue is more 
about XML processing in a local project of mine. I know that Solr is affected 
by this...

I also have a workaround for Apache Axis 1.4 (which was easy to fix without 
patching).


was (Author: thetaphi):
bq. (Or are you using a jarjar'ed build inside Solr or Lucene?)

Hah, you noticed it. Yes, I am Lucene/Solr. But this issue is more about XML 
processing in a local project of mine. I know that Solr is affected by this...

I also have a workaround for Apache Axis 1.4 (which was easy to fix without 
patching).

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-16 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439379#comment-16439379
 ] 

Uwe Schindler edited comment on XALANJ-2419 at 4/16/18 12:48 PM:
-

Thanks for the fix, I will test it in a moment.

About a release: I am Apache member and committer, so I might start a thread to 
push a release. As these bugs are horrible and make almost any XML handling of 
stuff like Emojis broken, we should maybe do a bugfix for serializer.jar 
release. Keep in mind, this would also require to make a Xerces release, as 
Xerces and Xalan share serializer.jar (I think they depend on each other on 
Maven central).

I would try to manage to do help with a relaese. This fix is indeed simple. 
Somebody should just commit it (I could theoretically do it, but that should be 
done by non-project members only as last resort), and press somebody else would 
press the button for release.

BTW, also my own projects like Apache Solr are affected by this bug (people 
that still use XML instead of JSON with Solr).


was (Author: thetaphi):
Thanks for the fix, I will test it in a moment.

About a release: I am Apache member and committer, so I might start a thread to 
push a release. As these bugs are horrible and make almost any XML handling of 
stuff like Emojis broken, we should maybe do a bugfix releaser for 
serializer.jar release. Keep in mind, this would also require to make a Xerces 
release, as Xerces and Xalan share serializer.jar (I think they depend on each 
other on Maven central).

I would try to manage to do help with a relaese. This fix is indeed simple. 
Somebody should just commit it (I could theoretically do it, but that should be 
done by non-project members only as last resort), and press somebody else would 
press the button for release.

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438864#comment-16438864
 ] 

Jesper Steen Møller edited comment on XALANJ-2419 at 4/15/18 11:22 PM:
---

[~thetaphi]: So - I fixed it, but will anybody care?

I mean, it's been almost 8 years...


was (Author: jespersm):
[~thetaphi]: I'm sure it could be fixed, but would anybody care?

I mean, it's been almost 8 years...

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-tests-v2.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2018-04-15 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438848#comment-16438848
 ] 

Uwe Schindler edited comment on XALANJ-2419 at 4/15/18 10:48 PM:
-

Unfortunately the patch does not fix the problem for attributes. Those got 
better with it, but it outputs the correct char and then the second half char 
of the surrogate as decimal escape.

The Policeman Emoji is serialized with the patch correctly, if part of a text 
node. This is fixed by this patch.

But inside an attribute the policeman emoji comes out like:

{code:xml}
 Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
>Priority: Major
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2017-09-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164487#comment-16164487
 ] 

Jesper Steen Møller edited comment on XALANJ-2419 at 9/13/17 11:04 AM:
---

The Xalan project appears quite dormant, which is sad, but understandable. I 
just came across this quite old posting on the subject: 
https://intellectualcramps.wordpress.com/2011/06/03/xalan-a-step-closer-to-the-attic/

I suggest one of two courses of action:

* Contact the Xalan PMC (use the mailing list, not just JIRA) and volunteer to 
help in putting out a new release (i.e. look for bugs with patches, or related 
Unicode issues, e.g. XALANJ-2610). You can find about the current PMC members 
and committers here: https://projects.apache.org/committee.html?xalan - ASF 
house rules say that you need three positive PMC votes to allow a new release. 
(Perhaps economic incentives work, i.e. pay existing committers to work on the 
release)
* Fork Xalan-J on GitHub or similar a place. You'll likely have to rename the 
project so Apache's trademarks aren't infringed, but the but it should be 
possible to keep the package names, thus allowing for backwards compatibility 
(But I'm not a lawyer!)

I won't be able to participate - I don't even code much in Java anymore.


was (Author: jespersm):
The Xalan project appears quite dormant, which is sad, but understandable. I 
just came across this quite old posting on the subject: 
https://intellectualcramps.wordpress.com/2011/06/03/xalan-a-step-closer-to-the-attic/

I suggest one of two courses of action:

* Contact the Xalan PMC (use the mailing list, not just JIRA) and volunteer to 
help in putting out a new release (i.e. look for bugs with patches, or related 
Unicode issues, e.g. XALANJ-2610). You can find about the current PMC members 
and committers here: https://projects.apache.org/committee.html?xalan - ASF 
house rules say that you need three positive PMC votes to allow a new release. 
(Perhaps economic incentives work, i.e. pay existing committers to work on the 
release)
* Fork Xalan-J on GitHub or similar a place. You'll likely have to rename the 
project so Apache's trademarks aren't infringed, but the but it should be 
possible to keep the package names, thus allowing for backwards compatibility 
(But I'm not a lawyer!)



> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2017-09-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164487#comment-16164487
 ] 

Jesper Steen Møller edited comment on XALANJ-2419 at 9/13/17 11:03 AM:
---

The Xalan project appears quite dormant, which is sad, but understandable. I 
just came across this quite old posting on the subject: 
https://intellectualcramps.wordpress.com/2011/06/03/xalan-a-step-closer-to-the-attic/

I suggest one of two courses of action:

* Contact the Xalan PMC (use the mailing list, not just JIRA) and volunteer to 
help in putting out a new release (i.e. look for bugs with patches, or related 
Unicode issues, e.g. XALANJ-2610). You can find about the current PMC members 
and committers here: https://projects.apache.org/committee.html?xalan - ASF 
house rules say that you need three positive PMC votes to allow a new release. 
(Perhaps economic incentives work, i.e. pay existing committers to work on the 
release)
* Fork Xalan-J on GitHub or similar a place. You'll likely have to rename the 
project so Apache's trademarks aren't infringed, but the but it should be 
possible to keep the package names, thus allowing for backwards compatibility 
(But I'm not a lawyer!)




was (Author: jespersm):
The Xalan project appears quite dormant, which is sad, but understandable. I 
just came across this quite old posting on the subject: 
https://intellectualcramps.wordpress.com/2011/06/03/xalan-a-step-closer-to-the-attic/

I suggest that one of two courses of action:

* Contact the Xalan PMC (use the mailing list, not just JIRA) and volunteer to 
help in putting out a new release (i.e. look for bugs with patches, or related 
Unicode issues, e.g. XALANJ-2610). You can find about the current PMC members 
and committers here: https://projects.apache.org/committee.html?xalan - ASF 
house rules say that you need three positive PMC votes to allow a new release. 
(Perhaps economic incentives work, i.e. pay existing committers to work on the 
release)
* Fork Xalan-J on GitHub or similar a place. You'll likely have to rename the 
project so Apache's trademarks aren't infringed, but the but it should be 
possible to keep the package names, thus allowing for backwards compatibility 
(But I'm not a lawyer!)



> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2015-08-21 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707688#comment-14707688
 ] 

Gary Gregory edited comment on XALANJ-2419 at 8/22/15 12:33 AM:


The only thing blocking me from doing this ATM is priorities. Next week, next 
month, who knows. Got to pay the bills ;-)


was (Author: garydgregory):
The only thing blocking me from doing ATM is priorities. Next week, next month, 
who knows. Got to pay the bills ;-)

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2015-08-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692327#comment-14692327
 ] 

Jesper Steen Møller edited comment on XALANJ-2419 at 8/11/15 9:58 PM:
--

Yes, they work on those branches, too. The patches were originally produced 
against the 2.7.1 tag.
(but, yes, I verified just now)


was (Author: jespersm):
Yes, they work on those branches, too. The patches were originally produced 
against the 2.7.1 tag.

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2015-08-11 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682136#comment-14682136
 ] 

Gary Gregory edited comment on XALANJ-2419 at 8/11/15 5:29 PM:
---

The 2.7.x maintenance is coming out of:
- https://svn.apache.org/repos/asf/xalan/java/branches/xalan-j_2_7_1_maint
- https://svn.apache.org/repos/asf/xalan/test/branches/xalan-j_2_7_x

Do the tests pass on this branch?


was (Author: garydgregory):
The 2.7.x maintenance is coming out of:
- https://svn.apache.org/repos/asf/xalan/java/branches/xalan-j_2_7_1_maint
- https://svn.apache.org/repos/asf/xalan/test/branches/xalan-j_2_7_x

Are the patches OK on this branch?

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2015-08-11 Thread Gary Gregory (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682136#comment-14682136
 ] 

Gary Gregory edited comment on XALANJ-2419 at 8/11/15 5:29 PM:
---

The 2.7.x maintenance is coming out of:
- https://svn.apache.org/repos/asf/xalan/java/branches/xalan-j_2_7_1_maint
- https://svn.apache.org/repos/asf/xalan/test/branches/xalan-j_2_7_x

Are the patches OK on this branch?


was (Author: garydgregory):
The 2.7.x maintenance is coming out of:
- https://svn.apache.org/repos/asf/xalan/java/branches/xalan-j_2_7_1_maint
- https://svn.apache.org/repos/asf/xalan/test/branches/xalan-j_2_7_x

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2015-08-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681738#comment-14681738
 ] 

Jesper Steen Møller edited comment on XALANJ-2419 at 8/11/15 12:38 PM:
---

I followed the instructions on 
https://xalan.apache.org/xalan-j/downloads.html#buildmyself on my Mac OS X 
10.10.4 with Xcode developer tools installed.
I had to add execute permissions on test/build.sh, and temporarily change my 
locale to "All American" (or the test "Extension test of javaSample3.xsl" fails)

That worked, and I got 2 x CONGRATULATIONS when running "./build.sh smoketest"

I then applied the patch containing the tests (using svn patch), and then 
*ToStreamTest.runTest()* and *StreamResultAPITest.runTest()* both failed, as 
was expected.

I then applied the fix, and the tests were once again OK.

So, yes, the fix still applies.

This was on Java 1.7.

Hope this helps!


was (Author: jespersm):
I followed the instructions on 
https://xalan.apache.org/xalan-j/downloads.html#buildmyself on my Mac OS X 
10.10.4 with Xcode developer tools installed.
I had to add execute permissions on test/build.sh, and temporarily change my 
locale to "All American" (or the test "Extension test of javaSample3.xsl" fails)

That worked, and I got 2 x CONGRATULATIONS

I then applied the tests-patch (using svn patch), and then 
ToStreamTest.runTest() and StreamResultAPITest.runTest() both failed, as was 
expected.

I then applied the fix, and the tests were once again OK.

So, yes, the fix still applies.

This was on Java 1.7.

Hope this helps!

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org



[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

2015-07-21 Thread Scott Mitchell (JIRA)

[ 
https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635451#comment-14635451
 ] 

Scott Mitchell edited comment on XALANJ-2419 at 7/21/15 6:15 PM:
-

I even figured out how to install CVS on my Mac (who thought I'd ever have to 
go there), and now the Apache CVS server is refusing connections. Sigh...


was (Author: smitchelus):
I even figured out how to install CVS on my Mac and the Apache CVS server is 
refusing connections. Sigh...

> Astral characters written as a pair of NCRs with the surrogate scalar values 
> when using UTF-8
> -
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
>  Issue Type: Bug
>  Components: Serialization
>Affects Versions: 2.7.1
>Reporter: Henri Sivonen
> Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
> 
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do?  We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because 
> isInEncoding() for UTF-8 returns false for surrogates. It is always wrong 
> (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters 
> in it ends up in an ill-formed serialization and does not parse back using an 
> XML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org
For additional commands, e-mail: dev-h...@xalan.apache.org