[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811010#comment-17811010 ] Cédric Damioli commented on XALANJ-2419: Hi [~kesh...@alum.mit.edu], I've added a PR for XALANJ-2618, I suppose this will also fix your tests on this ticket > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810716#comment-17810716 ] Cédric Damioli commented on XALANJ-2419: I think I may know this one ! It reminds me an issue with Encodings.properties loaded in a different order in Java 8 and Java 9+, leading to issues on modern JVM because it references unexisting encodings. Could it be related ? In my case I have modified Encoding.properties by removing all 8859_* encodings and all worked again I'm pretty sure that a Jira issue existe about this one but I can't find it anymore ... > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810609#comment-17810609 ] Joe Kesselman commented on XALANJ-2419: --- Arggh. Found the difference in invocation, I think. It's an annoying one. The failing version is using my commandline's current default binding of "java", which runs through the /etc/alternatives system to invoke `/usr/lib/jvm/java-17-openjdk-17.0.8.0.7-1.fc37.x86_64/bin/java` The succeeding versions explicitly invoke /usr/lib/jvm/jre-1.8.0/bin/java, which /etc/alternatives eventually maps to `/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.382.b05-2.fc37.x86_64/jre/bin/java` If I change my one-liner to use that 1.8 jre rather than my current default 17, the errors vanish. In the vain hope that the problem was specifically OpenJDK 17, I tried it with `/jre-21-openjdk-21.0.1.0.12-1.rolling.fc37.x86_64`. Fails there too. So *something* is being java-version sensitive and changed some time after Java 1.8. A bug fixed, a new bug, something redefined, something formatted differently, something ordered differently. May be ... +_interesting_+ ... to track down. At least we now know how to provoke the divergent behavior in the debugger for study. Deep breath. Let it out slowly. Recite the mantra: "{color:#0747a6}{color:#172b4d}+_If it was easy, they wouldn't need people like us_+{color}.{color}" > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810589#comment-17810589 ] Joe Kesselman commented on XALANJ-2419: --- Runs fine under Eclipse. Runs fine under apitests. Extracted the commandline from Eclipse, and using that it runs fine from CLI. Something different about classpaths in my own lazy CLI invocation, apparently. Not highly surprising. Doing a quick divide-and-conquer to isolate the difference, more for my own edification than anything else. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810537#comment-17810537 ] Joe Kesselman commented on XALANJ-2419: --- ToXMLStream Test Case 2, is having trouble with both ISO-8859-1 and astrals. Probably a late editing glitch, possibly something that crept in while I was playing with Max's amendment. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810526#comment-17810526 ] Jesper Steen Møller commented on XALANJ-2419: - Great work getting this forward, Joe! Which test(s) are you seeing the problem with? ISO-8859-1 output should not be affected by the patch. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810125#comment-17810125 ] Joe Kesselman commented on XALANJ-2419: --- Everything should have been carried forward. That may have been done manually and history may have been lost, but I don't believe any actual work has been lost. In any case, Master *IS* where new development is going. So if you can find anything which has not been addressed there, please flag it for our attention. Just don't assume that the absence of a particular commit, or a particular merge, means something is missing. Real-world git histories sometimes get messy, especially when operated by real-world humans. ("Begin by assuming a spherical cow...") > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810119#comment-17810119 ] Cédric Damioli commented on XALANJ-2419: I totally agree with you in theory, but the fact is that 2.7.2 and 2.7.3 were *not* released from master, or am I wrong here ? There is a 2_7_x_maint, but with no commits in 5 years I'm afraid we've lost some commits here with the lost of 2_7_1_maint ? Or were all commits on 2_7_1_maint from the last years actually backports from master ? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810117#comment-17810117 ] Joe Kesselman commented on XALANJ-2419: --- [~cdamioli] : I believe that was due to some mistakes in how the release was handled, and the awkward juggling needed to correct those mistakes. *Master* is always supposed to be our primary development branch. New code may be developed on other branches but isn't official until it is merged into {*}Master{*}, When a release is made, a tag or fork is created for that release number. Thus, there should be branches/tags for {*}2.7.1{*}, {*}2.7.2{*}, and *2.7.3* (along with older checkpoints). If hot fixes are needed which must be applied to code that has already been released (rather than just being included in the next release), we may create *maint* branches where the change is back-ported to the earlier versions. Essentially *2.7.1.maint* is the "development master" for *2.7.1.1.* This does _not_ mean *Master* should be derived from *maint* branches. It does mean that if something is fixed in an old release, Master should also be fixed – but due to code evolution over time, the fix may not be identical, and *maint* is not one of *Master's* dependencies, so that must be done manually. I believe that what I've just described is standard SCCS "best practice". It's certainly how we managed Xalan (mumble) years ago before I dropped out of it. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810114#comment-17810114 ] Cédric Damioli commented on XALANJ-2419: I may be wrong here, but I think 2.7.2 and 2.7.3 were not released from master but from some maintenance branch. Something like 2_7_1_maint IIRC By the way, I can't find that branch anymore in the github repo. Do you know where is it ? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810111#comment-17810111 ] Joe Kesselman commented on XALANJ-2419: --- Merged into Master. Obviously if there seems to be a regression vs. 2.7.3, please let me know. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810047#comment-17810047 ] Joe Kesselman commented on XALANJ-2419: --- Opened https://issues.apache.org/jira/browse/XALANJ-2725 for [~maxfortun] 's issue. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810043#comment-17810043 ] Max commented on XALANJ-2419: - [~kesh...@alum.mit.edu] , thank you for working on this. As you suggested, why don't you merge what works and I can try to help you work on the split buffer issue after on a good code? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810039#comment-17810039 ] Joe Kesselman commented on XALANJ-2419: --- Still having trouble with Max's suggestion. Current recommendation: Merge what we've got and open a new work item for that concern. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17810029#comment-17810029 ] Joe Kesselman commented on XALANJ-2419: --- Max's alternative does cause a regression in some of the new tests, assuming I applied it correctly. Surprising. Can take a longer look, but may want to merge what we have first since it *is* an improvement over the previous code. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809726#comment-17809726 ] Joseph Kessselman commented on XALANJ-2419: --- Still checking for interactions, but currently looks good. It's invoked from apitest rather than smoketest right now; I need to better recall how test selection works. I looked at @Max's buffer-boundary suggestion briefly. Swapping it in caused some regressions, probably because I tried to optimize it a bit on-the-fly to avoid object churn. I'll give it one more spin tomorrow. My main concern is that while I agree look-back would be safer than look-forward, I'm not sure that the one case Max patched is the only place were the issue might arise. We have entirely too many paths through this code that are doing essentially the same thing. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17809615#comment-17809615 ] Cédric Damioli commented on XALANJ-2419: Good to here [~kesh...@alum.mit.edu]! I can confirm that the patch provided here actually works but that at the same time the issue pointed by [~maxfortun] still exists. Feel free to ask if you want some help reviewing a patch or test something? I think we are many around here really glad to see this issue finally resolved! > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808815#comment-17808815 ] Joseph Kessselman commented on XALANJ-2419: --- Well, it's in apitest rather than smoketest, but it's there. Need to look at @max's ToStream buffer-bounds tweak and see if that still applies. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808293#comment-17808293 ] Joe Kesselman commented on XALANJ-2419: --- Work in progress on branch XALANJ-2419. Patches and tests confirmed; I just need to make sure this is included in smoketest. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1206#comment-1206 ] Nils Faupel commented on XALANJ-2419: - [~ggregory] , [~mukul_gandhi] : As saw you were active on the tickets included in the latest release. Is there a change to include this ticket into the next release? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17497617#comment-17497617 ] Max commented on XALANJ-2419: - [~jharrop] , btw, after cloning your solution I bumped into an edge case where high surrogate is the last character in a buffer and low surrogate is the first character in a buffer in the next call. I had to mod the code a bit to accommodate for that: https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1606-L1621 > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
Re: [jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
Good catch!--Sent from palmtop; apologies for any auto-incorrections.() | Text Mail Campaign/\ | HTML mail is _evil_!On Feb 17, 2022 9:51 AM, "Max (Jira)" wrote: [ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493984#comment-17493984 ] Max commented on XALANJ-2419: - [~jharrop] thanks for pointing to your work here. Saved me time. Would be nice to have it merged in. > Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization > Affects Versions: 2.7.1 > Reporter: Henri Sivonen > Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493984#comment-17493984 ] Max commented on XALANJ-2419: - [~jharrop] thanks for pointing to your work here. Saved me time. Would be nice to have it merged in. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16776009#comment-16776009 ] Jason Harrop commented on XALANJ-2419: -- [~jespersm] thanks a lot for digging into the Encoding issue. bq. But ... is anyone planning to make a new version? Or are we just making notes for posterity / when these ten+ year old bugs find a new victim? bq. Of course it would be good if there were to be a new official release. But even in the absence of that, these musings are helpful :-) since as you pointed out (in Sept 2017), one can make their own branch on GitHub (or wherever). Since I want Java 11 support, that's what I've done at https://github.com/plutext/xalan-j > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775214#comment-16775214 ] Jesper Steen Møller commented on XALANJ-2419: - But ... is anyone planning to make a new version? Or are we just making notes for posterity / when these ten+ year old bugs find a new victim? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775205#comment-16775205 ] Jesper Steen Møller commented on XALANJ-2419: - Ok, now I get what's wrong with the encoding. It's bad. The file `Encodings.property` contains mappings between several "Java names" (which probably made sense in the last millenium) and MIME names (which are what should be in the XML inputs). There's some logic to only register the first, but since that's iterating from a Properties object, that will NOT represent the order in the file. In other words, unpredictable. That's why it suddenly worked when you specified ISO8859_1. I don't know what changed for Java 11, it could be the hashtable ordering, or the accepted charset names, but the crux is that "8859-1" is NOT an acceptable Java name: {code:java} jshell> "\u00e8".getBytes("ISO-8859-1") $1 ==> byte[1] { -24 } jshell> "\u00e8".getBytes("8859_1") $2 ==> byte[1] { -24 } jshell> "\u00e8".getBytes("8859-1") | Exception java.io.UnsupportedEncodingException: 8859-1 | at StringCoding.encode (StringCoding.java:427) | at String.getBytes (String.java:941) | at (#3:1) jshell> "\u00e8".getBytes("ISO8859-1") $4 ==> byte[1] { -24 } jshell> "\u00e8".getBytes("ISO8859_1") $5 ==> byte[1] { -24 } jshell> {code} Possible fix: Remove the line "8859-1 ISO-8859-1 0x00FF" and similar patterns from `Encodings.property`? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774695#comment-16774695 ] Jesper Steen Møller commented on XALANJ-2419: - Yeah, the problem is that org.apache.xml.serializer.EncodingInfo.inEncoding(char, String) checks for code ability using the "" found in `Encodings.property`, and they use legacy names which have been purged from Java 11. So it gets the exception and so notes that ISO-8859-1 should be escaped beyond >127. Interestingly, I've named the test case "ISO-8859-1 characters should come out as entities" wrong, it's the other way around: They should come out as chars, and that's what's being tested. As for the astral characters, they do come out as astral characters, but the test also had an \u00a4 character in their expected output, and due to the same problem, it case out as an entity. Changing the charset name for the test is only stop-gap measure, since the way the Java charset name is found from Encodings.property is actually wrong: Line 372 of org.apache.xml.serializer.Encodings.loadEncodingInfo() overwrites the Mime mapping for the proper encoding name (ISO-8859-1 in this case) with the last associated Java charset name seen in the file, which ends up being the worst (= not supported by the JRE). When you use the alternative name, it finds a non-mangled version, since it has too look up by Java encoding name instead of MIME name. This should really be fixed. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774677#comment-16774677 ] Jason Harrop commented on XALANJ-2419: -- It works under Java 11 if I change makeStream("ISO-8859-1") to makeStream("ISO8859_1"). With makeStream("ISO-8859-1"), s.getBytes(encoding) throws UnsupportedEncodingException for encoding 8859-1 at {code:java} EncodingInfo.inEncoding(char, String) line: 438 EncodingInfo$EncodingImpl.isInEncoding(char) line: 226 EncodingInfo$EncodingImpl.isInEncoding(char) line: 215 EncodingInfo.isInEncoding(char) line: 113 ToXMLStream(ToStream).characters(char[], int, int) line: 1597 ToXMLStream(ToStream).characters(String) line: 1774 ToXMLStreamTest(ToStreamTest).outputCharacters(ToStream, String) line: 88 ToXMLStreamTest.testCase2() line: 114 NativeMethodAccessorImpl.invoke0(Method, Object, Object[]) line: not available [native method] NativeMethodAccessorImpl.invoke(Object, Object[]) line: 62 DelegatingMethodAccessorImpl.invoke(Object, Object[]) line: 43 Method.invoke(Object, Object...) line: 566 Reporter.executeTests(Test, int, Object) line: 787 ToXMLStreamTest(FileBasedTest).runTestCases(Properties) line: 339 ToXMLStreamTest(TestImpl).runTest(Properties) line: 205 ToXMLStreamTest(FileBasedTest).doMain(String[]) line: 833 ToXMLStreamTest.main(String[]) line: 196 {code} Not related to 2419, but FYI there is one other test which fails, due to date formatting and http://openjdk.java.net/jeps/252 I've put the test code on GitHub; for Java 11 I am using https://github.com/plutext/xalan-test/tree/Plutext_Java11_xalan-j_2_7_x > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774601#comment-16774601 ] Jesper Steen Møller commented on XALANJ-2419: - [~jharrop]: Is this using the [http://svn.apache.org/repos/asf/xalan/test/branches/xalan-j_2_7_x] branch? Using 11.0.2-openjdk, the test harness itself complains due to "-Djava.endorsed.dirs". > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774590#comment-16774590 ] Jason Harrop commented on XALANJ-2419: -- When I run the smoketest under Java 8, it works. When I compile using Java 11 and run the smoketest, for testcase2 I get: ``` ``` > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439454#comment-16439454 ] Uwe Schindler commented on XALANJ-2419: --- Fix works for me. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439386#comment-16439386 ] Uwe Schindler commented on XALANJ-2419: --- bq. (Or are you using a jarjar'ed build inside Solr or Lucene?) Hah, you noticed it. Yes, I am Lucene/Solr. But this issue is more about XML processing in a local project of mine. I know that Solr is affected by this... I also have a workaround for Apache Axis 1.4 (which was easy to fix without patching). > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439379#comment-16439379 ] Uwe Schindler commented on XALANJ-2419: --- Thanks for the fix, I will test it in a moment. About a release: I am Apache member and committer, so I might start a thread to push a release. As these bugs are horrible and make almost any XML handling of stuff like Emojis broken, we should maybe do a bugfix releaser for serializer.jar release. Keep in mind, this would also require to make a Xerces release, as Xerces and Xalan share serializer.jar (I think they depend on each other on Maven central). I would try to manage to do help with a relaese. This fix is indeed simple. Somebody should just commit it (I could theoretically do it, but that should be done by non-project members only as last resort), and press somebody else would press the button for release. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439335#comment-16439335 ] Jesper Steen Møller commented on XALANJ-2419: - Hi [~thetaphi] - Version 3 adds the fix for normal HTML attribute content as well as URL attributes encoded without URL escapes. A ToHTMLStream test is added, which also tests these corner cases (the UTF-8+URL-escaped byte sequences were as expected, but now has a test). But how do we get anybody to cut a new release? (Or are you using a jarjar'ed build inside Solr or Lucene?) > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439023#comment-16439023 ] Uwe Schindler commented on XALANJ-2419: --- Hi Jesper, thanks! I applied the same patch like your's to my local checkout yesterday and I can confirm it fixes the XML case. But it does not work for my HTML example above, the only workaround for the HTML encode is like it was here (if you pass an encoding of UTF-16 and use a writer to write it to an UTF-8 file - and you don't have a header with charset in HTML serializations). The issue in ToHTML stream seems to be a counting problem (it looks like it print the whole surrogate correctly, but it forgot to increment the counter, so it prints a hex escape of the second part): - I had no HREF attributes in my test, so i was not affected by a URL encoding corner case. - Normal attributes seem to have the above input character counting problem, the astral character is written correctly, but the low surrogate is printed as escape. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-tests-v2.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438878#comment-16438878 ] Jesper Steen Møller commented on XALANJ-2419: - [~thetaphi]: Now I see what you mean (perhaps): Yes, there is a very tricky similar bug in the attribute values of ToHTMLStream, but not in the general case (I think it's OK due to line 1440-1447 in writeAttrString, but have *not* tested this.) I only see the issue for ToHTMLStream in the case of URL attributes such as A#HREF, where the output has explicitly been set to *not* encoded as am URL (line 1294 in writeAttrURI). The default is to escape HTML attributes containing URLs using URL-encoding, unless overridden with xalan:use-url-escaping=yes in the XSLT output options. (As an aside: I'm no expert, but the UTF-8 encoder inside the URL-encoding (line 1208-1285 in writeAttrURI) seems legit, if a little verbose, instead of just doing String.getBytes(UTF_8) and hexing that) My v2 fix above does *not* address the corner case in line 1294. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix-v2.txt, XALANJ-2419-tests-v2.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438864#comment-16438864 ] Jesper Steen Møller commented on XALANJ-2419: - [~thetaphi]: I'm sure it could be fixed, but would anybody care? I mean, it's been almost 8 years... > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438848#comment-16438848 ] Uwe Schindler commented on XALANJ-2419: --- Unfortunately the patch does not fix the problem for attributes. Those got better with it, but it outputs the correct char and then the second half char of the surrogate as decimal escape. The Policeman Emoji is serialized with the patch correctly, if part of a text node. This is fixed by this patch. But inside an attribute the policeman emoji comes out like: {code:xml} Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Priority: Major > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164487#comment-16164487 ] Jesper Steen Møller commented on XALANJ-2419: - The Xalan project appears quite dormant, which is sad, but understandable. I just came across this quite old posting on the subject: https://intellectualcramps.wordpress.com/2011/06/03/xalan-a-step-closer-to-the-attic/ I suggest that one of two courses of action: * Contact the Xalan PMC (use the mailing list, not just JIRA) and volunteer to help in putting out a new release (i.e. look for bugs with patches, or related Unicode issues, e.g. XALANJ-2610). You can find about the current PMC members and committers here: https://projects.apache.org/committee.html?xalan - ASF house rules say that you need three positive PMC votes to allow a new release. (Perhaps economic incentives work, i.e. pay existing committers to work on the release) * Fork Xalan-J on GitHub or similar a place. You'll likely have to rename the project so Apache's trademarks aren't infringed, but the but it should be possible to keep the package names, thus allowing for backwards compatibility (But I'm not a lawyer!) > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164393#comment-16164393 ] Nils Faupel commented on XALANJ-2419: - Is there any update on fixing this bug? I don't understand why this bug is not fixed when a working patch is submitted that passes all existing tests and adds new tests for the changes as well. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009125#comment-15009125 ] Scott Mitchell commented on XALANJ-2419: Hate to be that nagging guy, but any chance of getting a release soon? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969778#comment-14969778 ] Scott Mitchell commented on XALANJ-2419: Hi Gary, any word on the timing for a bug fix release? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969779#comment-14969779 ] Scott Mitchell commented on XALANJ-2419: Hi Gary, any word on the timing for a bug fix release? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709360#comment-14709360 ] Scott Mitchell commented on XALANJ-2419: Understood Gary, glad to hear it will come eventually. I certainly would prefer an official build vs. us forking it and rolling our own version. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709218#comment-14709218 ] Jason Harrop commented on XALANJ-2419: -- If you just want to use TransformerIdentityImpl with astral characters, it is possible to repackage little more than org.apache.xml.serializer (and apply the fix in this issue). You also need to modify the *.properties files in that package, to point to your new repackaging. In addition to that package, I repackaged TransformerIdentityImpl, SerializerSwitcher and OutputProperties. Of course, life would be simpler if the Xalan maintainers would just make a 2.7.3 fixing this 7 year old bug. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707688#comment-14707688 ] Gary Gregory commented on XALANJ-2419: -- The only thing blocking me from doing ATM is priorities. Next week, next month, who knows. Got to pay the bills ;-) > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707451#comment-14707451 ] Scott Mitchell commented on XALANJ-2419: So, given that Jesper has confirmed the compatibility of the patch with the 2.7.X branch, what is the possibility of getting this included in a release in the near term? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692362#comment-14692362 ] Jesper Steen Møller commented on XALANJ-2419: - I noticed a confusing comment in the ToStreamTest, where the test method *testCase2()* asserts that {code} reporter.check(actual2, AELIG_OSLASH_ARING, "ISO-8859-1 characters should come out as entities"); {code} This should read {code} reporter.check(actual2, AELIG_OSLASH_ARING, "ISO-8859-1 characters should come out unscathed"); {code} ... as it's also what's asserted. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692327#comment-14692327 ] Jesper Steen Møller commented on XALANJ-2419: - Yes, they work on those branches, too. The patches were originally produced against the 2.7.1 tag. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682136#comment-14682136 ] Gary Gregory commented on XALANJ-2419: -- The 2.7.x maintenance is coming out of: - https://svn.apache.org/repos/asf/xalan/java/branches/xalan-j_2_7_1_maint - https://svn.apache.org/repos/asf/xalan/test/branches/xalan-j_2_7_x > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14682094#comment-14682094 ] Scott Mitchell commented on XALANJ-2419: Thanks Jesper! So, does this imply it would be possible to get this integrated into a release? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681738#comment-14681738 ] Jesper Steen Møller commented on XALANJ-2419: - I followed the instructions on https://xalan.apache.org/xalan-j/downloads.html#buildmyself on my Mac OS X 10.10.4 with Xcode developer tools installed. I had to add execute permissions on test/build.sh, and temporarily change my locale to "All American" (or the test "Extension test of javaSample3.xsl" fails) That worked, and I got 2 x CONGRATULATIONS I then applied the tests-patch (using svn patch), and then ToStreamTest.runTest() and StreamResultAPITest.runTest() both failed, as was expected. I then applied the fix, and the tests were once again OK. So, yes, the fix still applies. This was on Java 1.7. Hope this helps! > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635451#comment-14635451 ] Scott Mitchell commented on XALANJ-2419: I even figured out how to install CVS on my Mac and the Apache CVS server is refusing connections. Sigh... > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635277#comment-14635277 ] Scott Mitchell commented on XALANJ-2419: Man, I can't figure out how to get the test source code. Does anyone watching this thread have any insight here? As far as I can tell it might be that the test source code is still in CVS??? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632468#comment-14632468 ] Scott Mitchell commented on XALANJ-2419: Hi Gary, I'm trying to answer that question for you, but so far I've been unable to figure out how to run the test suite. I originally downloaded a source distribution to make my modifications, but that doesn't appear to have the tests included in it. I then tried cloning the Git repo and checking out the SVN repo and neither of those seem to include the tests either. Any clue how I can get my hands on the tests? FWIW, here's the error I'm getting when I run "ant smoketest" or "minitest": tests-not-available: [echo] [tests] The tests do not seem to be present in ../test [echo] [tests] You must have checked out from CVS to run the tests, [echo] [tests] it is not included in binary distributions. [echo] [tests] See http://xml.apache.org/xalan-j/test/ for more info. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631860#comment-14631860 ] Gary Gregory commented on XALANJ-2419: -- Does the whole test suite pass with this patch on top of 2.7.2? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631846#comment-14631846 ] Uwe Schindler commented on XALANJ-2419: --- Maybe it's a better idea to ask the XERCES people to fix this bug? XALAN seems dead to me, unfortunately. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631845#comment-14631845 ] Uwe Schindler commented on XALANJ-2419: --- One workaround for this is: - Just specify output encoding to be UTF-16 - Instead of OutputStream write the XML to a standard java.io.Writer with correct charset. Unfortunately this will print the wrong charset into the XML header, but this may be fixed with some FilterWriter. I did not have that issue, as we emit no XML header, so not sure how hard this is. But this bug should really be fixed. It makes serializer.jar unuseable with any far-east language! This also affects xerces.jar, which also uses serializer.jar when serializing XML. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631838#comment-14631838 ] Scott Mitchell commented on XALANJ-2419: Hi, we just got bitten by this same issue. I applied the patch supplied in this bug report and confirmed it fixed our issue. Is there any possibility of getting an officially supported release that has this fix in it? I'd hate to have to run our own customized version of Xalan. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14567251#comment-14567251 ] Sergey Oplavin commented on XALANJ-2419: Hello, is there any plans to fix this bug? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen > Attachments: XALANJ-2419-fix.txt, XALANJ-2419-tests.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); > writer.write(""); > writer.write(Integer.toString(ch)); > writer.write(';'); > lastDirtyCharProcessed = i; > } > This leads to the wrong (latter) if branch running for surrogates, because > isInEncoding() for UTF-8 returns false for surrogates. It is always wrong > (regardless of encoding) to escape a surrogate as an NCR. > The practical effect of this bug is that any document with astral characters > in it ends up in an ill-formed serialization and does not parse back using an > XML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org