[jira] [Comment Edited] (XALANJ-2725) Possible buffer-boundry issue when serializing surrogate pairs
[ https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810175#comment-17810175 ] Joe Kesselman edited comment on XALANJ-2725 at 1/24/24 2:44 AM: Unfortunately, no, the patch doesn't seem to be working for me. For both ToXMLStreamTest and ToHTMLStreamTest, the UTF-8 pass reports: {code:java} {code} I should admit that I don't, at first glance, see +why+ it's failing; might be time to fire up the debugger and watch it fail. Note that we have some annoyingly parallel solutions in this class – isHighSurrogate() is tested five different places under multiple serializing loops, One of them, in the *characters* method, actually does admit that there's a buffer bounds risk in its look-ahead; that's the one at line 1598 of my copy: {code:java} else if (Encodings.isHighUTF16Surrogate(ch) && i < end-1 && Encodings.isLowUTF16Surrogate(chars[i+1])) {{code} though it doesn't do more than dump the surrogates as Numeric Character Entities when the boundary is crossed. The others (two in *writeNormalizedCharacters,* one in {*}accumDefaultEscape{*}, one in {*}writeAttrString{*}) appear to blithely assume that if buffer division is possible it has been done on Unicode Character, rather than UTF16 unit, boundaries. Which is what I would expect to happen in most of our code. But mistakes get made, and the serializer APIs can be invoked from code other than Xalan, so guarding against this isn't an unreasonable idea. In fact, that would probably be the right way to test this – write unit-test code that exercises ToStream directly at the API level, rather than trying to do so from the functional-test level. Some off-the-cuff code review on your patch while I was glancing at it: I notice that you clear m_highUTF16Surrogate after a surrogate pair, and flush it out before the "fallback plan". The former makes sense. The latter ... Well, ill-formed UTF16 isn't supposed to occur, so that combination really shouldn't happen. If it does, I'm not sure writing high out as a numeric character reference makes sense as anything but an error indication, in which case I'd be tempted to write it as _HIGH_SURROGATE_; to make clear that this is what's going on. If we're going to assume that isolated surrogates are possible at all, your code risks combining them into a single character that never actually existed, since the high-surrogate cache isn't cleared until it's used. That could be hard to diagnose. ("Spooky action at a distance"?) I dislike spending the cycles, but we might want to make sure the high surrogate doesn't outlast the next UTF16 unit even if it isn't a low surrogate. Or we can assert that providing correct UTF16 input is the responsibility of the users, and sweep the whole issue of isolated surrogates under the carpet. One more thought: Do we really need to construct a Character to cache a surrogate? Couldn't we just stash the numeric value (unsigned short?), with 0 acting as the "none" case rather than null? Object churn is generally a Bad thing in inner loops. was (Author: JIRAUSER285361): Unfortunately, no, the patch doesn't seem to be working for me. For both ToXMLStreamTest and ToHTMLStreamTest, the UTF-8 pass reports: {code:java} {code} I should admit that I don't, at first glance, see +why+ it's failing; might be time to fire up the debugger and watch it fail. Note that we have some annoyingly parallel solutions in this class – isHighSurrogate() is tested five different places under multiple serializing loops, One of them, in the *characters* method, actually does admit that there's a buffer bounds risk in its look-ahead; that's the one at line 1598 of my copy: {code:java} else if (Encodings.isHighUTF16Surrogate(ch) && i < end-1 && Encodings.isLowUTF16Surrogate(chars[i+1])) {{code} though it doesn't do more than dump the surrogates as Numeric Character Entities when the boundary is crossed. The others (two in *writeNormalizedCharacters,* one in {*}accumDefaultEscape{*}, one in {*}writeAttrString{*}) appear to blithely assume that if buffer division is possible, that's been done on Unicode Character, rather than UTF16 unit, boundaries. Which is what I would expect to happen in most of our code. But mistakes get made, and the serializer APIs can be invoked from code other than Xalan, so guarding against this isn't an unreasonable idea. In fact, that would probably be the right way to test this – write unit-test code that exercises ToStream directly, rather than trying to do so from the functional-test level. Some off-the-cuff code review on your patch while I was glancing at it: I notice that you clear m_highUTF16Surrogate after a surrogate pair, and flush it out before the "fallback plan". The former makes sense. The latter ... Well, ill-formed UTF16 isn't supposed to occur, so that
[jira] [Comment Edited] (XALANJ-2725) Possible buffer-boundry issue when serializing surrogate pairs
[ https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810175#comment-17810175 ] Joe Kesselman edited comment on XALANJ-2725 at 1/24/24 2:43 AM: Unfortunately, no, the patch doesn't seem to be working for me. For both ToXMLStreamTest and ToHTMLStreamTest, the UTF-8 pass reports: {code:java} {code} I should admit that I don't, at first glance, see +why+ it's failing; might be time to fire up the debugger and watch it fail. Note that we have some annoyingly parallel solutions in this class – isHighSurrogate() is tested five different places under multiple serializing loops, One of them, in the *characters* method, actually does admit that there's a buffer bounds risk in its look-ahead; that's the one at line 1598 of my copy: {code:java} else if (Encodings.isHighUTF16Surrogate(ch) && i < end-1 && Encodings.isLowUTF16Surrogate(chars[i+1])) {{code} though it doesn't do more than dump the surrogates as Numeric Character Entities when the boundary is crossed. The others (two in *writeNormalizedCharacters,* one in {*}accumDefaultEscape{*}, one in {*}writeAttrString{*}) appear to blithely assume that if buffer division is possible, that's been done on Unicode Character, rather than UTF16 unit, boundaries. Which is what I would expect to happen in most of our code. But mistakes get made, and the serializer APIs can be invoked from code other than Xalan, so guarding against this isn't an unreasonable idea. In fact, that would probably be the right way to test this – write unit-test code that exercises ToStream directly, rather than trying to do so from the functional-test level. Some off-the-cuff code review on your patch while I was glancing at it: I notice that you clear m_highUTF16Surrogate after a surrogate pair, and flush it out before the "fallback plan". The former makes sense. The latter ... Well, ill-formed UTF16 isn't supposed to occur, so that combination really shouldn't happen. If it does, I'm not sure writing high out as a numeric character reference makes sense as anything but an error indication, in which case I'd be tempted to write it as _HIGH_SURROGATE_; to make clear that this is what's going on. If we're going to assume that isolated surrogates are possible at all, your code risks combining them into a single character that never actually existed, since the high-surrogate cache isn't cleared until it's used. That could be hard to diagnose. ("Spooky action at a distance"?) I dislike spending the cycles, but we might want to make sure the high surrogate doesn't outlast the next UTF16 unit even if it isn't a low surrogate. Or we can assert that providing correct UTF16 input is the responsibility of the users, and sweep the whole issue of isolated surrogates under the carpet. One more thought: Do we really need to construct a Character to cache a surrogate? Couldn't we just stash the numeric value (unsigned short?), with 0 acting as the "none" case rather than null? Object churn is generally a Bad thing in inner loops. was (Author: JIRAUSER285361): Unfortunately, no, the patch doesn't seem to be working for me. For both ToXMLStreamTest and ToHTMLStreamTest, the UTF-8 pass reports: {code:java} {code} {{{} {}}}I should admit that I don't, at first glance, see +why+ it's failing; might be time to fire up the debugger and watch it fail.{{{} {}}} Note that we have some annoyingly parallel solutions in this class – isHighSurrogate() is tested five different places under multiple serializing loops, One of them, in the *characters* method, actually does admit that there's a buffer bounds risk in its look-ahead; that's the one at line 1598 of my copy: {{ }} {code:java} else if (Encodings.isHighUTF16Surrogate(ch) && i < end-1 && Encodings.isLowUTF16Surrogate(chars[i+1])) {{code} though it doesn't do more than dump the surrogates as Numeric Character Entities when the boundary is crossed. The others (two in *writeNormalizedCharacters,* one in {*}accumDefaultEscape{*}, one in {*}writeAttrString{*}) appear to blithely assume that if buffer division is possible, that's been done on Unicode Character, rather than UTF16 unit, boundaries. Which is what I would expect to happen in most of our code. But mistakes get made, and the serializer APIs can be invoked from code other than Xalan, so guarding against this isn't an unreasonable idea. In fact, that would probably be the right way to test this – write unit-test code that exercises ToStream directly, rather than trying to do so from the functional-test level. Some off-the-cuff code review on your patch while I was glancing at it: I notice that you clear m_highUTF16Surrogate after a surrogate pair, and flush it out before the "fallback plan". The former makes sense. The latter ... Well, ill-formed UTF16 isn't
[jira] [Commented] (XALANJ-2725) Possible buffer-boundry issue when serializing surrogate pairs
[ https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810175#comment-17810175 ] Joe Kesselman commented on XALANJ-2725: --- Unfortunately, no, the patch doesn't seem to be working for me. For both ToXMLStreamTest and ToHTMLStreamTest, the UTF-8 pass reports: {code:java} {code} {{{} {}}}I should admit that I don't, at first glance, see +why+ it's failing; might be time to fire up the debugger and watch it fail.{{{} {}}} Note that we have some annoyingly parallel solutions in this class – isHighSurrogate() is tested five different places under multiple serializing loops, One of them, in the *characters* method, actually does admit that there's a buffer bounds risk in its look-ahead; that's the one at line 1598 of my copy: {{ }} {code:java} else if (Encodings.isHighUTF16Surrogate(ch) && i < end-1 && Encodings.isLowUTF16Surrogate(chars[i+1])) {{code} though it doesn't do more than dump the surrogates as Numeric Character Entities when the boundary is crossed. The others (two in *writeNormalizedCharacters,* one in {*}accumDefaultEscape{*}, one in {*}writeAttrString{*}) appear to blithely assume that if buffer division is possible, that's been done on Unicode Character, rather than UTF16 unit, boundaries. Which is what I would expect to happen in most of our code. But mistakes get made, and the serializer APIs can be invoked from code other than Xalan, so guarding against this isn't an unreasonable idea. In fact, that would probably be the right way to test this – write unit-test code that exercises ToStream directly, rather than trying to do so from the functional-test level. Some off-the-cuff code review on your patch while I was glancing at it: I notice that you clear m_highUTF16Surrogate after a surrogate pair, and flush it out before the "fallback plan". The former makes sense. The latter ... Well, ill-formed UTF16 isn't supposed to occur, so that combination really shouldn't happen. If it does, I'm not sure writing high out as a numeric character reference makes sense as anything but an error indication, in which case I'd be tempted to write it as _HIGH_SURROGATE_; to make clear that this is what's going on. If we're going to assume that isolated surrogates are possible at all, your code risks combining them into a single character that never actually existed, since the high-surrogate cache isn't cleared until it's used. That could be hard to diagnose. ("Spooky action at a distance"?) I dislike spending the cycles, but we might want to make sure the high surrogate doesn't outlast the next UTF16 unit even if it isn't a low surrogate. Or we can assert that providing correct UTF16 input is the responsibility of the users, and sweep the whole issue of isolated surrogates under the carpet. One more thought: Do we really need to construct a Character to cache a surrogate? Couldn't we just stash the numeric value (unsigned short?), with 0 acting as the "none" case rather than null? Object churn is generally a Bad thing in inner loops. > Possible buffer-boundry issue when serializing surrogate pairs > -- > > Key: XALANJ-2725 > URL: https://issues.apache.org/jira/browse/XALANJ-2725 > Project: XalanJ2 > Issue Type: Improvement > Security Level: No security risk; visible to anyone(Ordinary problems in > Xalan projects. Anybody can view the issue.) > Components: Serialization >Reporter: Joe Kesselman >Assignee: Joe Kesselman >Priority: Major > Labels: Surrogates, escaping, unicode, utf > Attachments: astral-chars-split-buffer.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a > surrogate pair (two UTF-16 units), were not being serialized correctly. We > have a proposed fix for that. > There is reported to still be an edge case when a surrogate pair which > crosses buffer boundaries might not be handled correctly. [~maxfortun] > offered what looks like a reasonable proposed fix > (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607), > but in my testing this was not serializing the surrogate pairs correctly, > causing regression on the tests XALANJ-2419 introduced. I don't know whether > that's because we're taking multiple paths through > But the edge case does appear to be real, and if so we will need some such > solution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail:
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810125#comment-17810125 ] Joe Kesselman commented on XALANJ-2419: --- Everything should have been carried forward. That may have been done manually and history may have been lost, but I don't believe any actual work has been lost. In any case, Master *IS* where new development is going. So if you can find anything which has not been addressed there, please flag it for our attention. Just don't assume that the absence of a particular commit, or a particular merge, means something is missing. Real-world git histories sometimes get messy, especially when operated by real-world humans. ("Begin by assuming a spherical cow...") > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810125#comment-17810125 ] Joe Kesselman edited comment on XALANJ-2419 at 1/23/24 10:41 PM: - Everything should have been carried forward. That may have been done manually and history may have been lost, but I don't believe any actual work has been lost. In any case, Master *IS* where new development is going. So if you can find anything which has not been addressed there, please flag it for our attention. Just don't assume that the absence of a particular commit, or a particular merge, means something is missing. Real-world git histories sometimes get messy, especially when operated by real-world humans. And don't let past mistakes get in the way of doing what's right in the future. ("Begin by assuming a spherical cow...") was (Author: JIRAUSER285361): Everything should have been carried forward. That may have been done manually and history may have been lost, but I don't believe any actual work has been lost. In any case, Master *IS* where new development is going. So if you can find anything which has not been addressed there, please flag it for our attention. Just don't assume that the absence of a particular commit, or a particular merge, means something is missing. Real-world git histories sometimes get messy, especially when operated by real-world humans. ("Begin by assuming a spherical cow...") > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810119#comment-17810119 ] Cédric Damioli commented on XALANJ-2419: I totally agree with you in theory, but the fact is that 2.7.2 and 2.7.3 were *not* released from master, or am I wrong here ? There is a 2_7_x_maint, but with no commits in 5 years I'm afraid we've lost some commits here with the lost of 2_7_1_maint ? Or were all commits on 2_7_1_maint from the last years actually backports from master ? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810117#comment-17810117 ] Joe Kesselman commented on XALANJ-2419: --- [~cdamioli] : I believe that was due to some mistakes in how the release was handled, and the awkward juggling needed to correct those mistakes. *Master* is always supposed to be our primary development branch. New code may be developed on other branches but isn't official until it is merged into {*}Master{*}, When a release is made, a tag or fork is created for that release number. Thus, there should be branches/tags for {*}2.7.1{*}, {*}2.7.2{*}, and *2.7.3* (along with older checkpoints). If hot fixes are needed which must be applied to code that has already been released (rather than just being included in the next release), we may create *maint* branches where the change is back-ported to the earlier versions. Essentially *2.7.1.maint* is the "development master" for *2.7.1.1.* This does _not_ mean *Master* should be derived from *maint* branches. It does mean that if something is fixed in an old release, Master should also be fixed – but due to code evolution over time, the fix may not be identical, and *maint* is not one of *Master's* dependencies, so that must be done manually. I believe that what I've just described is standard SCCS "best practice". It's certainly how we managed Xalan (mumble) years ago before I dropped out of it. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
[jira] [Resolved] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe Kesselman resolved XALANJ-2419. --- Fix Version/s: The Latest Development Code Resolution: Fixed Fixed on head. Haven't yet incremented semantic version to 2.7.3.1. Should? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Fix For: The Latest Development Code > > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810114#comment-17810114 ] Cédric Damioli commented on XALANJ-2419: I may be wrong here, but I think 2.7.2 and 2.7.3 were not released from master but from some maintenance branch. Something like 2_7_1_maint IIRC By the way, I can't find that branch anymore in the github repo. Do you know where is it ? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
Re: [PR] Make sure the new stream tests gate apitest success or failure [xalan-test]
jkesselm merged PR #9: URL: https://github.com/apache/xalan-test/pull/9 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[PR] Make sure the new stream tests gate apitest success or failure [xalan-test]
jkesselm opened a new pull request, #9: URL: https://github.com/apache/xalan-test/pull/9 Notes to myself: Apparently we're relying on an explicit list of expected-good tests in build.xml, looking for the Pass-testname files. There is currently one known fail in SmoketestOuttakes (which is why it's an outtake), which I believe is also reflected in Harness failing. Should sanity check that it's something we're aware of and have either accepted divergence on or opened a work item for. Also note that the performance tests (TimeDTM*) are deliberately considered ambiguous, since we don't have anything calculating "reasonable" time limits for a platform, and indeed that's nontrivial to do. Performance testing is usually conducted manually/locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810111#comment-17810111 ] Joe Kesselman commented on XALANJ-2419: --- Merged into Master. Obviously if there seems to be a regression vs. 2.7.3, please let me know. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
Re: [PR] Xalanj 2419 -- Issues serializing astral characters [xalan-test]
jkesselm merged PR #8: URL: https://github.com/apache/xalan-test/pull/8 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
Re: [PR] XALANJ-2419: Erroneous serialization of astral characters [xalan-java]
jkesselm merged PR #163: URL: https://github.com/apache/xalan-java/pull/163 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2725) Possible buffer-boundry issue when serializing surrogate pairs
[ https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810065#comment-17810065 ] Max commented on XALANJ-2725: - [~kesh...@alum.mit.edu] , I added a patch I tried with your PR. See if your regression tests pass now. Need to figure out a good test case for the split buffer. > Possible buffer-boundry issue when serializing surrogate pairs > -- > > Key: XALANJ-2725 > URL: https://issues.apache.org/jira/browse/XALANJ-2725 > Project: XalanJ2 > Issue Type: Improvement > Security Level: No security risk; visible to anyone(Ordinary problems in > Xalan projects. Anybody can view the issue.) > Components: Serialization >Reporter: Joe Kesselman >Assignee: Joe Kesselman >Priority: Major > Labels: Surrogates, escaping, unicode, utf > Attachments: astral-chars-split-buffer.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a > surrogate pair (two UTF-16 units), were not being serialized correctly. We > have a proposed fix for that. > There is reported to still be an edge case when a surrogate pair which > crosses buffer boundaries might not be handled correctly. [~maxfortun] > offered what looks like a reasonable proposed fix > (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607), > but in my testing this was not serializing the surrogate pairs correctly, > causing regression on the tests XALANJ-2419 introduced. I don't know whether > that's because we're taking multiple paths through > But the edge case does appear to be real, and if so we will need some such > solution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Updated] (XALANJ-2725) Possible buffer-boundry issue when serializing surrogate pairs
[ https://issues.apache.org/jira/browse/XALANJ-2725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max updated XALANJ-2725: Attachment: astral-chars-split-buffer.patch > Possible buffer-boundry issue when serializing surrogate pairs > -- > > Key: XALANJ-2725 > URL: https://issues.apache.org/jira/browse/XALANJ-2725 > Project: XalanJ2 > Issue Type: Improvement > Security Level: No security risk; visible to anyone(Ordinary problems in > Xalan projects. Anybody can view the issue.) > Components: Serialization >Reporter: Joe Kesselman >Assignee: Joe Kesselman >Priority: Major > Labels: Surrogates, escaping, unicode, utf > Attachments: astral-chars-split-buffer.patch > > Original Estimate: 168h > Remaining Estimate: 168h > > XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a > surrogate pair (two UTF-16 units), were not being serialized correctly. We > have a proposed fix for that. > There is reported to still be an edge case when a surrogate pair which > crosses buffer boundaries might not be handled correctly. [~maxfortun] > offered what looks like a reasonable proposed fix > (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607), > but in my testing this was not serializing the surrogate pairs correctly, > causing regression on the tests XALANJ-2419 introduced. I don't know whether > that's because we're taking multiple paths through > But the edge case does appear to be real, and if so we will need some such > solution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
Re: [PR] improving xpath 3.1 function fn:deep-equal's implementation, by adding support for collation argument. adding a new working related test case as well. committing new xercesj implementation ja
mukulga merged PR #164: URL: https://github.com/apache/xalan-java/pull/164 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[PR] improving xpath 3.1 function fn:deep-equal's implementation, by adding support for collation argument. adding a new working related test case as well. committing new xercesj implementation jar as
mukulga opened a new pull request, #164: URL: https://github.com/apache/xalan-java/pull/164 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[PR] XALANJ-2419: Erroneous serialization of astral characters [xalan-java]
jkesselm opened a new pull request, #163: URL: https://github.com/apache/xalan-java/pull/163 See also https://github.com/apache/xalan-test/compare/master...XALANJ-2419 Note that there is a remaining edge case, https://issues.apache.org/jira/browse/XALANJ-2725, but this PR is a definite step forward. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[PR] Xalanj 2419 -- Issues serializing astral characters [xalan-test]
jkesselm opened a new pull request, #8: URL: https://github.com/apache/xalan-test/pull/8 See also https://github.com/apache/xalan-java/compare/master...XALANJ-2419 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810047#comment-17810047 ] Joe Kesselman commented on XALANJ-2419: --- Opened https://issues.apache.org/jira/browse/XALANJ-2725 for [~maxfortun] 's issue. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
[jira] [Created] (XALANJ-2725) Possible buffer-boundry issue when serializing surrogate pairs
Joe Kesselman created XALANJ-2725: - Summary: Possible buffer-boundry issue when serializing surrogate pairs Key: XALANJ-2725 URL: https://issues.apache.org/jira/browse/XALANJ-2725 Project: XalanJ2 Issue Type: Improvement Security Level: No security risk; visible to anyone (Ordinary problems in Xalan projects. Anybody can view the issue.) Components: Serialization Reporter: Joe Kesselman Assignee: Joe Kesselman XALANJ-2419 addressed a case where "astral" Unicode characters, requiring a surrogate pair (two UTF-16 units), were not being serialized correctly. We have a proposed fix for that. There is reported to still be an edge case when a surrogate pair which crosses buffer boundaries might not be handled correctly. [~maxfortun] offered what looks like a reasonable proposed fix (https://github.com/maxfortun/xalan-j/blob/a9bd5591d9f8a523548aeec091e886b64c691628/src/org/apache/xml/serializer/ToStream.java#L1607), but in my testing this was not serializing the surrogate pairs correctly, causing regression on the tests XALANJ-2419 introduced. I don't know whether that's because we're taking multiple paths through But the edge case does appear to be real, and if so we will need some such solution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@xalan.apache.org For additional commands, e-mail: dev-h...@xalan.apache.org
[jira] [Comment Edited] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810043#comment-17810043 ] Max edited comment on XALANJ-2419 at 1/23/24 5:08 PM: -- [~kesh...@alum.mit.edu] , thank you for working on this. As you suggested, why don't you merge what works and I can try to help you work on the split buffer issue after? on a good code? was (Author: maxfortun): [~kesh...@alum.mit.edu] , thank you for working on this. As you suggested, why don't you merge what works and I can try to help you work on the split buffer issue after on a good code? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810043#comment-17810043 ] Max commented on XALANJ-2419: - [~kesh...@alum.mit.edu] , thank you for working on this. As you suggested, why don't you merge what works and I can try to help you work on the split buffer issue after on a good code? > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810039#comment-17810039 ] Joe Kesselman commented on XALANJ-2419: --- Still having trouble with Max's suggestion. Current recommendation: Merge what we've got and open a new work item for that concern. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
[jira] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419 ] Joe Kesselman deleted comment on XALANJ-2419: --- was (Author: JIRAUSER285361): Max's alternative does cause a regression in some of the new tests, assuming I applied it correctly. Surprising. Can take a longer look, but may want to merge what we have first since it *is* an improvement over the previous code. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810029#comment-17810029 ] Joe Kesselman commented on XALANJ-2419: --- Max's alternative does cause a regression in some of the new tests, assuming I applied it correctly. Surprising. Can take a longer look, but may want to merge what we have first since it *is* an improvement over the previous code. > Astral characters written as a pair of NCRs with the surrogate scalar values > when using UTF-8 > - > > Key: XALANJ-2419 > URL: https://issues.apache.org/jira/browse/XALANJ-2419 > Project: XalanJ2 > Issue Type: Bug > Components: Serialization >Affects Versions: 2.7.1 >Reporter: Henri Sivonen >Assignee: Joe Kesselman >Priority: Major > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt > > > org.apache.xml.serializer.ToStream contains the following code: > else if (m_encodingInfo.isInEncoding(ch)) { > // If the character is in the encoding, and > // not in the normal ASCII range, we also > // just leave it get added on to the clean characters > > } > else { > // This is a fallback plan, we should never get here > // but if the character wasn't previously handled > // (i.e. isn't in the encoding, etc.) then what > // should we do? We choose to write out an entity > writeOutCleanChars(chars, i, lastDirtyCharProcessed); >