Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On 3/21/2019 1:23 PM, Paul Moore wrote: On Thu, 21 Mar 2019 at 17:05, Steve Holden wrote: Especially as the standards specifically say that ordering has no semantic impact. Byte-by-byte comparison of XML is almost always inappropriate. Conversely, if ordering has no semantic impact, there's no real justification for asking for the current order to be changed. In practice, allowing the user to control the ordering (by preserving input order) gives users a way of handling (according to the standard) broken consumers who ascribe semantic meaning to the attribute order. Or, as Jonathan Goble said elsewhere, use an order that makes whatever sense to the author and other readers. The order of positional parameter names in a function definition has no semantic meaning to python, but it would be terrible to make them be sorted. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
Victor Stinner schrieb am 21.03.19 um 01:22: > Alternatives have been proposed like a recipe to sort node attributes > before serialization, but honestly, it's way too complex. Hm, really? Five lines of simple and obvious Python code, that provide a fast and completely Python-version agnostic solution to the problem that a few users have, are "way too complex" ? That sounds a bit extreme to me. > I don't want > to have to copy such recipe to every project. Add a new function, > import it, use it where XML is written into a file, etc. Taken alone, > maybe it's acceptable. But please remember that some companies are > still porting their large Python 2 code base to Python 3. This new > backward incompatible gets on top of the pile of other backward > incompatible changes between 2.7 and 3.8. > > I would prefer to be able to "just add" sort=True. Don't forget that > tests like "if sys.version >= (3, 8):" will be needed which makes the > overall fix more complicated. Yes, exactly! Users would have to add that option *conditionally* to their code somewhere. Personally, I really dislike having to say "if Python version is X do this, otherwise, do that". I prefer a solution that just works. There are at least four approaches that generally work across Python releases: ignoring the ordering, using C14N, creating attributes in order, sorting attributes before serialisation. I'd prefer if users picked one of those, preferably the right on for their use case, rather than starting to put version specific kludges into their code. Stefan ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On Thu, Mar 21, 2019, 1:05 PM Steve Holden wrote: > On Thu, Mar 21, 2019 at 11:33 AM Antoine Pitrou > wrote: > >> [...] >> >> Most users and applications should /never/ care about the order of XML >> attributes. >> >> Regards >> >> Antoine >> > > Especially as the standards specifically say that ordering has no semantic > impact. > When you have a lot of attributes, though, sometimes having them in a particular defined order can make it easier to reason about and make sense of the code when manually reviewing it. > ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On Thu, 21 Mar 2019 at 17:05, Steve Holden wrote: > > On Thu, Mar 21, 2019 at 11:33 AM Antoine Pitrou wrote: >> >> [...] >> >> Most users and applications should /never/ care about the order of XML >> attributes. >> >> Regards >> >> Antoine > > > Especially as the standards specifically say that ordering has no semantic > impact. > > Byte-by-byte comparison of XML is almost always inappropriate. Conversely, if ordering has no semantic impact, there's no real justification for asking for the current order to be changed. In practice, allowing the user to control the ordering (by preserving input order) gives users a way of handling (according to the standard) broken consumers who ascribe semantic meaning to the attribute order. So there's a small benefit for real-world users having to deal with non-compliant software. But that benefit is by definition small, as standards-compliant software won't be affected. The cost of making the change to projects that rely on the current output is significant, and that should be considered. But there's also the question of setting a precedent. If we do reject this change because of the cost to 3rd parties, are we then committing Python to guaranteeing sorted attribute order (and worse, byte-for-byte reproducible output) for ever - a far stronger commitment than the standards require of us? That seems to me to be an extremely bad precedent to set. There's no good answer here - maybe a possible compromise would be for us to document explicitly in 3.8 that output is only guaranteed identical to the level the standards require (i.e., attribute order is not guaranteed to be preserved) and then make this change in 3.9. But in practice, that's not really any better for projects like coverage - it just delays the point when they have to bite the bullet (and it's not like 3.8 is imminent - there's plenty of time between now and 3.8 without adding an additional delay). Reluctantly, I think I'd have to say that I don't think we should reject this change simply because existing users rely on the exact output currently being produced. To mitigate the impact on 3rd parties, it would be very helpful if we could add to the stdlib some form of "compare two XML documents for semantic equality up to the level that the standards require". 3rd party code could then use that if it's present, and fall back to byte-equality if it's not. If we could get something like that for 3.9, but not for 3.8, then that would seem to me to be a good reason to defer this change until 3.9 (because we don't want to have 3.8 being an exception where there's no semantic comparison function, but the byte-equality fallback doesn't work - that's just needlessly annoying). Paul ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On Thu, Mar 21, 2019 at 11:33 AM Antoine Pitrou wrote: > [...] > > Most users and applications should /never/ care about the order of XML > attributes. > > Regards > > Antoine > Especially as the standards specifically say that ordering has no semantic impact. Byte-by-byte comparison of XML is almost always inappropriate. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On Thu, 21 Mar 2019 02:07:01 +0100 Victor Stinner wrote: > Le lun. 18 mars 2019 à 23:41, Raymond Hettinger > a écrit : > > The code in the current 3.8 alpha differs from 3.7 in that it removes > > attribute sorting and instead preserves the order the user specified when > > creating an element. As far as I can tell, there is no objection to this > > as a feature. > > By the way, what's the rationale of this backward incompatible change? > > I found this short message: > "FWIW, this issue arose from an end-user problem. She had a hard > requirement to show a security clearance level as the first attribute. > We did find a work around but it was hack." > https://bugs.python.org/issue34160#msg338098 > > It's the first time that I hear an user asking to preserve attribute > insertion order (or did I miss a previous request?). Technically, it > was possible to implement the feature earlier using OrderedDict. So > why doing it now? > > Is it really worth it to break Python backward compatibility (change > the default behavior) for everyone, if it's only needed for few users? The argument you're making is weird here. If only "a few users" need a deterministic ordering of XML attributes, then compatibility is broken only for "a few users", not for "everyone". Most users and applications should /never/ care about the order of XML attributes. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
> On Mar 20, 2019, at 6:07 PM, Victor Stinner wrote: > > what's the rationale of this backward incompatible change? Please refrain from abusive mischaracterizations. It is only backwards incompatible if there was a guaranteed behavior. Whether there was or not is what this thread is about. My reading of this thread was that the various experts did not want to lock in the 3.7 behavior nor did they think the purpose of the XML modules is to produce an exact binary output. The lxml maintainer is dropping sorting (its expensive and it overrides the order specified by the user). Other XML modules don't sort. It only made sense as a way to produce a deterministic output within a feature release back when there was no other way to do it. For my part, any agreed upon outcome in fine. I'm not willing be debased further, so I am out of this discussion. It's up to you all to do the right thing. Raymond ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
Le lun. 18 mars 2019 à 23:41, Raymond Hettinger a écrit : > The code in the current 3.8 alpha differs from 3.7 in that it removes > attribute sorting and instead preserves the order the user specified when > creating an element. As far as I can tell, there is no objection to this as > a feature. By the way, what's the rationale of this backward incompatible change? I found this short message: "FWIW, this issue arose from an end-user problem. She had a hard requirement to show a security clearance level as the first attribute. We did find a work around but it was hack." https://bugs.python.org/issue34160#msg338098 It's the first time that I hear an user asking to preserve attribute insertion order (or did I miss a previous request?). Technically, it was possible to implement the feature earlier using OrderedDict. So why doing it now? Is it really worth it to break Python backward compatibility (change the default behavior) for everyone, if it's only needed for few users? > 1) Revert back to the 3.7 behavior. This of course, makes all the test pass > :-) The downside is that it perpetuates the practice of bytewise equality > tests and locks in all implementation quirks forever. I don't know of anyone > advocating this option, but it is the simplest thing to do. Can't we revert Python 3.7 behavior and add a new opt-in option to preserve the attribution insertion order (current Python 3.8 default behavior)? Python 3.7, sorting attributes by name, doesn't sound so silly to me. It's one arbitrary choice, but at least the output is deterministic. And well, Python is doing that for 20 years :-) > 4) Fix the tests in the third-party modules (...) I also like the option "not break the backward compatibility" to not have to fix any project :-) Victor -- Night gathers, and now my watch begins. It shall not end until my death. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
> On Mar 20, 2019, at 5:22 PM, Victor Stinner wrote: > > I don't understand why such simple solution has been rejected. It hasn't been rejected. That is above my pay grade. Stefan and I recommended against going down this path. However, since you're in disagreement and have marked this as a release blocker, it is now time for the steering committee to earn their pay (which is at least double what I'm making) or defer to the principal module maintainer, Stefan. To recap reasons for not going down this path: 1) The only known use case for a "sort=True" parameter is to perpetuate the practice of byte-by-byte output comparisons guaranteed to work across feature releases. The various XML experts in this thread have opined that isn't something we should guarantee (and sorting isn't the only aspect detail subject to change, Stefan listed others). 2) The intent of the XML modules is to implement the specification and be interoperable with other languages and other XML tools. It is not intended to be used to generate an exact binary output. Per section 3.1 of the XML spec, "Note that the order of attribute specifications in a start-tag or empty-element tag is not significant." 3) Mitigating a test failure is a one-time problem. API expansions are forever. 4) The existing API is not small and presents a challenge for teaching. Making the API bigger will make it worse. 5) As far as I can tell, XML tools in other languages (such as Java) don't sort (and likely for good reason). LXML is dropping its attribute sorting as well, so the standard library would become more of an outlier. Raymond ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
Le jeu. 21 mars 2019 à 01:30, Raymond Hettinger a écrit : > There's no preaching and no judgment. We can't have a conversation though if > we can't state the crux of the problem: some existing tests in third-party > modules depend on the XML serialization being byte-for-byte identical > forever. The various respondents to this thread have indicated that the > standard library should only make that guarantee within a single feature > release and that it may to vary across feature releases. > > For docutils, it may end-up being an easy fix (either with a semantic > comparison or with regenerating the target files when point releases differ). > For Coverage, I don't make any presumption that reengineering the tests will > be easy or fun. Several mitigation strategies have been proposed: > > * alter to element creation code to create the attributes in the desired order > * use a canonicalization tool to create output that is guarantee not to change > * generate new baseline files when a feature release changes > * apply Stefan's recipe for reordering attributes > * make a semantic level comparison > > Will any other these work for you? Python 3.8 is still in a very early stage of testing. We only started to discover which projects are broken by the XML change. IMHO the problem is wider than just unit tests written in Python. Python can be used to produce the XML, but other languages can be used to parse or compare the generated XML. For example, if the generated file is stored in Git, it will be seen as modified and "git diff" will show a lot of "irrelevant" changes. Comparison of XML using string comparison can also be used to avoid expensive disk/database write or reduce network bandwidth. That's an issue if the program isn't written in Python, whereas the XML is generated by Python. Getting the same output on Python 3.7 and Python 3.8 is also matter for https://reproducible-builds.org/ Victor -- Night gathers, and now my watch begins. It shall not end until my death. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
> On Mar 19, 2019, at 4:53 AM, Ned Batchelder wrote: > > None of this is impossible, but please try not to preach to us maintainers > that we are doing it wrong, that it will be easy to fix, etc There's no preaching and no judgment. We can't have a conversation though if we can't state the crux of the problem: some existing tests in third-party modules depend on the XML serialization being byte-for-byte identical forever. The various respondents to this thread have indicated that the standard library should only make that guarantee within a single feature release and that it may to vary across feature releases. For docutils, it may end-up being an easy fix (either with a semantic comparison or with regenerating the target files when point releases differ). For Coverage, I don't make any presumption that reengineering the tests will be easy or fun. Several mitigation strategies have been proposed: * alter to element creation code to create the attributes in the desired order * use a canonicalization tool to create output that is guarantee not to change * generate new baseline files when a feature release changes * apply Stefan's recipe for reordering attributes * make a semantic level comparison Will any other these work for you? Raymond ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
Hi, Le lun. 18 mars 2019 à 23:41, Raymond Hettinger a écrit : > We're having a super interesting discussion on > https://bugs.python.org/issue34160 . It is now marked as a release blocker > and warrants a broader discussion. Thanks for starting a thread on python-dev. I'm the one who raised the priority to release blocker to trigger such discussion on python-dev. > Our problem is that at least two distinct and important users have written > tests that depend on exact byte-by-byte comparisons of the final > serialization. Sorry but I don't think that it's a good summary of the issue. IMHO the issue is more general about how we introduce backward incompatible in Python. The migration from Python 2 to Python 3 took around ten years. That's way too long and it caused a lot of troubles in the Python community. IMHO one explanation is our patronizing behavior regarding to users that I would like to summarize as "your code is wrong, you have to fix it" (whereas the code was working well for 10 years with Python 2!). I'm not opposed to backward incompatible changes, but I think that we must very carefully prepare the migration and do our best to help users to migrate their code. > 2). Go into every XML module and add attribute sorting options to each > function that generate xml. (...) Written like that, it sounds painful and a huge project... But in practice, the implementation looks simple and straightforward: https://github.com/python/cpython/pull/12354/files I don't understand why such simple solution has been rejected. IMHO adding an optional sort parameter is just the *bare minimum* that we can do for our users. Alternatives have been proposed like a recipe to sort node attributes before serialization, but honestly, it's way too complex. I don't want to have to copy such recipe to every project. Add a new function, import it, use it where XML is written into a file, etc. Taken alone, maybe it's acceptable. But please remember that some companies are still porting their large Python 2 code base to Python 3. This new backward incompatible gets on top of the pile of other backward incompatible changes between 2.7 and 3.8. I would prefer to be able to "just add" sort=True. Don't forget that tests like "if sys.version >= (3, 8):" will be needed which makes the overall fix more complicated. Said differently, the stdlib should help the user to update Python. The pain should not only be on the user side. Victor -- Night gathers, and now my watch begins. It shall not end until my death. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On Tue, Mar 19, 2019, 4:53 AM Ned Batchelder wrote: > On 3/19/19 4:13 AM, Serhiy Storchaka wrote: > > 19.03.19 00:41, Raymond Hettinger пише: > >> 4) Fix the tests in the third-party modules to be more focused on > >> their actual test objectives, the semantics of the generated XML > >> rather than the exact serialization. This option would seem like the > >> right-thing-to-do but it isn't trivial because the entire premise of > >> the existing test is invalid. For every case, we'll actually have to > >> think through what the test objective really is. > > > > Option 4 is misleading. Is anyone here really offering to "fix the > tests in third-party modules"? Option 4 is actually, "do nothing, and > let a multitude of projects figure out how to fix their tests, slowing > progress in those projects as they try to support Python 3.8." > We've done Option 4 for every past behavior change of any form on feature .Next releases. We do try to encourage projects to run their tests on the 3.Next betas so that they can be ready before 3.Next.0 lands, some of us even do it ourselves when we're interested. Many things won't get ready ahead of time, but the actual .0 release forces the issue as their users start demanding it on occasion offering patches. We don't bock a release on existing user code being ready for it. In my case, the test code has a generic function to compare an actual > directory of files to an expected directory of files, so it isn't quite > as simple as "just use the right XML comparison." And I support Python > 2.7, 3.5, etc, so tests still need to work under those versions. None > of this is impossible, but please try not to preach to us maintainers > that we are doing it wrong, that it will be easy to fix, etc. Using > language like "the entire premise of the test is invalid" seems > needlessly condescending. > Agreed, that was poor wording. Lets not let that type of wording escape python-dev into docs about a behavior change. Wording aside, a test relying on undefined behavior is testing for things the code under test doesn't actually need to care about being true, even if it has happened to work for years. Such a test is overspecified. Potentially without the authors previously consciously realizing that. It'll need refactoring to loosen its requirements. How to loosen it is always an implementation detail based on the constraints imposed upon the test. Difficulty lies within range(0, "Port Mercurial to Python 3"). But the end result is nice: The code is healthier as tests focus more on what was actually important rather than what was quicker to write that got the original job done many years ago. One of the suggested solutions, a DOM comparison is not enough. I > don't just want to know that my actual XML is different than my expected > XML. I want to know where and how it differs. > > Textual comparison may be the "wrong" way to check XML, but it gives me > many tools for working with the test results. It was simple and it > worked. Now in Python 3.8, because Python doesn't want to add an > optional flag to continue doing what it has always done, I need to > re-engineer my tests. > > --Ned. > I understand that from a code owners perspective having to do any work, no matter what the reason, is counted as re-engineering. But that doesn't make it wrong. If "what it has always done" was unspecified and arbitrary despite happening to not change in the past rather than something easy to continue the stability of such as sorted attributes, it is a burden to maintain that **unspecifiable** behavior in the language or library going forward. (Note that I have no idea what order the xml code in question happened to impose upon attributes; if it went from sorted to not a "fix" to provide users is clear: provide a way to keep it sorted for those who need that. If it relied on insertion order or hash table iteration order or the phase of the moon when the release was cut, we are right to refuse to maintain unspecifiable implementation side effect behavior) -gps ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On Wed, 20 Mar 2019 at 00:29, Serhiy Storchaka wrote: > 19.03.19 15:10, Tim Delaney пише: > > Now Calibre is definitely in the wrong here - it should be able to > > import regardless of the order of attributes. But the fact is that there > > are a lot of tools out there that are semi-broken in a similar manner. > > Is not Calibre going to seat on Python 2 forever? This makes it > non-relevant to the discussion about Python 3.8. > I was simply using Calibre as an example of a tool I'd encountered recently that works correctly with input files with attributes in one order, but not the other. That it happens to be using Python (of any vintage) is irrelevant - could have been written in C, Go, Lua ... same problem that XML libraries that arbitrarily sort (or otherwise manipulate the order of) attributes can result in files that may not work with third-party tools. Tim Delaney ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
Ned Batchelder schrieb am 19.03.19 um 12:53: > I need to re-engineer my tests. … or sort the attributes before serialisation, or use C14N always, or change your code to create the attributes in sorted-by-name order. The new behaviour allows for a couple of ways to deal with the issue of backwards compatibility. Stefan ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
Nathaniel Smith schrieb am 19.03.19 um 00:15: > That seems potentially simpler to implement than canonical XML > serialization C14N is already implemented for ElementTree, just needs to be ported to Py3.8 and merged. https://bugs.python.org/issue13611 Stefan ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On Tue, Mar 19, 2019 at 6:15 AM Serhiy Storchaka wrote: > 19.03.19 13:53, Ned Batchelder пише: > > Option 4 is misleading. Is anyone here really offering to "fix the > > tests in third-party modules"? Option 4 is actually, "do nothing, and > > let a multitude of projects figure out how to fix their tests, slowing > > progress in those projects as they try to support Python 3.8." > > Any option except option 1 (and option 2 with sorting by default) > requires changing third-party code. You should either pass additional > argument to serialization functions, or use special canonization functions. > > We should look at the problem from long perspective. Freezing the > current behavior forever does not look good. If we need to break the > user code, we should minimize the harm and provide convenient tools for > reproducing the current behavior. And this is an opportunity to rewrite > user tests in more appropriate form. In your case textual comparison may > be the most appropriate form, but this may be not so in other cases. > In situations like this I think it's best to bite the bullet sooner rather than later while acknowledging that folks like Ned are in a bind when they have support older versions and thus have long-term support costs, too, and try to make the transition as painless as possible (my guess is Ned's need to support older versions will drop off faster than us having to support the xml libraries in the stdlib going forward, hence my viewpoint). > > > Now in Python 3.8, because Python doesn't want to add an > > optional flag to continue doing what it has always done, I need to > > re-engineer my tests. > > Please wait yet some time. I hope to add canonicalization before the > first beta. > For me I think canonicalization/stable pretty-print is the best option, especially if we can put the canonicalization code up on PyPI for supporting older versions of Python. Otherwise a function that does something like an XOR to help diagnose what differs between 2 XML documents is also seems like a good option to me. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
and why not with a very long PendingDeprecationWarning? this warning could be used in this case. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
19.03.19 15:10, Tim Delaney пише: Now Calibre is definitely in the wrong here - it should be able to import regardless of the order of attributes. But the fact is that there are a lot of tools out there that are semi-broken in a similar manner. Is not Calibre going to seat on Python 2 forever? This makes it non-relevant to the discussion about Python 3.8. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
19.03.19 14:50, Antoine Pitrou пише: 2). Go into every XML module and add attribute sorting options to each function that generate xml. What do you mean with "every XML module"? Are there many of them? ElementTree and minidom. Maybe xmlrpc. And perhaps we need to add arguments in calls at higher level where these modules are used. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
19.03.19 13:53, Ned Batchelder пише: Option 4 is misleading. Is anyone here really offering to "fix the tests in third-party modules"? Option 4 is actually, "do nothing, and let a multitude of projects figure out how to fix their tests, slowing progress in those projects as they try to support Python 3.8." Any option except option 1 (and option 2 with sorting by default) requires changing third-party code. You should either pass additional argument to serialization functions, or use special canonization functions. We should look at the problem from long perspective. Freezing the current behavior forever does not look good. If we need to break the user code, we should minimize the harm and provide convenient tools for reproducing the current behavior. And this is an opportunity to rewrite user tests in more appropriate form. In your case textual comparison may be the most appropriate form, but this may be not so in other cases. Now in Python 3.8, because Python doesn't want to add an optional flag to continue doing what it has always done, I need to re-engineer my tests. Please wait yet some time. I hope to add canonicalization before the first beta. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On Tue, 19 Mar 2019 at 23:13, David Mertz wrote: > In a way, this case makes bugs worse because they are not only a Python > internal matter. XML is used to communicate among many tools and > programming languages, and relying on assumptions those other tools will > not follow us a bad habit. > I have a recent example I encountered where the 3.7 behaviour (sorting attributes) results in a third-party tool behaving incorrectly, whereas maintaining attribute order works correctly. The particular case was using HTML tags for importing into Calibre for converting to an ebook. The most common symptom was that series indexes were sometimes being correctly imported, and sometimes not. Occasionally other tags would also fail to be correctly imported. Turns out that gave consistently correct results, whilst was erratic. And whilst I'd specified the tags with the name attribute first, I was then passing the HTML through BeautifulSoup, which sorted the attributes. Now Calibre is definitely in the wrong here - it should be able to import regardless of the order of attributes. But the fact is that there are a lot of tools out there that are semi-broken in a similar manner. This to me is an argument to default to maintaining order, but provide a way for the caller to control the order of attributes when formatting e.g. pass an ordering function. If you want sorted attributes, pass the built-in sorted function as your ordering function. But I think that's getting beyond the scope of this discussion. Tim Delaney ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
Hi Raymond, As long as the new serialization order is deterministic (i.e. it's the same every run and doesn't depend on e.g. hash randomization), then I think it's fine to change it. Some more comments / questions: > 2). Go into every XML module and add attribute sorting options to each > function that generate xml. What do you mean with "every XML module"? Are there many of them? > Regardless of option chosen, we should make explicit whether on not the > Python standard library modules guarantee cross-release bytewise identical > output for XML. IMO we certainly shouldn't. XML is a serialization format used for machine interoperability (even though "human-editable" was one of its selling points at the start, rather misguidingly). However, the output should ideally be stable and deterministic accross all releases of a given bugfix branch. (i.e., if I run the same code multiple times on all 3.7.x versions, I should get always the same output) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
In my opinion, any test that relied on a non-promised accident of serialization is broken today. Very often, such bad tests mask bad production code that makes the same unsafe assumptions. This is similar to tests that assumed a certain dictionary order, before we got guaranteed insertion order. Or like tests that rely on object identity of short strings or small ints. Or like non-guaranteed identities in pickles across versions. In a way, this case makes bugs worse because they are not only a Python internal matter. XML is used to communicate among many tools and programming languages, and relying on assumptions those other tools will not follow us a bad habit. Sure, most tests probably don't get to the point of touching those external tools themselves, but staying from bad assumptions about the domain isn't best practices. That said, I think aN XML canonicalization function is generally a good thing for Python to have. But it shouldn't be a stopper in releases. On Mon, Mar 18, 2019, 6:47 PM Raymond Hettinger wrote: > We're having a super interesting discussion on > https://bugs.python.org/issue34160 . It is now marked as a release > blocker and warrants a broader discussion. > > Our problem is that at least two distinct and important users have written > tests that depend on exact byte-by-byte comparisons of the final > serialization. So any changes to the XML modules will break those tests > (not the applications themselves, just the test cases that assume the > output will be forever, byte-by-byte identical). > > In theory, the tests are incorrectly designed and should not treat the > module output as a canonical normal form. In practice, doing an equality > test on the output is the simplest, most obvious approach, and likely is > being done in other packages we don't know about yet. > > With pickle, json, and __repr__, the usual way to write a test is to > verify a roundtrip: assert pickle.loads(pickle.dumps(data)) == data. With > XML, the problem is that the DOM doesn't have an equality operator. The > user is left with either testing specific fragments with > element.find(xpath) or with using a standards compliant canonicalization > package (not available from us). Neither option is pleasant. > > The code in the current 3.8 alpha differs from 3.7 in that it removes > attribute sorting and instead preserves the order the user specified when > creating an element. As far as I can tell, there is no objection to this > as a feature. The problem is what to do about the existing tests in > third-party code, what guarantees we want to make going forward, and what > do we recommend as a best practice for testing XML generation. > > Things we can do: > > 1) Revert back to the 3.7 behavior. This of course, makes all the test > pass :-) The downside is that it perpetuates the practice of bytewise > equality tests and locks in all implementation quirks forever. I don't > know of anyone advocating this option, but it is the simplest thing to do. > > 2). Go into every XML module and add attribute sorting options to each > function that generate xml. This gives users a way to make their tests > pass for now. There are several downsides. a) It grows the API in a way > that is inconsistent with all the other XML packages I've seen. b) We'll > have to test, maintain, and document the API forever -- the API is already > large and time consuming to teach. c) It perpetuates the notion that > bytewise equality tests are the right thing to do, so we'll have this > problem again if substitute in another code generator or alter any of the > other implementation quirks (i.e. how CDATA sections are serialized). > > 3) Add a standards compliant canonicalization tool (see > https://en.wikipedia.org/wiki/Canonical_XML ). This is likely to be the > right-way-to-do-it but takes time and energy. > > 4) Fix the tests in the third-party modules to be more focused on their > actual test objectives, the semantics of the generated XML rather than the > exact serialization. This option would seem like the right-thing-to-do but > it isn't trivial because the entire premise of the existing test is > invalid. For every case, we'll actually have to think through what the > test objective really is. > > Of these, option 2 is my least preferred. Ideally, we don't guarantee > bytewise identical output across releases, and ideally we don't grow a new > API that perpetuates the issue. That said, I'm not wedded to any of these > options and just want us to do what is best for the users in the long run. > > Regardless of option chosen, we should make explicit whether on not the > Python standard library modules guarantee cross-release bytewise identical > output for XML. That is really the core issue here. Had we had an explicit > notice one way or the other, there wouldn't be an issue now. > > Any thoughts? > > > > Raymond Hettinger > > > P.S. Stefan Behnel is planning to remove attribute sorting from lxml. > On the bug tr
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On 3/19/19 4:13 AM, Serhiy Storchaka wrote: 19.03.19 00:41, Raymond Hettinger пише: 3) Add a standards compliant canonicalization tool (see https://en.wikipedia.org/wiki/Canonical_XML ). This is likely to be the right-way-to-do-it but takes time and energy. 4) Fix the tests in the third-party modules to be more focused on their actual test objectives, the semantics of the generated XML rather than the exact serialization. This option would seem like the right-thing-to-do but it isn't trivial because the entire premise of the existing test is invalid. For every case, we'll actually have to think through what the test objective really is. I think the combination of options 3 and 4 is the right thing. Not always the stable output is needed, in these cases option 4 should be considered. But there are valid use cases for the stable output, in these cases we need to provide an alternative in the stdlib. I am working on this. Option 4 is misleading. Is anyone here really offering to "fix the tests in third-party modules"? Option 4 is actually, "do nothing, and let a multitude of projects figure out how to fix their tests, slowing progress in those projects as they try to support Python 3.8." In my case, the test code has a generic function to compare an actual directory of files to an expected directory of files, so it isn't quite as simple as "just use the right XML comparison." And I support Python 2.7, 3.5, etc, so tests still need to work under those versions. None of this is impossible, but please try not to preach to us maintainers that we are doing it wrong, that it will be easy to fix, etc. Using language like "the entire premise of the test is invalid" seems needlessly condescending. As one of the suggested solutions, a DOM comparison is not enough. I don't just want to know that my actual XML is different than my expected XML. I want to know where and how it differs. Textual comparison may be the "wrong" way to check XML, but it gives me many tools for working with the test results. It was simple and it worked. Now in Python 3.8, because Python doesn't want to add an optional flag to continue doing what it has always done, I need to re-engineer my tests. --Ned. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/ned%40nedbatchelder.com ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
19.03.19 00:41, Raymond Hettinger пише: 3) Add a standards compliant canonicalization tool (see https://en.wikipedia.org/wiki/Canonical_XML ). This is likely to be the right-way-to-do-it but takes time and energy. 4) Fix the tests in the third-party modules to be more focused on their actual test objectives, the semantics of the generated XML rather than the exact serialization. This option would seem like the right-thing-to-do but it isn't trivial because the entire premise of the existing test is invalid. For every case, we'll actually have to think through what the test objective really is. I think the combination of options 3 and 4 is the right thing. Not always the stable output is needed, in these cases option 4 should be considered. But there are valid use cases for the stable output, in these cases we need to provide an alternative in the stdlib. I am working on this. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On Mon, Mar 18, 2019 at 9:44 PM Terry Reedy wrote: > On 3/18/2019 6:41 PM, Raymond Hettinger wrote: > > We're having a super interesting discussion on > https://bugs.python.org/issue34160 . It is now marked as a release > blocker and warrants a broader discussion. > > > > Our problem is that at least two distinct and important users have > written tests that depend on exact byte-by-byte comparisons of the final > serialization. So any changes to the XML modules will break those tests > (not the applications themselves, just the test cases that assume the > output will be forever, byte-by-byte identical). > > > > In theory, the tests are incorrectly designed and should not treat the > module output as a canonical normal form. In practice, doing an equality > test on the output is the simplest, most obvious approach, and likely is > being done in other packages we don't know about yet. > > > > With pickle, json, and __repr__, the usual way to write a test is to > verify a roundtrip: assert pickle.loads(pickle.dumps(data)) == data. With > XML, the problem is that the DOM doesn't have an equality operator. The > user is left with either testing specific fragments with > element.find(xpath) or with using a standards compliant canonicalization > package (not available from us). Neither option is pleasant. > > > > The code in the current 3.8 alpha differs from 3.7 in that it removes > attribute sorting and instead preserves the order the user specified when > creating an element. As far as I can tell, there is no objection to this > as a feature. The problem is what to do about the existing tests in > third-party code, what guarantees we want to make going forward, and what > do we recommend as a best practice for testing XML generation. > > > > Things we can do: > > > > 1) Revert back to the 3.7 behavior. This of course, makes all the test > pass :-) The downside is that it perpetuates the practice of bytewise > equality tests and locks in all implementation quirks forever. I don't > know of anyone advocating this option, but it is the simplest thing to do. > > If it comes down to doing *something* to unblock the release ... > 1b) Revert to 3.7 *and* document that byte equality with current ouput > is *not* guaranteed. > > > 2). Go into every XML module and add attribute sorting options to each > function that generate xml. This gives users a way to make their tests > pass for now. There are several downsides. a) It grows the API in a way > that is inconsistent with all the other XML packages I've seen. b) We'll > have to test, maintain, and document the API forever -- the API is already > large and time consuming to teach. c) It perpetuates the notion that > bytewise equality tests are the right thing to do, so we'll have this > problem again if substitute in another code generator or alter any of the > other implementation quirks (i.e. how CDATA sections are serialized). > > > > 3) Add a standards compliant canonicalization tool (see > https://en.wikipedia.org/wiki/Canonical_XML ). This is likely to be the > right-way-to-do-it but takes time and energy. > > > 4) Fix the tests in the third-party modules to be more focused on their > actual test objectives, the semantics of the generated XML rather than the > exact serialization. This option would seem like the right-thing-to-do but > it isn't trivial because the entire premise of the existing test is > invalid. For every case, we'll actually have to think through what the > test objective really is. > > > Of these, option 2 is my least preferred. Ideally, we don't guarantee > bytewise identical output across releases, and ideally we don't grow a new > API that perpetuates the issue. That said, I'm not wedded to any of these > options and just want us to do what is best for the users in the long run. > For (1) - don't revert in 3.8 - Do not worry about order or formatting of serialized data changing between major Python releases. change in 3.8? that's 100% okay. This already happens all the time between Python releases. We've changed dict iteration order between releases twice this decade. Within point releases of stable versions, ie 3.7.x? Up to the release manager; it is semi-rude to change something like this within a stable release unless there is a good reason, but we *believe* have done it before. A general rule of thumb is to try not to without good reason though unless the code to avoid doing so would be over complicated. It is always the user code depending on the non-declared ordering within output that is wrong, when we preserve it we're only doing them a temporary favor that ultimately allows more problems to grow in the future. Nobody should use a text comparison on serialized data not explicitly stated as canonical and call that test good by any standard unless you are writing a test that for canonical output by a library that explicitly guarantees its output will be canonical. Agreed that your option (2) is not good for
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
On 3/18/2019 6:41 PM, Raymond Hettinger wrote: We're having a super interesting discussion on https://bugs.python.org/issue34160 . It is now marked as a release blocker and warrants a broader discussion. Our problem is that at least two distinct and important users have written tests that depend on exact byte-by-byte comparisons of the final serialization. So any changes to the XML modules will break those tests (not the applications themselves, just the test cases that assume the output will be forever, byte-by-byte identical). In theory, the tests are incorrectly designed and should not treat the module output as a canonical normal form. In practice, doing an equality test on the output is the simplest, most obvious approach, and likely is being done in other packages we don't know about yet. With pickle, json, and __repr__, the usual way to write a test is to verify a roundtrip: assert pickle.loads(pickle.dumps(data)) == data. With XML, the problem is that the DOM doesn't have an equality operator. The user is left with either testing specific fragments with element.find(xpath) or with using a standards compliant canonicalization package (not available from us). Neither option is pleasant. The code in the current 3.8 alpha differs from 3.7 in that it removes attribute sorting and instead preserves the order the user specified when creating an element. As far as I can tell, there is no objection to this as a feature. The problem is what to do about the existing tests in third-party code, what guarantees we want to make going forward, and what do we recommend as a best practice for testing XML generation. Things we can do: 1) Revert back to the 3.7 behavior. This of course, makes all the test pass :-) The downside is that it perpetuates the practice of bytewise equality tests and locks in all implementation quirks forever. I don't know of anyone advocating this option, but it is the simplest thing to do. If it comes down to doing *something* to unblock the release ... 1b) Revert to 3.7 *and* document that byte equality with current ouput is *not* guaranteed. 2). Go into every XML module and add attribute sorting options to each function that generate xml. This gives users a way to make their tests pass for now. There are several downsides. a) It grows the API in a way that is inconsistent with all the other XML packages I've seen. b) We'll have to test, maintain, and document the API forever -- the API is already large and time consuming to teach. c) It perpetuates the notion that bytewise equality tests are the right thing to do, so we'll have this problem again if substitute in another code generator or alter any of the other implementation quirks (i.e. how CDATA sections are serialized). 3) Add a standards compliant canonicalization tool (see https://en.wikipedia.org/wiki/Canonical_XML ). This is likely to be the right-way-to-do-it but takes time and energy. 4) Fix the tests in the third-party modules to be more focused on their actual test objectives, the semantics of the generated XML rather than the exact serialization. This option would seem like the right-thing-to-do but it isn't trivial because the entire premise of the existing test is invalid. For every case, we'll actually have to think through what the test objective really is. Of these, option 2 is my least preferred. Ideally, we don't guarantee bytewise identical output across releases, and ideally we don't grow a new API that perpetuates the issue. That said, I'm not wedded to any of these options and just want us to do what is best for the users in the long run. The point of 1b would be to give us time to do that if more is needed. Regardless of option chosen, we should make explicit whether on not the Python standard library modules guarantee cross-release bytewise identical output for XML. That is really the core issue here. Had we had an explicit notice one way or the other, there wouldn't be an issue now. I have not read the XML docs but based on this and the issue discussion and what I think our general guarantee policy has been, I would consider that there is not one. (I am thinking about things like garbage collection, stable sorting, and set/dict iteration order.) -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
> On Mar 18, 2019, at 4:15 PM, Nathaniel Smith wrote: > > I noticed that your list doesn't include "add a DOM equality operator". That > seems potentially simpler to implement than canonical XML serialization, and > like a useful thing to have in any case. Would it make sense as an option? Time machine! Stéphane Wirtel just posted a basic semantic comparison between two streams.¹ Presumably, there would need to be a range of options for specifying what constitutes equivalence but this is a nice start. Raymond ¹ https://bugs.python.org/file48217/test_xml_compare.py ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
I noticed that your list doesn't include "add a DOM equality operator". That seems potentially simpler to implement than canonical XML serialization, and like a useful thing to have in any case. Would it make sense as an option? On Mon, Mar 18, 2019, 15:46 Raymond Hettinger wrote: > We're having a super interesting discussion on > https://bugs.python.org/issue34160 . It is now marked as a release > blocker and warrants a broader discussion. > > Our problem is that at least two distinct and important users have written > tests that depend on exact byte-by-byte comparisons of the final > serialization. So any changes to the XML modules will break those tests > (not the applications themselves, just the test cases that assume the > output will be forever, byte-by-byte identical). > > In theory, the tests are incorrectly designed and should not treat the > module output as a canonical normal form. In practice, doing an equality > test on the output is the simplest, most obvious approach, and likely is > being done in other packages we don't know about yet. > > With pickle, json, and __repr__, the usual way to write a test is to > verify a roundtrip: assert pickle.loads(pickle.dumps(data)) == data. With > XML, the problem is that the DOM doesn't have an equality operator. The > user is left with either testing specific fragments with > element.find(xpath) or with using a standards compliant canonicalization > package (not available from us). Neither option is pleasant. > > The code in the current 3.8 alpha differs from 3.7 in that it removes > attribute sorting and instead preserves the order the user specified when > creating an element. As far as I can tell, there is no objection to this > as a feature. The problem is what to do about the existing tests in > third-party code, what guarantees we want to make going forward, and what > do we recommend as a best practice for testing XML generation. > > Things we can do: > > 1) Revert back to the 3.7 behavior. This of course, makes all the test > pass :-) The downside is that it perpetuates the practice of bytewise > equality tests and locks in all implementation quirks forever. I don't > know of anyone advocating this option, but it is the simplest thing to do. > > 2). Go into every XML module and add attribute sorting options to each > function that generate xml. This gives users a way to make their tests > pass for now. There are several downsides. a) It grows the API in a way > that is inconsistent with all the other XML packages I've seen. b) We'll > have to test, maintain, and document the API forever -- the API is already > large and time consuming to teach. c) It perpetuates the notion that > bytewise equality tests are the right thing to do, so we'll have this > problem again if substitute in another code generator or alter any of the > other implementation quirks (i.e. how CDATA sections are serialized). > > 3) Add a standards compliant canonicalization tool (see > https://en.wikipedia.org/wiki/Canonical_XML ). This is likely to be the > right-way-to-do-it but takes time and energy. > > 4) Fix the tests in the third-party modules to be more focused on their > actual test objectives, the semantics of the generated XML rather than the > exact serialization. This option would seem like the right-thing-to-do but > it isn't trivial because the entire premise of the existing test is > invalid. For every case, we'll actually have to think through what the > test objective really is. > > Of these, option 2 is my least preferred. Ideally, we don't guarantee > bytewise identical output across releases, and ideally we don't grow a new > API that perpetuates the issue. That said, I'm not wedded to any of these > options and just want us to do what is best for the users in the long run. > > Regardless of option chosen, we should make explicit whether on not the > Python standard library modules guarantee cross-release bytewise identical > output for XML. That is really the core issue here. Had we had an explicit > notice one way or the other, there wouldn't be an issue now. > > Any thoughts? > > > > Raymond Hettinger > > > P.S. Stefan Behnel is planning to remove attribute sorting from lxml. > On the bug tracker, he has clearly articulated his reasons. > > > ___ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/njs%40pobox.com > ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?
We're having a super interesting discussion on https://bugs.python.org/issue34160 . It is now marked as a release blocker and warrants a broader discussion. Our problem is that at least two distinct and important users have written tests that depend on exact byte-by-byte comparisons of the final serialization. So any changes to the XML modules will break those tests (not the applications themselves, just the test cases that assume the output will be forever, byte-by-byte identical). In theory, the tests are incorrectly designed and should not treat the module output as a canonical normal form. In practice, doing an equality test on the output is the simplest, most obvious approach, and likely is being done in other packages we don't know about yet. With pickle, json, and __repr__, the usual way to write a test is to verify a roundtrip: assert pickle.loads(pickle.dumps(data)) == data. With XML, the problem is that the DOM doesn't have an equality operator. The user is left with either testing specific fragments with element.find(xpath) or with using a standards compliant canonicalization package (not available from us). Neither option is pleasant. The code in the current 3.8 alpha differs from 3.7 in that it removes attribute sorting and instead preserves the order the user specified when creating an element. As far as I can tell, there is no objection to this as a feature. The problem is what to do about the existing tests in third-party code, what guarantees we want to make going forward, and what do we recommend as a best practice for testing XML generation. Things we can do: 1) Revert back to the 3.7 behavior. This of course, makes all the test pass :-) The downside is that it perpetuates the practice of bytewise equality tests and locks in all implementation quirks forever. I don't know of anyone advocating this option, but it is the simplest thing to do. 2). Go into every XML module and add attribute sorting options to each function that generate xml. This gives users a way to make their tests pass for now. There are several downsides. a) It grows the API in a way that is inconsistent with all the other XML packages I've seen. b) We'll have to test, maintain, and document the API forever -- the API is already large and time consuming to teach. c) It perpetuates the notion that bytewise equality tests are the right thing to do, so we'll have this problem again if substitute in another code generator or alter any of the other implementation quirks (i.e. how CDATA sections are serialized). 3) Add a standards compliant canonicalization tool (see https://en.wikipedia.org/wiki/Canonical_XML ). This is likely to be the right-way-to-do-it but takes time and energy. 4) Fix the tests in the third-party modules to be more focused on their actual test objectives, the semantics of the generated XML rather than the exact serialization. This option would seem like the right-thing-to-do but it isn't trivial because the entire premise of the existing test is invalid. For every case, we'll actually have to think through what the test objective really is. Of these, option 2 is my least preferred. Ideally, we don't guarantee bytewise identical output across releases, and ideally we don't grow a new API that perpetuates the issue. That said, I'm not wedded to any of these options and just want us to do what is best for the users in the long run. Regardless of option chosen, we should make explicit whether on not the Python standard library modules guarantee cross-release bytewise identical output for XML. That is really the core issue here. Had we had an explicit notice one way or the other, there wouldn't be an issue now. Any thoughts? Raymond Hettinger P.S. Stefan Behnel is planning to remove attribute sorting from lxml. On the bug tracker, he has clearly articulated his reasons. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com