Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-21 Thread Terry Reedy

On 3/21/2019 1:23 PM, Paul Moore wrote:

On Thu, 21 Mar 2019 at 17:05, Steve Holden  wrote:



Especially as the standards specifically say that ordering has no semantic 
impact.

Byte-by-byte comparison of XML is almost always inappropriate.


Conversely, if ordering has no semantic impact, there's no real
justification for asking for the current order to be changed.

In practice, allowing the user to control the ordering (by preserving
input order) gives users a way of handling (according to the standard)
broken consumers who ascribe semantic meaning to the attribute order.


Or, as Jonathan Goble said elsewhere, use an order that makes whatever 
sense to the author and other readers.  The order of positional 
parameter names in a function definition has no semantic meaning to 
python, but it would be terrible to make them be sorted.


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-21 Thread Stefan Behnel
Victor Stinner schrieb am 21.03.19 um 01:22:
> Alternatives have been proposed like a recipe to sort node attributes
> before serialization, but honestly, it's way too complex.

Hm, really? Five lines of simple and obvious Python code, that provide a
fast and completely Python-version agnostic solution to the problem that a
few users have, are "way too complex" ? That sounds a bit extreme to me.


> I don't want
> to have to copy such recipe to every project. Add a new function,
> import it, use it where XML is written into a file, etc. Taken alone,
> maybe it's acceptable. But please remember that some companies are
> still porting their large Python 2 code base to Python 3. This new
> backward incompatible gets on top of the pile of other backward
> incompatible changes between 2.7 and 3.8.
> 
> I would prefer to be able to "just add" sort=True. Don't forget that
> tests like "if sys.version >= (3, 8):"  will be needed which makes the
> overall fix more complicated.

Yes, exactly! Users would have to add that option *conditionally* to their
code somewhere. Personally, I really dislike having to say "if Python
version is X do this, otherwise, do that". I prefer a solution that just
works. There are at least four approaches that generally work across Python
releases: ignoring the ordering, using C14N, creating attributes in order,
sorting attributes before serialisation. I'd prefer if users picked one of
those, preferably the right on for their use case, rather than starting to
put version specific kludges into their code.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-21 Thread Jonathan Goble
On Thu, Mar 21, 2019, 1:05 PM Steve Holden  wrote:

> On Thu, Mar 21, 2019 at 11:33 AM Antoine Pitrou 
> wrote:
>
>> [...]
>>
>> Most users and applications should /never/ care about the order of XML
>> attributes.
>>
>> Regards
>>
>> Antoine
>>
>
> Especially as the standards specifically say that ordering has no semantic
> impact.
>

When you have a lot of attributes, though, sometimes having them in a
particular defined order can make it easier to reason about and make sense
of the code when manually reviewing it.

>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-21 Thread Paul Moore
On Thu, 21 Mar 2019 at 17:05, Steve Holden  wrote:
>
> On Thu, Mar 21, 2019 at 11:33 AM Antoine Pitrou  wrote:
>>
>> [...]
>>
>> Most users and applications should /never/ care about the order of XML
>> attributes.
>>
>> Regards
>>
>> Antoine
>
>
> Especially as the standards specifically say that ordering has no semantic 
> impact.
>
> Byte-by-byte comparison of XML is almost always inappropriate.

Conversely, if ordering has no semantic impact, there's no real
justification for asking for the current order to be changed.

In practice, allowing the user to control the ordering (by preserving
input order) gives users a way of handling (according to the standard)
broken consumers who ascribe semantic meaning to the attribute order.
So there's a small benefit for real-world users having to deal with
non-compliant software. But that benefit is by definition small, as
standards-compliant software won't be affected.

The cost of making the change to projects that rely on the current
output is significant, and that should be considered. But there's also
the question of setting a precedent. If we do reject this change
because of the cost to 3rd parties, are we then committing Python to
guaranteeing sorted attribute order (and worse, byte-for-byte
reproducible output) for ever - a far stronger commitment than the
standards require of us? That seems to me to be an extremely bad
precedent to set.

There's no good answer here - maybe a possible compromise would be for
us to document explicitly in 3.8 that output is only guaranteed
identical to the level the standards require (i.e., attribute order is
not guaranteed to be preserved) and then make this change in 3.9. But
in practice, that's not really any better for projects like coverage -
it just delays the point when they have to bite the bullet (and it's
not like 3.8 is imminent - there's plenty of time between now and 3.8
without adding an additional delay).

Reluctantly, I think I'd have to say that I don't think we should
reject this change simply because existing users rely on the exact
output currently being produced.

To mitigate the impact on 3rd parties, it would be very helpful if we
could add to the stdlib some form of "compare two XML documents for
semantic equality up to the level that the standards require". 3rd
party code could then use that if it's present, and fall back to
byte-equality if it's not. If we could get something like that for
3.9, but not for 3.8, then that would seem to me to be a good reason
to defer this change until 3.9 (because we don't want to have 3.8
being an exception where there's no semantic comparison function, but
the byte-equality fallback doesn't work - that's just needlessly
annoying).

Paul
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-21 Thread Steve Holden
On Thu, Mar 21, 2019 at 11:33 AM Antoine Pitrou  wrote:

> [...]
>
> Most users and applications should /never/ care about the order of XML
> attributes.
>
> Regards
>
> Antoine
>

Especially as the standards specifically say that ordering has no semantic
impact.

Byte-by-byte comparison of XML is almost always inappropriate.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-21 Thread Antoine Pitrou
On Thu, 21 Mar 2019 02:07:01 +0100
Victor Stinner  wrote:
> Le lun. 18 mars 2019 à 23:41, Raymond Hettinger
>  a écrit :
> > The code in the current 3.8 alpha differs from 3.7 in that it removes 
> > attribute sorting and instead preserves the order the user specified when 
> > creating an element.  As far as I can tell, there is no objection to this 
> > as a feature.  
> 
> By the way, what's the rationale of this backward incompatible change?
> 
> I found this short message:
> "FWIW, this issue arose from an end-user problem. She had a hard
> requirement to show a security clearance level as the first attribute.
> We did find a work around but it was hack."
> https://bugs.python.org/issue34160#msg338098
> 
> It's the first time that I hear an user asking to preserve attribute
> insertion order (or did I miss a previous request?). Technically, it
> was possible to implement the feature earlier using OrderedDict. So
> why doing it now?
> 
> Is it really worth it to break Python backward compatibility (change
> the default behavior) for everyone, if it's only needed for few users?

The argument you're making is weird here.  If only "a few users" need a
deterministic ordering of XML attributes, then compatibility is broken
only for "a few users", not for "everyone".

Most users and applications should /never/ care about the order of XML
attributes.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-20 Thread Raymond Hettinger


> On Mar 20, 2019, at 6:07 PM, Victor Stinner  wrote:
> 
> what's the rationale of this backward incompatible change?

Please refrain from abusive mischaracterizations.  It is only backwards 
incompatible if there was a guaranteed behavior.  Whether there was or not is 
what this thread is about.  

My reading of this thread was that the various experts did not want to lock in 
the 3.7 behavior nor did they think the purpose of the XML modules is to 
produce an exact binary output.  The lxml maintainer is dropping sorting (its 
expensive and it overrides the order specified by the user). Other XML modules 
don't sort. It only made sense as a way to produce a deterministic output 
within a feature release back when there was no other way to do it.

For my part, any agreed upon outcome in fine. I'm not willing be debased 
further, so I am out of this discussion. It's up to you all to do the right 
thing.


Raymond



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-20 Thread Victor Stinner
Le lun. 18 mars 2019 à 23:41, Raymond Hettinger
 a écrit :
> The code in the current 3.8 alpha differs from 3.7 in that it removes 
> attribute sorting and instead preserves the order the user specified when 
> creating an element.  As far as I can tell, there is no objection to this as 
> a feature.

By the way, what's the rationale of this backward incompatible change?

I found this short message:
"FWIW, this issue arose from an end-user problem. She had a hard
requirement to show a security clearance level as the first attribute.
We did find a work around but it was hack."
https://bugs.python.org/issue34160#msg338098

It's the first time that I hear an user asking to preserve attribute
insertion order (or did I miss a previous request?). Technically, it
was possible to implement the feature earlier using OrderedDict. So
why doing it now?

Is it really worth it to break Python backward compatibility (change
the default behavior) for everyone, if it's only needed for few users?


> 1) Revert back to the 3.7 behavior. This of course, makes all the test pass 
> :-)  The downside is that it perpetuates the practice of bytewise equality 
> tests and locks in all implementation quirks forever.  I don't know of anyone 
> advocating this option, but it is the simplest thing to do.

Can't we revert Python 3.7 behavior and add a new opt-in option to
preserve the attribution insertion order (current Python 3.8 default
behavior)?

Python 3.7, sorting attributes by name, doesn't sound so silly to me.
It's one arbitrary choice, but at least the output is deterministic.
And well, Python is doing that for 20 years :-)


> 4) Fix the tests in the third-party modules (...)

I also like the option "not break the backward compatibility" to not
have to fix any project :-)

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-20 Thread Raymond Hettinger


> On Mar 20, 2019, at 5:22 PM, Victor Stinner  wrote:
> 
> I don't understand why such simple solution has been rejected.

It hasn't been rejected. That is above my pay grade.  Stefan and I recommended 
against going down this path. However, since you're in disagreement and have 
marked this as a release blocker, it is now time for the steering committee to 
earn their pay (which is at least double what I'm making) or defer to the 
principal module maintainer, Stefan.

To recap reasons for not going down this path:

1) The only known use case for a "sort=True" parameter is to perpetuate the 
practice of byte-by-byte output comparisons guaranteed to work across feature 
releases.  The various XML experts in this thread have opined that isn't 
something we should guarantee (and sorting isn't the only aspect detail subject 
to change, Stefan listed others).

2) The intent of the XML modules is to implement the specification and be 
interoperable with other languages and other XML tools. It is not intended to 
be used to generate an exact binary output.  Per section 3.1 of the XML spec, 
"Note that the order of attribute specifications in a start-tag or 
empty-element tag is not significant."

3) Mitigating a test failure is a one-time problem. API expansions are forever.

4) The existing API is not small and presents a challenge for teaching. Making 
the API bigger will make it worse.

5) As far as I can tell, XML tools in other languages (such as Java) don't sort 
(and likely for good reason).  LXML is dropping its attribute sorting as well, 
so the standard library would become more of an outlier.


Raymond

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-20 Thread Victor Stinner
Le jeu. 21 mars 2019 à 01:30, Raymond Hettinger
 a écrit :
> There's no preaching and no judgment.  We can't have a conversation though if 
> we can't state the crux of the problem: some existing tests in third-party 
> modules depend on the XML serialization being byte-for-byte identical 
> forever. The various respondents to this thread have indicated that the 
> standard library should only make that guarantee within a single feature 
> release and that it may to vary across feature releases.
>
> For docutils, it may end-up being an easy fix (either with a semantic 
> comparison or with regenerating the target files when point releases differ). 
>  For Coverage, I don't make any presumption that reengineering the tests will 
> be easy or fun.  Several mitigation strategies have been proposed:
>
> * alter to element creation code to create the attributes in the desired order
> * use a canonicalization tool to create output that is guarantee not to change
> * generate new baseline files when a feature release changes
> * apply Stefan's recipe for reordering attributes
> * make a semantic level comparison
>
> Will any other these work for you?

Python 3.8 is still in a very early stage of testing. We only started
to discover which projects are broken by the XML change.

IMHO the problem is wider than just unit tests written in Python.
Python can be used to produce the XML, but other languages can be used
to parse or compare the generated XML. For example, if the generated
file is stored in Git, it will be seen as modified and "git diff" will
show a lot of "irrelevant" changes.

Comparison of XML using string comparison can also be used to avoid
expensive disk/database write or reduce network bandwidth. That's an
issue if the program isn't written in Python, whereas the XML is
generated by Python.

Getting the same output on Python 3.7 and Python 3.8 is also matter
for https://reproducible-builds.org/

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-20 Thread Raymond Hettinger


> On Mar 19, 2019, at 4:53 AM, Ned Batchelder  wrote:
> 
> None of this is impossible, but please try not to preach to us maintainers 
> that we are doing it wrong, that it will be easy to fix, etc

There's no preaching and no judgment.  We can't have a conversation though if 
we can't state the crux of the problem: some existing tests in third-party 
modules depend on the XML serialization being byte-for-byte identical forever. 
The various respondents to this thread have indicated that the standard library 
should only make that guarantee within a single feature release and that it may 
to vary across feature releases.

For docutils, it may end-up being an easy fix (either with a semantic 
comparison or with regenerating the target files when point releases differ).  
For Coverage, I don't make any presumption that reengineering the tests will be 
easy or fun.  Several mitigation strategies have been proposed:

* alter to element creation code to create the attributes in the desired order
* use a canonicalization tool to create output that is guarantee not to change
* generate new baseline files when a feature release changes
* apply Stefan's recipe for reordering attributes
* make a semantic level comparison

Will any other these work for you?


Raymond







___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-20 Thread Victor Stinner
Hi,

Le lun. 18 mars 2019 à 23:41, Raymond Hettinger
 a écrit :
> We're having a super interesting discussion on 
> https://bugs.python.org/issue34160 .  It is now marked as a release blocker 
> and warrants a broader discussion.

Thanks for starting a thread on python-dev. I'm the one who raised the
priority to release blocker to trigger such discussion on python-dev.


> Our problem is that at least two distinct and important users have written 
> tests that depend on exact byte-by-byte comparisons of the final 
> serialization.

Sorry but I don't think that it's a good summary of the issue. IMHO
the issue is more general about how we introduce backward incompatible
in Python.

The migration from Python 2 to Python 3 took around ten years. That's
way too long and it caused a lot of troubles in the Python community.
IMHO one explanation is our patronizing behavior regarding to users
that I would like to summarize as "your code is wrong, you have to fix
it" (whereas the code was working well for 10 years with Python 2!).

I'm not opposed to backward incompatible changes, but I think that we
must very carefully prepare the migration and do our best to help
users to migrate their code.


> 2). Go into every XML module and add attribute sorting options to each 
> function that generate xml. (...)

Written like that, it sounds painful and a huge project... But in
practice, the implementation looks simple and straightforward:
https://github.com/python/cpython/pull/12354/files

I don't understand why such simple solution has been rejected.

IMHO adding an optional sort parameter is just the *bare minimum* that
we can do for our users.

Alternatives have been proposed like a recipe to sort node attributes
before serialization, but honestly, it's way too complex. I don't want
to have to copy such recipe to every project. Add a new function,
import it, use it where XML is written into a file, etc. Taken alone,
maybe it's acceptable. But please remember that some companies are
still porting their large Python 2 code base to Python 3. This new
backward incompatible gets on top of the pile of other backward
incompatible changes between 2.7 and 3.8.

I would prefer to be able to "just add" sort=True. Don't forget that
tests like "if sys.version >= (3, 8):"  will be needed which makes the
overall fix more complicated.

Said differently, the stdlib should help the user to update Python.
The pain should not only be on the user side.

Victor
-- 
Night gathers, and now my watch begins. It shall not end until my death.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Gregory P. Smith
On Tue, Mar 19, 2019, 4:53 AM Ned Batchelder  wrote:

> On 3/19/19 4:13 AM, Serhiy Storchaka wrote:
> > 19.03.19 00:41, Raymond Hettinger пише:
> >> 4) Fix the tests in the third-party modules to be more focused on
> >> their actual test objectives, the semantics of the generated XML
> >> rather than the exact serialization.  This option would seem like the
> >> right-thing-to-do but it isn't trivial because the entire premise of
> >> the existing test is invalid.  For every case, we'll actually have to
> >> think through what the test objective really is.
> >
>
> Option 4 is misleading.  Is anyone here really offering to "fix the
> tests in third-party modules"?  Option 4 is actually, "do nothing, and
> let a multitude of projects figure out how to fix their tests, slowing
> progress in those projects as they try to support Python 3.8."
>

We've done Option 4 for every past behavior change of any form on feature
.Next releases.  We do try to encourage projects to run their tests on the
3.Next betas so that they can be ready before 3.Next.0 lands, some of us
even do it ourselves when we're interested.  Many things won't get ready
ahead of time, but the actual .0 release forces the issue as their users
start demanding it on occasion offering patches.  We don't bock a release
on existing user code being ready for it.

In my case, the test code has a generic function to compare an actual
> directory of files to an expected directory of files, so it isn't quite
> as simple as "just use the right XML comparison."  And I support Python
> 2.7, 3.5, etc, so tests still need to work under those versions.  None
> of this is impossible, but please try not to preach to us maintainers
> that we are doing it wrong, that it will be easy to fix, etc.  Using
> language like "the entire premise of the test is invalid" seems
> needlessly condescending.
>

Agreed, that was poor wording.  Lets not let that type of wording escape
python-dev into docs about a behavior change.

Wording aside, a test relying on undefined behavior is testing for things
the code under test doesn't actually need to care about being true, even if
it has happened to work for years.  Such a test is overspecified.
Potentially without the authors previously consciously realizing that.
It'll need refactoring to loosen its requirements.  How to loosen it is
always an implementation detail based on the constraints imposed upon the
test.  Difficulty lies within range(0, "Port Mercurial to Python 3").  But
the end result is nice: The code is healthier as tests focus more on what
was actually important rather than what was quicker to write that got the
original job done many years ago.

One of the suggested solutions, a DOM comparison is not enough. I
> don't just want to know that my actual XML is different than my expected
> XML.  I want to know where and how it differs.
>
> Textual comparison may be the "wrong" way to check XML, but it gives me
> many tools for working with the test results.  It was simple and it
> worked.  Now in Python 3.8, because Python doesn't want to add an
> optional flag to continue doing what it has always done, I need to
> re-engineer my tests.
>
> --Ned.
>

I understand that from a code owners perspective having to do any work, no
matter what the reason, is counted as re-engineering.  But that doesn't
make it wrong.  If "what it has always done" was unspecified and arbitrary
despite happening to not change in the past rather than something easy to
continue the stability of such as sorted attributes, it is a burden to
maintain that **unspecifiable** behavior in the language or library going
forward.

(Note that I have no idea what order the xml code in question happened to
impose upon attributes; if it went from sorted to not a "fix" to provide
users is clear: provide a way to keep it sorted for those who need that.
If it relied on insertion order or hash table iteration order or the phase
of the moon when the release was cut, we are right to refuse to maintain
unspecifiable implementation side effect behavior)

-gps
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Tim Delaney
On Wed, 20 Mar 2019 at 00:29, Serhiy Storchaka  wrote:

> 19.03.19 15:10, Tim Delaney пише:
> > Now Calibre is definitely in the wrong here - it should be able to
> > import regardless of the order of attributes. But the fact is that there
> > are a lot of tools out there that are semi-broken in a similar manner.
>
> Is not Calibre going to seat on Python 2 forever? This makes it
> non-relevant to the discussion about Python 3.8.
>

I was simply using Calibre as an example of a tool I'd encountered recently
that works correctly with input files with attributes in one order, but not
the other. That it happens to be using Python (of any vintage) is
irrelevant - could have been written in C, Go, Lua ... same problem that
XML libraries that arbitrarily sort (or otherwise manipulate the order of)
attributes can result in files that may not work with third-party tools.

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Stefan Behnel
Ned Batchelder schrieb am 19.03.19 um 12:53:
> I need to re-engineer my tests.

… or sort the attributes before serialisation, or use C14N always, or
change your code to create the attributes in sorted-by-name order. The new
behaviour allows for a couple of ways to deal with the issue of backwards
compatibility.

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Stefan Behnel
Nathaniel Smith schrieb am 19.03.19 um 00:15:
> That seems potentially simpler to implement than canonical XML
> serialization

C14N is already implemented for ElementTree, just needs to be ported to
Py3.8 and merged.

https://bugs.python.org/issue13611

Stefan

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Brett Cannon
On Tue, Mar 19, 2019 at 6:15 AM Serhiy Storchaka 
wrote:

> 19.03.19 13:53, Ned Batchelder пише:
> > Option 4 is misleading.  Is anyone here really offering to "fix the
> > tests in third-party modules"?  Option 4 is actually, "do nothing, and
> > let a multitude of projects figure out how to fix their tests, slowing
> > progress in those projects as they try to support Python 3.8."
>
> Any option except option 1 (and option 2 with sorting by default)
> requires changing third-party code. You should either pass additional
> argument to serialization functions, or use special canonization functions.
>
> We should look at the problem from long perspective. Freezing the
> current behavior forever does not look good. If we need to break the
> user code, we should minimize the harm and provide convenient tools for
> reproducing the current behavior. And this is an opportunity to rewrite
> user tests in more appropriate form. In your case textual comparison may
> be the most appropriate form, but this may be not so in other cases.
>

In situations like this I think it's best to bite the bullet sooner rather
than later while acknowledging that folks like Ned are in a bind when they
have support older versions and thus have long-term support costs, too, and
try to make the transition as painless as possible (my guess is Ned's need
to support older versions will drop off faster than us having to support
the xml libraries in the stdlib going forward, hence my viewpoint).


>
> > Now in Python 3.8, because Python doesn't want to add an
> > optional flag to continue doing what it has always done, I need to
> > re-engineer my tests.
>
> Please wait yet some time. I hope to add canonicalization before the
> first beta.
>

For me I think canonicalization/stable pretty-print is the best option,
especially if we can put the canonicalization code up on PyPI for
supporting older versions of Python. Otherwise a function that does
something like an XOR to help diagnose what differs between 2 XML documents
is also seems like a good option to me.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Stéphane Wirtel
and why not with a very long PendingDeprecationWarning? this warning
could be used in this case.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Serhiy Storchaka

19.03.19 15:10, Tim Delaney пише:
Now Calibre is definitely in the wrong here - it should be able to 
import regardless of the order of attributes. But the fact is that there 
are a lot of tools out there that are semi-broken in a similar manner.


Is not Calibre going to seat on Python 2 forever? This makes it 
non-relevant to the discussion about Python 3.8.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Serhiy Storchaka

19.03.19 14:50, Antoine Pitrou пише:

2). Go into every XML module and add attribute sorting options to each function 
that generate xml.


What do you mean with "every XML module"? Are there many of them?


ElementTree and minidom. Maybe xmlrpc. And perhaps we need to add 
arguments in calls at higher level where these modules are used.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Serhiy Storchaka

19.03.19 13:53, Ned Batchelder пише:
Option 4 is misleading.  Is anyone here really offering to "fix the 
tests in third-party modules"?  Option 4 is actually, "do nothing, and 
let a multitude of projects figure out how to fix their tests, slowing 
progress in those projects as they try to support Python 3.8."


Any option except option 1 (and option 2 with sorting by default) 
requires changing third-party code. You should either pass additional 
argument to serialization functions, or use special canonization functions.


We should look at the problem from long perspective. Freezing the 
current behavior forever does not look good. If we need to break the 
user code, we should minimize the harm and provide convenient tools for 
reproducing the current behavior. And this is an opportunity to rewrite 
user tests in more appropriate form. In your case textual comparison may 
be the most appropriate form, but this may be not so in other cases.


Now in Python 3.8, because Python doesn't want to add an 
optional flag to continue doing what it has always done, I need to 
re-engineer my tests.


Please wait yet some time. I hope to add canonicalization before the 
first beta.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Tim Delaney
On Tue, 19 Mar 2019 at 23:13, David Mertz  wrote:

> In a way, this case makes bugs worse because they are not only a Python
> internal matter. XML is used to communicate among many tools and
> programming languages, and relying on assumptions those other tools will
> not follow us a bad habit.
>

I have a recent example I encountered where the 3.7 behaviour (sorting
attributes) results in a third-party tool behaving incorrectly, whereas
maintaining attribute order works correctly. The particular case was using
HTML  tags for importing into Calibre for converting to an ebook. The
most common symptom was that series indexes were sometimes being correctly
imported, and sometimes not. Occasionally other  tags would also fail
to be correctly imported.

Turns out that  gave consistently
correct results, whilst  was
erratic. And whilst I'd specified the  tags with the name attribute
first, I was then passing the HTML through BeautifulSoup, which sorted the
attributes.

Now Calibre is definitely in the wrong here - it should be able to import
regardless of the order of attributes. But the fact is that there are a lot
of tools out there that are semi-broken in a similar manner.

This to me is an argument to default to maintaining order, but provide a
way for the caller to control the order of attributes when formatting e.g.
pass an ordering function. If you want sorted attributes, pass the built-in
sorted function as your ordering function. But I think that's getting
beyond the scope of this discussion.

Tim Delaney
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Antoine Pitrou


Hi Raymond,

As long as the new serialization order is deterministic (i.e. it's the
same every run and doesn't depend on e.g. hash randomization), then I
think it's fine to change it.

Some more comments / questions:

> 2). Go into every XML module and add attribute sorting options to each 
> function that generate xml.

What do you mean with "every XML module"? Are there many of them?

> Regardless of option chosen, we should make explicit whether on not the 
> Python standard library modules guarantee cross-release bytewise identical 
> output for XML.

IMO we certainly shouldn't.  XML is a serialization format used for
machine interoperability (even though "human-editable" was one of its
selling points at the start, rather misguidingly).  However, the output
should ideally be stable and deterministic accross all releases of a
given bugfix branch.

(i.e., if I run the same code multiple times on all 3.7.x versions, I
should get always the same output)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread David Mertz
In my opinion, any test that relied on a non-promised accident of
serialization is broken today. Very often, such bad tests mask bad
production code that makes the same unsafe assumptions.

This is similar to tests that assumed a certain dictionary order, before we
got guaranteed insertion order. Or like tests that rely on object identity
of short strings or small ints. Or like non-guaranteed identities in
pickles across versions.

In a way, this case makes bugs worse because they are not only a Python
internal matter. XML is used to communicate among many tools and
programming languages, and relying on assumptions those other tools will
not follow us a bad habit. Sure, most tests probably don't get to the point
of touching those external tools themselves, but staying from bad
assumptions about the domain isn't best practices.

That said, I think aN XML canonicalization function is generally a good
thing for Python to have. But it shouldn't be a stopper in releases.

On Mon, Mar 18, 2019, 6:47 PM Raymond Hettinger 
wrote:

> We're having a super interesting discussion on
> https://bugs.python.org/issue34160 .  It is now marked as a release
> blocker and warrants a broader discussion.
>
> Our problem is that at least two distinct and important users have written
> tests that depend on exact byte-by-byte comparisons of the final
> serialization.  So any changes to the XML modules will break those tests
> (not the applications themselves, just the test cases that assume the
> output will be forever, byte-by-byte identical).
>
> In theory, the tests are incorrectly designed and should not treat the
> module output as a canonical normal form.  In practice, doing an equality
> test on the output is the simplest, most obvious approach, and likely is
> being done in other packages we don't know about yet.
>
> With pickle, json, and __repr__, the usual way to write a test is to
> verify a roundtrip:  assert pickle.loads(pickle.dumps(data)) == data.  With
> XML, the problem is that the DOM doesn't have an equality operator.  The
> user is left with either testing specific fragments with
> element.find(xpath) or with using a standards compliant canonicalization
> package (not available from us). Neither option is pleasant.
>
> The code in the current 3.8 alpha differs from 3.7 in that it removes
> attribute sorting and instead preserves the order the user specified when
> creating an element.  As far as I can tell, there is no objection to this
> as a feature.  The problem is what to do about the existing tests in
> third-party code, what guarantees we want to make going forward, and what
> do we recommend as a best practice for testing XML generation.
>
> Things we can do:
>
> 1) Revert back to the 3.7 behavior. This of course, makes all the test
> pass :-)  The downside is that it perpetuates the practice of bytewise
> equality tests and locks in all implementation quirks forever.  I don't
> know of anyone advocating this option, but it is the simplest thing to do.
>
> 2). Go into every XML module and add attribute sorting options to each
> function that generate xml.  This gives users a way to make their tests
> pass for now. There are several downsides. a) It grows the API in a way
> that is inconsistent with all the other XML packages I've seen. b) We'll
> have to test, maintain, and document the API forever -- the API is already
> large and time consuming to teach. c) It perpetuates the notion that
> bytewise equality tests are the right thing to do, so we'll have this
> problem again if substitute in another code generator or alter any of the
> other implementation quirks (i.e. how CDATA sections are serialized).
>
> 3) Add a standards compliant canonicalization tool (see
> https://en.wikipedia.org/wiki/Canonical_XML ).  This is likely to be the
> right-way-to-do-it but takes time and energy.
>
> 4) Fix the tests in the third-party modules to be more focused on their
> actual test objectives, the semantics of the generated XML rather than the
> exact serialization.  This option would seem like the right-thing-to-do but
> it isn't trivial because the entire premise of the existing test is
> invalid.  For every case, we'll actually have to think through what the
> test objective really is.
>
> Of these, option 2 is my least preferred.  Ideally, we don't guarantee
> bytewise identical output across releases, and ideally we don't grow a new
> API that perpetuates the issue. That said, I'm not wedded to any of these
> options and just want us to do what is best for the users in the long run.
>
> Regardless of option chosen, we should make explicit whether on not the
> Python standard library modules guarantee cross-release bytewise identical
> output for XML. That is really the core issue here.  Had we had an explicit
> notice one way or the other, there wouldn't be an issue now.
>
> Any thoughts?
>
>
>
> Raymond Hettinger
>
>
> P.S.   Stefan Behnel is planning to remove attribute sorting from lxml.
> On the bug tr

Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Ned Batchelder

On 3/19/19 4:13 AM, Serhiy Storchaka wrote:

19.03.19 00:41, Raymond Hettinger пише:
3) Add a standards compliant canonicalization tool (see 
https://en.wikipedia.org/wiki/Canonical_XML ).  This is likely to be 
the right-way-to-do-it but takes time and energy.


4) Fix the tests in the third-party modules to be more focused on 
their actual test objectives, the semantics of the generated XML 
rather than the exact serialization.  This option would seem like the 
right-thing-to-do but it isn't trivial because the entire premise of 
the existing test is invalid.  For every case, we'll actually have to 
think through what the test objective really is.


I think the combination of options 3 and 4 is the right thing. Not 
always the stable output is needed, in these cases option 4 should be 
considered. But there are valid use cases for the stable output, in 
these cases we need to provide an alternative in the stdlib. I am 
working on this.


Option 4 is misleading.  Is anyone here really offering to "fix the 
tests in third-party modules"?  Option 4 is actually, "do nothing, and 
let a multitude of projects figure out how to fix their tests, slowing 
progress in those projects as they try to support Python 3.8."


In my case, the test code has a generic function to compare an actual 
directory of files to an expected directory of files, so it isn't quite 
as simple as "just use the right XML comparison."  And I support Python 
2.7, 3.5, etc, so tests still need to work under those versions.  None 
of this is impossible, but please try not to preach to us maintainers 
that we are doing it wrong, that it will be easy to fix, etc.  Using 
language like "the entire premise of the test is invalid" seems 
needlessly condescending.


As one of the suggested solutions, a DOM comparison is not enough. I 
don't just want to know that my actual XML is different than my expected 
XML.  I want to know where and how it differs.


Textual comparison may be the "wrong" way to check XML, but it gives me 
many tools for working with the test results.  It was simple and it 
worked.  Now in Python 3.8, because Python doesn't want to add an 
optional flag to continue doing what it has always done, I need to 
re-engineer my tests.


--Ned.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/ned%40nedbatchelder.com

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-19 Thread Serhiy Storchaka

19.03.19 00:41, Raymond Hettinger пише:

3) Add a standards compliant canonicalization tool (see 
https://en.wikipedia.org/wiki/Canonical_XML ).  This is likely to be the 
right-way-to-do-it but takes time and energy.

4) Fix the tests in the third-party modules to be more focused on their actual 
test objectives, the semantics of the generated XML rather than the exact 
serialization.  This option would seem like the right-thing-to-do but it isn't 
trivial because the entire premise of the existing test is invalid.  For every 
case, we'll actually have to think through what the test objective really is.


I think the combination of options 3 and 4 is the right thing. Not 
always the stable output is needed, in these cases option 4 should be 
considered. But there are valid use cases for the stable output, in 
these cases we need to provide an alternative in the stdlib. I am 
working on this.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-18 Thread Gregory P. Smith
On Mon, Mar 18, 2019 at 9:44 PM Terry Reedy  wrote:

> On 3/18/2019 6:41 PM, Raymond Hettinger wrote:
> > We're having a super interesting discussion on
> https://bugs.python.org/issue34160 .  It is now marked as a release
> blocker and warrants a broader discussion.
> >
> > Our problem is that at least two distinct and important users have
> written tests that depend on exact byte-by-byte comparisons of the final
> serialization.  So any changes to the XML modules will break those tests
> (not the applications themselves, just the test cases that assume the
> output will be forever, byte-by-byte identical).
> >
> > In theory, the tests are incorrectly designed and should not treat the
> module output as a canonical normal form.  In practice, doing an equality
> test on the output is the simplest, most obvious approach, and likely is
> being done in other packages we don't know about yet.
> >
> > With pickle, json, and __repr__, the usual way to write a test is to
> verify a roundtrip:  assert pickle.loads(pickle.dumps(data)) == data.  With
> XML, the problem is that the DOM doesn't have an equality operator.  The
> user is left with either testing specific fragments with
> element.find(xpath) or with using a standards compliant canonicalization
> package (not available from us). Neither option is pleasant.
> >
> > The code in the current 3.8 alpha differs from 3.7 in that it removes
> attribute sorting and instead preserves the order the user specified when
> creating an element.  As far as I can tell, there is no objection to this
> as a feature.  The problem is what to do about the existing tests in
> third-party code, what guarantees we want to make going forward, and what
> do we recommend as a best practice for testing XML generation.
> >
> > Things we can do:
> >
> > 1) Revert back to the 3.7 behavior. This of course, makes all the test
> pass :-)  The downside is that it perpetuates the practice of bytewise
> equality tests and locks in all implementation quirks forever.  I don't
> know of anyone advocating this option, but it is the simplest thing to do.
>
> If it comes down to doing *something* to unblock the release ...
> 1b) Revert to 3.7 *and* document that byte equality with current ouput
> is *not* guaranteed.
>
> > 2). Go into every XML module and add attribute sorting options to each
> function that generate xml.  This gives users a way to make their tests
> pass for now. There are several downsides. a) It grows the API in a way
> that is inconsistent with all the other XML packages I've seen. b) We'll
> have to test, maintain, and document the API forever -- the API is already
> large and time consuming to teach. c) It perpetuates the notion that
> bytewise equality tests are the right thing to do, so we'll have this
> problem again if substitute in another code generator or alter any of the
> other implementation quirks (i.e. how CDATA sections are serialized).
> >
> > 3) Add a standards compliant canonicalization tool (see
> https://en.wikipedia.org/wiki/Canonical_XML ).  This is likely to be the
> right-way-to-do-it but takes time and energy.

>
> > 4) Fix the tests in the third-party modules to be more focused on their
> actual test objectives, the semantics of the generated XML rather than the
> exact serialization.  This option would seem like the right-thing-to-do but
> it isn't trivial because the entire premise of the existing test is
> invalid.  For every case, we'll actually have to think through what the
> test objective really is.

>
> > Of these, option 2 is my least preferred.  Ideally, we don't guarantee
> bytewise identical output across releases, and ideally we don't grow a new
> API that perpetuates the issue. That said, I'm not wedded to any of these
> options and just want us to do what is best for the users in the long run.
>

For (1) - don't revert in 3.8 - Do not worry about order or formatting of
serialized data changing between major Python releases.  change in 3.8?
that's 100% okay.  This already happens all the time between Python
releases.  We've changed dict iteration order between releases twice this
decade.

Within point releases of stable versions, ie 3.7.x? Up to the release
manager; it is semi-rude to change something like this within a stable
release unless there is a good reason, but we *believe* have done it
before. A general rule of thumb is to try not to without good reason though
unless the code to avoid doing so would be over complicated.

It is always the user code depending on the non-declared ordering within
output that is wrong, when we preserve it we're only doing them a temporary
favor that ultimately allows more problems to grow in the future.  Nobody
should use a text comparison on serialized data not explicitly stated as
canonical and call that test good by any standard unless you are writing a
test that for canonical output by a library that explicitly guarantees its
output will be canonical.

Agreed that your option (2) is not good for 

Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-18 Thread Terry Reedy

On 3/18/2019 6:41 PM, Raymond Hettinger wrote:

We're having a super interesting discussion on 
https://bugs.python.org/issue34160 .  It is now marked as a release blocker and 
warrants a broader discussion.

Our problem is that at least two distinct and important users have written 
tests that depend on exact byte-by-byte comparisons of the final serialization. 
 So any changes to the XML modules will break those tests (not the applications 
themselves, just the test cases that assume the output will be forever, 
byte-by-byte identical).

In theory, the tests are incorrectly designed and should not treat the module 
output as a canonical normal form.  In practice, doing an equality test on the 
output is the simplest, most obvious approach, and likely is being done in 
other packages we don't know about yet.

With pickle, json, and __repr__, the usual way to write a test is to verify a 
roundtrip:  assert pickle.loads(pickle.dumps(data)) == data.  With XML, the 
problem is that the DOM doesn't have an equality operator.  The user is left 
with either testing specific fragments with element.find(xpath) or with using a 
standards compliant canonicalization package (not available from us). Neither 
option is pleasant.

The code in the current 3.8 alpha differs from 3.7 in that it removes attribute 
sorting and instead preserves the order the user specified when creating an 
element.  As far as I can tell, there is no objection to this as a feature.  
The problem is what to do about the existing tests in third-party code, what 
guarantees we want to make going forward, and what do we recommend as a best 
practice for testing XML generation.

Things we can do:

1) Revert back to the 3.7 behavior. This of course, makes all the test pass :-) 
 The downside is that it perpetuates the practice of bytewise equality tests 
and locks in all implementation quirks forever.  I don't know of anyone 
advocating this option, but it is the simplest thing to do.


If it comes down to doing *something* to unblock the release ...
1b) Revert to 3.7 *and* document that byte equality with current ouput 
is *not* guaranteed.



2). Go into every XML module and add attribute sorting options to each function 
that generate xml.  This gives users a way to make their tests pass for now. 
There are several downsides. a) It grows the API in a way that is inconsistent 
with all the other XML packages I've seen. b) We'll have to test, maintain, and 
document the API forever -- the API is already large and time consuming to 
teach. c) It perpetuates the notion that bytewise equality tests are the right 
thing to do, so we'll have this problem again if substitute in another code 
generator or alter any of the other implementation quirks (i.e. how CDATA 
sections are serialized).

3) Add a standards compliant canonicalization tool (see 
https://en.wikipedia.org/wiki/Canonical_XML ).  This is likely to be the 
right-way-to-do-it but takes time and energy.

4) Fix the tests in the third-party modules to be more focused on their actual 
test objectives, the semantics of the generated XML rather than the exact 
serialization.  This option would seem like the right-thing-to-do but it isn't 
trivial because the entire premise of the existing test is invalid.  For every 
case, we'll actually have to think through what the test objective really is.

Of these, option 2 is my least preferred.  Ideally, we don't guarantee bytewise 
identical output across releases, and ideally we don't grow a new API that 
perpetuates the issue. That said, I'm not wedded to any of these options and 
just want us to do what is best for the users in the long run.


The point of 1b would be to give us time to do that if more is needed.


Regardless of option chosen, we should make explicit whether on not the Python 
standard library modules guarantee cross-release bytewise identical output for 
XML. That is really the core issue here.  Had we had an explicit notice one way 
or the other, there wouldn't be an issue now.


I have not read the XML docs but based on this and the issue discussion 
and what I think our general guarantee policy has been, I would consider 
that there is not one.  (I am thinking about things like garbage 
collection, stable sorting, and set/dict iteration order.)


--
Terry Jan Reedy

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-18 Thread Raymond Hettinger


> On Mar 18, 2019, at 4:15 PM, Nathaniel Smith  wrote:
> 
> I noticed that your list doesn't include "add a DOM equality operator". That 
> seems potentially simpler to implement than canonical XML serialization, and 
> like a useful thing to have in any case. Would it make sense as an option?

Time machine!  Stéphane Wirtel just posted a basic semantic comparison between 
two streams.¹   Presumably, there would need to be a range of options for 
specifying what constitutes equivalence but this is a nice start.

Raymond


¹ https://bugs.python.org/file48217/test_xml_compare.py

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-18 Thread Nathaniel Smith
I noticed that your list doesn't include "add a DOM equality operator".
That seems potentially simpler to implement than canonical XML
serialization, and like a useful thing to have in any case. Would it make
sense as an option?

On Mon, Mar 18, 2019, 15:46 Raymond Hettinger 
wrote:

> We're having a super interesting discussion on
> https://bugs.python.org/issue34160 .  It is now marked as a release
> blocker and warrants a broader discussion.
>
> Our problem is that at least two distinct and important users have written
> tests that depend on exact byte-by-byte comparisons of the final
> serialization.  So any changes to the XML modules will break those tests
> (not the applications themselves, just the test cases that assume the
> output will be forever, byte-by-byte identical).
>
> In theory, the tests are incorrectly designed and should not treat the
> module output as a canonical normal form.  In practice, doing an equality
> test on the output is the simplest, most obvious approach, and likely is
> being done in other packages we don't know about yet.
>
> With pickle, json, and __repr__, the usual way to write a test is to
> verify a roundtrip:  assert pickle.loads(pickle.dumps(data)) == data.  With
> XML, the problem is that the DOM doesn't have an equality operator.  The
> user is left with either testing specific fragments with
> element.find(xpath) or with using a standards compliant canonicalization
> package (not available from us). Neither option is pleasant.
>
> The code in the current 3.8 alpha differs from 3.7 in that it removes
> attribute sorting and instead preserves the order the user specified when
> creating an element.  As far as I can tell, there is no objection to this
> as a feature.  The problem is what to do about the existing tests in
> third-party code, what guarantees we want to make going forward, and what
> do we recommend as a best practice for testing XML generation.
>
> Things we can do:
>
> 1) Revert back to the 3.7 behavior. This of course, makes all the test
> pass :-)  The downside is that it perpetuates the practice of bytewise
> equality tests and locks in all implementation quirks forever.  I don't
> know of anyone advocating this option, but it is the simplest thing to do.
>
> 2). Go into every XML module and add attribute sorting options to each
> function that generate xml.  This gives users a way to make their tests
> pass for now. There are several downsides. a) It grows the API in a way
> that is inconsistent with all the other XML packages I've seen. b) We'll
> have to test, maintain, and document the API forever -- the API is already
> large and time consuming to teach. c) It perpetuates the notion that
> bytewise equality tests are the right thing to do, so we'll have this
> problem again if substitute in another code generator or alter any of the
> other implementation quirks (i.e. how CDATA sections are serialized).
>
> 3) Add a standards compliant canonicalization tool (see
> https://en.wikipedia.org/wiki/Canonical_XML ).  This is likely to be the
> right-way-to-do-it but takes time and energy.
>
> 4) Fix the tests in the third-party modules to be more focused on their
> actual test objectives, the semantics of the generated XML rather than the
> exact serialization.  This option would seem like the right-thing-to-do but
> it isn't trivial because the entire premise of the existing test is
> invalid.  For every case, we'll actually have to think through what the
> test objective really is.
>
> Of these, option 2 is my least preferred.  Ideally, we don't guarantee
> bytewise identical output across releases, and ideally we don't grow a new
> API that perpetuates the issue. That said, I'm not wedded to any of these
> options and just want us to do what is best for the users in the long run.
>
> Regardless of option chosen, we should make explicit whether on not the
> Python standard library modules guarantee cross-release bytewise identical
> output for XML. That is really the core issue here.  Had we had an explicit
> notice one way or the other, there wouldn't be an issue now.
>
> Any thoughts?
>
>
>
> Raymond Hettinger
>
>
> P.S.   Stefan Behnel is planning to remove attribute sorting from lxml.
> On the bug tracker, he has clearly articulated his reasons.
>
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/njs%40pobox.com
>
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Is XML serialization output guaranteed to be bytewise identical forever?

2019-03-18 Thread Raymond Hettinger
We're having a super interesting discussion on 
https://bugs.python.org/issue34160 .  It is now marked as a release blocker and 
warrants a broader discussion.

Our problem is that at least two distinct and important users have written 
tests that depend on exact byte-by-byte comparisons of the final serialization. 
 So any changes to the XML modules will break those tests (not the applications 
themselves, just the test cases that assume the output will be forever, 
byte-by-byte identical).  

In theory, the tests are incorrectly designed and should not treat the module 
output as a canonical normal form.  In practice, doing an equality test on the 
output is the simplest, most obvious approach, and likely is being done in 
other packages we don't know about yet.

With pickle, json, and __repr__, the usual way to write a test is to verify a 
roundtrip:  assert pickle.loads(pickle.dumps(data)) == data.  With XML, the 
problem is that the DOM doesn't have an equality operator.  The user is left 
with either testing specific fragments with element.find(xpath) or with using a 
standards compliant canonicalization package (not available from us). Neither 
option is pleasant.

The code in the current 3.8 alpha differs from 3.7 in that it removes attribute 
sorting and instead preserves the order the user specified when creating an 
element.  As far as I can tell, there is no objection to this as a feature.  
The problem is what to do about the existing tests in third-party code, what 
guarantees we want to make going forward, and what do we recommend as a best 
practice for testing XML generation.

Things we can do:

1) Revert back to the 3.7 behavior. This of course, makes all the test pass :-) 
 The downside is that it perpetuates the practice of bytewise equality tests 
and locks in all implementation quirks forever.  I don't know of anyone 
advocating this option, but it is the simplest thing to do.

2). Go into every XML module and add attribute sorting options to each function 
that generate xml.  This gives users a way to make their tests pass for now. 
There are several downsides. a) It grows the API in a way that is inconsistent 
with all the other XML packages I've seen. b) We'll have to test, maintain, and 
document the API forever -- the API is already large and time consuming to 
teach. c) It perpetuates the notion that bytewise equality tests are the right 
thing to do, so we'll have this problem again if substitute in another code 
generator or alter any of the other implementation quirks (i.e. how CDATA 
sections are serialized).

3) Add a standards compliant canonicalization tool (see 
https://en.wikipedia.org/wiki/Canonical_XML ).  This is likely to be the 
right-way-to-do-it but takes time and energy.

4) Fix the tests in the third-party modules to be more focused on their actual 
test objectives, the semantics of the generated XML rather than the exact 
serialization.  This option would seem like the right-thing-to-do but it isn't 
trivial because the entire premise of the existing test is invalid.  For every 
case, we'll actually have to think through what the test objective really is.

Of these, option 2 is my least preferred.  Ideally, we don't guarantee bytewise 
identical output across releases, and ideally we don't grow a new API that 
perpetuates the issue. That said, I'm not wedded to any of these options and 
just want us to do what is best for the users in the long run.

Regardless of option chosen, we should make explicit whether on not the Python 
standard library modules guarantee cross-release bytewise identical output for 
XML. That is really the core issue here.  Had we had an explicit notice one way 
or the other, there wouldn't be an issue now.

Any thoughts?



Raymond Hettinger


P.S.   Stefan Behnel is planning to remove attribute sorting from lxml.  On the 
bug tracker, he has clearly articulated his reasons.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com