Re: [Python-ideas] Windows Best Fit Encodings

2018-01-19 Thread Steve Dower

On 20Jan2018 0518, M.-A. Lemburg wrote:

do you know of a definite resource for Windows code pages
on MSDN or another official MS website ?


I don't know of anything sorry, and my quick search didn't turn up 
anything public. But I can at least confirm that the internal table for 
cp1252 has the same undefined characters as on unicode.org, so 
presumably if MultiByteToWideChar is mapping those to "best fit" 
characters it's only because the flag has been passed. As far as I can 
tell, Microsoft has not been secretly redefining any encodings.


Cheers,
Steve
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Chaining coders

2018-01-19 Thread Rob Speer
I see how this is another way to get what I was asking for: a way to decode
some unfortunately common text encodings, ones that Web browsers use, in
Python without having to import additional modules.

I appreciate other ideas about how to solve this problem, but the
generality here seems pretty unnecessary. The world isn't making any
_novel_ legacy encodings. There are 8 legacy encodings that Python has
missed, and there's no reason to expect there to be any more of them.

It's worrisome to support arbitrary compositions of encodings. Most of
these possible hybrid encodings haven't been used before, and using them
would be a bad idea because there would be no reason to expect any other
software in existence to be compatible with them.

Some of these legacy encodings (like the webbish version of windows-1255)
are not the composition of two encodings that already exist in Python. So
you'd have to define new encodings anyway.

On Fri, 19 Jan 2018 at 17:09 Soni L.  wrote:

> windows-1252 is based on iso-8859-1. Thus, I'd like to be able to chain
> coders as follows:
>
> bytes.decode("windows-1252-ext", else=lambda r: r.decode("iso-8859-1"))
>
> What this "else" does is that it's a lambda, and it gets passed an
> object with a decode method identical to the bytes decode method, except
> that it doesn't affect already-decoded characters. In this case,
> "windows-1252-ext" only includes things in the \x80-\x9F range, leaving
> it up to "iso-8859-1" to handle the rest.
>
> A similar process would happen for encoding: encode with
> "windows-1252-ext", else = "iso-8859-1".
>
> (Technically, "windows-1252-ext" isn't needed - you can use the existing
> "windows-1252" and combine it with the "iso-8859-1" to get
> "windows-1252-c1".)
>
> This would be a novel way to think of encodings as not just flat
> translation tables but highly composable translation tables. I have a
> thing for composition.
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Chris Barker
On Fri, Jan 19, 2018 at 11:21 AM, Giampaolo Rodola' 
wrote:
>
> I personally include them in psutil distribution so that users can test
> the installation with "python -m psutil.test". I have even this documented
> as I think it's an added value.
>

or:

pytest --pyargs pkg_name

It Is really handy, and sometimes required to test the distribution /
installation itself.

So I do that most  of the time these days -- but it gets ugly if the tests
get really huge.

-CHB





>
>
>
> --
> Giampaolo - http://grodola.blogspot.com
>
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>
>


-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Chaining coders

2018-01-19 Thread Soni L.
windows-1252 is based on iso-8859-1. Thus, I'd like to be able to chain 
coders as follows:


bytes.decode("windows-1252-ext", else=lambda r: r.decode("iso-8859-1"))

What this "else" does is that it's a lambda, and it gets passed an 
object with a decode method identical to the bytes decode method, except 
that it doesn't affect already-decoded characters. In this case, 
"windows-1252-ext" only includes things in the \x80-\x9F range, leaving 
it up to "iso-8859-1" to handle the rest.


A similar process would happen for encoding: encode with 
"windows-1252-ext", else = "iso-8859-1".


(Technically, "windows-1252-ext" isn't needed - you can use the existing 
"windows-1252" and combine it with the "iso-8859-1" to get 
"windows-1252-c1".)


This would be a novel way to think of encodings as not just flat 
translation tables but highly composable translation tables. I have a 
thing for composition.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Repurpose `assert' into a general-purpose check

2018-01-19 Thread Sylvain MARIE
> I haven't yet seen any justification for syntax here. The nearest I've seen 
> is that this "ensure" action is more like:
>
> try:
> cond = x >= 0
> except BaseException:
> raise AssertionError("x must be positive")
> else:
> if not cond:
> raise AssertionError("x must be positive")
>
> Which, IMO, is a bad idea, and I'm not sure anyone was actually advocating it 
> anyway.
> 
> ChrisA

Indeed, I was the one advocating for it :) 

Based on all the feedback I received from this discussion, I realized that my 
implementation was completely flawed by the fact that I had done the class and 
functions decorators first, and wanted to apply the same pattern to the inline 
validator, resulting in this assert_valid with overkill delayed evaluation. 
Resulting in me saying that the only way out would be a new python language 
element.

I tried my best to update valid8 and reached a new stable point with version 
3.0.0, providing 2 main utilities for inline validation:
 - the simple but not so powerful `quick_valid` function 
 - the more verbose (2 lines) but much more generic `wrap_valid` context 
manager (that's the best I can do today !)

The more capable but delayed-evaluation based `assert_valid` is not recommended 
anymore, or just a tool to replicate what is done in the function and class 
validation decorators. Like the decorators, it adds the ability to blend two 
styles of base functions (boolean testers and failure raisers) with boolean 
operators seamlessly. But the complexity is not worth it for inline validation 
(it seems to be worth it for decorators).

See https://smarie.github.io/python-valid8 for the new updated documentation. I 
also updated the problem description page at 
https://smarie.github.io/python-valid8/why_validation/ so as to keep a 
reference of the problem description and "wishlist" (whether it is implemented 
by this library or by new language elements in the future). Do not hesitate to 
contribute or send me your edits (off-list).

I would love to get feedback from anyone concerning this library, whether you 
consider it's useless or "interesting but...". We should probably take this 
offline though, so as not to pollute the initial thread.

Thanks again, a great weekend to all (end of the day here in france ;) )
Kind regards

Sylvain 

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread M.-A. Lemburg
Rob:

I think I was very clear very early in the thread that I'm
opposed to adding a complete set of new encodings to the stdlib
which only slightly alter many existing ones.

Ever since I've been trying to give you suggestions on how
we can solve the issue you're trying to address with the
encodings in different ways which achieve much of the same
but with the existing code base.

I've also tried to understand the issue with WideCharToMultiByte()
et al. apparently using different encodings than the ones which
MS itself published to the Unicode Consortium, to see whether
there's an issue we may need to resolve. That's a different
topic, which is why I changed the subject line.

If you call that derailing, I cannot help it, but won't
engage any further in this discussion.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 19 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/




On 19.01.2018 19:35, Rob Speer wrote:
>> It depends on what you want to achieve. You may want to fail, assign a
> code point from a private area or use a surrogate escape approach.
> 
> And the way to express that is with errors='replace',
> errors='surrogateescape', or whatever, which Python already does. We do
> not need an explosion of error handlers. This problem can be very
> straightforwardly solved with encodings, and error handlers can keep
> doing their usual job on top of encodings.
> 
>> You could also add a "latin1replace" error handler which simply passes
> through everything that's undefined as-is.
> 
> Nobody asked for this.
> 
>> I just don't want to have people start using "web-1252" as encoding
> simply because they they are writing out text for a web application -
> they should use "utf-8" instead.
> 
> I did ask for input on the name. If the problem is that you think my
> working name for the encoding is misleading, you could help with that
> instead of constantly trying to replace the proposal with something
> different.
> 
> Guido had some very sensible feedback just a moment ago. I am wondering
> now if we lost Guido because I broke python-ideas etiquette (is a pull
> request not the next step, for example? I never got a good answer on the
> process), or because this thread is just constantly being derailed.
> 
> 
> 
> On Fri, 19 Jan 2018 at 13:14 M.-A. Lemburg  > wrote:
> 
> On 19.01.2018 18:12, Rob Speer wrote:
> > Error handlers are quite orthogonal to this problem. If you try to
> solve
> > this problem with an error handler, you will have a different problem.
> >
> > Suppose you made "c1-control-passthrough" or whatever into an error
> > handler, similar to "replace" or "ignore", and then you encounter an
> > unassigned character that's *not* in the range 0x80 to 0x9f. (Many
> > encodings have these.) Do you replace it? Do you ignore it? You don't
> > know because you just replaced the error handler with something that's
> > not about error handling.
> 
> It depends on what you want to achieve. You may want to fail,
> assign a code point from a private area or use a surrogate
> escape approach. Based on the context it may also make sense
> to escape the input data using a different syntax, e.g.
> XML escapes, backslash notations, HTML numeric entities, etc.
> 
> You could also add a "latin1replace" error handler which
> simply passes through everything that's undefined as-is.
> 
> The Unicode error handlers are pretty flexible when it comes
> to providing a solution:
> 
> https://www.python.org/dev/peps/pep-0293/
> 
> You can even have the handler work "patch" an encoding, since
> it also gets the encoding name as input.
> 
> You could probably create an error handler which implements
> most of their workarounds into a single "whatwg" handler.
> 
> > I will also repeat that having these encodings (in both
> directions) will
> > provide more ways for Python to *reduce* the amount of mojibake that
> > exists. If acknowledging that mojibake exists offends your sense of
> > purity, and you'd rather just destroy all mojibake at the source...
> > that's great, and please get back to me after you've fixed
> Microsoft Excel.
> 
> I acknowledge that we have different views on 

Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Giampaolo Rodola'
On Fri, Jan 19, 2018 at 5:23 PM, Paul Moore  wrote:

> Another common approach is to not ship tests as part of your (runtime)
> package at all - they are in the sdist but not the wheels nor are they
> deployed with "setup.py install". In my experience, this is the usual
> approach projects take if they don't have the tests in the package
> directory. (I don't think I've *ever* seen a project try to install
> tests except by including them in the package directory...)


I personally include them in psutil distribution so that users can test the
installation with "python -m psutil.test". I have even this documented as I
think it's an added value.


-- 
Giampaolo - http://grodola.blogspot.com
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Chris Barker
hmm,

I've struggled for ages with this problem -- I have some packages with
REALLY big test suites. so I don't put the tests in the package.

But their are also numerous issues with building and installing the package
(C code, lots of dependencies, etc), so it would be really nice to have a
way to test the actual installed package after the fact (and, be able to
properly test conda packages as well -- easy if the tests are in the
package, hard if not...)

So I like the idea of having a standard way / place to install tests.

However, somehow I never thought to make a my_package_tests package --
d'uh! seems the obvious way to handle teh "optionally install the tests"
problem.

I still like the idea of separate location, but Paul is right that it's a
change that would have to filter out through a lot of infrastructure, so
maybe not practical.

So maybe the way to go is to come up with recommendations for a standard
way to do it -- maybe published by PyPa?

-CHB




On Fri, Jan 19, 2018 at 10:19 AM, Stefan Krah  wrote:

> On Fri, Jan 19, 2018 at 05:30:43PM +, Paul Moore wrote:
> [cut]
> > I'd think that the idea of a site-packages/stest directory would need
> > a much more compelling use case to justify it.
>
> Thanks for the detailed explanation!  It sounds that there's much more work
> involved than I thought, so it's probably better to drop this proposal.
>
>
> > PS There's nothing stopping a (distribution) package FOO from
> > installing (Python) packages foo and foo-tests. It's not common, and
> > probably violates people's expectations, but it's not *illegal* (the
> > setuptools distribution installs pkg_resources as well as setuptools,
> > for a well-known example). So in theory, if people wanted this enough,
> > they could have implemented it right now, without needing any change
> > to Python or the packaging ecosystem.
>
> If people don't come with pitchforks, that's a good solution. I suspected
> that people would complain both if foo-tests were installed automatically
> like pkg_resources but also if foo-tests were a separate optional package
> (too much hassle).
>
>
>
> Stefan Krah
>
>
>
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Paul Moore
On 19 January 2018 at 18:19, Stefan Krah  wrote:
> On Fri, Jan 19, 2018 at 05:30:43PM +, Paul Moore wrote:
> [cut]
>> I'd think that the idea of a site-packages/stest directory would need
>> a much more compelling use case to justify it.
>
> Thanks for the detailed explanation!  It sounds that there's much more work
> involved than I thought, so it's probably better to drop this proposal.
>
>
>> PS There's nothing stopping a (distribution) package FOO from
>> installing (Python) packages foo and foo-tests. It's not common, and
>> probably violates people's expectations, but it's not *illegal* (the
>> setuptools distribution installs pkg_resources as well as setuptools,
>> for a well-known example). So in theory, if people wanted this enough,
>> they could have implemented it right now, without needing any change
>> to Python or the packaging ecosystem.
>
> If people don't come with pitchforks, that's a good solution. I suspected
> that people would complain both if foo-tests were installed automatically
> like pkg_resources but also if foo-tests were a separate optional package
> (too much hassle).

Personally, I prefer packages that don't install their tests (I'm just
about willing to tolerate the tests-inside-the package-approach) so I
actually dislike this option myself - I was just saying it's possible.

Paul
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread Rob Speer
> It depends on what you want to achieve. You may want to fail, assign a
code point from a private area or use a surrogate escape approach.

And the way to express that is with errors='replace',
errors='surrogateescape', or whatever, which Python already does. We do not
need an explosion of error handlers. This problem can be very
straightforwardly solved with encodings, and error handlers can keep doing
their usual job on top of encodings.

> You could also add a "latin1replace" error handler which simply passes
through everything that's undefined as-is.

Nobody asked for this.

> I just don't want to have people start using "web-1252" as encoding
simply because they they are writing out text for a web application - they
should use "utf-8" instead.

I did ask for input on the name. If the problem is that you think my
working name for the encoding is misleading, you could help with that
instead of constantly trying to replace the proposal with something
different.

Guido had some very sensible feedback just a moment ago. I am wondering now
if we lost Guido because I broke python-ideas etiquette (is a pull request
not the next step, for example? I never got a good answer on the process),
or because this thread is just constantly being derailed.



On Fri, 19 Jan 2018 at 13:14 M.-A. Lemburg  wrote:

> On 19.01.2018 18:12, Rob Speer wrote:
> > Error handlers are quite orthogonal to this problem. If you try to solve
> > this problem with an error handler, you will have a different problem.
> >
> > Suppose you made "c1-control-passthrough" or whatever into an error
> > handler, similar to "replace" or "ignore", and then you encounter an
> > unassigned character that's *not* in the range 0x80 to 0x9f. (Many
> > encodings have these.) Do you replace it? Do you ignore it? You don't
> > know because you just replaced the error handler with something that's
> > not about error handling.
>
> It depends on what you want to achieve. You may want to fail,
> assign a code point from a private area or use a surrogate
> escape approach. Based on the context it may also make sense
> to escape the input data using a different syntax, e.g.
> XML escapes, backslash notations, HTML numeric entities, etc.
>
> You could also add a "latin1replace" error handler which
> simply passes through everything that's undefined as-is.
>
> The Unicode error handlers are pretty flexible when it comes
> to providing a solution:
>
> https://www.python.org/dev/peps/pep-0293/
>
> You can even have the handler work "patch" an encoding, since
> it also gets the encoding name as input.
>
> You could probably create an error handler which implements
> most of their workarounds into a single "whatwg" handler.
>
> > I will also repeat that having these encodings (in both directions) will
> > provide more ways for Python to *reduce* the amount of mojibake that
> > exists. If acknowledging that mojibake exists offends your sense of
> > purity, and you'd rather just destroy all mojibake at the source...
> > that's great, and please get back to me after you've fixed Microsoft
> Excel.
>
> I acknowledge that we have different views on this :-)
>
> Note that I'm not saying that the encodings are bad idea,
> or should not be used.
>
> I just don't want to have people start using "web-1252" as
> encoding simply because they they are writing out text for
> a web application - they should use "utf-8" instead.
>
> The extra hurdle to pip-install a package for this feels
> like the right way to turn this into a more conscious
> decision and who knows... perhaps it'll even help fix Excel
> once they have decided on including Python as scripting
> language:
>
>
> https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-application/suggestions/10549005-python-as-an-excel-scripting-language
>
> > I hope to make a pull request shortly that implements these mappings as
> > new encodings that work just like the other ones.
> >
> > On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg  > > wrote:
> >
> > On 19.01.2018 17:20, Guido van Rossum wrote:
> > > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg  > 
> > > >> wrote:
> > >
> > > On 19.01.2018 05:38, Nathaniel Smith wrote:
> > > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum
> >   > >> wrote:
> > > >> Can someone explain to me why this is such a controversial
> > issue?
> > > >
> > > > I guess practicality versus purity is always controversial
> :-)
> > > >
> > > >> It seems reasonable to me to add new encodings to the
> > stdlib that do the
> > > >> roundtripping requested in the first message of the thread.
> > As long as they
> > > >> have new names that seems to 

Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Stefan Krah
On Fri, Jan 19, 2018 at 05:30:43PM +, Paul Moore wrote:
[cut]
> I'd think that the idea of a site-packages/stest directory would need
> a much more compelling use case to justify it.

Thanks for the detailed explanation!  It sounds that there's much more work
involved than I thought, so it's probably better to drop this proposal.


> PS There's nothing stopping a (distribution) package FOO from
> installing (Python) packages foo and foo-tests. It's not common, and
> probably violates people's expectations, but it's not *illegal* (the
> setuptools distribution installs pkg_resources as well as setuptools,
> for a well-known example). So in theory, if people wanted this enough,
> they could have implemented it right now, without needing any change
> to Python or the packaging ecosystem.

If people don't come with pitchforks, that's a good solution. I suspected
that people would complain both if foo-tests were installed automatically
like pkg_resources but also if foo-tests were a separate optional package
(too much hassle).



Stefan Krah




___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Windows Best Fit Encodings

2018-01-19 Thread M.-A. Lemburg
Hi Steve,

do you know of a definite resource for Windows code pages
on MSDN or another official MS website ?

I tried to find some links, but only got these ancient
ones:

https://msdn.microsoft.com/en-us/library/cc195054.aspx

(this version of cp1252 doesn't even have the euro sign yet)

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 19 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/



On 19.01.2018 18:17, M.-A. Lemburg wrote:
> On 19.01.2018 17:24, Random832 wrote:
>> On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
 Someone did discover that Microsoft's current implementations of the
 windows-* encodings matches the WHAT-WG spec, rather than the Unicode
 spec that Microsoft originally wrote.
>>>
>>> No, MS implements somethings called "best fit encodings"
>>> and these are different than what WHATWG uses.
>>
>> NO. I made this absolutely clear in my previous message, best fit mappings 
>> can be clearly distinguished from regular mappings by the behavior of the 
>> native conversion functions with certain argument flags (the mapping of 0xA0 
>> to some private use character in cp932, for example, is a best-fit mapping 
>> in the decoding direction - but is treated as a regular mapping for encoding 
>> purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit 
>> mapping or in any way different from the rest of the mappings.
>>
>> We are not talking about implementing the best fit mappings. We are talking 
>> about real regular mappings that actually exist in these codepages that were 
>> for some unknown reason not included in the files published by Unicode.
> 
> I only know the best fit encoding maps that are available
> on the Unicode site.
> 
> If I read your comment correctly, you are saying that MS has
> moved away from the standard code pages towards something
> else - perhaps even something other than the best fit encodings
> listed on the Unicode site ?
> 
> Do you have some references for this ?
> 
> Note that the Windows code page codecs implemented in Python
> are all based on the Unicode mapping files and those were
> created by MS.
> 
>>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx
>>>
>>> unfortunately uses the above mentioned best fit encodings,
>>> but this can and should be switched off by specifying the
>>> WC_NO_BEST_FIT_CHARS for anything that requires validation
>>> or needs to be interoperable:
>>
>> Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in 
>> fact does not disable the mappings we are discussing.
> 
> Interesting. The CP1252 mapping clearly defines 0x80 to map
> to undefined, whereas the bestfit1252 maps it to 0x0081:
> 
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
> http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
> 
> Same for the example you gave for CP932:
> 
> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
> http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt
> 
> So at least following the documentation you'd expect the function
> to implement the regular mappings.
> 

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread M.-A. Lemburg
On 19.01.2018 18:12, Rob Speer wrote:
> Error handlers are quite orthogonal to this problem. If you try to solve
> this problem with an error handler, you will have a different problem.
> 
> Suppose you made "c1-control-passthrough" or whatever into an error
> handler, similar to "replace" or "ignore", and then you encounter an
> unassigned character that's *not* in the range 0x80 to 0x9f. (Many
> encodings have these.) Do you replace it? Do you ignore it? You don't
> know because you just replaced the error handler with something that's
> not about error handling.

It depends on what you want to achieve. You may want to fail,
assign a code point from a private area or use a surrogate
escape approach. Based on the context it may also make sense
to escape the input data using a different syntax, e.g.
XML escapes, backslash notations, HTML numeric entities, etc.

You could also add a "latin1replace" error handler which
simply passes through everything that's undefined as-is.

The Unicode error handlers are pretty flexible when it comes
to providing a solution:

https://www.python.org/dev/peps/pep-0293/

You can even have the handler work "patch" an encoding, since
it also gets the encoding name as input.

You could probably create an error handler which implements
most of their workarounds into a single "whatwg" handler.

> I will also repeat that having these encodings (in both directions) will
> provide more ways for Python to *reduce* the amount of mojibake that
> exists. If acknowledging that mojibake exists offends your sense of
> purity, and you'd rather just destroy all mojibake at the source...
> that's great, and please get back to me after you've fixed Microsoft Excel.

I acknowledge that we have different views on this :-)

Note that I'm not saying that the encodings are bad idea,
or should not be used.

I just don't want to have people start using "web-1252" as
encoding simply because they they are writing out text for
a web application - they should use "utf-8" instead.

The extra hurdle to pip-install a package for this feels
like the right way to turn this into a more conscious
decision and who knows... perhaps it'll even help fix Excel
once they have decided on including Python as scripting
language:

https://excel.uservoice.com/forums/304921-excel-for-windows-desktop-application/suggestions/10549005-python-as-an-excel-scripting-language

> I hope to make a pull request shortly that implements these mappings as
> new encodings that work just like the other ones.
> 
> On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg  > wrote:
> 
> On 19.01.2018 17:20, Guido van Rossum wrote:
> > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg  
> > >> wrote:
> >
> >     On 19.01.2018 05:38, Nathaniel Smith wrote:
> >     > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum
>   >> wrote:
> >     >> Can someone explain to me why this is such a controversial
> issue?
> >     >
> >     > I guess practicality versus purity is always controversial :-)
> >     >
> >     >> It seems reasonable to me to add new encodings to the
> stdlib that do the
> >     >> roundtripping requested in the first message of the thread.
> As long as they
> >     >> have new names that seems to fall under "practicality beats
> purity".
> >
> >     There are a few issues here:
> >
> >     * WHATWG encodings are mostly for decoding content in order to
> >       show it in the browser, accepting broken encoding data.
> >
> >
> > And sometimes Python apps that pull data from the web.
> >  
> >
> >       Python already has support for this by using one of the
> available
> >       error handlers, or adding new ones to suit the needs.
> >
> >
> > This seems cumbersome though.
>   
> Why is that ?
> 
> Python 3 uses such error handlers for most of the I/O that's done
> with the OS already and for very similar reasons: dealing with
> broken data or broken configurations.
> 
> >       If we'd add the encodings, people will start creating more
> >       broken data, since this is what the WHATWG codecs output
> >       when encoding Unicode.
> >
> >
> > That's FUD. Only apps that specifically use the new WHATWG encodings
> > would be able to consume that data. And surely the practice of web
> > browsers will have a much bigger effect than Python's choice.
>   
> It's not FUD. I don't think we ought to encourage having
> Python create more broken data. The purpose of the WHATWG
> encodings is to help browsers deal with decoding broken
> data in a uniform way. It's not to generate more such data.
> 
> That may be 

Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Petr Viktorin
FWIW, I've had very good experience with putting tests for package `foo` 
in a directory/package called `test_foo`.


This combines the best of both worlds -- it can be easily separated for 
distribution (like `tests`), and it doesn't cause name conflicts (like 
`foo.tests`).



On 01/19/2018 05:23 PM, Paul Moore wrote:

Another common approach is to not ship tests as part of your (runtime)
package at all - they are in the sdist but not the wheels nor are they
deployed with "setup.py install". In my experience, this is the usual
approach projects take if they don't have the tests in the package
directory. (I don't think I've *ever* seen a project try to install
tests except by including them in the package directory...)

Paul

On 19 January 2018 at 16:10, Guido van Rossum  wrote:

IIUC another common layout is to have folders named test or tests inside
each package. This would avoid requiring any changes to the site-packages
layout.

On Fri, Jan 19, 2018 at 6:27 AM, Stefan Krah  wrote:



Hello,

I wonder if we could get an official site-packages/test directory.
Currently
it seems to be problematic to distribute tests if they are outside the
package
directory.  Here is a nice overview of the two main layout possibilities:


http://pytest.readthedocs.io/en/reorganize-docs/new-docs/user/directory_structure.html


I like the outside-the-package approach, mostly for reasons described very
eloquently here:


http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html


CPython itself of course also uses Lib/foo.py and Lib/test/test_foo.py, so
it
would make sense to have site-packages/foo.py and
site-packages/test/test_foo.py.

For me, this is the natural layout.

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/



___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Paul Moore
On 19 January 2018 at 17:08, Stefan Krah  wrote:
> On Fri, Jan 19, 2018 at 04:23:23PM +, Paul Moore wrote:
>> Another common approach is to not ship tests as part of your (runtime)
>> package at all - they are in the sdist but not the wheels nor are they
>> deployed with "setup.py install". In my experience, this is the usual
>> approach projects take if they don't have the tests in the package
>> directory. (I don't think I've *ever* seen a project try to install
>> tests except by including them in the package directory...)
>
> Yes, given the current situation not shipping is definitely the best
> approach in that case.
>
> I just thought that if we did have something like site-packages/stest
> (Guido correctly noted that "test" wouldn't work), people might use it.
>
>
> But it is all very speculative and I'm not really sure myself.

To be usable, tools like pip, wheel, setuptools, flit, etc, would all
need to be updated to take into account this option, as well as the
relevant standards (the wheel spec for one). Add to that the changes
needed to places like the sysconfig package to allow introspecting the
location of the new test directory. Would there be a test directory in
user-site as well? What about in virtual environments? (If only in
site-packages, then it'll likely be read-only in a lot of
environments). Also, would we need to reserve the directory name
chosen to prohibit 3rd party packages using it? As we've seen the
stdlib test package clashes with the original proposal, who's to say
there's nothing on PyPI that uses stest?

The idea isn't a bad one in principle - there's a proposal from some
time back on distutils-sig that Python packaging support more "target
locations" matching the POSIX style locations - for docs, config, etc.
A test directory would fit in with this idea. But it's a pretty big
change in practice, and no-one has yet done much beyond talk about it.
And the proposal would likely have put the test directory *outside*
site-packages, which avoids the name clash problem.

I'd think that the idea of a site-packages/stest directory would need
a much more compelling use case to justify it.

Paul

PS There's nothing stopping a (distribution) package FOO from
installing (Python) packages foo and foo-tests. It's not common, and
probably violates people's expectations, but it's not *illegal* (the
setuptools distribution installs pkg_resources as well as setuptools,
for a well-known example). So in theory, if people wanted this enough,
they could have implemented it right now, without needing any change
to Python or the packaging ecosystem.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread Guido van Rossum
OK, I will tune out this conversation. It is clearly not going anywhere.

On Fri, Jan 19, 2018 at 9:12 AM, Rob Speer  wrote:

> Error handlers are quite orthogonal to this problem. If you try to solve
> this problem with an error handler, you will have a different problem.
>
> Suppose you made "c1-control-passthrough" or whatever into an error
> handler, similar to "replace" or "ignore", and then you encounter an
> unassigned character that's *not* in the range 0x80 to 0x9f. (Many
> encodings have these.) Do you replace it? Do you ignore it? You don't know
> because you just replaced the error handler with something that's not about
> error handling.
>
> I will also repeat that having these encodings (in both directions) will
> provide more ways for Python to *reduce* the amount of mojibake that
> exists. If acknowledging that mojibake exists offends your sense of purity,
> and you'd rather just destroy all mojibake at the source... that's great,
> and please get back to me after you've fixed Microsoft Excel.
>
> I hope to make a pull request shortly that implements these mappings as
> new encodings that work just like the other ones.
>
> On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg  wrote:
>
>> On 19.01.2018 17:20, Guido van Rossum wrote:
>> > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg > > > wrote:
>> >
>> > On 19.01.2018 05:38, Nathaniel Smith wrote:
>> > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <
>> gu...@python.org > wrote:
>> > >> Can someone explain to me why this is such a controversial issue?
>> > >
>> > > I guess practicality versus purity is always controversial :-)
>> > >
>> > >> It seems reasonable to me to add new encodings to the stdlib
>> that do the
>> > >> roundtripping requested in the first message of the thread. As
>> long as they
>> > >> have new names that seems to fall under "practicality beats
>> purity".
>> >
>> > There are a few issues here:
>> >
>> > * WHATWG encodings are mostly for decoding content in order to
>> >   show it in the browser, accepting broken encoding data.
>> >
>> >
>> > And sometimes Python apps that pull data from the web.
>> >
>> >
>> >   Python already has support for this by using one of the available
>> >   error handlers, or adding new ones to suit the needs.
>> >
>> >
>> > This seems cumbersome though.
>>
>> Why is that ?
>>
>> Python 3 uses such error handlers for most of the I/O that's done
>> with the OS already and for very similar reasons: dealing with
>> broken data or broken configurations.
>>
>> >   If we'd add the encodings, people will start creating more
>> >   broken data, since this is what the WHATWG codecs output
>> >   when encoding Unicode.
>> >
>> >
>> > That's FUD. Only apps that specifically use the new WHATWG encodings
>> > would be able to consume that data. And surely the practice of web
>> > browsers will have a much bigger effect than Python's choice.
>>
>> It's not FUD. I don't think we ought to encourage having
>> Python create more broken data. The purpose of the WHATWG
>> encodings is to help browsers deal with decoding broken
>> data in a uniform way. It's not to generate more such data.
>>
>> That may be regarded as purists view, but also has a very
>> practical meaning. The output of the codecs will only readable
>> by browsers implementing the WHATWG encodings. Other tools
>> receiving the data will run into the same decoding problems.
>>
>> Once you have Unicode, it's better to stay there and use
>> UTF-8 for encoding to avoid any such issues.
>>
>> >   As discussed, this could be addressed by making the WHATWG
>> >   codecs decode-only.
>> >
>> >
>> > But that would defeat the point of roundtripping, right?
>>
>> Yes, intentionally. Once you have Unicode, the data should
>> be encoded correctly back into UTF-8 or whatever legacy encoding
>> is needed, fixing any issues while in Unicode.
>>
>> As always, it's better to explicitly address such problems than
>> to simply punt on them and write back broken data.
>>
>> > * The use case seems limited to implementing browsers or headless
>> >   implementations working like browsers.
>> >
>> >   That's not really general enough to warrant adding lots of
>> >   new codecs to the stdlib. A PyPI package is better suited
>> >   for this.
>> >
>> >
>> > Perhaps, but such a package already exists and its author (who surely
>> > has read a lot of bug reports from its users) says that this is
>> cumbersome.
>>
>> The only critique I read was that registering the codecs
>> is not explicit enough, but that's really only a nit, since
>> you can easily have the codec package expose a register
>> function which you then call explicitly in the code using
>> the codecs.
>>
>> > * The WHATWG codecs do not only cover simple mapping codecs,
>> >   but also many multi-byte ones 

Re: [Python-ideas] Windows Best Fit Encodings (was: Support WHATWG versions of legacy encodings)

2018-01-19 Thread M.-A. Lemburg
On 19.01.2018 17:24, Random832 wrote:
> On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
>>> Someone did discover that Microsoft's current implementations of the
>>> windows-* encodings matches the WHAT-WG spec, rather than the Unicode
>>> spec that Microsoft originally wrote.
>>
>> No, MS implements somethings called "best fit encodings"
>> and these are different than what WHATWG uses.
> 
> NO. I made this absolutely clear in my previous message, best fit mappings 
> can be clearly distinguished from regular mappings by the behavior of the 
> native conversion functions with certain argument flags (the mapping of 0xA0 
> to some private use character in cp932, for example, is a best-fit mapping in 
> the decoding direction - but is treated as a regular mapping for encoding 
> purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit 
> mapping or in any way different from the rest of the mappings.
> 
> We are not talking about implementing the best fit mappings. We are talking 
> about real regular mappings that actually exist in these codepages that were 
> for some unknown reason not included in the files published by Unicode.

I only know the best fit encoding maps that are available
on the Unicode site.

If I read your comment correctly, you are saying that MS has
moved away from the standard code pages towards something
else - perhaps even something other than the best fit encodings
listed on the Unicode site ?

Do you have some references for this ?

Note that the Windows code page codecs implemented in Python
are all based on the Unicode mapping files and those were
created by MS.

>> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx
>>
>> unfortunately uses the above mentioned best fit encodings,
>> but this can and should be switched off by specifying the
>> WC_NO_BEST_FIT_CHARS for anything that requires validation
>> or needs to be interoperable:
> 
> Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in 
> fact does not disable the mappings we are discussing.

Interesting. The CP1252 mapping clearly defines 0x80 to map
to undefined, whereas the bestfit1252 maps it to 0x0081:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt

Same for the example you gave for CP932:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT
http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt

So at least following the documentation you'd expect the function
to implement the regular mappings.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 19 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Wolfgang Maier

On 01/19/2018 05:48 PM, Guido van Rossum wrote:
On Fri, Jan 19, 2018 at 8:30 AM, Wolfgang Maier 
> wrote:



I think that's a really nice idea.
With an official site-packages/test directory there could be pip
support for optionally installing tests alongside a package if its
layout allows it. So end users could just install things without
tests, but developers could do: pip install  --with-tests
or something to get everything?


Oh, I just realized there's another problem here. The existing 'test' 
package (which is not a namespace package) would hide the 
site-packages/test directory.




Well, that shouldn't be a big obstacle since one could just as well 
choose another name ( __tests__ for example?).
Alternatively, package-specific test directories could exist *inside* 
site-packages. So much like today's .dist-info directories 
there could be .test dirs?

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread Rob Speer
Error handlers are quite orthogonal to this problem. If you try to solve
this problem with an error handler, you will have a different problem.

Suppose you made "c1-control-passthrough" or whatever into an error
handler, similar to "replace" or "ignore", and then you encounter an
unassigned character that's *not* in the range 0x80 to 0x9f. (Many
encodings have these.) Do you replace it? Do you ignore it? You don't know
because you just replaced the error handler with something that's not about
error handling.

I will also repeat that having these encodings (in both directions) will
provide more ways for Python to *reduce* the amount of mojibake that
exists. If acknowledging that mojibake exists offends your sense of purity,
and you'd rather just destroy all mojibake at the source... that's great,
and please get back to me after you've fixed Microsoft Excel.

I hope to make a pull request shortly that implements these mappings as new
encodings that work just like the other ones.

On Fri, 19 Jan 2018 at 11:54 M.-A. Lemburg  wrote:

> On 19.01.2018 17:20, Guido van Rossum wrote:
> > On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg  > > wrote:
> >
> > On 19.01.2018 05:38, Nathaniel Smith wrote:
> > > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum <
> gu...@python.org > wrote:
> > >> Can someone explain to me why this is such a controversial issue?
> > >
> > > I guess practicality versus purity is always controversial :-)
> > >
> > >> It seems reasonable to me to add new encodings to the stdlib that
> do the
> > >> roundtripping requested in the first message of the thread. As
> long as they
> > >> have new names that seems to fall under "practicality beats
> purity".
> >
> > There are a few issues here:
> >
> > * WHATWG encodings are mostly for decoding content in order to
> >   show it in the browser, accepting broken encoding data.
> >
> >
> > And sometimes Python apps that pull data from the web.
> >
> >
> >   Python already has support for this by using one of the available
> >   error handlers, or adding new ones to suit the needs.
> >
> >
> > This seems cumbersome though.
>
> Why is that ?
>
> Python 3 uses such error handlers for most of the I/O that's done
> with the OS already and for very similar reasons: dealing with
> broken data or broken configurations.
>
> >   If we'd add the encodings, people will start creating more
> >   broken data, since this is what the WHATWG codecs output
> >   when encoding Unicode.
> >
> >
> > That's FUD. Only apps that specifically use the new WHATWG encodings
> > would be able to consume that data. And surely the practice of web
> > browsers will have a much bigger effect than Python's choice.
>
> It's not FUD. I don't think we ought to encourage having
> Python create more broken data. The purpose of the WHATWG
> encodings is to help browsers deal with decoding broken
> data in a uniform way. It's not to generate more such data.
>
> That may be regarded as purists view, but also has a very
> practical meaning. The output of the codecs will only readable
> by browsers implementing the WHATWG encodings. Other tools
> receiving the data will run into the same decoding problems.
>
> Once you have Unicode, it's better to stay there and use
> UTF-8 for encoding to avoid any such issues.
>
> >   As discussed, this could be addressed by making the WHATWG
> >   codecs decode-only.
> >
> >
> > But that would defeat the point of roundtripping, right?
>
> Yes, intentionally. Once you have Unicode, the data should
> be encoded correctly back into UTF-8 or whatever legacy encoding
> is needed, fixing any issues while in Unicode.
>
> As always, it's better to explicitly address such problems than
> to simply punt on them and write back broken data.
>
> > * The use case seems limited to implementing browsers or headless
> >   implementations working like browsers.
> >
> >   That's not really general enough to warrant adding lots of
> >   new codecs to the stdlib. A PyPI package is better suited
> >   for this.
> >
> >
> > Perhaps, but such a package already exists and its author (who surely
> > has read a lot of bug reports from its users) says that this is
> cumbersome.
>
> The only critique I read was that registering the codecs
> is not explicit enough, but that's really only a nit, since
> you can easily have the codec package expose a register
> function which you then call explicitly in the code using
> the codecs.
>
> > * The WHATWG codecs do not only cover simple mapping codecs,
> >   but also many multi-byte ones for e.g. Asian languages.
> >
> >   I doubt that we'd want to maintain such codecs in the stdlib,
> >   since this will increase the download sizes of the installers
> >   and also require people knowledgeable about these variants
> >   to work on them and fix 

Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Stefan Krah
On Fri, Jan 19, 2018 at 04:23:23PM +, Paul Moore wrote:
> Another common approach is to not ship tests as part of your (runtime)
> package at all - they are in the sdist but not the wheels nor are they
> deployed with "setup.py install". In my experience, this is the usual
> approach projects take if they don't have the tests in the package
> directory. (I don't think I've *ever* seen a project try to install
> tests except by including them in the package directory...)

Yes, given the current situation not shipping is definitely the best
approach in that case.

I just thought that if we did have something like site-packages/stest
(Guido correctly noted that "test" wouldn't work), people might use it.


But it is all very speculative and I'm not really sure myself.



Stefan Krah



___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread M.-A. Lemburg
On 19.01.2018 17:20, Guido van Rossum wrote:
> On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg  > wrote:
> 
> On 19.01.2018 05:38, Nathaniel Smith wrote:
> > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum  > wrote:
> >> Can someone explain to me why this is such a controversial issue?
> >
> > I guess practicality versus purity is always controversial :-)
> >
> >> It seems reasonable to me to add new encodings to the stdlib that do 
> the
> >> roundtripping requested in the first message of the thread. As long as 
> they
> >> have new names that seems to fall under "practicality beats purity".
> 
> There are a few issues here:
> 
> * WHATWG encodings are mostly for decoding content in order to
>   show it in the browser, accepting broken encoding data.
> 
> 
> And sometimes Python apps that pull data from the web.
>  
> 
>   Python already has support for this by using one of the available
>   error handlers, or adding new ones to suit the needs.
> 
> 
> This seems cumbersome though.
  
Why is that ?

Python 3 uses such error handlers for most of the I/O that's done
with the OS already and for very similar reasons: dealing with
broken data or broken configurations.

>   If we'd add the encodings, people will start creating more
>   broken data, since this is what the WHATWG codecs output
>   when encoding Unicode.
> 
> 
> That's FUD. Only apps that specifically use the new WHATWG encodings
> would be able to consume that data. And surely the practice of web
> browsers will have a much bigger effect than Python's choice.
  
It's not FUD. I don't think we ought to encourage having
Python create more broken data. The purpose of the WHATWG
encodings is to help browsers deal with decoding broken
data in a uniform way. It's not to generate more such data.

That may be regarded as purists view, but also has a very
practical meaning. The output of the codecs will only readable
by browsers implementing the WHATWG encodings. Other tools
receiving the data will run into the same decoding problems.

Once you have Unicode, it's better to stay there and use
UTF-8 for encoding to avoid any such issues.

>   As discussed, this could be addressed by making the WHATWG
>   codecs decode-only.
> 
> 
> But that would defeat the point of roundtripping, right?

Yes, intentionally. Once you have Unicode, the data should
be encoded correctly back into UTF-8 or whatever legacy encoding
is needed, fixing any issues while in Unicode.

As always, it's better to explicitly address such problems than
to simply punt on them and write back broken data.

> * The use case seems limited to implementing browsers or headless
>   implementations working like browsers.
> 
>   That's not really general enough to warrant adding lots of
>   new codecs to the stdlib. A PyPI package is better suited
>   for this.
> 
> 
> Perhaps, but such a package already exists and its author (who surely
> has read a lot of bug reports from its users) says that this is cumbersome.
  
The only critique I read was that registering the codecs
is not explicit enough, but that's really only a nit, since
you can easily have the codec package expose a register
function which you then call explicitly in the code using
the codecs.

> * The WHATWG codecs do not only cover simple mapping codecs,
>   but also many multi-byte ones for e.g. Asian languages.
> 
>   I doubt that we'd want to maintain such codecs in the stdlib,
>   since this will increase the download sizes of the installers
>   and also require people knowledgeable about these variants
>   to work on them and fix any issues.
> 
> 
> Really? Why is adding a bunch of codecs so much effort? Surely the
> translation tables contain data that compresses well? And surely we
> don't need a separate dedicated piece of C code for each new codec?
  
For the simple charmap style codecs that's true. Not so for the
Asian ones and the latter also do require dedicated C code (see
Modules/cjkcodecs).

> Overall, I think either pointing people to error handlers
> or perhaps adding a new one specifically for the case of
> dealing with control character mappings would provide a better
> maintenance / usefulness ratio than adding lots of new
> legacy codecs to the stdlib.
> 
> 
> Wouldn't error handlers be much slower? And to me it seems a new error
> handler is a much *bigger* deal than some new encodings -- error
> handlers must work for *all* encodings.
  
Error handlers have a standard interface and so they will work
for all codecs. Some codecs limits the number of handlers that
can be used, but most accept all registered handlers.

If a handler is too slow in Python, it can be coded in C for
speed.

> BTW: WHATWG pushes for always using UTF-8 as far as I can tell
> from their website.
> 
> 
> As 

Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Guido van Rossum
On Fri, Jan 19, 2018 at 8:30 AM, Wolfgang Maier <
wolfgang.ma...@biologie.uni-freiburg.de> wrote:

>
> I think that's a really nice idea.
> With an official site-packages/test directory there could be pip support
> for optionally installing tests alongside a package if its layout allows
> it. So end users could just install things without tests, but developers
> could do: pip install  --with-tests or something to get everything?


Oh, I just realized there's another problem here. The existing 'test'
package (which is not a namespace package) would hide the
site-packages/test directory.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Wolfgang Maier

On 01/19/2018 03:27 PM, Stefan Krah wrote:


Hello,

I wonder if we could get an official site-packages/test directory.  Currently
it seems to be problematic to distribute tests if they are outside the package
directory.  Here is a nice overview of the two main layout possibilities:

http://pytest.readthedocs.io/en/reorganize-docs/new-docs/user/directory_structure.html


I like the outside-the-package approach, mostly for reasons described very
eloquently here:

http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html


CPython itself of course also uses Lib/foo.py and Lib/test/test_foo.py, so it
would make sense to have site-packages/foo.py and 
site-packages/test/test_foo.py.

For me, this is the natural layout.



I think that's a really nice idea.
With an official site-packages/test directory there could be pip support 
for optionally installing tests alongside a package if its layout allows 
it. So end users could just install things without tests, but developers 
could do: pip install  --with-tests or something to get everything?


Wolfgang

___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread Random832
On Fri, Jan 19, 2018, at 08:30, M.-A. Lemburg wrote:
> > Someone did discover that Microsoft's current implementations of the
> > windows-* encodings matches the WHAT-WG spec, rather than the Unicode
> > spec that Microsoft originally wrote.
> 
> No, MS implements somethings called "best fit encodings"
> and these are different than what WHATWG uses.

NO. I made this absolutely clear in my previous message, best fit mappings can 
be clearly distinguished from regular mappings by the behavior of the native 
conversion functions with certain argument flags (the mapping of 0xA0 to some 
private use character in cp932, for example, is a best-fit mapping in the 
decoding direction - but is treated as a regular mapping for encoding 
purposes), and the mapping of 0x81 to U+0081 in cp1252 etc is NOT a best fit 
mapping or in any way different from the rest of the mappings.

We are not talking about implementing the best fit mappings. We are talking 
about real regular mappings that actually exist in these codepages that were 
for some unknown reason not included in the files published by Unicode.

> https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx
> 
> unfortunately uses the above mentioned best fit encodings,
> but this can and should be switched off by specifying the
> WC_NO_BEST_FIT_CHARS for anything that requires validation
> or needs to be interoperable:

Specifying this flag (and MB_ERR_INVALID_CHARS in the other direction) in fact 
does not disable the mappings we are discussing.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Paul Moore
Another common approach is to not ship tests as part of your (runtime)
package at all - they are in the sdist but not the wheels nor are they
deployed with "setup.py install". In my experience, this is the usual
approach projects take if they don't have the tests in the package
directory. (I don't think I've *ever* seen a project try to install
tests except by including them in the package directory...)

Paul

On 19 January 2018 at 16:10, Guido van Rossum  wrote:
> IIUC another common layout is to have folders named test or tests inside
> each package. This would avoid requiring any changes to the site-packages
> layout.
>
> On Fri, Jan 19, 2018 at 6:27 AM, Stefan Krah  wrote:
>>
>>
>> Hello,
>>
>> I wonder if we could get an official site-packages/test directory.
>> Currently
>> it seems to be problematic to distribute tests if they are outside the
>> package
>> directory.  Here is a nice overview of the two main layout possibilities:
>>
>>
>> http://pytest.readthedocs.io/en/reorganize-docs/new-docs/user/directory_structure.html
>>
>>
>> I like the outside-the-package approach, mostly for reasons described very
>> eloquently here:
>>
>>
>> http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html
>>
>>
>> CPython itself of course also uses Lib/foo.py and Lib/test/test_foo.py, so
>> it
>> would make sense to have site-packages/foo.py and
>> site-packages/test/test_foo.py.
>>
>> For me, this is the natural layout.
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread Guido van Rossum
On Fri, Jan 19, 2018 at 5:30 AM, M.-A. Lemburg  wrote:

> On 19.01.2018 05:38, Nathaniel Smith wrote:
> > On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum 
> wrote:
> >> Can someone explain to me why this is such a controversial issue?
> >
> > I guess practicality versus purity is always controversial :-)
> >
> >> It seems reasonable to me to add new encodings to the stdlib that do the
> >> roundtripping requested in the first message of the thread. As long as
> they
> >> have new names that seems to fall under "practicality beats purity".
>
> There are a few issues here:
>
> * WHATWG encodings are mostly for decoding content in order to
>   show it in the browser, accepting broken encoding data.
>

And sometimes Python apps that pull data from the web.


>   Python already has support for this by using one of the available
>   error handlers, or adding new ones to suit the needs.
>

This seems cumbersome though.


>   If we'd add the encodings, people will start creating more
>   broken data, since this is what the WHATWG codecs output
>   when encoding Unicode.
>

That's FUD. Only apps that specifically use the new WHATWG encodings would
be able to consume that data. And surely the practice of web browsers will
have a much bigger effect than Python's choice.


>   As discussed, this could be addressed by making the WHATWG
>   codecs decode-only.
>

But that would defeat the point of roundtripping, right?


> * The use case seems limited to implementing browsers or headless
>   implementations working like browsers.
>
>   That's not really general enough to warrant adding lots of
>   new codecs to the stdlib. A PyPI package is better suited
>   for this.
>

Perhaps, but such a package already exists and its author (who surely has
read a lot of bug reports from its users) says that this is cumbersome.


> * The WHATWG codecs do not only cover simple mapping codecs,
>   but also many multi-byte ones for e.g. Asian languages.
>
>   I doubt that we'd want to maintain such codecs in the stdlib,
>   since this will increase the download sizes of the installers
>   and also require people knowledgeable about these variants
>   to work on them and fix any issues.
>

Really? Why is adding a bunch of codecs so much effort? Surely the
translation tables contain data that compresses well? And surely we don't
need a separate dedicated piece of C code for each new codec?


> Overall, I think either pointing people to error handlers
> or perhaps adding a new one specifically for the case of
> dealing with control character mappings would provide a better
> maintenance / usefulness ratio than adding lots of new
> legacy codecs to the stdlib.
>

Wouldn't error handlers be much slower? And to me it seems a new error
handler is a much *bigger* deal than some new encodings -- error handlers
must work for *all* encodings.


> BTW: WHATWG pushes for always using UTF-8 as far as I can tell
> from their website.
>

As does Python. But apparently it will take decades more to get there.


> >> (Modifying existing encodings seems wrong -- did the feature request
> somehow
> >> transmogrify into that?)
> >
> > Someone did discover that Microsoft's current implementations of the
> > windows-* encodings matches the WHAT-WG spec, rather than the Unicode
> > spec that Microsoft originally wrote.
>
> No, MS implements somethings called "best fit encodings"
> and these are different than what WHATWG uses.
>
> Unlike the WHATWG encodings, these are documented as vendor encodings
> on the Unicode site, which is what we normally use as reference
> for out stdlib codecs.
>
> However, whether these are actually a good idea, is open to discussion
> as well, since they sometimes go a bit far with "best fit", e.g.
> mapping the infinity symbol to 8.
>
> Again, using the error handles we have for dealing with
> situations which require non-standard encoding behavior are
> the better approach:
>
> https://docs.python.org/3.7/library/codecs.html#error-handlers
>
> Adding new ones is possible as well.
>
> > So there is some argument that
> > the Python's existing encodings are simply out of date, and changing
> > them would be a bugfix. (And standards aside, it is surely going to be
> > somewhat error-prone if Python's windows-1252 doesn't match everyone
> > else's implementations of windows-1252.) But yeah, AFAICT the original
> > requesters would be happy either way; they just want it available
> > under some name.
>
> The encodings are not out of date. I don't know where you got
> that impression from.
>
> The Windows API WideCharToMultiByte  which was quoted in the discussion:
>
> https://msdn.microsoft.com/en-us/library/windows/desktop/
> dd374130%28v=vs.85%29.aspx
>
> unfortunately uses the above mentioned best fit encodings,
> but this can and should be switched off by specifying the
> WC_NO_BEST_FIT_CHARS for anything that requires validation
> or needs to be interoperable:
>
> """
> For strings that 

Re: [Python-ideas] Official site-packages/test directory

2018-01-19 Thread Guido van Rossum
IIUC another common layout is to have folders named test or tests inside
each package. This would avoid requiring any changes to the site-packages
layout.

On Fri, Jan 19, 2018 at 6:27 AM, Stefan Krah  wrote:

>
> Hello,
>
> I wonder if we could get an official site-packages/test directory.
> Currently
> it seems to be problematic to distribute tests if they are outside the
> package
> directory.  Here is a nice overview of the two main layout possibilities:
>
> http://pytest.readthedocs.io/en/reorganize-docs/new-docs/
> user/directory_structure.html
>
>
> I like the outside-the-package approach, mostly for reasons described very
> eloquently here:
>
> http://python-notes.curiousefficiency.org/en/
> latest/python_concepts/import_traps.html
>
>
> CPython itself of course also uses Lib/foo.py and Lib/test/test_foo.py, so
> it
> would make sense to have site-packages/foo.py and
> site-packages/test/test_foo.py.
>
> For me, this is the natural layout.
>
>
>
> Stefan Krah
>
>
>
> ___
> Python-ideas mailing list
> Python-ideas@python.org
> https://mail.python.org/mailman/listinfo/python-ideas
> Code of Conduct: http://python.org/psf/codeofconduct/
>



-- 
--Guido van Rossum (python.org/~guido)
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-ideas] Official site-packages/test directory

2018-01-19 Thread Stefan Krah

Hello,

I wonder if we could get an official site-packages/test directory.  Currently
it seems to be problematic to distribute tests if they are outside the package
directory.  Here is a nice overview of the two main layout possibilities:

http://pytest.readthedocs.io/en/reorganize-docs/new-docs/user/directory_structure.html


I like the outside-the-package approach, mostly for reasons described very
eloquently here:

http://python-notes.curiousefficiency.org/en/latest/python_concepts/import_traps.html


CPython itself of course also uses Lib/foo.py and Lib/test/test_foo.py, so it
would make sense to have site-packages/foo.py and 
site-packages/test/test_foo.py.

For me, this is the natural layout.



Stefan Krah



___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] Support WHATWG versions of legacy encodings

2018-01-19 Thread M.-A. Lemburg
On 19.01.2018 05:38, Nathaniel Smith wrote:
> On Thu, Jan 18, 2018 at 7:51 PM, Guido van Rossum  wrote:
>> Can someone explain to me why this is such a controversial issue?
> 
> I guess practicality versus purity is always controversial :-)
> 
>> It seems reasonable to me to add new encodings to the stdlib that do the
>> roundtripping requested in the first message of the thread. As long as they
>> have new names that seems to fall under "practicality beats purity".

There are a few issues here:

* WHATWG encodings are mostly for decoding content in order to
  show it in the browser, accepting broken encoding data.

  Python already has support for this by using one of the available
  error handlers, or adding new ones to suit the needs.

  If we'd add the encodings, people will start creating more
  broken data, since this is what the WHATWG codecs output
  when encoding Unicode.

  As discussed, this could be addressed by making the WHATWG
  codecs decode-only.

* The use case seems limited to implementing browsers or headless
  implementations working like browsers.

  That's not really general enough to warrant adding lots of
  new codecs to the stdlib. A PyPI package is better suited
  for this.

* The WHATWG codecs do not only cover simple mapping codecs,
  but also many multi-byte ones for e.g. Asian languages.

  I doubt that we'd want to maintain such codecs in the stdlib,
  since this will increase the download sizes of the installers
  and also require people knowledgeable about these variants
  to work on them and fix any issues.

Overall, I think either pointing people to error handlers
or perhaps adding a new one specifically for the case of
dealing with control character mappings would provide a better
maintenance / usefulness ratio than adding lots of new
legacy codecs to the stdlib.

BTW: WHATWG pushes for always using UTF-8 as far as I can tell
from their website.

>> (Modifying existing encodings seems wrong -- did the feature request somehow
>> transmogrify into that?)
> 
> Someone did discover that Microsoft's current implementations of the
> windows-* encodings matches the WHAT-WG spec, rather than the Unicode
> spec that Microsoft originally wrote.

No, MS implements somethings called "best fit encodings"
and these are different than what WHATWG uses.

Unlike the WHATWG encodings, these are documented as vendor encodings
on the Unicode site, which is what we normally use as reference
for out stdlib codecs.

However, whether these are actually a good idea, is open to discussion
as well, since they sometimes go a bit far with "best fit", e.g.
mapping the infinity symbol to 8.

Again, using the error handles we have for dealing with
situations which require non-standard encoding behavior are
the better approach:

https://docs.python.org/3.7/library/codecs.html#error-handlers

Adding new ones is possible as well.

> So there is some argument that
> the Python's existing encodings are simply out of date, and changing
> them would be a bugfix. (And standards aside, it is surely going to be
> somewhat error-prone if Python's windows-1252 doesn't match everyone
> else's implementations of windows-1252.) But yeah, AFAICT the original
> requesters would be happy either way; they just want it available
> under some name.

The encodings are not out of date. I don't know where you got
that impression from.

The Windows API WideCharToMultiByte  which was quoted in the discussion:

https://msdn.microsoft.com/en-us/library/windows/desktop/dd374130%28v=vs.85%29.aspx

unfortunately uses the above mentioned best fit encodings,
but this can and should be switched off by specifying the
WC_NO_BEST_FIT_CHARS for anything that requires validation
or needs to be interoperable:

"""
For strings that require validation, such as file, resource, and user
names, the application should always use the WC_NO_BEST_FIT_CHARS flag.
This flag prevents the function from mapping characters to characters
that appear similar but have very different semantics. In some cases,
the semantic change can be extreme. For example, the symbol for "∞"
(infinity) maps to 8 (eight) in some code pages.
"""

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Jan 19 2018)
>>> Python Projects, Coaching and Consulting ...  http://www.egenix.com/
>>> Python Database Interfaces ...   http://products.egenix.com/
>>> Plone/Zope Database Interfaces ...   http://zope.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
  http://www.malemburg.com/

___