Re: Everything you did not want to know about Unicode in Python 3

2014-05-17 Thread Mark Lawrence

On 17/05/2014 05:19, Marko Rauhamaa wrote:


The sole copyright holder can
simply state: this work is in the Public Domain, or: all rights
relinquished, or some such. Ultimately, everything is decided by the
courts, of course.



For examples see all the Python PEPs.

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-17 Thread Robert Kern

On 2014-05-17 02:07, Steven D'Aprano wrote:

On Fri, 16 May 2014 14:46:23 +, Grant Edwards wrote:


At least in the US, there doesn't seem to be such a thing as placing a
work into the public domain.  The copyright holder can transfer
ownershipt to soembody else, but there is no public domain to which
ownership can be trasferred.


That's factually incorrect. In the US, sufficiently old works, or works
of a certain age that were not explicitly registered for copyright, are
in the public domain. Under a wide range of circumstances, works created
by the federal government go immediately into the public domain.


There is such a thing as the public domain in the US, and there are works in it, 
but there isn't really such a thing as placing a work there voluntarily, as 
Grant says. A work either is or isn't in the public domain. The author has no 
choice in the matter.


--
Robert Kern

I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth.
  -- Umberto Eco

--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-17 Thread Chris Angelico
On Sat, May 17, 2014 at 6:57 PM, Robert Kern robert.k...@gmail.com wrote:
 There is such a thing as the public domain in the US, and there are works in
 it, but there isn't really such a thing as placing a work there
 voluntarily, as Grant says. A work either is or isn't in the public domain.
 The author has no choice in the matter.

Then what's copyright status on PEPs?

The nearest thing to assigning to public domain that works across
legislatures is probably CC0:

http://creativecommons.org/about/cc0

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-17 Thread Robert Kern

On 2014-05-17 05:19, Marko Rauhamaa wrote:

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:


On Fri, 16 May 2014 14:46:23 +, Grant Edwards wrote:


At least in the US, there doesn't seem to be such a thing as placing
a work into the public domain. The copyright holder can transfer
ownershipt to soembody else, but there is no public domain to which
ownership can be trasferred.


That's factually incorrect. In the US, sufficiently old works, or works
of a certain age that were not explicitly registered for copyright, are
in the public domain. Under a wide range of circumstances, works created
by the federal government go immediately into the public domain.


Steven, you're not disputing Grant. I am. The sole copyright holder can
simply state: this work is in the Public Domain, or: all rights
relinquished, or some such. Ultimately, everything is decided by the
courts, of course.


One can state many things, but that doesn't mean they have legal effect. The US 
Code has provisions for how works become copyrighted automatically, how they 
leave copyright automatically at the end of specific time periods, how some 
works automatically enter the public domain on their creation (i.e. works of the 
US federal government), but has nothing at all for how a private creator can 
voluntarily place their work into the public domain when it would otherwise not 
be. It used to, but does not any more.


For a private individual to say about a work they just created that this work 
is in the Public Domain is, under US law, merely an erroneous statement of 
fact, not a speech act that effects a change in the legal status of the work. 
For another example of this distinction, saying I am married when I have not 
applied for, received, and solemnified a valid marriage license is just an 
erroneous statement of fact and does not make me legally married.


Relinquishing your rights can have some effect, but not all rights can be 
relinquished, and this is not the same as putting your work into the public 
domain. Among other things, your heirs can sometimes reclaim those rights in 
some circumstances if you are not careful (and if they are valuable enough to 
bother reclaiming).


If you wish to do something like this, I highly recommend (though IANAL and 
TINLA) using the CC0 Waiver from Creative Commons. It has thorough legalese for 
relinquishing all the rights that one can relinquish for the maximum terms that 
one can do so in as many jurisdictions as possible and acts as a license to 
use/distribute/etc. without restriction even if some rights cannot be 
relinquished. Even if US law were to change to provide for dedicating works to 
the public domain, I would probably still use the CC0 anyways to account for the 
high variability in how different jurisdictions around the world treat their own 
public domains.


  http://creativecommons.org/about/cc0
  http://wiki.creativecommons.org/CC0_FAQ

Note how they distinguish the CC0 Waiver from their Public Domain Mark: the 
Public Domain Mark is just a label for things that are known to be free of 
copyright worldwide but does not make a work so. The CC0 *does* have an 
operative effect that is substantially similar to the work being in the public 
domain.


--
Robert Kern

I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth.
  -- Umberto Eco

--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-17 Thread Ben Finney
Chris Angelico ros...@gmail.com writes:

 On Sat, May 17, 2014 at 6:57 PM, Robert Kern robert.k...@gmail.com wrote:
  There is such a thing as the public domain in the US, and there are works in
  it, but there isn't really such a thing as placing a work there
  voluntarily, as Grant says. A work either is or isn't in the public domain.
  The author has no choice in the matter.

 Then what's copyright status on PEPs?

My guess: They are in the default copyright status, with all rights
reserved (i.e. everything that copyright law restricts, is forbidden to
the recipient).

But, if any of those copyright holders were ever to assert their
copyright had been infringed by some recipient, the “this work is in the
public domain” or equivalent would be taken as a clear indication of the
*intent* of the copyright holder.

Ultimately, what matters is the determination of whatever judge you find
yourself facing. To that end, clarifying in the copyright statement and
license terms exactly what is permitted can be immensely helpful in
foreshortening and, ideally, avoiding a future copyright suit.

Copyright is a ridiculous burden on everyone — to the extent that even
those copyright holders who don't *want* those rights which the law
reserves to the copyright holder, and want to divest themselves of the
role of copyright holder, find it frustratingly difficult to do so
effectively across jurisdictions.

-- 
 \  “Computer perspective on Moore's Law: Human effort becomes |
  `\   twice as expensive roughly every two years.” —anonymous |
_o__)  |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-17 Thread Steven D'Aprano
On Sat, 17 May 2014 09:57:06 +0100, Robert Kern wrote:

 On 2014-05-17 02:07, Steven D'Aprano wrote:
 On Fri, 16 May 2014 14:46:23 +, Grant Edwards wrote:

 At least in the US, there doesn't seem to be such a thing as placing
 a work into the public domain.  The copyright holder can transfer
 ownershipt to soembody else, but there is no public domain to which
 ownership can be trasferred.

 That's factually incorrect. In the US, sufficiently old works, or works
 of a certain age that were not explicitly registered for copyright, are
 in the public domain. Under a wide range of circumstances, works
 created by the federal government go immediately into the public
 domain.
 
 There is such a thing as the public domain in the US, and there are
 works in it, but there isn't really such a thing as placing a work
 there voluntarily, as Grant says. A work either is or isn't in the
 public domain. The author has no choice in the matter.

That's incorrect.

http://cr.yp.to/publicdomain.html

Here's the money quote, from the 9th Circuit Court:

It is well settled that rights gained under the Copyright Act 
may be abandoned. But abandonment of a right must be manifested
by some overt act indicating an intention to abandon that right.


There's also this:

http://creativecommons.org/publicdomain/zero/1.0/

which counts as an overt act.


By the way, there's more info on US copyright terms here:

http://copyright.cornell.edu/resources/publicdomain.cfm

although it doesn't specifically mention voluntarily abandonment of 
copyright.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-17 Thread Steven D'Aprano
On Sat, 17 May 2014 10:29:00 +0100, Robert Kern wrote:

 One can state many things, but that doesn't mean they have legal effect.
 The US Code has provisions for how works become copyrighted
 automatically, how they leave copyright automatically at the end of
 specific time periods, how some works automatically enter the public
 domain on their creation (i.e. works of the US federal government), but
 has nothing at all for how a private creator can voluntarily place their
 work into the public domain when it would otherwise not be. It used to,
 but does not any more.

The case for abandonment was stated as well settled in 1998 (Micro-Star 
v. Formgen Inc). Unless there has been a major legal change in the years 
since then, I don't think it is true that authors cannot abandon 
copyright.

 
 For a private individual to say about a work they just created that
 this work is in the Public Domain is, under US law, merely an
 erroneous statement of fact, not a speech act that effects a change in
 the legal status of the work. For another example of this distinction,
 saying I am married when I have not applied for, received, and
 solemnified a valid marriage license is just an erroneous statement of
 fact and does not make me legally married.

There may be something to what you say, although I think we're now 
arguing fine semantic details. See:

https://en.wikipedia.org/wiki/Wikipedia:Granting_work_into_the_public_domain

To play Devil's Advocate in favour of your assertion, it may be that 
abandoning copyright does not literally put the work in the public 
domain, but merely makes it quack like the public domain. That is to 
say, the author still, in some abstract but legally meaningless sense, 
has copyright in the work *but* has given unlimited usage rights. (I 
don't actually think that is the case, at least not in the US.)

It's this tiny bit of residual uncertainty that leads some authorities to 
say that it is hard to release a work into the public domain, 
particularly in a world-wide context, and that merely stating this is in 
the public domain is not sufficient to remove all legal doubt over the 
status, and that a more overt and explicit release *may* be required. 
Hence the CC0 licence which you refer to. The human readable summary says 
in part:

 The person who associated a work with this deed has dedicated
 the work to the public domain by waiving all of his or her
 rights to the work worldwide under copyright law, including
 all related and neighboring rights, to the extent allowed by
 law.

 You can copy, modify, distribute and perform the work, even
 for commercial purposes, all without asking permission.

http://creativecommons.org/publicdomain/zero/1.0/

while the actual legal licence comes in at almost 800 words. This is 
basically the same as I release this to the public domain only longer.

(The CC0 licence is longer than you might expect, because it is assumed 
that it may have to apply in countries where you *really cannot* 
relinquish copyright. But we're specifically talking about the US, where 
the 9th Circuit says you can.)


 Relinquishing your rights can have some effect, but not all rights can
 be relinquished, 

Outside of the US, so-called moral rights or reputation rights cannot 
generally be relinquished, except perhaps in work-for-hire and perhaps 
not even then. (E.g. if you're a ghost writer.) The situation in the US 
is a bit murky -- there are no official moral rights per se, and 
copyright only controls usage rights such as copying, distribution and so 
forth. But this doesn't mean that you can (for example) claim authorship 
of a public domain work unless you actually wrote it.

In any case, we're discussing copyright, not other rights.


 and this is not the same as putting your work into the
 public domain. 

One might not be the same while still being effectively the same. For 
example, the U.S. Copyright Office states that one may not grant their 
work into the public domain. However, a copyright owner may release all 
of their rights to their work by stating the work may be freely 
reproduced, distributed, etc. as if it were in in the public domain.

But note that the Copyright Office does not make the final decision 
whether you can relinquish copyright or not. That's up to the courts.


 Among other things, your heirs can sometimes reclaim
 those rights in some circumstances if you are not careful (and if they
 are valuable enough to bother reclaiming).

That's a good point. A simplistic I release this to the public domain 
statement *may* (I emphasise the uncertainty) leave some doubt that it is 
*sufficiently overt* to prevent your heirs from disagreeing and coming 
after your users for infringement. Then the courts have to get involved, 
and it's all ugliness and only the lawyers win.

Hence the advice to be as explicit and overt as possible.


 If you wish to do something like this, I highly recommend (though IANAL
 and TINLA) 

Re: Everything you did not want to know about Unicode in Python 3

2014-05-17 Thread Robert Kern

On 2014-05-17 15:15, Steven D'Aprano wrote:

On Sat, 17 May 2014 10:29:00 +0100, Robert Kern wrote:


One can state many things, but that doesn't mean they have legal effect.
The US Code has provisions for how works become copyrighted
automatically, how they leave copyright automatically at the end of
specific time periods, how some works automatically enter the public
domain on their creation (i.e. works of the US federal government), but
has nothing at all for how a private creator can voluntarily place their
work into the public domain when it would otherwise not be. It used to,
but does not any more.


The case for abandonment was stated as well settled in 1998 (Micro-Star
v. Formgen Inc). Unless there has been a major legal change in the years
since then, I don't think it is true that authors cannot abandon
copyright.


Good old Micro-Star v. Formgen Inc. A perennial favorite. No, that case did not 
settle this question. There is a statement in the opinion that would suggest 
this, but (and this seems to be a reoccurring theme) it's inclusion in the 
opinion did not create precedent to that effect. The statement that you refer to 
is, as far as my NAL eyes can tell, what the lawyers call dictum: a statement 
made by a judicial opinion but is unnecessary to decide the case and therefore 
not precedential. FormGen explicitly registered the copyright to the works in 
question, and the case was decided on whether or not the 
Micro-Star-redistributed works counted as derivative works (yes). Now, if the 
case were about an author that affirmatively dedicated his work to the public 
domain and then sued someone who redistributed it, then such a statement would 
have a precedential effect (because then the judge would decide in favor of the 
defendant on the basis of that statement). The quote that you refer to 
references a previous case, which follows similar lines, and also predates the 
automatic copyright regime post-Berne Convention, so it's not even clear to me 
that it should have been precedential to Micro-Star.


Even if this case did so decide (which, I will grant it more or less did later 
by codifying such a rule in their jury instructions for such cases), it would 
only have effect in the 9th Circuit of the US and not even in the rest of the 
US, much less worldwide. Why bother when the CC0 gives you the desired effect 
with more assurance to your audience?



For a private individual to say about a work they just created that
this work is in the Public Domain is, under US law, merely an
erroneous statement of fact, not a speech act that effects a change in
the legal status of the work. For another example of this distinction,
saying I am married when I have not applied for, received, and
solemnified a valid marriage license is just an erroneous statement of
fact and does not make me legally married.


There may be something to what you say, although I think we're now
arguing fine semantic details.


Sure, it's the law. Fine semantic details are important. However, the difference 
between speech acts and statements of fact is a pretty gross semantic 
distinction and not just splitting semantic hairs. The act of making some 
statements (e.g. declaring that a work you own the copyright to is available 
under a given license) actually makes a change in the legal status of something. 
Most statements don't. Which ones do and don't are defined by statute and (in 
common law countries like the US) court decisions. Deciding which is which is 
often hairy, but that's an epistemological problem, not a semantic one. :-)



See:

https://en.wikipedia.org/wiki/Wikipedia:Granting_work_into_the_public_domain

To play Devil's Advocate in favour of your assertion, it may be that
abandoning copyright does not literally put the work in the public
domain, but merely makes it quack like the public domain. That is to
say, the author still, in some abstract but legally meaningless sense,
has copyright in the work *but* has given unlimited usage rights. (I
don't actually think that is the case, at least not in the US.)

It's this tiny bit of residual uncertainty that leads some authorities to
say that it is hard to release a work into the public domain,
particularly in a world-wide context, and that merely stating this is in
the public domain is not sufficient to remove all legal doubt over the
status, and that a more overt and explicit release *may* be required.
Hence the CC0 licence which you refer to. The human readable summary says
in part:

  The person who associated a work with this deed has dedicated
  the work to the public domain by waiving all of his or her
  rights to the work worldwide under copyright law, including
  all related and neighboring rights, to the extent allowed by
  law.

  You can copy, modify, distribute and perform the work, even
  for commercial purposes, all without asking permission.

http://creativecommons.org/publicdomain/zero/1.0/

while the actual 

Re: Everything you did not want to know about Unicode in Python 3

2014-05-17 Thread Robert Kern

On 2014-05-17 13:07, Steven D'Aprano wrote:

On Sat, 17 May 2014 09:57:06 +0100, Robert Kern wrote:


On 2014-05-17 02:07, Steven D'Aprano wrote:

On Fri, 16 May 2014 14:46:23 +, Grant Edwards wrote:


At least in the US, there doesn't seem to be such a thing as placing
a work into the public domain.  The copyright holder can transfer
ownershipt to soembody else, but there is no public domain to which
ownership can be trasferred.


That's factually incorrect. In the US, sufficiently old works, or works
of a certain age that were not explicitly registered for copyright, are
in the public domain. Under a wide range of circumstances, works
created by the federal government go immediately into the public
domain.


There is such a thing as the public domain in the US, and there are
works in it, but there isn't really such a thing as placing a work
there voluntarily, as Grant says. A work either is or isn't in the
public domain. The author has no choice in the matter.


That's incorrect.

http://cr.yp.to/publicdomain.html


Thanks for the link. While it has not really changed my opinion (as discussed at 
length in my other reply), I did not know that the 9th Circuit had formalized 
the overt act test in their civil procedure rules, so there is at least one 
jurisdiction in the US that does currently work like this. None of the others 
do, to my knowledge, and this is the product of judicial common law, not 
statutory law, so it's still pretty shaky.


--
Robert Kern

I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth.
  -- Umberto Eco

--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-16 Thread Antoine Pitrou
Terry Reedy tjreedy at udel.edu writes:
 
 On 5/13/2014 8:53 PM, Ethan Furman wrote:
  On 05/13/2014 05:10 PM, Steven D'Aprano wrote:
  On Tue, 13 May 2014 10:08:42 -0600, Ian Kelly wrote:
 
  Because Python 3 presents stdin and stdout as text streams however, it
  makes them more difficult to use with binary data, which is why Armin
  sets up all that extra code to make sure his file objects are binary.
 
  What surprises me is how hard that is. Surely there's a simpler way to
  open stdin and stdout in binary mode? If not, there ought to be.
 
  Somebody already posted this:
 
  https://docs.python.org/3/library/sys.html#sys.stdin
 
  which talks about .detach().
 
 I sent a message to Armin about this.

And the documentation has now been fixed:
http://bugs.python.org/issue21364

So something *can* come out of a python-list rantfest, it seems.

Regards

Antoine.


-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-16 Thread wxjmfauth
Le vendredi 16 mai 2014 13:50:47 UTC+2, Antoine Pitrou a écrit :
 Terry Reedy tjreedy at udel.edu writes:
 
  
 
  On 5/13/2014 8:53 PM, Ethan Furman wrote:
 
   On 05/13/2014 05:10 PM, Steven D'Aprano wrote:
 
   On Tue, 13 May 2014 10:08:42 -0600, Ian Kelly wrote:
 
  
 
   Because Python 3 presents stdin and stdout as text streams however, it
 
   makes them more difficult to use with binary data, which is why Armin
 
   sets up all that extra code to make sure his file objects are binary.
 
  
 
   What surprises me is how hard that is. Surely there's a simpler way to
 
   open stdin and stdout in binary mode? If not, there ought to be.
 
  
 
   Somebody already posted this:
 
  
 
   https://docs.python.org/3/library/sys.html#sys.stdin
 
  
 
   which talks about .detach().
 
  
 
  I sent a message to Armin about this.
 
 
 
 And the documentation has now been fixed:
 
 http://bugs.python.org/issue21364
 
 
 
 So something *can* come out of a python-list rantfest, it seems.
 
 
 
 Regards
 
 
 
 Antoine.

==

http://www.unicode.org/

Avec mes meilleures salutations.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-16 Thread Grant Edwards
On 2014-05-14, alister alister.nospam.w...@ntlworld.com wrote:
 On Wed, 14 May 2014 10:08:57 +1000, Chris Angelico wrote:

 On Wed, May 14, 2014 at 9:53 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 With the current system, all of us here are technically violating
 copyright every time we reply to an email and quote more than a small
 percentage of it.
 
 Oh wow... so when someone quotes heaps of text without trimming, and
 adding blank lines, we can complain that it's a copyright violation -
 reproducing our work with unauthorized modifications and without
 permission...
 
 I never thought of it like that.

 I think I could make a very strong case that anything sent to a public 
 forum with the intention of being broadcast has been placed into the 
 public domain by this action.

At least in the US, there doesn't seem to be such a thing as placing
a work into the public domain.  The copyright holder can transfer
ownershipt to soembody else, but there is no public domain to which
ownership can be trasferred.  IIRC, there is a way under Germain
copyright law to release certain rights.  The mere act of widely
widely distributing something does not in any way relinquish
copyrights.

-- 
Grant Edwards   grant.b.edwardsYow! Am I elected yet?
  at   
  gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-16 Thread Steven D'Aprano
On Fri, 16 May 2014 14:46:23 +, Grant Edwards wrote:

 At least in the US, there doesn't seem to be such a thing as placing a
 work into the public domain.  The copyright holder can transfer
 ownershipt to soembody else, but there is no public domain to which
 ownership can be trasferred.

That's factually incorrect. In the US, sufficiently old works, or works 
of a certain age that were not explicitly registered for copyright, are 
in the public domain. Under a wide range of circumstances, works created 
by the federal government go immediately into the public domain.

It is true that under the Mickey Mouse Copyright Grab Act[1] of insert 
years here, every time Mickey Mouse is about to reach the end of 
copyright, Congress retroactively extends copyright terms for another few 
decades, but that's another story.




[1] Not the real name of the act.

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-16 Thread Marko Rauhamaa
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 On Fri, 16 May 2014 14:46:23 +, Grant Edwards wrote:

 At least in the US, there doesn't seem to be such a thing as placing
 a work into the public domain. The copyright holder can transfer
 ownershipt to soembody else, but there is no public domain to which
 ownership can be trasferred.

 That's factually incorrect. In the US, sufficiently old works, or works 
 of a certain age that were not explicitly registered for copyright, are 
 in the public domain. Under a wide range of circumstances, works created 
 by the federal government go immediately into the public domain.

Steven, you're not disputing Grant. I am. The sole copyright holder can
simply state: this work is in the Public Domain, or: all rights
relinquished, or some such. Ultimately, everything is decided by the
courts, of course.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-14 Thread wxjmfauth
Le mardi 13 mai 2014 10:08:45 UTC+2, Johannes Bauer a écrit :
 On 13.05.2014 03:18, Steven D'Aprano wrote:
 
 
 
  Armin Ronacher is an extremely experienced and knowledgeable Python 
 
  developer, and a Python core developer. He might be wrong, but he's not 
 
  *obviously* wrong.
 
 
 
 He's correct about file name encodings. Which can be fixed really easily
 
 wihtout messing everything up (sys.argv binary variant, open accepting
 
 binary filenames). But that he suggests that Go would be superior:
 
 
 
  Which uses an even simpler model than Python 2: everything is a byte 
  string. The assumed encoding is UTF-8. End of the story.
 
 
 
 Is just a horrible idea. An obviously horrible idea, too.
 
 
 
 Having dealt with the UTF-8 problems on Python2 I can safely say that I
 
 never, never ever want to go back to that freaky hell. If I deal with
 
 strings, I want to be able to sanely manipulate them and I want to be
 
 sure that after manipulation they're still valid strings. Manipulating
 
 the bytes representation of unicode data just doesn't work.
 
 
 
 And I'm very very glad that some people felt the same way and
 
 implemented a sane, consistent way of dealing with Unicode in Python3.
 
 It's one of the reasons why I switched to Py3 very early and I love it.
 
 
 
 Cheers,
 
 Johannes
 
 
 
 -- 
 
  Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 
  Zumindest nicht öffentlich!
 
 Ah, der neueste und bis heute genialste Streich unsere großen
 
 Kosmologen: Die Geheim-Vorhersage.
 
  - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org

===

A Rob 'Commander' Pike will never put utf16 and
ebcdic in the same basket, when discussing coding
of characters.

jmf

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-14 Thread alister
On Tue, 13 May 2014 10:08:42 -0600, Ian Kelly wrote:

 On Tue, May 13, 2014 at 5:19 AM, alister
 alister.nospam.w...@ntlworld.com wrote:
 I am only an amateur python coder which is why I asked if I am missing
 something

 I could not see any reason to be using the shutil module if all that
 the programm is doing is opening a file, reading it  then printing it.

 is it python that causes the issue, the shutil module or just the OS
 not liking the data it is being sent?

 an explanation of why this approach is taken would be much appreciated.
 
 No, that part is perfectly fine.  This is exactly what the shutil module
 is meant for: providing shell-like operations.  Although in this case
 the copyfileobj function is quite simple (have yourself a look at the
 source -- it just reads from one file and writes to the other in a
 loop), in general the Pythonic thing is to avoid reinventing the wheel.
 
 And since it's so simple, it shouldn't be hard to see that the use of
 the shutil module has nothing to do with the Unicode woes here.  The
 crux of the issue is that a general-purpose command like cat typically
 can't know the encoding of its input and can't assume anything about it.
 In fact, there may not even be an encoding; cat can be used with binary
 data.  The only non-destructive approach then is to copy the binary data
 straight from the source to the destination with no decoding steps at
 all, and trust the user to ensure that the destination will be able to
 accommodate the source encoding.  Because Python 3 presents stdin and
 stdout as text streams however, it makes them more difficult to use with
 binary data, which is why Armin sets up all that extra code to make sure
 his file objects are binary.

I think I understand that 
in which case I owe Armin an apology, this certainly sounds like a 
failing in pythons handling of stdout



-- 
Get it up, keep it up... LINUX: Viagra for the PC.
   
   -- Chris Abbey
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-14 Thread alister
On Wed, 14 May 2014 10:08:57 +1000, Chris Angelico wrote:

 On Wed, May 14, 2014 at 9:53 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 With the current system, all of us here are technically violating
 copyright every time we reply to an email and quote more than a small
 percentage of it.
 
 Oh wow... so when someone quotes heaps of text without trimming, and
 adding blank lines, we can complain that it's a copyright violation -
 reproducing our work with unauthorized modifications and without
 permission...
 
 I never thought of it like that.
 
 ChrisA

I think I could make a very strong case that anything sent to a public 
forum with the intention of being broadcast has been placed into the 
public domain by this action.
  



-- 
Work expands to fill the time available.
-- Cyril Northcote Parkinson, The Economist, 1955
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-14 Thread Chris Angelico
On Wed, May 14, 2014 at 10:42 PM, alister
alister.nospam.w...@ntlworld.com wrote:
 On Wed, 14 May 2014 10:08:57 +1000, Chris Angelico wrote:

 On Wed, May 14, 2014 at 9:53 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 With the current system, all of us here are technically violating
 copyright every time we reply to an email and quote more than a small
 percentage of it.

 Oh wow... so when someone quotes heaps of text without trimming, and
 adding blank lines, we can complain that it's a copyright violation -
 reproducing our work with unauthorized modifications and without
 permission...

 I never thought of it like that.

 ChrisA

 I think I could make a very strong case that anything sent to a public
 forum with the intention of being broadcast has been placed into the
 public domain by this action.

I don't think so. One can reasonably assume that anything sent to a
public forum is permissible to read, and to copy verbatim (although
there may be presumed limits on the copying, but probably not with
python-list). But if I quote your text and edit it, then you would
rightly complain, which is not the case with public domain text. The
question is whether or not it's fair to try to scare people with that
when they repeatedly use buggy software that inserts blank lines
everywhere :)

In case it's not obvious, I am NOT seriously contemplating pursuing
anything like this legally. It's just funny to contemplate.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-14 Thread Dave Angel

On 05/13/2014 09:39 AM, Steven D'Aprano wrote:

On Tue, 13 May 2014 07:20:34 -0400, Roy Smith wrote:


ASCII *is* all I need.


You've never needed to copyright something? Copyright © Roy Smith 2014...
I know some people use (c) instead, but that actually has no legal
standing. (Not that any reasonable judge would invalidate a copyright
based on a technicality like that, not these days.)



(c) has no standing whatsoever, as it's properly spelled (copr)


--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-14 Thread Ian Kelly
On May 13, 2014 6:10 PM, Chris Angelico ros...@gmail.com wrote:

 On Wed, May 14, 2014 at 9:53 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
  With the current system, all of us here are technically violating
  copyright every time we reply to an email and quote more than a small
  percentage of it.

 Oh wow... so when someone quotes heaps of text without trimming, and
 adding blank lines, we can complain that it's a copyright violation -
 reproducing our work with unauthorized modifications and without
 permission...

 I never thought of it like that.

I'd be surprised if this doesn't fall under fair use.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-14 Thread Robin Becker

On 13/05/2014 17:08, Ian Kelly wrote:
.


And since it's so simple, it shouldn't be hard to see that the use of
the shutil module has nothing to do with the Unicode woes here.  The
crux of the issue is that a general-purpose command like cat typically
can't know the encoding of its input and can't assume anything about
it. In fact, there may not even be an encoding; cat can be used with
binary data.  The only non-destructive approach then is to copy the
binary data straight from the source to the destination with no
decoding steps at all, and trust the user to ensure that the
destination will be able to accommodate the source encoding.  Because
Python 3 presents stdin and stdout as text streams however, it makes
them more difficult to use with binary data, which is why Armin sets
up all that extra code to make sure his file objects are binary.

Doesn't this issue also come up wherever bytes are being read ie in sockets, 
pipe file handles etc? Some sources may have well defined encodings and so allow 
use of unicode strings but surely not all. I imagine all of the problems 
associated with a broken encoding promise for stdin can also occur with sockets 
 other sources ie error messages failing to be printable etc etc. Since bytes 
in Python 3 are not equivalent to the old str (Python 3 bytes != Python 2 str) 
using bytes everywhere has its own problems.

--
Robin Becker

--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-14 Thread Ian Kelly
On Wed, May 14, 2014 at 9:30 AM, Robin Becker ro...@reportlab.com wrote:
 Doesn't this issue also come up wherever bytes are being read ie in sockets,
 pipe file handles etc? Some sources may have well defined encodings and so
 allow use of unicode strings but surely not all. I imagine all of the
 problems associated with a broken encoding promise for stdin can also occur
 with sockets  other sources ie error messages failing to be printable etc
 etc. Since bytes in Python 3 are not equivalent to the old str (Python 3
 bytes != Python 2 str) using bytes everywhere has its own problems.

Sockets send and receive bytes, and pipes created by the subprocess
module are opened in binary mode.  Pipes inherited as stdin are still
assumed to be unicode, though.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-14 Thread Terry Reedy

On 5/13/2014 8:53 PM, Ethan Furman wrote:

On 05/13/2014 05:10 PM, Steven D'Aprano wrote:

On Tue, 13 May 2014 10:08:42 -0600, Ian Kelly wrote:


Because Python 3 presents stdin and stdout as text streams however, it
makes them more difficult to use with binary data, which is why Armin
sets up all that extra code to make sure his file objects are binary.


What surprises me is how hard that is. Surely there's a simpler way to
open stdin and stdout in binary mode? If not, there ought to be.


Somebody already posted this:

https://docs.python.org/3/library/sys.html#sys.stdin

which talks about .detach().


I sent a message to Armin about this.

--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Ben Finney
Gene Heskett ghesk...@wdtv.com writes:

 On Tuesday 13 May 2014 01:39:06 Mark H Harris did opine

  QOTW(so far...)

 But its early yet, only Tuesday  its just barely started... :)

Says who? For some of us, Tuesday is approaching sunset.

(It's always a good day to remind people that the rest of the world
exists.)

-- 
 \ “Reality must take precedence over public relations, for nature |
  `\   cannot be fooled.” —Richard P. Feynman, _Rogers' Commission |
_o__)   Report into the Challenger Crash_, 1986-06 |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Rustom Mody
On Tuesday, May 13, 2014 11:09:06 AM UTC+5:30, Mark H. Harris wrote:
 On 5/13/14 12:10 AM, Rustom Mody wrote:
 
  I think the most helpful way forward is to accept two things:
  a. Unicode is a headache
  b. No-unicode is a non-option
 
 
 QOTW(so far...)

I said that getting unicode right straight off is unrealistic.

I should have added this:
Armin makes a (sarcastic?) dig about the fact that python (3) goofs because
its mismatched with the assumptions of unix.

| UNIX is bytes, has been defined that way and will always be that way. To 

| Unicode on UNIX is only madness if you force it on everything. But that's not 
| how Unicode on UNIX works. UNIX does not have a distinction between unicode 
| and byte APIs. They are one and the same which makes them easy to deal with.]

| Python 3 takes a very difference stance on Unicode than UNIX does. Python 3 
| says: everything is Unicode ...

This may be right...
Or it may be the other way round as I claim at 
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

At this point I dont believe that anyone is very clear what is the
right way and and wrong way
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Chris Angelico
On Tue, May 13, 2014 at 4:03 PM, Ben Finney b...@benfinney.id.au wrote:
 (It's always a good day to remind people that the rest of the world
 exists.)

Ironic that this should come up in a discussion on Unicode, given that
Unicode's fundamental purpose is to welcome that whole rest of the
world instead of yelling LALALALALA America is everything and
pretending that ASCII, or Latin-1, or something, is all you need.

ChrisA
Currently enjoying Monday Night Flagging on Threshold RPG... at 4pm
on Tuesday.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread alex23

On 13/05/2014 11:39 AM, Chris Angelico wrote:

On Tue, May 13, 2014 at 11:18 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:

- have a bytes version of sys.argv (bargv? argvb?) and read
   the file names from that;


argb? :)


I tried and failed to come up with an argy bargy joke here so decided 
to go for a meta-reference instead.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Chris Angelico
On Tue, May 13, 2014 at 4:25 PM, alex23 wuwe...@gmail.com wrote:
 On 13/05/2014 11:39 AM, Chris Angelico wrote:

 On Tue, May 13, 2014 at 11:18 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:

 - have a bytes version of sys.argv (bargv? argvb?) and read
the file names from that;


 argb? :)


 I tried and failed to come up with an argy bargy joke here so decided to
 go for a meta-reference instead.

I'm just waiting for someone to have need for arguments in both
network byte order and host byte order. The latter, of course, would
be argh.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Mark H Harris

On 5/13/14 1:18 AM, Chris Angelico wrote:

instead of yelling LALALALALA America is everything and
pretending that ASCII, or Latin-1, or something, is all you need.



... it isn't?



LALALALALALALALALA   :))

--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread gregor
Am 13 May 2014 01:18:35 GMT
schrieb Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 
 - have a simple way to write bytes to stdout and stderr.

there is the underlying binary buffer:

https://docs.python.org/3/library/sys.html#sys.stdin

greg

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Johannes Bauer
On 13.05.2014 03:18, Steven D'Aprano wrote:

 Armin Ronacher is an extremely experienced and knowledgeable Python 
 developer, and a Python core developer. He might be wrong, but he's not 
 *obviously* wrong.

He's correct about file name encodings. Which can be fixed really easily
wihtout messing everything up (sys.argv binary variant, open accepting
binary filenames). But that he suggests that Go would be superior:

 Which uses an even simpler model than Python 2: everything is a byte string. 
 The assumed encoding is UTF-8. End of the story.

Is just a horrible idea. An obviously horrible idea, too.

Having dealt with the UTF-8 problems on Python2 I can safely say that I
never, never ever want to go back to that freaky hell. If I deal with
strings, I want to be able to sanely manipulate them and I want to be
sure that after manipulation they're still valid strings. Manipulating
the bytes representation of unicode data just doesn't work.

And I'm very very glad that some people felt the same way and
implemented a sane, consistent way of dealing with Unicode in Python3.
It's one of the reasons why I switched to Py3 very early and I love it.

Cheers,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Marko Rauhamaa
Johannes Bauer dfnsonfsdu...@gmx.de:

 Having dealt with the UTF-8 problems on Python2 I can safely say that
 I never, never ever want to go back to that freaky hell. If I deal
 with strings, I want to be able to sanely manipulate them and I want
 to be sure that after manipulation they're still valid strings.
 Manipulating the bytes representation of unicode data just doesn't
 work.

Based on my background (network and system programming), I'm a bit
suspicious of strings, that is, text. For example, is the stuff that
goes to syslog bytes or text? Does an XML file contain bytes or
(encoded) text? The answers are not obvious to me. Modern computing is
full of ASCII-esque binary communication standards and formats.

Python 2's ambiguity allows me not to answer the tough philosophical
questions. I'm not saying it's necessarily a good thing, but it has its
benefits.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Chris Angelico
On Tue, May 13, 2014 at 6:25 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 Johannes Bauer dfnsonfsdu...@gmx.de:

 Having dealt with the UTF-8 problems on Python2 I can safely say that
 I never, never ever want to go back to that freaky hell. If I deal
 with strings, I want to be able to sanely manipulate them and I want
 to be sure that after manipulation they're still valid strings.
 Manipulating the bytes representation of unicode data just doesn't
 work.

 Based on my background (network and system programming), I'm a bit
 suspicious of strings, that is, text. For example, is the stuff that
 goes to syslog bytes or text? Does an XML file contain bytes or
 (encoded) text? The answers are not obvious to me. Modern computing is
 full of ASCII-esque binary communication standards and formats.

These are problems that Unicode can't solve. In theory, XML should
contain text in a known encoding (defaulting to UTF-8). With syslog,
it's problematic - I don't remember what it's meant to be, but I know
there are issues. Same with other log files.

 Python 2's ambiguity allows me not to answer the tough philosophical
 questions. I'm not saying it's necessarily a good thing, but it has its
 benefits.

It's not a good thing. It means that you have the convenience of
pretending there's no problem, which means you don't notice trouble
until something happens... and then, in all probability, your app is
in production and you have no idea why stuff went wrong.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Marko Rauhamaa
Chris Angelico ros...@gmail.com:

 These are problems that Unicode can't solve.

I actually think the problem has little to do with Unicode. Text is an
abstract data type just like any class. If I have an object (say, a
subprocess or a dictionary) in memory, I don't expect the object to have
any existence independently of the Python virtual machine. I have the
same feeling about Py3 strings: they only exist inside the Python
virtual machine.

An abstract object like a subprocess or dictionary justifies its
existence through its behaviour (its quacking). Now, do strings quack or
are they silent? I guess if you are writing a word processor they might
quack to you. Otherwise, they are just an esoteric storage format.

What I'm saying is that strings definitely have an important application
in the human interface. However, I feel strings might be overused in the
Py3 API. Case in point: are pathnames bytes objects or strings? The
linux position is that they are bytes objects. Py3 supports both
interpretations seemingly throughout:

   open(b/bin/ls)vsopen(/bin/ls)
   os.path.join(ba, bb)vsos.path.join(a, b)


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Chris Angelico
On Tue, May 13, 2014 at 7:06 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 Chris Angelico ros...@gmail.com:

 These are problems that Unicode can't solve.

 I actually think the problem has little to do with Unicode. Text is an
 abstract data type just like any class. If I have an object (say, a
 subprocess or a dictionary) in memory, I don't expect the object to have
 any existence independently of the Python virtual machine. I have the
 same feeling about Py3 strings: they only exist inside the Python
 virtual machine.

That's true; the only difference is that text is extremely prevalent.
You can share a dict with another program, or store it in a file, or
whatever, simply by agreeing on an encoding - for instance, JSON. As
long as you and the other program know that this file is JSON encoded,
you can write it and he can read it, and you'll get the right data at
the far end. It's no different; there are encodings that are easy to
handle and have limitations, and there are encodings that are
elaborate and have lots of features (XML comes to mind, although
technically you can't encode a dict in XML).

 Case in point: are pathnames bytes objects or strings? The
 linux position is that they are bytes objects. Py3 supports both
 interpretations seemingly throughout:

open(b/bin/ls)vsopen(/bin/ls)
os.path.join(ba, bb)vsos.path.join(a, b)

That's a problem that comes from the underlying file systems. If every
FS in the world worked with Unicode file names, it would be easy.
(Most would encode them onto the platters in UTF-8 or maybe UTF-16;
some might choose to use a PEP 393 or Pike string structure, with the
size_shift being a file mode just like the 'directory' bit; others
might use a limited encoding for legacy reasons, storing uppercased
CP437 on the disk, and returning an error if the desired name didn't
fit.) But since they don't, we have to cope with that. What happens if
you're running on Linux, and you have a mounted drive from an OS/2
share, and inside that, you access an aliased drive that represents a
Windows share, on which you've mounted a remote-backup share? A single
path name could have components parsed by each of those systems, so
what's its encoding? How do you handle that? There's no solution.
(Well, okay. There is a solution: don't do something so stupidly
convoluted. But there's no law against cackling admins making circular
mounts. In fact, I just mounted my own home directory as a
subdirectory under my home directory, via sshfs. I can now encrypt my
own file reads and writes exactly as many times as I choose to. I also
cackled.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Johannes Bauer
On 13.05.2014 10:38, Chris Angelico wrote:

 Python 2's ambiguity allows me not to answer the tough philosophical
 questions. I'm not saying it's necessarily a good thing, but it has its
 benefits.
 
 It's not a good thing. It means that you have the convenience of
 pretending there's no problem, which means you don't notice trouble
 until something happens... and then, in all probability, your app is
 in production and you have no idea why stuff went wrong.

Exactly. With Py2 strings you never know what encoding they are, if
they already have been converted or something like that. And it's very
well possible to mix already converted strings with other, not yet
encoded strings. What a mess!

All these issues are avoided by Py3. There is a very clear distinction
between strings and string representation (data bytes), which is
beautiful. Accidental mixing is not possible. And you have some thing
*guaranteed* for the string type which aren't guaranteed for the bytes
type (for example when doing string manipulation).

Regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Steven D'Aprano
On Tue, 13 May 2014 12:06:50 +0300, Marko Rauhamaa wrote:

 Chris Angelico ros...@gmail.com:
 
 These are problems that Unicode can't solve.
 
 I actually think the problem has little to do with Unicode. Text is an
 abstract data type just like any class. If I have an object (say, a
 subprocess or a dictionary) in memory, I don't expect the object to have
 any existence independently of the Python virtual machine. I have the
 same feeling about Py3 strings: they only exist inside the Python
 virtual machine.

And you would be correct. When you write them to a device (say, push them 
over a network, or write them to a file) they need to be serialized. If 
you're lucky, you have an API that takes a string and serializes it for 
you, and then all you have to deal with is:

- am I happy with the default encoding?

- if not, what encoding do I want?

Otherwise you ought to have an API that requires bytes, not strings, and 
you have to perform your own serialization by encoding it.

But abstractions leak, and this abstraction leaks because *right now* 
there isn't a single serialization for text strings. There are HUNDREDS, 
and sometimes you don't know which one is being used.


[...]
 What I'm saying is that strings definitely have an important application
 in the human interface. However, I feel strings might be overused in the
 Py3 API. Case in point: are pathnames bytes objects or strings?

Yes. On POSIX systems, file names are sequences of bytes, with a very few 
restrictions. On recent Windows file systems (NTFS I believe?), file 
names are Unicode strings encoded to UTF-16, but with a whole lot of 
other restrictions imposed by the OS.


 The
 linux position is that they are bytes objects. Py3 supports both
 interpretations seemingly throughout:
 
open(b/bin/ls)vsopen(/bin/ls) os.path.join(ba, bb)   
vsos.path.join(a, b)

Because it has to, otherwise there will be files that are unreachable on 
one platform or another.


-- 
Steven
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Johannes Bauer
On 13.05.2014 10:25, Marko Rauhamaa wrote:

 Based on my background (network and system programming), I'm a bit
 suspicious of strings, that is, text. For example, is the stuff that
 goes to syslog bytes or text? Does an XML file contain bytes or
 (encoded) text? The answers are not obvious to me. Modern computing is
 full of ASCII-esque binary communication standards and formats.

Traditional Unix programs (syslog for example) are notorious for being
clear, ambiguous and/or ignorant of character encodings altogether. And
this works, unfortunately, for the most time because many encodings
share a common subset. If they wouldn't, the problems would be VERY
apparent and people would be forced to handle the issues not so sloppily.

Which is the route that Py3 chose. Don't be sloppy, make a great
distinction between text (which handles naturally as strings) and its
respective encoding.

The only people who are angered by this now is people who always treated
encodings sloppily and it just worked. Well, there's a good chance it
has worked by pure chance so far. It's a good thing that Python does
this now more strictly as it gives developers *guarantees* about what
they can and cannot do with text datatypes without having to deal with
encoding issues in many places. Just one place: The interface where text
is read or written, just as it should be.

Regards,
Johannes

-- 
 Wo hattest Du das Beben nochmal GENAU vorhergesagt?
 Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
 - Karl Kaos über Rüdiger Thomas in dsa hidbv3$om2$1...@speranza.aioe.org
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Marko Rauhamaa
Johannes Bauer dfnsonfsdu...@gmx.de:

 The only people who are angered by this now is people who always
 treated encodings sloppily and it just worked. Well, there's a good
 chance it has worked by pure chance so far. It's a good thing that
 Python does this now more strictly as it gives developers *guarantees*
 about what they can and cannot do with text datatypes without having
 to deal with encoding issues in many places. Just one place: The
 interface where text is read or written, just as it should be.

I'm not angered by text. I'm just wondering if it has any practical use
that is not misuse...

For example, Py3 should not make any pretense that there is a default
encoding for strings. Locale's are an abhorrent invention from the early
8-bit days. IOW, you should never input or output text without explicit
serialization.

I get the feeling that Py3 would like to present a world where strings
are first-class I/O objects that can exist in files, in filenames,
inside pipes. You say, text is read or written. I'm saying text is
never read or written. It only exists as an abstraction (not even
unicode) inside the virtual machine.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread alister
On Tue, 13 May 2014 01:18:35 +, Steven D'Aprano wrote:

 On Mon, 12 May 2014 17:47:48 +, alister wrote:
 
 On Mon, 12 May 2014 16:19:17 +0100, Mark Lawrence wrote:
 
 This was *NOT* written by our resident unicode expert
 http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
 
 Posted as I thought it would make a rather pleasant change from
 interminable threads about names vs values vs variables vs objects.
 
 Surely those example programs are not the pythonoic way to do things or
 am i missing something?
 
 Armin Ronacher is an extremely experienced and knowledgeable Python
 developer, and a Python core developer. He might be wrong, but he's not
 *obviously* wrong.
 
I am only an amateur python coder which is why I asked if I am missing 
something

I could not see any reason to be using the shutil module if all that the 
programm is doing is opening a file, reading it  then printing it.

is it python that causes the issue, the shutil module or just the OS not 
liking the data it is being sent?

an explanation of why this approach is taken would be much appreciated.



-- 
Revenge is a form of nostalgia.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Roy Smith
In article mailman.9939.1399961928.18130.python-l...@python.org,
 Chris Angelico ros...@gmail.com wrote:

 On Tue, May 13, 2014 at 4:03 PM, Ben Finney b...@benfinney.id.au wrote:
  (It's always a good day to remind people that the rest of the world
  exists.)
 
 Ironic that this should come up in a discussion on Unicode, given that
 Unicode's fundamental purpose is to welcome that whole rest of the
 world instead of yelling LALALALALA America is everything and
 pretending that ASCII, or Latin-1, or something, is all you need.

ASCII *is* all I need.  The problem is, it's not all that other people 
need, and I need to interact with those other people.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Mark Lawrence

On 13/05/2014 09:38, Chris Angelico wrote:


It's not a good thing. It means that you have the convenience of
pretending there's no problem, which means you don't notice trouble
until something happens... and then, in all probability, your app is
in production and you have no idea why stuff went wrong.



Unless you're (un)lucky enough to be working on IIRC the 1/3 of major IT 
projects that deliver nothing :)


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Chris Angelico
On Tue, May 13, 2014 at 11:30 PM, Mark Lawrence breamore...@yahoo.co.uk wrote:
 On 13/05/2014 09:38, Chris Angelico wrote:


 It's not a good thing. It means that you have the convenience of
 pretending there's no problem, which means you don't notice trouble
 until something happens... and then, in all probability, your app is
 in production and you have no idea why stuff went wrong.


 Unless you're (un)lucky enough to be working on IIRC the 1/3 of major IT
 projects that deliver nothing :)

Been there, done that. At least, most likely so... there is a chance,
albeit slim, that the boss/owner will either discover someone who'll
finish the project for him, or find the time to finish it himself. I
gather he's looking at ripping all my code out and replacing it with
PHP of his own design, which should be fun. On the plus side, that
does mean he can get any idiot straight out of a uni course to do the
work; much easier than finding someone who knows Python, Pike, bash,
and C++. The White King told Alice that cynicism is a disease that can
be cured... but it can also be inflicted, and a promising-looking
N-year project that collapses because the boss starts getting stupid
with code formatting rules and then ends up firing his last remaining
competent employee is a pretty effective means of instilling cynicism.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Steven D'Aprano
On Tue, 13 May 2014 07:20:34 -0400, Roy Smith wrote:

 ASCII *is* all I need.

You've never needed to copyright something? Copyright © Roy Smith 2014... 
I know some people use (c) instead, but that actually has no legal 
standing. (Not that any reasonable judge would invalidate a copyright 
based on a technicality like that, not these days.)

Or price something in cents? I suppose the days of the 25¢ steak dinner 
are long gone, but you might need to sell something for 99¢ a pound... 


 The problem is, it's not all that other people
 need, and I need to interact with those other people.

True, true.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Chris Angelico
On Tue, May 13, 2014 at 11:39 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 You've never needed to copyright something? Copyright © Roy Smith 2014...
 I know some people use (c) instead, but that actually has no legal
 standing. (Not that any reasonable judge would invalidate a copyright
 based on a technicality like that, not these days.)

Copyright Chris Angelico 2014. The full word copyright has legal
standing. I tend to stick with that in my README files; staying ASCII
makes it that bit safer for random text editors
(*cough*Notepad*cough*) that might otherwise misinterpret it (only a
bit, though [1]).

 Or price something in cents? I suppose the days of the 25¢ steak dinner
 are long gone, but you might need to sell something for 99¢ a pound...

$0.99/lb? :)

ChrisA

[1] https://en.wikipedia.org/wiki/Bush_hid_the_facts
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Grant Edwards
On 2014-05-13, Chris Angelico ros...@gmail.com wrote:
 On Tue, May 13, 2014 at 4:03 PM, Ben Finney b...@benfinney.id.au wrote:
 (It's always a good day to remind people that the rest of the world
 exists.)

 Ironic that this should come up in a discussion on Unicode, given that
 Unicode's fundamental purpose is to welcome that whole rest of the
 world instead of yelling LALALALALA America is everything and
 pretending that ASCII, or Latin-1, or something, is all you need.

Well, strictly speaking, it ASCII or Latin-1 _is_ all I need.

I will however admit to the existence of other people who might need
something else...

-- 
Grant Edwards   grant.b.edwardsYow! How many retured
  at   bricklayers from FLORIDA
  gmail.comare out purchasing PENCIL
   SHARPENERS right NOW??
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Grant Edwards
On 2014-05-13, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 On Tue, 13 May 2014 07:20:34 -0400, Roy Smith wrote:

 ASCII *is* all I need.

 You've never needed to copyright something? Copyright © Roy Smith 2014...

Bah.  You don't need the little copyright symbol at all.  The
statement without the symbol has the exact same legal weight.

-- 
Grant Edwards   grant.b.edwardsYow! World War Three can
  at   be averted by adherence
  gmail.comto a strictly enforced
   dress code!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Skip Montanaro
On Tue, May 13, 2014 at 3:38 AM, Chris Angelico ros...@gmail.com wrote:
 Python 2's ambiguity allows me not to answer the tough philosophical
 questions. I'm not saying it's necessarily a good thing, but it has its
 benefits.

 It's not a good thing. It means that you have the convenience of
 pretending there's no problem, which means you don't notice trouble
 until something happens... and then, in all probability, your app is
 in production and you have no idea why stuff went wrong.

BITD, when I still maintained and developed Musi-Cal (an early online
concert calendar, long since gone), I faced a challenge when I first
started encountering non-ASCII band names and cities. I resisted UTF-8.
After all, if I printed a string containing an é, it came out looking like



What kind of mess was that???

I tried to ignore it, or assume Latin-1 would cover all the bases (my first
non-ASCII inputs tended to come from Western Europe). If nothing else, at
least é was legible.

Needless to say, those approaches didn't work well. After perhaps six
months or a year, I broke down and started converting everything coming in
​ or going out​
to UTF-8 at the boundaries of my system (making educated guesses at
​input
 encodings if necessary). My life got a whole lot easier after that. The
distinction between bytes and text didn't really matter much, certainly not
compared to the mess I had before where strings of unknown data leaked into
my system and its database.

Skip

​P.S. My apologies for the mess this message probably is. Amazing as it may
seem, Gmail in Chrome does a crappy job editing anything other than plain
text. Also, I'm surprised in this day and age that common tools like Gnome
Terminal have little or no encoding support. I wound up having to pop up
urxvt to get an encodings-flexible terminal emulator...​
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Rustom Mody
On Tuesday, May 13, 2014 7:13:47 PM UTC+5:30, Chris Angelico wrote:
 On Tue, May 13, 2014 at 11:39 PM, Steven D'Aprano
  Or price something in cents? I suppose the days of the 25¢ steak dinner
  are long gone, but you might need to sell something for 99¢ a pound...
 
 
 $0.99/lb? :)

Dollars Zeros Slashes Question marks Smileys...
Just alphabets is enough I think...

Come to think of it why have anything other than zeros and ones?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Chris Angelico
On Wed, May 14, 2014 at 12:30 AM, Rustom Mody rustompm...@gmail.com wrote:
 Come to think of it why have anything other than zeros and ones?

Obligatory: http://xkcd.com/257/

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread alister
On Tue, 13 May 2014 13:51:20 +, Grant Edwards wrote:

 On 2014-05-13, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info
 wrote:
 On Tue, 13 May 2014 07:20:34 -0400, Roy Smith wrote:

 ASCII *is* all I need.

 You've never needed to copyright something? Copyright © Roy Smith
 2014...
 
 Bah.  You don't need the little copyright symbol at all.  The statement
 without the symbol has the exact same legal weight.


You do not need any statements at all, copyright is automaticly assigned 
to anything you create (at least that is the case in UK Law) although 
proving the creation date my be difficult.



-- 
Depends on how you define always.  :-)
 -- Larry Wall in 199710211647.jaa17...@wall.org
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Grant Edwards
On 2014-05-13, alister alister.nospam.w...@ntlworld.com wrote:
 On Tue, 13 May 2014 13:51:20 +, Grant Edwards wrote:

 On 2014-05-13, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info
 wrote:
 On Tue, 13 May 2014 07:20:34 -0400, Roy Smith wrote:

 ASCII *is* all I need.

 You've never needed to copyright something? Copyright © Roy Smith
 2014...
 
 Bah.  You don't need the little copyright symbol at all.  The statement
 without the symbol has the exact same legal weight.

 You do not need any statements at all, copyright is automaticly assigned 
 to anything you create (at least that is the case in UK Law)
 although proving the creation date my be difficult.

Yep, it's the same in the US.

-- 
Grant Edwards   grant.b.edwardsYow! Hello.  Just walk
  at   along and try NOT to think
  gmail.comabout your INTESTINES being
   almost FORTY YARDS LONG!!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Ian Kelly
On Tue, May 13, 2014 at 5:19 AM, alister
alister.nospam.w...@ntlworld.com wrote:
 I am only an amateur python coder which is why I asked if I am missing
 something

 I could not see any reason to be using the shutil module if all that the
 programm is doing is opening a file, reading it  then printing it.

 is it python that causes the issue, the shutil module or just the OS not
 liking the data it is being sent?

 an explanation of why this approach is taken would be much appreciated.

No, that part is perfectly fine.  This is exactly what the shutil
module is meant for: providing shell-like operations.  Although in
this case the copyfileobj function is quite simple (have yourself a
look at the source -- it just reads from one file and writes to the
other in a loop), in general the Pythonic thing is to avoid
reinventing the wheel.

And since it's so simple, it shouldn't be hard to see that the use of
the shutil module has nothing to do with the Unicode woes here.  The
crux of the issue is that a general-purpose command like cat typically
can't know the encoding of its input and can't assume anything about
it. In fact, there may not even be an encoding; cat can be used with
binary data.  The only non-destructive approach then is to copy the
binary data straight from the source to the destination with no
decoding steps at all, and trust the user to ensure that the
destination will be able to accommodate the source encoding.  Because
Python 3 presents stdin and stdout as text streams however, it makes
them more difficult to use with binary data, which is why Armin sets
up all that extra code to make sure his file objects are binary.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Steven D'Aprano
On Tue, 13 May 2014 14:42:51 +, alister wrote:

 On Tue, 13 May 2014 13:51:20 +, Grant Edwards wrote:
 
 On 2014-05-13, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info
 wrote:
 On Tue, 13 May 2014 07:20:34 -0400, Roy Smith wrote:

 ASCII *is* all I need.

 You've never needed to copyright something? Copyright © Roy Smith
 2014...
 
 Bah.  You don't need the little copyright symbol at all.  The statement
 without the symbol has the exact same legal weight.
 
 
 You do not need any statements at all, copyright is automaticly assigned
 to anything you create (at least that is the case in UK Law) although
 proving the creation date my be difficult.

(1) In my lifetime, that wasn't always the case. Up until the 1970s or 
thereabouts, you had to explicitly register anything you wanted 
copyrighted, a much more sensible system which weeded out the meaningless 
copyrights on economically worthless content. If we still had that 
system, orphan works would be a lesser problem.

With the current system, all of us here are technically violating 
copyright every time we reply to an email and quote more than a small 
percentage of it. Not to mention all the mirror sites that violate 
copyright by mirroring our posts in their entirety without permission.

(Author's moral rights not to be misquoted or plagiarised are a different 
kettle of fish separate from their ownership rights over the work. That 
should be automatic.)

(2) You don't have to just prove copyright. You also have to *identify* 
who the work is copyrighted by, and it needs to be an identifiable legal 
person (actual person or corporation), not necessarily the author. In the 
absence of a statement otherwise, copyright is assumed to be held by the 
author, but that's not always the case -- it might be a work for hire, or 
copyright might have been transferred to another person or entity. Or the 
author is unidentifiable. Hence the orphan work problem: it's presumed to 
be copyrighted, but since nobody knows who owns the copyright, there's no 
way to get permission to copy that work. It might as well be lost, even 
when the original is sitting right there in front of you mouldering away.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Chris Angelico
On Wed, May 14, 2014 at 9:53 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 With the current system, all of us here are technically violating
 copyright every time we reply to an email and quote more than a small
 percentage of it.

Oh wow... so when someone quotes heaps of text without trimming, and
adding blank lines, we can complain that it's a copyright violation -
reproducing our work with unauthorized modifications and without
permission...

I never thought of it like that.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Steven D'Aprano
On Tue, 13 May 2014 10:08:42 -0600, Ian Kelly wrote:

 Because Python 3 presents stdin and stdout as text streams however, it
 makes them more difficult to use with binary data, which is why Armin
 sets up all that extra code to make sure his file objects are binary.

What surprises me is how hard that is. Surely there's a simpler way to 
open stdin and stdout in binary mode? If not, there ought to be.




-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-13 Thread Ethan Furman

On 05/13/2014 05:10 PM, Steven D'Aprano wrote:

On Tue, 13 May 2014 10:08:42 -0600, Ian Kelly wrote:


Because Python 3 presents stdin and stdout as text streams however, it
makes them more difficult to use with binary data, which is why Armin
sets up all that extra code to make sure his file objects are binary.


What surprises me is how hard that is. Surely there's a simpler way to
open stdin and stdout in binary mode? If not, there ought to be.


Somebody already posted this:

https://docs.python.org/3/library/sys.html#sys.stdin

which talks about .detach().

--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread alister
On Mon, 12 May 2014 16:19:17 +0100, Mark Lawrence wrote:

 This was *NOT* written by our resident unicode expert
 http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
 
 Posted as I thought it would make a rather pleasant change from
 interminable threads about names vs values vs variables vs objects.

Surely those example programs are not the pythonoic way to do things or 
am i missing something?

if those code samples are anything to go by this guy makes JMF look 
sensible.



-- 
The Heineken Uncertainty Principle:
You can never be sure how many beers you had last night.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread Ian Kelly
On Mon, May 12, 2014 at 11:47 AM, alister
alister.nospam.w...@ntlworld.com wrote:
 On Mon, 12 May 2014 16:19:17 +0100, Mark Lawrence wrote:

 This was *NOT* written by our resident unicode expert
 http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

 Posted as I thought it would make a rather pleasant change from
 interminable threads about names vs values vs variables vs objects.

 Surely those example programs are not the pythonoic way to do things or
 am i missing something?

The _is_binary_reader and _is_binary_writer functions look like they
could be simplified by calling isinstance on the io object itself
against io.TextIOBase, io.BufferedIOBase or io.RawIOBase, rather than
doing those odd 0-length reads and writes.  And then perhaps those
exception-swallowing try-excepts wouldn't be necessary.  But perhaps
there's a non-obvious reason why it's written the way it is.

And there appears to be a bug where everything *except* the filename
'-' is treated as stdin, so the script probably hasn't been tested at
all.

 if those code samples are anything to go by this guy makes JMF look
 sensible.

This is an ad hominem.  Just because his code sucks doesn't mean he's
wrong about the state of Unicode and UNIX in Python 3.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread MRAB

On 2014-05-12 19:31, Ian Kelly wrote:

On Mon, May 12, 2014 at 11:47 AM, alister
alister.nospam.w...@ntlworld.com wrote:

On Mon, 12 May 2014 16:19:17 +0100, Mark Lawrence wrote:


This was *NOT* written by our resident unicode expert
http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

Posted as I thought it would make a rather pleasant change from
interminable threads about names vs values vs variables vs objects.


Surely those example programs are not the pythonoic way to do things or
am i missing something?


The _is_binary_reader and _is_binary_writer functions look like they
could be simplified by calling isinstance on the io object itself
against io.TextIOBase, io.BufferedIOBase or io.RawIOBase, rather than
doing those odd 0-length reads and writes.  And then perhaps those
exception-swallowing try-excepts wouldn't be necessary.  But perhaps
there's a non-obvious reason why it's written the way it is.


How about checking sys.stdin.mode and sys.stdout.mode?


And there appears to be a bug where everything *except* the filename
'-' is treated as stdin, so the script probably hasn't been tested at
all.


if those code samples are anything to go by this guy makes JMF look
sensible.


This is an ad hominem.  Just because his code sucks doesn't mean he's
wrong about the state of Unicode and UNIX in Python 3.



--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread Ian Kelly
On Mon, May 12, 2014 at 1:42 PM, MRAB pyt...@mrabarnett.plus.com wrote:
 How about checking sys.stdin.mode and sys.stdout.mode?

Seems to work, but I notice that the docs only define the mode
attribute for the FileIO class, which sys.stdin and sys.stdout are not
instances of.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread Chris Angelico
On Tue, May 13, 2014 at 4:31 AM, Ian Kelly ian.g.ke...@gmail.com wrote:
 Just because his code sucks doesn't mean he's
 wrong about the state of Unicode and UNIX in Python 3.

Uhm... I think wrongness of code is generally fairly indicative of
wrongness of thinking :) If I write a rant about how Python's list
type sucks and it turns out my code is using it like a cons cell and
never putting more than two elements into a list, then you would
accurately conclude that I'm wrong about the state of data type
support in Python.

I don't have a problem with someone coming to the list here with
misconceptions. That's what discussions are for. But rants like that,
on blogs, I quickly get weary of reading. The tone is always Look
what's so wrong, not inviting dialogue, and I can't be bothered
digging into the details to compose a full response. Chances are the
author's (a) not looking at what 3.4 and what's happened to improve
things (and certainly not 3.5 and what's going to happen), and (b) not
listening to responses anyway.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread Steven D'Aprano
On Mon, 12 May 2014 17:47:48 +, alister wrote:

 On Mon, 12 May 2014 16:19:17 +0100, Mark Lawrence wrote:
 
 This was *NOT* written by our resident unicode expert
 http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/
 
 Posted as I thought it would make a rather pleasant change from
 interminable threads about names vs values vs variables vs objects.
 
 Surely those example programs are not the pythonoic way to do things or
 am i missing something?

Feel free to show us your version of cat for Python then. Feel free to 
target any version you like. Don't forget to test it against files with 
names and content that:

- aren't valid UTF-8;

- are valid UTF-8, but not valid in the local encoding.



 if those code samples are anything to go by this guy makes JMF look
 sensible.

Armin Ronacher is an extremely experienced and knowledgeable Python 
developer, and a Python core developer. He might be wrong, but he's not 
*obviously* wrong.

Unicode is hard, not because Unicode is hard, but because of legacy 
problems. I can create a file on a machine that uses ISO-8859-7 for the 
file name, put JShift-JIS encoded text inside it, transfer it to a 
machine that uses Windows-1251 as the file system encoding, then SSH into 
that machine from a system using Big5, and try to make sense of it. If 
everybody used UTF-8 any time data touched a disk or network, we'd be 
laughing. It would all be so simple.

Reading Armin's post, I think that all that is needed to simplify his 
Python 3 version is:

- have a bytes version of sys.argv (bargv? argvb?) and read 
  the file names from that;

- have a simple way to write bytes to stdout and stderr.

Most programs won't need either of those, but file system utilities will.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread Chris Angelico
On Tue, May 13, 2014 at 11:18 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Reading Armin's post, I think that all that is needed to simplify his
 Python 3 version is:

 - have a bytes version of sys.argv (bargv? argvb?) and read
   the file names from that;

argb? :)

 - have a simple way to write bytes to stdout and stderr.

I'm not sure how that goes with I/O redirection, but sure.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread Mark H Harris

On 5/12/14 8:18 PM, Steven D'Aprano wrote:

Unicode is hard, not because Unicode is hard, but because of legacy
problems.


Yes.  To put a finer point on that, Unicode (which is only a 
specification constantly being improved upon) is harder to implement 
when it hasn't been on the design board from the ground up; Python in 
this case.


Julia has Unicode support from the ground up, and it was easier for 
those guys to implement (in beta release) than for the Python crew when 
they undertook the Unicode work that had to be done for Python3.x (just 
an observation).


Anytime there are legacy code issues, regression testing problems, and a 
host of domain issues that weren't thought through from the get-go there 
are going to be more problematic hurdles; not to mention bugs.


Having said that, I still think Unicode is somewhat harder than you're 
admitting.


marcus

--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread Mark Lawrence

On 13/05/2014 02:18, Steven D'Aprano wrote:

On Mon, 12 May 2014 17:47:48 +, alister wrote:


On Mon, 12 May 2014 16:19:17 +0100, Mark Lawrence wrote:


This was *NOT* written by our resident unicode expert
http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

Posted as I thought it would make a rather pleasant change from
interminable threads about names vs values vs variables vs objects.


Surely those example programs are not the pythonoic way to do things or
am i missing something?


Feel free to show us your version of cat for Python then. Feel free to
target any version you like. Don't forget to test it against files with
names and content that:

- aren't valid UTF-8;

- are valid UTF-8, but not valid in the local encoding.




if those code samples are anything to go by this guy makes JMF look
sensible.


Armin Ronacher is an extremely experienced and knowledgeable Python
developer, and a Python core developer. He might be wrong, but he's not
*obviously* wrong.

Unicode is hard, not because Unicode is hard, but because of legacy
problems. I can create a file on a machine that uses ISO-8859-7 for the
file name, put JShift-JIS encoded text inside it, transfer it to a
machine that uses Windows-1251 as the file system encoding, then SSH into
that machine from a system using Big5, and try to make sense of it. If
everybody used UTF-8 any time data touched a disk or network, we'd be
laughing. It would all be so simple.

Reading Armin's post, I think that all that is needed to simplify his
Python 3 version is:

- have a bytes version of sys.argv (bargv? argvb?) and read
   the file names from that;

- have a simple way to write bytes to stdout and stderr.

Most programs won't need either of those, but file system utilities will.



I think http://bugs.python.org/issue8776 and 
http://bugs.python.org/issue8775 are relevant but both were placed in 
the small round filing cabinet.


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

---
This email is free from viruses and malware because avast! Antivirus protection 
is active.
http://www.avast.com


--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread Rustom Mody
On Tuesday, May 13, 2014 6:48:35 AM UTC+5:30, Steven D'Aprano wrote:
 On Mon, 12 May 2014 17:47:48 +, alister wrote:
 
  Surely those example programs are not the pythonoic way to do things or
  am i missing something?
 
 
 
 Feel free to show us your version of cat for Python then. Feel free to 
 target any version you like. Don't forget to test it against files with 
 names and content that:
 
 
 - aren't valid UTF-8;
 
 
 - are valid UTF-8, but not valid in the local encoding.

Thanks for a non-defensive appraisal!

 
 
  if those code samples are anything to go by this guy makes JMF look
  sensible.
 
 
 
 Armin Ronacher is an extremely experienced and knowledgeable Python 
 developer, and a Python core developer. He might be wrong, but he's not 
 *obviously* wrong.
 
 
 
 Unicode is hard, not because Unicode is hard, but because of legacy 
 problems. I can create a file on a machine that uses ISO-8859-7 for the 
 file name, put JShift-JIS encoded text inside it, transfer it to a 
 machine that uses Windows-1251 as the file system encoding, then SSH into 
 that machine from a system using Big5, and try to make sense of it. If 
 everybody used UTF-8 any time data touched a disk or network, we'd be 
 laughing. It would all be so simple.

I think the most helpful way forward is to accept two things:
a. Unicode is a headache
b. No-unicode is a non-option

 
 
 
 Reading Armin's post, I think that all that is needed to simplify his 
 Python 3 version is:
 
 
 
 - have a bytes version of sys.argv (bargv? argvb?) and read 
   the file names from that;
 
 - have a simple way to write bytes to stdout and stderr.
 
 
 Most programs won't need either of those, but file system utilities will.

About the technical merits of Armin's post and your suggestions, Ive 
nothing to say, since I am an ignoramus on (the mechanics of) unicode

[Consider me an eager, early, ignorant adopter :-) ]

Its however good to note that unicode is rather unique in the history
not just of IT/CS but of humanity, in the sense that no one (to the best
of my knowledge) has ever tried to come up with an all-encompassing umbrella
for all humanity's scripts/writing systems etc.

So hiccups and mistakes are only to be expected.  The absence of these would
be much more surprising!
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread Mark H Harris

On 5/13/14 12:10 AM, Rustom Mody wrote:

I think the most helpful way forward is to accept two things:
a. Unicode is a headache
b. No-unicode is a non-option


QOTW(so far...)

--
https://mail.python.org/mailman/listinfo/python-list


Re: Everything you did not want to know about Unicode in Python 3

2014-05-12 Thread Gene Heskett
On Tuesday 13 May 2014 01:39:06 Mark H Harris did opine
And Gene did reply:
 On 5/13/14 12:10 AM, Rustom Mody wrote:
  I think the most helpful way forward is to accept two things:
  a. Unicode is a headache
  b. No-unicode is a non-option
 
 QOTW(so far...)

But its early yet, only Tuesday  its just barely started... :)

Cheers, Gene
-- 
There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order.
-Ed Howdershelt (Author)
Genes Web page http://geneslinuxbox.net:6309/gene
US V Castleman, SCOTUS, Mar 2014 is grounds for Impeaching SCOTUS
-- 
https://mail.python.org/mailman/listinfo/python-list