[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2013-01-27 Thread Ezio Melotti

Ezio Melotti added the comment:

OK, I'm going to close this then.

I'll take a look at the links and see if what they say can be included in the 
HOWTO.  As I mentioned in an earlier post I made a few talks about Unicode and 
encodings, so I will take some material from there too.  Depending on the final 
result we can then decide if and what additional links are necessary.

--
resolution:  -> duplicate
stage: needs patch -> committed/rejected
status: open -> closed
superseder:  -> Unicode HOWTO up to date?

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2013-01-27 Thread Nick Coghlan

Nick Coghlan added the comment:

Include a couple of "See Also" links out to my essay and Ned's article and that 
sounds good to me.

(Assuming I've adjusted the DNS settings correctly, this alternate URL for my 
essay should start working soon: 
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2013-01-27 Thread Ezio Melotti

Ezio Melotti added the comment:

If we agree on this, I can propose a patch in #4153 and this issue can be 
closed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2013-01-27 Thread Terry J. Reedy

Terry J. Reedy added the comment:

I basically agree with Ezio. The doc currently starts with

Introduction to Unicode
History of Character Codes
...

It ends with

Tips for Writing Unicode-aware Programs.
  ...
  The most important tip is:
Software should only work with Unicode strings internally, decoding the 
input data as soon as possible and encoding the output only at the end.

I think the how-to should *start* with that general principle and continue with 
the specific task-based how-tos from the thread. This will tell people who at 
least vaguely know the following material how to get going in a practical 
manner.

--
versions: +Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2013-01-27 Thread Ezio Melotti

Ezio Melotti added the comment:

Maybe the Unicode HOWTO could be reorganized so that it first introduces the 
bare minimum and then expands the concepts for whoever wants to know more?
Or should we have a "basic" and an "advanced" Unicode HOWTO?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2013-01-26 Thread Nick Coghlan

Nick Coghlan added the comment:

Current status:

#14015 is still valid (i.e. surrogateescape is not well documented)
#4153: the Unicode HOWTO still covers more than the bare minimum people need to 
know
Ned Batchelder's "Pragmatic Unicode" is one of the best intros to the topic I 
have seen: http://nedbatchelder.com/text/unipain.html

My full notes on the topic, which I'm still happy with as a "bare minimum 
Python 3 users should know about Unicode" are available at 
http://python-notes.boredomandlaziness.org/en/latest/python3/text_file_processing.html

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2013-01-26 Thread Ezio Melotti

Ezio Melotti added the comment:

What's the status of this?

Issue #4153 might also be related.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-07-14 Thread Eli Bendersky

Changes by Eli Bendersky :


--
nosy:  -eli.bendersky

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-03-31 Thread Chris Rebert

Chris Rebert  added the comment:

Links to the "rambling Unicode thread"s for posterity and convenience:

Gets into several issues, among them, Unicode:
http://mail.python.org/pipermail/python-ideas/2012-February/013665.html

Unicode-specific offshoot of the above:
http://mail.python.org/pipermail/python-ideas/2012-February/013825.html

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-18 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

Yes, the 'how to' alternatives, with + and -, should be included in the doc 
addition. I thought it the best thing to come out of the python-ideas thread.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-18 Thread Nick Coghlan

Nick Coghlan  added the comment:

The other thing that came out of the rambling Unicode thread on python-ideas is 
that we should clearly articulate the options for processing files in a 
task-based fashion and describe the trade-offs for the different alternatives.

I started writing up my notes on that as a tracker comment, but the became a 
little... long: 
http://readthedocs.org/docs/ncoghlan_devs-python-notes/en/latest/py3k_text_file_processing.html

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-17 Thread Ezio Melotti

Ezio Melotti  added the comment:

FWIW I recently made a talk at PyCon Finland called "Understanding Encodings" 
that goes through the things you mentioned in the last message.

I could turn that in a patch for the Unicode Howto.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-17 Thread Terry J. Reedy

Terry J. Reedy  added the comment:

I agree with no new builtin and appreciate that being taken off the table.

I think the place is the Unicode How-to. I think that document should be 
renamed Encodings and Unicode How-to. The reasons are 1) one has to first 
understand the concept of encoding characters and text as numbers, and 2) this 
issue (and the python-ideas discussion) is not about Unicode, but about using 
pre- (and non-)Unicode encodings with Python3's bytes and string types, and how 
that differs in Python3 versus using Python2's unicode and string types. If 
only Unicode encodings were used, with utf-8 dominant on the Internet (and it 
is now most common for web pages), the problems of concern here would not exist.

Learning about Unicode would mean learning about code units versus codepoints, 
normal versus surrogate chars, BMP versus extended chars (all of which are 
non-issues in wide builds and Py 3.3), 256-char planes, BOMs, surrogates, 
normalization forms, and character properties. While sometimes useful, these 
subjects are not the issue here.

--
nosy: +terry.reedy

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-14 Thread Jim Jewett

Jim Jewett  added the comment:

See bugs/python.org/issue14015 for one reason that surrogateescape isn't better 
known.

--
nosy: +Jim.Jewett

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-13 Thread Tshepang Lekhonkhobe

Changes by Tshepang Lekhonkhobe :


--
nosy: +tshepang

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-13 Thread Giampaolo Rodola'

Changes by Giampaolo Rodola' :


--
nosy: +giampaolo.rodola

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-12 Thread Antoine Pitrou

Antoine Pitrou  added the comment:

> My mental model here is text editors, which let you open any file, do
> their best to display as much as they can and allow you to manipulate
> it without damaging the bits you don't change. I don't see any reason
> why people shouldn't be able to write Python 3 code that way if they
> need to.

Some text editors try to guess the encoding, which is different from
"display invalid characters anyway".
Other text editors like gedit pop up an error when there are invalid
bytes according to the configured encoding.

That said, people *are* able to write Python 3 code the way you said.
They simply have to use the "surrogateescape" error handler.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-12 Thread Florent Xicluna

Changes by Florent Xicluna :


--
nosy: +flox

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-12 Thread Paul Moore

Paul Moore  added the comment:

A better example in terms of "intended to be text" might be ChangeLog files. 
These are clearly text files, but of sufficiently standard format that they can 
be manipulated programmatically.

Consider a program to get a list of all authors who changed a particular file. 
Scan the file for date lines, then scan the block of text below for the 
filename you care about. Extract the author from the date line, put into a set, 
sort and print.

All of this can be done assuming the file is ASCII-compatible, but requires 
non-trivial text processing that would be a pain to do on bytes. But author 
names are quite likely to be non-ASCII, especially if it's an international 
project. And the changelog file is manually edited by people on different 
machines, so the possibility of inconsistent encodings is definitely there. (I 
have seen this happen - it's not theoretical!)

For my code, all I care about is that the names round-trip, so that I'm not 
damaging people's names any more than has already happened.

encoding="ascii",errors="surrogateescape" sounds like precisely the right 
answer here.

(If it's hard to find a good answer in Python 3, it's very easy to decide to 
use Python 2 which "just works", or even other tools like awk which also take 
Python 2's naive approach - and dismiss Python 3's Unicode model as "too hard").

My mental model here is text editors, which let you open any file, do their 
best to display as much as they can and allow you to manipulate it without 
damaging the bits you don't change. I don't see any reason why people shouldn't 
be able to write Python 3 code that way if they need to.

--
nosy: +pmoore

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-12 Thread Nick Coghlan

Nick Coghlan  added the comment:

If such use cases are indeed better handled as bytes, then that's what should 
be documented. However, there are some text processing assumptions that no 
longer hold when using bytes instead of strings (such as "x[0:1] == x[0]"). You 
also can't safely pass such byte sequences to various other APIs (e.g. 
urllib.parse will happily process surrogate escaped text without corrupting 
them, but will throw UnicodeDecodeError for bytes sequences that aren't pure 
7-bit ASCII).

Using surrogateescape instead means that you're only going to have problems if 
you go to encode the data to an encoding other than the source one. That's 
basically the things work in Python 2 with 8-bit strings.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-12 Thread Nadeem Vawda

Changes by Nadeem Vawda :


--
nosy: +nadeem.vawda

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-12 Thread STINNER Victor

STINNER Victor  added the comment:

Why do you use Unicode with the ugly surrogateescape error handler in
this case? Bytes are just fine for such usecase.

The surrogateescape error handler produces unusual characters in range
U+DC80-U+DCFF which cannot be printed to a console because sys.stdout
uses the strict error handler, and sys.stderr  uses the
backslashreplace error handler. If I remember correctly, only UTF-7
encoder allow lone surrogate characters.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-12 Thread Nick Coghlan

Nick Coghlan  added the comment:

Usually because the file may contain certain ASCII markers (or you're inserting 
such markers), but beyond that, you only care that it's in a consistent ASCII 
compatible encoding.

Parsing log files from sources that aren't set up correctly often falls into 
this category - you know the markers are ASCII, but the actual message contents 
may not be properly encoded. (e.g. they use a locale dependent encoding, but 
not all the log files are from the same machine and not all machines have their 
locale set up properly). (although errors="replace" can be a better option for 
such "read-only" use cases).

A use case where you really do need "errors='surrogateescape'" is when you're 
reformatting a log file and you want to preserve the encoding for the messages 
while manipulating the pure ASCII timestamps and message headers. In that case, 
surrogateescape is the right answer, because you can manipulate the ASCII bits 
freely while preserving the log message contents when you write the reformatted 
files back out. The reformatting script offers an API that says "put any ASCII 
compatible encoding in, and you'll get that same encoding back out".

You'll get weird behaviour (i.e. as you do in Python 2) if the assumption of an 
ASCII compatible encoding is ever violated, but that would be equally true if 
the script tried to process things at the raw bytes level.

The assumption of an ASCII compatibile text encoding is a useful one a lot of 
the time. The problem with Python 2 is it makes that assumption implicitly, and 
makes it almost impossible to disable it. Python 3, on the other hand, assumes 
very little by default (basically what it returns from 
sys.getfilesystemencoding() and locale.getpreferredencoding()), this requiring 
that the programmer know how to state their assumptions explicitly.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-12 Thread STINNER Victor

STINNER Victor  added the comment:

> A common programming task is "I want to process this text file,
> I know it's in an ASCII compatible encoding, I don't know which
> one specifically, but I'm only manipulating the ASCII parts
> so it doesn't matter".

Can you give more detail about this use case? Why would you ignore non-ASCII 
characters?

--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-12 Thread Ezio Melotti

Changes by Ezio Melotti :


--
components: +Unicode
nosy: +ezio.melotti
stage:  -> needs patch
type:  -> enhancement

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-11 Thread Eli Bendersky

Eli Bendersky  added the comment:

If the concept is accepted. I see no better place for this than the
Unicode HOWTO. If it's too long, then a TL;DR; section should be added
in the beginning detailing "the bare minimum". No need to scatter such
information in bits and pieces around the documentation. That's what
the Unicode HOWTO is for.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue13997] Clearly explain the bare minimum Python 3 users should know about Unicode

2012-02-11 Thread Nick Coghlan

Changes by Nick Coghlan :


--
assignee:  -> docs@python
components: +Documentation
nosy: +docs@python
title: Add open_ascii() builtin -> Clearly explain the bare minimum Python 3 
users should know about Unicode
versions: +Python 3.2, Python 3.3

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com