subject:"Newbie question about text encoding"

On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Marko Rauhamaa wrote:

 Chris Angelico ros...@gmail.com:

 Once again, you appear to be surprised that invalid data is failing.
 Why is this so strange? U+DD00 is not a valid character.

 But it is a valid non-character code point.

 It is quite correct to throw this error.

 '\udd00' is a valid str object:

 Is it though? Perhaps the bug is not UTF-8's inability to encode lone
 surrogates, but that Python allows you to create lone surrogates in the
 first place. That's not a rhetorical question. It's a genuine question.

Ah, I see the confusion. Yes, it is plausible to permit the UTF-8-like
encoding of surrogates; but it's illegal according to the RFC:

https://tools.ietf.org/html/rfc3629

   The definition of UTF-8 prohibits encoding character numbers between
   U+D800 and U+DFFF, which are reserved for use with the UTF-16
   encoding form (as surrogate pairs) and do not directly represent
   characters.


They're not valid characters, and the UTF-8 spec explicitly says that
they must not be encoded. Python is fully spec-compliant in rejecting
these. Some encoders [1] will permit them, but the resulting stream is
invalid UTF-8, just as CESU-8 and Modified UTF-8 are (the latter being
UTF-8, only U+ is represented as C0 80).

ChrisA

[1] eg 
http://pike.lysator.liu.se/generated/manual/modref/ex/predef_3A_3A/string_to_utf8.html
optionally
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Mon, Mar 9, 2015 at 5:25 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Perhaps the bug is not UTF-8's inability to encode lone
 surrogates, but that Python allows you to create lone surrogates in the
 first place. That's not a rhetorical question. It's a genuine question.

As to the notion of rejecting the construction of strings containing
these invalid codepoints, I'm not sure. Are there any languages out
there that have a Unicode string type that requires that all
codepoints be valid (no surrogates, no U+FFFE, etc)? This is the kind
of thing that's usually done in an obscure language before it hits a
mainstream one.

Pike is similar to Python here. I can create a string with invalid
code points in it:

 \uFFFE\uDD00;
(1) Result: \ufffe\udd00

but I can't UTF-8 encode that:

 string_to_utf8(\uFFFE\uDD00);
Character 0xdd00 at index 1 is in the surrogate range and therefore invalid.
Unknown program: string_to_utf8(\ufffe\udd00)
HilfeInput:1: HilfeInput()-___HilfeWrapper()

Or, using the streaming UTF-8 encoder instead of the short-hand:

 Charset.encoder(UTF-8)-feed(\uFFFE\uDD00)-drain();
Error encoding \ufffe[0xdd00] using utf8: Unsupported character 56576.
/usr/local/pike/8.1.0/lib/modules/_Charset.so:1:
_Charset.UTF8enc()-feed(\ufffe\udd00)
HilfeInput:1: HilfeInput()-___HilfeWrapper()

Does anyone know of a language where you can't even construct the string?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Rustom Mody wrote:

 On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote:
 Rustom Mody wrote:
  This includes not just bug-prone-system code such as Java and Windows
  but seemingly working code such as python 3.
 
 What Unicode bugs do you think Python 3.3 and above have?
 
 Literal/Legalistic answer:
 https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-2135

Nice one :-) but not exactly in the spirit of what we're discussing (as you
acknowledge below), so I won't discuss that.


 [And already quoted at
 http://blog.languager.org/2015/03/whimsical-unicode.html
 ]
 
 An answer more in the spirit of what I am trying to say:
 Idle3, Roy's example and in general all systems that are
 python-centric but use components outside of python that are
 unicode-broken
 
 IOW I would expect people (at least people with good faith) reading my
 
 bug-prone-system code...seemingly working code such as python 3...
 
 to interpret that NOT as
 
 python 3 is seemingly working but actually broken


Why not? That is the natural interpretation of the sentence, particularly in
the context of your previous sentence:

[quote]
Or you can skip the blame-game and simply note the fact that 
large segments of extant code-bases are currently in bug-prone
or plain buggy state.

This includes not just bug-prone-system code such as Java and
Windows but seemingly working code such as python 3.
[end quote]


The natural interpretation of this is that Python 3 is only *seemingly*
working, but is also an example of a code base in bug-prone or plain buggy
state.

If that's not your intended meaning, then rather than casting aspersions on
my honesty (good faith indeed) you might accept that perhaps you didn't
quite manage to get your message across.


 But as
 
 Apps made with working system code (eg python3) can end up being broken
 because of other non-working system code - eg mysql, java, javascript,
 windows-shell, and ultimately windows, linux

Don't forget viruses or other malware, cosmic rays, processor bugs, dry
solder joints on the motherboard, faulty memory, and user-error.

I'm not sure what point you think you are making. If you want to discuss the
fact that complex systems have more interactions than simple systems, and
therefore more ways for things to go wrong, I will agree. I'll agree that
this is an issue with Python code that interacts with other systems which
may or may not implement Unicode correctly. There are a few ways to
interpret this:

(1) You're making a general point about the complexity of modern computing.

(2) You're making the point that dealing with text encodings in general, and
Unicode in specific, is hard because of the interaction of programming
language, database, file system, locale, etc.

(3) You're implying that Python ought to fix this problem some how.

(4) You're implying that *Unicode* specifically is uniquely problematic in
this way. Or at least *unusual* to be problematic in this way.


I will agree with 1 and 2; I'll say that 3 would be nice but in the absence
of concrete proposals for how to fix it, it's just meaningless chatter. And
I'll disagree strongly with 4.

Unicode came into existence because legacy encodings suffer from similar
problems, only worse. (One major advantage of Unicode over previous
multi-byte encodings is that the UTF encodings are self-healing. A single
corrupted byte will, *at worst*, cause a single corrupted code point.)

In one sense, Unicode has solved these legacy encoding problems, in the
sense that if you always use a correct implementation of Unicode then you
won't *ever* suffer from problems like moji-bake, broken strings and so
forth.

In another sense, Unicode hasn't solved these legacy problems because we
still have to deal with files using legacy encodings, as well as standards
organisations, operating systems, developers, applications and users who
continue to produce new content using legacy encodings, buggy or incorrect
implementations of the standard, also viruses, cosmic rays, dry solder
joints and user-error. How are these things Unicode's fault or
responsibility?



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Marko Rauhamaa wrote:

 Chris Angelico ros...@gmail.com:
 
 Once again, you appear to be surprised that invalid data is failing.
 Why is this so strange? U+DD00 is not a valid character. 

But it is a valid non-character code point.


 It is quite correct to throw this error.
 
 '\udd00' is a valid str object:

Is it though? Perhaps the bug is not UTF-8's inability to encode lone
surrogates, but that Python allows you to create lone surrogates in the
first place. That's not a rhetorical question. It's a genuine question.


 '\udd00'
'\udd00'
 '\udd00'.encode('utf-32')
b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
 '\udd00'.encode('utf-16')
b'\xff\xfe\x00\xdd'

If you explicitly specify the endianness (say, utf-16-be or -le) then you
don't get the BOMs.

 I was simply stating that UTF-8 is not a bijection between unicode
 strings and octet strings (even forgetting Python). Enriching Unicode
 with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
 without side effects.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-08 Thread Marko Rauhamaa

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Marko Rauhamaa wrote:
 '\udd00' is a valid str object:

 Is it though? Perhaps the bug is not UTF-8's inability to encode lone
 surrogates, but that Python allows you to create lone surrogates in
 the first place. That's not a rhetorical question. It's a genuine
 question.

The problem is that no matter how you shuffle surrogates, encoding
schemes, coding points and the like, a wrinkle always remains.

I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But
that's where the buck stops; traditional arithmetic functions are closed
under ℂ.

Unicode apparently hasn't found a similar closure.

That's why I think that while UTF-8 is a fabulous way to bring Unicode
to Linux, Linux should have taken the tack that Unicode is always an
application-level interpretation with few operating system tie-ins.
Unfortunately, the GNU world is busy trying to build a Unicode frosting
everywhere. The illusion can never be complete but is convincing enough
for application developers to forget to handle corner cases.

To answer your question, I think every code point from 0 to 1114111
should be treated as valid and analogous. Thus Python is correct here:

len('\udd00')
   1
len('\ufeff')
   1

The alternatives are far too messy to consider.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Steven D'Aprano wrote:

 Marko Rauhamaa wrote:
 
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:
 
 Marko Rauhamaa wrote:

 That said, UTF-8 does suffer badly from its not being
 a bijective mapping.

 Can you explain?
 
 In Python terms, there are bytes objects b that don't satisfy:
 
b.decode('utf-8').encode('utf-8') == b
 
 Are you talking about the fact that not all byte streams are valid UTF-8?
 That is, some byte objects b may raise an exception on b.decode('utf-8').

Eh, I should have read the rest of the thread before replying...


 I don't see why that means UTF-8 suffers badly from this. Can you give
 an example of where you would expect to take an arbitrary byte-stream,
 decode it as UTF-8, and expect the results to be meaningful?

File names on Unix-like systems.

Unfortunately file names are a bit of a mess, but we're slowly converging on
Unicode support for files. I reckon that by 2070, 2080 tops, we'll have
that licked...

The three major operating systems have different levels of support for
Unicode file names:

* Apple OS X: HFS+ stores file names in decomposed form, using UTF-16. I
think this is the strictest Unicode support of all common file systems.
Well done Apple. Decomposed in this sense means that single code points may
be expanded where possible, e.g. é U+00E9 LATIN SMALL LETTER E WITH ACUTE
will be stored as two code points, U+0065 LATIN SMALL LETTER E + U+0301
COMBINING ACUTE ACCENT.

* Windows: NTFS stores file names as sequences of 16-bit code units except
0x. (Additional restrictions also apply: e.g. in POSIX mode, / is also
forbidden; in Win32 mode, / ? + etc. are forbidden.) The code units are
interpreted as UTF-16 but the file system doesn't prevent you from creating
file names with invalid sequences.

* Linux: ext2/ext3 stores file names as arbitrary bytes except for / and
nul. However most Linux distributions treat file names as if they were
UTF-8 (displaying ? glyphs for undecodable bytes), and many Linux GUI file
managers enforce the rule that file names are valid UTF-8.

File systems on removable media (FAT32, UDF, ISO-9660 with or without
extensions such as Joliet and Rock Ridge) have their own issues, but
generally speaking don't support Unicode well or at all.

So although the current situation is still a bit of a mess, there is a slow
move towards file names which are valid Unicode.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-08 Thread Marko Rauhamaa

Chris Angelico ros...@gmail.com:

 Once again, you appear to be surprised that invalid data is failing.
 Why is this so strange? U+DD00 is not a valid character. It is quite
 correct to throw this error.

'\udd00' is a valid str object:

'\udd00'
   '\udd00'
'\udd00'.encode('utf-32')
   b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
'\udd00'.encode('utf-16')
   b'\xff\xfe\x00\xdd'

I was simply stating that UTF-8 is not a bijection between unicode
strings and octet strings (even forgetting Python). Enriching Unicode
with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
without side effects.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 7:09 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 Chris Angelico ros...@gmail.com:

 Once again, you appear to be surprised that invalid data is failing.
 Why is this so strange? U+DD00 is not a valid character. It is quite
 correct to throw this error.

 '\udd00' is a valid str object:

 '\udd00'
'\udd00'
 '\udd00'.encode('utf-32')
b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
 '\udd00'.encode('utf-16')
b'\xff\xfe\x00\xdd'

 I was simply stating that UTF-8 is not a bijection between unicode
 strings and octet strings (even forgetting Python). Enriching Unicode
 with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
 without side effects.

But it's not a valid Unicode string, so a Unicode encoding can't be
expected to cope with it. Mathematically, 0xC0 0x80 would represent
U+, and some UTF-8 codecs generate and accept this (in order to
allow U+ without ever yielding 0x00), but that doesn't mean that
UTF-8 should allow that byte sequence.

The only reason to craft some kind of Unicode string for any arbitrary
sequence of bytes is the smuggling effect used for file name
handling. There is no reason to support invalid Unicode codepoints.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-08 Thread Rustom Mody

On Monday, March 9, 2015 at 7:39:42 AM UTC+5:30, Cameron Simpson wrote:
 On 07Mar2015 22:09, Steven D'Aprano  wrote:
 Rustom Mody wrote:
 [...big snip...]
  Some parts are here some earlier and from my memory.
  If details wrong please correct:
  - 200 million records
  - Containing 4 strings with SMP characters
  - System made with python and mysql. SMP works with python, breaks mysql.
So whole system broke due to those 4 in 200,000,000 records
 
 No, they broke because MySQL has buggy Unicode handling.
 [...]
  You could also choose do with astral crap (Roy's words) what we all do
  with crap -- throw it out as early as possible.
 
 And when Roy's customers demand that his product support emoji, or complain
 that they cannot spell their own name because of his parochial and ignorant
 idea of crap, perhaps he will consider doing what he should have done
 from the beginning:
 
 Stop using MySQL, which is a joke of a database[1], and use Postgres which
 does not have this problem.
 
 [1] So I have been told.
 
 I use MySQL a fair bit, and Postgres very slightly. I would agree with your 
 characterisation above; MySQL is littered with inconsistencies and arbitrary 
 breakage, both in tools and SQL implementation. And Postgres has been a pure 
 pleasure to work with, little though I have done that so far.
 
 Cheers,
 Cameron Simpson
 
 There is no human problem which could not be solved if people would simply
 do as I advise. - Gore Vidal

I think that last quote sums up the issue best.
Ive written to Intel asking them to make their next generation have 21-bit wide 
bytes.
Once they do that we will be back in the paradise we have been for the last 40 
years
which I call the 'Unix-assumption'
http://blog.languager.org/2014/04/unicode-and-unix-assumption.html

Until then...

We have to continue living in the real world.
Which includes 10 times more windows than linux users.
Is windows 10 times better an OS than linux?

In the 'real world' people make choices for all sorts of reasons. My guess is 
the
top reason is the pointiness of the hair of pointy-haired-boss.

Just like people choose  windows over linux, people choose mysql over postgres,
and that's the context of this discussion -- people stuck in sub-optimal choices
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-08 Thread Ben Finney

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:

 '\udd00' should be a SyntaxError.

I find your argument convincing, that attempting to construct a Unicode
string of a lone surrogate should be an error.

Shouldn't the error type be a ValueError, though? The statement is not,
to my mind, erroneous syntax.

-- 
 \ “Please do not feed the animals. If you have any suitable food, |
  `\ give it to the guard on duty.” —zoo, Budapest |
_o__)  |
Ben Finney

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-08 Thread Cameron Simpson


On 07Mar2015 22:09, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info 
wrote:

Rustom Mody wrote:

[...big snip...]
Some parts are here some earlier and from my memory.
If details wrong please correct:
- 200 million records
- Containing 4 strings with SMP characters
- System made with python and mysql. SMP works with python, breaks mysql.
  So whole system broke due to those 4 in 200,000,000 records


No, they broke because MySQL has buggy Unicode handling.

[...]

You could also choose do with astral crap (Roy's words) what we all do
with crap -- throw it out as early as possible.


And when Roy's customers demand that his product support emoji, or complain
that they cannot spell their own name because of his parochial and ignorant
idea of crap, perhaps he will consider doing what he should have done
from the beginning:

Stop using MySQL, which is a joke of a database[1], and use Postgres which
does not have this problem.

[1] So I have been told.


I use MySQL a fair bit, and Postgres very slightly. I would agree with your 
characterisation above; MySQL is littered with inconsistencies and arbitrary 
breakage, both in tools and SQL implementation. And Postgres has been a pure 
pleasure to work with, little though I have done that so far.


Cheers,
Cameron Simpson c...@zip.com.au

There is no human problem which could not be solved if people would simply
do as I advise. - Gore Vidal
--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Mon, Mar 9, 2015 at 1:09 PM, Ben Finney ben+pyt...@benfinney.id.au wrote:
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:

 '\udd00' should be a SyntaxError.

 I find your argument convincing, that attempting to construct a Unicode
 string of a lone surrogate should be an error.

 Shouldn't the error type be a ValueError, though? The statement is not,
 to my mind, erroneous syntax.

For the string literal, I would say SyntaxError is more appropriate
than ValueError, as a string object has to be constructed at
compilation time.

I'd still like to see a report from someone who has used a language
that specifically disallows all surrogates in strings. Does it help?
Is it more hassle than it's worth? Are there weird edge cases that it
breaks?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-08 Thread random832

On Sun, Mar 8, 2015, at 22:09, Ben Finney wrote:
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 
  '\udd00' should be a SyntaxError.
 
 I find your argument convincing, that attempting to construct a Unicode
 string of a lone surrogate should be an error.
 
 Shouldn't the error type be a ValueError, though? The statement is not,
 to my mind, erroneous syntax.

In this hypothetical, it's a problem with evaluating a literal - in the
same way that '\U12345', or '\U0011, is.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Marko Rauhamaa wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:
 
 Marko Rauhamaa wrote:
 '\udd00' is a valid str object:

 Is it though? Perhaps the bug is not UTF-8's inability to encode lone
 surrogates, but that Python allows you to create lone surrogates in
 the first place. That's not a rhetorical question. It's a genuine
 question.
 
 The problem is that no matter how you shuffle surrogates, encoding
 schemes, coding points and the like, a wrinkle always remains.

Really? Define your terms. Can you define wrinkles, and prove that it is
impossible to remove them? What's so bad about wrinkles anyway?


 I'm reminded of number sets where you go from ℕ to ℤ to ℚ to ℝ to ℂ. But
 that's where the buck stops; traditional arithmetic functions are closed
 under ℂ.

That's simply incorrect. What's z/(0+0i)?

There are many more number sets used by mathematicians, some going back to
the 1800s. Here are just a few:

* ℝ-overbar or [−∞, +∞], which adds a pair of infinities to ℝ.

* ℝ-caret or ℝ+{∞}, which does the same but with a single 
  unsigned infinity.

* A similar extended version of ℂ with a single infinity.

* Split-complex or hyperbolic numbers, defined similarly to ℂ 
  except with i**2 = +1 (rather than the complex i**2 = -1).

* Dual numbers, which add a single infinitesimal number ε != 0 
  with the property that ε**2 = 0.

* Hyperreal numbers.

* John Conway's surreal numbers, which may be the largest 
  possible set, in the sense that it can construct all finite, 
  infinite and infinitesimal numbers. (The hyperreals and dual 
  numbers can be considered subsets of the surreals.)

The process of extending ℝ to ℂ is formally known as Cayley–Dickson
construction, and there is an infinite number of algebras (and hence number
sets) which can be constructed this way. The next few are:

* Hamilton's quaternions ℍ, very useful for dealing with rotations 
  in 3D space. They fell out of favour for some decades, but are now
  experiencing something of a renaissance.

* Octonions or Cayley numbers.

* Sedenions.


 Unicode apparently hasn't found a similar closure.

Similar in what way? And why do you think this is important?

It is not a requirement for every possible byte sequence to be a valid
Unicode string, any more than it is a requirement for every possible byte
sequence to be valid JPG, zip archive, or ELF executable. Some byte strings
simply are not JPG images, zip archives or ELF executables -- or Unicode
strings. So what?

Why do you think that is a problem that needs fixing by the Unicode
standard? It may be a problem that needs fixing by (for example)
programming languages, and Python invented the surrogatesescape encoding to
smuggle such invalid bytes into strings. Other solutions may exist as well.
But that's not part of Unicode and it isn't a problem for Unicode.


 That's why I think that while UTF-8 is a fabulous way to bring Unicode
 to Linux, Linux should have taken the tack that Unicode is always an
 application-level interpretation with few operating system tie-ins.

Should have? That is *exactly* the status quo, and while it was the only
practical solution given Linux's history, it's a horrible idea. That
Unicode is stuck on top of an OS which is unaware of Unicode is precisely
why we're left with problems like how do you represent arbitrary bytes as
Unicode strings?.


 Unfortunately, the GNU world is busy trying to build a Unicode frosting
 everywhere. The illusion can never be complete but is convincing enough
 for application developers to forget to handle corner cases.
 
 To answer your question, I think every code point from 0 to 1114111
 should be treated as valid and analogous. 

Your opinion isn't very relevant. What is relevant is what the Unicode
standard demands, and I think it requires that strings containing
surrogates are illegal (rather like x/0 is illegal in the real numbers).
Wikipedia states:


The Unicode standard permanently reserves these code point 
values [U+D800 to U+DFFF] for UTF-16 encoding of the high 
and low surrogates, and they will never be assigned a 
character, so there should be no reason to encode them. The 
official Unicode standard says that no UTF forms, including 
UTF-16, can encode these code points.

However UCS-2, UTF-8, and UTF-32 can encode these code points
in trivial and obvious ways, and large amounts of software 
does so even though the standard states that such arrangements
should be treated as encoding errors. It is possible to 
unambiguously encode them in UTF-16 by using a code unit equal
to the code point, as long as no sequence of two code units can
be interpreted as a legal surrogate pair (that is, as long as a
high surrogate is never followed by a low surrogate). The 
majority of UTF-16 encoder and decoder implementations translate
between encodings as though this were the case.


http://en.wikipedia.org/wiki/UTF-16

So yet again we are left with the

Re: Newbie question about text encoding

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 For those cases where you do wish to take an arbitrary byte stream and
 round-trip it, Python now provides an error handler for that.

 py import random
 py b = bytes([random.randint(0, 255) for _ in range(1)])
 py s = b.decode('utf-8')
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0:
 invalid start byte
 py s = b.decode('utf-8', errors='surrogateescape')
 py s.encode('utf-8', errors='surrogateescape') == b
 True

That is indeed a valid workaround. With it we achieve

   b.decode('utf-8', errors='surrogateescape'). \
   encode('utf-8', errors='surrogateescape') == b

for any bytes b. It goes to great lengths to address the Linux
programmer's situation.

However,

 * it's not UTF-8 but a variant of it,

 * it sacrifices the ordering correspondence of UTF-8:

'\udc80'  'ä'
   True
'\udc80'.encode('utf-8', errors='surrogateescape')  \
   ...'ä'.encode('utf-8', errors='surrogateescape')
   False

 * it still isn't bijective between str and bytes:

'\udd00'.encode('utf-8', errors='surrogateescape')
   Traceback (most recent call last):
 File stdin, line 1, in module
   UnicodeEncodeError: 'utf-8' codec can't encode character 
   '\udd00' in position 0: surrogates not allowed


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Rustom Mody

On Saturday, March 7, 2015 at 4:39:48 PM UTC+5:30, Steven D'Aprano wrote:
 Rustom Mody wrote:
  This includes not just bug-prone-system code such as Java and Windows but
  seemingly working code such as python 3.
 
 What Unicode bugs do you think Python 3.3 and above have?

Literal/Legalistic answer:
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2012-2135

[And already quoted at
http://blog.languager.org/2015/03/whimsical-unicode.html
]

An answer more in the spirit of what I am trying to say:
Idle3, Roy's example and in general all systems that are
python-centric but use components outside of python that are unicode-broken

IOW I would expect people (at least people with good faith) reading my

 bug-prone-system code...seemingly working code such as python 3...

to interpret that NOT as

python 3 is seemingly working but actually broken

But as

Apps made with working system code (eg python3) can end up being broken
because of other non-working system code - eg mysql, java, javascript, 
windows-shell, and ultimately windows, linux
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 6:20 PM, Marko Rauhamaa ma...@pacujo.net wrote:
  * it still isn't bijective between str and bytes:

 '\udd00'.encode('utf-8', errors='surrogateescape')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character
'\udd00' in position 0: surrogates not allowed

Once again, you appear to be surprised that invalid data is failing.
Why is this so strange? U+DD00 is not a valid character. It is quite
correct to throw this error.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Marko Rauhamaa wrote:

 That said, UTF-8 does suffer badly from its not being
 a bijective mapping.

 Can you explain?

In Python terms, there are bytes objects b that don't satisfy:

   b.decode('utf-8').encode('utf-8') == b


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Marko Rauhamaa wrote:

 That said, UTF-8 does suffer badly from its not being
 a bijective mapping.

 Can you explain?

 In Python terms, there are bytes objects b that don't satisfy:

b.decode('utf-8').encode('utf-8') == b

Please provide an example; that sounds like a bug. If there is any
invalid UTF-8 stream which decodes without an error, it is actually a
security bug, and should be fixed pronto in all affected and supported
versions.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 Chris Angelico ros...@gmail.com:

 On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Marko Rauhamaa wrote:

 That said, UTF-8 does suffer badly from its not being
 a bijective mapping.

 Can you explain?

 In Python terms, there are bytes objects b that don't satisfy:

b.decode('utf-8').encode('utf-8') == b

 Please provide an example; that sounds like a bug. If there is any
 invalid UTF-8 stream which decodes without an error, it is actually a
 security bug, and should be fixed pronto in all affected and supported
 versions.

 Here's an example:

b = b'\x80'

 Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
 from str objects to bytes objects.

That's not the same as what you said. All you've proven is that there
are bit patterns which are not UTF-8 streams... which is a very
deliberate feature. How does UTF-8 *suffer* from this? It benefits
hugely!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding


On 07/03/2015 16:25, Marko Rauhamaa wrote:

Chris Angelico ros...@gmail.com:


On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote:

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:


Marko Rauhamaa wrote:


That said, UTF-8 does suffer badly from its not being
a bijective mapping.


Can you explain?


In Python terms, there are bytes objects b that don't satisfy:

b.decode('utf-8').encode('utf-8') == b


Please provide an example; that sounds like a bug. If there is any
invalid UTF-8 stream which decodes without an error, it is actually a
security bug, and should be fixed pronto in all affected and supported
versions.


Here's an example:

b = b'\x80'

Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
from str objects to bytes objects.



Python 2 might, Python 3 doesn't.

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Chris Angelico ros...@gmail.com:

 On Sun, Mar 8, 2015 at 3:25 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 Marko Rauhamaa wrote:
 That said, UTF-8 does suffer badly from its not being
 a bijective mapping.

 Here's an example:

b = b'\x80'

 Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
 from str objects to bytes objects.

 That's not the same as what you said.

Except that it's precisely what I said.

 All you've proven is that there are bit patterns which are not UTF-8
 streams...

And that causes problems.

 which is a very deliberate feature.

Well, nobody desired it. It was just something that had to give.

I believe you *could* have defined it as a bijective mapping but then
you would have lost the sorting order correspondence.

 How does UTF-8 *suffer* from this? It benefits hugely!

You can't operate on file names and text files using Python strings. Or
at least, you will need to add (nontrivial) exception catching logic.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 There are two things happening here:

 1) The underlying file system is not UTF-8, and you can't depend on
 that,

 Correct. Linux pathnames are octet strings regardless of the locale.

 That's why Linux developers should refer to filenames using bytes.
 Unfortunately, Python itself violates that principle by having
 os.listdir() return str objects (to mention one example).

Only because you gave it a str with the path name. If you want to
refer to file names using bytes, then be consistent and refer to ALL
file names using bytes. As I demonstrated, that works just fine.

 2) You forgot to put the path on that, so it failed to find the file.
 Here's my version of your demo:

 open(/tmp/xyz/+os.listdir('/tmp/xyz')[0])
 _io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'

 Looks fine to me.

 I stand corrected.

 Then we have:

 os.listdir()[0].encode('utf-8')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
position 0: surrogates not allowed

So?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 5:34 AM, Dan Sommers d...@tombstonezero.net wrote:
 I think we're all agreeing:  not all file systems are the same, and
 Python doesn't smooth out all of the bumps, even for something that
 seems as simple as displaying the names of files in a directory.  And
 that's *after* we've agreed that filesystems contain files in
 hierarchical directories.

I think you and I are in agreement. No idea about Marko, I'm still not
entirely sure what he's saying.

Python can't smooth out all of the bumps in file systems, any more
than Unicode can smooth out the bumps in natural language, or TCP can
smooth out the bumps in IP. The abstraction layers help, but every now
and then they leak, and you have to cope with the underlying mess.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding


On 07/03/2015 16:48, Marko Rauhamaa wrote:

Mark Lawrence breamore...@yahoo.co.uk:


On 07/03/2015 16:25, Marko Rauhamaa wrote:

Here's an example:

 b = b'\x80'

Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
from str objects to bytes objects.


Python 2 might, Python 3 doesn't.


Python 3.3.2 (default, Dec  4 2014, 12:49:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
Type help, copyright, credits or license for more information.
 b'\x80'.decode('utf-8')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
invalid start byte


Marko



It would clearly help if you were to type in the correct UK English accent.

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Dan Sommers d...@tombstonezero.net:

 I think we're all agreeing: not all file systems are the same, and
 Python doesn't smooth out all of the bumps, even for something that
 seems as simple as displaying the names of files in a directory. And
 that's *after* we've agreed that filesystems contain files in
 hierarchical directories.

A whole new set of problems took root with Unicode. There were gains but
there were losses, too.

Python is not alone in the conceptual difficulties. Guile 2's (readdir)
simply converts bad UTF-8 in a filename into a question mark:

   scheme@(guile-user) [1] (readdir s)
   $3 = ?
   scheme@(guile-user) [4] (equal? $3 ?)
   $4 = #t

So does lxterminal:

   $ ls
   ?

even though it's all bytes on the inside:

   $ [ $(ls) = ? ]
   $ echo $?
   1

Scripts that make use of standard text utilities must now be very
careful:

   $ ls | egrep ^.$ | wc -l
   0

You are well advised to sprinkle LANG=C in your scripts:

   $ ls | LANG=C egrep ^.$ | wc -l
   1

Nasty locale-related bugs plague installation scripts, whose writers are
not accustomed to running their tests in myriads of locales. The topic
is of course larger than just Unicode.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 3:40 AM, Mark Lawrence breamore...@yahoo.co.uk wrote:
 Here's an example:

 b = b'\x80'

 Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
 from str objects to bytes objects.


 Python 2 might, Python 3 doesn't.

He was talking about this line of code:

b.decode('utf-8').encode('utf-8') == b

With the above assignment, that does indeed throw an error - which is
correct behaviour.

Challenge: Figure out a byte-string input that will make this function
return True.

def is_utf8_broken(b):
return b.decode('utf-8').encode('utf-8') != b

Correct responses for this function are either False or raising an exception.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Mark Lawrence breamore...@yahoo.co.uk:

 It would clearly help if you were to type in the correct UK English
 accent.

Your ad-hominem-to-contribution ratio is alarmingly high.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 See:

$ mkdir /tmp/xyz
$ touch /tmp/xyz/
 \x80'
$ python3
Python 3.3.2 (default, Dec  4 2014, 12:49:00)
[GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
Type help, copyright, credits or license for more information.
 import os
 os.listdir('/tmp/xyz')
['\udc80']
 open(os.listdir('/tmp/xyz')[0])
Traceback (most recent call last):
  File stdin, line 1, in module
FileNotFoundError: [Errno 2] No such file or directory: '\udc80'

 File names encoded with Latin-X are quite commonplace even in UTF-8
 locales.

That is not a problem with UTF-8, though. I don't understand how
you're blaming UTF-8 for that. There are two things happening here:

1) The underlying file system is not UTF-8, and you can't depend on
that, ergo the decode to Unicode has to have some special handling of
failing bytes.
2) You forgot to put the path on that, so it failed to find the file.
Here's my version of your demo:

 open(/tmp/xyz/+os.listdir('/tmp/xyz')[0])
_io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'

Looks fine to me.

Alternatively, if you pass a byte string to os.listdir, you get back a
list of byte string file names:

 os.listdir(b/tmp/xyz)
[b'\x80']
 open(b/tmp/xyz/+os.listdir(b'/tmp/xyz')[0])
_io.TextIOWrapper name=b'/tmp/xyz/\x80' mode='r' encoding='UTF-8'

Either way works. You can use bytes or text, and if you use text,
there is a way to smuggle bytes through it. None of this has anything
to do with UTF-8 as an encoding. (Note that the encoding='UTF-8'
note in the response has to do with the presumed encoding of the file
contents, not of the file name. As an empty file, it can be considered
to be a stream of zero Unicode characters, encoded UTF-8, so that's
valid.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding


On 07/03/2015 17:16, Marko Rauhamaa wrote:

Mark Lawrence breamore...@yahoo.co.uk:


It would clearly help if you were to type in the correct UK English
accent.


Your ad-hominem-to-contribution ratio is alarmingly high.


Marko



You've been a PITA ever since you first joined this list, what about it?

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Dan Sommers

On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote:

 On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers d...@tombstonezero.net wrote:
 On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:

 On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote:

 Correct. Linux pathnames are octet strings regardless of the locale.

 That's why Linux developers should refer to filenames using bytes.
 Unfortunately, Python itself violates that principle by having
 os.listdir() return str objects (to mention one example).

 Only because you gave it a str with the path name. If you want to
 refer to file names using bytes, then be consistent and refer to ALL
 file names using bytes. As I demonstrated, that works just fine.

 Python 3.4.2 (default, Oct  8 2014, 10:45:20)
 [GCC 4.9.1] on linux
 Type help, copyright, credits or license for more information.
 import os
 type(os.listdir(os.curdir)[0])
 class 'str'
 
 Help on module os:
 
 DESCRIPTION
 This exports:
   - os.curdir is a string representing the current directory ('.' or ':')
   - os.pardir is a string representing the parent directory ('..' or '::')
 
 Explicitly documented as strings. If you want to work with strings,
 work with strings. If you want to work with bytes, don't use
 os.curdir, use bytes instead. Personally, I'm happy using strings, but
 if you want to go down the path of using bytes, you simply have to be
 consistent, and that probably means being platform-dependent anyway,
 so just use b. for the current directory.

I think we're all agreeing:  not all file systems are the same, and
Python doesn't smooth out all of the bumps, even for something that
seems as simple as displaying the names of files in a directory.  And
that's *after* we've agreed that filesystems contain files in
hierarchical directories.

Dan
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 You can't operate on file names and text files using Python strings. Or
 at least, you will need to add (nontrivial) exception catching logic.

You can't operate on a JPG file using a Unicode string, nor an array
of integers. What of it? You can't operate on an array of integers
using a dictionary, either. So? How is this a failing of UTF-8?

If you really REALLY can't use the bytes() type to work with something
that is, yaknow, bytes, then you could use an alternative encoding
that has a value for every byte. It's still not Unicode text, so it
doesn't much matter which encoding you use. But it's much better to
use the bytes type to work with bytes. It is not text, so don't treat
it as text.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Chris Angelico ros...@gmail.com:

 If you really REALLY can't use the bytes() type to work with something
 that is, yaknow, bytes, then you could use an alternative encoding
 that has a value for every byte. It's still not Unicode text, so it
 doesn't much matter which encoding you use. But it's much better to
 use the bytes type to work with bytes. It is not text, so don't treat
 it as text.

See:

   $ mkdir /tmp/xyz
   $ touch /tmp/xyz/$'\x80'
   $ python3
   Python 3.3.2 (default, Dec  4 2014, 12:49:00) 
   [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
   Type help, copyright, credits or license for more information.
import os
os.listdir('/tmp/xyz')
   ['\udc80']
open(os.listdir('/tmp/xyz')[0])
   Traceback (most recent call last):
 File stdin, line 1, in module
   FileNotFoundError: [Errno 2] No such file or directory: '\udc80'

File names encoded with Latin-X are quite commonplace even in UTF-8
locales.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Dan Sommers

On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:

 On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote:

 Correct. Linux pathnames are octet strings regardless of the locale.

 That's why Linux developers should refer to filenames using bytes.
 Unfortunately, Python itself violates that principle by having
 os.listdir() return str objects (to mention one example).
 
 Only because you gave it a str with the path name. If you want to
 refer to file names using bytes, then be consistent and refer to ALL
 file names using bytes. As I demonstrated, that works just fine.

Python 3.4.2 (default, Oct  8 2014, 10:45:20) 
[GCC 4.9.1] on linux
Type help, copyright, credits or license for more information.
 import os
 type(os.listdir(os.curdir)[0])
class 'str'
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding


On 07/03/2015 18:34, Dan Sommers wrote:

On Sun, 08 Mar 2015 05:13:09 +1100, Chris Angelico wrote:


On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers d...@tombstonezero.net wrote:

On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:


On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote:



Correct. Linux pathnames are octet strings regardless of the locale.

That's why Linux developers should refer to filenames using bytes.
Unfortunately, Python itself violates that principle by having
os.listdir() return str objects (to mention one example).


Only because you gave it a str with the path name. If you want to
refer to file names using bytes, then be consistent and refer to ALL
file names using bytes. As I demonstrated, that works just fine.


Python 3.4.2 (default, Oct  8 2014, 10:45:20)
[GCC 4.9.1] on linux
Type help, copyright, credits or license for more information.

import os
type(os.listdir(os.curdir)[0])

class 'str'


Help on module os:

DESCRIPTION
 This exports:
   - os.curdir is a string representing the current directory ('.' or ':')
   - os.pardir is a string representing the parent directory ('..' or '::')

Explicitly documented as strings. If you want to work with strings,
work with strings. If you want to work with bytes, don't use
os.curdir, use bytes instead. Personally, I'm happy using strings, but
if you want to go down the path of using bytes, you simply have to be
consistent, and that probably means being platform-dependent anyway,
so just use b. for the current directory.


I think we're all agreeing:  not all file systems are the same, and
Python doesn't smooth out all of the bumps, even for something that
seems as simple as displaying the names of files in a directory.  And
that's *after* we've agreed that filesystems contain files in
hierarchical directories.

Dan



Isn't pathlib 
https://docs.python.org/3/library/pathlib.html#module-pathlib 
effectively a more recent attempt at smoothing or even removing (some 
of) the bumps?  Has anybody here got experience of it as I've never used it?


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Mark Lawrence breamore...@yahoo.co.uk:

 On 07/03/2015 16:25, Marko Rauhamaa wrote:
 Here's an example:

 b = b'\x80'

 Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
 from str objects to bytes objects.

 Python 2 might, Python 3 doesn't.

   Python 3.3.2 (default, Dec  4 2014, 12:49:00) 
   [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
   Type help, copyright, credits or license for more information.
b'\x80'.decode('utf-8')
   Traceback (most recent call last):
 File stdin, line 1, in module
   UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0:
   invalid start byte


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 3:54 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 All you've proven is that there are bit patterns which are not UTF-8
 streams...

 And that causes problems.

Demonstrate.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Chris Angelico ros...@gmail.com:

 On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 File names encoded with Latin-X are quite commonplace even in UTF-8
 locales.

 That is not a problem with UTF-8, though. I don't understand how
 you're blaming UTF-8 for that.

I'm saying it creates practical problems. There's a snake in the
paradise.

 There are two things happening here:

 1) The underlying file system is not UTF-8, and you can't depend on
 that,

Correct. Linux pathnames are octet strings regardless of the locale.

That's why Linux developers should refer to filenames using bytes.
Unfortunately, Python itself violates that principle by having
os.listdir() return str objects (to mention one example).

 2) You forgot to put the path on that, so it failed to find the file.
 Here's my version of your demo:

 open(/tmp/xyz/+os.listdir('/tmp/xyz')[0])
 _io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' encoding='UTF-8'

 Looks fine to me.

I stand corrected.

Then we have:

os.listdir()[0].encode('utf-8')
   Traceback (most recent call last):
 File stdin, line 1, in module
   UnicodeEncodeError: 'utf-8' codec can't encode character '\udc80' in
   position 0: surrogates not allowed


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sun, Mar 8, 2015 at 5:02 AM, Dan Sommers d...@tombstonezero.net wrote:
 On Sun, 08 Mar 2015 04:59:56 +1100, Chris Angelico wrote:

 On Sun, Mar 8, 2015 at 4:50 AM, Marko Rauhamaa ma...@pacujo.net wrote:

 Correct. Linux pathnames are octet strings regardless of the locale.

 That's why Linux developers should refer to filenames using bytes.
 Unfortunately, Python itself violates that principle by having
 os.listdir() return str objects (to mention one example).

 Only because you gave it a str with the path name. If you want to
 refer to file names using bytes, then be consistent and refer to ALL
 file names using bytes. As I demonstrated, that works just fine.

 Python 3.4.2 (default, Oct  8 2014, 10:45:20)
 [GCC 4.9.1] on linux
 Type help, copyright, credits or license for more information.
 import os
 type(os.listdir(os.curdir)[0])
 class 'str'

Help on module os:

DESCRIPTION
This exports:
  - os.curdir is a string representing the current directory ('.' or ':')
  - os.pardir is a string representing the parent directory ('..' or '::')

Explicitly documented as strings. If you want to work with strings,
work with strings. If you want to work with bytes, don't use
os.curdir, use bytes instead. Personally, I'm happy using strings, but
if you want to go down the path of using bytes, you simply have to be
consistent, and that probably means being platform-dependent anyway,
so just use b. for the current directory.

Normally, using Unicode strings for file names will work just fine.
Any name that you craft yourself will be correctly encoded for the
target file system (or UTF-8 if you can't know), and any that you get
back from os.listdir or equivalent will be usable in file name
contexts. What else can you do with a file name that isn't encoded the
way you expect it to be? Unless you have some out-of-band encoding
information, you can't do anything meaningful with the stream of
bytes, other than keeping it exactly as it is.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Albert-Jan Roskam

--- Original Message -

 From: Chris Angelico ros...@gmail.com
 To: 
 Cc: python-list@python.org python-list@python.org
 Sent: Saturday, March 7, 2015 6:26 PM
 Subject: Re: Newbie question about text encoding

 On Sun, Mar 8, 2015 at 4:14 AM, Marko Rauhamaa ma...@pacujo.net wrote:
  See:

 $ mkdir /tmp/xyz
 $ touch /tmp/xyz/
  \x80'
 $ python3
 Python 3.3.2 (default, Dec  4 2014, 12:49:00)
 [GCC 4.8.3 20140911 (Red Hat 4.8.3-7)] on linux
 Type help, copyright, credits or 
 license for more information.
  import os
  os.listdir('/tmp/xyz')
 ['\udc80']
  open(os.listdir('/tmp/xyz')[0])
 Traceback (most recent call last):
   File stdin, line 1, in module
 FileNotFoundError: [Errno 2] No such file or directory: 
 '\udc80'

  File names encoded with Latin-X are quite commonplace even in UTF-8
  locales.

 That is not a problem with UTF-8, though. I don't understand how
 you're blaming UTF-8 for that. There are two things happening here:

 1) The underlying file system is not UTF-8, and you can't depend on
 that, ergo the decode to Unicode has to have some special handling of
 failing bytes.
 2) You forgot to put the path on that, so it failed to find the file.
 Here's my version of your demo:

  open(/tmp/xyz/+os.listdir('/tmp/xyz')[0])
 _io.TextIOWrapper name='/tmp/xyz/\udc80' mode='r' 
 encoding='UTF-8'

 Looks fine to me.

 Alternatively, if you pass a byte string to os.listdir, you get back a
 list of byte string file names:

  os.listdir(b/tmp/xyz)

 [b'\x80']

Nice, I did not know that. And glob.glob works the same way: it returns a list 
of ustrings when given a ustring, and returns bstrings when given a bstring.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Dan Sommers

On Sat, 07 Mar 2015 19:00:47 +, Mark Lawrence wrote:

 Isn't pathlib
 https://docs.python.org/3/library/pathlib.html#module-pathlib
 effectively a more recent attempt at smoothing or even removing (some
 of) the bumps?  Has anybody here got experience of it as I've never
 used it?

I almost said something about Common Lisp's PATHNAME type, but I didn't.

An extremely quick reading of that page tells me that os.pathlib
addresses *some* of the issues that PATHNAME addresses, but os.pathlib
seems more limited in scope (e.g., os.pathlib doesn't account for
filesystems with versioned files).  I'll certainly have a closer look
later.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Chris Angelico ros...@gmail.com:

 On Sun, Mar 8, 2015 at 2:48 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Marko Rauhamaa wrote:

 That said, UTF-8 does suffer badly from its not being
 a bijective mapping.

 Can you explain?

 In Python terms, there are bytes objects b that don't satisfy:

b.decode('utf-8').encode('utf-8') == b

 Please provide an example; that sounds like a bug. If there is any
 invalid UTF-8 stream which decodes without an error, it is actually a
 security bug, and should be fixed pronto in all affected and supported
 versions.

Here's an example:

   b = b'\x80'

Yes, it generates an exception. IOW, UTF-8 is not a bijective mapping
from str objects to bytes objects.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sat, Mar 7, 2015 at 10:09 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Stop using MySQL, which is a joke of a database[1], and use Postgres which
 does not have this problem.

I agree with the recommendation, though to be fair to MySQL, it is now
possible to store full Unicode. Though personally, I think the whole
UTF8MB3 vs UTF8MB4 split is an embarrassment and should be abolished
*immediately* - not we may change the meaning of UTF8 to be an alias
for UTF8MB4 in the future, just completely abolish the distinction
right now. (And deprecate the longer words.) There should be no reason
to build any kind of UTF-8 but limited to three bytes encoding for
anything. Ever.

But at least you can, if you configure things correctly, store any
Unicode character in your TEXT field.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Rustom Mody wrote:
 My conclusion: Early adopters of unicode -- Windows and Java -- were
 punished for their early adoption. You can blame the unicode
 consortium, you can blame the babel of human languages, particularly
 that some use characters and some only (the equivalent of) what we
 call words.

 I see you are blaming everyone except the people actually to blame.

I don't think you need to blame anybody. I think the UCS-2 mistake was
both deplorable and very understandable. At the time it looked like the
magic bullet to get out of the 8-bit mess. While 16-bit wide wchar_t's
looked like a hugely expensive price, it was deemed forward-looking to
pay it anyway to resolve the character set problem once and for all.

Linux was lucky to join the fray late enough to benefit from the bad
UCS-2 experience. That said, UTF-8 does suffer badly from its not being
a bijective mapping.

(Linux didn't quite dodge the bullet with pthreads, threads being
another sad fad of the 1990's. The hippies that cooked up the fork
system call should be awarded the next Millennium Prize. That foresight
or stroke of luck has withstood the challenge of half a century.)

 But there's nothing wrong with the design of the SMP. It allows the
 great majority of text, probably 99% or more, to use two bytes
 (UTF-16) or no more than three bytes (UTF-8), while only relatively
 specialised uses need four bytes for some code points.

The main dream was a fixed-width encoding scheme. People thought 16 bits
would be enough. The dream is so precious and true to us in the West
that people don't want to give it up.

It may yet be that UTF-32 replaces all previous schemes since it has all
the benefits of ASCII and only one drawback: redundancy. Maybe one day
we'll declare the byte 32 bits wide and be done with it. In some many
other aspects, 32-bit bytes are the de-facto reality already. Even C
coders routinely use 32 bits to express boolean values.

 And when Roy's customers demand that his product support emoji, or
 complain that they cannot spell their own name because of his
 parochial and ignorant idea of crap, perhaps he will consider doing
 what he should have done from the beginning:

That's a recurring theme: Why didn't we do IPv6 from the get-go? Why
didn't we do multi-user from the get-go? Why didn't we do localization
from the get-go?

There comes a point when you have to release to start making money. You
then suffer the consequences until your company goes bankrupt.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding


On 07/03/2015 12:02, Chris Angelico wrote:

On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa ma...@pacujo.net wrote:

The main dream was a fixed-width encoding scheme. People thought 16 bits
would be enough. The dream is so precious and true to us in the West
that people don't want to give it up.


So... use Pike, or Python 3.3+?

ChrisA



Cue obligatory cobblers from our RUE.

--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding


On 07/03/2015 11:09, Steven D'Aprano wrote:

Rustom Mody wrote:



This includes not just bug-prone-system code such as Java and Windows but
seemingly working code such as python 3.


What Unicode bugs do you think Python 3.3 and above have?



Methinks somebody has been drinking too much loony juice.  Either that 
or taking too much notice of our RUE.  Not that I've done a proper 
analysis, but to my knowledge there's nothing like the number of issues 
on the bug tracker for Unicode bugs for Python 3 compared to Python 2.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Steven D'Aprano

Marko Rauhamaa wrote:

 That said, UTF-8 does suffer badly from its not being
 a bijective mapping.

Can you explain?

As far as I am aware, every code point has one and only one valid UTF-8
encoding, and every UTF-8 encoding has one and only one valid code point.

There are *invalid* UTF-8 encodings, such as CESU-8, which is sometimes
mislabelled as UTF-8 (Oracle, I'm looking at you.) It violates the rule
that valid UTF-8 encodings are the shortest possible.

E.g. SMP code points should be encoded to four bytes using UTF-8:

py u'\U0010FF01'.encode('utf-8')  # U+10FF01
'\xf4\x8f\xbc\x81'


But in CESU-8, the code point is first interpreted as a UTF-16 surrogate
pair:

py u'\U0010FF01'.encode('utf-16be')
'\xdb\xff\xdf\x01'


then each surrogate pair is treated as a 16-bit code unit and individually
encoded to three bytes using UTF-8:

py u'\udbff'.encode('utf-8')
'\xed\xaf\xbf'
py u'\udf01'.encode('utf-8')
'\xed\xbc\x81'


giving six bytes in total:

'\xed\xaf\xbf\xed\xbc\x81'


This is not UTF-8! But some software mislabels it as UTF-8.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Steven D'Aprano

Rustom Mody wrote:

On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote:
[...]
Chris is suggesting that going from BMP to all of Unicode is not the hard
part. Going from ASCII to the BMP part of Unicode is the hard part. If
you can do that, you can go the rest of the way easily.

Depends where the going is starting from.
I specifically names Java, Javascript, Windows... among others.
Here's some quotes from the supplementary chars doc of Java

http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html

| Supplementary characters are characters in the Unicode standard whose
| code points are above U+, and which therefore cannot be described as
| single 16-bit entities such as the char data type in the Java
| programming language. Such characters are generally rare, but some are
| used, for example, as part of Chinese and Japanese personal names, and
| so support for them is commonly required for government applications in
| East Asian countries...

| The introduction of supplementary characters unfortunately makes the
| character model quite a bit more complicated.

| Unicode was originally designed as a fixed-width 16-bit character
| encoding. The primitive data type char in the Java programming language
| was intended to take advantage of this design by providing a simple data
| type that could hold
| any character Version 5.0 of the J2SE is required to support
| version 4.0 of the Unicode standard, so it has to support supplementary
| characters.

My conclusion: Early adopters of unicode -- Windows and Java -- were
punished
for their early adoption. You can blame the unicode consortium, you can
blame the babel of human languages, particularly that some use characters
and some only (the equivalent of) what we call words.

I see you are blaming everyone except the people actually to blame.

It is 2015. Unicode 2.0 introduced the SMPs in 1996, almost twenty years
ago, the same year as 1.0 release of Java. Java has had eight major new
releases since then. Oracle, and Sun before them, are/were serious, tier-1,
world-class major IT companies. Why haven't they done something about
introducing proper support for Unicode in Java? It's not hard -- if Python
can do it using nothing but volunteers, Oracle can do it. They could even
do it in a backwards-compatible way, by leaving the existing APIs in place
and adding new APIs.

As for Microsoft, as a member of the Unicode Consortium they have no excuse.
But I think you exaggerate the lack of support for SMPs in Windows. Some
parts of Windows have no SMP support, but they tend to be the oldest and
less important (to Microsoft) parts, like the command prompt.

Anyone have Powershell and like to see how well it supports SMP?

This Stackoverflow question suggests that post-Windows 2000, the Windows
file system has proper support for code points in the supplementary planes:

http://stackoverflow.com/questions/7870014/how-does-windows-wchar-t-handle-unicode-characters-outside-the-basic-multilingua

Or maybe not.

Or you can skip the blame-game and simply note the fact that large
segments of extant code-bases are currently in bug-prone or plain buggy
state.

This includes not just bug-prone-system code such as Java and Windows but
seemingly working code such as python 3.

What Unicode bugs do you think Python 3.3 and above have?

I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
UTF-8 and UTF-32, since that goes against the grain of the system. You
would have to program in artificial restrictions that otherwise don't
exist.

Yes UTF-8 and UTF-32 make most of the objections to unicode 7.0
irrelevant.

Glad you agree about that much at least.

[...]
Conclusion: faulty implementations of UTF-16 which incorrectly handle
surrogate pairs should be replaced by non-faulty implementations, or
changed to UTF-8 or UTF-32; incomplete Unicode implementations which
assume that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should
be upgraded.

Imagine for a moment a thought experiment -- we are not on a python but a
java forum and please rewrite the above para.

There is no need to re-write it. If Java's only implementation of Unicode
assumes that code points are 16 bits only, then Java needs a new Unicode
implementation. (I assume that the existing one cannot be changed for
backwards-compatibility reasons.)

Are you addressing the vanilla java programmer? Language implementer?
Designer? The Java-funders -- earlier Sun, now Oracle?

The last three should be considered the same people.

The vanilla Java programmer is not responsible for the short-comings of
Java's implementation.

[...]
In practice, standards change.
However if a standard changes so frequently that that users have to
play catching cook and keep asking: Which version? they are justified
in asking Are the standard-makers doing due diligence?

Since Unicode has stability

Re: Newbie question about text encoding

On Sat, Mar 7, 2015 at 10:53 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 The main dream was a fixed-width encoding scheme. People thought 16 bits
 would be enough. The dream is so precious and true to us in the West
 that people don't want to give it up.

So... use Pike, or Python 3.3+?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Rustom Mody

On Saturday, March 7, 2015 at 11:41:53 AM UTC+5:30, Terry Reedy wrote:
 On 3/6/2015 11:20 AM, Rustom Mody wrote:
 
  =
  pp = 
  print (pp)
  =
  Try open it in idle3 and you get (at least I get):
 
  $ idle3 ff.py
  Traceback (most recent call last):
 File /usr/bin/idle3, line 5, in module
   main()
 File /usr/lib/python3.4/idlelib/PyShell.py, line 1562, in main
   if flist.open(filename) is None:
 File /usr/lib/python3.4/idlelib/FileList.py, line 36, in open
   edit = self.EditorWindow(self, filename, key)
 File /usr/lib/python3.4/idlelib/PyShell.py, line 126, in __init__
   EditorWindow.__init__(self, *args)
 File /usr/lib/python3.4/idlelib/EditorWindow.py, line 294, in __init__
   if io.loadfile(filename):
 File /usr/lib/python3.4/idlelib/IOBinding.py, line 236, in loadfile
   self.text.insert(1.0, chars)
 File /usr/lib/python3.4/idlelib/Percolator.py, line 25, in insert
   self.top.insert(index, chars, tags)
 File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 81, in insert
   self.addcmd(InsertCommand(index, chars, tags))
 File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 116, in addcmd
   cmd.do(self.delegate)
 File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 219, in do
   text.insert(self.index1, self.chars, self.tags)
 File /usr/lib/python3.4/idlelib/ColorDelegator.py, line 82, in insert
   self.delegate.insert(index, chars, tags)
 File /usr/lib/python3.4/idlelib/WidgetRedirector.py, line 148, in 
  __call__
   return self.tk_call(self.orig_and_operation + args)
  _tkinter.TclError: character U+1f4a9 is above the range (U+-U+) 
  allowed by Tcl
 
  So who/what is broken?
 
 tcl
 The possible workaround is for Idle to translate  to \U0001f4a9 
 (10 chars) before sending it to tk.
 
 But some perspective.  In the console interpreter:
 
   print(\U0001f4a9)
 Traceback (most recent call last):
File stdin, line 1, in module
File C:\Programs\Python34\lib\encodings\cp437.py, line 19, in encode
  return codecs.charmap_encode(input,self.errors,encoding_map)[0]
 UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9' 
 in posit
 ion 0: character maps to undefined
 
 So what is broken?  Windows Command Prompt.
 
 More perspective.  tk/Idle *will* print *something* for every BMP char. 
   Command Prompt will not.  It does not even do ucs-2 correctly. So 
 which is more broken?  Windows Command Prompt.  Who has perhaps 
 1,000,000 times more resources, Microsoft? or the tcl/tk group?  I think 
 we all know.

Thanks Terry for the perspective.

From my side:

No complaints about python or tcl (or idle -- its actually neater than emacs
if only emacs was not burnt into my nervous system)

Even unicode -- only marginal complaints.
I wrote http://blog.languager.org/2015/02/universal-unicode.html
precisely to say that unicode is a wonderful thing and one should be 
enthusiastic
about it.
[You got that better than anyone else who has spoken -- Thanks]

Xah's pages are way more comprehensive than mine.
But comprehensive can be a negative -- ultimately the unicode standard is
the most comprehensive and correspondingly impenetrable without a compass.

The only very minor complaint I would make is:
If idle is unable to deal with SMP-chars and this is known and unlikely to 
change
(until TK changes), why not put up a dialog of the sort:
SMP char on line nn
SMP support currently unimplemented -- Sorry

instead of a backtrace?

[As I said just a suggestion]
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Rustom Mody

On Saturday, March 7, 2015 at 11:49:44 PM UTC+5:30, Mark Lawrence wrote:
 On 07/03/2015 17:16, Marko Rauhamaa wrote:
  Mark Lawrence:
 
  It would clearly help if you were to type in the correct UK English
  accent.
 
  Your ad-hominem-to-contribution ratio is alarmingly high.
 
 
  Marko
 
 
 You've been a PITA ever since you first joined this list, what about it?
 
 -- 
 My fellow Pythonistas, ask not what our language can do for you, ask
 what you can do for our language.

Hi Mark
Your UK accent above is funny [At least *I* find it so]
The above however is crossing a line. Please desist.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-07 Thread Steven D'Aprano

Marko Rauhamaa wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:
 
 Marko Rauhamaa wrote:

 That said, UTF-8 does suffer badly from its not being
 a bijective mapping.

 Can you explain?
 
 In Python terms, there are bytes objects b that don't satisfy:
 
b.decode('utf-8').encode('utf-8') == b

Are you talking about the fact that not all byte streams are valid UTF-8?
That is, some byte objects b may raise an exception on b.decode('utf-8').

I don't see why that means UTF-8 suffers badly from this. Can you give an
example of where you would expect to take an arbitrary byte-stream, decode
it as UTF-8, and expect the results to be meaningful?

For those cases where you do wish to take an arbitrary byte stream and
round-trip it, Python now provides an error handler for that.

py import random
py b = bytes([random.randint(0, 255) for _ in range(1)])
py s = b.decode('utf-8')
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x94 in position 0:
invalid start byte
py s = b.decode('utf-8', errors='surrogateescape')
py s.encode('utf-8', errors='surrogateescape') == b
True



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sat, Mar 7, 2015 at 1:03 AM,  random...@fastmail.us wrote:
 On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote:
 Number of code points is the most logical way to length-limit
 something. If you want to allow users to set their display names but
 not to make arbitrarily long ones, limiting them to X code points is
 the safest way (and preferably do an NFC or NFD normalization before
 counting, for consistency);

 Why are you length-limiting it? Storage space? Limit it in whatever
 encoding they're stored in. Why are combining marks pathological but
 surrogate characters not? Display space? Limit it by columns. If you're
 going to allow a Japanese user's name to be twice as wide, you've got a
 problem when you go to display it.

To prevent people from putting three paragraphs of lipsum in and
calling it a username.

 this means you disallow pathological cases
 where every base character has innumerable combining marks added.

 No it doesn't. If you limit it to, say, fifty, someone can still post
 two base characters with twenty combining marks each. If you actually
 want to disallow this, you've got to do more work. You've disallowed
 some of the pathological cases, some of the time, by coincidence. And
 limiting the number of UTF-8 bytes, or the number of UTF-16 code points,
 will accomplish this just as well.

They can, but then they're limited to two base characters. They can't
have fifty base characters with twenty combining marks each. That's
the point.

 Now, if you intend to _silently truncate_ it to the desired length, you
 certainly don't want to leave half a character in, of course. But who's
 to say the base character plus first few combining marks aren't also
 half a character? If you're _splitting_ a string, rather than merely
 truncating it, you probably don't want those combining marks at the
 beginning of part two.

So you truncate to the desired length, then if the first character of
the trimmed-off section is a combining mark (based on its Unicode
character types), you keep trimming until you've removed a character
which isn't. Then, if you no longer have any content whatsoever,
reject the name. Simple.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-06 Thread random832

On Fri, Mar 6, 2015, at 09:11, Chris Angelico wrote:
 To prevent people from putting three paragraphs of lipsum in and
 calling it a username.

Limiting by UTF-8 bytes or UTF-16 units works just as well for that.

 So you truncate to the desired length, then if the first character of
 the trimmed-off section is a combining mark (based on its Unicode
 character types), you keep trimming until you've removed a character
 which isn't. Then, if you no longer have any content whatsoever,
 reject the name. Simple.

My entire point was that UTF-32 doesn't save you from that, so it cannot
be called a deficiency of UTF-16. My point is there are very few
problems to which count of Unicode code points is the only right
answer - that UTF-32 is good enough for but that are meaningfully
impacted by a naive usage of UTF-16, to the point where UTF-16 is
something you have to be safe from.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sat, Mar 7, 2015 at 1:50 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Rustom Mody wrote:

 On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:

 [snip example of an analogous situation with NULs]

 Strawman.

 Sigh. If I had a dollar for every time somebody cried Strawman! when what
 they really should say is Yes, that's a good argument, I'm afraid I can't
 argue against it, at least not without considerable thought, I'd be a
 wealthy man...

If I had a dollar for every time anyone said If I had insert
currency unit here for every time..., I'd go meta all day long and
profit from it... :)

 - If you are writing your own file system layer, it's 2015 fer fecks sake,
 file names should be Unicode strings, not bytes! (That's one part of the
 Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
 system, whichever you please, but again remember that both are
 variable-width formats.

I agree that that part of the Unix model needs to change, but there
are two viable ways to move forward:

1) Keep file names as bytes, but mandate that they be valid UTF-8
streams, and recommend that they be decoded UTF-8 for display to a
human
2) Change the entire protocol stack from the file system upwards so
that file names become Unicode strings.

Trouble with #2 is that file names need to be passed around somehow,
which means bytes in memory. So ultimately, #2 really means keep file
names as bytes, and mandate an encoding all the way up the stack...
so it's a massive documentation change that really comes down to the
same thing as #1.

This is one area where, as I understand it, Mac OS got it right. It's
time for other Unix variants to adopt the same policy. The bulk of
file names will be ASCII-only anyway, so requiring UTF-8 won't affect
them; a lot of others are already UTF-8; so all we need is a
transition scheme for the remaining ones. If there's a known FS
encoding, it ought to be possible to have a file system conversion
tool that goes through everything, decodes, re-encodes UTF-8, and then
flags the file system as UTF-8 compliant. All that'd be left would be
the file names that are broken already - ones that don't decode in the
FS encoding - and there's nothing to be done with them but wrap them
up into something probably-meaningless-but reversible.

When can we start doing this? ext5?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-06 Thread random832

On Fri, Mar 6, 2015, at 08:39, Chris Angelico wrote:
 Number of code points is the most logical way to length-limit
 something. If you want to allow users to set their display names but
 not to make arbitrarily long ones, limiting them to X code points is
 the safest way (and preferably do an NFC or NFD normalization before
 counting, for consistency);

Why are you length-limiting it? Storage space? Limit it in whatever
encoding they're stored in. Why are combining marks pathological but
surrogate characters not? Display space? Limit it by columns. If you're
going to allow a Japanese user's name to be twice as wide, you've got a
problem when you go to display it.

 this means you disallow pathological cases
 where every base character has innumerable combining marks added.

No it doesn't. If you limit it to, say, fifty, someone can still post
two base characters with twenty combining marks each. If you actually
want to disallow this, you've got to do more work. You've disallowed
some of the pathological cases, some of the time, by coincidence. And
limiting the number of UTF-8 bytes, or the number of UTF-16 code points,
will accomplish this just as well.

Now, if you intend to _silently truncate_ it to the desired length, you
certainly don't want to leave half a character in, of course. But who's
to say the base character plus first few combining marks aren't also
half a character? If you're _splitting_ a string, rather than merely
truncating it, you probably don't want those combining marks at the
beginning of part two.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-06 Thread Steven D'Aprano

Rustom Mody wrote:

 On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:

[snip example of an analogous situation with NULs]

 Strawman.

Sigh. If I had a dollar for every time somebody cried Strawman! when what
they really should say is Yes, that's a good argument, I'm afraid I can't
argue against it, at least not without considerable thought, I'd be a
wealthy man...


 Lets please stick to UTF-16 shall we?
 
 Now tell me:
 - Is it broken or not?

The UTF-16 standard is not broken. It is a perfectly adequate variable-width
encoding, and considerably better than most other variable-width encodings.

However, many implementations of UTF-16 are faulty, and assume a
fixed-width. *That* is broken, not UTF-16.

(The difference between specification and implementation is critical.)


 - Is it widely used or not?

It's quite widely used.


 - Should programmers be careful of it or not?

Programmers should be aware whether or not any specific language uses UTF-16
and whether the implementation is buggy. That will help them decide whether
or not to use that language.


 - Should programmers be warned about it or not?

I'm in favour of people having more knowledge rather than less. I don't
believe that ignorance is bliss, except perhaps in the case that a giant
asteroid the size of Texas is heading straight for us.

Programmers should be aware of the limitations or bugs in any UTF-16
implementation they are likely to run into. Hence my general
recommendation:

- For transmission over networks or storage on permanent media (e.g. the
content of text files), use UTF-8. It is well-implemented by nearly all
languages that support Unicode, as far as I know.

- If you are designing your own language, your implementation of Unicode
strings should use something like Python's FSR, or UTF-8 with tweaks to
make string indexing O(1) rather than O(N), or correctly-implemented
UTF-16, or even UTF-32 if you have the memory. (Choices, choices.) If, in
2015, you design your Unicode implementation as if UTF-16 is a fixed 2-byte
per code point format, you fail.

- If you are using an existing language, be aware of any bugs and
limitations in its Unicode implementation. You may or may not be able to
work around them, but at least you can decide whether or not you wish to
try.

- If you are writing your own file system layer, it's 2015 fer fecks sake,
file names should be Unicode strings, not bytes! (That's one part of the
Unix model that needs to die.) You can use UTF-8 or UTF-16 in the file
system, whichever you please, but again remember that both are
variable-width formats.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody rustompm...@gmail.com wrote:
 Broken systems can be shown up by anything. Suppose you have a program
 that breaks when it gets a NUL character (not unknown in C code); is
 the fault with the Unicode consortium for allocating something at
 codepoint 0, or the code that can't cope with a perfectly normal
 character?

 Strawman.

Not really, no. I know of lots of programs that can't handle embedded
NULs, and which fail in various ways when given them (the most common
is simple truncation, but it's by far not the only way). And it's
exactly the same: a program that purports to handle arbitrary Unicode
text should be able to handle arbitrary Unicode text, not Unicode
text as long as it contains only codepoints within the range X-Y. It
doesn't matter whether the code chokes on U+, U+005C, U+FFFC, or
U+1F4A3 - if your code blows up, it's a failure in your code.

 Lets please stick to UTF-16 shall we?

 Now tell me:
 - Is it broken or not?
 - Is it widely used or not?
 - Should programmers be careful of it or not?
 - Should programmers be warned about it or not?

No, UTF-16 is not itself broken. (It would be if we expected
codepoints 0x10, and it's because of UTF-16 that that's the cap
on Unicode, but it's looking unlikely that we'll be needing any more
than that anyway.) What's broken is code that tries to treat UTF-16 as
if it's UCS-2, and then breaks on surrogate pairs.

Yes, it's widely used. Programmers should probably be warned about it,
but only because its tradeoffs are generally poorer than UTF-8's. If
you use it correctly, there's no problem.

 Also:
 Can a programmer who is away from UTF-16 in one part of the system (say by 
 using python3)
 assume he is safe all over?

I don't know what you mean here. Do you mean that your Python 3
program is at risk in some way because there might be some other
program that misuses UTF-16? Well, sure. And there might be some other
program that misuses buffer sizes, SQL queries, or shell invocations,
and makes your overall system vulnerable to buffer overruns or
injection attacks. These are significantly more likely AND more
serious than UTF-16 misuses. And you still have not proven anything
about SMP characters being a problem, but only that code can be
broken. Broken code is still broken code, no matter what your actual
brokenness.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-06 Thread random832

On Fri, Mar 6, 2015, at 04:06, Rustom Mody wrote:
 Also:
 Can a programmer who is away from UTF-16 in one part of the system (say
 by using python3)
 assume he is safe all over?

The most common failure of UTF-16 support, supposedly, is in programs
misusing the number of code units (for length or random access) as a
proxy for the number of characters.

However, when do you _really_ want the number of characters? You may
want to use it for, for example, the number of columns in a 'monospace'
font, which you've already screwed up because you haven't accounted for
double-wide characters or combining marks. Or you may want the position
that pressing an arrow key or backspace or forward-delete a number of
times will reach, which has its own rules in e.g. Indic languages (and
also fails on Latin with combining marks).
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sat, Mar 7, 2015 at 12:33 AM,  random...@fastmail.us wrote:
 However, when do you _really_ want the number of characters? You may
 want to use it for, for example, the number of columns in a 'monospace'
 font, which you've already screwed up because you haven't accounted for
 double-wide characters or combining marks. Or you may want the position
 that pressing an arrow key or backspace or forward-delete a number of
 times will reach, which has its own rules in e.g. Indic languages (and
 also fails on Latin with combining marks).

Number of code points is the most logical way to length-limit
something. If you want to allow users to set their display names but
not to make arbitrarily long ones, limiting them to X code points is
the safest way (and preferably do an NFC or NFD normalization before
counting, for consistency); this means you disallow pathological cases
where every base character has innumerable combining marks added.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-06 Thread Steven D'Aprano

random...@fastmail.us wrote:

 My point is there are very few
 problems to which count of Unicode code points is the only right
 answer - that UTF-32 is good enough for but that are meaningfully
 impacted by a naive usage of UTF-16, to the point where UTF-16 is
 something you have to be safe from.

I'm not sure why you care about the count of Unicode code points, although
that *is* a problem. Not for end-user reasons like how long is my
password?, but because it makes your job as a programmer harder.


[steve@ando ~]$ python2.7 -c print (len(u'\U:\U00014445'))
4
[steve@ando ~]$ python3.3 -c print (len(u'\U:\U00014445'))
3

It's hard to reason about your code when something as fundamental as the
length of a string is implementation-dependent. (By the way, the right
answer should be 3, not 4.)


But an even more important problem is that broken-UTF-16 lets you create
invalid, impossible Unicode strings *by accident*. Naturally you can create
broken Unicode if you assemble strings of surrogates yourself, but
broken-UTF-16 means it can happen from otherwise innocuous operations like
reversing a string:

py s = u'\U:\U00014445'  # Python 2.7 narrow build
py s[::-1]
u'\udc45\ud811:\u'


It's hard for me to demonstrate that the reversed string is broken because
the shell I am using does an amazingly good job of handling broken Unicode.
Even if I print it, the shell just prints missing-character glyphs instead
of crashing (fortunately for me!). But the first two code points are in
illegal order:

\udc45 is a high surrogate, and must follow a low surrogate;
\ud811 is a low surrogate, and must precede a high surrogate;

I'm not convinced you should be allowed to create Unicode strings containing
mismatched surrogates like this deliberately, but you certainly shouldn't
be able to do so by accident.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Friday, March 6, 2015 at 8:20:22 PM UTC+5:30, Steven D'Aprano wrote:
 Rustom Mody wrote:
 
  On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
 
 [snip example of an analogous situation with NULs]
 
  Strawman.
 
 Sigh. If I had a dollar for every time somebody cried Strawman! when what
 they really should say is Yes, that's a good argument, I'm afraid I can't
 argue against it, at least not without considerable thought, I'd be a
 wealthy man...

Missed my addition? Here it is again –  grammar slightly corrected.

===
Ah well if you insist on pursuing the nul-char example...
- No, the unicode consortium (or ASCII equivalent) is not wrong in allocating 
codepoint 0

- No, the code that can't cope with a perfectly normal character is not wrong

- It is C that is wrong for designing a buggy string data structure that cannot
contain a valid char.
===

In fact Chris' nul-char example is so strongly supporting my argument – 
bugginess of UTF-16 –
it is perhaps too strong even for me.

To elaborate:
Take the buggy-plane analogy I gave in
http://blog.languager.org/2015/03/whimsical-unicode.html

If a plane model crashes once in 10,000 flights compared to others that crash 
once in
one million flights we can call it bug-prone though not strictly buggy – it 
does fly  
 times safely!
OTOH if a plane is guaranteed to crash we can all it a buggy plane.

C's string is not bug-prone its plain buggy as it cannot represent strings
with nulls.

I would not go that far for UTF-16.
It is bug-inviting but it can also be implemented correctly
 
 
  Lets please stick to UTF-16 shall we?
  
  Now tell me:
  - Is it broken or not?
 
 The UTF-16 standard is not broken. It is a perfectly adequate variable-width
 encoding, and considerably better than most other variable-width encodings.
 
 However, many implementations of UTF-16 are faulty, and assume a
 fixed-width. *That* is broken, not UTF-16.
 
 (The difference between specification and implementation is critical.)
 
 
  - Is it widely used or not?
 
 It's quite widely used.
 
 
  - Should programmers be careful of it or not?
 
 Programmers should be aware whether or not any specific language uses UTF-16
 and whether the implementation is buggy. That will help them decide whether
 or not to use that language.
 
 
  - Should programmers be warned about it or not?
 
 I'm in favour of people having more knowledge rather than less. I don't
 believe that ignorance is bliss, except perhaps in the case that a giant
 asteroid the size of Texas is heading straight for us.
 
 Programmers should be aware of the limitations or bugs in any UTF-16
 implementation they are likely to run into. Hence my general
 recommendation:
 
 - For transmission over networks or storage on permanent media (e.g. the
 content of text files), use UTF-8. It is well-implemented by nearly all
 languages that support Unicode, as far as I know.
 
 - If you are designing your own language, your implementation of Unicode
 strings should use something like Python's FSR, or UTF-8 with tweaks to
 make string indexing O(1) rather than O(N), or correctly-implemented
 UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)

FSR is possible in python for very specific pythonic reasons
- dynamicness
- immutable strings

Drop either and FSR is impossible

 If, in 2015, you design your Unicode implementation as if UTF-16 is a fixed 
 2-byte per code point format, you fail.

Seems obvious enough.
So lets see...
Here's a 2-line python program -- runs well enough when run as a command.
Program:
=
pp = 
print (pp)
=
Try open it in idle3 and you get (at least I get):

$ idle3 ff.py 
Traceback (most recent call last):
  File /usr/bin/idle3, line 5, in module
main()
  File /usr/lib/python3.4/idlelib/PyShell.py, line 1562, in main
if flist.open(filename) is None:
  File /usr/lib/python3.4/idlelib/FileList.py, line 36, in open
edit = self.EditorWindow(self, filename, key)
  File /usr/lib/python3.4/idlelib/PyShell.py, line 126, in __init__
EditorWindow.__init__(self, *args)
  File /usr/lib/python3.4/idlelib/EditorWindow.py, line 294, in __init__
if io.loadfile(filename):
  File /usr/lib/python3.4/idlelib/IOBinding.py, line 236, in loadfile
self.text.insert(1.0, chars)
  File /usr/lib/python3.4/idlelib/Percolator.py, line 25, in insert
self.top.insert(index, chars, tags)
  File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 81, in insert
self.addcmd(InsertCommand(index, chars, tags))
  File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 116, in addcmd
cmd.do(self.delegate)
  File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 219, in do
text.insert(self.index1, self.chars, self.tags)
  File /usr/lib/python3.4/idlelib/ColorDelegator.py, line 82, in insert
self.delegate.insert(index, chars, tags)
  File /usr/lib/python3.4/idlelib/WidgetRedirector.py, line 148, in __call__
return

Re: Newbie question about text encoding

On Sat, Mar 7, 2015 at 3:20 AM, Rustom Mody rustompm...@gmail.com wrote:
 C's string is not bug-prone its plain buggy as it cannot represent strings
 with nulls.

 I would not go that far for UTF-16.
 It is bug-inviting but it can also be implemented correctly

C's standard library string handling functions are restricted in that
they handle a 255-byte alphabet. They do not handle Unicode, they do
not handle NUL, that is simply how they are. But I never said I was
talking about the C standard library. If you type a text string into a
GUI entry field, or encode it quoted-printable and pass it to a web
server, or whatever, you shouldn't know or care about what language
the program is written in; and if that program barfs on a NUL, that's
a limitation. That limitation might be caused by its naive use of
strcpy() when it should have used memcpy(), but that's not your
problem.

It's exactly the same here: if your program chokes on an SMP
character, I don't care what your program was written in or what
library functions your program called on. All I care is that your
program - repeated for emphasis, *your* program - failed on that
input. It's up to you to choose your underlying functions
appropriately.

 - If you are designing your own language, your implementation of Unicode
 strings should use something like Python's FSR, or UTF-8 with tweaks to
 make string indexing O(1) rather than O(N), or correctly-implemented
 UTF-16, or even UTF-32 if you have the memory. (Choices, choices.)

 FSR is possible in python for very specific pythonic reasons
 - dynamicness
 - immutable strings

 Drop either and FSR is impossible

I don't know what you mean by dynamicness. What you do need is a
Unicode string type, such that the application program isn't aware of
the underlying bytes, but simply treats this string as a sequence of
code points. The immutability isn't technically a requirement, but it
does make the FSR much more manageable; in a language with mutable
strings, it's probably more efficient to use UTF-32 for simplicity,
but it's up to the language designer to figure that out. (It might be
best to use something like the FSR, but where strings are never
narrowed after being widened, so it'd be possible for an ASCII-only
string to be stored UTF-32. That has consequences for comparisons, but
might give a reasonable hybrid of storage and mutation performance.)

 _tkinter.TclError: character U+1f4a9 is above the range (U+-U+) 
 allowed by Tcl

 So who/what is broken?

The exception is pretty clear on that point. Tcl can't handle SMP
characters. So it's Tcl that's broken. Unless there's evidence to the
contrary, that's what I would expect to be the case.

 Correct.
 Windows is broken for using UTF-16
 Linux is broken for conflating UTF-8 and byte string.

 Lot of breakage out here dont you think?
 May be related to the equation

 UTF-16 = UCS-2 + Duct-tape

UTF-16 is an encoding that was designed to be backward-compatible with
UCS-2, just as UTF-8 was designed to be compatible with ASCII. Call it
what you will, but backward compatibility is pretty important. Look at
things like DES3 - if you use the same key three times, it's
compatible with DES.

Linux isn't broken for conflating UTF-8 and byte strings. Linux is
flawed in that it defines file names to be byte strings, which means
that every file system could be different in what it actually uses as
the encoding. Since file names exist for the benefit of humans, they
should be treated as text, so we should work with them as text. But
for reasons of backward compatibility, Linux hasn't yet changed.

Windows isn't broken for using UTF-16. I think it's a poor trade-off,
given that so many file names are ASCII-only; and, of course, if any
program treats a Windows file name as UCS-2, then that program is
broken. But UTF-16 is not itself broken, any more than UTF-7 is. And
UTF-7 is a lot harder to work with.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Friday, March 6, 2015 at 2:33:11 PM UTC+5:30, Rustom Mody wrote:
 Lets please stick to UTF-16 shall we?
 
 Now tell me:
 - Is it broken or not?
 - Is it widely used or not?
 - Should programmers be careful of it or not?
 - Should programmers be warned about it or not?

Also:
Can a programmer who is away from UTF-16 in one part of the system (say by 
using python3)
assume he is safe all over?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Friday, March 6, 2015 at 10:50:35 AM UTC+5:30, Chris Angelico wrote:
 On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody wrote:
  My conclusion: Early adopters of unicode -- Windows and Java -- were 
  punished
  for their early adoption.  You can blame the unicode consortium, you can
  blame the babel of human languages, particularly that some use characters
  and some only (the equivalent of) what we call words.
 
  Or you can skip the blame-game and simply note the fact that large segments 
  of
  extant code-bases are currently in bug-prone or plain buggy state.
 
 For most of the 1990s, I was writing code in REXX, on OS/2. An even
 earlier adopter, REXX didn't have Unicode support _at all_, but
 instead had facilities for working with DBCS strings. You can't get
 everything right AND be the first to produce anything. Python didn't
 make Unicode strings the default until 3.0, but that's not Unicode's
 fault.
 
  This includes not just bug-prone-system code such as Java and Windows but
  seemingly working code such as python 3.
 
  Here is Roy's Smith post that first started me thinking that something may
  be wrong with SMP
  https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ
 
  Some parts are here some earlier and from my memory.
  If details wrong please correct:
  - 200 million records
  - Containing 4 strings with SMP characters
  - System made with python and mysql. SMP works with python, breaks mysql.
So whole system broke due to those 4 in 200,000,000 records
 
  I know enough (or not enough) of unicode to be chary of statistical 
  conclusions
  from the above.
  My conclusion is essentially an 'existence-proof':
 
 Hang on hang on. Why are you blaming Python or SMP characters for
 this? The problem here is MySQL, which doesn't adequately cope with
 the full Unicode range. (Or, didn't then, or doesn't with its default
 settings. I believe you can configure current versions of MySQL to
 work correctly, though I haven't actually checked. PostgreSQL gets it
 right, that's good enough for me.)
 
  SMP-chars can break systems.
  The breakage is costly-fied by the combination
  - layman statistical assumptions
  - BMP → SMP exercises different code-paths
 
 Broken systems can be shown up by anything. Suppose you have a program
 that breaks when it gets a NUL character (not unknown in C code); is
 the fault with the Unicode consortium for allocating something at
 codepoint 0, or the code that can't cope with a perfectly normal
 character?

Strawman.

Lets please stick to UTF-16 shall we?

Now tell me:
- Is it broken or not?
- Is it widely used or not?
- Should programmers be careful of it or not?
- Should programmers be warned about it or not?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Friday, March 6, 2015 at 3:24:48 PM UTC+5:30, Chris Angelico wrote:
 On Fri, Mar 6, 2015 at 8:02 PM, Rustom Mody wrote:
  Broken systems can be shown up by anything. Suppose you have a program
  that breaks when it gets a NUL character (not unknown in C code); is
  the fault with the Unicode consortium for allocating something at
  codepoint 0, or the code that can't cope with a perfectly normal
  character?
 
  Strawman.
 
 Not really, no. I know of lots of programs that can't handle embedded
 NULs, and which fail in various ways when given them (the most common
 is simple truncation, but it's by far not the only way).

Ah well if you insist on pursuing the nul-char example...
No the unicode consortium (or ASCII equivalent) is not wrong in allocating 
codepoint 0
Nor the code that can't cope with a perfectly normal character?

But with C for having a data structure called string with a 'hole' in it.

And it's
 exactly the same: a program that purports to handle arbitrary Unicode
 text should be able to handle arbitrary Unicode text, not Unicode
 text as long as it contains only codepoints within the range X-Y. It
 doesn't matter whether the code chokes on U+, U+005C, U+FFFC, or
 U+1F4A3 - if your code blows up, it's a failure in your code.
 
  Lets please stick to UTF-16 shall we?
 
  Now tell me:
  - Is it broken or not?
  - Is it widely used or not?
  - Should programmers be careful of it or not?
  - Should programmers be warned about it or not?
 
 No, UTF-16 is not itself broken. (It would be if we expected
 codepoints 0x10, and it's because of UTF-16 that that's the cap
 on Unicode, but it's looking unlikely that we'll be needing any more
 than that anyway.) What's broken is code that tries to treat UTF-16 as
 if it's UCS-2, and then breaks on surrogate pairs.
 
 Yes, it's widely used. Programmers should probably be warned about it,
 but only because its tradeoffs are generally poorer than UTF-8's. If
 you use it correctly, there's no problem.
 
  Also:
  Can a programmer who is away from UTF-16 in one part of the system (say by 
  using python3)
  assume he is safe all over?
 
 I don't know what you mean here. Do you mean that your Python 3
 program is at risk in some way because there might be some other
 program that misuses UTF-16?

Yes some other program/library/API etc connected to the python one

 Well, sure. And there might be some other
 program that misuses buffer sizes, SQL queries, or shell invocations,
 and makes your overall system vulnerable to buffer overruns or
 injection attacks. These are significantly more likely AND more
 serious than UTF-16 misuses. And you still have not proven anything
 about SMP characters being a problem, but only that code can be
 broken. Broken code is still broken code, no matter what your actual
 brokenness.

Roy Smith (and many other links Ive cited) prove exactly that - an
SMP character broke the code.

Note: I have no objection to people supporting full unicode 7.
Im just saying it may be significantly harder than just Use python3 and you 
are done
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-06 Thread Terry Reedy


On 3/6/2015 11:20 AM, Rustom Mody wrote:


=
pp = 
print (pp)
=
Try open it in idle3 and you get (at least I get):

$ idle3 ff.py
Traceback (most recent call last):
   File /usr/bin/idle3, line 5, in module
 main()
   File /usr/lib/python3.4/idlelib/PyShell.py, line 1562, in main
 if flist.open(filename) is None:
   File /usr/lib/python3.4/idlelib/FileList.py, line 36, in open
 edit = self.EditorWindow(self, filename, key)
   File /usr/lib/python3.4/idlelib/PyShell.py, line 126, in __init__
 EditorWindow.__init__(self, *args)
   File /usr/lib/python3.4/idlelib/EditorWindow.py, line 294, in __init__
 if io.loadfile(filename):
   File /usr/lib/python3.4/idlelib/IOBinding.py, line 236, in loadfile
 self.text.insert(1.0, chars)
   File /usr/lib/python3.4/idlelib/Percolator.py, line 25, in insert
 self.top.insert(index, chars, tags)
   File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 81, in insert
 self.addcmd(InsertCommand(index, chars, tags))
   File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 116, in addcmd
 cmd.do(self.delegate)
   File /usr/lib/python3.4/idlelib/UndoDelegator.py, line 219, in do
 text.insert(self.index1, self.chars, self.tags)
   File /usr/lib/python3.4/idlelib/ColorDelegator.py, line 82, in insert
 self.delegate.insert(index, chars, tags)
   File /usr/lib/python3.4/idlelib/WidgetRedirector.py, line 148, in __call__
 return self.tk_call(self.orig_and_operation + args)
_tkinter.TclError: character U+1f4a9 is above the range (U+-U+) allowed 
by Tcl

So who/what is broken?


tcl
The possible workaround is for Idle to translate  to \U0001f4a9 
(10 chars) before sending it to tk.


But some perspective.  In the console interpreter:

 print(\U0001f4a9)
Traceback (most recent call last):
  File stdin, line 1, in module
  File C:\Programs\Python34\lib\encodings\cp437.py, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f4a9' 
in posit

ion 0: character maps to undefined

So what is broken?  Windows Command Prompt.

More perspective.  tk/Idle *will* print *something* for every BMP char. 
 Command Prompt will not.  It does not even do ucs-2 correctly. So 
which is more broken?  Windows Command Prompt.  Who has perhaps 
1,000,000 times more resources, Microsoft? or the tcl/tk group?  I think 
we all know.


--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-05 Thread random832

On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote:
 I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
 UTF-8
 and UTF-32, since that goes against the grain of the system. You would
 have
 to program in artificial restrictions that otherwise don't exist.

UTF-8 is already restricted from representing values above 0x10,
whereas UTF-8 can naturally represent values up to 0x1F in four
bytes, up to 0x3FF in five bytes, and 0x7FFF in six bytes. If
anything, the BMP represents a natural boundary, since it coincides with
values that can be represented in three bytes. Likewise, UTF-32 can
obviously represent values up to 0x. You're programming in
artificial restrictions either way, it's just a question of what those
restrictions are.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-05 Thread Steven D'Aprano

random...@fastmail.us wrote:

 On Thu, Mar 5, 2015, at 09:06, Steven D'Aprano wrote:
 I mostly agree with Chris. Supporting *just* the BMP is non-trivial in
 UTF-8
 and UTF-32, since that goes against the grain of the system. You would
 have
 to program in artificial restrictions that otherwise don't exist.
 
 UTF-8 is already restricted from representing values above 0x10,
 whereas UTF-8 can naturally represent values up to 0x1F in four
 bytes, up to 0x3FF in five bytes, and 0x7FFF in six bytes. If
 anything, the BMP represents a natural boundary, since it coincides with
 values that can be represented in three bytes. Likewise, UTF-32 can
 obviously represent values up to 0x. You're programming in
 artificial restrictions either way, it's just a question of what those
 restrictions are.

Good points, but they don't greatly change my conclusion. If you are
implementing UTF-8 or UTF-32, it is no harder to deal with code points in
the SMP than those in the BMP.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-05 Thread Steven D'Aprano

Rustom Mody wrote:

 On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote:
 On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody  wrote:
 
  It lists some examples of software that somehow break/goof going from
  BMP-only unicode to 7.0 unicode.
 
  IOW the suggestion is that the the two-way classification
  - ASCII
  - Unicode
 
  is less useful and accurate than the 3-way
 
  - ASCII
  - BMP
  - Unicode
 
 How is that more useful? Aside from storage optimizations (in which
 the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
 not significantly different from the rest of Unicode.
 
 Sorry... Dont understand.

Chris is suggesting that going from BMP to all of Unicode is not the hard
part. Going from ASCII to the BMP part of Unicode is the hard part. If you
can do that, you can go the rest of the way easily.

I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8
and UTF-32, since that goes against the grain of the system. You would have
to program in artificial restrictions that otherwise don't exist.

UTF-16 is different, and that's probably why you think supporting all of
Unicode is hard. With UTF-16, there really is an obvious distinction
between the BMP and the SMP: that's where you jump from a single 2-byte
unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8
or UTF-32: 

- In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you
  support the SMP or not doesn't change the fact that you have to deal
  with multi-byte characters.

- In UTF-32, everything is fixed-width whether it is in the BMP or not.

In both cases, supporting the SMPs is no harder than supporting the BMP.
It's only UTF-16 that makes the SMP seem hard.

Conclusion: faulty implementations of UTF-16 which incorrectly handle
surrogate pairs should be replaced by non-faulty implementations, or
changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume
that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be
upgraded.

Wrong conclusion: SMPs are unnecessary and unneeded, and we need a new
standard that is just like obsolete Unicode version 1.

Unicode version 1 is obsolete for a reason. 16 bits is not enough for even
existing languages, let alone all the code points and characters that are
used in human communication.


 Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
 do you keep talking about 7.0 as if it's a recent change?
 
 It is 2015 as of now. 7.0 is the current standard.
 
 The need for the adjective 'current' should be pondered upon.

What's your point?

The UTF encodings have not changed since they were first introduced. They
have been stable for at least twenty years: UTF-8 has existed since 1993,
and UTF-16 since 1996.

Since version 2.0 of Unicode in 1996, the standard has made stability
guarantees that no code points will be renamed or removed. Consequently,
there has only been one version which removed characters, version 1.1.
Since then, new versions of the standard have only added characters, never
moved, renamed or deleted them.

http://unicode.org/policies/stability_policy.html

Some highlights in Unicode history:

Unicode 1.0 (1991): initial version, defined 7161 code points.

In January 1993, Rob Pike and Ken Thompson announced the design and working
implementation of the UTF-8 encoding.

1.1 (1993): defined 34233 characters, finalised Han Unification. Removed
some characters from the 1.0 set. This is the first and only time any code
points have been removed.

2.0 (1996): First version to include code points in the Supplementary
Multilingual Planes. Defined 38950 code points. Introduced the UTF-16
encoding.

3.1 (2001): Defined 94205 code points, including 42711 additional Han
ideographs, bringing the total number of CJK code points alone to 71793,
too many to fit in 16 bits.

2006: The People's Republic Of China mandates support for the GB-18030
character set for all software products sold in the PRC. GB-18030 supports
the entire Unicode range, include the SMPs. Since this date, all software
sold in China must support the SMPs.

6.0 (2010): The first emoji or emoticons were added to Unicode.

7.0 (2014): 113021 code points defined in total.


 In practice, standards change.
 However if a standard changes so frequently that that users have to play
 catching cook and keep asking: Which version? they are justified in
 asking Are the standard-makers doing due diligence?

Since Unicode has stability guarantees, and the encodings have not changed
in twenty years and will not change in the future, this argument is bogus.
Updating to a new version of the standard means, to a first approximation,
merely allocating some new code points which had previously been undefined
but are now defined.

(Code points can be flagged deprecated, but they will never be removed.)



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-05 Thread Chris Angelico

On Fri, Mar 6, 2015 at 3:53 PM, Rustom Mody rustompm...@gmail.com wrote:
 My conclusion: Early adopters of unicode -- Windows and Java -- were punished
 for their early adoption.  You can blame the unicode consortium, you can
 blame the babel of human languages, particularly that some use characters
 and some only (the equivalent of) what we call words.

 Or you can skip the blame-game and simply note the fact that large segments of
 extant code-bases are currently in bug-prone or plain buggy state.

For most of the 1990s, I was writing code in REXX, on OS/2. An even
earlier adopter, REXX didn't have Unicode support _at all_, but
instead had facilities for working with DBCS strings. You can't get
everything right AND be the first to produce anything. Python didn't
make Unicode strings the default until 3.0, but that's not Unicode's
fault.

 This includes not just bug-prone-system code such as Java and Windows but
 seemingly working code such as python 3.

 Here is Roy's Smith post that first started me thinking that something may
 be wrong with SMP
 https://groups.google.com/d/msg/comp.lang.python/loYWMJnPtos/GHMC0cX_hfgJ

 Some parts are here some earlier and from my memory.
 If details wrong please correct:
 - 200 million records
 - Containing 4 strings with SMP characters
 - System made with python and mysql. SMP works with python, breaks mysql.
   So whole system broke due to those 4 in 200,000,000 records

 I know enough (or not enough) of unicode to be chary of statistical 
 conclusions
 from the above.
 My conclusion is essentially an 'existence-proof':

Hang on hang on. Why are you blaming Python or SMP characters for
this? The problem here is MySQL, which doesn't adequately cope with
the full Unicode range. (Or, didn't then, or doesn't with its default
settings. I believe you can configure current versions of MySQL to
work correctly, though I haven't actually checked. PostgreSQL gets it
right, that's good enough for me.)

 SMP-chars can break systems.
 The breakage is costly-fied by the combination
 - layman statistical assumptions
 - BMP → SMP exercises different code-paths

Broken systems can be shown up by anything. Suppose you have a program
that breaks when it gets a NUL character (not unknown in C code); is
the fault with the Unicode consortium for allocating something at
codepoint 0, or the code that can't cope with a perfectly normal
character?

 You could also choose do with astral crap (Roy's words) what we all do with
 crap -- throw it out as early as possible.

There's only one character that fits that description, and that's
1F4A9. Everything else is just astral characters, and you shouldn't
have any difficulties with them.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-05 Thread Rustom Mody

On Thursday, March 5, 2015 at 7:36:32 PM UTC+5:30, Steven D'Aprano wrote:
 Rustom Mody wrote:
 
  On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote:
  On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody  wrote:
  
   It lists some examples of software that somehow break/goof going from
   BMP-only unicode to 7.0 unicode.
  
   IOW the suggestion is that the the two-way classification
   - ASCII
   - Unicode
  
   is less useful and accurate than the 3-way
  
   - ASCII
   - BMP
   - Unicode
  
  How is that more useful? Aside from storage optimizations (in which
  the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
  not significantly different from the rest of Unicode.
  
  Sorry... Dont understand.
 
 Chris is suggesting that going from BMP to all of Unicode is not the hard
 part. Going from ASCII to the BMP part of Unicode is the hard part. If you
 can do that, you can go the rest of the way easily.

Depends where the going is starting from.
I specifically names Java, Javascript, Windows... among others.
Here's some quotes from the supplementary chars doc of Java
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html

| Supplementary characters are characters in the Unicode standard whose code
| points are above U+, and which therefore cannot be described as single 
| 16-bit entities such as the char data type in the Java programming language. 
| Such characters are generally rare, but some are used, for example, as part 
| of Chinese and Japanese personal names, and so support for them is commonly 
| required for government applications in East Asian countries...

| The introduction of supplementary characters unfortunately makes the 
| character model quite a bit more complicated. 

| Unicode was originally designed as a fixed-width 16-bit character encoding. 
| The primitive data type char in the Java programming language was intended to 
| take advantage of this design by providing a simple data type that could hold 
| any character  Version 5.0 of the J2SE is required to support version 4.0 
| of the Unicode standard, so it has to support supplementary characters. 

My conclusion: Early adopters of unicode -- Windows and Java -- were punished
for their early adoption.  You can blame the unicode consortium, you can
blame the babel of human languages, particularly that some use characters
and some only (the equivalent of) what we call words.

Or you can skip the blame-game and simply note the fact that large segments of
extant code-bases are currently in bug-prone or plain buggy state.

This includes not just bug-prone-system code such as Java and Windows but
seemingly working code such as python 3.
 
 I mostly agree with Chris. Supporting *just* the BMP is non-trivial in UTF-8
 and UTF-32, since that goes against the grain of the system. You would have
 to program in artificial restrictions that otherwise don't exist.

Yes  UTF-8 and UTF-32 make most of the objections to unicode 7.0 irrelevant.
Large segments of the
 
 UTF-16 is different, and that's probably why you think supporting all of
 Unicode is hard. With UTF-16, there really is an obvious distinction
 between the BMP and the SMP: that's where you jump from a single 2-byte
 unit to a pair of 2-byte units. But that distinction doesn't exist in UTF-8
 or UTF-32: 
 
 - In UTF-8, about 99.8% of the BMP requires multiple bytes. Whether you
   support the SMP or not doesn't change the fact that you have to deal
   with multi-byte characters.
 
 - In UTF-32, everything is fixed-width whether it is in the BMP or not.
 
 In both cases, supporting the SMPs is no harder than supporting the BMP.
 It's only UTF-16 that makes the SMP seem hard.
 
 Conclusion: faulty implementations of UTF-16 which incorrectly handle
 surrogate pairs should be replaced by non-faulty implementations, or
 changed to UTF-8 or UTF-32; incomplete Unicode implementations which assume
 that Unicode is 16-bit only (e.g. UCS-2) are obsolete and should be
 upgraded.

Imagine for a moment a thought experiment -- we are not on a python but a java
forum and please rewrite the above para.
Are you addressing the vanilla java programmer? Language implementer? Designer?
The Java-funders -- earlier Sun, now Oracle?
 
 Wrong conclusion: SMPs are unnecessary and unneeded, and we need a new
 standard that is just like obsolete Unicode version 1.
 
 Unicode version 1 is obsolete for a reason. 16 bits is not enough for even
 existing languages, let alone all the code points and characters that are
 used in human communication.
 
 
  Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
  do you keep talking about 7.0 as if it's a recent change?
  
  It is 2015 as of now. 7.0 is the current standard.
  
  The need for the adjective 'current' should be pondered upon.
 
 What's your point?
 
 The UTF encodings have not changed since they were first introduced. They
 have been stable for at least twenty years: UTF-8 has

Re: Newbie question about text encoding

2015-03-03 Thread Chris Angelico

On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody rustompm...@gmail.com wrote:
 What I was trying to say expanded here
 http://blog.languager.org/2015/03/whimsical-unicode.html
 [Hope  the word 'whimsical' is less jarring and more accurate than 
 'gibberish']

Re footnote #4: ½ is a single character for compatibility reasons.
⅟₁₀₀ doesn't need to be a single character, because there are
countably infinite vulgar fractions and only 0x11 Unicode
characters.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
 On 2/26/2015 8:24 AM, Chris Angelico wrote:
  On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
  Wrote something up on why we should stop using ASCII:
  http://blog.languager.org/2015/02/universal-unicode.html
 
 I think that the main point of the post, that many Unicode chars are 
 truly planetary rather than just national/regional, is excellent.

snipped

 You should add emoticons, but not call them or the above 'gibberish'.
 I think that this part of your post is more 'unprofessional' than the 
 character blocks.  It is very jarring and seems contrary to your main point.

Ok Done

References to gibberish removed from
http://blog.languager.org/2015/02/universal-unicode.html 

What I was trying to say expanded here
http://blog.languager.org/2015/03/whimsical-unicode.html
[Hope  the word 'whimsical' is less jarring and more accurate than 'gibberish']
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-03 Thread Terry Reedy


On 3/3/2015 1:03 PM, Rustom Mody wrote:

On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:



You should add emoticons, but not call them or the above 'gibberish'.
I think that this part of your post is more 'unprofessional' than the
character blocks.  It is very jarring and seems contrary to your main point.


Ok Done

References to gibberish removed from
http://blog.languager.org/2015/02/universal-unicode.html

What I was trying to say expanded here
http://blog.languager.org/2015/03/whimsical-unicode.html
[Hope  the word 'whimsical' is less jarring and more accurate than 'gibberish']


I agree with both.

--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Wednesday, March 4, 2015 at 10:25:24 AM UTC+5:30, Chris Angelico wrote:
 On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody  wrote:
 
  It lists some examples of software that somehow break/goof going from 
  BMP-only
  unicode to 7.0 unicode.
 
  IOW the suggestion is that the the two-way classification
  - ASCII
  - Unicode
 
  is less useful and accurate than the 3-way
 
  - ASCII
  - BMP
  - Unicode
 
 How is that more useful? Aside from storage optimizations (in which
 the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
 not significantly different from the rest of Unicode.

Sorry... Dont understand.
 
 Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
 do you keep talking about 7.0 as if it's a recent change?

It is 2015 as of now. 7.0 is the current standard.

The need for the adjective 'current' should be pondered upon.

In practice, standards change.
However if a standard changes so frequently that that users have to play 
catching cook
and keep asking: Which version? they are justified in asking Are the 
standard-makers
doing due diligence?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Wednesday, March 4, 2015 at 12:14:11 AM UTC+5:30, Chris Angelico wrote:
 On Wed, Mar 4, 2015 at 5:03 AM, Rustom Mody wrote:
  What I was trying to say expanded here
  http://blog.languager.org/2015/03/whimsical-unicode.html
  [Hope  the word 'whimsical' is less jarring and more accurate than 
  'gibberish']
 
 Re footnote #4: ½ is a single character for compatibility reasons.
 ⅟₁₀₀ ...
  ^^^

Neat 
Thanks
[And figured out some of quopri module along the way figuring that out]
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-03 Thread Steven D'Aprano

Rustom Mody wrote:

 On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
 On 2/26/2015 8:24 AM, Chris Angelico wrote:
  On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
  Wrote something up on why we should stop using ASCII:
  http://blog.languager.org/2015/02/universal-unicode.html
 
 I think that the main point of the post, that many Unicode chars are
 truly planetary rather than just national/regional, is excellent.
 
 snipped
 
 You should add emoticons, but not call them or the above 'gibberish'.
 I think that this part of your post is more 'unprofessional' than the
 character blocks.  It is very jarring and seems contrary to your main
 point.
 
 Ok Done
 
 References to gibberish removed from
 http://blog.languager.org/2015/02/universal-unicode.html

I consider it unethical to make semantic changes to a published work in
place without acknowledgement. Fixing minor typos or spelling errors, or
dead links, is okay. But any edit that changes the meaning should be
commented on, either by an explicit note on the page itself, or by striking
out the previous content and inserting the new.

As for the content of the essay, it is currently rather unfocused. It
appears to be more of a list of here are some Unicode characters I think
are interesting, divided into subgroups, oh and here are some I personally
don't have any use for, which makes them silly than any sort of discussion
about the universality of Unicode. That makes it rather idiosyncratic and
parochial. Why should obscure maths symbols be given more importance than
obscure historical languages?

I think that the universality of Unicode could be explained in a single
sentence:

It is the aim of Unicode to be the one character set anyone needs to
represent every character, ideogram or symbol (but not necessarily distinct
glyph) from any existing or historical human language.

I can expand on that, but in a nutshell that is it.


You state:

APL and Z Notation are two notable languages APL is a programming language
and Z a specification language that did not tie themselves down to a
restricted charset ...


but I don't think that is correct. I'm pretty sure that neither APL nor Z
allowed you to define new characters. They might not have used ASCII alone,
but they still had a restricted character set. It was merely less
restricted than ASCII.

You make a comment about Cobol's relative unpopularity, but (1) Cobol
doesn't require you to write out numbers as English words, and (2) Cobol is
still used, there are uncounted billions of lines of Cobol code being used,
and if the number of Cobol programmers is less now than it was 16 years
ago, there are still a lot of them. Academics and FOSS programmers don't
think much of Cobol, but it has to count as one of the most amazing success
stories in the field of programming languages, despite its lousy design.

You list ideographs such as Cuneiform under Icons. They are not icons.
They are a mixture of symbols used for consonants, syllables, and
logophonetic, consonantal alphabetic and syllabic signs. That sits them
firmly in the same categories as modern languages with consonants, ideogram
languages like Chinese, and syllabary languages like Cheyenne.

Just because native readers of Cuneiform are all dead doesn't make Cuneiform
unimportant. There are probably more people who need to write Cuneiform
than people who need to write APL source code.

You make a comment:

To me – a unicode-layman – it looks unprofessional… Billions of computing
devices world over, each having billions of storage words having their
storage wasted on blocks such as these??

But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
Why are you so worried about an (illusionary) minor optimization?

Whether code points are allocated or not doesn't affect how much space they
take up. There are millions of unused Unicode code points today. If they
are allocated tomorrow, the space your documents take up will not increase
one byte.

Allocating code points to Cuneiform has not increased the space needed by
Unicode at all. Two bytes alone is not enough for even existing human
languages (thanks China). For hardware related reasons, it is faster and
more efficient to use four bytes than three, so the obvious and dumb (in
the simplest thing which will work) way to store Unicode is UTF-32, which
takes a full four bytes per code point, regardless of whether there are
65537 code points or 1114112. That makes it less expensive than floating
point numbers, which take eight. Would you like to argue that floating
point doubles are unprofessional and wasteful?

As Dave pointed out, and you apparently agreed with him enough to quote him
TWICE (once in each of two blog posts), history of computing is full of
premature optimizations for space. (In fact, some of these may have been
justified by the technical limitations of the day.) Technically Unicode is
also limited, but it is limited to over one million code

Re: Newbie question about text encoding

On Wednesday, March 4, 2015 at 9:35:28 AM UTC+5:30, Rustom Mody wrote:
 On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote:
  Rustom Mody wrote:
  
   On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
   On 2/26/2015 8:24 AM, Chris Angelico wrote:
On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
Wrote something up on why we should stop using ASCII:
http://blog.languager.org/2015/02/universal-unicode.html
   
   I think that the main point of the post, that many Unicode chars are
   truly planetary rather than just national/regional, is excellent.
   
   snipped
   
   You should add emoticons, but not call them or the above 'gibberish'.
   I think that this part of your post is more 'unprofessional' than the
   character blocks.  It is very jarring and seems contrary to your main
   point.
   
   Ok Done
   
   References to gibberish removed from
   http://blog.languager.org/2015/02/universal-unicode.html
  
  I consider it unethical to make semantic changes to a published work in
  place without acknowledgement. Fixing minor typos or spelling errors, or
  dead links, is okay. But any edit that changes the meaning should be
  commented on, either by an explicit note on the page itself, or by striking
  out the previous content and inserting the new.
 
 Dunno What you are grumping about…
 
 Anyway the attribution is made more explicit – footnote 5 in
  http://blog.languager.org/2015/03/whimsical-unicode.html.
 
 Note Terry Reedy's post who mainly objected was already acked earlier.
 Ive just added one more ack¹
 And JFTR the 'publication' (O how archaic!) is the whole blog not a single 
 page just as it is for any other dead-tree publication.
 
  
  As for the content of the essay, it is currently rather unfocused.
 
 True.
 
  It
  appears to be more of a list of here are some Unicode characters I think
  are interesting, divided into subgroups, oh and here are some I personally
  don't have any use for, which makes them silly than any sort of discussion
  about the universality of Unicode. That makes it rather idiosyncratic and
  parochial. Why should obscure maths symbols be given more importance than
  obscure historical languages?
 
 Idiosyncratic ≠ parochial
 
 
  
  I think that the universality of Unicode could be explained in a single
  sentence:
  
  It is the aim of Unicode to be the one character set anyone needs to
  represent every character, ideogram or symbol (but not necessarily distinct
  glyph) from any existing or historical human language.
  
  I can expand on that, but in a nutshell that is it.
  
  
  You state:
  
  APL and Z Notation are two notable languages APL is a programming language
  and Z a specification language that did not tie themselves down to a
  restricted charset ...
 
 Tsk Tsk – dihonest snipping. I wrote
 
 | APL and Z Notation are two notable languages APL is a programming language 
 | and Z a specification language that did not tie themselves down to a 
 | restricted charset even in the day that ASCII ruled.
 
 so its clear that the restricted applies to ASCII
  
  You list ideographs such as Cuneiform under Icons. They are not icons.
  They are a mixture of symbols used for consonants, syllables, and
  logophonetic, consonantal alphabetic and syllabic signs. That sits them
  firmly in the same categories as modern languages with consonants, ideogram
  languages like Chinese, and syllabary languages like Cheyenne.
 
 Ok changed to iconic.
 Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they 
 were languages.
 In 2015 when someone sees them and recognizes them, they are 'those things 
 that
 Sumerians/Egyptians wrote' No one except a rare expert knows those languages
 
  
  Just because native readers of Cuneiform are all dead doesn't make Cuneiform
  unimportant. There are probably more people who need to write Cuneiform
  than people who need to write APL source code.
  
  You make a comment:
  
  To me – a unicode-layman – it looks unprofessional… Billions of computing
  devices world over, each having billions of storage words having their
  storage wasted on blocks such as these??
  
  But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
  Why are you so worried about an (illusionary) minor optimization?
 
 2  4 as far as I am concerned.
 [If you disagree one man's illusionary is another's waking]
 
  
  Whether code points are allocated or not doesn't affect how much space they
  take up. There are millions of unused Unicode code points today. If they
  are allocated tomorrow, the space your documents take up will not increase
  one byte.
  
  Allocating code points to Cuneiform has not increased the space needed by
  Unicode at all. Two bytes alone is not enough for even existing human
  languages (thanks China). For hardware related reasons, it is faster and
  more efficient to use four bytes than three, so the obvious and dumb (in
  the

Re: Newbie question about text encoding

On Wednesday, March 4, 2015 at 12:07:06 AM UTC+5:30, jmf wrote:
 Le mardi 3 mars 2015 19:04:06 UTC+1, Rustom Mody a écrit :
  On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
   On 2/26/2015 8:24 AM, Chris Angelico wrote:
On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
Wrote something up on why we should stop using ASCII:
http://blog.languager.org/2015/02/universal-unicode.html
   
   I think that the main point of the post, that many Unicode chars are 
   truly planetary rather than just national/regional, is excellent.
  
  snipped
  
   You should add emoticons, but not call them or the above 'gibberish'.
   I think that this part of your post is more 'unprofessional' than the 
   character blocks.  It is very jarring and seems contrary to your main 
   point.
  
  Ok Done
  
  References to gibberish removed from
  http://blog.languager.org/2015/02/universal-unicode.html 
  
  What I was trying to say expanded here
  http://blog.languager.org/2015/03/whimsical-unicode.html
  [Hope  the word 'whimsical' is less jarring and more accurate than 
  'gibberish']
 
 
 
 Emoji and Dingbats are now part of Unicode.
 They should be considered as well as a 1 or a a
 or a mathematical alpha.
 So, there is nothing special to say about them.
 
 jmf

Maybe you missed this section:
http://blog.languager.org/2015/03/whimsical-unicode.html#half-assed

It lists some examples of software that somehow break/goof going from BMP-only 
unicode to 7.0 unicode.

IOW the suggestion is that the the two-way classification
- ASCII
- Unicode

is less useful and accurate than the 3-way

- ASCII
- BMP
- Unicode

Personally I would be pleased if 훌 were used for the math-lambda and
λ left alone for Greek-speaking users' identifiers.
However one should draw a line between personal preferences and a 
univeral(izable) standard.
As of now, λ works in blogger whereas 훌 breaks blogger -- gets replaced by �.
Similar breakages are current in Java, Javascript, Emacs, Mysql, Idle and 
Windows, various fonts etc etc. [Only one of these is remotely connected with 
python]

So BMP is practical, 7.0 is idealistic. You are free too pick 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Wednesday, March 4, 2015 at 8:24:40 AM UTC+5:30, Steven D'Aprano wrote:
 Rustom Mody wrote:
 
  On Thursday, February 26, 2015 at 10:33:44 PM UTC+5:30, Terry Reedy wrote:
  On 2/26/2015 8:24 AM, Chris Angelico wrote:
   On Thu, Feb 26, 2015 at 11:40 PM, Rustom Mody wrote:
   Wrote something up on why we should stop using ASCII:
   http://blog.languager.org/2015/02/universal-unicode.html
  
  I think that the main point of the post, that many Unicode chars are
  truly planetary rather than just national/regional, is excellent.
  
  snipped
  
  You should add emoticons, but not call them or the above 'gibberish'.
  I think that this part of your post is more 'unprofessional' than the
  character blocks.  It is very jarring and seems contrary to your main
  point.
  
  Ok Done
  
  References to gibberish removed from
  http://blog.languager.org/2015/02/universal-unicode.html
 
 I consider it unethical to make semantic changes to a published work in
 place without acknowledgement. Fixing minor typos or spelling errors, or
 dead links, is okay. But any edit that changes the meaning should be
 commented on, either by an explicit note on the page itself, or by striking
 out the previous content and inserting the new.

Dunno What you are grumping about…

Anyway the attribution is made more explicit – footnote 5 in
 http://blog.languager.org/2015/03/whimsical-unicode.html.

Note Terry Reedy's post who mainly objected was already acked earlier.
Ive just added one more ack¹
And JFTR the 'publication' (O how archaic!) is the whole blog not a single page 
just as it is for any other dead-tree publication.

 
 As for the content of the essay, it is currently rather unfocused.

True.

 It
 appears to be more of a list of here are some Unicode characters I think
 are interesting, divided into subgroups, oh and here are some I personally
 don't have any use for, which makes them silly than any sort of discussion
 about the universality of Unicode. That makes it rather idiosyncratic and
 parochial. Why should obscure maths symbols be given more importance than
 obscure historical languages?

Idiosyncratic ≠ parochial


 
 I think that the universality of Unicode could be explained in a single
 sentence:
 
 It is the aim of Unicode to be the one character set anyone needs to
 represent every character, ideogram or symbol (but not necessarily distinct
 glyph) from any existing or historical human language.
 
 I can expand on that, but in a nutshell that is it.
 
 
 You state:
 
 APL and Z Notation are two notable languages APL is a programming language
 and Z a specification language that did not tie themselves down to a
 restricted charset ...

Tsk Tsk – dihonest snipping. I wrote

| APL and Z Notation are two notable languages APL is a programming language 
| and Z a specification language that did not tie themselves down to a 
| restricted charset even in the day that ASCII ruled.

so its clear that the restricted applies to ASCII
 
 You list ideographs such as Cuneiform under Icons. They are not icons.
 They are a mixture of symbols used for consonants, syllables, and
 logophonetic, consonantal alphabetic and syllabic signs. That sits them
 firmly in the same categories as modern languages with consonants, ideogram
 languages like Chinese, and syllabary languages like Cheyenne.

Ok changed to iconic.
Obviously 2-3 millenia ago, when people spoke hieroglyphs or cuneiform they 
were languages.
In 2015 when someone sees them and recognizes them, they are 'those things that
Sumerians/Egyptians wrote' No one except a rare expert knows those languages

 
 Just because native readers of Cuneiform are all dead doesn't make Cuneiform
 unimportant. There are probably more people who need to write Cuneiform
 than people who need to write APL source code.
 
 You make a comment:
 
 To me – a unicode-layman – it looks unprofessional… Billions of computing
 devices world over, each having billions of storage words having their
 storage wasted on blocks such as these??
 
 But that is nonsense, and it contradicts your earlier quoting of Dave Angel.
 Why are you so worried about an (illusionary) minor optimization?

2  4 as far as I am concerned.
[If you disagree one man's illusionary is another's waking]

 
 Whether code points are allocated or not doesn't affect how much space they
 take up. There are millions of unused Unicode code points today. If they
 are allocated tomorrow, the space your documents take up will not increase
 one byte.
 
 Allocating code points to Cuneiform has not increased the space needed by
 Unicode at all. Two bytes alone is not enough for even existing human
 languages (thanks China). For hardware related reasons, it is faster and
 more efficient to use four bytes than three, so the obvious and dumb (in
 the simplest thing which will work) way to store Unicode is UTF-32, which
 takes a full four bytes per code point, regardless of whether there are
 65537 code points or 1114112. That makes it less

Re: Newbie question about text encoding

2015-03-03 Thread Chris Angelico

On Wed, Mar 4, 2015 at 1:54 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 It is easy to mock what is not important to you. I daresay kids adding emoji
 to their 10 character tweets would mock all the useless maths symbols in
 Unicode too.

Definitely! Who ever sings do you wanna build an integral sign?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-03-03 Thread Chris Angelico

On Wed, Mar 4, 2015 at 3:45 PM, Rustom Mody rustompm...@gmail.com wrote:

 It lists some examples of software that somehow break/goof going from BMP-only
 unicode to 7.0 unicode.

 IOW the suggestion is that the the two-way classification
 - ASCII
 - Unicode

 is less useful and accurate than the 3-way

 - ASCII
 - BMP
 - Unicode

How is that more useful? Aside from storage optimizations (in which
the significant breaks would be Latin-1, UCS-2, and UCS-4), the BMP is
not significantly different from the rest of Unicode.

Also, the expansion from 16-bit was back in Unicode 2.0, not 7.0. Why
do you keep talking about 7.0 as if it's a recent change?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-02-27 Thread Dave Angel


On 02/27/2015 06:54 AM, Steven D'Aprano wrote:

Dave Angel wrote:


On 02/27/2015 12:58 AM, Steven D'Aprano wrote:

Dave Angel wrote:


(Although I believe Seymour Cray was quoted as saying that virtual
memory is a crock, because you can't fake what you ain't got.)


If I recall correctly, disk access is about 1 times slower than RAM,
so virtual memory is *at least* that much slower than real memory.



It's so much more complicated than that, that I hardly know where to
start.


[snip technical details]

As interesting as they were, none of those details will make swap faster,
hence my comment that virtual memory is *at least* 1 times slower than
RAM.



The term virtual memory is used for many aspects of the modern memory 
architecture.  But I presume you're using it in the sense of running in 
a swapfile as opposed to running in physical RAM.


Yes, a page fault takes on the order of 10,000 times as long as an 
access to a location in L1 cache.  I suspect it's a lot smaller though 
if the swapfile is on an SSD drive.  The first byte is that slow.


But once the fault is resolved, the nearby bytes are in physical memory, 
and some of them are in L3, L2, and L1.  So you're not running in the 
swapfile any more.  And even when you run off the end of the page, 
fetching the sequentially adjacent page from a hard disk is much faster. 
 And if the disk has well designed buffering, faster yet.  The OS tries 
pretty hard to keep the swapfile unfragmented.


The trick is to minimize the number of page faults, especially to random 
locations.  If you're getting lots of them, it's called thrashing.


There are tools to help with that.  To minimize page faults on code, 
linking with a good working-set-tuner can help, though I don't hear of 
people bothering these days.  To minimize page faults on data, choosing 
one's algorithm carefully can help.  For example, in scanning through a 
typical matrix, row order might be adjacent locations, while column 
order might be scattered.


Not really much different than reading a text file.  If you can arrange 
to process it a line at a time, rather than reading the whole file into 
memory, you generally minimize your round-trips to disk.  And if you 
need to randomly access it, it's quite likely more efficient to memory 
map it, in which case it temporarily becomes part of the swapfile system.


--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-02-27 Thread alister

On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote:

 On Sat, Feb 28, 2015 at 3:00 AM, alister
 alister.nospam.w...@ntlworld.com wrote:
 I think there is a case for bringing back the overlay file, or at least
 loading larger programs in sections only loading the routines as they
 are required could speed up the start time of many large applications.
 examples libre office, I rarely need the mail merge function, the word
 count and may other features that could be added into the running
 application on demand rather than all at once.
 
 Downside of that is twofold: firstly the complexity that I already
 mentioned, and secondly you pay the startup cost on first usage. So you
 might get into the program a bit faster, but as soon as you go to any
 feature you didn't already hit this session, the program pauses for a
 bit and loads it. Sometimes startup cost is the best time to do this
 sort of thing.
 
If the modules are small enough this may not be noticeable but yes I do 
accept there may be delays on first usage.

As to the complexity it has been my observation that as the memory 
footprint available to programmers has increase they have become less  
less skilled at writing code.

of course my time as a professional programmer was over 20 years ago on 8 
bit micro controllers with 8k of ROM (eventually, original I only had 2k 
to play with)  128 Bytes (yes bytes!) of RAM so I am very out of date.

I now play with python because it is so much less demanding of me which 
probably makes me just a guilty :-)

 Of course, there is an easy way to implement exactly what you're asking
 for: use separate programs for everything, instead of expecting a
 megantic office suite[1] to do everything for you. Just get yourself a
 nice simple text editor, then invoke other programs - maybe from a
 terminal, or maybe from within the editor - to do the rest of the work.
 A simple disk cache will mean that previously-used programs start up
 quickly.
Libre office was sighted as just one example
Video editing suites are another that could be used as an example 
(perhaps more so, does the rendering engine need to be loaded until you 
start generating the output? a small delay here would be insignificant)
 
 ChrisA
 
 [1] It's slightly less bloated than the gigantic office suite sold by a
 top-end software company.





-- 
You don't sew with a fork, so I see no reason to eat with knitting 
needles.
-- Miss Piggy, on eating Chinese Food
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sat, Feb 28, 2015 at 3:45 AM, alister
alister.nospam.w...@ntlworld.com wrote:
 On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote:

 On Sat, Feb 28, 2015 at 3:00 AM, alister
 alister.nospam.w...@ntlworld.com wrote:
 I think there is a case for bringing back the overlay file, or at least
 loading larger programs in sections only loading the routines as they
 are required could speed up the start time of many large applications.
 examples libre office, I rarely need the mail merge function, the word
 count and may other features that could be added into the running
 application on demand rather than all at once.

 Downside of that is twofold: firstly the complexity that I already
 mentioned, and secondly you pay the startup cost on first usage. So you
 might get into the program a bit faster, but as soon as you go to any
 feature you didn't already hit this session, the program pauses for a
 bit and loads it. Sometimes startup cost is the best time to do this
 sort of thing.

 If the modules are small enough this may not be noticeable but yes I do
 accept there may be delays on first usage.

 As to the complexity it has been my observation that as the memory
 footprint available to programmers has increase they have become less 
 less skilled at writing code.

Perhaps, but on the other hand, the skill of squeezing code into less
memory is being replaced by other skills. We can write code that takes
the simple/dumb approach, let it use an entire megabyte of memory, and
not care about the cost... and we can write that in an hour, instead
of spending a week fiddling with it. Reducing the development cycle
time means we can add all sorts of cool features to a program, all
while the original end user is still excited about it. (Of course, a
comparison between today's World Wide Web and that of the 1990s
suggests that these cool features aren't necessarily beneficial, but
still, we have the option of foregoing austerity.)

 Video editing suites are another that could be used as an example
 (perhaps more so, does the rendering engine need to be loaded until you
 start generating the output? a small delay here would be insignificant)

Hmm, I'm not sure that's actually a big deal, because your *data* will
dwarf the code. I can fire up sox and avconv, both fairly large
programs, and their code will all sit comfortably in memory; but then
they get to work on my data, and suddenly my hard disk is chewing
through 91GB of content. Breaking up avconv into a dozen pieces
wouldn't make a dent in 91GB.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-02-27 Thread Grant Edwards

On 2015-02-27, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:
 Dave Angel wrote:

 On 02/27/2015 12:58 AM, Steven D'Aprano wrote:
 Dave Angel wrote:

 (Although I believe Seymour Cray was quoted as saying that virtual
 memory is a crock, because you can't fake what you ain't got.)

 If I recall correctly, disk access is about 1 times slower than RAM,
 so virtual memory is *at least* that much slower than real memory.

 
 It's so much more complicated than that, that I hardly know where to
 start.

 [snip technical details]

 As interesting as they were, none of those details will make swap faster,
 hence my comment that virtual memory is *at least* 1 times slower than
 RAM.

Nonsense.  On all of my machines, virtual memory _is_ RAM almost all
of the time.  I don't do the type of things that force the usage of
swap.

-- 
Grant Edwards   grant.b.edwardsYow! ... I want FORTY-TWO
  at   TRYNEL FLOATATION SYSTEMS
  gmail.cominstalled within SIX AND A
   HALF HOURS!!!
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sat, Feb 28, 2015 at 1:02 AM, Dave Angel da...@davea.name wrote:
 The term virtual memory is used for many aspects of the modern memory
 architecture.  But I presume you're using it in the sense of running in a
 swapfile as opposed to running in physical RAM.

Given that this started with a quote about you can't fake what you
ain't got, I would say that, yes, this refers to using hard disk to
provide more RAM.

If you're trying to use the pagefile/swapfile as if it's more memory
(I have 256MB of memory, but 10GB of swap space, so that's 10GB of
memory!), then yes, these performance considerations are huge. But
suppose you need to run a program that's larger than your available
RAM. On MS-DOS, sometimes you'd need to work with program overlays (a
concept borrowed from older systems, but ones that I never worked on,
so I'm going back no further than DOS here). You get a *massive*
complexity hit the instant you start using them, whether your program
would have been able to fit into memory on some systems or not. Just
making it possible to have only part of your code in memory places
demands on your code that you, the programmer, have to think about.
With virtual memory, though, you just write your code as if it's all
in memory, and some of it may, at some times, be on disk. Less code to
debug = less time spent debugging. The performance question is largely
immaterial (you'll be using the disk either way), but the savings on
complexity are tremendous. And then when you do find yourself running
on a system with enough RAM? No code changes needed, and full
performance. That's where virtual memory shines.

It's funny how the world changes, though. Back in the 90s, virtual
memory was the key. No home computer ever had enough RAM. Today? A
home-grade PC could easily have 16GB... and chances are you don't need
all of that. So we go for the opposite optimization: disk caching.
Apart from when I rebuild my Audio-Only Frozen project [1] and the
caches get completely blasted through, heaps and heaps of my work can
be done inside the disk cache. Hey, Sikorsky, got any files anywhere
on the hard disk matching *Pastel*.iso case insensitively? *chug chug
chug* Nope. Okay. Sikorsky, got any files matching *Pas5*.iso case
insensitively? *zip* Yeah, here it is. I didn't tell the first search
to hold all that file system data in memory; the hard drive controller
managed it all for me, and I got the performance benefit. Same as the
above: the main benefit is that this sort of thing requires zero
application code complexity. It's all done in a perfectly generic way
at a lower level.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-02-27 Thread Grant Edwards

On 2015-02-27, Grant Edwards invalid@invalid.invalid wrote:
 On 2015-02-27, Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote: 
 Dave Angel wrote:
 On 02/27/2015 12:58 AM, Steven D'Aprano wrote: Dave Angel wrote:

 (Although I believe Seymour Cray was quoted as saying that virtual
 memory is a crock, because you can't fake what you ain't got.)

 If I recall correctly, disk access is about 1 times slower than RAM,
 so virtual memory is *at least* that much slower than real memory.

 It's so much more complicated than that, that I hardly know where to
 start.

 [snip technical details]

 As interesting as they were, none of those details will make swap faster,
 hence my comment that virtual memory is *at least* 1 times slower than
 RAM.

 Nonsense.  On all of my machines, virtual memory _is_ RAM almost all
 of the time.  I don't do the type of things that force the usage of
 swap.

And on some of the embedded systems I work on, _all_ virtual memory is
RAM 100.000% of the time.

-- 
Grant Edwards   grant.b.edwardsYow! Don't SANFORIZE me!!
  at   
  gmail.com
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-02-27 Thread alister

On Sat, 28 Feb 2015 01:22:15 +1100, Chris Angelico wrote:

 
 If you're trying to use the pagefile/swapfile as if it's more memory (I
 have 256MB of memory, but 10GB of swap space, so that's 10GB of
 memory!), then yes, these performance considerations are huge. But
 suppose you need to run a program that's larger than your available RAM.
 On MS-DOS, sometimes you'd need to work with program overlays (a concept
 borrowed from older systems, but ones that I never worked on, so I'm
 going back no further than DOS here). You get a *massive* complexity hit
 the instant you start using them, whether your program would have been
 able to fit into memory on some systems or not. Just making it possible
 to have only part of your code in memory places demands on your code
 that you, the programmer, have to think about. With virtual memory,
 though, you just write your code as if it's all in memory, and some of
 it may, at some times, be on disk. Less code to debug = less time spent
 debugging. The performance question is largely immaterial (you'll be
 using the disk either way), but the savings on complexity are
 tremendous. And then when you do find yourself running on a system with
 enough RAM? No code changes needed, and full performance. That's where
 virtual memory shines.
 ChrisA

I think there is a case for bringing back the overlay file, or at least 
loading larger programs in sections
only loading the routines as they are required could speed up the start 
time of many large applications.
examples libre office, I rarely need the mail merge function, the word 
count and may other features that could be added into the running 
application on demand rather than all at once.

obviously with large memory  virtual mem there is no need to un-install 
them once loaded. 



-- 
Ralph's Observation:
It is a mistake to let any mechanical object realise that you
are in a hurry.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

On Sat, Feb 28, 2015 at 3:00 AM, alister
alister.nospam.w...@ntlworld.com wrote:
 I think there is a case for bringing back the overlay file, or at least
 loading larger programs in sections
 only loading the routines as they are required could speed up the start
 time of many large applications.
 examples libre office, I rarely need the mail merge function, the word
 count and may other features that could be added into the running
 application on demand rather than all at once.

Downside of that is twofold: firstly the complexity that I already
mentioned, and secondly you pay the startup cost on first usage. So
you might get into the program a bit faster, but as soon as you go to
any feature you didn't already hit this session, the program pauses
for a bit and loads it. Sometimes startup cost is the best time to do
this sort of thing.

Of course, there is an easy way to implement exactly what you're
asking for: use separate programs for everything, instead of expecting
a megantic office suite[1] to do everything for you. Just get yourself
a nice simple text editor, then invoke other programs - maybe from a
terminal, or maybe from within the editor - to do the rest of the
work. A simple disk cache will mean that previously-used programs
start up quickly.

ChrisA

[1] It's slightly less bloated than the gigantic office suite sold by
a top-end software company.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-02-27 Thread Dave Angel


On 02/27/2015 09:22 AM, Chris Angelico wrote:

On Sat, Feb 28, 2015 at 1:02 AM, Dave Angel da...@davea.name wrote:

The term virtual memory is used for many aspects of the modern memory
architecture.  But I presume you're using it in the sense of running in a
swapfile as opposed to running in physical RAM.


Given that this started with a quote about you can't fake what you
ain't got, I would say that, yes, this refers to using hard disk to
provide more RAM.

If you're trying to use the pagefile/swapfile as if it's more memory
(I have 256MB of memory, but 10GB of swap space, so that's 10GB of
memory!), then yes, these performance considerations are huge. But
suppose you need to run a program that's larger than your available
RAM. On MS-DOS, sometimes you'd need to work with program overlays (a
concept borrowed from older systems, but ones that I never worked on,
so I'm going back no further than DOS here). You get a *massive*
complexity hit the instant you start using them, whether your program
would have been able to fit into memory on some systems or not. Just
making it possible to have only part of your code in memory places
demands on your code that you, the programmer, have to think about.
With virtual memory, though, you just write your code as if it's all
in memory, and some of it may, at some times, be on disk. Less code to
debug = less time spent debugging. The performance question is largely
immaterial (you'll be using the disk either way), but the savings on
complexity are tremendous. And then when you do find yourself running
on a system with enough RAM? No code changes needed, and full
performance. That's where virtual memory shines.

It's funny how the world changes, though. Back in the 90s, virtual
memory was the key. No home computer ever had enough RAM. Today? A
home-grade PC could easily have 16GB... and chances are you don't need
all of that. So we go for the opposite optimization: disk caching.
Apart from when I rebuild my Audio-Only Frozen project [1] and the
caches get completely blasted through, heaps and heaps of my work can
be done inside the disk cache. Hey, Sikorsky, got any files anywhere
on the hard disk matching *Pastel*.iso case insensitively? *chug chug
chug* Nope. Okay. Sikorsky, got any files matching *Pas5*.iso case
insensitively? *zip* Yeah, here it is. I didn't tell the first search
to hold all that file system data in memory; the hard drive controller
managed it all for me, and I got the performance benefit. Same as the
above: the main benefit is that this sort of thing requires zero
application code complexity. It's all done in a perfectly generic way
at a lower level.


In 1973, I did manual swapping to an external 8k ramdisk.  It was a box 
that sat on the floor and contained 8k of core memory (not 
semiconductor).  The memory was non-volatile, so it contained the 
working copy of my code.  Then I built a small swapper that would bring 
in the set of routines currently needed.  My onboard RAM (semiconductor) 
was 1.5k, which had to hold the swapper, the code, and the data.  I was 
writing a GPS system for shipboard use, and the final version of the 
code had to fit entirely in EPROM, 2k of it.  But debugging EPROM code 
is a pain, since every small change took half an hour to make new chips.


Later, I built my first PC with 512k of RAM, and usually used much of it 
as a ramdisk, since programs didn't use nearly that amount.



--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-02-27 Thread MRAB


On 2015-02-27 16:45, alister wrote:

On Sat, 28 Feb 2015 03:12:16 +1100, Chris Angelico wrote:


On Sat, Feb 28, 2015 at 3:00 AM, alister
alister.nospam.w...@ntlworld.com wrote:

I think there is a case for bringing back the overlay file, or at least
loading larger programs in sections only loading the routines as they
are required could speed up the start time of many large applications.
examples libre office, I rarely need the mail merge function, the word
count and may other features that could be added into the running
application on demand rather than all at once.


Downside of that is twofold: firstly the complexity that I already
mentioned, and secondly you pay the startup cost on first usage. So you
might get into the program a bit faster, but as soon as you go to any
feature you didn't already hit this session, the program pauses for a
bit and loads it. Sometimes startup cost is the best time to do this
sort of thing.


If the modules are small enough this may not be noticeable but yes I do
accept there may be delays on first usage.


I suppose you could load the basic parts first so that the user can
start working, and then load the additional features in the background.


As to the complexity it has been my observation that as the memory
footprint available to programmers has increase they have become less 
less skilled at writing code.

of course my time as a professional programmer was over 20 years ago on 8
bit micro controllers with 8k of ROM (eventually, original I only had 2k
to play with)  128 Bytes (yes bytes!) of RAM so I am very out of date.

I now play with python because it is so much less demanding of me which
probably makes me just a guilty :-)


Of course, there is an easy way to implement exactly what you're asking
for: use separate programs for everything, instead of expecting a
megantic office suite[1] to do everything for you. Just get yourself a
nice simple text editor, then invoke other programs - maybe from a
terminal, or maybe from within the editor - to do the rest of the work.
A simple disk cache will mean that previously-used programs start up
quickly.

Libre office was sighted as just one example
Video editing suites are another that could be used as an example
(perhaps more so, does the rendering engine need to be loaded until you
start generating the output? a small delay here would be insignificant)


ChrisA

[1] It's slightly less bloated than the gigantic office suite sold by a
top-end software company.




--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

2015-02-27 Thread Dave Angel


On 02/27/2015 11:00 AM, alister wrote:

On Sat, 28 Feb 2015 01:22:15 +1100, Chris Angelico wrote:



If you're trying to use the pagefile/swapfile as if it's more memory (I
have 256MB of memory, but 10GB of swap space, so that's 10GB of
memory!), then yes, these performance considerations are huge. But
suppose you need to run a program that's larger than your available RAM.
On MS-DOS, sometimes you'd need to work with program overlays (a concept
borrowed from older systems, but ones that I never worked on, so I'm
going back no further than DOS here). You get a *massive* complexity hit
the instant you start using them, whether your program would have been
able to fit into memory on some systems or not. Just making it possible
to have only part of your code in memory places demands on your code
that you, the programmer, have to think about. With virtual memory,
though, you just write your code as if it's all in memory, and some of
it may, at some times, be on disk. Less code to debug = less time spent
debugging. The performance question is largely immaterial (you'll be
using the disk either way), but the savings on complexity are
tremendous. And then when you do find yourself running on a system with
enough RAM? No code changes needed, and full performance. That's where
virtual memory shines.
ChrisA


I think there is a case for bringing back the overlay file, or at least
loading larger programs in sections
only loading the routines as they are required could speed up the start
time of many large applications.
examples libre office, I rarely need the mail merge function, the word
count and may other features that could be added into the running
application on demand rather than all at once.

obviously with large memory  virtual mem there is no need to un-install
them once loaded.



I can't say how Linux handles it (I'd like to know, but haven't needed 
to yet), but in Windows (NT, XP, etc), a DLL is not loaded, but rather 
mapped.  And it's not copied into the swapfile, it's mapped directly 
from the DLL.  The mapping mode is copy-on-write which means that 
read=only portions are swapped directly from the DLL, on first usage, 
while read-write portions (eg. static/global variables, relocation 
modifications) are copied on first use to the swap file.  I presume 
EXE's are done the same way, but never had a need to know.


If that's the case on the architectures you're talking about, then the 
problem of slow loading is not triggered by the memory usage, but by 
lots of initialization code.  THAT's what should be deferred for 
seldom-used portions of code.


The main point of a working-set-tuner is to group sections of code 
together that are likely to be used together.  To take an extreme case, 
all the fatal exception handlers should be positioned adjacent to each 
other in linear memory, as it's unlikely that any of them will be 
needed, and the code takes up no time or space in physical memory.


Also (in Windows), a DLL can be pre-relocated, so that it has a 
preferred address to be loaded into memory.  If that memory is available 
when it gets loaded (actually mapped), then no relocation needs to 
happen, which saves time and swap space.


In the X86 architecture, most code is self-relocating, everything is 
relative.  But references to other DLL's and jump tables were absolute, 
so they needed to be relocated at load time, when final locations were 
nailed down.


Perhaps the authors of bloated applications have forgotten how to do 
these, as the defaults in the linker puts all DLL's in the same 
location, meaning all but the first will need relocating.  But system 
DLL's  are (were) each given unique addresses.


On one large project, I added the build step of assigning these base 
addresses.  Each DLL had to start on a 64k boundary, and I reserved some 
fractional extra space between them in case one would grow.  Then every 
few months, we double-checked that they didn't overlap, and if necessary 
adjusted the start addresses.  We didn't just automatically assign 
closest addresses, because frequently some of the DLL's would be updated 
independently of the others.

--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding