subject:"Unicode normalisation \[was Re\: \[beginner\] What's wrong\?\]"

Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-18 Thread Steven D'Aprano

On Monday 18 April 2016 12:01, Random832 wrote:

> On Sun, Apr 17, 2016, at 21:39, Steven D'Aprano wrote:
>> Oh no, it's the thread that wouldn't die! *wink*
>>
>> Actually, yes it is. At least, according to this website:
>> 
>> http://www.mit.edu/~jcb/Dvorak/history.html
> 
> I'd really rather see an instance of the claim not associated with
> Dvorak marketing. 

So would I, but this is hardly a Dvorak *marketing*. The author even points 
out that the famous case-study done by the US Navy was "biased, and at 
worst, fabricated".

http://www.mit.edu/~jcb/Dvorak/

And he too repeats the canard that "Contrary to popular opinion" QWERTY 
wasn't designed to slow typists down. (Even though he later goes on to 
support the popular opinion.)

You can also read the article in Reason magazine:

http://reason.com/archives/1996/06/01/typing-errors

You can skip the entire first page -- it is almost entirely a screed against 
government regulation and a defence of the all-mighty free-market. But the 
article goes through some fairly compelling evidence that Dvorak keyboards 
are barely more efficient that QWERTY, and that there was plenty of 
competition in type-writers in the late 1800s.

I don't agree with the Reason article that they have disproven the 
conventional wisdom that QWERTY won the typewriter wars due to luck and 
path-dependence. The authors are (in my opinion) overly keen to dismiss 
path-dependence, for instance taking it as self-evidently true that the use 
of QWERTY in the US would have no influence over other countries' choice in 
key layout. But it does support the contention that, at the time, QWERTY was 
faster than the alternatives.

Unfortunately, what it doesn't talk about is whether or not the alternate 
layouts had fewer jams.

Wikipedia's article on QWERTY shows the various designs used by Sholes and 
Remington, leading to the modern layout

https://en.wikipedia.org/wiki/QWERTY

One serious problem for discussion is that the QWERTY keyboard we use now is 
*not* the same as that designed by Sholes. For instance, one anomaly is that 
two very common digraphs, ER and RE, are right next to each other. But 
that's not how Sholes laid out the keys. On his keyboard, the top row was 
initially AEI.?Y then changed to QWE.TY. Failure to recognise this leads to 
errors like this blogger's claim that it is "wrong" that QWERTY was designed 
to break apart common digraphs:

http://yasuoka.blogspot.com.au/2006/08/sholes-discovered-that-many-
english.html

Even on a modern keyboard, out of the ten most common digraphs:

th he in er an re nd at on nt

only er/re use consecutive keys, and five out of the ten use alternate 
hands. Move the R back to its original position, and there are none with 
consecutive keys and seven with alternate hands.

> It only holds up as an obvious inference from the
> nature of how typing works if we assume *one*-finger hunt-and-peck
> rather than two-finger.

I don't agree, but neither can I prove it conclusively.

> Your website describes two-finger as the method
> that was being replaced by the 1878 introduction of ten-finger typing.
> 
>> The QWERTY layout was first sold in 1873 while the first known use of
>> ten-fingered typing was in 1878, and touch-typing wasn't invented for
>> another decade, in 1888.
> 
> Two-finger hunt-and-peck is sufficient for placing keys on opposite
> hands to speed typing up rather than slow it down.

Correct, once you take into account jamming. That's the whole point of 
separating the keys. But consider common letter combinations that can be 
typed by the one hand: QWERTY has a significant number of quite long words 
that can be typed with one hand, the *left* hand. That's actually quite 
harmful for both typing speed and accuracy.

Anyway, you seem to have ignored (or perhaps you just have nothing to say) 
my comments about the home keys. It seems clear to me that even with two-
finger typing, a layout that puts ETAOIN on the home keys, such as the 
Blickensderfer typewriter, would minimize the distance travelled by the 
fingers and improve typing speed -- but only so long as the problem of 
jamming was solved.

Interestingly, Wikipedia makes it clear that in the 19th century, the 
problem of jamming arms was already solved by doing away with the arms and 
using a wheel or a ball.

-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-17 Thread Chris Angelico

On Mon, Apr 18, 2016 at 11:39 AM, Steven D'Aprano  wrote:
> With QWERTY, the eight home keys only cover a fraction over a quarter of
> all key presses: ASDF JKL; have frequencies of
>
> 8.12% 6.28% 4.32% 2.30% 0.10% 0.69% 3.98% and effectively 0%
>
> making a total of 25.79%. If you also include G and H as "virtual
> home-keys", that rises to 33.74%.

Hey, that's a little unfair. Remember, lots of people still have to
write C code, so the semicolon is an important character! :) In fact,
skimming the CPython source code (grouped by file extension) shows
that C code has more semicolons than j's or k's:

a 3.19% s 3.26% d 1.90% f 1.76% g 0.95% h 0.89% j 0.36% k 0.35% l 2.62% ; 1.40%

for a total of 16.69% of characters coming from the home row.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-17 Thread Random832

On Sun, Apr 17, 2016, at 21:39, Steven D'Aprano wrote:
> Oh no, it's the thread that wouldn't die! *wink*
>
> Actually, yes it is. At least, according to this website:
> 
> http://www.mit.edu/~jcb/Dvorak/history.html

I'd really rather see an instance of the claim not associated with
Dvorak marketing. It only holds up as an obvious inference from the
nature of how typing works if we assume *one*-finger hunt-and-peck
rather than two-finger. Your website describes two-finger as the method
that was being replaced by the 1878 introduction of ten-finger typing.

> The QWERTY layout was first sold in 1873 while the first known use of
> ten-fingered typing was in 1878, and touch-typing wasn't invented for
> another decade, in 1888.

Two-finger hunt-and-peck is sufficient for placing keys on opposite
hands to speed typing up rather than slow it down.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-17 Thread Steven D'Aprano

Oh no, it's the thread that wouldn't die! *wink*

On Sun, 10 Apr 2016 01:53 am, Random832 wrote:

> On Fri, Apr 8, 2016, at 23:28, Steven D'Aprano wrote:
>> This is the power of the "slowing typists down is a myth" meme: same
>> Wikipedia contributor takes an article which *clearly and obviously*
>> repeats the conventional narrative that QWERTY was designed to
>> decrease the number of key presses per second, and uses that to defend
>> the counter-myth that QWERTY wasn't designed to decrease the number of
>> key presses per second!
> 
> Er, the footnote is clearly and obviously being used to cite the claim
> that that is popularly believed, not the claim that it's incorrect.

That's not clear nor obvious to me. But I won't quibble, I'll accept that as
a plausible interpretation.

>> These are the historical facts:
> 
>> - Sholes spend significant time developing a layout which reduced the
>>   number of jams by intentionally moving frequently typed characters
>>   far apart, which has the effect of slowing down the rate at which
>>   the typist can hit keys;
> 
> "Moving characters far apart has the effect of slowing down the rate at
> which the typist can hit keys" is neither a fact nor historical.

Actually, yes it is. At least, according to this website:

http://www.mit.edu/~jcb/Dvorak/history.html

  [quote]
  Because typists at that time used the "hunt-and-peck" method,
  Sholes's arrangement increased the time it took for the typists
  to hit the keys for common two-letter combinations enough to
  ensure that each type bar had time to fall back sufficiently
  far to be out of the way before the next one came up.
  [end quote]

The QWERTY layout was first sold in 1873 while the first known use of
ten-fingered typing was in 1878, and touch-typing wasn't invented for
another decade, in 1888.

So I think it is pretty clear that *at the time QWERTY was invented*
it slowed down the rate at which keys were pressed, thus allowing an
overall greater typing speed thanks to the reduced jamming.

Short of a signed memo from Shole himself, commenting one way or another, I
don't think we're going to find anything more definitive.

Even though QWERTY wasn't designed with touch-typing in mind, it's
interesting to look at some of the weaknesses of the system. It is almost
as if it had been designed to make touch-typing as inefficient as
possible :-) Just consider the home keys. The home keys require the least
amount of finger or hand movement, and are therefore the fastest to reach.
With QWERTY, the eight home keys only cover a fraction over a quarter of
all key presses: ASDF JKL; have frequencies of

8.12% 6.28% 4.32% 2.30% 0.10% 0.69% 3.98% and effectively 0%

making a total of 25.79%. If you also include G and H as "virtual
home-keys", that rises to 33.74%.

But that's far less than the obvious tactic of using the most common
letters ETAOIN as the home keys, which would cover 51.18% just from those
eight keys alone. The 19th century Blickensderfer typewriter used a similar
layout, with the ten home keys DHIATENSOR as the home keys. This would
allow the typist to make just under 74% of all alphabetical key presses
without moving the hands.

https://en.wikipedia.org/wiki/Blickensderfer_typewriter

Letter frequencies taken from here:

http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html

> Keys 
> that are further apart *can be hit faster without jamming* due to the
> specifics of the type-basket mechanism, and there's no reason to think
> that they can't be hit with at least equal speed by the typist.

You may be correct about that specific issue when it comes to touch typing,
but touch typing was 15 years in the future when Sholes invented QWERTY.
And unlike Guido, he didn't have a time-machine :-)

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-10 Thread pyotr filipivich

Ian Kelly  on Sun, 10 Apr 2016 07:43:13 -0600
typed in comp.lang.python  the following:
>On Sat, Apr 9, 2016 at 9:09 PM, pyotr filipivich  wrote:
>> ASINTOER are the top eight English letters (not in any order, it
>> is just that "A Sin To Err" is easy to remember.
>
>What's so hard to remember about ETA OIN SHRDLU? Plus that even gives
>you the top twelve. :-)

Depends on what you're looking for, I suppose.  In this case,
those eight get encoded differently than the other 20 characters.
--  
pyotr filipivich
The fears of one class of men are not the measure of the rights of another. 
-- George Bancroft
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-10 Thread Ian Kelly

On Sat, Apr 9, 2016 at 9:09 PM, pyotr filipivich  wrote:
> ASINTOER are the top eight English letters (not in any order, it
> is just that "A Sin To Err" is easy to remember.

What's so hard to remember about ETA OIN SHRDLU? Plus that even gives
you the top twelve. :-)
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-10 Thread Gregory Ewing


Steven D'Aprano :


But when you get down to fundamentals, character sets and alphabets have
always blurred the line between presentation and meaning. W ("double-u")
was, once upon a time, UU


And before that, it was VV, because the Romans used V the
way we now use U, and didn't have a letter U.

When U first appeared, it was just a cursive style of writing
a V. According to this, it wasn't until the 18th century that
the English alphabet got both U and V as separate letters:

http://boards.straightdope.com/sdmb/showthread.php?t=147677

Apparently "uu"/"vv" came to be known as "double u" prior to
that, and the name has persisted.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-10 Thread Gregory Ewing


Ben Bacarisse wrote:

The problem with that theory is that 'er/re' (this is e and r in either
order) is the 3rd most common pair in English but have been placed
together.


No, they haven't. The order of the characters in the type
basket goes down the slanted columns of keys, so E and R
are separated by D and C.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-09 Thread pyotr filipivich

Dennis Lee Bieber  on Sat, 09 Apr 2016 14:52:50
-0400 typed in comp.lang.python  the following:
>On Sat, 09 Apr 2016 11:44:48 -0400, Random832 
>declaimed the following:
>
>>I don't understand where this idea that alternating hands makes you
>>slows you down came from in the first place... I suspect it's people who
>
>   It's not (to my mind) the alternation that slows one down. It's the
>combination of putting common letters under weak fingers and some
>combinationS that require the same hand/finger to slow one down.
>
>aspect a is on the weakest left finger, with the s on a finger that
>many people have trouble moving independently from the middle finger (hmm,
>I seem to be okay moving the ring finger, but moving the middle finger
>tends to drag the ring with it). p is the weakest finger of the right hand.
>e&c use the same finger of the left hand, t is the strongest finger but one
>is coming off the lower-row reach of middle-finger c.
>
>deaf   is all left hand, and the de is the same finger... earth except
>for the h is also all left hand, and rt are the same finger.
>
>   I suspect for any argument for one side, a corresponding counter can be
>made for the other side. There are only 5.5 vowels (the .5 is Y) in
>English, so they are likely more prevalent than the 20-odd consonants when
>taking singly. Yet A is on the weakest finger on the weakest (for most of
>the populace) hand. IOU OTOH are in a fast three-finger roll -- and worse,
>IO is fairly common (all the ***ion endings).

ASINTOER are the top eight English letters (not in any order, it
is just that "A Sin To Err" is easy to remember.
--  
pyotr filipivich
The fears of one class of men are not the measure of the rights of another. 
-- George Bancroft
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-09 Thread Stephen Hansen

On Sat, Apr 9, 2016, at 12:25 PM, Mark Lawrence via Python-list wrote:
> Again, where is the relevance to Python in this discussion, as we're on 
> the main Python mailing list?  Please can the moderators take this stuff 
> out, it is getting beyond the pale.

You need to come to grip with the fact that python-list is only
moderated in the vaguest sense of the word. 

Quote:

https://www.python.org/community/lists/
"Pretty much anything Python-related is fair game for discussion, and
the group is even fairly tolerant of off-topic digressions; there have
been entertaining discussions of topics such as floating point, good
software design, and other programming languages such as Lisp and
Forth."

If you don't like it, sorry. We all have our burdens to bear.

--S
-- 
https://mail.python.org/mailman/listinfo/python-list

RE: [E] QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-09 Thread Coll-Barth, Michael via Python-list



-Original Message-
From: Ben Finney

>> This is an often-repeated myth, with citations back as far as the 1970s.
>> It is false.

>> The design is intended to reduce jamming the print heads together, but the 
>> goal of this is not to reduce speed, but to enable *fast* typing.

>> It aims to maximise the frequency in which (English-language) text has 
>> consecutive letters alternating either side of the middle of the keyboard. 
>> This should thus reduce collisions of nearby heads — and hence
>> *increase* the effective typing speed that can be achieved on such a 
>> mechanical typewriter.

When I was in high school, mid-70s, the instructor, an elderly women, said the 
same thing, the placement of the keys were designed to minimize collision of 
the heads.  I don't remember what she called the various parts, but they all 
had technical names.  I vaguely remember hearing the myth of slowing down 
typists when Dvorak's keyboard became available for PCs, '80s(?), and that this 
'new' layout removed that incumbrance.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-09 Thread Mark Lawrence via Python-list


On 09/04/2016 17:08, Rustom Mody wrote:

On Saturday, April 9, 2016 at 7:14:05 PM UTC+5:30, Ben Bacarisse wrote:

The problem with that theory is that 'er/re' (this is e and r in either
order) is the 3rd most common pair in English but have been placed
together.  ou and et (in either order) are the 15th and 22nd most common
and they are separated by only one hammer position.  On the other hand,
the QWERTY layout puts jk together, but they almost never appear
together in English text.


Where do you get this (kind of) statistical data?



Again, where is the relevance to Python in this discussion, as we're on 
the main Python mailing list?  Please can the moderators take this stuff 
out, it is getting beyond the pale.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-09 Thread Ben Bacarisse

Rustom Mody  writes:

> On Saturday, April 9, 2016 at 7:14:05 PM UTC+5:30, Ben Bacarisse wrote:
>> The problem with that theory is that 'er/re' (this is e and r in either
>> order) is the 3rd most common pair in English but have been placed
>> together.  ou and et (in either order) are the 15th and 22nd most common
>> and they are separated by only one hammer position.  On the other hand,
>> the QWERTY layout puts jk together, but they almost never appear
>> together in English text.
>
> Where do you get this (kind of) statistical data?

It was generated by counting the pairs found in a corpus of texts taken
from Project Gutenberg.  The numbers do very depending on what you pick
(for the complete works of Mark Twain er/re is second, for example), and
the none of the texts are very modern (because of the source) but I
doubt that matters too much.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-09 Thread Rustom Mody

On Saturday, April 9, 2016 at 7:14:05 PM UTC+5:30, Ben Bacarisse wrote:
> The problem with that theory is that 'er/re' (this is e and r in either
> order) is the 3rd most common pair in English but have been placed
> together.  ou and et (in either order) are the 15th and 22nd most common
> and they are separated by only one hammer position.  On the other hand,
> the QWERTY layout puts jk together, but they almost never appear
> together in English text.

Where do you get this (kind of) statistical data?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-09 Thread Random832

On Fri, Apr 8, 2016, at 23:28, Steven D'Aprano wrote:
> This is the power of the "slowing typists down is a myth" meme: same
> Wikipedia contributor takes an article which *clearly and obviously*
> repeats the conventional narrative that QWERTY was designed to
> decrease the number of key presses per second, and uses that to defend
> the counter-myth that QWERTY wasn't designed to decrease the number of
> key presses per second!

Er, the footnote is clearly and obviously being used to cite the claim
that that is popularly believed, not the claim that it's incorrect.

> These are the historical facts:

> - Sholes spend significant time developing a layout which reduced the
>   number of jams by intentionally moving frequently typed characters
>   far apart, which has the effect of slowing down the rate at which
>   the typist can hit keys;

"Moving characters far apart has the effect of slowing down the rate at
which the typist can hit keys" is neither a fact nor historical. Keys
that are further apart *can be hit faster without jamming* due to the
specifics of the type-basket mechanism, and there's no reason to think
that they can't be hit with at least equal speed by the typist.

Take a typewriter. Press Q and A (right next to each other) at the same
time, and observe the distance from the type basket where the jam
occurs. Now press Q and P (on the opposite side of the basket from each
other) and observe where the jam occurs.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-09 Thread Random832

On Fri, Apr 8, 2016, at 23:28, Steven D'Aprano wrote:
> And how did it enable fast typing? By *slowing down the typist*, and thus
> having fewer jams.

Er, no? The point is that type bars that are closer together collide
more easily *at the same actual typing speed* than ones that are further
apart - For Q to collide with P, they would have to both be nearly all
the way to the platen at the same time, whereas Q can collide with A
even a mere millimeter from the basket (or anywhere in between).

I don't understand where this idea that alternating hands makes you
slows you down came from in the first place... I suspect it's people who
haven't really thought for a minute about the physical process of typing
(to type "ec" you have to physically move your left hand, to type "en"
your right hand can already be moving into place while your left hand
presses the first key. The former is clearly slower than the latter.)
This goes double for hunt-and-peck typing, where you have to move your
whole hand to press _any_ two keys on the same hand.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-09 Thread Ben Bacarisse

Ben Bacarisse  writes:

> alister  writes:
> 
>> 
>> the design of qwerty was not to "Slow" the typist bu to ensure that the 
>> hammers for letters commonly used together are spaced widely apart, 
>> reducing the portion of trier travel arc were the could jam.
>> I and E are actually such a pair which is why they are at opposite ends 
>> of the hammer rack (I doubt that is the correct technical term).
>> they are on opposite hands to make typing of them faster.
>> unfortunately as you found it is still possible to jam them if they are 
>> hit almost simultaneously
>> 
>
> The problem with that theory is that 'er/re' (this is e and r in either
> order) is the 3rd most common pair in English but have been placed
> together.  ou and et (in either order) are the 15th and 22nd most common
> and they are separated by only one hammer position.  On the other hand,
> the QWERTY layout puts jk together, but they almost never appear
> together in English text.

This last part came out muddled.  It's obviously wise to put infrequent
combinations together (like jk), but j and k are both also rare letters
so putting them together represents a wasted opportunity for meeting the
supposed design objective.  Swapping, say, k and r, or splitting jk but
putting e in the middle would surely result in a net gain of "hammer
separation".

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-09 Thread Ben Bacarisse

alister  writes:

> 
> the design of qwerty was not to "Slow" the typist bu to ensure that the 
> hammers for letters commonly used together are spaced widely apart, 
> reducing the portion of trier travel arc were the could jam.
> I and E are actually such a pair which is why they are at opposite ends 
> of the hammer rack (I doubt that is the correct technical term).
> they are on opposite hands to make typing of them faster.
> unfortunately as you found it is still possible to jam them if they are 
> hit almost simultaneously
> 

The problem with that theory is that 'er/re' (this is e and r in either
order) is the 3rd most common pair in English but have been placed
together.  ou and et (in either order) are the 15th and 22nd most common
and they are separated by only one hammer position.  On the other hand,
the QWERTY layout puts jk together, but they almost never appear
together in English text.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-09 Thread alister

On Fri, 08 Apr 2016 20:20:02 -0400, Dennis Lee Bieber wrote:

> On Fri, 8 Apr 2016 11:04:53 -0700 (PDT), Rustom Mody
>  declaimed the following:
> 
>>Its reasonably likely that all our keyboards start QWERT...
>> Doesn't make it a sane design.
>>
>   It was a sane design -- for early mechanical typewrites. It 
fulfills
> its goal of slowing down a typist to reduce jamming print-heads at the
> platen.* And since so many of us who had formal touch typing training
> probably learned on said mechanical typewriters, it hangs around.
> Fortunately, even though the typewriters at school had European
> dead-keys, we were plain English and I never had to pick them up.
> 
>   For a few years I did have problems with ()... They were on 
different
> keys (8 and 9, respectively) on old typewriters (the type that also had
> no 1) vs IBM Selectrics (never used by be) and computer terminals...
> 
> 
> 
> * Except I kept jamming two letters of my last name... I and E are
> reached with the same finger on opposite hands, which made a fast
> stroke-pair (compare moving the same finger on both hands to moving
> different fingers).


the design of qwerty was not to "Slow" the typist bu to ensure that the 
hammers for letters commonly used together are spaced widely apart, 
reducing the portion of trier travel arc were the could jam.
I and E are actually such a pair which is why they are at opposite ends 
of the hammer rack (I doubt that is the correct technical term).
they are on opposite hands to make typing of them faster.
unfortunately as you found it is still possible to jam them if they are 
hit almost simultaneously




-- 
There's a trick to the Graceful Exit.  It begins with the vision to
recognize when a job, a life stage, a relationship is over -- and to let
go.  It means leaving what's over without denying its validity or its
past importance in our lives.  It involves a sense of future, a belief
that every exit line is an entry, that we are moving on, rather than out.
The trick of retiring well may be the trick of living well.  It's hard to
recognize that life isn't a holding action, but a process.  It's hard to
learn that we don't leave the best parts of ourselves behind, back in the
dugout or the office. We own what we learned back there.  The experiences
and the growth are grafted onto our lives.  And when we exit, we can take
ourselves along -- quite gracefully.
-- Ellen Goodman
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-08 Thread Steven D'Aprano

On Sat, 9 Apr 2016 10:43 am, Ben Finney wrote:

> Dennis Lee Bieber  writes:
> 
>> [The QWERTY keyboard layout] was a sane design -- for early mechanical
>> typewrites. It fulfills its goal of slowing down a typist to reduce
>> jamming print-heads at the platen.
> 
> This is an often-repeated myth, with citations back as far as the 1970s.
> It is false.
> 
> The design is intended to reduce jamming the print heads together, but
> the goal of this is not to reduce speed, but to enable *fast* typing.

And how did it enable fast typing? By *slowing down the typist*, and thus
having fewer jams.

Honestly, I have the greatest respect for the Straight Dope, but this is one
of those times when they miss the forest for the trees. The conventional
wisdom about typewriters isn't wrong -- or at least there's no evidence
that it's wrong.

As far as I can, *every single* argument against the conventional wisdom
comes down to an argument that it is ridiculous or silly that anyone might
have wanted to slow typing down. For example, Wikipedia links to this page:

http://www.smithsonianmag.com/arts-culture/fact-of-fiction-the-legend-of-the-qwerty-keyboard-49863249/?no-ist

which quotes researchers:

“The speed of Morse receiver should be equal to the Morse sender, of course.
If Sholes really arranged the keyboard to slow down the operator, the
operator became unable to catch up the Morse sender. We don’t believe that
Sholes had such a nonsense intention during his development of
Type-Writer.”

This is merely argument from personal incredibility:

http://rationalwiki.org/wiki/Argument_from_incredulity

and is trivially answerable: how well do you think the receiver can keep up
with the sender if they have to stop every few dozen keystrokes to unjam
the typewriter?

Wikipedia states:

"Contrary to popular belief, the QWERTY layout was not designed to slow the
typist down,[3]"

with the footnote [3] linking to

http://www.maltron.com/media/lillian_kditee_001.pdf

which clearly and prominently states in the THIRD paragraph:

"It has been said of the Sholes letter layout [QWERTY] this it would
probably have been chosen if the objective was to find the least
efficient -- in terms of learning time and speed achievable -- and the most
error producing character arrangement. This is not surprising when one
considers that a team of people spent one year developing this layout so
that it should provide THE GREATEST INHIBITION TO FAST KEYING. [Emphasis
added.] This was no Machiavellian plot, but necessary because the mechanism
of the early typewriters required slow operation."

This is the power of the "slowing typists down is a myth" meme: same
Wikipedia contributor takes an article which *clearly and obviously*
repeats the conventional narrative that QWERTY was designed to decrease the
number of key presses per second, and uses that to defend the counter-myth
that QWERTY wasn't designed to decrease the number of key presses per
second!

These are the historical facts:

- early typewriters had varying layouts, some of which allow much more rapid
keying than QWERTY;

- early typewriters were prone to frequent and difficult jamming;

- Sholes spend significant time developing a layout which reduced the number
of jams by intentionally moving frequently typed characters far apart,
which has the effect of slowing down the rate at which the typist can hit
keys;

- which results in greater typing speed do to a reduced number of jams.

In other words the conventional story.

Jams have such a massively negative effect on typing speed that reducing the
number of jams gives you a *huge* win on overall speed even if the rate of
keying is significantly lower. At first glance, it may seem paradoxical,
but it's not. Which is faster?

- typing at a steady speed of (lets say) 100 words per minute;

- typing in bursts of (say) 200 wpm for a minute, followed by three minutes
of 0 wpm.

The second case averages half the speed of the first, even though the typist
is hitting keys at a faster rate. This shouldn't be surprising to any car
driver who has raced from one red light to the next, only to be caught up
and even overtaken by somebody driving at a more sedate speed who caught
nothing but green lights. Or to anyone who has heard the story of the
Tortoise and the Hare.

The moral of QWERTY is "less haste, more speed".

The myth of the "QWERTY myth" is based on the idea that people are unable to
distinguish between peak speed and average speed. But ironically, in my
experience, it's only those repeating the myth who seem confused by that
difference (as in the quote from the Smithsonian above). Most people don't
need the conventional narrative explained:

"Speed up typing by slowing the typist down? Yeah, that makes sense. When I
try to do things in a rush, I make more mistakes and end up taking longer
than I otherwise would have. This is exactly the same sort of principle."

while others, like our dear Cecil from the Straight Dope, wrongly imagine
that o

QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])

2016-04-08 Thread Ben Finney

Dennis Lee Bieber writes:

> [The QWERTY keyboard layout] was a sane design -- for early mechanical
> typewrites. It fulfills its goal of slowing down a typist to reduce
> jamming print-heads at the platen.

This is an often-repeated myth, with citations back as far as the 1970s.
It is false.

The design is intended to reduce jamming the print heads together, but
the goal of this is not to reduce speed, but to enable *fast* typing.

It aims to maximise the frequency in which (English-language) text has
consecutive letters alternating either side of the middle of the
keyboard. This should thus reduce collisions of nearby heads — and hence
*increase* the effective typing speed that can be achieved on such a
mechanical typewriter.

The degree to which this maximum was achieved is arguable. Certainly the
relevance to keyboards today, with no connection from the layout to
whether print heads will jam, is negligible.

What is not arguable is that there is no evidence the design had any
intention of *slowing* typists in any way. Quite the opposite, in fact.

http://www.straightdope.com/columns/read/221/was-the-qwerty-keyboard-purposely-designed-to-slow-typists>,
and other links from the Wikipedia article
https://en.wikipedia.org/wiki/QWERTY#History_and_purposes>, should
allow interested people to get the facts right on this canard.

--
\ “I used to think that the brain was the most wonderful organ in |
`\ my body. Then I realized who was telling me this.” —Emo Philips |
_o__) |
Ben Finney

--
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-08 Thread Marko Rauhamaa

Steven D'Aprano :

> But when you get down to fundamentals, character sets and alphabets have
> always blurred the line between presentation and meaning. W ("double-u")
> was, once upon a time, UU

But as every Finnish-speaker now knows, "w" is only an old-fashioned
typographic variant of the glyph "v". We still have people who write
"Wirtanen" or "Waltari" to make their last names look respectable and
19th-centrury-ish.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-08 Thread Steven D'Aprano

On Sat, 9 Apr 2016 03:21 am, Peter Pearson wrote:

> On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano 
> wrote:
>> On Fri, 8 Apr 2016 02:51 am, Peter Pearson wrote:
>>> 
>>> The Unicode consortium was certifiably insane when it went into the
>>> typesetting business.
>>
>> They are not, and never have been, in the typesetting business. Perhaps
>> characters are not the only things easily confused *wink*
> 
> Defining codepoints that deal with appearance but not with meaning is
> going into the typesetting business.  Examples: ligatures, and spaces of
> varying widths with specific typesetting properties like being
> non-breaking.

Both of which are covered by the requirement that Unicode is capable of
representing legacy encodings/code pages.

Examples: MacRoman contains fl and fi ligatures, and NBSP. 

Non-breaking space is not so much a typesetting property as a semantic
property, that is, it deals with *meaning* (exactly what you suggested it
doesn't deal with). It is a space which doesn't break words.

Ligatures are a good example -- the Unicode consortium have explicitly
refused to add other ligatures beyond the handful needed for backwards
compatibility because they maintain that it is a typesetting issue that is
best handled by the font. There's even a FAQ about that very issue, and I
quote:

"The existing ligatures exist basically for compatibility and round-tripping
with non-Unicode character sets. Their use is discouraged. No more will be
encoded in any circumstances."

http://www.unicode.org/faq/ligature_digraph.html#Lig2

Unicode currently contains something of the order of one hundred and ten
thousand defined code points. I'm sure that if you went through the entire
list, with a sufficiently loose definition of "typesetting", you could
probably find some that exist only for presentation, and aren't covered by
the legacy encoding clause. So what? One swallow does not mean the season
is spring. Unicode makes an explicit rejection of being responsible for
typesetting. See their discussion on presentation forms:

http://www.unicode.org/faq/ligature_digraph.html#PForms

But I will grant you that sometimes there's a grey area between presentation
and semantics, and the Unicode consortium has to make a decision one way or
another. Those decisions may not always be completely consistent, and may
be driven by political and/or popular demand.

E.g. the Consortium explicitly state that stylistic issues such as bold,
italic, superscript etc are up to the layout engine or markup, and
shouldn't be part of the Unicode character set. They insist that they only
show representative glyphs for code points, and that font designers and
vendors are free (within certain limits) to modify the presentation as
desired. Nevertheless, there are specialist characters with distinct
formatting, and variant selectors for specifying a specific glyph, and
emoji modifiers for specifying skin tone.

But when you get down to fundamentals, character sets and alphabets have
always blurred the line between presentation and meaning. W ("double-u")
was, once upon a time, UU and & (ampersand) started off as a ligature
of "et" (Latin for "and"). There are always going to be cases where
well-meaning people can agree to disagree on whether or not adding the
character to Unicode was justified or not.

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-08 Thread Rustom Mody

Adding link

On Friday, April 8, 2016 at 11:48:07 PM UTC+5:30, Rustom Mody wrote:

> 5.12 Deprecation
> 
> In the Unicode Standard, the term deprecation is used somewhat differently 
> than it is in some other standards. Deprecation is used to mean that a 
> character or other feature is strongly discouraged from use. This should not, 
> however, be taken as indicating that anything has been removed from the 
> standard, nor that anything is planned for removal from the standard. Any 
> such change is constrained by the Unicode Consortium Stability Policies 
> [Stability].
> 
> For the Unicode Character Database, there are two important types of 
> deprecation to be noted. First, an encoded character may be deprecated. 
> Second, a character property may be deprecated.
> 
> When an encoded character is strongly discouraged from use, it is given the 
> property value Deprecated=True. The Deprecated property is a binary property 
> defined specifically to carry this information about Unicode characters. Very 
> few characters are ever formally deprecated this way; it is not enough that a 
> character be uncommon, obsolete, disliked, or not preferred. Only those few 
> characters which have been determined by the UTC to have serious 
> architectural defects or which have been determined to cause significant 
> implementation problems are ever deprecated. Even in the most severe cases, 
> such as the deprecated format control characters (U+206A..U+206F), an encoded 
> character is never removed from the standard. Furthermore, although 
> deprecated characters are strongly discouraged from use, and should be 
> avoided in favor of other, more appropriate mechanisms, they may occur in 
> data. Conformant implementations of Unicode processes such a Unicode 
> normalization must handle even deprec
 ated characters correctly.



Link: http://unicode.org/reports/tr44/#Deprecation
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-08 Thread Rustom Mody

On Friday, April 8, 2016 at 11:33:38 PM UTC+5:30, Peter Pearson wrote:
> On Sat, 9 Apr 2016 03:50:16 +1000, Chris Angelico wrote:
> > On Sat, Apr 9, 2016 at 3:44 AM, Marko Rauhamaa  wrote:
> [snip]
> >> (As for ligatures, I understand that there might be quite a bit of
> >> legacy software that dedicated code points and code pages for ligatures.
> >> Translating that legacy software to Unicode was made more
> >> straightforward by introducing analogous codepoints to Unicode. Unicode
> >> has quite many such codepoints: µ, K, Ω etc.)
> >
> > More specifically, Unicode solved the problems that *codepages* had
> > posed. And one of the principles of its design was that every
> > character in every legacy encoding had a direct representation as a
> > Unicode codepoint, allowing bidirectional transcoding for
> > compatibility. Perhaps if Unicode had existed from the dawn of
> > computing, we'd have less characters; but backward compatibility is
> > way too important to let a narrow purity argument sway it.
> 
> I guess with that historical perspective the current situation
> seems almost inevitable.  Thanks.  And thanks to Steven D'Aprano
> for other relevant insights.

Strange view
In fact the unicode standard itself encourages not using the standard in its
entirety

5.12 Deprecation

In the Unicode Standard, the term deprecation is used somewhat differently than 
it is in some other standards. Deprecation is used to mean that a character or 
other feature is strongly discouraged from use. This should not, however, be 
taken as indicating that anything has been removed from the standard, nor that 
anything is planned for removal from the standard. Any such change is 
constrained by the Unicode Consortium Stability Policies [Stability].

For the Unicode Character Database, there are two important types of 
deprecation to be noted. First, an encoded character may be deprecated. Second, 
a character property may be deprecated.

When an encoded character is strongly discouraged from use, it is given the 
property value Deprecated=True. The Deprecated property is a binary property 
defined specifically to carry this information about Unicode characters. Very 
few characters are ever formally deprecated this way; it is not enough that a 
character be uncommon, obsolete, disliked, or not preferred. Only those few 
characters which have been determined by the UTC to have serious architectural 
defects or which have been determined to cause significant implementation 
problems are ever deprecated. Even in the most severe cases, such as the 
deprecated format control characters (U+206A..U+206F), an encoded character is 
never removed from the standard. Furthermore, although deprecated characters 
are strongly discouraged from use, and should be avoided in favor of other, 
more appropriate mechanisms, they may occur in data. Conformant implementations 
of Unicode processes such a Unicode normalization must handle even deprecated 
characters correctly.

I read this as saying that -- in addition to officially deprecated chars --
there ARE "uncommon, obsolete, disliked, or not preferred" chars
which sensible users should avoid using even though unicode as a standard is
compelled to keep supporting

Which translates into
- python as a language *implementing* unicode (eg in strings) needs to
do it completely if it is to be standard compliant
- python as a *user* of unicode (eg in identifiers) can (and IMHO should)
use better judgement
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-08 Thread Rustom Mody

On Friday, April 8, 2016 at 11:14:21 PM UTC+5:30, Marko Rauhamaa wrote:
> Peter Pearson :
> 
> > On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano  wrote:
> >> They are not, and never have been, in the typesetting business.
> >> Perhaps characters are not the only things easily confused *wink*
> >
> > Defining codepoints that deal with appearance but not with meaning is
> > going into the typesetting business. Examples: ligatures, and spaces
> > of varying widths with specific typesetting properties like being
> > non-breaking.
> >
> > Typesetting done in MS Word using such Unicode codepoints will never
> > be more than a goofy approximation to real typesetting (e.g., TeX),
> > but it will cost a huge amount of everybody's time, with the current
> > discussion of ligatures in variable names being just a straw in the
> > wind. Getting all the world's writing systems into a single, coherent
> > standard was an extraordinarily ambitious, monumental undertaking, and
> > I'm baffled that the urge to broaden its scope in this irrelevant
> > direction was entertained at all.
> 
> I agree completely but at the same time have a lot of understanding for
> the reasons why Unicode had to become such a mess. Part of it is
> historical, part of it is political, yet part of it is in the
> unavoidable messiness of trying to define what a character is.

There are standards and standards.
Just because they are standard does not make them useful, well-designed,
reasonable etc..

Its reasonably likely that all our keyboards start QWERT...
 Doesn't make it a sane design.

Likewise using NFKC to define the equivalence relation on identifiers
is analogous to saying: Since QWERTY has been in use for over a hundred years
its a perfectly good design. Just because NFKC has the stamp of the unicode
consortium it does not straightaway make it useful for all purposes
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-08 Thread Peter Pearson

On Sat, 9 Apr 2016 03:50:16 +1000, Chris Angelico  wrote:
> On Sat, Apr 9, 2016 at 3:44 AM, Marko Rauhamaa  wrote:
[snip]
>> (As for ligatures, I understand that there might be quite a bit of
>> legacy software that dedicated code points and code pages for ligatures.
>> Translating that legacy software to Unicode was made more
>> straightforward by introducing analogous codepoints to Unicode. Unicode
>> has quite many such codepoints: µ, K, Ω etc.)
>
> More specifically, Unicode solved the problems that *codepages* had
> posed. And one of the principles of its design was that every
> character in every legacy encoding had a direct representation as a
> Unicode codepoint, allowing bidirectional transcoding for
> compatibility. Perhaps if Unicode had existed from the dawn of
> computing, we'd have less characters; but backward compatibility is
> way too important to let a narrow purity argument sway it.

I guess with that historical perspective the current situation
seems almost inevitable.  Thanks.  And thanks to Steven D'Aprano
for other relevant insights.

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-08 Thread Rustom Mody

On Friday, April 8, 2016 at 10:24:17 AM UTC+5:30, Chris Angelico wrote:
> On Fri, Apr 8, 2016 at 2:43 PM, Rustom Mody  wrote:
> > No I am not clever/criminal enough to know how to write a text that is 
> > visually
> > close to
> > print "Hello World"
> > but is internally closer to
> > rm -rf /
> >
> > For me this:
> >  >>> Α = 1
>  A = 2
>  Α + 1 == A
> > True
> 
> >
> >
> > is cure enough that I am not amused
> 
> To me, the above is a contrived example. And you can contrive examples
> that are just as confusing while still being ASCII-only, like
> swimmer/swirnmer in many fonts, or I and l, or any number of other
> visually-confusing glyphs. I propose that we ban the letters 'r' and
> 'l' from identifiers, to ensure that people can't mess with
> themselves.

swirnmer and swimmer are distinguished by squiting a bit
А and A only by digging down into the hex.
If you categorize them as similar/same... well I am not arguing...
will come to you when I am short of straw...


> 
> > Specifically as far as I am concerned if python were to throw back say
> > a ligature in an identifier as a syntax error -- exactly what python2 does 
> > --
> > I think it would be perfectly fine and a more sane choice
> 
> The ligature is handled straight-forwardly: it gets decomposed into
> its component letters. I'm not seeing a problem here.

Yes... there is no problem... HERE [I did say python gets this right that
haskell for example gets wrong]
Whats wrong is the whole approach of swallowing gobs of characters that
need not be legal at all and then getting indigestion:

Note the "non-normative" in
https://docs.python.org/3/reference/lexical_analysis.html#identifiers

If a language reference is not normative what is?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-08 Thread Chris Angelico

On Sat, Apr 9, 2016 at 3:44 AM, Marko Rauhamaa  wrote:
> Unicode heroically and definitively solved the problems ASCII had posed
> but introduced a bag of new, trickier problems.
>
> (As for ligatures, I understand that there might be quite a bit of
> legacy software that dedicated code points and code pages for ligatures.
> Translating that legacy software to Unicode was made more
> straightforward by introducing analogous codepoints to Unicode. Unicode
> has quite many such codepoints: µ, K, Ω etc.)

More specifically, Unicode solved the problems that *codepages* had
posed. And one of the principles of its design was that every
character in every legacy encoding had a direct representation as a
Unicode codepoint, allowing bidirectional transcoding for
compatibility. Perhaps if Unicode had existed from the dawn of
computing, we'd have less characters; but backward compatibility is
way too important to let a narrow purity argument sway it.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-08 Thread Marko Rauhamaa

Peter Pearson :

> On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano  
> wrote:
>> They are not, and never have been, in the typesetting business.
>> Perhaps characters are not the only things easily confused *wink*
>
> Defining codepoints that deal with appearance but not with meaning is
> going into the typesetting business. Examples: ligatures, and spaces
> of varying widths with specific typesetting properties like being
> non-breaking.
>
> Typesetting done in MS Word using such Unicode codepoints will never
> be more than a goofy approximation to real typesetting (e.g., TeX),
> but it will cost a huge amount of everybody's time, with the current
> discussion of ligatures in variable names being just a straw in the
> wind. Getting all the world's writing systems into a single, coherent
> standard was an extraordinarily ambitious, monumental undertaking, and
> I'm baffled that the urge to broaden its scope in this irrelevant
> direction was entertained at all.

I agree completely but at the same time have a lot of understanding for
the reasons why Unicode had to become such a mess. Part of it is
historical, part of it is political, yet part of it is in the
unavoidable messiness of trying to define what a character is.

For example, is "ä" one character or two: "a" plus "¨"? Is "i" one
character of two: "ı" plus "˙"? Is writing linear or two-dimensional?

Unicode heroically and definitively solved the problems ASCII had posed
but introduced a bag of new, trickier problems.

(As for ligatures, I understand that there might be quite a bit of
legacy software that dedicated code points and code pages for ligatures.
Translating that legacy software to Unicode was made more
straightforward by introducing analogous codepoints to Unicode. Unicode
has quite many such codepoints: µ, K, Ω etc.)

Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-08 Thread Peter Pearson

On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano  wrote:
> On Fri, 8 Apr 2016 02:51 am, Peter Pearson wrote:
>> 
>> The Unicode consortium was certifiably insane when it went into the
>> typesetting business.
>
> They are not, and never have been, in the typesetting business. Perhaps
> characters are not the only things easily confused *wink*

Defining codepoints that deal with appearance but not with meaning is
going into the typesetting business.  Examples: ligatures, and spaces of
varying widths with specific typesetting properties like being non-breaking.

Typesetting done in MS Word using such Unicode codepoints will never
be more than a goofy approximation to real typesetting (e.g., TeX), but
it will cost a huge amount of everybody's time, with the current discussion
of ligatures in variable names being just a straw in the wind.  Getting
all the world's writing systems into a single, coherent standard was
an extraordinarily ambitious, monumental undertaking, and I'm baffled
that the urge to broaden its scope in this irrelevant direction was
entertained at all.

(Should this have been in cranky-geezer font?)

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-07 Thread Chris Angelico

On Fri, Apr 8, 2016 at 4:00 PM, Steven D'Aprano  wrote:
> Or for that matter:
>
> a = akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqwe9fhlcjbqvcbhsiauy37wkg() + 100
> b = 100 + akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqew9fhlcjbqvcbhsiauy37wkg()
>
> How easily can you tell them apart at a glance?

Ouch! Can't even align them top and bottom. This is evil.

> I think that, beyond normalisation, the compiler need not be too concerned
> by confusables. I wouldn't *object* to the compiler raising a warning if it
> detected confusable identifiers, or mixed script identifiers, but I think
> that's more the job for a linter or human code review.

The compiler should treat as identical anything that an editor should
reasonably treat as identical. I'm not sure whether multiple combining
characters on a single base character are forced into some order prior
to comparison or are kept in the order they were typed, but my gut
feeling is that they should be considered identical.

> They are not, and never have been, in the typesetting business. Perhaps
> characters are not the only things easily confused *wink*

Peter is definitely a character. So are you. QUITE a character. :)

> But really, why should we object? Is "pile-of-poo" any more silly than any
> of the other dingbats, graphics characters, and other non-alphabetical
> characters? Unicode is not just for "letters of the alphabet".

It's less silly than "ZERO-WIDTH NON-BREAKING SPACE", which isn't a
space at all, it's a joiner. Go figure.

(History's a wonderful thing, ain't it? So's backward compatibility
and a guarantee that names will never be changed.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-07 Thread Steven D'Aprano

On Fri, 8 Apr 2016 02:51 am, Peter Pearson wrote:

> Seriously, it's cute how neatly normalisation works when you're
> watching closely and using it in the circumstances for which it was
> intended, but that hardly proves that these practices won't cause much
> trouble when they're used more casually and nobody's watching closely.
> Considering how much energy good software engineers spend eschewing
> unnecessary complexity, 

Maybe so, but it's not good software engineers we have to worry about, but
the other 99.9% :-)

> do we really want to embrace the prospect of 
> having different things look identical?

You mean like ASCII identifiers? I'm afraid it's about fifty years too late
to ban identifiers using O and 0, or l, I and 1, or rn and m.

Or for that matter:

a = akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqwe9fhlcjbqvcbhsiauy37wkg() + 100
b = 100 + akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqew9fhlcjbqvcbhsiauy37wkg()

How easily can you tell them apart at a glance?

The reality is that we trust our coders not to deliberately mess us about.
As the Obfuscated C and the Underhanded C contest prove, you don't need
Unicode to hide hostile code. In fact, the use of Unicode confusables in an
otherwise all-ASCII file is a dead giveaway that something fishy is going
on.

I think that, beyond normalisation, the compiler need not be too concerned
by confusables. I wouldn't *object* to the compiler raising a warning if it
detected confusable identifiers, or mixed script identifiers, but I think
that's more the job for a linter or human code review.

> (A relevant reference point: 
> mixtures of spaces and tabs in Python indentation.)

Most editors have an option to display whitespace, and tabs and spaces look
different. Typically the tab is shown with an arrow, and the space by a
dot. If people *still* confuse them, the issue is easily managed by a
combination of "well don't do that" and TabError.

> [snip]
>> The Unicode consortium seems to disagree with you.
> 
> 
> 
> The Unicode consortium was certifiably insane when it went into the
> typesetting business.

They are not, and never have been, in the typesetting business. Perhaps
characters are not the only things easily confused *wink*

(Although some members of the consortium may be. But the consortium itself
isn't.)

> The pile-of-poo character was just frosting on 
> the cake.

Blame the Japanese mobile phone companies for that. When you pay your
membership fee, you get to object to the addition of characters too.
(Anyone, I think, can propose a new character, but only members get to
choose which proposals are accepted.)

But really, why should we object? Is "pile-of-poo" any more silly than any
of the other dingbats, graphics characters, and other non-alphabetical
characters? Unicode is not just for "letters of the alphabet".

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-07 Thread Chris Angelico

On Fri, Apr 8, 2016 at 2:43 PM, Rustom Mody  wrote:
> No I am not clever/criminal enough to know how to write a text that is 
> visually
> close to
> print "Hello World"
> but is internally closer to
> rm -rf /
>
> For me this:
>  >>> Α = 1
 A = 2
 Α + 1 == A
> True

>
>
> is cure enough that I am not amused

To me, the above is a contrived example. And you can contrive examples
that are just as confusing while still being ASCII-only, like
swimmer/swirnmer in many fonts, or I and l, or any number of other
visually-confusing glyphs. I propose that we ban the letters 'r' and
'l' from identifiers, to ensure that people can't mess with
themselves.

> Specifically as far as I am concerned if python were to throw back say
> a ligature in an identifier as a syntax error -- exactly what python2 does --
> I think it would be perfectly fine and a more sane choice

The ligature is handled straight-forwardly: it gets decomposed into
its component letters. I'm not seeing a problem here.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-07 Thread Rustom Mody

On Friday, April 8, 2016 at 10:13:16 AM UTC+5:30, Rustom Mody wrote:
> No I am not clever/criminal enough to know how to write a text that is 
> visually
> close to 
> print "Hello World"
> but is internally closer to
> rm -rf /
> 
> For me this:
>  >>> Α = 1
> >>> A = 2
> >>> Α + 1 == A 
> True
> >>> 
> 
> 
> is cure enough that I am not amused

Um... "cute" was the intention
[Or is it cuʇe ?]
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-07 Thread Rustom Mody

On Thursday, April 7, 2016 at 10:22:18 PM UTC+5:30, Peter Pearson wrote:
> On Thu, 07 Apr 2016 11:37:50 +1000, Steven D'Aprano wrote:
> > On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote:
> >> Rustom Mody wrote:
> >
> >>> So here are some examples to illustrate what I am saying:
> >>> 
> >>> Example 1 -- Ligatures:
> >>> 
> >>> Python3 gets it right
> >> ﬂag = 1
> >> flag
> >>> 1
> [snip]
> >> 
> >> I do not think this is correct, though.  Different Unicode code sequences,
> >> after normalization, should result in different symbols.
> >
> > I think you are confused about normalisation. By definition, normalising
> > different Unicode code sequences may result in the same symbols, since that
> > is what normalisation means.
> >
> > Consider two distinct strings which nevertheless look identical:
> >
> > py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}"
> > py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"
> > py> a == b
> > False
> > py> print(a, b)
> > ü ü
> >
> >
> > The purpose of normalisation is to turn one into the other:
> >
> > py> unicodedata.normalize('NFKC', a) == b  # compose 2 code points --> 1
> > True
> > py> unicodedata.normalize('NFKD', b) == a  # decompose 1 code point --> 2
> > True
> 
> It's all great fun until someone loses an eye.
> 
> Seriously, it's cute how neatly normalisation works when you're
> watching closely and using it in the circumstances for which it was
> intended, but that hardly proves that these practices won't cause much
> trouble when they're used more casually and nobody's watching closely.
> Considering how much energy good software engineers spend eschewing
> unnecessary complexity, do we really want to embrace the prospect of
> having different things look identical?  (A relevant reference point:
> mixtures of spaces and tabs in Python indentation.)

That kind of sums up my position.
To be a casual user of unicode is one thing
To support it is another -- unicode strings in python3 -- ok so far
To mix up these two is a third without enough thought or consideration --
unicode identifiers is likely a security hole waiting to happen...

No I am not clever/criminal enough to know how to write a text that is visually
close to 
print "Hello World"
but is internally closer to
rm -rf /

For me this:
 >>> Α = 1
>>> A = 2
>>> Α + 1 == A 
True
>>> 


is cure enough that I am not amused

[The only reason I brought up case distinction is that this is in the same 
direction and way worse than that]

If python had been more serious about embracing the brave new world of
unicode it should have looked in this direction:
http://blog.languager.org/2014/04/unicoded-python.html

Also here I suggest a classification of unicode, that, while not
official or even formalizable is (I believe) helpful
http://blog.languager.org/2015/03/whimsical-unicode.html

Specifically as far as I am concerned if python were to throw back say
a ligature in an identifier as a syntax error -- exactly what python2 does --
I think it would be perfectly fine and a more sane choice
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-07 Thread Chris Angelico

On Fri, Apr 8, 2016 at 2:51 AM, Peter Pearson  wrote:
> The pile-of-poo character was just frosting on
> the cake.
>
> (Sorry to leave you with that image.)

No. You're not even a little bit sorry.

You're an evil, evil man. And funny.

ChrisA
who knows that its codepoint is 1F4A9 without looking it up
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-07 Thread Peter Pearson

On Thu, 07 Apr 2016 11:37:50 +1000, Steven D'Aprano wrote:
> On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote:
>> Rustom Mody wrote:
>
>>> So here are some examples to illustrate what I am saying:
>>> 
>>> Example 1 -- Ligatures:
>>> 
>>> Python3 gets it right
>> ﬂag = 1
>> flag
>>> 1
[snip]
>> 
>> I do not think this is correct, though.  Different Unicode code sequences,
>> after normalization, should result in different symbols.
>
> I think you are confused about normalisation. By definition, normalising
> different Unicode code sequences may result in the same symbols, since that
> is what normalisation means.
>
> Consider two distinct strings which nevertheless look identical:
>
> py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}"
> py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"
> py> a == b
> False
> py> print(a, b)
> ü ü
>
>
> The purpose of normalisation is to turn one into the other:
>
> py> unicodedata.normalize('NFKC', a) == b  # compose 2 code points --> 1
> True
> py> unicodedata.normalize('NFKD', b) == a  # decompose 1 code point --> 2
> True

It's all great fun until someone loses an eye.

Seriously, it's cute how neatly normalisation works when you're
watching closely and using it in the circumstances for which it was
intended, but that hardly proves that these practices won't cause much
trouble when they're used more casually and nobody's watching closely.
Considering how much energy good software engineers spend eschewing
unnecessary complexity, do we really want to embrace the prospect of
having different things look identical?  (A relevant reference point:
mixtures of spaces and tabs in Python indentation.)

[snip]
> The Unicode consortium seems to disagree with you.

The Unicode consortium was certifiably insane when it went into the
typesetting business.  The pile-of-poo character was just frosting on
the cake.

(Sorry to leave you with that image.)

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-06 Thread Marko Rauhamaa

Steven D'Aprano :

> So even in English, capitalisation can make a semantic difference.

It can even make a pronunciation difference: polish vs Polish.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Unicode normalisation [was Re: [beginner] What's wrong?]

2016-04-06 Thread Steven D'Aprano

On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote:

> Rustom Mody wrote:

>> So here are some examples to illustrate what I am saying:
>> 
>> Example 1 -- Ligatures:
>> 
>> Python3 gets it right
> ﬂag = 1
> flag
>> 1

Python identifiers are intentionally normalised to reduce security issues,
or at least confusion and annoyance, due to visually-identical identifiers
being treated as different.

Unicode has technical standards dealing with identifiers:

http://www.unicode.org/reports/tr31/

and visual spoofing and confusables:

http://www.unicode.org/reports/tr39/

I don't believe that CPython goes to the full extreme of checking for mixed
script confusables, but it does partially mitigate the problem by
normalising identifiers.

Unfortunately PEP 3131 leaves a number of questions open. Presumably they
were answered in the implementation, but they aren't documented in the PEP.

https://www.python.org/dev/peps/pep-3131/

> Fascinating; confirmed with
> 
> | $ python3
> | Python 3.4.4 (default, Jan  5 2016, 15:35:18)
> | [GCC 5.3.1 20160101] on linux
> | […]
> 
> I do not think this is correct, though.  Different Unicode code sequences,
> after normalization, should result in different symbols.

I think you are confused about normalisation. By definition, normalising
different Unicode code sequences may result in the same symbols, since that
is what normalisation means.

Consider two distinct strings which nevertheless look identical:

py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}"
py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}"
py> a == b
False
py> print(a, b)
ü ü

The purpose of normalisation is to turn one into the other:

py> unicodedata.normalize('NFKC', a) == b  # compose 2 code points --> 1
True
py> unicodedata.normalize('NFKD', b) == a  # decompose 1 code point --> 2
True

In the case of the fl ligature, normalisation splits the ligature into
individual 'f' and 'l' code points regardless of whether you compose or
decompose:

py> unicodedata.normalize('NFKC', "ﬂag") == "flag"
True
py> unicodedata.normalize('NFKD', "ﬂag") == "flag"
True

That's using the combatability composition form. Using the default
composition form leaves the ligature unchanged.

Note that UTS #39 (security mechanisms) suggests that identifiers should be
normalised using NFKC.

[...]
> I think Haskell gets it right here, while Py3k does not.  The “ﬂ” is not
> to be decomposed to “fl”.

The Unicode consortium seems to disagree with you. Table 1 of UTS #39 (see
link above) includes "Characters that cannot occur in strings normalized to
NFKC" in the Restricted category, that is, characters which should not be
used in identifiers. ﬂ cannot occur in such normalised strings, and so it
is classified as Restricted and should not be used in identifiers.

I'm not entirely sure just how closely Python's identifiers follow the
standard, but I think that the intention is to follow something close to
"UAX31-R4. Equivalent Normalized Identifiers":

http://www.unicode.org/reports/tr31/#R4

[Rustom] 
>> Python gets it wrong
> a=1
> A
>> Traceback (most recent call last):
>>   File "", line 1, in 
>> NameError: name 'A' is not defined
> 
> This is not wrong; it is just different.

I agree with Thomas here. Case-insensitivity is a choice, and I don't think
it is a good choice for programming identifiers. Being able to make case
distinctions between (let's say):

SPAM  # a constant, or at least constant-by-convention
Spam  # a class or type
spam  # an instance

is useful.

[Rustom]
>> With ASCII the problems are minor: Case-distinct identifiers are distinct
>> -- they dont IDENTIFY.
> 
> I do not think this is a problem.
> 
>> This contradicts standard English usage and practice
> 
> No, it does not.

I agree with Thomas here too. Although it is rare for case to make a
distinction in English, it does happen. As the old joke goes:

Capitalisation is the difference between helping my Uncle Jack off a horse,
and helping my uncle jack off a horse.

So even in English, capitalisation can make a semantic difference.

-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

40 matches

Mail list logo