Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
On Monday 18 April 2016 12:01, Random832 wrote: > On Sun, Apr 17, 2016, at 21:39, Steven D'Aprano wrote: >> Oh no, it's the thread that wouldn't die! *wink* >> >> Actually, yes it is. At least, according to this website: >> >> http://www.mit.edu/~jcb/Dvorak/history.html > > I'd really rather see an instance of the claim not associated with > Dvorak marketing. So would I, but this is hardly a Dvorak *marketing*. The author even points out that the famous case-study done by the US Navy was "biased, and at worst, fabricated". http://www.mit.edu/~jcb/Dvorak/ And he too repeats the canard that "Contrary to popular opinion" QWERTY wasn't designed to slow typists down. (Even though he later goes on to support the popular opinion.) You can also read the article in Reason magazine: http://reason.com/archives/1996/06/01/typing-errors You can skip the entire first page -- it is almost entirely a screed against government regulation and a defence of the all-mighty free-market. But the article goes through some fairly compelling evidence that Dvorak keyboards are barely more efficient that QWERTY, and that there was plenty of competition in type-writers in the late 1800s. I don't agree with the Reason article that they have disproven the conventional wisdom that QWERTY won the typewriter wars due to luck and path-dependence. The authors are (in my opinion) overly keen to dismiss path-dependence, for instance taking it as self-evidently true that the use of QWERTY in the US would have no influence over other countries' choice in key layout. But it does support the contention that, at the time, QWERTY was faster than the alternatives. Unfortunately, what it doesn't talk about is whether or not the alternate layouts had fewer jams. Wikipedia's article on QWERTY shows the various designs used by Sholes and Remington, leading to the modern layout https://en.wikipedia.org/wiki/QWERTY One serious problem for discussion is that the QWERTY keyboard we use now is *not* the same as that designed by Sholes. For instance, one anomaly is that two very common digraphs, ER and RE, are right next to each other. But that's not how Sholes laid out the keys. On his keyboard, the top row was initially AEI.?Y then changed to QWE.TY. Failure to recognise this leads to errors like this blogger's claim that it is "wrong" that QWERTY was designed to break apart common digraphs: http://yasuoka.blogspot.com.au/2006/08/sholes-discovered-that-many- english.html Even on a modern keyboard, out of the ten most common digraphs: th he in er an re nd at on nt only er/re use consecutive keys, and five out of the ten use alternate hands. Move the R back to its original position, and there are none with consecutive keys and seven with alternate hands. > It only holds up as an obvious inference from the > nature of how typing works if we assume *one*-finger hunt-and-peck > rather than two-finger. I don't agree, but neither can I prove it conclusively. > Your website describes two-finger as the method > that was being replaced by the 1878 introduction of ten-finger typing. > >> The QWERTY layout was first sold in 1873 while the first known use of >> ten-fingered typing was in 1878, and touch-typing wasn't invented for >> another decade, in 1888. > > Two-finger hunt-and-peck is sufficient for placing keys on opposite > hands to speed typing up rather than slow it down. Correct, once you take into account jamming. That's the whole point of separating the keys. But consider common letter combinations that can be typed by the one hand: QWERTY has a significant number of quite long words that can be typed with one hand, the *left* hand. That's actually quite harmful for both typing speed and accuracy. Anyway, you seem to have ignored (or perhaps you just have nothing to say) my comments about the home keys. It seems clear to me that even with two- finger typing, a layout that puts ETAOIN on the home keys, such as the Blickensderfer typewriter, would minimize the distance travelled by the fingers and improve typing speed -- but only so long as the problem of jamming was solved. Interestingly, Wikipedia makes it clear that in the 19th century, the problem of jamming arms was already solved by doing away with the arms and using a wheel or a ball. -- Steve -- https://mail.python.org/mailman/listinfo/python-list
Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
On Mon, Apr 18, 2016 at 11:39 AM, Steven D'Aprano wrote: > With QWERTY, the eight home keys only cover a fraction over a quarter of > all key presses: ASDF JKL; have frequencies of > > 8.12% 6.28% 4.32% 2.30% 0.10% 0.69% 3.98% and effectively 0% > > making a total of 25.79%. If you also include G and H as "virtual > home-keys", that rises to 33.74%. Hey, that's a little unfair. Remember, lots of people still have to write C code, so the semicolon is an important character! :) In fact, skimming the CPython source code (grouped by file extension) shows that C code has more semicolons than j's or k's: a 3.19% s 3.26% d 1.90% f 1.76% g 0.95% h 0.89% j 0.36% k 0.35% l 2.62% ; 1.40% for a total of 16.69% of characters coming from the home row. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
On Sun, Apr 17, 2016, at 21:39, Steven D'Aprano wrote: > Oh no, it's the thread that wouldn't die! *wink* > > Actually, yes it is. At least, according to this website: > > http://www.mit.edu/~jcb/Dvorak/history.html I'd really rather see an instance of the claim not associated with Dvorak marketing. It only holds up as an obvious inference from the nature of how typing works if we assume *one*-finger hunt-and-peck rather than two-finger. Your website describes two-finger as the method that was being replaced by the 1878 introduction of ten-finger typing. > The QWERTY layout was first sold in 1873 while the first known use of > ten-fingered typing was in 1878, and touch-typing wasn't invented for > another decade, in 1888. Two-finger hunt-and-peck is sufficient for placing keys on opposite hands to speed typing up rather than slow it down. -- https://mail.python.org/mailman/listinfo/python-list
Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
Oh no, it's the thread that wouldn't die! *wink* On Sun, 10 Apr 2016 01:53 am, Random832 wrote: > On Fri, Apr 8, 2016, at 23:28, Steven D'Aprano wrote: >> This is the power of the "slowing typists down is a myth" meme: same >> Wikipedia contributor takes an article which *clearly and obviously* >> repeats the conventional narrative that QWERTY was designed to >> decrease the number of key presses per second, and uses that to defend >> the counter-myth that QWERTY wasn't designed to decrease the number of >> key presses per second! > > Er, the footnote is clearly and obviously being used to cite the claim > that that is popularly believed, not the claim that it's incorrect. That's not clear nor obvious to me. But I won't quibble, I'll accept that as a plausible interpretation. >> These are the historical facts: > >> - Sholes spend significant time developing a layout which reduced the >> number of jams by intentionally moving frequently typed characters >> far apart, which has the effect of slowing down the rate at which >> the typist can hit keys; > > "Moving characters far apart has the effect of slowing down the rate at > which the typist can hit keys" is neither a fact nor historical. Actually, yes it is. At least, according to this website: http://www.mit.edu/~jcb/Dvorak/history.html [quote] Because typists at that time used the "hunt-and-peck" method, Sholes's arrangement increased the time it took for the typists to hit the keys for common two-letter combinations enough to ensure that each type bar had time to fall back sufficiently far to be out of the way before the next one came up. [end quote] The QWERTY layout was first sold in 1873 while the first known use of ten-fingered typing was in 1878, and touch-typing wasn't invented for another decade, in 1888. So I think it is pretty clear that *at the time QWERTY was invented* it slowed down the rate at which keys were pressed, thus allowing an overall greater typing speed thanks to the reduced jamming. Short of a signed memo from Shole himself, commenting one way or another, I don't think we're going to find anything more definitive. Even though QWERTY wasn't designed with touch-typing in mind, it's interesting to look at some of the weaknesses of the system. It is almost as if it had been designed to make touch-typing as inefficient as possible :-) Just consider the home keys. The home keys require the least amount of finger or hand movement, and are therefore the fastest to reach. With QWERTY, the eight home keys only cover a fraction over a quarter of all key presses: ASDF JKL; have frequencies of 8.12% 6.28% 4.32% 2.30% 0.10% 0.69% 3.98% and effectively 0% making a total of 25.79%. If you also include G and H as "virtual home-keys", that rises to 33.74%. But that's far less than the obvious tactic of using the most common letters ETAOIN as the home keys, which would cover 51.18% just from those eight keys alone. The 19th century Blickensderfer typewriter used a similar layout, with the ten home keys DHIATENSOR as the home keys. This would allow the typist to make just under 74% of all alphabetical key presses without moving the hands. https://en.wikipedia.org/wiki/Blickensderfer_typewriter Letter frequencies taken from here: http://www.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html > Keys > that are further apart *can be hit faster without jamming* due to the > specifics of the type-basket mechanism, and there's no reason to think > that they can't be hit with at least equal speed by the typist. You may be correct about that specific issue when it comes to touch typing, but touch typing was 15 years in the future when Sholes invented QWERTY. And unlike Guido, he didn't have a time-machine :-) -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
Ian Kelly on Sun, 10 Apr 2016 07:43:13 -0600 typed in comp.lang.python the following: >On Sat, Apr 9, 2016 at 9:09 PM, pyotr filipivich wrote: >> ASINTOER are the top eight English letters (not in any order, it >> is just that "A Sin To Err" is easy to remember. > >What's so hard to remember about ETA OIN SHRDLU? Plus that even gives >you the top twelve. :-) Depends on what you're looking for, I suppose. In this case, those eight get encoded differently than the other 20 characters. -- pyotr filipivich The fears of one class of men are not the measure of the rights of another. -- George Bancroft -- https://mail.python.org/mailman/listinfo/python-list
Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
On Sat, Apr 9, 2016 at 9:09 PM, pyotr filipivich wrote: > ASINTOER are the top eight English letters (not in any order, it > is just that "A Sin To Err" is easy to remember. What's so hard to remember about ETA OIN SHRDLU? Plus that even gives you the top twelve. :-) -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Steven D'Aprano : But when you get down to fundamentals, character sets and alphabets have always blurred the line between presentation and meaning. W ("double-u") was, once upon a time, UU And before that, it was VV, because the Romans used V the way we now use U, and didn't have a letter U. When U first appeared, it was just a cursive style of writing a V. According to this, it wasn't until the 18th century that the English alphabet got both U and V as separate letters: http://boards.straightdope.com/sdmb/showthread.php?t=147677 Apparently "uu"/"vv" came to be known as "double u" prior to that, and the name has persisted. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Ben Bacarisse wrote: The problem with that theory is that 'er/re' (this is e and r in either order) is the 3rd most common pair in English but have been placed together. No, they haven't. The order of the characters in the type basket goes down the slanted columns of keys, so E and R are separated by D and C. -- Greg -- https://mail.python.org/mailman/listinfo/python-list
Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
Dennis Lee Bieber on Sat, 09 Apr 2016 14:52:50 -0400 typed in comp.lang.python the following: >On Sat, 09 Apr 2016 11:44:48 -0400, Random832 >declaimed the following: > >>I don't understand where this idea that alternating hands makes you >>slows you down came from in the first place... I suspect it's people who > > It's not (to my mind) the alternation that slows one down. It's the >combination of putting common letters under weak fingers and some >combinationS that require the same hand/finger to slow one down. > >aspect a is on the weakest left finger, with the s on a finger that >many people have trouble moving independently from the middle finger (hmm, >I seem to be okay moving the ring finger, but moving the middle finger >tends to drag the ring with it). p is the weakest finger of the right hand. >e&c use the same finger of the left hand, t is the strongest finger but one >is coming off the lower-row reach of middle-finger c. > >deaf is all left hand, and the de is the same finger... earth except >for the h is also all left hand, and rt are the same finger. > > I suspect for any argument for one side, a corresponding counter can be >made for the other side. There are only 5.5 vowels (the .5 is Y) in >English, so they are likely more prevalent than the 20-odd consonants when >taking singly. Yet A is on the weakest finger on the weakest (for most of >the populace) hand. IOU OTOH are in a fast three-finger roll -- and worse, >IO is fairly common (all the ***ion endings). ASINTOER are the top eight English letters (not in any order, it is just that "A Sin To Err" is easy to remember. -- pyotr filipivich The fears of one class of men are not the measure of the rights of another. -- George Bancroft -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Sat, Apr 9, 2016, at 12:25 PM, Mark Lawrence via Python-list wrote: > Again, where is the relevance to Python in this discussion, as we're on > the main Python mailing list? Please can the moderators take this stuff > out, it is getting beyond the pale. You need to come to grip with the fact that python-list is only moderated in the vaguest sense of the word. Quote: https://www.python.org/community/lists/ "Pretty much anything Python-related is fair game for discussion, and the group is even fairly tolerant of off-topic digressions; there have been entertaining discussions of topics such as floating point, good software design, and other programming languages such as Lisp and Forth." If you don't like it, sorry. We all have our burdens to bear. --S -- https://mail.python.org/mailman/listinfo/python-list
RE: [E] QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
-Original Message- From: Ben Finney >> This is an often-repeated myth, with citations back as far as the 1970s. >> It is false. >> The design is intended to reduce jamming the print heads together, but the >> goal of this is not to reduce speed, but to enable *fast* typing. >> It aims to maximise the frequency in which (English-language) text has >> consecutive letters alternating either side of the middle of the keyboard. >> This should thus reduce collisions of nearby heads — and hence >> *increase* the effective typing speed that can be achieved on such a >> mechanical typewriter. When I was in high school, mid-70s, the instructor, an elderly women, said the same thing, the placement of the keys were designed to minimize collision of the heads. I don't remember what she called the various parts, but they all had technical names. I vaguely remember hearing the myth of slowing down typists when Dvorak's keyboard became available for PCs, '80s(?), and that this 'new' layout removed that incumbrance. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On 09/04/2016 17:08, Rustom Mody wrote: On Saturday, April 9, 2016 at 7:14:05 PM UTC+5:30, Ben Bacarisse wrote: The problem with that theory is that 'er/re' (this is e and r in either order) is the 3rd most common pair in English but have been placed together. ou and et (in either order) are the 15th and 22nd most common and they are separated by only one hammer position. On the other hand, the QWERTY layout puts jk together, but they almost never appear together in English text. Where do you get this (kind of) statistical data? Again, where is the relevance to Python in this discussion, as we're on the main Python mailing list? Please can the moderators take this stuff out, it is getting beyond the pale. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Rustom Mody writes: > On Saturday, April 9, 2016 at 7:14:05 PM UTC+5:30, Ben Bacarisse wrote: >> The problem with that theory is that 'er/re' (this is e and r in either >> order) is the 3rd most common pair in English but have been placed >> together. ou and et (in either order) are the 15th and 22nd most common >> and they are separated by only one hammer position. On the other hand, >> the QWERTY layout puts jk together, but they almost never appear >> together in English text. > > Where do you get this (kind of) statistical data? It was generated by counting the pairs found in a corpus of texts taken from Project Gutenberg. The numbers do very depending on what you pick (for the complete works of Mark Twain er/re is second, for example), and the none of the texts are very modern (because of the source) but I doubt that matters too much. -- Ben. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Saturday, April 9, 2016 at 7:14:05 PM UTC+5:30, Ben Bacarisse wrote: > The problem with that theory is that 'er/re' (this is e and r in either > order) is the 3rd most common pair in English but have been placed > together. ou and et (in either order) are the 15th and 22nd most common > and they are separated by only one hammer position. On the other hand, > the QWERTY layout puts jk together, but they almost never appear > together in English text. Where do you get this (kind of) statistical data? -- https://mail.python.org/mailman/listinfo/python-list
Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
On Fri, Apr 8, 2016, at 23:28, Steven D'Aprano wrote: > This is the power of the "slowing typists down is a myth" meme: same > Wikipedia contributor takes an article which *clearly and obviously* > repeats the conventional narrative that QWERTY was designed to > decrease the number of key presses per second, and uses that to defend > the counter-myth that QWERTY wasn't designed to decrease the number of > key presses per second! Er, the footnote is clearly and obviously being used to cite the claim that that is popularly believed, not the claim that it's incorrect. > These are the historical facts: > - Sholes spend significant time developing a layout which reduced the > number of jams by intentionally moving frequently typed characters > far apart, which has the effect of slowing down the rate at which > the typist can hit keys; "Moving characters far apart has the effect of slowing down the rate at which the typist can hit keys" is neither a fact nor historical. Keys that are further apart *can be hit faster without jamming* due to the specifics of the type-basket mechanism, and there's no reason to think that they can't be hit with at least equal speed by the typist. Take a typewriter. Press Q and A (right next to each other) at the same time, and observe the distance from the type basket where the jam occurs. Now press Q and P (on the opposite side of the basket from each other) and observe where the jam occurs. -- https://mail.python.org/mailman/listinfo/python-list
Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
On Fri, Apr 8, 2016, at 23:28, Steven D'Aprano wrote: > And how did it enable fast typing? By *slowing down the typist*, and thus > having fewer jams. Er, no? The point is that type bars that are closer together collide more easily *at the same actual typing speed* than ones that are further apart - For Q to collide with P, they would have to both be nearly all the way to the platen at the same time, whereas Q can collide with A even a mere millimeter from the basket (or anywhere in between). I don't understand where this idea that alternating hands makes you slows you down came from in the first place... I suspect it's people who haven't really thought for a minute about the physical process of typing (to type "ec" you have to physically move your left hand, to type "en" your right hand can already be moving into place while your left hand presses the first key. The former is clearly slower than the latter.) This goes double for hunt-and-peck typing, where you have to move your whole hand to press _any_ two keys on the same hand. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Ben Bacarisse writes: > alister writes: > >> >> the design of qwerty was not to "Slow" the typist bu to ensure that the >> hammers for letters commonly used together are spaced widely apart, >> reducing the portion of trier travel arc were the could jam. >> I and E are actually such a pair which is why they are at opposite ends >> of the hammer rack (I doubt that is the correct technical term). >> they are on opposite hands to make typing of them faster. >> unfortunately as you found it is still possible to jam them if they are >> hit almost simultaneously >> > > The problem with that theory is that 'er/re' (this is e and r in either > order) is the 3rd most common pair in English but have been placed > together. ou and et (in either order) are the 15th and 22nd most common > and they are separated by only one hammer position. On the other hand, > the QWERTY layout puts jk together, but they almost never appear > together in English text. This last part came out muddled. It's obviously wise to put infrequent combinations together (like jk), but j and k are both also rare letters so putting them together represents a wasted opportunity for meeting the supposed design objective. Swapping, say, k and r, or splitting jk but putting e in the middle would surely result in a net gain of "hammer separation". -- Ben. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
alister writes: > > the design of qwerty was not to "Slow" the typist bu to ensure that the > hammers for letters commonly used together are spaced widely apart, > reducing the portion of trier travel arc were the could jam. > I and E are actually such a pair which is why they are at opposite ends > of the hammer rack (I doubt that is the correct technical term). > they are on opposite hands to make typing of them faster. > unfortunately as you found it is still possible to jam them if they are > hit almost simultaneously > The problem with that theory is that 'er/re' (this is e and r in either order) is the 3rd most common pair in English but have been placed together. ou and et (in either order) are the 15th and 22nd most common and they are separated by only one hammer position. On the other hand, the QWERTY layout puts jk together, but they almost never appear together in English text. -- Ben. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Fri, 08 Apr 2016 20:20:02 -0400, Dennis Lee Bieber wrote: > On Fri, 8 Apr 2016 11:04:53 -0700 (PDT), Rustom Mody > declaimed the following: > >>Its reasonably likely that all our keyboards start QWERT... >> Doesn't make it a sane design. >> > It was a sane design -- for early mechanical typewrites. It fulfills > its goal of slowing down a typist to reduce jamming print-heads at the > platen.* And since so many of us who had formal touch typing training > probably learned on said mechanical typewriters, it hangs around. > Fortunately, even though the typewriters at school had European > dead-keys, we were plain English and I never had to pick them up. > > For a few years I did have problems with ()... They were on different > keys (8 and 9, respectively) on old typewriters (the type that also had > no 1) vs IBM Selectrics (never used by be) and computer terminals... > > > > * Except I kept jamming two letters of my last name... I and E are > reached with the same finger on opposite hands, which made a fast > stroke-pair (compare moving the same finger on both hands to moving > different fingers). the design of qwerty was not to "Slow" the typist bu to ensure that the hammers for letters commonly used together are spaced widely apart, reducing the portion of trier travel arc were the could jam. I and E are actually such a pair which is why they are at opposite ends of the hammer rack (I doubt that is the correct technical term). they are on opposite hands to make typing of them faster. unfortunately as you found it is still possible to jam them if they are hit almost simultaneously -- There's a trick to the Graceful Exit. It begins with the vision to recognize when a job, a life stage, a relationship is over -- and to let go. It means leaving what's over without denying its validity or its past importance in our lives. It involves a sense of future, a belief that every exit line is an entry, that we are moving on, rather than out. The trick of retiring well may be the trick of living well. It's hard to recognize that life isn't a holding action, but a process. It's hard to learn that we don't leave the best parts of ourselves behind, back in the dugout or the office. We own what we learned back there. The experiences and the growth are grafted onto our lives. And when we exit, we can take ourselves along -- quite gracefully. -- Ellen Goodman -- https://mail.python.org/mailman/listinfo/python-list
Re: QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
On Sat, 9 Apr 2016 10:43 am, Ben Finney wrote: > Dennis Lee Bieber writes: > >> [The QWERTY keyboard layout] was a sane design -- for early mechanical >> typewrites. It fulfills its goal of slowing down a typist to reduce >> jamming print-heads at the platen. > > This is an often-repeated myth, with citations back as far as the 1970s. > It is false. > > The design is intended to reduce jamming the print heads together, but > the goal of this is not to reduce speed, but to enable *fast* typing. And how did it enable fast typing? By *slowing down the typist*, and thus having fewer jams. Honestly, I have the greatest respect for the Straight Dope, but this is one of those times when they miss the forest for the trees. The conventional wisdom about typewriters isn't wrong -- or at least there's no evidence that it's wrong. As far as I can, *every single* argument against the conventional wisdom comes down to an argument that it is ridiculous or silly that anyone might have wanted to slow typing down. For example, Wikipedia links to this page: http://www.smithsonianmag.com/arts-culture/fact-of-fiction-the-legend-of-the-qwerty-keyboard-49863249/?no-ist which quotes researchers: “The speed of Morse receiver should be equal to the Morse sender, of course. If Sholes really arranged the keyboard to slow down the operator, the operator became unable to catch up the Morse sender. We don’t believe that Sholes had such a nonsense intention during his development of Type-Writer.” This is merely argument from personal incredibility: http://rationalwiki.org/wiki/Argument_from_incredulity and is trivially answerable: how well do you think the receiver can keep up with the sender if they have to stop every few dozen keystrokes to unjam the typewriter? Wikipedia states: "Contrary to popular belief, the QWERTY layout was not designed to slow the typist down,[3]" with the footnote [3] linking to http://www.maltron.com/media/lillian_kditee_001.pdf which clearly and prominently states in the THIRD paragraph: "It has been said of the Sholes letter layout [QWERTY] this it would probably have been chosen if the objective was to find the least efficient -- in terms of learning time and speed achievable -- and the most error producing character arrangement. This is not surprising when one considers that a team of people spent one year developing this layout so that it should provide THE GREATEST INHIBITION TO FAST KEYING. [Emphasis added.] This was no Machiavellian plot, but necessary because the mechanism of the early typewriters required slow operation." This is the power of the "slowing typists down is a myth" meme: same Wikipedia contributor takes an article which *clearly and obviously* repeats the conventional narrative that QWERTY was designed to decrease the number of key presses per second, and uses that to defend the counter-myth that QWERTY wasn't designed to decrease the number of key presses per second! These are the historical facts: - early typewriters had varying layouts, some of which allow much more rapid keying than QWERTY; - early typewriters were prone to frequent and difficult jamming; - Sholes spend significant time developing a layout which reduced the number of jams by intentionally moving frequently typed characters far apart, which has the effect of slowing down the rate at which the typist can hit keys; - which results in greater typing speed do to a reduced number of jams. In other words the conventional story. Jams have such a massively negative effect on typing speed that reducing the number of jams gives you a *huge* win on overall speed even if the rate of keying is significantly lower. At first glance, it may seem paradoxical, but it's not. Which is faster? - typing at a steady speed of (lets say) 100 words per minute; - typing in bursts of (say) 200 wpm for a minute, followed by three minutes of 0 wpm. The second case averages half the speed of the first, even though the typist is hitting keys at a faster rate. This shouldn't be surprising to any car driver who has raced from one red light to the next, only to be caught up and even overtaken by somebody driving at a more sedate speed who caught nothing but green lights. Or to anyone who has heard the story of the Tortoise and the Hare. The moral of QWERTY is "less haste, more speed". The myth of the "QWERTY myth" is based on the idea that people are unable to distinguish between peak speed and average speed. But ironically, in my experience, it's only those repeating the myth who seem confused by that difference (as in the quote from the Smithsonian above). Most people don't need the conventional narrative explained: "Speed up typing by slowing the typist down? Yeah, that makes sense. When I try to do things in a rush, I make more mistakes and end up taking longer than I otherwise would have. This is exactly the same sort of principle." while others, like our dear Cecil from the Straight Dope, wrongly imagine that o
QWERTY was not designed to intentionally slow typists down (was: Unicode normalisation [was Re: [beginner] What's wrong?])
Dennis Lee Bieber writes: > [The QWERTY keyboard layout] was a sane design -- for early mechanical > typewrites. It fulfills its goal of slowing down a typist to reduce > jamming print-heads at the platen. This is an often-repeated myth, with citations back as far as the 1970s. It is false. The design is intended to reduce jamming the print heads together, but the goal of this is not to reduce speed, but to enable *fast* typing. It aims to maximise the frequency in which (English-language) text has consecutive letters alternating either side of the middle of the keyboard. This should thus reduce collisions of nearby heads — and hence *increase* the effective typing speed that can be achieved on such a mechanical typewriter. The degree to which this maximum was achieved is arguable. Certainly the relevance to keyboards today, with no connection from the layout to whether print heads will jam, is negligible. What is not arguable is that there is no evidence the design had any intention of *slowing* typists in any way. Quite the opposite, in fact. http://www.straightdope.com/columns/read/221/was-the-qwerty-keyboard-purposely-designed-to-slow-typists>, and other links from the Wikipedia article https://en.wikipedia.org/wiki/QWERTY#History_and_purposes>, should allow interested people to get the facts right on this canard. -- \ “I used to think that the brain was the most wonderful organ in | `\ my body. Then I realized who was telling me this.” —Emo Philips | _o__) | Ben Finney -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Steven D'Aprano : > But when you get down to fundamentals, character sets and alphabets have > always blurred the line between presentation and meaning. W ("double-u") > was, once upon a time, UU But as every Finnish-speaker now knows, "w" is only an old-fashioned typographic variant of the glyph "v". We still have people who write "Wirtanen" or "Waltari" to make their last names look respectable and 19th-centrury-ish. Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Sat, 9 Apr 2016 03:21 am, Peter Pearson wrote: > On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano > wrote: >> On Fri, 8 Apr 2016 02:51 am, Peter Pearson wrote: >>> >>> The Unicode consortium was certifiably insane when it went into the >>> typesetting business. >> >> They are not, and never have been, in the typesetting business. Perhaps >> characters are not the only things easily confused *wink* > > Defining codepoints that deal with appearance but not with meaning is > going into the typesetting business. Examples: ligatures, and spaces of > varying widths with specific typesetting properties like being > non-breaking. Both of which are covered by the requirement that Unicode is capable of representing legacy encodings/code pages. Examples: MacRoman contains fl and fi ligatures, and NBSP. Non-breaking space is not so much a typesetting property as a semantic property, that is, it deals with *meaning* (exactly what you suggested it doesn't deal with). It is a space which doesn't break words. Ligatures are a good example -- the Unicode consortium have explicitly refused to add other ligatures beyond the handful needed for backwards compatibility because they maintain that it is a typesetting issue that is best handled by the font. There's even a FAQ about that very issue, and I quote: "The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances." http://www.unicode.org/faq/ligature_digraph.html#Lig2 Unicode currently contains something of the order of one hundred and ten thousand defined code points. I'm sure that if you went through the entire list, with a sufficiently loose definition of "typesetting", you could probably find some that exist only for presentation, and aren't covered by the legacy encoding clause. So what? One swallow does not mean the season is spring. Unicode makes an explicit rejection of being responsible for typesetting. See their discussion on presentation forms: http://www.unicode.org/faq/ligature_digraph.html#PForms But I will grant you that sometimes there's a grey area between presentation and semantics, and the Unicode consortium has to make a decision one way or another. Those decisions may not always be completely consistent, and may be driven by political and/or popular demand. E.g. the Consortium explicitly state that stylistic issues such as bold, italic, superscript etc are up to the layout engine or markup, and shouldn't be part of the Unicode character set. They insist that they only show representative glyphs for code points, and that font designers and vendors are free (within certain limits) to modify the presentation as desired. Nevertheless, there are specialist characters with distinct formatting, and variant selectors for specifying a specific glyph, and emoji modifiers for specifying skin tone. But when you get down to fundamentals, character sets and alphabets have always blurred the line between presentation and meaning. W ("double-u") was, once upon a time, UU and & (ampersand) started off as a ligature of "et" (Latin for "and"). There are always going to be cases where well-meaning people can agree to disagree on whether or not adding the character to Unicode was justified or not. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Adding link On Friday, April 8, 2016 at 11:48:07 PM UTC+5:30, Rustom Mody wrote: > 5.12 Deprecation > > In the Unicode Standard, the term deprecation is used somewhat differently > than it is in some other standards. Deprecation is used to mean that a > character or other feature is strongly discouraged from use. This should not, > however, be taken as indicating that anything has been removed from the > standard, nor that anything is planned for removal from the standard. Any > such change is constrained by the Unicode Consortium Stability Policies > [Stability]. > > For the Unicode Character Database, there are two important types of > deprecation to be noted. First, an encoded character may be deprecated. > Second, a character property may be deprecated. > > When an encoded character is strongly discouraged from use, it is given the > property value Deprecated=True. The Deprecated property is a binary property > defined specifically to carry this information about Unicode characters. Very > few characters are ever formally deprecated this way; it is not enough that a > character be uncommon, obsolete, disliked, or not preferred. Only those few > characters which have been determined by the UTC to have serious > architectural defects or which have been determined to cause significant > implementation problems are ever deprecated. Even in the most severe cases, > such as the deprecated format control characters (U+206A..U+206F), an encoded > character is never removed from the standard. Furthermore, although > deprecated characters are strongly discouraged from use, and should be > avoided in favor of other, more appropriate mechanisms, they may occur in > data. Conformant implementations of Unicode processes such a Unicode > normalization must handle even deprec ated characters correctly. Link: http://unicode.org/reports/tr44/#Deprecation -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Friday, April 8, 2016 at 11:33:38 PM UTC+5:30, Peter Pearson wrote: > On Sat, 9 Apr 2016 03:50:16 +1000, Chris Angelico wrote: > > On Sat, Apr 9, 2016 at 3:44 AM, Marko Rauhamaa wrote: > [snip] > >> (As for ligatures, I understand that there might be quite a bit of > >> legacy software that dedicated code points and code pages for ligatures. > >> Translating that legacy software to Unicode was made more > >> straightforward by introducing analogous codepoints to Unicode. Unicode > >> has quite many such codepoints: µ, K, Ω etc.) > > > > More specifically, Unicode solved the problems that *codepages* had > > posed. And one of the principles of its design was that every > > character in every legacy encoding had a direct representation as a > > Unicode codepoint, allowing bidirectional transcoding for > > compatibility. Perhaps if Unicode had existed from the dawn of > > computing, we'd have less characters; but backward compatibility is > > way too important to let a narrow purity argument sway it. > > I guess with that historical perspective the current situation > seems almost inevitable. Thanks. And thanks to Steven D'Aprano > for other relevant insights. Strange view In fact the unicode standard itself encourages not using the standard in its entirety 5.12 Deprecation In the Unicode Standard, the term deprecation is used somewhat differently than it is in some other standards. Deprecation is used to mean that a character or other feature is strongly discouraged from use. This should not, however, be taken as indicating that anything has been removed from the standard, nor that anything is planned for removal from the standard. Any such change is constrained by the Unicode Consortium Stability Policies [Stability]. For the Unicode Character Database, there are two important types of deprecation to be noted. First, an encoded character may be deprecated. Second, a character property may be deprecated. When an encoded character is strongly discouraged from use, it is given the property value Deprecated=True. The Deprecated property is a binary property defined specifically to carry this information about Unicode characters. Very few characters are ever formally deprecated this way; it is not enough that a character be uncommon, obsolete, disliked, or not preferred. Only those few characters which have been determined by the UTC to have serious architectural defects or which have been determined to cause significant implementation problems are ever deprecated. Even in the most severe cases, such as the deprecated format control characters (U+206A..U+206F), an encoded character is never removed from the standard. Furthermore, although deprecated characters are strongly discouraged from use, and should be avoided in favor of other, more appropriate mechanisms, they may occur in data. Conformant implementations of Unicode processes such a Unicode normalization must handle even deprecated characters correctly. I read this as saying that -- in addition to officially deprecated chars -- there ARE "uncommon, obsolete, disliked, or not preferred" chars which sensible users should avoid using even though unicode as a standard is compelled to keep supporting Which translates into - python as a language *implementing* unicode (eg in strings) needs to do it completely if it is to be standard compliant - python as a *user* of unicode (eg in identifiers) can (and IMHO should) use better judgement -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Friday, April 8, 2016 at 11:14:21 PM UTC+5:30, Marko Rauhamaa wrote: > Peter Pearson : > > > On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano wrote: > >> They are not, and never have been, in the typesetting business. > >> Perhaps characters are not the only things easily confused *wink* > > > > Defining codepoints that deal with appearance but not with meaning is > > going into the typesetting business. Examples: ligatures, and spaces > > of varying widths with specific typesetting properties like being > > non-breaking. > > > > Typesetting done in MS Word using such Unicode codepoints will never > > be more than a goofy approximation to real typesetting (e.g., TeX), > > but it will cost a huge amount of everybody's time, with the current > > discussion of ligatures in variable names being just a straw in the > > wind. Getting all the world's writing systems into a single, coherent > > standard was an extraordinarily ambitious, monumental undertaking, and > > I'm baffled that the urge to broaden its scope in this irrelevant > > direction was entertained at all. > > I agree completely but at the same time have a lot of understanding for > the reasons why Unicode had to become such a mess. Part of it is > historical, part of it is political, yet part of it is in the > unavoidable messiness of trying to define what a character is. There are standards and standards. Just because they are standard does not make them useful, well-designed, reasonable etc.. Its reasonably likely that all our keyboards start QWERT... Doesn't make it a sane design. Likewise using NFKC to define the equivalence relation on identifiers is analogous to saying: Since QWERTY has been in use for over a hundred years its a perfectly good design. Just because NFKC has the stamp of the unicode consortium it does not straightaway make it useful for all purposes -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Sat, 9 Apr 2016 03:50:16 +1000, Chris Angelico wrote: > On Sat, Apr 9, 2016 at 3:44 AM, Marko Rauhamaa wrote: [snip] >> (As for ligatures, I understand that there might be quite a bit of >> legacy software that dedicated code points and code pages for ligatures. >> Translating that legacy software to Unicode was made more >> straightforward by introducing analogous codepoints to Unicode. Unicode >> has quite many such codepoints: µ, K, Ω etc.) > > More specifically, Unicode solved the problems that *codepages* had > posed. And one of the principles of its design was that every > character in every legacy encoding had a direct representation as a > Unicode codepoint, allowing bidirectional transcoding for > compatibility. Perhaps if Unicode had existed from the dawn of > computing, we'd have less characters; but backward compatibility is > way too important to let a narrow purity argument sway it. I guess with that historical perspective the current situation seems almost inevitable. Thanks. And thanks to Steven D'Aprano for other relevant insights. -- To email me, substitute nowhere->runbox, invalid->com. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Friday, April 8, 2016 at 10:24:17 AM UTC+5:30, Chris Angelico wrote: > On Fri, Apr 8, 2016 at 2:43 PM, Rustom Mody wrote: > > No I am not clever/criminal enough to know how to write a text that is > > visually > > close to > > print "Hello World" > > but is internally closer to > > rm -rf / > > > > For me this: > > >>> Α = 1 > A = 2 > Α + 1 == A > > True > > > > > > > is cure enough that I am not amused > > To me, the above is a contrived example. And you can contrive examples > that are just as confusing while still being ASCII-only, like > swimmer/swirnmer in many fonts, or I and l, or any number of other > visually-confusing glyphs. I propose that we ban the letters 'r' and > 'l' from identifiers, to ensure that people can't mess with > themselves. swirnmer and swimmer are distinguished by squiting a bit А and A only by digging down into the hex. If you categorize them as similar/same... well I am not arguing... will come to you when I am short of straw... > > > Specifically as far as I am concerned if python were to throw back say > > a ligature in an identifier as a syntax error -- exactly what python2 does > > -- > > I think it would be perfectly fine and a more sane choice > > The ligature is handled straight-forwardly: it gets decomposed into > its component letters. I'm not seeing a problem here. Yes... there is no problem... HERE [I did say python gets this right that haskell for example gets wrong] Whats wrong is the whole approach of swallowing gobs of characters that need not be legal at all and then getting indigestion: Note the "non-normative" in https://docs.python.org/3/reference/lexical_analysis.html#identifiers If a language reference is not normative what is? -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Sat, Apr 9, 2016 at 3:44 AM, Marko Rauhamaa wrote: > Unicode heroically and definitively solved the problems ASCII had posed > but introduced a bag of new, trickier problems. > > (As for ligatures, I understand that there might be quite a bit of > legacy software that dedicated code points and code pages for ligatures. > Translating that legacy software to Unicode was made more > straightforward by introducing analogous codepoints to Unicode. Unicode > has quite many such codepoints: µ, K, Ω etc.) More specifically, Unicode solved the problems that *codepages* had posed. And one of the principles of its design was that every character in every legacy encoding had a direct representation as a Unicode codepoint, allowing bidirectional transcoding for compatibility. Perhaps if Unicode had existed from the dawn of computing, we'd have less characters; but backward compatibility is way too important to let a narrow purity argument sway it. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Peter Pearson : > On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano > wrote: >> They are not, and never have been, in the typesetting business. >> Perhaps characters are not the only things easily confused *wink* > > Defining codepoints that deal with appearance but not with meaning is > going into the typesetting business. Examples: ligatures, and spaces > of varying widths with specific typesetting properties like being > non-breaking. > > Typesetting done in MS Word using such Unicode codepoints will never > be more than a goofy approximation to real typesetting (e.g., TeX), > but it will cost a huge amount of everybody's time, with the current > discussion of ligatures in variable names being just a straw in the > wind. Getting all the world's writing systems into a single, coherent > standard was an extraordinarily ambitious, monumental undertaking, and > I'm baffled that the urge to broaden its scope in this irrelevant > direction was entertained at all. I agree completely but at the same time have a lot of understanding for the reasons why Unicode had to become such a mess. Part of it is historical, part of it is political, yet part of it is in the unavoidable messiness of trying to define what a character is. For example, is "ä" one character or two: "a" plus "¨"? Is "i" one character of two: "ı" plus "˙"? Is writing linear or two-dimensional? Unicode heroically and definitively solved the problems ASCII had posed but introduced a bag of new, trickier problems. (As for ligatures, I understand that there might be quite a bit of legacy software that dedicated code points and code pages for ligatures. Translating that legacy software to Unicode was made more straightforward by introducing analogous codepoints to Unicode. Unicode has quite many such codepoints: µ, K, Ω etc.) Marko -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Fri, 08 Apr 2016 16:00:10 +1000, Steven D'Aprano wrote: > On Fri, 8 Apr 2016 02:51 am, Peter Pearson wrote: >> >> The Unicode consortium was certifiably insane when it went into the >> typesetting business. > > They are not, and never have been, in the typesetting business. Perhaps > characters are not the only things easily confused *wink* Defining codepoints that deal with appearance but not with meaning is going into the typesetting business. Examples: ligatures, and spaces of varying widths with specific typesetting properties like being non-breaking. Typesetting done in MS Word using such Unicode codepoints will never be more than a goofy approximation to real typesetting (e.g., TeX), but it will cost a huge amount of everybody's time, with the current discussion of ligatures in variable names being just a straw in the wind. Getting all the world's writing systems into a single, coherent standard was an extraordinarily ambitious, monumental undertaking, and I'm baffled that the urge to broaden its scope in this irrelevant direction was entertained at all. (Should this have been in cranky-geezer font?) -- To email me, substitute nowhere->runbox, invalid->com. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Fri, Apr 8, 2016 at 4:00 PM, Steven D'Aprano wrote: > Or for that matter: > > a = akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqwe9fhlcjbqvcbhsiauy37wkg() + 100 > b = 100 + akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqew9fhlcjbqvcbhsiauy37wkg() > > How easily can you tell them apart at a glance? Ouch! Can't even align them top and bottom. This is evil. > I think that, beyond normalisation, the compiler need not be too concerned > by confusables. I wouldn't *object* to the compiler raising a warning if it > detected confusable identifiers, or mixed script identifiers, but I think > that's more the job for a linter or human code review. The compiler should treat as identical anything that an editor should reasonably treat as identical. I'm not sure whether multiple combining characters on a single base character are forced into some order prior to comparison or are kept in the order they were typed, but my gut feeling is that they should be considered identical. > They are not, and never have been, in the typesetting business. Perhaps > characters are not the only things easily confused *wink* Peter is definitely a character. So are you. QUITE a character. :) > But really, why should we object? Is "pile-of-poo" any more silly than any > of the other dingbats, graphics characters, and other non-alphabetical > characters? Unicode is not just for "letters of the alphabet". It's less silly than "ZERO-WIDTH NON-BREAKING SPACE", which isn't a space at all, it's a joiner. Go figure. (History's a wonderful thing, ain't it? So's backward compatibility and a guarantee that names will never be changed.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Fri, 8 Apr 2016 02:51 am, Peter Pearson wrote: > Seriously, it's cute how neatly normalisation works when you're > watching closely and using it in the circumstances for which it was > intended, but that hardly proves that these practices won't cause much > trouble when they're used more casually and nobody's watching closely. > Considering how much energy good software engineers spend eschewing > unnecessary complexity, Maybe so, but it's not good software engineers we have to worry about, but the other 99.9% :-) > do we really want to embrace the prospect of > having different things look identical? You mean like ASCII identifiers? I'm afraid it's about fifty years too late to ban identifiers using O and 0, or l, I and 1, or rn and m. Or for that matter: a = akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqwe9fhlcjbqvcbhsiauy37wkg() + 100 b = 100 + akjhvciwfdwkejfc2qweoduycwldvqspjcwuhoqew9fhlcjbqvcbhsiauy37wkg() How easily can you tell them apart at a glance? The reality is that we trust our coders not to deliberately mess us about. As the Obfuscated C and the Underhanded C contest prove, you don't need Unicode to hide hostile code. In fact, the use of Unicode confusables in an otherwise all-ASCII file is a dead giveaway that something fishy is going on. I think that, beyond normalisation, the compiler need not be too concerned by confusables. I wouldn't *object* to the compiler raising a warning if it detected confusable identifiers, or mixed script identifiers, but I think that's more the job for a linter or human code review. > (A relevant reference point: > mixtures of spaces and tabs in Python indentation.) Most editors have an option to display whitespace, and tabs and spaces look different. Typically the tab is shown with an arrow, and the space by a dot. If people *still* confuse them, the issue is easily managed by a combination of "well don't do that" and TabError. > [snip] >> The Unicode consortium seems to disagree with you. > > > > The Unicode consortium was certifiably insane when it went into the > typesetting business. They are not, and never have been, in the typesetting business. Perhaps characters are not the only things easily confused *wink* (Although some members of the consortium may be. But the consortium itself isn't.) > The pile-of-poo character was just frosting on > the cake. Blame the Japanese mobile phone companies for that. When you pay your membership fee, you get to object to the addition of characters too. (Anyone, I think, can propose a new character, but only members get to choose which proposals are accepted.) But really, why should we object? Is "pile-of-poo" any more silly than any of the other dingbats, graphics characters, and other non-alphabetical characters? Unicode is not just for "letters of the alphabet". -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Fri, Apr 8, 2016 at 2:43 PM, Rustom Mody wrote: > No I am not clever/criminal enough to know how to write a text that is > visually > close to > print "Hello World" > but is internally closer to > rm -rf / > > For me this: > >>> Α = 1 A = 2 Α + 1 == A > True > > > is cure enough that I am not amused To me, the above is a contrived example. And you can contrive examples that are just as confusing while still being ASCII-only, like swimmer/swirnmer in many fonts, or I and l, or any number of other visually-confusing glyphs. I propose that we ban the letters 'r' and 'l' from identifiers, to ensure that people can't mess with themselves. > Specifically as far as I am concerned if python were to throw back say > a ligature in an identifier as a syntax error -- exactly what python2 does -- > I think it would be perfectly fine and a more sane choice The ligature is handled straight-forwardly: it gets decomposed into its component letters. I'm not seeing a problem here. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Friday, April 8, 2016 at 10:13:16 AM UTC+5:30, Rustom Mody wrote: > No I am not clever/criminal enough to know how to write a text that is > visually > close to > print "Hello World" > but is internally closer to > rm -rf / > > For me this: > >>> Α = 1 > >>> A = 2 > >>> Α + 1 == A > True > >>> > > > is cure enough that I am not amused Um... "cute" was the intention [Or is it cuʇe ?] -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Thursday, April 7, 2016 at 10:22:18 PM UTC+5:30, Peter Pearson wrote: > On Thu, 07 Apr 2016 11:37:50 +1000, Steven D'Aprano wrote: > > On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote: > >> Rustom Mody wrote: > > > >>> So here are some examples to illustrate what I am saying: > >>> > >>> Example 1 -- Ligatures: > >>> > >>> Python3 gets it right > >> flag = 1 > >> flag > >>> 1 > [snip] > >> > >> I do not think this is correct, though. Different Unicode code sequences, > >> after normalization, should result in different symbols. > > > > I think you are confused about normalisation. By definition, normalising > > different Unicode code sequences may result in the same symbols, since that > > is what normalisation means. > > > > Consider two distinct strings which nevertheless look identical: > > > > py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}" > > py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}" > > py> a == b > > False > > py> print(a, b) > > ü ü > > > > > > The purpose of normalisation is to turn one into the other: > > > > py> unicodedata.normalize('NFKC', a) == b # compose 2 code points --> 1 > > True > > py> unicodedata.normalize('NFKD', b) == a # decompose 1 code point --> 2 > > True > > It's all great fun until someone loses an eye. > > Seriously, it's cute how neatly normalisation works when you're > watching closely and using it in the circumstances for which it was > intended, but that hardly proves that these practices won't cause much > trouble when they're used more casually and nobody's watching closely. > Considering how much energy good software engineers spend eschewing > unnecessary complexity, do we really want to embrace the prospect of > having different things look identical? (A relevant reference point: > mixtures of spaces and tabs in Python indentation.) That kind of sums up my position. To be a casual user of unicode is one thing To support it is another -- unicode strings in python3 -- ok so far To mix up these two is a third without enough thought or consideration -- unicode identifiers is likely a security hole waiting to happen... No I am not clever/criminal enough to know how to write a text that is visually close to print "Hello World" but is internally closer to rm -rf / For me this: >>> Α = 1 >>> A = 2 >>> Α + 1 == A True >>> is cure enough that I am not amused [The only reason I brought up case distinction is that this is in the same direction and way worse than that] If python had been more serious about embracing the brave new world of unicode it should have looked in this direction: http://blog.languager.org/2014/04/unicoded-python.html Also here I suggest a classification of unicode, that, while not official or even formalizable is (I believe) helpful http://blog.languager.org/2015/03/whimsical-unicode.html Specifically as far as I am concerned if python were to throw back say a ligature in an identifier as a syntax error -- exactly what python2 does -- I think it would be perfectly fine and a more sane choice -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Fri, Apr 8, 2016 at 2:51 AM, Peter Pearson wrote: > The pile-of-poo character was just frosting on > the cake. > > (Sorry to leave you with that image.) No. You're not even a little bit sorry. You're an evil, evil man. And funny. ChrisA who knows that its codepoint is 1F4A9 without looking it up -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
On Thu, 07 Apr 2016 11:37:50 +1000, Steven D'Aprano wrote: > On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote: >> Rustom Mody wrote: > >>> So here are some examples to illustrate what I am saying: >>> >>> Example 1 -- Ligatures: >>> >>> Python3 gets it right >> flag = 1 >> flag >>> 1 [snip] >> >> I do not think this is correct, though. Different Unicode code sequences, >> after normalization, should result in different symbols. > > I think you are confused about normalisation. By definition, normalising > different Unicode code sequences may result in the same symbols, since that > is what normalisation means. > > Consider two distinct strings which nevertheless look identical: > > py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}" > py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}" > py> a == b > False > py> print(a, b) > ü ü > > > The purpose of normalisation is to turn one into the other: > > py> unicodedata.normalize('NFKC', a) == b # compose 2 code points --> 1 > True > py> unicodedata.normalize('NFKD', b) == a # decompose 1 code point --> 2 > True It's all great fun until someone loses an eye. Seriously, it's cute how neatly normalisation works when you're watching closely and using it in the circumstances for which it was intended, but that hardly proves that these practices won't cause much trouble when they're used more casually and nobody's watching closely. Considering how much energy good software engineers spend eschewing unnecessary complexity, do we really want to embrace the prospect of having different things look identical? (A relevant reference point: mixtures of spaces and tabs in Python indentation.) [snip] > The Unicode consortium seems to disagree with you. The Unicode consortium was certifiably insane when it went into the typesetting business. The pile-of-poo character was just frosting on the cake. (Sorry to leave you with that image.) -- To email me, substitute nowhere->runbox, invalid->com. -- https://mail.python.org/mailman/listinfo/python-list
Re: Unicode normalisation [was Re: [beginner] What's wrong?]
Steven D'Aprano : > So even in English, capitalisation can make a semantic difference. It can even make a pronunciation difference: polish vs Polish. Marko -- https://mail.python.org/mailman/listinfo/python-list
Unicode normalisation [was Re: [beginner] What's wrong?]
On Thu, 7 Apr 2016 05:56 am, Thomas 'PointedEars' Lahn wrote: > Rustom Mody wrote: >> So here are some examples to illustrate what I am saying: >> >> Example 1 -- Ligatures: >> >> Python3 gets it right > flag = 1 > flag >> 1 Python identifiers are intentionally normalised to reduce security issues, or at least confusion and annoyance, due to visually-identical identifiers being treated as different. Unicode has technical standards dealing with identifiers: http://www.unicode.org/reports/tr31/ and visual spoofing and confusables: http://www.unicode.org/reports/tr39/ I don't believe that CPython goes to the full extreme of checking for mixed script confusables, but it does partially mitigate the problem by normalising identifiers. Unfortunately PEP 3131 leaves a number of questions open. Presumably they were answered in the implementation, but they aren't documented in the PEP. https://www.python.org/dev/peps/pep-3131/ > Fascinating; confirmed with > > | $ python3 > | Python 3.4.4 (default, Jan 5 2016, 15:35:18) > | [GCC 5.3.1 20160101] on linux > | […] > > I do not think this is correct, though. Different Unicode code sequences, > after normalization, should result in different symbols. I think you are confused about normalisation. By definition, normalising different Unicode code sequences may result in the same symbols, since that is what normalisation means. Consider two distinct strings which nevertheless look identical: py> a = "\N{LATIN SMALL LETTER U}\N{COMBINING DIAERESIS}" py> b = "\N{LATIN SMALL LETTER U WITH DIAERESIS}" py> a == b False py> print(a, b) ü ü The purpose of normalisation is to turn one into the other: py> unicodedata.normalize('NFKC', a) == b # compose 2 code points --> 1 True py> unicodedata.normalize('NFKD', b) == a # decompose 1 code point --> 2 True In the case of the fl ligature, normalisation splits the ligature into individual 'f' and 'l' code points regardless of whether you compose or decompose: py> unicodedata.normalize('NFKC', "flag") == "flag" True py> unicodedata.normalize('NFKD', "flag") == "flag" True That's using the combatability composition form. Using the default composition form leaves the ligature unchanged. Note that UTS #39 (security mechanisms) suggests that identifiers should be normalised using NFKC. [...] > I think Haskell gets it right here, while Py3k does not. The “fl” is not > to be decomposed to “fl”. The Unicode consortium seems to disagree with you. Table 1 of UTS #39 (see link above) includes "Characters that cannot occur in strings normalized to NFKC" in the Restricted category, that is, characters which should not be used in identifiers. fl cannot occur in such normalised strings, and so it is classified as Restricted and should not be used in identifiers. I'm not entirely sure just how closely Python's identifiers follow the standard, but I think that the intention is to follow something close to "UAX31-R4. Equivalent Normalized Identifiers": http://www.unicode.org/reports/tr31/#R4 [Rustom] >> Python gets it wrong > a=1 > A >> Traceback (most recent call last): >> File "", line 1, in >> NameError: name 'A' is not defined > > This is not wrong; it is just different. I agree with Thomas here. Case-insensitivity is a choice, and I don't think it is a good choice for programming identifiers. Being able to make case distinctions between (let's say): SPAM # a constant, or at least constant-by-convention Spam # a class or type spam # an instance is useful. [Rustom] >> With ASCII the problems are minor: Case-distinct identifiers are distinct >> -- they dont IDENTIFY. > > I do not think this is a problem. > >> This contradicts standard English usage and practice > > No, it does not. I agree with Thomas here too. Although it is rare for case to make a distinction in English, it does happen. As the old joke goes: Capitalisation is the difference between helping my Uncle Jack off a horse, and helping my uncle jack off a horse. So even in English, capitalisation can make a semantic difference. -- Steven -- https://mail.python.org/mailman/listinfo/python-list