Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-19 Thread Steven D'Aprano
On Thu, 19 Jul 2018 20:34:26 +0200, Christian Gollwitzer wrote:

> Am 19.07.2018 um 14:50 schrieb Gregory Ewing:
>> Chris Angelico wrote:
>>> On Thu, Jul 19, 2018 at 4:41 PM, Gregory Ewing
>>>  wrote:
>>>
 (Google doesn't seem to think so -- it asks me whether I meant
 "assist shop". Although it does offer to translateč it into Czech...)
>>>
>>> Into or from?? I'm thoroughly confused now!
>>
>> Hard to tell. This is what the link said:
>>
>> assistshop - Czech translation - bab.la English-Czech dictionary
>> https://en.bab.la/dictionary/english-czech/assistshop Translation for
>> 'assistshop' in the free English-Czech dictionary and" many other Czech
>> translations.
> 
> Well that link tries to translate "assistshop" into the czech word
> "prodavač" which is the usual word for a person in a shop who consults
> the customers and sells the goods to them; I don't know if "assist shop"
> in English comes close, as I don't understand it (I'm a native German
> speaker)

In English, that would be "shop assistant". "Assist shop" would be 
grammatically incorrect: it should be written as "assist the shop", 
meaning "help the shop".


Relevant:

https://www.theatlantic.com/technology/archive/2018/01/the-shallowness-of-google-translate/551570/




-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-19 Thread Christian Gollwitzer

Am 19.07.2018 um 14:50 schrieb Gregory Ewing:

Chris Angelico wrote:

On Thu, Jul 19, 2018 at 4:41 PM, Gregory Ewing
 wrote:


(Google doesn't seem to think so -- it asks me whether
I meant "assist shop". Although it does offer to translateč
it into Czech...)


Into or from?? I'm thoroughly confused now!


Hard to tell. This is what the link said:

assistshop - Czech translation - bab.la English-Czech dictionary
https://en.bab.la/dictionary/english-czech/assistshop
Translation for 'assistshop' in the free English-Czech dictionary and"
many other Czech translations.


Well that link tries to translate "assistshop" into the czech word 
"prodavač" which is the usual word for a person in a shop who consults 
the customers and sells the goods to them; I don't know if "assist shop" 
in English comes close, as I don't understand it (I'm a native German 
speaker)


Christian



--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-19 Thread Gregory Ewing

Chris Angelico wrote:

On Thu, Jul 19, 2018 at 4:41 PM, Gregory Ewing
 wrote:


(Google doesn't seem to think so -- it asks me whether
I meant "assist shop". Although it does offer to translate
it into Czech...)


Into or from?? I'm thoroughly confused now!


Hard to tell. This is what the link said:

assistshop - Czech translation - bab.la English-Czech dictionary
https://en.bab.la/dictionary/english-czech/assistshop
Translation for 'assistshop' in the free English-Czech dictionary and many other 
Czech translations.


--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-19 Thread Abdur-Rahmaan Janhangeer
it's also thoroughly time to give this thread a well deserved rest

RIP

Abdur-Rahmaan Janhangeer
https://github.com/Abdur-rahmaanJ

Into or from?? I'm thoroughly confused now!
>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-19 Thread Chris Angelico
On Thu, Jul 19, 2018 at 4:41 PM, Gregory Ewing
 wrote:
> Stefan Ram wrote:
>>
>>   »assistshop«,
>
>
> Is that a word?
>
> (Google doesn't seem to think so -- it asks me whether
> I meant "assist shop". Although it does offer to translate
> it into Czech...)
>

Into or from?? I'm thoroughly confused now!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-19 Thread Gregory Ewing

Stefan Ram wrote:

  »assistshop«,


Is that a word?

(Google doesn't seem to think so -- it asks me whether
I meant "assist shop". Although it does offer to translate
it into Czech...)

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-19 Thread Gregory Ewing

Stefan Ram wrote:

Gregory Ewing  writes:



That's debatable. I've never thought of it that way and I'm
fairly certain I don't pronounce it that way. My tongue does
not do the same thing when I say "ch" as it does when I
say "tsh".


archives   ˈɑɚ kɑɪvz (n)
bachelor   ˈbæʧ lɚ (n)
machine məˈʃin
cash    kæʃ
dachshund  ˈdɑks ˌhʊnt


I'm talking specifically about the "ch" sound in
"bachelor", "change", etc. It sounds and feels like
a single sound to me.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-18 Thread Gregory Ewing

MRAB wrote:
"ch" usually represents 2 phonemes, basically the sounds of "t" followed 
by "sh";


That's debatable. I've never thought of it that way and I'm
fairly certain I don't pronounce it that way. My tongue does
not do the same thing when I say "ch" as it does when I
say "tsh".

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-18 Thread Antoon Pardon
On 18-07-18 10:07, Marko Rauhamaa wrote:
>> Sure there were some surprises or gotcha's, but the result was still
>> better than doing it in python2 and they were easier to deal with than
>> in python2.
> BTW, in those needs, even Python2 has Unicode strings and unicodedata at
> your disposal.

Sure, just as there are byte strings at your disposal in python3.

I also don't think using u'...' in python2 is less ugly than using b'...' in 
python3.

-- 
Antoon.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-18 Thread Marko Rauhamaa
Antoon Pardon :

> On 17-07-18 14:22, Marko Rauhamaa wrote:
>> If you assume that NFC normalizes every letter to a single codepoint
>> (and carefully use NFC everywhere), you are right. But equally likely
>> you may inadvertently be setting yourself up for a surprise.
>
> You are moving the goal post. I didn't claim there were no surprises.
> I only claim that in the end combining regular expressions and working
> with multiple languages ended up being far easier with python3 strings
> than with python2 strings.

Fair enough.

> Sure there were some surprises or gotcha's, but the result was still
> better than doing it in python2 and they were easier to deal with than
> in python2.

BTW, in those needs, even Python2 has Unicode strings and unicodedata at
your disposal.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-18 Thread Antoon Pardon
On 17-07-18 14:22, Marko Rauhamaa wrote:
> Antoon Pardon :
>
>> On 17-07-18 10:27, Marko Rauhamaa wrote:
>>> Also, Python2's strings do as good a job at delivering codepoints as
>>> Python3.
>> No they don't. The programs that I work on, need to be able to treat
>> at least german, french, dutch and english text. My experience is that
>> in python3 it is way easier to do things right. Especially if you are
>> working with regular expressions.
> If you assume that NFC normalizes every letter to a single codepoint
> (and carefully use NFC everywhere), you are right. But equally likely
> you may inadvertently be setting yourself up for a surprise.

You are moving the goal post. I didn't claim there were no surprises. I only 
claim
that in the end combining regular expressions and working with multiple 
languages
ended up being far easier with python3 strings than with python2 strings.

Sure there were some surprises or gotcha's, but the result was still better than
doing it in python2 and they were easier to deal with than in python2.

-- 
Antoon.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Mark Lawrence

On 17/07/18 19:16, Marko Rauhamaa wrote:

MRAB :

"ch" usually represents 2 phonemes, basically the sounds of "t"
followed by "sh";


Traditionally, that sound is considered a single phoneme:

https://en.wikipedia.org/wiki/Affricate_consonant>

Can you hear the difference in these expressions:

high chairs

height shares

height chairs

Try them on an English-speaking person. In a restaurant, ask for a
"height share" and see if they bring you a high chair.

The English "tr" sound can also be considered a single affricate
phoneme:

https://en.wikipedia.org/wiki/Voiceless_postalveolar_affricate>

Is there a difference between these expressions:

rye train

right rain

right train


Marko



I do not see what this has to do with the Python programming language, 
neither do I care.  Please take this offline, as you've all ready been 
asked to do by a moderator, Tim Golden.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Rhodri James

On 17/07/18 19:16, Marko Rauhamaa wrote:

MRAB :

"ch" usually represents 2 phonemes, basically the sounds of "t"
followed by "sh";


Traditionally, that sound is considered a single phoneme:

https://en.wikipedia.org/wiki/Affricate_consonant>


To quote the introduction of that article, "It is often difficult to 
decide if a stop and fricative form a single phoneme or a consonant 
pair."  I'm afraid your bold assertion is more than a bit arguable.



Can you hear the difference in these expressions:

high chairs

height shares

height chairs


Yes, but then I'm a trained singer.


Try them on an English-speaking person. In a restaurant, ask for a
"height share" and see if they bring you a high chair.


That's a different effect.  Listeners will often subconsciously make 
small "corrections" to what they hear to bring it into context.  It is 
particularly noticeable in experiments where one person repeats what 
another says while they are still speaking -- effectively simultaneous 
translation without the translation part :-)  The person repeating will 
correct small mistakes in what was originally said without ever noticing 
the error.  (Google is being annoying and not supplying me with the 
information, but I know there have been papers on this.)




The English "tr" sound can also be considered a single affricate
phoneme:

https://en.wikipedia.org/wiki/Voiceless_postalveolar_affricate>

Is there a difference between these expressions:

rye train

right rain

right train


Again, yes.  Very much so this time.

--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Marko Rauhamaa
MRAB :
> "ch" usually represents 2 phonemes, basically the sounds of "t"
> followed by "sh";

Traditionally, that sound is considered a single phoneme:

   https://en.wikipedia.org/wiki/Affricate_consonant>

Can you hear the difference in these expressions:

   high chairs

   height shares

   height chairs

Try them on an English-speaking person. In a restaurant, ask for a
"height share" and see if they bring you a high chair.

The English "tr" sound can also be considered a single affricate
phoneme:

   https://en.wikipedia.org/wiki/Voiceless_postalveolar_affricate>

Is there a difference between these expressions:

   rye train

   right rain

   right train


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread MRAB

On 2018-07-17 03:25, Tim Chase wrote:

On 2018-07-17 01:08, Steven D'Aprano wrote:

In English, I think most people would prefer to use a different
term for whatever "sh" and "ch" represent than "character".


The term you may be reaching for is "consonant cluster"?

https://en.wikipedia.org/wiki/Consonant_cluster


They are digraphs, 2 characters that are treated as a single unit.

As it says in the first paragraph: "a consonant cluster, consonant 
sequence or consonant compound is a group of consonants which have no 
intervening vowel."


"sh" is a single phoneme (sound) that happens to be written in English 
with 2 letters.


"ch" usually represents 2 phonemes, basically the sounds of "t" followed 
by "sh"; other times it's "k" (e.g. in "echo"); occasionally it's "sh" 
(e.g. in "champagne").

--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Marko Rauhamaa
Antoon Pardon :

> On 17-07-18 10:27, Marko Rauhamaa wrote:
>> Also, Python2's strings do as good a job at delivering codepoints as
>> Python3.
>
> No they don't. The programs that I work on, need to be able to treat
> at least german, french, dutch and english text. My experience is that
> in python3 it is way easier to do things right. Especially if you are
> working with regular expressions.

If you assume that NFC normalizes every letter to a single codepoint
(and carefully use NFC everywhere), you are right. But equally likely
you may inadvertently be setting yourself up for a surprise.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Antoon Pardon
On 17-07-18 10:27, Marko Rauhamaa wrote:
> Steven D'Aprano :
>> On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote:
>>> Who says there needs to be one. A good engineer will use the
>>> definition that is most appropriate to the task at hand. Some things
>>> need very solid definitions, and some things don’t.
>> The the problem is solved: we have a perfectly good de facto definition 
>> of character: it is a synonym for "code point", and every single one of 
>> Marko's objections disappears.
> I admit it. Python3 is the perfect medium for your codepoint delivery
> needs.
>
> What you don't seem to understand about my objections is that no
> programmer needs codepoints per se. Also, Python2's strings do as good a
> job at delivering codepoints as Python3.

No they don't. The programs that I work on, need to be able to treat at least 
german,
french, dutch and english text. My experience is that in python3 it is way 
easier to do things
right. Especially if you are working with regular expressions.

-- 
Antoon.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Richard Damon
> On Jul 17, 2018, at 3:44 AM, Steven D'Aprano 
>  wrote:
> 
> On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote:
> 
>>> On Jul 16, 2018, at 9:21 PM, Steven D'Aprano
>>>  wrote:
>>> 
 On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote:
 
 You are defining a variable/fixed width codepoint set. Many others
 want to deal with CHARACTER sets.
>>> 
>>> Good luck coming up with a universal, objective, language-neutral,
>>> consistent definition for a character.
>>> 
>> Who says there needs to be one. A good engineer will use the definition
>> that is most appropriate to the task at hand. Some things need very
>> solid definitions, and some things don’t.
> 
> The the problem is solved: we have a perfectly good de facto definition 
> of character: it is a synonym for "code point", and every single one of 
> Marko's objections disappears.
> 
Which is a ‘changed’ definition! Do you agree that the concept of variable 
width encoding vastly predates the creation of Unicode? Can you also find any 
use of the word codepoint that predates the development of Unicode? 
Code points and code words are an invention of the Unicode consortium, and as 
such should really only be used in talking about IT and not some other 
encodings. I believe that Unicode also created the idea of storing composed 
characters as a series of codepoints instead of it being done in the input 
routine and the character set needing to define a character code for every 
needed composed character.
> 
>> This goes back to my original point, where I said some people consider
>> UTF-32 as a variable width encoding. For very many things, practically,
>> the ‘codepoint’ isn’t the important thing, 
> 
> Ah, is this another one of those "let's pick a definition that nobody 
> else uses, and state it as a fact" like UTF-32 being variable width?
> 
> If by "very many things", you mean "not very many things", I agree with 
> you. In my experience, dealing with code points is "good enough", 
> especially if you use Western European alphabets, and even more so if 
> you're willing to do a normalization step before processing text.
> 
AH, that is the rub, you only deal with the parts of Unicode that are simple 
and regular. This is EXACTLY the issue that you blame people who want to use 
ASCII or Codepages to solve, just the next step in the evolution.

One problem with normalization is that for Western European characters it tends 
to be able to convert every ‘Character’ to a code point, but in some corner 
cases, especially for other languages it can’t. I am not just talking about 
digraphs like ch that have been mentioned, but the real composed characters 
with a base glyph with marks above/below/embedded on it. Unicode represents 
many of them with a code point, but no where near all of them.

If you actually read the Unicode documents, they do talk about Characters, and 
admit that they aren’t necessarily codepoints, so if you actually want to talk 
about a CHARACTER set, Unicode, even UTF-32 needs to sometimes be treated as 
variable width. 

> But of course other people's experience may vary. I'm interested in 
> learning about the library you use to process graphemes in your software.
> 
> 
>> so the fact that every UTF-32
>> code point takes the same number of bytes or code words isn’t that
>> important. They are dealing with something that needs to be rendered and
>> preserving larger units, like the grapheme is important.
> 
> If you're writing a text widget or a shell, you need to worry about 
> rendering glyphs. Everyone else just delegates to their text widget, GUI 
> framework, or shell.
> 
But someone needs to write that text widget, or it might not do exactly what 
you want, say wrapping the text around obstacles already placed on the 
screen/page.

And try using that text widget to find the ‘middle’ (as shown) of a text 
string, (other than iterating with multiple calls to it to try and find it).

Unicode made the processing of Codepoints simpler, but made the processing of 
actual rendered text much more complicated if you want to handle everything 
right. 
> 
 This doesn’t mean that UTF-32 is an awful system, just that it isn’t
 the magical cure that some were hoping for.
>>> 
>>> Nobody ever claimed it was, except for the people railing that since it
>>> isn't a magically system we ought to go back to the Good Old Days of
>>> code page hell, or even further back when everyone just used ASCII.
>>> 
>> Sometimes ASCII is good enough, especially on a small machine with
>> limited resources.
> 
> I doubt that there are many general purpose computers with resources 
> *that* limited. Even MicroPython supports Unicode, and that runs on 
> embedded devices with memory measured in kilobytes. 8K is considered the 
> smallest amount of memory usable with MicroPython, although 128K is more 
> realistic as the *practical* lower limit.
> 
> In the mid 1980s, I was using computers with 128K of RAM, and they were 
> still 

Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Marko Rauhamaa
Chris Angelico :

> On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa  wrote:
>> Of course, UTF-8 doesn't relieve you from Unicode problems. But it has
>> one big advantage: it can usually deal with non-Unicode data without any
>> extra considerations while Python3's strings make you have to take
>> elaborate measures to handle those special cases. Why, even print() must
>> be guarded against UnicodeEncodeError when the printed string is not in
>> the programmer's control.
>
> What is this "non-Unicode data" that UTF-8 can handle? Do you mean
> arbitrary byte sequences? Because no, it cannot; properly-formed UTF-8
> sequences MUST comply with the precise requirements of the format.

I was being imprecise: byte strings carrying UTF-8 can handle bad UTF-8
with equal ease. And that's a real, practical advantage.

> Can you give an example of how Python 3's print function can raise
> UnicodeEncodeError when given a Python 3 string?

   >>> print("\ud810")
   Traceback (most recent call last):
 File "", line 1, in 
   UnicodeEncodeError: 'utf-8' codec can't encode character '\ud810' \
   in position 0: surrogates not allowed


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Marko Rauhamaa
Chris Angelico :

> On Tue, Jul 17, 2018 at 7:03 PM, Marko Rauhamaa  wrote:
>> What I'd need is for the tty to tell me what column the cursor is
>> visually. Or better yet, the tty would have to tell me where the column
>> would be *after* I emit the next grapheme cluster.
>
> Are you prepared for the possibility that emitting characters won't
> change what column you're in?

Absolutely.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Chris Angelico
On Tue, Jul 17, 2018 at 7:03 PM, Marko Rauhamaa  wrote:
> Chris Angelico :
>
>> On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa  wrote:
>>> For me, the issue is where do I produce a line break in my text output?
>>> Currently, I'm just counting codepoints to estimate the width of the
>>> output.
>>
>> Well, that's just flat out wrong, then. Counting graphemes isn't going
>> to make it any better. Grab a well-known library like Pango and let it
>> do your measurements for you, *in pixels*. Or better still, just poke
>> your text to a dedicated text-display widget and let it display it
>> correctly.
>
> What I'd need is for the tty to tell me what column the cursor is
> visually. Or better yet, the tty would have to tell me where the column
> would be *after* I emit the next grapheme cluster.

Are you prepared for the possibility that emitting characters won't
change what column you're in? Start a new line, then emit one Arabic
character. What column are you in? Now emit three more Arabic
characters, completing the word. What column? Now emit a U+0020 SPACE.
What column? Now emit some Latin characters, followed by more Arabic.
Where are you?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Marko Rauhamaa
Chris Angelico :

> On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa  wrote:
>> For me, the issue is where do I produce a line break in my text output?
>> Currently, I'm just counting codepoints to estimate the width of the
>> output.
>
> Well, that's just flat out wrong, then. Counting graphemes isn't going
> to make it any better. Grab a well-known library like Pango and let it
> do your measurements for you, *in pixels*. Or better still, just poke
> your text to a dedicated text-display widget and let it display it
> correctly.

What I'd need is for the tty to tell me what column the cursor is
visually. Or better yet, the tty would have to tell me where the column
would be *after* I emit the next grapheme cluster.

The tty *does* know that but I don't know if there is an interface to
query it. This doesn't seem to be working properly:

sys.stdout.write("a\u0300\u001b[6n\n")

(and would be a tricky interface even if it did)


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Chris Angelico
On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa  wrote:
> It is essential for people to understand that the very same issues that
> plague UTF-8 plague UTF-32 as well. Using UTF in both highlights that
> fact.

What a wonderful nonsense. I suppose that the same issues plague Elon
Musk as plague the musk sticks in the sweets aisle in the supermarket
- they do use the same letters, after all.

>> If by "very many things", you mean "not very many things", I agree
>> with you. In my experience, dealing with code points is "good enough",
>> especially if you use Western European alphabets, and even more so if
>> you're willing to do a normalization step before processing text.
>
> Of course, UTF-8 doesn't relieve you from Unicode problems. But it has
> one big advantage: it can usually deal with non-Unicode data without any
> extra considerations while Python3's strings make you have to take
> elaborate measures to handle those special cases. Why, even print() must
> be guarded against UnicodeEncodeError when the printed string is not in
> the programmer's control.

What is this "non-Unicode data" that UTF-8 can handle? Do you mean
arbitrary byte sequences? Because no, it cannot; properly-formed UTF-8
sequences MUST comply with the precise requirements of the format.

Can you give an example of how Python 3's print function can raise
UnicodeEncodeError when given a Python 3 string?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Chris Angelico
On Tue, Jul 17, 2018 at 6:27 PM, Marko Rauhamaa  wrote:
>> But of course other people's experience may vary. I'm interested in
>> learning about the library you use to process graphemes in your software.
>
> For me, the issue is where do I produce a line break in my text output?
> Currently, I'm just counting codepoints to estimate the width of the
> output.

Well, that's just flat out wrong, then. Counting graphemes isn't going
to make it any better. Grab a well-known library like Pango and let it
do your measurements for you, *in pixels*. Or better still, just poke
your text to a dedicated text-display widget and let it display it
correctly.

Back in the early 2000s, I built a program that displayed text in a
monospaced font, and it was riddled with assumptions that "one byte ==
one character == N pixels of width" (for some value of N that changed
only when you change font). It was easier to throw it out completely
and start over than to try to "bolt on" true Unicode support. The
replacement program uses GTK and Pango to do all its display work, and
while it still has a lot of complexities (because it has to handle
colour codes, highlighting, point-to-word, and such, all of which get
very complicated when you mix LTR and RTL text), at least it can 100%
dependably say "wrap to this point". For the convenience of the human
using it, it specifies a wrap width in characters, but in the fine
print, the wrap width is defined as "the width of that many of the
letter 'n' in the chosen font". At no point do I ever count bytes,
code units, code points, grapheme clusters, or blue-faced baboons, to
try to pretend that I know the width of the string. All of them are
wrong for the wrapping of text.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Marko Rauhamaa
Steven D'Aprano :

> On Tue, 17 Jul 2018 09:52:13 +0300, Marko Rauhamaa wrote:
>
>> Both Python2 and Python3 provide two forms of string, one containing
>> 8-bit integers and another one containing 21-bit integers.
>
> Why do you insist on making counter-factual statements as facts? Don't 
> you have a Python REPL you can try these outrageous claims out before 
> making them?
>
> [...]
>
> Python strings are sequences of abstract characters.

which -- by your definition -- are codepoints -- which by any
definition -- are integers.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Marko Rauhamaa
Steven D'Aprano :
> On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote:
>> Who says there needs to be one. A good engineer will use the
>> definition that is most appropriate to the task at hand. Some things
>> need very solid definitions, and some things don’t.
>
> The the problem is solved: we have a perfectly good de facto definition 
> of character: it is a synonym for "code point", and every single one of 
> Marko's objections disappears.

I admit it. Python3 is the perfect medium for your codepoint delivery
needs.

What you don't seem to understand about my objections is that no
programmer needs codepoints per se. Also, Python2's strings do as good a
job at delivering codepoints as Python3. Simultaneously, Python2's
strings are a better fit for the Unix system and network programming
model.

>> This goes back to my original point, where I said some people
>> consider UTF-32 as a variable width encoding. For very many things,
>> practically, the ‘codepoint’ isn’t the important thing,
>
> Ah, is this another one of those "let's pick a definition that nobody
> else uses, and state it as a fact" like UTF-32 being variable width?

   Each 32-bit value in UTF-32 represents one Unicode code point and is
   exactly equal to that code point's numerical value.

   https://en.wikipedia.org/wiki/UTF-32>

That is called bijection. Even more, it's a homomorphism. Homomorphism
is very high degree of sameness.

It is essential for people to understand that the very same issues that
plague UTF-8 plague UTF-32 as well. Using UTF in both highlights that
fact.

> If by "very many things", you mean "not very many things", I agree
> with you. In my experience, dealing with code points is "good enough",
> especially if you use Western European alphabets, and even more so if
> you're willing to do a normalization step before processing text.

Of course, UTF-8 doesn't relieve you from Unicode problems. But it has
one big advantage: it can usually deal with non-Unicode data without any
extra considerations while Python3's strings make you have to take
elaborate measures to handle those special cases. Why, even print() must
be guarded against UnicodeEncodeError when the printed string is not in
the programmer's control.

> But of course other people's experience may vary. I'm interested in 
> learning about the library you use to process graphemes in your software.

For me, the issue is where do I produce a line break in my text output?
Currently, I'm just counting codepoints to estimate the width of the
output.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Steven D'Aprano
On Tue, 17 Jul 2018 10:51:38 +0300, Marko Rauhamaa wrote:

> in which Python3's honor is defended in a good many of the discussions
> in this newsgroup: anger, condescension, ridicule, name-calling.

You call it defending Python 3's honour. I call it responding to people 
who insist on spreading misinformation and falsehoods even when given the 
correct details.

Some people have their self-image wrapped up in being able to portray 
themselves as a maverick who, almost alone, sees through the "lies" about 
 to see "the truth". Others prefer reality 
instead, and get upset when false facts are repeated, over and over 
again, as truth.

If instead you want to discuss actual concrete areas where Python's text/
bytes divide hurts, you'll find that there are plenty of people who 
agree. Especially if they have to write string-handling code that needs 
to run under both 2 and 3. Been there, done that, don't want to do it 
again.

The Python 3 redesign was done to fix certain common, hard-to-diagnose 
problems in string handling caused by Python2's violation of the Zen "in 
the face of ambiguity, refuse the temptation to guess". (Python 2 guesses 
what encoding you probably mean when it comes to strings and bytes, and 
when it gets it right it is convenient, but when it gets it wrong, it is 
badly wrong, and hard to diagnose and fix.)

It impossible to improve the text handling experience for every single 
programmer writing every single kind of program under every single set of 
circumstances. Like any semantic change, there are going to be winners 
and losers, and the core devs' position is that if the losers have 
concrete and backwards-compatible suggestions for improving their 
experience (e.g. re-adding % support for byte strings) they will consider 
them, but going back to the Python 2 misdesign is off the table.


-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Steven D'Aprano
On Tue, 17 Jul 2018 15:20:16 +0900, INADA Naoki wrote (replying to Marko):

> I still don't understand what's your original point. I think UTF-8 vs
> UTF-32 is totally different from Python 2 vs 3.
> 
> For example, string in Rust and Swift (2010s languages!) are *valid*
> UTF-8. There are strong separation between byte array and string, even
> they use UTF-8. They looks similar to Python 3, not Python 2.
> 
> And Python can use UTF-8 for internal encoding in the future. AFAIK,
> PyPy tries it now.  After they succeeded,  I want to try port it to
> CPython after we removed legacy Unicode APIs. (ref PEP 393)

I'm not sure about PyPy, but I'm fairly certain that MicroPython uses 
UTF-8.

I would be very interested to see the results of using UTF-8 in CPython. 
At the least, it would remove the need to keep a separate UTF-8 
representation in the string object, as they do now. It might even be 
more compact, although a naive implementation would lose the ability to 
do constant time indexing into strings.

That might be a tradeoff worth keeping, if indexing remained sufficiently 
fast.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Steven D'Aprano
On Tue, 17 Jul 2018 09:52:13 +0300, Marko Rauhamaa wrote:

> Both Python2 and Python3 provide two forms of string, one containing
> 8-bit integers and another one containing 21-bit integers.

Why do you insist on making counter-factual statements as facts? Don't 
you have a Python REPL you can try these outrageous claims out before 
making them?

py> b'abcd'[2] + 1  # bytes are sequences of integers
100

py> 'abcd'[2] + 1  # strings are not sequences of integers
Traceback (most recent call last):
  File "", line 1, in 
TypeError: Can't convert 'int' object to str implicitly


Python strings are sequences of abstract characters.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Steven D'Aprano
On Tue, 17 Jul 2018 08:26:45 +0300, Marko Rauhamaa wrote:

> Steven D'Aprano :
>> On Mon, 16 Jul 2018 22:51:32 +0300, Marko Rauhamaa wrote:
>>> UTF-8 bytes can only represent the first 128 code points of Unicode.
>>
>> This is DailyWTF material. Perhaps you want to rethink your wording and
>> maybe even learn a bit more about Unicode and the UTF encodings before
>> making such statements.
>>
>> The idea that UTF-8 bytes cannot represent the whole of Unicode is not
>> even wrong. Of course a *single* byte cannot, but a single byte is not
>> "UTF-8 bytes".
> 
> So I hope that by now you have understood my point and been able to
> decide if you agree with it or not.

If your point was not what you wrote, then no, I'm sorry, my crystal ball 
unexpectedly broke down (why it didn't foresee its own failure I'll never 
know...). I can't tell what you are thinking, only what you write. 
Sometimes I can guess (like my earlier guess that you meant grapheme, 
rather than glyph) but in this case, if you mean something other than 

"UTF-8 bytes can only represent the first 128 code points of Unicode"

I'm flummoxed.


-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Marko Rauhamaa
INADA Naoki :

>> I won't comment on Rust and Swift because I don't know them.
> ...
>> I won't comment on Go, either.
>
> Hmm, do you say Python 3 is "cult-like" without survey other popular,
> programming languages?

You can talk about Python3 independently of other programming languages.

Python3 is not a cult. It's a programming language. What is cult-like is
the manner in which Python3's honor is defended in a good many of the
discussions in this newsgroup: anger, condescension, ridicule,
name-calling.

> I can't agree that it's cult-like behavior.  I think it's practical
> design decision.

If Python3 works for you, I'm happy for you.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Steven D'Aprano
On Mon, 16 Jul 2018 21:25:20 -0500, Tim Chase wrote:

> On 2018-07-17 01:08, Steven D'Aprano wrote:
>> In English, I think most people would prefer to use a different term
>> for whatever "sh" and "ch" represent than "character".
> 
> The term you may be reaching for is "consonant cluster"?
> 
> https://en.wikipedia.org/wiki/Consonant_cluster

Thanks!


-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Steven D'Aprano
On Mon, 16 Jul 2018 21:48:42 -0400, Richard Damon wrote:

>> On Jul 16, 2018, at 9:21 PM, Steven D'Aprano
>>  wrote:
>> 
>>> On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote:
>>> 
>>> You are defining a variable/fixed width codepoint set. Many others
>>> want to deal with CHARACTER sets.
>> 
>> Good luck coming up with a universal, objective, language-neutral,
>> consistent definition for a character.
>> 
> Who says there needs to be one. A good engineer will use the definition
> that is most appropriate to the task at hand. Some things need very
> solid definitions, and some things don’t.

The the problem is solved: we have a perfectly good de facto definition 
of character: it is a synonym for "code point", and every single one of 
Marko's objections disappears.


> This goes back to my original point, where I said some people consider
> UTF-32 as a variable width encoding. For very many things, practically,
> the ‘codepoint’ isn’t the important thing, 

Ah, is this another one of those "let's pick a definition that nobody 
else uses, and state it as a fact" like UTF-32 being variable width?

If by "very many things", you mean "not very many things", I agree with 
you. In my experience, dealing with code points is "good enough", 
especially if you use Western European alphabets, and even more so if 
you're willing to do a normalization step before processing text.

But of course other people's experience may vary. I'm interested in 
learning about the library you use to process graphemes in your software.


> so the fact that every UTF-32
> code point takes the same number of bytes or code words isn’t that
> important. They are dealing with something that needs to be rendered and
> preserving larger units, like the grapheme is important.

If you're writing a text widget or a shell, you need to worry about 
rendering glyphs. Everyone else just delegates to their text widget, GUI 
framework, or shell.


>>> This doesn’t mean that UTF-32 is an awful system, just that it isn’t
>>> the magical cure that some were hoping for.
>> 
>> Nobody ever claimed it was, except for the people railing that since it
>> isn't a magically system we ought to go back to the Good Old Days of
>> code page hell, or even further back when everyone just used ASCII.
>> 
> Sometimes ASCII is good enough, especially on a small machine with
> limited resources.

I doubt that there are many general purpose computers with resources 
*that* limited. Even MicroPython supports Unicode, and that runs on 
embedded devices with memory measured in kilobytes. 8K is considered the 
smallest amount of memory usable with MicroPython, although 128K is more 
realistic as the *practical* lower limit.

In the mid 1980s, I was using computers with 128K of RAM, and they were 
still able to deal with more than just ASCII. I think the "limited 
resources" argument is bogus.


-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread INADA Naoki
> I won't comment on Rust and Swift because I don't know them.
...
> I won't comment on Go, either.

Hmm, do you say Python 3 is "cult-like" without survey other popular,
programming languages?

There are many popular languages which separate bytes and unicode
string explicitly and string is not byte-transparent; C#, Java, ECMAScript,
(including families like TypeScript), Rust, Swift, Julia, and more.

I can't agree that it's cult-like behavior.  I think it's practical
design decision.

Regards,
-- 
INADA Naoki  
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Terry Reedy

On 7/16/2018 10:25 PM, Tim Chase wrote:

On 2018-07-17 01:08, Steven D'Aprano wrote:

In English, I think most people would prefer to use a different
term for whatever "sh" and "ch" represent than "character".


The term you may be reaching for is "consonant cluster"?

https://en.wikipedia.org/wiki/Consonant_cluster


Sibilant (soft) ch (as opposed to hard aspirated chi as in Greek letter 
khi (visually like X)) and sh are single consonants, single phonemes in 
spoken language.  In less parsimonious writing systems than Latin, they 
are often represented by single characters.  When transliterated into 
Latin characters, both decorated c and s and ch and sh are used.


'str', as in string or street is a consonant cluster. It might be 
represented by a single ligature, but I would not expect any 
phoneme-based writing system to consider the result to be a single 
character.  (Given that the sound of X (hard chi) mutated into 'ks', the 
latter is not impossible.)


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Marko Rauhamaa
INADA Naoki :

> On Tue, Jul 17, 2018 at 2:31 PM Marko Rauhamaa  wrote:
>> So I hope that by now you have understood my point and been able to
>> decide if you agree with it or not.
>
> I still don't understand what's your original point.
> I think UTF-8 vs UTF-32 is totally different from Python 2 vs 3.
>
> For example, string in Rust and Swift (2010s languages!) are *valid*
> UTF-8. There are strong separation between byte array and string, even
> they use UTF-8. They looks similar to Python 3, not Python 2.

I won't comment on Rust and Swift because I don't know them.

> And Python can use UTF-8 for internal encoding in the future. AFAIK,
> PyPy tries it now. After they succeeded, I want to try port it to
> CPython after we removed legacy Unicode APIs. (ref PEP 393)

How CPython3 implements str objects internally is not what I'm talking
about. It's the programmer's model in any compliant Python3
implementation.

Both Python2 and Python3 provide two forms of string, one containing
8-bit integers and another one containing 21-bit integers. Python3 made
the situation worse in a minor way and a major way. The minor way is the
uglification of the byte string notation. The major way is the wholesale
preference or mandating of Unicode strings in numerous standard-library
interfaces.

> So "UTF-8 is better than UTF-32" is totally different problem from
> "Python 2 is better than 3".

Unix programming is smoothest when the programmer can operate on bytes.
Bytes are the mother tongue of Unix, and programming languages should
not try to present a different model to the programmer.

> Is your point "accepting invalid UTF-8 implicitly by default is better
> than explicit 'surrogateescape' error handler" like Go?
> (It's 2010s languages with UTF-8 based string too, but accept invalid
> UTF-8).

I won't comment on Go, either.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread Terry Reedy

On 7/16/2018 7:02 PM, Richard Damon wrote:



On Jul 16, 2018, at 3:28 PM, Terry Reedy  wrote:



If one is using a broader definition than usual, it is clearer to say so.


This is the core of what I wrote.  Do you disagree?


You are defining a variable/fixed width codepoint set.


No, I did not define anything.  I said, I believe accurately, that this 
is the, or at least one common understanding of 'variable/fixed width 
encoding.  To repeat, it one is writing to be understood, rather than 
create an effect, and one uses a word or phrase in a non-standard 
fashion (which I myself do occasionally), then it is clearer to say what 
one is doing (which I try to also do).


--
Terry Jan Reedy

--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-17 Thread INADA Naoki
On Tue, Jul 17, 2018 at 2:31 PM Marko Rauhamaa  wrote:
>
> Steven D'Aprano :
> > On Mon, 16 Jul 2018 22:51:32 +0300, Marko Rauhamaa wrote:
> >> UTF-8 bytes can only represent the first 128 code points of Unicode.
> >
> > This is DailyWTF material. Perhaps you want to rethink your wording
> > and maybe even learn a bit more about Unicode and the UTF encodings
> > before making such statements.
> >
> > The idea that UTF-8 bytes cannot represent the whole of Unicode is not
> > even wrong. Of course a *single* byte cannot, but a single byte is not
> > "UTF-8 bytes".
>
> So I hope that by now you have understood my point and been able to
> decide if you agree with it or not.
>
>
> Marko

I still don't understand what's your original point.
I think UTF-8 vs UTF-32 is totally different from Python 2 vs 3.

For example, string in Rust and Swift (2010s languages!) are *valid* UTF-8.
There are strong separation between byte array and string, even they use UTF-8.
They looks similar to Python 3, not Python 2.

And Python can use UTF-8 for internal encoding in the future.
AFAIK, PyPy tries it now.  After they succeeded,  I want to try port it
to CPython after we removed legacy Unicode APIs. (ref PEP 393)

So "UTF-8 is better than UTF-32" is totally different problem from
"Python 2 is better than 3".

Is your point "accepting invalid UTF-8 implicitly by default is better
than explicit 'surrogateescape' error handler" like Go?
(It's 2010s languages with UTF-8 based string too, but accept invalid
UTF-8).

Regards,

--
INADA Naoki  
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Marko Rauhamaa
Steven D'Aprano :
> On Mon, 16 Jul 2018 22:51:32 +0300, Marko Rauhamaa wrote:
>> UTF-8 bytes can only represent the first 128 code points of Unicode.
>
> This is DailyWTF material. Perhaps you want to rethink your wording
> and maybe even learn a bit more about Unicode and the UTF encodings
> before making such statements.
>
> The idea that UTF-8 bytes cannot represent the whole of Unicode is not
> even wrong. Of course a *single* byte cannot, but a single byte is not
> "UTF-8 bytes".

So I hope that by now you have understood my point and been able to
decide if you agree with it or not.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Tim Chase
On 2018-07-17 01:21, Steven D'Aprano wrote:
> > This doesn’t mean that UTF-32 is an awful system, just that it
> > isn’t the magical cure that some were hoping for.  
> 
> Nobody ever claimed it was, except for the people railing that
> since it isn't a magically system we ought to go back to the Good
> Old Days of code page hell, or even further back when everyone just
> used ASCII.

But even ed(1) on most systems is 8-bit clean so even there you're not
limited to ASCII.  I can't say I miss code-pages in the least.

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Tim Chase
On 2018-07-17 01:08, Steven D'Aprano wrote:
> In English, I think most people would prefer to use a different
> term for whatever "sh" and "ch" represent than "character".

The term you may be reaching for is "consonant cluster"?

https://en.wikipedia.org/wiki/Consonant_cluster

-tkc



-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Richard Damon
> On Jul 16, 2018, at 9:21 PM, Steven D'Aprano 
>  wrote:
> 
>> On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote:
>> 
>> You are defining a variable/fixed width codepoint set. Many others want
>> to deal with CHARACTER sets.
> 
> Good luck coming up with a universal, objective, language-neutral, 
> consistent definition for a character.
> 
Who says there needs to be one. A good engineer will use the definition that is 
most appropriate to the task at hand. Some things need very solid definitions, 
and some things don’t. 

This goes back to my original point, where I said some people consider UTF-32 
as a variable width encoding. For very many things, practically, the 
‘codepoint’ isn’t the important thing, so the fact that every UTF-32 code point 
takes the same number of bytes or code words isn’t that important. They are 
dealing with something that needs to be rendered and preserving larger units, 
like the grapheme is important.

> 
>> This doesn’t mean that UTF-32 is an awful system, just that it isn’t the
>> magical cure that some were hoping for.
> 
> Nobody ever claimed it was, except for the people railing that since it 
> isn't a magically system we ought to go back to the Good Old Days of code 
> page hell, or even further back when everyone just used ASCII.
> 
Sometimes ASCII is good enough, especially on a small machine with limited 
resources. Sometimes you do need to use a ‘Code Page’ because of limited 
resources and that unit will only be able to talk a single language because of 
that too). Sometimes you have the luxury of being able to use a somewhat 
complete Unicode implementation. Sometimes you are never going to be displaying 
anything, and you can mostly just treat everything as a bag of bytes. You use 
the tool that is right for the job.

> -- 
> Steven D'Aprano
> "Ever since I learned about confirmation bias, I've been seeing
> it everywhere." -- Jon Ronson
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Steven D'Aprano
On Mon, 16 Jul 2018 22:51:32 +0300, Marko Rauhamaa wrote:

> All UTF-8. No unicode strings.

That just means you are re-implementing the bits of Unicode you care 
about (which may be "nothing at all") as UTF-8. If your application is 
nothing but middleware squirting bytes from one layer to another layer, 
that might be all you need care about.

But then you're not processing text in your application, and why should 
your experience in not-processing-text be given any weight over the 
experiences of those who do process text?


And later, in another post:

> UTF-8 bytes can only represent the first 128 code points of Unicode.

This is DailyWTF material. Perhaps you want to rethink your wording and 
maybe even learn a bit more about Unicode and the UTF encodings before 
making such statements.

The idea that UTF-8 bytes cannot represent the whole of Unicode is not 
even wrong. Of course a *single* byte cannot, but a single byte is not 
"UTF-8 bytes".


-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Steven D'Aprano
On Mon, 16 Jul 2018 15:28:51 -0400, Terry Reedy wrote:

> On 7/16/2018 1:11 PM, Richard Damon wrote:
> 
>> Many consider that UTF-32 is a variable-width encoding because of the
>> combining characters. It can take multiple ‘codepoints’ to define what
>> should be a single ‘character’ for display.
> 
> I hope you realize that this is not the standard meaning of
> 'variable-width encoding', which is 'variable number of bytes for a
> codepoint'.

A minor correction Terry: it is the number of code units, not bytes.

UTF-8 uses 1-byte code units, and from 1 to 4 code units per code point;

UTF-16 uses 2-byte code units (a 16-bit word), and 1 or 2 words per code 
point;

UTF-32 uses 4-byte code units (a 32-bit word), and only ever a single 
code unit for every code point.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Steven D'Aprano
On Mon, 16 Jul 2018 19:02:36 -0400, Richard Damon wrote:

> You are defining a variable/fixed width codepoint set. Many others want
> to deal with CHARACTER sets.

Good luck coming up with a universal, objective, language-neutral, 
consistent definition for a character.


> This doesn’t mean that UTF-32 is an awful system, just that it isn’t the
> magical cure that some were hoping for.

Nobody ever claimed it was, except for the people railing that since it 
isn't a magically system we ought to go back to the Good Old Days of code 
page hell, or even further back when everyone just used ASCII.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Steven D'Aprano
On Tue, 17 Jul 2018 06:15:25 +1000, Chris Angelico wrote:

> On Tue, Jul 17, 2018 at 4:55 AM, Steven D'Aprano
>  wrote:
>> There is nothing special about diacritics such that we ought to treat
>> some combinations like "Ch" (two code points = one character) as "fixed
>> width" while others like "â" (two code points = one character) as
>> "variable width".
> 
> When you reverse a word, do you treat "ch" and "sh" as one character or
> two? 

In English, "ch" is always two letters of the alphabet. In Welsh and 
Czech, they can be one or two letters. (I think they will be two letters 
only in loan words, but I'm not certain about that.) Whether that makes 
them one or two characters depends on how you define "character".

Good luck with finding a universal, objective, unambiguous definition.


> I'm of the opinion that they're single characters, and thus this
> should be "dalokosh":
> 
> https://wiki.teamfortress.com/wiki/Dalokohs_Bar
> 
> (It's the Russian for "chocolate" - "шоколад" - transliterated to
> English/Latin - "šokolad" or "shokolad" - and then reversed.)

In English, I think most people would prefer to use a different term for 
whatever "sh" and "ch" represent than "character". But you make a good 
point that even in English, we sometimes want to treat two letter 
combinations as a single unit.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Richard Damon

> On Jul 16, 2018, at 3:28 PM, Terry Reedy  wrote:
> 
>> On 7/16/2018 1:11 PM, Richard Damon wrote:
>> 
>> Many consider that UTF-32 is a variable-width encoding because of the 
>> combining characters. It can take multiple ‘codepoints’ to define what 
>> should be a single ‘character’ for display.
> 
> I hope you realize that this is not the standard meaning of 'variable-width 
> encoding', which is 'variable number of bytes for a codepoint'.  UTF-16 and 
> UTF-8 are variable width.  If one expands the definition enough, Ascii is 
> 'variable width' because 'fi' is two bytes, or more realistically, because <= 
> and >= are two bytes instead of one (as they can be in Unicode!).
> 
> If one is using a broader definition than usual, it is clearer to say so.
> 
> -- 
> Terry Jan Reedy
> 

You are defining a variable/fixed width codepoint set. Many others want to deal 
with CHARACTER sets. The Unicode consortium agrees that a code point is not 
necessarily a character (which is one reason they came up with the term). When 
actually trying to do work with text strings, the fact that some codepoints are 
combining codes that need to ‘stick’ to their mate becomes important. One of 
the claimed advantages of fixed width character set encodings is that you 
aren’t supposed to need to worry about breaking strings in two, but that 
doesn’t work in Unicode, you need to make sure you aren’t breaking a combining 
sequence.

Even worse, Unicode really needs arbitrary look back to render substrings 
because it uses shift codes for things like left-to-right/right-to-left 
rendering control.

This doesn’t mean that UTF-32 is an awful system, just that it isn’t the 
magical cure that some were hoping for.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Chris Angelico
On Tue, Jul 17, 2018 at 7:02 AM, Ethan Furman  wrote:
> On 07/16/2018 01:15 PM, Chris Angelico wrote:
>>
>> On Tue, Jul 17, 2018 at 4:55 AM, Steven D'Aprano wrote:
>
>
>>> There is nothing special about diacritics such that we ought to treat
>>> some combinations like "Ch" (two code points = one character) as "fixed
>>> width" while others like "â" (two code points = one character) as
>>> "variable width".
>>
>>
>> When you reverse a word, do you treat "ch" and "sh" as one character
>> or two? I'm of the opinion that they're single characters, and thus
>> this should be "dalokosh":
>
>
> Depends on the language:  in Spanish, "ch" is it's own letter (at least it
> was when I grew up), so any word containing it should still contain it when
> reversed:  "chica" would be "acich".
>

Yeah. In Russian, "sh" is the single character "ш". I'm of the opinion
that, even after being transliterated into English phonetics, that
should be treated as a unit. ISO-9 uses "š" rather than "sh", which is
an improvement in character correspondence, but your average English
speaker is more likely to be able to pronounce "dalokosh" correctly
than to figure out "dalokoš". In the same way, I created a magic item
in a D campaign called "Yasham Burda", even though the more correct
spelling would be "Yaşam Burda" or even "Yasam Burda", for the benefit
of my monolingual players. But I'd still treat the "sh" as one
character.

Ain't transliteration fun?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Marko Rauhamaa
Ethan Furman :
> Depends on the language: in Spanish, "ch" is it's own letter (at least
> it was when I grew up), so any word containing it should still contain
> it when reversed: "chica" would be "acich".

The Royal Academy broke "ch" and "ll" up into separate letters a decade
or so back. It had become accepted practice in dictionaries way before
that.

In Finnish, "v" and "w" are still ortographic variants of the same
letter. In practice, Finns don't have a problem with computers insisting
they are separate letters.

While the Royal Academy of the Spanish Language has now accepted that
"ñ" is an accented "n", no Finn would think that "ä" is an accented "a"
any more than an English-speaker would think that "G" is an accented "C"
(which it originally was).


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Chris Angelico
On Tue, Jul 17, 2018 at 6:54 AM, Marko Rauhamaa  wrote:
> Chris Angelico :
>> Challenge: Reverse a string in UTF-8.
>
> Counter-challenge: Reverse a Unicode string:
>
>>>> s = "a\u0304e"
>>>> s
>'āe'
>>>> L = list(s)
>>>> L.reverse()
>>>> "".join(L)
>'ēa'
>
>> Challenge: Center text in UTF-8.
>
> Counter-challenge: Center a Unicode string:
>
>>>> t = s * 3
>>>> t
>'āeāeāe'
>>>> t.center(9)
>'āeāeāe'
>
>> Challenge: Given a (non-initial) character in a buffer of UTF-8 bytes,
>> find the immediately preceding character.
>
> The counter-challenge is left as an exercise for the reader.
>
>> All of these are fundamentally difficult by nature, but if you index
>> by code points, you eliminate one level of difficulty; indexing by
>> bytes retains all the existing difficulty and adds another layer.
>
> Oh, sorry. I thought you were suggesting Unicode strings would make the
> challenges somehow easy.

So now that you've actually read my entire post, you'll see that there
are fundamental difficulties, but that UTF-8 introduces more. Great.
Now go ahead and reply to my post, knowing my actual point.
Congratulations on posting something of no value.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Ethan Furman

On 07/16/2018 01:15 PM, Chris Angelico wrote:

On Tue, Jul 17, 2018 at 4:55 AM, Steven D'Aprano wrote:



There is nothing special about diacritics such that we ought to treat
some combinations like "Ch" (two code points = one character) as "fixed
width" while others like "â" (two code points = one character) as
"variable width".


When you reverse a word, do you treat "ch" and "sh" as one character
or two? I'm of the opinion that they're single characters, and thus
this should be "dalokosh":


Depends on the language:  in Spanish, "ch" is it's own letter (at least it was when I grew up), so any word containing 
it should still contain it when reversed:  "chica" would be "acich".


--
~Ethan~
--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Marko Rauhamaa
Chris Angelico :
> Challenge: Reverse a string in UTF-8.

Counter-challenge: Reverse a Unicode string:

   >>> s = "a\u0304e"
   >>> s
   'āe'
   >>> L = list(s)
   >>> L.reverse()
   >>> "".join(L)
   'ēa'

> Challenge: Center text in UTF-8.

Counter-challenge: Center a Unicode string:

   >>> t = s * 3
   >>> t
   'āeāeāe'
   >>> t.center(9)
   'āeāeāe'

> Challenge: Given a (non-initial) character in a buffer of UTF-8 bytes,
> find the immediately preceding character.

The counter-challenge is left as an exercise for the reader.

> All of these are fundamentally difficult by nature, but if you index
> by code points, you eliminate one level of difficulty; indexing by
> bytes retains all the existing difficulty and adds another layer.

Oh, sorry. I thought you were suggesting Unicode strings would make the
challenges somehow easy.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Chris Angelico
On Tue, Jul 17, 2018 at 4:55 AM, Steven D'Aprano
 wrote:
> There is nothing special about diacritics such that we ought to treat
> some combinations like "Ch" (two code points = one character) as "fixed
> width" while others like "â" (two code points = one character) as
> "variable width".

When you reverse a word, do you treat "ch" and "sh" as one character
or two? I'm of the opinion that they're single characters, and thus
this should be "dalokosh":

https://wiki.teamfortress.com/wiki/Dalokohs_Bar

(It's the Russian for "chocolate" - "шоколад" - transliterated to
English/Latin - "šokolad" or "shokolad" - and then reversed.)

But that's an extremely difficult thing to explain to your average gamer...

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Chris Angelico
On Tue, Jul 17, 2018 at 5:51 AM, Marko Rauhamaa  wrote:
> Steven D'Aprano :
>> Under that standard definition, UTF-8 and UTF-16 are variable-width,
>> and UTF-32 is fixed-width.
>>
>> But I'll accept that UTF-32 is variable-width if Marko accepts that
>> ASCII is too.
>
> If that makes you happy, fine. The point is, UTF-32 has no advantages
> over UTF-8. And I'm referring to the text abstraction as seen by the
> programmer. It has nothing to do with the layout of bytes inside
> CPython.
>
> I use UTF-8 in my C programs and sense no disadvantage. I have never
> felt a need for wchar_t. Similarly, I had a small Python2 program that
> quizzed me about Hebrew vocabulary with Finnish translations and
> Esperanto pronunciation instructions. All UTF-8. No unicode strings. (I
> *have* converted that to Python3 just to be on the bleeding edge, but it
> didn't give me any advantages over Python2.)

Challenge: Reverse a string in UTF-8.

Challenge: Center text in UTF-8.

Challenge: Given a (non-initial) character in a buffer of UTF-8 bytes,
find the immediately preceding character.

All of these are fundamentally difficult by nature, but if you index
by code points, you eliminate one level of difficulty; indexing by
bytes retains all the existing difficulty and adds another layer.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Rhodri James

On 16/07/18 20:51, Marko Rauhamaa wrote:

I use UTF-8 in my C programs and sense no disadvantage. I have never
felt a need for wchar_t.


That's not a good comparison, though, because wchar_t in C really 
doesn't give you much (if any) advantage over rolling your own UTF-8 
support, even when that means making sure you don't split characters 
across buffers.


--
Rhodri James *-* Kynesim Ltd
--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Marko Rauhamaa
Steven D'Aprano :
> Under that standard definition, UTF-8 and UTF-16 are variable-width,
> and UTF-32 is fixed-width.
>
> But I'll accept that UTF-32 is variable-width if Marko accepts that
> ASCII is too.

If that makes you happy, fine. The point is, UTF-32 has no advantages
over UTF-8. And I'm referring to the text abstraction as seen by the
programmer. It has nothing to do with the layout of bytes inside
CPython.

I use UTF-8 in my C programs and sense no disadvantage. I have never
felt a need for wchar_t. Similarly, I had a small Python2 program that
quizzed me about Hebrew vocabulary with Finnish translations and
Esperanto pronunciation instructions. All UTF-8. No unicode strings. (I
*have* converted that to Python3 just to be on the bleeding edge, but it
didn't give me any advantages over Python2.)


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Terry Reedy

On 7/16/2018 1:11 PM, Richard Damon wrote:


Many consider that UTF-32 is a variable-width encoding because of the combining 
characters. It can take multiple ‘codepoints’ to define what should be a single 
‘character’ for display.


I hope you realize that this is not the standard meaning of 
'variable-width encoding', which is 'variable number of bytes for a 
codepoint'.  UTF-16 and UTF-8 are variable width.  If one expands the 
definition enough, Ascii is 'variable width' because 'fi' is two bytes, 
or more realistically, because <= and >= are two bytes instead of one 
(as they can be in Unicode!).


If one is using a broader definition than usual, it is clearer to say so.

--
Terry Jan Reedy


--
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Steven D'Aprano
On Mon, 16 Jul 2018 14:22:27 -0400, Richard Damon wrote:

[...]
> But I am not talking about those sort of characters or ligatures, 

So what? I am.

You don't get to say "only non-standard definitions I approve of count".

There is the industry standard definition of what it means to be a fixed- 
or variable-width encoding, which we can all agree on, or we can have a 
free-for-all where I reject your non-standard meaning and you reject mine 
and nobody can understand anything that anyone else says.

You (generic "you", not necessarily you personally) don't get to demand 
that I must accept your redefinition, while simultaneously refusing to 
return the favour. If you try, I will simply dismiss what you say as 
nonsense on stilts: you (still generic you) clearly don't know what 
variable-width means and are trying to shift the terms of the debate by 
redefining terms so that black means white and white means purple.


> but
> ‘characters’ that are built up of a combining diacritical marks (like
> accents) and a base character. Unicode define many code points for the
> more common of these, but many others do not.

I am aware how Unicode works, and it doesn't change a thing.

Fixed/variable width is NOT defined in terms of "characters", but if it 
were, ASCII would be variable width too. Limiting the definition to only 
diacritics is just a feeble attempt to wiggle out of the logical 
consequences of your (generic your) position.

There is nothing special about diacritics such that we ought to treat 
some combinations like "Ch" (two code points = one character) as "fixed 
width" while others like "â" (two code points = one character) as 
"variable width".

To do so is just special pleading. And the thing about special pleading 
is that we're not obliged to accept it. Plead as much as you like, the 
answer is still no.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Chris Angelico
On Tue, Jul 17, 2018 at 4:22 AM, Richard Damon  wrote:
>
> But I am not talking about those sort of characters or ligatures, but 
> ‘characters’ that are built up of a combining diacritical marks (like 
> accents) and a base character. Unicode define many code points for the more 
> common of these, but many others do not.
>

So, you're talking about "grapheme clusters". Those can be arbitrarily
large and complex. Trolls revel in the ability to adorn base
characters with ridiculous numbers of "dripping" marks, making it hard
to type their names. Since the amount of information in one grapheme
cluster is (as far as I know) potentially infinite, it's fundamentally
impossible to create a fixed-size encoding that can represent them. If
I'm wrong about the possibilities being infinite, then they are
certainly very extensive, as there are MANY combining characters
available (the only question is whether you can use the same
characters multiple times, in which case there are infinite
possibilities, or if not, in which case the possibilities are
base_characters*2^combining_characters aka "virtually infinite").

http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

This is a display feature, not an input feature, and certainly not a
string representation feature.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Richard Damon

> On Jul 16, 2018, at 1:36 PM, Steven D'Aprano 
>  wrote:
> 
> On Mon, 16 Jul 2018 13:11:23 -0400, Richard Damon wrote:
> 
>>> On Jul 16, 2018, at 12:51 PM, Steven D'Aprano
>>>  wrote:
>>> 
 On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote:
 
 if your new system used Python3's UTF-32 strings as a foundation, that
 would be an equally naïve misstep. You'd need to reach a notch higher
 and use glyphs or other "semiotic atoms" as building blocks. UTF-32,
 after all, is a variable-width encoding.
>>> 
>>> Python's strings aren't UTF-32. They are sequences of abstract code
>>> points.
>>> 
>>> UTF-32 is not a variable-width encoding.
>>> 
>>> --
>>> Steven D'Aprano
>>> 
>>> 
>> Many consider that UTF-32 is a variable-width encoding because of the
>> combining characters. It can take multiple ‘codepoints’ to define what
>> should be a single ‘character’ for display.
> 
> Ah, well if we're going to start making up our own definitions of terms, 
> then ASCII is a variable-width encoding too.
> 
> "Ch" (a single letter of the alphabet in a number of European languages, 
> including Welsh and Czech) requires two code points in ASCII. Even in 
> English, "qu" could be considered a two-byte "character" (grapheme), and 
> for ASCII users, (c) is a THREE code point character for what ought to be 
> a single character ©.
> 
> The standard definition of variable- and fixed-width encodings refers to 
> how many *code units* is required to make up a single *code point*.
> 
> Under that standard definition, UTF-8 and UTF-16 are variable-width, and 
> UTF-32 is fixed-width. 
> 
> But I'll accept that UTF-32 is variable-width if Marko accepts that ASCII 
> is too.
> 
> -- 
> Steven D'Aprano
> 

But I am not talking about those sort of characters or ligatures, but 
‘characters’ that are built up of a combining diacritical marks (like accents) 
and a base character. Unicode define many code points for the more common of 
these, but many others do not.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Steven D'Aprano
On Mon, 16 Jul 2018 13:11:23 -0400, Richard Damon wrote:

>> On Jul 16, 2018, at 12:51 PM, Steven D'Aprano
>>  wrote:
>> 
>>> On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote:
>>> 
>>> if your new system used Python3's UTF-32 strings as a foundation, that
>>> would be an equally naïve misstep. You'd need to reach a notch higher
>>> and use glyphs or other "semiotic atoms" as building blocks. UTF-32,
>>> after all, is a variable-width encoding.
>> 
>> Python's strings aren't UTF-32. They are sequences of abstract code
>> points.
>> 
>> UTF-32 is not a variable-width encoding.
>> 
>> --
>> Steven D'Aprano
>> 
>> 
> Many consider that UTF-32 is a variable-width encoding because of the
> combining characters. It can take multiple ‘codepoints’ to define what
> should be a single ‘character’ for display.

Ah, well if we're going to start making up our own definitions of terms, 
then ASCII is a variable-width encoding too.

"Ch" (a single letter of the alphabet in a number of European languages, 
including Welsh and Czech) requires two code points in ASCII. Even in 
English, "qu" could be considered a two-byte "character" (grapheme), and 
for ASCII users, (c) is a THREE code point character for what ought to be 
a single character ©.

The standard definition of variable- and fixed-width encodings refers to 
how many *code units* is required to make up a single *code point*.

Under that standard definition, UTF-8 and UTF-16 are variable-width, and 
UTF-32 is fixed-width. 

But I'll accept that UTF-32 is variable-width if Marko accepts that ASCII 
is too.


-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Glyphs and graphemes [was Re: Cult-like behaviour]

2018-07-16 Thread Richard Damon
> On Jul 16, 2018, at 12:51 PM, Steven D'Aprano 
>  wrote:
> 
>> On Mon, 16 Jul 2018 00:28:39 +0300, Marko Rauhamaa wrote:
>> 
>> if your new system used Python3's UTF-32 strings as a foundation, that
>> would be an equally naïve misstep. You'd need to reach a notch higher
>> and use glyphs or other "semiotic atoms" as building blocks. UTF-32,
>> after all, is a variable-width encoding.
> 
> Python's strings aren't UTF-32. They are sequences of abstract code 
> points.
> 
> UTF-32 is not a variable-width encoding.
> 
> -- 
> Steven D'Aprano
> 

Many consider that UTF-32 is a variable-width encoding because of the combining 
characters. It can take multiple ‘codepoints’ to define what should be a single 
‘character’ for display.
-- 
https://mail.python.org/mailman/listinfo/python-list