subject:"Re\: \[Python\-Dev\] Divorcing str and unicode \(no more implicitconversions\)."

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-29 Thread Martin v. Löwis

Antoine Pitrou wrote:
> FWIW, being French, I don't remember hearing any programmer wish (s)he
> could use non-ASCII identifiers, in any programming language. But
> arguably translitteration is very straight-forward (although a bit
> lossless at times ;-)).

My canonical example is François Pinard, who keeps requesting it,
saying that local people where surprised they couldn't use accented
characters in Python.

Perhaps that's because he actually is Quebecian :-)

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-29 Thread Fabien Schwob

> FWIW, being French, I don't remember hearing any programmer wish (s)he
> could use non-ASCII identifiers, in any programming language. But
> arguably translitteration is very straight-forward (although a bit
> lossless at times ;-)).
> 
> I think typeability and reproduceability should be weighted carefully.
> It's nice to have the real letter delta instead of "delta", but how do I
> type it again on my non-Greek keyboard if I want to keep consistent
> naming in the program?
> 
> ASCII is ethnocentric, but it probably can be typed easily with every
> device in the world.
> 
> Also, as a matter of fact, if I type an identifier with an accented
> letter inside, I would like Python to warn me, because it would be a
> typing error on my part.
> 
> Maybe this should be an option at the beginning of any source file (like
> encoding currently). Or is this overkill?

I'm also French and I must say that I agree with you. In my case, the 
most important thing is to be able to manage the _data_ in the good 
encoding.

I'm currently trying to implement a little search engine in python (to 
improve my skills mainly) and the biggest problem I have to face is how 
to manage encoding. Some web pages are in French, in German, in English, 
etc. and it take me a lot of time to handle this problem correctly.

I think it's more useful to be able to manipulate simply the _data_ than 
to have accents in identifiers.

-- 
Derrière chaque bogue, il y a un développeur, un homme qui s'est trompé.
(Bon, OK, parfois ils s'y mettent à plusieurs).

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-29 Thread Antoine Pitrou


> Thanks for these data. This mostly reflects my experience with German
> and French users: some people would like to use non-ASCII identifiers
> if they could, other argue they never would as a matter of principle.
> Of course, transliteration is more straight-forward.

FWIW, being French, I don't remember hearing any programmer wish (s)he
could use non-ASCII identifiers, in any programming language. But
arguably translitteration is very straight-forward (although a bit
lossless at times ;-)).

I think typeability and reproduceability should be weighted carefully.
It's nice to have the real letter delta instead of "delta", but how do I
type it again on my non-Greek keyboard if I want to keep consistent
naming in the program?

ASCII is ethnocentric, but it probably can be typed easily with every
device in the world.

Also, as a matter of fact, if I type an identifier with an accented
letter inside, I would like Python to warn me, because it would be a
typing error on my part.

Maybe this should be an option at the beginning of any source file (like
encoding currently). Or is this overkill?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-29 Thread Gustavo J. A. M. Carneiro

On Sat, 2005-10-29 at 10:56 +0200, "Martin v. Löwis" wrote:
> Atsuo Ishimoto wrote:
> > I'm +0.1 for non-ASCII identifiers, although module names should remain
> > ASCII. ASCII identifiers might be encouraged, but as Martin said, it is
> > very useful for some groups of users.
> 
> Thanks for these data. This mostly reflects my experience with German
> and French users: some people would like to use non-ASCII identifiers
> if they could, other argue they never would as a matter of principle.
> Of course, transliteration is more straight-forward.

  Not sure if anyone has made this point already, but unicode
identifiers are also useful for math programs.  The ability to directly
type the math letters, like alpha, omega, etc., would actually make the
code more readable, while still understandable by programmers of all
nationalities.  For instance, you could write:

Δv = x1 - x0
if Δv < ε:
return

Instead of:

delta_v = x1 - x0
if delta_v < epsilon:
return

But anyone that is supposed to understand the code will be able to read
the delta and epsilon symbols.

  Regards.

-- 
Gustavo J. A. M. Carneiro
<[EMAIL PROTECTED]> <[EMAIL PROTECTED]>
The universe is always one step beyond logic

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-29 Thread Martin v. Löwis

Atsuo Ishimoto wrote:
> I'm +0.1 for non-ASCII identifiers, although module names should remain
> ASCII. ASCII identifiers might be encouraged, but as Martin said, it is
> very useful for some groups of users.

Thanks for these data. This mostly reflects my experience with German
and French users: some people would like to use non-ASCII identifiers
if they could, other argue they never would as a matter of principle.
Of course, transliteration is more straight-forward.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-28 Thread Atsuo Ishimoto

Hello from Japan,

I googled discussions about non-ASCII identifiers in Japanese, but I
found no consensus. Major languages such as Java or VB support non-ASCII
identifiers, so projects uses non-ASCII identifiers for their programs
are existing. Not all Japanese programmers think this is a good idea.
Some people enthusiastically prefer Japanese identifiers, but some feel
it reduces readability and hard to type, some worry about tool breakages
or encoding problem, etc. It looks that smart people don't like to
express their preference to Japanese identifiers, maybe because they
think such style is not cool, or they are afraid such confession may
reveal lack of their English ability.;) 

I'm +0.1 for non-ASCII identifiers, although module names should remain
ASCII. ASCII identifiers might be encouraged, but as Martin said, it is
very useful for some groups of users.

On Sat, 29 Oct 2005 00:21:03 +0200
"Martin v. Lvwis" <[EMAIL PROTECTED]> wrote:

> Neil Hodgson wrote:
> >This is anecdotal but it appears to me that transliterations are
> > not commonly used apart from learning languages and some minimal help
> > for foreigners such as including transliterated names on railway
> > station name boards.
> 
> That would be my guess also. Transliteration is clearly common for
> Latin-based languages (French, German, Spanish, say), but I doubt
> non-Latin scripts are that often transliterated (even if conventions
> exist).
> 

Yes, transliterations are rarely used in daily life in Japan. For
programming, I know a lot of projects use transliterated Japanses style,
but I guess they are rather minority.

--
Atsuo Ishimoto
[EMAIL PROTECTED]
Homepage:http://www.gembook.jp

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-28 Thread Martin v. Löwis

Neil Hodgson wrote:
>This is anecdotal but it appears to me that transliterations are
> not commonly used apart from learning languages and some minimal help
> for foreigners such as including transliterated names on railway
> station name boards.

That would be my guess also. Transliteration is clearly common for
Latin-based languages (French, German, Spanish, say), but I doubt
non-Latin scripts are that often transliterated (even if conventions
exist).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-28 Thread Oren Tirosh

On 10/28/05, Neil Hodgson <[EMAIL PROTECTED]> wrote:
>I used to work on software written by Japanese and English speakers
> at Fujitsu with most developers being Japanese. The rules were that
> comments could be in Japanese but identifiers were only allowed to
> contain ASCII characters. Most variable names were poorly chosen with
> s, p, q, fla (boolean=flag) and flafla being popular. When I asked
> some Japanese coders why they didn't use Japanese words expressed in
> ASCII (Romaji), their response was that it was a really weird idea.
>
>This is anecdotal but it appears to me that transliterations are
> not commonly used apart from learning languages and some minimal help
> for foreigners such as including transliterated names on railway
> station name boards.

Israeli programmers generally use English identifiers but
transliterations are common for local business terminology: types of
financial instruments, tax and insurance terminology, employee benefit
plans etc. Yes, it looks weird, but it would be rather pointless to
try to translate them. Even native English speakers would find it
difficult to recognize the translations because they are used to using
them as loan words. Only transliteration (or possibly the use of
non-ASCII identifiers) would make sense in this situation and I do not
think it is unique to Israel.

BTW, I heard about a Cobol shop that had an explicit policy of using
only transliterated identifiers. This resulted in a much smaller
chance of hitting one of Cobol's numerous reserved words. Thankfully,
this is not an issue in Python...

  Oren
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-28 Thread Stephen J. Turnbull

> "Neil" == Neil Hodgson <[EMAIL PROTECTED]> writes:

Neil> Most variable names were poorly chosen with s, p, q, fla
Neil> (boolean=flag) and flafla being popular. When I asked some
Neil> Japanese coders why they didn't use Japanese words expressed
Neil> in ASCII (Romaji), their response was that it was a really
Neil> weird idea.

That may be due to the fact that two-ideograph words will often have a
dozen homonyms, and sometimes several dozen.  I sometimes use kanji in
not-for-general-distribution Emacs LISP code when 2 kanji will give as
expressive an identifier as 10 or 15 ASCII characters.

Neil> This is anecdotal but it appears to me that transliterations
Neil> are not commonly used apart from learning languages

In everyday usage, they're used a lot for identifier-like purposes
like corporate logos.

The only large corpuses of Japanese-oriented Japanese-authored code
I'm familiar with are the input methods Wnn, Canna, and SKK, and these
invariably use transliterated Japanese grammatical terms for parser
components[1], although there are perfectly good equivalents in English,
at least (I think they may actually be standardized by the Ministry of
Education).

There's also an Emacs library, edict.el, that uses _mixed_
ASCII-hiragana-kanji identifiers.  (ISTR that was done just to prove a
point---the person who wrote it was an American, I
believe---definitely not Japanese.)


Footnotes: 
[1]  Japanese does not require word delimiters, so input methods must
have grammatical knowledge to choose among large numbers of homonyms.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-27 Thread Neil Hodgson

Josiah Carlson:

> According to wikipedia (http://en.wikipedia.org/wiki/Latin_alphabet),
> various languages have adopted a transliteration of their language
> and/or former alphabets into latin.  They don't purport to know all of
> the reasons why, and I'm not going to speculate.

   I used to work on software written by Japanese and English speakers
at Fujitsu with most developers being Japanese. The rules were that
comments could be in Japanese but identifiers were only allowed to
contain ASCII characters. Most variable names were poorly chosen with
s, p, q, fla (boolean=flag) and flafla being popular. When I asked
some Japanese coders why they didn't use Japanese words expressed in
ASCII (Romaji), their response was that it was a really weird idea.

   This is anecdotal but it appears to me that transliterations are
not commonly used apart from learning languages and some minimal help
for foreigners such as including transliterated names on railway
station name boards.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-27 Thread Martin v. Löwis

Greg Ewing wrote:
> I still think this is a much worse potential problem
> than that of "l" vs "1", etc. It's reasonable to
> adopt the practice of never using "l" as a single
> letter identifier, for example. But it would be
> unreasonable to ban the use of "E" as an identifier
> on the grounds that someone somewhere might confuse
> it with a capital epsilon.

As a style guide, people should use single-letter
identifiers only for local variables. If they follow
the guideline, it should be easy to tell whether
such an identifier is Latin or Greek (if everything
else in the function is Latin, the E likely is as
well).

> An alternative would be to identify such confusable
> letters in the various alphabets and define them
> to be equivalent.

pylint could check for such things (although I very
much doubt it would have any hits in the next 10
years).

> And beyond the issue of alphabets there's also the
> question of whether accented characters should be
> considered distinct. I can see quite a few holy
> flame wars erupting over that...

For that, there is the Unicode TR that precisely
defines how this should be done. People should then
have their wars with the Unicode consortium.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-27 Thread M.-A. Lemburg

Greg Ewing wrote:
> M.-A. Lemburg wrote:
> 
> 
>>If you are told to debug a program
>>written by say a Japanese programmer using Japanese identifiers
>>you are going to have a really hard time.
> 
> 
> Or you could look upon it as an opportunity to
> broaden your mental horizons by learning some
> Japanese. :-)

I just took Japanese as exmaple for a language and script
that I don't know anything about. I would actually love
to learn some Japanese, but simply don't have the time
for learning it.

Anyway, I could just as well have chosen Tibetian, Thai or Limbu
scripts (which all look very nice, BTW):

http://www.unicode.org/charts/

Perhaps this is not as bad after all - I just don't think that
it will help code readability in the long run.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 27 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-27 Thread M.-A. Lemburg

Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
> 
>>You even argued against having non-ASCII identifiers:
>>
>>http://mail.python.org/pipermail/python-list/2002-May/102936.html
> 
> 
> I see :-) It seems I have changed my mind since then (which
> apparently predates PEP 263).
> 
> One issue I apparently was worried about was the plan to use
> native-encoding byte strings for the identifiers; this I didn't
> like at all.
> 
> 
>>* Unicode identifiers are going to introduce massive
>>code breakage - just think of all the tools people use
>>to manipulate Python code today; I'm quite sure that
>>most of it will fail in one way or another if you present
>>it Unicode literals such as in "zähler += 1".
> 
> 
> True. Today, I think I would be willing to accept the
> code breakage: these tools had quite some time to update
> themselves to PEP 263 (even though not all of them have
> done so yet); also, usage of the feature would only spread
> gradually. A failure to support the feature in the Python
> proper would be treated as a bug by us; how tool providers
> deal with the feature would be their choice.

I was thinking of introspection and debugging tools.
These would then see Unicode objects in the namespace
dictionaries and this will likely break a lot of code -
much for the same reason you see code breakage now
if you let Unicode object enter the Python standard lib
without warning :-)

>>* People don't seem very interested in using Unicode
>>identifiers, e.g.
>>
>>  http://mail.python.org/pipermail/i18n-sig/2001-February/000828.html
> 
> 
> True. However, I also suspect that lack of tool support
> contributes to that. For the specific case of Java,
> there is no notion of source encoding, which makes Unicode
> identifiers really tedious to use.
> 
> If it were really easy to use, I assume people would actually
> use it - atleast in some of the contexts, like teaching,
> where Python is also widely used.

Well, that has two sides: Of course, you'll always find
some people that will like a certain feature. The question
is what effects does it have on the rest of us.

Python has always put some constraints on programmers
to raise code readability, e.g. white space awareness.
Giving them Unicode identifiers sounds like a step
backwards in this context.

Note that I'm not talking about comments, string literal
contents, etc. - only the programming logic, ie. keywords
and identifiers.

>>Do you really think that it will help with code readability
>>if programmers are allowed to use native scripts for their
>>identifiers ?
> 
> 
> Yes, I do - for some groups of users. Of course, code sharing
> would be more difficult, and there certainly should be a policy
> to use only ASCII in the standard library. But within local
> groups, users would find understanding code easier if they
> knew what the identifiers actually meant.

Hmm, but why do you think they wouldn't understand the meaning of
ASCII versions of the identifiers ?

Note that using ASCII doesn't necessarily mean that you
have to use English as basis for the naming schemes of
identifiers.

>>If you are told to debug a program
>>written by say a Japanese programmer using Japanese identifiers
>>you are going to have a really hard time. Integrating such
>>code into other applications will be even harder, since you'd
>>be forced to use his Japanese class names in your application.
> 
> 
> Certainly, yes. There is a trade-off: you can make it easier
> for some people to read and write code if they can use their
> native script; at the same time, it would be harder for others
> to read and modify it.
> 
> It's a policy decision whether you use English identifiers or
> not - it shouldn't be a technical decision (as it currently
> is).

See above: ASCII != English. Most scripts have a transliteration
into ASCII - simply because that's the global standard for
scripts.

>>I think source code encodings provide an ideal way to
>>have comments written in native scripts - and people
>>use that a lot. However, keeping the program code itself
>>in plain ASCII makes it far more readable and reusable
>>across locales. Something that's important in this
>>globalized world.
> 
> 
> Certainly. However, some programs don't need to live in
> a globalized world - e.g. if they are homework in a school.
> Within a locale, using native scripts would make the program
> more readable.

True.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 27 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/py

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-26 Thread Josiah Carlson


"Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> Josiah Carlson wrote:
> > According to wikipedia (http://en.wikipedia.org/wiki/Latin_alphabet),
> > various languages have adopted a transliteration of their language
> > and/or former alphabets into latin.  They don't purport to know all of
> > the reasons why, and I'm not going to speculate.
> > 
> > Whether or not more languages start using the latin alphabet is a good
> > question.  Basing judgement on history and likely globalization, it is
> > only a matter of time before basically all languages have a
> > transcription into the latin alphabet that is taught to all (unless
> > China takes over the world).
> 
> That is a very U.S. centric view. I don't share it, but I think it is
> pointless to argue against it.

I should have included a ;).  Whether or not in the future all languages
use the latin alphabet should have little to do with whether Python
chooses to support non-latin identifiers in the forthcoming 2.5 or later
releases.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-26 Thread Greg Ewing

Martin v. Löwis wrote:

> Not in the literal sense: you certainly want to allow
> "latin" digits in, say, a cyrillic identifier.

Yes, by "alphabet" I really only meant the letters,
although you might want to apply the same idea to
clusters of digits within an identifier, depending
on how potentially confusable the various sets of
digits are -- I'm not familiar enough with alternative
digit sets to know whether that would be a problem.

 > Just because
> you *can* come up with look-alike identifiers doesn't
> mean that people will use them, or that they will mistake
> the scripts (except for deliberately doing so, of
> course).

I still think this is a much worse potential problem
than that of "l" vs "1", etc. It's reasonable to
adopt the practice of never using "l" as a single
letter identifier, for example. But it would be
unreasonable to ban the use of "E" as an identifier
on the grounds that someone somewhere might confuse
it with a capital epsilon.

An alternative would be to identify such confusable
letters in the various alphabets and define them
to be equivalent.

And beyond the issue of alphabets there's also the
question of whether accented characters should be
considered distinct. I can see quite a few holy
flame wars erupting over that...

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | A citizen of NewZealandCorp, a   |
Christchurch, New Zealand  | wholly-owned subsidiary of USA Inc.  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-26 Thread Greg Ewing

M.-A. Lemburg wrote:

> If you are told to debug a program
> written by say a Japanese programmer using Japanese identifiers
> you are going to have a really hard time.

Or you could look upon it as an opportunity to
broaden your mental horizons by learning some
Japanese. :-)

-- 
Greg Ewing, Computer Science Dept, +--+
University of Canterbury,  | A citizen of NewZealandCorp, a   |
Christchurch, New Zealand  | wholly-owned subsidiary of USA Inc.  |
[EMAIL PROTECTED]  +--+
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-26 Thread Martin v. Löwis

Josiah Carlson wrote:
> According to wikipedia (http://en.wikipedia.org/wiki/Latin_alphabet),
> various languages have adopted a transliteration of their language
> and/or former alphabets into latin.  They don't purport to know all of
> the reasons why, and I'm not going to speculate.
> 
> Whether or not more languages start using the latin alphabet is a good
> question.  Basing judgement on history and likely globalization, it is
> only a matter of time before basically all languages have a
> transcription into the latin alphabet that is taught to all (unless
> China takes over the world).

That is a very U.S. centric view. I don't share it, but I think it is
pointless to argue against it.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-26 Thread Josiah Carlson

"Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> 
> Josiah Carlson wrote:
> > In this case it's not just a misreading, the characters look identical! 
> > When is an 'E' not an 'E'?  When it is an Epsilon or Ie.  Saying what
> > characters will or will not be used as identifiers, when those
> > characters are keys on a keyboard of a specific type, is pretty
> > presumptuous.
> 
> Why is that rude and disrespectful? I'm certainly respecting developers
> who want to use their scripts for identifiers, or else I would not have
> suggested that they could do so.

I never said rude, I said presumptuous.  "Going beyond what is right or
proper; excessively forward." (according to dictionary.com, the OED has
a similar definition).  I was trying to say that in stating that users
wouldn't be using keys on their keyboard in their natual language when
also using english characters, that you were assuming a bit about their
usage patterns that you perhaps shouldn't.  I certainly could also be
presumptuous in stating that users may very well mix certain languages,
but it seems to be more likely given keywords and the standard library
using the latin alphabet.

> > Indeed, they are similar, but_ different_ in my font as well.  The trick
> > is that the glyphs are not different in the case of certain greek or
> > cyrillic letters.  They don't just /look/ similar they /are identical/.
> 
> This string: "EÎ" is the LATIN CAPITAL LETTER E, followed by the GREEK
> CAPITAL LETTER EPSILON. In the font my email composer uses, the E is
> slightly larger than the Epsilon - so there /is/ a visual difference.

My email client doesn't handle unicode, but a quick check by swapping
fonts in a word processor provides that at least on my platform, all
three are the same glyph (same size, shape, ...) for all fixed-width
fonts. If a platform distinguishes all three, then one should consider
one's platform lucky.  Not all platforms and/or preferred fonts of users
are.

> But even if there isn't: if this was a frequent problem, the name
> error could include an alternative representation (say, with Unicode
> ordinals for non-ASCII characters) which would give an easy visual
> clue.

It would offer a great cue, but I'm not sure if it is possible.  I think
that it sounds like an ugly discussion of stdout/err encodings and
exception handling machinery that I don't want to be a part of.

> I still doubt that this is a frequent problem, and I don't see any
> better grounds for claiming that it is than for claiming that it
> is not.

Whether or not it is frequent will depend on the prevalence of desire to
use those characters.  While I don't think that such uses will be as
common as using 'klass' when passing a class, I do think that it will
result in more than a few sf bug reports.  I also share Marc-Andre
Lemburg's concerns about the understandability of code written in Kanji,
Hebrew, Arabic, etc., at least for those who have not memorized the
entirety of those alphabets.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-26 Thread Josiah Carlson

"Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> 
> M.-A. Lemburg wrote:
> > You even argued against having non-ASCII identifiers:
> > 
> > http://mail.python.org/pipermail/python-list/2002-May/102936.html
> > 
> > Do you really think that it will help with code readability
> > if programmers are allowed to use native scripts for their
> > identifiers ?
> 
> Yes, I do - for some groups of users. Of course, code sharing
> would be more difficult, and there certainly should be a policy
> to use only ASCII in the standard library. But within local
> groups, users would find understanding code easier if they
> knew what the identifiers actually meant.

According to wikipedia (http://en.wikipedia.org/wiki/Latin_alphabet),
various languages have adopted a transliteration of their language
and/or former alphabets into latin.  They don't purport to know all of
the reasons why, and I'm not going to speculate.

Whether or not more languages start using the latin alphabet is a good
question.  Basing judgement on history and likely globalization, it is
only a matter of time before basically all languages have a
transcription into the latin alphabet that is taught to all (unless
China takes over the world).

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-26 Thread Martin v. Löwis

M.-A. Lemburg wrote:
> You even argued against having non-ASCII identifiers:
> 
> http://mail.python.org/pipermail/python-list/2002-May/102936.html

I see :-) It seems I have changed my mind since then (which
apparently predates PEP 263).

One issue I apparently was worried about was the plan to use
native-encoding byte strings for the identifiers; this I didn't
like at all.

> * Unicode identifiers are going to introduce massive
> code breakage - just think of all the tools people use
> to manipulate Python code today; I'm quite sure that
> most of it will fail in one way or another if you present
> it Unicode literals such as in "zähler += 1".

True. Today, I think I would be willing to accept the
code breakage: these tools had quite some time to update
themselves to PEP 263 (even though not all of them have
done so yet); also, usage of the feature would only spread
gradually. A failure to support the feature in the Python
proper would be treated as a bug by us; how tool providers
deal with the feature would be their choice.

> * People don't seem very interested in using Unicode
> identifiers, e.g.
> 
>   http://mail.python.org/pipermail/i18n-sig/2001-February/000828.html

True. However, I also suspect that lack of tool support
contributes to that. For the specific case of Java,
there is no notion of source encoding, which makes Unicode
identifiers really tedious to use.

If it were really easy to use, I assume people would actually
use it - atleast in some of the contexts, like teaching,
where Python is also widely used.

> Do you really think that it will help with code readability
> if programmers are allowed to use native scripts for their
> identifiers ?

Yes, I do - for some groups of users. Of course, code sharing
would be more difficult, and there certainly should be a policy
to use only ASCII in the standard library. But within local
groups, users would find understanding code easier if they
knew what the identifiers actually meant.

> If you are told to debug a program
> written by say a Japanese programmer using Japanese identifiers
> you are going to have a really hard time. Integrating such
> code into other applications will be even harder, since you'd
> be forced to use his Japanese class names in your application.

Certainly, yes. There is a trade-off: you can make it easier
for some people to read and write code if they can use their
native script; at the same time, it would be harder for others
to read and modify it.

It's a policy decision whether you use English identifiers or
not - it shouldn't be a technical decision (as it currently
is).

> I think source code encodings provide an ideal way to
> have comments written in native scripts - and people
> use that a lot. However, keeping the program code itself
> in plain ASCII makes it far more readable and reusable
> across locales. Something that's important in this
> globalized world.

Certainly. However, some programs don't need to live in
a globalized world - e.g. if they are homework in a school.
Within a locale, using native scripts would make the program
more readable.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-26 Thread M.-A. Lemburg

Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
> 
>>A few years ago we had a discussion about this on python-dev
>>and agreed to stick with ASCII identifiers for Python. I still
>>think that's the right way to go.
> 
> I don't think there ever was such an agreement.

You even argued against having non-ASCII identifiers:

http://mail.python.org/pipermail/python-list/2002-May/102936.html

and I agree with you on most of the points you make in that
posting:

* Unicode identifiers are going to introduce massive
code breakage - just think of all the tools people use
to manipulate Python code today; I'm quite sure that
most of it will fail in one way or another if you present
it Unicode literals such as in "zähler += 1".

* People don't seem very interested in using Unicode
identifiers, e.g.

  http://mail.python.org/pipermail/i18n-sig/2001-February/000828.html

most of the few who did comment, said they'd rather have
ASCII identifiers, e.g.

  http://mail.python.org/pipermail/python-list/2002-May/104050.html

Do you really think that it will help with code readability
if programmers are allowed to use native scripts for their
identifiers ?

I think this goes beyond just visual aspects of being able
to distinguish graphemes:

If you are told to debug a program
written by say a Japanese programmer using Japanese identifiers
you are going to have a really hard time. Integrating such
code into other applications will be even harder, since you'd
be forced to use his Japanese class names in your application.
This doesn't only introduce problems with being able to enter
the Japanese identifiers, it will also cause your application
to suddenly contain identifiers in Japanese even though that's
not your native script.

I think source code encodings provide an ideal way to
have comments written in native scripts - and people
use that a lot. However, keeping the program code itself
in plain ASCII makes it far more readable and reusable
across locales. Something that's important in this
globalized world.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 26 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-26 Thread Walter Dörwald

Am 25.10.2005 um 23:40 schrieb Josiah Carlson:

> [...]
> Identically drawn glyphs are a problem, and pretending that they  
> aren't
> a problem, doesn't make it so.  Right now, all possible name glyphs  
> are
> visually distinct, which would not be the case if any unicode  
> character
> could be used as a name (except for numerals).  Speaking of which,  
> would
> we then be offering support for arabic/indic numeric literals, and/or
> support it in int()/float()?

It's already supported in int() and float()

 >>> int(u"\u136c\u2082")
42
 >>> float(u"\u0664\u09e8")
42.0

But not as literals:

# -*- coding: unicode-escape -*-

print \u136c\u2082

This gives (on the Mac):

   File "encoding.py", line 3
 print ፬₂
   ^
SyntaxError: invalid syntax

> [...]

Bye,
Walter Dörwald

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Stephen J. Turnbull

> "Josiah" == Josiah Carlson <[EMAIL PROTECTED]> writes:

Josiah> Indeed, they are similar, but_ different_ in my font as
Josiah> well.  The trick is that the glyphs are not different in
Josiah> the case of certain greek or cyrillic letters.  They don't
Josiah> just /look/ similar they /are identical/.

But these problems are going to arise in _any_ multilingual context;
it's not at all specific to identifiers.  It's just that computers
lexing identifiers are kinda picky about those things compared to
humans.  I think you can reasonably classify it as a new breed of
typo, and develop UIs to deal with it in that way.

To handle cases where glyphs are (nearly) identical, UIs that visually
flag "foreign" characters, at least in contexts where cross-block
punning is unacceptable, will be developed, and users will learn to
pay attention to those flags.


-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of TsukubaTennodai 1-1-1 Tsukuba 305-8573 JAPAN
   Ask not how you can "do" free software business;
  ask what your business can "do for" free software.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Martin v. Löwis

Greg Ewing wrote:
> Would it help if an identifier were required to be
> made up of letters from the same alphabet, e.g. all
> Latin or all Greek or all Cyrillic, but not a mixture.
> Then you'd get an immediate error if you accidentally
> slipped in a letter from the wrong alphabet.

Not in the literal sense: you certainly want to allow
"latin" digits in, say, a cyrillic identifier.See

http://www.unicode.org/reports/tr31/

for what the Unicode consortium recommends to do.
In addition to the strict specification, they envision
usage guidelines. This seems Pythonic: just because
you could potentially shoot yourself in the foot doesn't
mean it should be banned from the language.

IOW, whether it would help largely depends on whether
the problem is real in the first place. Just because
you *can* come up with look-alike identifiers doesn't
mean that people will use them, or that they will mistake
the scripts (except for deliberately doing so, of
course).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Martin v. Löwis

Josiah Carlson wrote:
> In this case it's not just a misreading, the characters look identical! 
> When is an 'E' not an 'E'?  When it is an Epsilon or Ie.  Saying what
> characters will or will not be used as identifiers, when those
> characters are keys on a keyboard of a specific type, is pretty
> presumptuous.

Why is that rude and disrespectful? I'm certainly respecting developers
who want to use their scripts for identifiers, or else I would not have
suggested that they could do so.

However, from the experience with my own language, and the three or so
foreign languages I know, I can tell you that people would normally
don't mix identifiers of different scripts.

> Sure, that example was made up, but there are words which have been
> stolen from various languages by english, and you are discounting the
> case of single-letter temporary variables.  Saying what will and won't
> happen over the course of using unicode identifiers is quite the
> prediction.

Sure, people can make mistakes. They get an error, and then will
need to find the cause of the problem. Sometimes, this will be easy,
and sometimes, it will not.

> Indeed, they are similar, but_ different_ in my font as well.  The trick
> is that the glyphs are not different in the case of certain greek or
> cyrillic letters.  They don't just /look/ similar they /are identical/.

This string: "EΕ" is the LATIN CAPITAL LETTER E, followed by the GREEK
CAPITAL LETTER EPSILON. In the font my email composer uses, the E is
slightly larger than the Epsilon - so there /is/ a visual difference.

But even if there isn't: if this was a frequent problem, the name
error could include an alternative representation (say, with Unicode
ordinals for non-ASCII characters) which would give an easy visual
clue.

I still doubt that this is a frequent problem, and I don't see any
better grounds for claiming that it is than for claiming that it
is not.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Greg Ewing

Martin v. Löwis wrote:

> For window.draw, people will readily understand that
> they are supposed to use Latin letters. More generally, they will know
> what script to use just from looking at the identifier.

Would it help if an identifier were required to be
made up of letters from the same alphabet, e.g. all
Latin or all Greek or all Cyrillic, but not a mixture.
Then you'd get an immediate error if you accidentally
slipped in a letter from the wrong alphabet.

Greg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Neil Hodgson

Martin v. Löwis:

> This aspect of rendering is often not implemented, though. Web browsers
> do it correctly, see
> ...
> GUI frameworks sometimes do it correctly, sometimes don't; most
> notably, Tk has no good support for RTL text.

   Scintilla does a rough job with this. RTL text is displayed
correctly as the underlying platform libraries (Windows or GTK+/Pango)
handle this aspect when called to draw text. However editing is not
performed correctly with the caret not being placed correctly within
RTL text and other visual glitches. There is interest in the area and
even a funding proposal this week.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Josiah Carlson


Guido van Rossum <[EMAIL PROTECTED]> wrote:
> 
> On 10/25/05, Josiah Carlson <[EMAIL PROTECTED]> wrote:
> > Indeed, they are similar, but_ different_ in my font as well.  The trick
> > is that the glyphs are not different in the case of certain greek or
> > cyrillic letters.  They don't just /look/ similar they /are identical/.
> 
> Well, in the font I'm using to read this email, I and l are /identical/.

In all fonts I've seen, E/Epsilon/Ie are /always identical/.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Guido van Rossum

On 10/25/05, Josiah Carlson <[EMAIL PROTECTED]> wrote:
> Indeed, they are similar, but_ different_ in my font as well.  The trick
> is that the glyphs are not different in the case of certain greek or
> cyrillic letters.  They don't just /look/ similar they /are identical/.

Well, in the font I'm using to read this email, I and l are /identical/.

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Josiah Carlson


"Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> 
> Josiah Carlson wrote:
> > And how users could say, "name error? But I typed in window.draw(PEN) as
> > I was told to, and it didn't work!"
> 
> Ah, so the "serious issues" you are talking about are not security 
> issues, but usability issues.

Indeed, it was a misunderstanding, as the email stated:
I did not mean to imply that I was concerned about the security
implications of inserting arbitrary identifiers in Python (I was
mentioning the web browser case for an example of how such
characters have been confusing previously), I am concerned about
confusion involved with using: [glyphs which are identical]


> I don't think extending the range of acceptable characters will
> cause any additional confusion. Users are already getting "surprising"
> NameErrors/AttributeErrors in the following cases:
> - they just misspell the identifier, and then, when the error message
>is printed, fail to recognize the difference, as they read over the
>typo just like they read over it when mistyping it in the first place.

In this case it's not just a misreading, the characters look identical! 
When is an 'E' not an 'E'?  When it is an Epsilon or Ie.  Saying what
characters will or will not be used as identifiers, when those
characters are keys on a keyboard of a specific type, is pretty
presumptuous.


> - they run into confusions with different things having the same names
>in different contexts. For example, they wonder why they get TypeError
>for passing the wrong number of arguments to a function, when the
>call matches exactly what the source code in front of them tells
>them - only that they were calling a different function which just
>happened to have the same name.

Right, and users should be reading the documentation for the functions
and methods they are calling.


> In the light of these common mistakes, your example with an identifier
> named PEN, where the "P" might be a cyrillic letter or the E a greek one
> is just made up: For window.draw, people will readily understand that
> they are supposed to use Latin letters. More generally, they will know
> what script to use just from looking at the identifier.

Sure, that example was made up, but there are words which have been
stolen from various languages by english, and you are discounting the
case of single-letter temporary variables.  Saying what will and won't
happen over the course of using unicode identifiers is quite the
prediction.


> > Identically drawn glyphs are a problem, and pretending that they aren't
> > a problem, doesn't make it so.  Right now, all possible name glyphs are
> > visually distinct
> 
> Not at all: Just compare Fool and Foo1 (and perhaps FooI)
> 
> In the font in which I'm typing this, these are slightly different - but
> there are fonts in which the difference is really difficult to
> recognize.

Indeed, they are similar, but_ different_ in my font as well.  The trick
is that the glyphs are not different in the case of certain greek or
cyrillic letters.  They don't just /look/ similar they /are identical/.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Martin v. Löwis

Guido van Rossum wrote:
> This actually seems a killer even for allowing Unicode in comments,
> which I'd otherwise favor. What do Unicode-aware apps generally do
> with right-to-left characters?

The Unicode standard has an elaborate definition of what should happen.
There are many rules to it, but essentially, there is the notion of a
"primary" direction, which then is toggled based on the directionality
of each character (unicodedata.bidirectional). There are also formatting
characters which toggle the direction.

This aspect of rendering is often not implemented, though. Web browsers
do it correctly, see

http://he.wikipedia.org/wiki/Python

where all text should come out right-adjusted, yet the Latin fragments
are still left to right (such as "Guido van Rossum")

Integrating it into this text looks like this: פייתון (Python).

GUI frameworks sometimes do it correctly, sometimes don't; most
notably, Tk has no good support for RTL text.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Martin v. Löwis

M.-A. Lemburg wrote:
> A few years ago we had a discussion about this on python-dev
> and agreed to stick with ASCII identifiers for Python. I still
> think that's the right way to go.

I don't think there ever was such an agreement.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Martin v. Löwis

Josiah Carlson wrote:
> And how users could say, "name error? But I typed in window.draw(PEN) as
> I was told to, and it didn't work!"

Ah, so the "serious issues" you are talking about are not security 
issues, but usability issues.

I don't think extending the range of acceptable characters will
cause any additional confusion. Users are already getting "surprising"
NameErrors/AttributeErrors in the following cases:
- they just misspell the identifier, and then, when the error message
   is printed, fail to recognize the difference, as they read over the
   typo just like they read over it when mistyping it in the first place.

- they run into confusions with different things having the same names
   in different contexts. For example, they wonder why they get TypeError
   for passing the wrong number of arguments to a function, when the
   call matches exactly what the source code in front of them tells
   them - only that they were calling a different function which just
   happened to have the same name.

In the light of these common mistakes, your example with an identifier
named PEN, where the "P" might be a cyrillic letter or the E a greek one
is just made up: For window.draw, people will readily understand that
they are supposed to use Latin letters. More generally, they will know
what script to use just from looking at the identifier.

> Identically drawn glyphs are a problem, and pretending that they aren't
> a problem, doesn't make it so.  Right now, all possible name glyphs are
> visually distinct

Not at all: Just compare Fool and Foo1 (and perhaps FooI)

In the font in which I'm typing this, these are slightly different - but
there are fonts in which the difference is really difficult to
recognize.

> Speaking of which, would
> we then be offering support for arabic/indic numeric literals, and/or
> support it in int()/float()?

No. None of the Arabic users have ever requested such a feature, so
it would be stupid to provide it. We provide extended identifiers not
for the fun of it, but because users are requesting them.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Guido van Rossum

On 10/25/05, Josiah Carlson <[EMAIL PROTECTED]> wrote:
> Identically drawn glyphs are a problem, and pretending that they aren't
> a problem, doesn't make it so.  Right now, all possible name glyphs are
> visually distinct, which would not be the case if any unicode character
> could be used as a name (except for numerals).  Speaking of which, would
> we then be offering support for arabic/indic numeric literals, and/or
> support it in int()/float()?  Ideally I would like to say yes, but I
> could see the confusion if such were allowed.

This problem isn't new. There are plenty of fonts where 1 and l are
hard to distinguish, or l and I for that matter, or O and 0.

Yes, we need better tools to diagnose this.

No, we shouldn't let this stop us from adding such a feature if it is
otherwise a good feature.

I'm not so sure about this for other reasons -- it hampers code
sharing, and as soon as you add right-to-left character sets to the
mix (or top-to-bottom, for that matter), displaying source code is
going to be near impossible for most tools (since the keywords and
standard library module names will still be in the Latin alphabet).
This actually seems a killer even for allowing Unicode in comments,
which I'd otherwise favor. What do Unicode-aware apps generally do
with right-to-left characters?

--
--Guido van Rossum (home page: http://www.python.org/~guido/)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread M.-A. Lemburg

Josiah Carlson wrote:
> "Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> 
>>Fredrik Lundh wrote:
>>
>>>however, for Python 3000, it would be nice if the source-code encoding 
>>>applied
>>>to the *entire* file (XML-style), rather than just unicode string literals 
>>>and (hope-
>>>fully) comments and docstrings.
>>
>>As MAL explains, the encoding currently does apply to the entire file.
>>However, because of the Python syntax, you are restricted to ASCII
>>in many places, such as keywords, number literals, and (unfortunately)
>>identifiers. Lifting the restriction on identifiers is on my agenda.
> 
> 
> It seems that removing this restriction may cause serious issues, at
> least in the case when using cyrillic characters in names.  See recent
> security issues in regards to web addresses in web browsers for the
> confusion (and/or name errors) that could result in their use.
> 
> While I agree in principle that people should be able to use the
> entirety of one's own natural language in writing software in
> programming languages, I think that it is an ugly can of worms that
> perhaps shouldn't be opened.

I agree with Josiah.

A few years ago we had a discussion about this on python-dev
and agreed to stick with ASCII identifiers for Python. I still
think that's the right way to go.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Josiah Carlson

"Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> 
> Josiah Carlson wrote:
> > It seems that removing this restriction may cause serious issues, at
> > least in the case when using cyrillic characters in names.  See recent
> > security issues in regards to web addresses in web browsers for the
> > confusion (and/or name errors) that could result in their use.
> 
> That impression is deceiving. We are talking about source code here;
> people type in identifiers explicitly rather than receiving them
> through linking, and they scope identifiers (by module or object).
> 
> If somebody manages to get look-alike identifiers into your Python
> libraries, you have bigger problems than these look-alikes: anybody
> capable of doing so could just as well replace the real thing in
> the first place.
> 
> As always in computer security: define your threat model before
> reasoning about the risks.

I should have been more explicit.  I did not mean to imply that I was
concerned about the security implications of inserting arbitrary
identifiers in Python (I was mentioning the web browser case for
an example of how such characters have been confusing previously), I am
concerned about confusion involved with using:
Greek Capital: Alpha, Beta, Epsilon, Zeta, Eta, Iota, Kappa, Mu, Nu,
Omicron, Rho, and Tau.
Cyrillic Capital: Dze, Je, A, Ve, Ie, Em, En, O, Er, Es, Te, Ha, ...

And how users could say, "name error? But I typed in window.draw(PEN) as
I was told to, and it didn't work!"

Identically drawn glyphs are a problem, and pretending that they aren't
a problem, doesn't make it so.  Right now, all possible name glyphs are
visually distinct, which would not be the case if any unicode character
could be used as a name (except for numerals).  Speaking of which, would
we then be offering support for arabic/indic numeric literals, and/or
support it in int()/float()?  Ideally I would like to say yes, but I
could see the confusion if such were allowed.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Martin v. Löwis

Josiah Carlson wrote:
> It seems that removing this restriction may cause serious issues, at
> least in the case when using cyrillic characters in names.  See recent
> security issues in regards to web addresses in web browsers for the
> confusion (and/or name errors) that could result in their use.

That impression is deceiving. We are talking about source code here;
people type in identifiers explicitly rather than receiving them
through linking, and they scope identifiers (by module or object).

If somebody manages to get look-alike identifiers into your Python
libraries, you have bigger problems than these look-alikes: anybody
capable of doing so could just as well replace the real thing in
the first place.

As always in computer security: define your threat model before
reasoning about the risks.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Josiah Carlson

"Martin v. Löwis" <[EMAIL PROTECTED]> wrote:
> 
> Fredrik Lundh wrote:
> > however, for Python 3000, it would be nice if the source-code encoding 
> > applied
> > to the *entire* file (XML-style), rather than just unicode string literals 
> > and (hope-
> > fully) comments and docstrings.
> 
> As MAL explains, the encoding currently does apply to the entire file.
> However, because of the Python syntax, you are restricted to ASCII
> in many places, such as keywords, number literals, and (unfortunately)
> identifiers. Lifting the restriction on identifiers is on my agenda.

It seems that removing this restriction may cause serious issues, at
least in the case when using cyrillic characters in names.  See recent
security issues in regards to web addresses in web browsers for the
confusion (and/or name errors) that could result in their use.

While I agree in principle that people should be able to use the
entirety of one's own natural language in writing software in
programming languages, I think that it is an ugly can of worms that
perhaps shouldn't be opened.

 - Josiah

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Martin v. Löwis

Fredrik Lundh wrote:
> however, for Python 3000, it would be nice if the source-code encoding applied
> to the *entire* file (XML-style), rather than just unicode string literals 
> and (hope-
> fully) comments and docstrings.

As MAL explains, the encoding currently does apply to the entire file.
However, because of the Python syntax, you are restricted to ASCII
in many places, such as keywords, number literals, and (unfortunately)
identifiers. Lifting the restriction on identifiers is on my agenda.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread M.-A. Lemburg

Fredrik Lundh wrote:
> M.-A. Lemburg wrote:
> 
> 
>>I don't follow you here. The source code encoding
>>is only applied to Unicode literals (you are using string
>>literals in your example). String literals are passed
>>through as-is.
> 
> 
> however, for Python 3000, it would be nice if the source-code encoding applied
> to the *entire* file (XML-style), rather than just unicode string literals 
> and (hope-
> fully) comments and docstrings.

Actually, the encoding is applied to the complete source file:
the file is transcoded into UTF-8 and then parsed by the
Python parser.

Unicode literals are then decoded from the UTF-8 into Unicode.
String literals are transcoded back into the source code encoding,
thus making the (rather long due to technical constraints) round-trip
source code encoding -> Unicode -> UTF-8 -> Unicode -> source code encoding.

Python 3k should have a fully Unicode based parser to reduce this
additional transcoding overhead.

Since Py3k will only have Unicode literals, the problems with
string literals will go away all by themselves :-)

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] Divorcing str and unicode (no more implicitconversions).

2005-10-25 Thread Fredrik Lundh

M.-A. Lemburg wrote:

> I don't follow you here. The source code encoding
> is only applied to Unicode literals (you are using string
> literals in your example). String literals are passed
> through as-is.

however, for Python 3000, it would be nice if the source-code encoding applied
to the *entire* file (XML-style), rather than just unicode string literals and 
(hope-
fully) comments and docstrings.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

41 matches

Mail list logo