Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Where I cut your words, we are in 100% agreement.  (FWIW :-)

Guido van Rossum writes:
  On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull
  step...@xemacs.org wrote:

   Well, that's why I wrote intended to be suggestive.  The Unicode
   Standard does not specify at all what the internal representation of
   characters may be, it only specifies what their external behavior must
   be when two processes communicate.  (For process as used in the
   standard, think Python modules here, since we are concerned with the
   problems of folks who develop in Python.)  When observing the behavior
   of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or
   even UTF-32 arrays; only arrays of characters.
  
  Hm, that's not how I would read process. IMO that is an
  intentionally vague term,

I agree.  I'm sorry that I didn't make myself clear.  The reason I
read process as module is that some modules of Python, and
therefore Python as a whole, cannot conform to the Unicode standard.
Eg, anything that inputs or outputs bytes.  Therefore only modules
and types can be asked to conform.  (I don't think it makes sense to
ask anything lower level to conform.  See below where I comment on
your .lower() example.)

What I am advocating (for the long term) is provision of *one* module
(or type) such that if the text processing done by the application is
done entirely in terms of this module (type), it will conform (to some
specified degree, chosen to balance user wants with implementation and
support costs).  It may be desireable to provide others for
sufficiently important particular use cases, but at present I see a
clear need for *one*.  Unicode conformance is going to be a common
requirement for apps used by global enterprises.

I oppose trying to make str into that type.  We need str, just as it
is, for many reasons.

  and we are free to decide how to interpret it. I don't think it
  will work very well to define a process as a Python module; what
  about Python modules that agree about passing along array of code
  units (or streams of UTF-8, for that matter)?

Certainly a group of cooperating modules could form a conforming
process, just as you describe it for one example.  The one module
mentioned above need not implement everything internally, but it would
take responsiblity for providing guarantees (eg, unit tests) of
whatever conformance claims it makes.

   Thus, according to the rules of handling a UTF-16 stream, it is an
   error to observe a lone surrogate or a surrogate pair that isn't a
   high-low pair (Unicode 6.0, Ch. 3 Conformance, requirements C1 and
   C8-C10).  That's what I mean by can't tell it's UTF-16.
  
  But if you can observe (valid) surrogate pairs it is still UTF-16.

In the concrete implementation I have in mind, surrogate pairs are
represented by a str containing 2 code units.  But in that case
s[i][1] is an error, and s[i][0] == s[i].  print(s[i][0]) and
print(s[i]) will print the same character to the screen.  If you
decode it to bytes, well, it's not a str any more so what have you
proved?  Ie, what you will see is *code points* not in the BMP.

You don't have to agree that such surrogate containment behavior is
so valuable as I think it is, but that's what I have in mind as one
requirement for a conforming implementation of UTF-16.

  At the same time I think it would be useful if certain string
  operations like .lower() worked in such a way that *if* the input were
  valid UTF-16, *then* the output would also be, while *if* the input
  contained an invalid surrogate, the result would simply be something
  that is no worse (in particular, those are all mapped to
  themselves).

I don't think that it's a good idea to go for conformance at the
method level.  It would be a feature for apps that don't claim full
conformance because they nevertheless give good results in more cases.
The downside will be Python apps using str that will pass conformance
tests written for, say Western Europe, but end users in Kuwait and
Kuala Lumpur will report bugs.

  An analogy is actually found in .lower() on 8-bit strings in Python 2:
  it assumes the string contains ASCII, and non-ASCII characters are
  mapped to themselves. If your string contains Latin-1 or EBCDIC or
  UTF-8 it will not do the right thing. But that doesn't mean strings
  cannot contain those encodings, it just means that the .lower() method
  is not useful if they do. (Why ASCII? Because that is the system
  encoding in Python 2.)

Sure.  I think that approach is fine for str, too, except that I would
hope it looks up BMP base characters in the case-mapping database.
The fact is that with very few exceptions non-BMP characters are going
to be symbols (mathematical operators and emoticons, for example).
This is good enough, except when it's not---but when it's not, only
100% conformance is really a reasonable target.  IMO, of course.

  I think we should just document how it behaves and not get hung up on
  what it is 

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Neil Hodgson
Glenn Linderman:

 How many different iterators into the same text would be concurrently needed
 by an application?  And why? Seems like if it is dealing with text at the
 level of grapheme clusters, it needs that type of iterator.  Of course, if
 it does I/O it needs codec access, but that is by nature sequential from the
 starting point to the end point.

   I would expect that there would mostly be a single iterator into a
string but can imagine scenarios in which multiple iterators may be
concurrently active and that these could be of different types. For
example, say we wanted to search for each code point in a text that
fails some test (such as being a member of a set of unwanted vowel
diacritics) and then display that failure in context with its
surrounding text of up to 30 graphemes either side.

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Glenn Linderman writes:

  I found your discussion of streams versus arrays, as separate concepts 
  related to Unicode, along with Terry's bisect indexing implementation, 
  to rather inspiring.  Just because Unicode defines streams of codeunits 
  of various sizes (UTF-8, UTF-16, UTF-32) to represent characters when 
  processes communicate and for storage (which is one way processes 
  communicate), that doesn't imply that the internal representation of 
  character strings in a programming language must use exactly that 
  representation.

That is true, and Unicode is *very* careful to define its requirements
so that is true.  That doesn't mean using an alternative
representation is an improvement, though.

  I'm unaware of any current Python implementation that has chosen to
  use UTF-8 as the internal representation of character strings (I'm
  also aware Perl has made that choice), yet UTF-8 is one of the
  commonly recommend character representations on the Linux platform,
  from what I read.

There are two reasons for that.  First, widechar representations are
right out for anything related to the file system or OS, unless you
are prepared to translate before passing to the OS.  If you use UTF-8,
then asking the user to use a UTF-8 locale to communicate with your
app is a plausible way to eliminate any translation in your app.  (The
original moniker for UTF-8 was UTF-FSS, where FSS stands for file
system safe.)

Second, much text processing is stream-oriented and one-pass.  In
those cases, the variable-width nature of UTF-8 doesn't cost you
anything.  Eg, this is why the common GUIs for Unix (X.org, GTK+, and
Qt) either provide or require UTF-8 coding for their text.  It costs
*them* nothing and is file-system-safe.

  So in that sense, Python has rejected the idea of using the
  native or OS configured representation as its internal
  representation.

I can't agree with that characterization.  POSIX defines the concept
of *locale* precisely because the native representation of text in
Unix is ASCII.  Obviously that won't fly, so they solved the problem
in the worst possible waywink/:  they made the representation
variable!

It is the *variability* of text representation that Python rejects,
just as Emacs and Perl do.  They happen to have chosen six different
representations.[1]

  So why, then, must one choose from a repertoire of Unicode-defined
  stream representations if they don't meet the goal of efficient
  length, indexing, or slicing operations on actual characters?

One need not.  But why do anything else?  It's not like the authors of
that standard paid no attention to various concerns about efficiency
and backward compatibility!  That's the question that you have not
answered, and I am presently lacking in any data that suggests I'll
ever need the facilities you propose.

Footnotes: 
[1]  Emacs recently changed its mind.  Originally it used the
so-called MULE encoding, and now a different extension of UTF-8 from
Perl.  Of course, Python beats that, with narrow, wide, and now
PEP-393 representations!wink /

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Neil Hodgson
Stephen J. Turnbull:

 ...  Eg, this is why the common GUIs for Unix (X.org, GTK+, and
 Qt) either provide or require UTF-8 coding for their text.

   Qt uses UTF-16 for its basic QString type. While QString is mostly
treated as a black box which you can create from input buffers in any
encoding, the only encoding allowed for a contents-by-reference
QString (QString::fromRawData) is UTF-16.
http://doc.qt.nokia.com/latest/qstring.html#fromRawData

   Neil
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Glenn Linderman writes:

  How many different iterators into the same text would be concurrently 
  needed by an application?  And why?

A WYSIWYG editor for structured text (TeX, HTML) might want two (at
least), one for the source window and one for the rendered window.
One might want to save the state of the iterators (if that's possible)
and cache it as one moves the window forward to make short backward
motion fast, giving you two (or four, etc) more.

  Seems like if it is dealing with text at the level of grapheme
  clusters, it needs that type of iterator.  Of course, if it does
  I/O it needs codec access, but that is by nature sequential from
  the starting point to the end point.

`save-region' ?  `save-text-remove-markup' ?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Glenn Linderman

On 9/1/2011 2:15 AM, Stephen J. Turnbull wrote:

Glenn Linderman writes:

How many different iterators into the same text would be concurrently
needed by an application?  And why?

A WYSIWYG editor for structured text (TeX, HTML) might want two (at
least), one for the source window and one for the rendered window.
One might want to save the state of the iterators (if that's possible)
and cache it as one moves the window forward to make short backward
motion fast, giving you two (or four, etc) more.


Sure.  But those are probably all the same type of iterators — probably 
(since they are WYSIWYG) dealing with multi-codepoint characters 
(Guido's recent definition of grapheme, which seems to subsume both 
grapheme clusters and composed characters).


Hence all of  them would be using/requiring the same sort of 
representation, index, analysis, or some combination of those.



Seems like if it is dealing with text at the level of grapheme
clusters, it needs that type of iterator.  Of course, if it does
I/O it needs codec access, but that is by nature sequential from
the starting point to the end point.

`save-region' ?  `save-text-remove-markup' ?


Yes, save-region sounds like exactly what I was speaking of.  
save-text-remove-markup I would infer needs to process the text to 
remove the markup characters... since you used TeX and HTML as examples, 
markup is text, not binary (which would be a different problem).  Since 
the TeX and HTML markup is mostly ASCII, markup removal (or more likely, 
text extraction) could be performed via either a grapheme iterator, or a 
codepoint iterator, or even a code unit iterator.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Glenn Linderman

On 9/1/2011 12:59 AM, Stephen J. Turnbull wrote:

Glenn Linderman writes:

We can either artificially constrain ourselves to minor tweaks of
the legal conforming bytestreams,

It's not artificial.  Having the internal representation be the same
as a standard encoding is very useful for a large number of minor
usages (urgently saving buffers in a text editor that knows its
internal state is inconsistent, viewing strings in the debugger, PEP
393-style space optimization is simpler if text properties are
out-of-band, etc).


saving buffers urgently when the internal state is inconsistent sounds 
like carefully preserving a bug.  Windows 7 64-bit on one of my 
computers happily crashes several times a day when it detects 
inconsistent internal state... under the theory, I guess, that losing 
work is better than saving bad work.  You sound the opposite.


I'm actually very grateful that Firefox and emacs recover gracefully 
from Windows crashes, and I lose very little data from the crashes, but 
cannot recommend Windows 7 (this machine being my only experience with 
it) for stability.


In any case, the operations you mention still require the data to be 
processed, if ever so slightly, and I'll admit that a more complex 
representation would require a bit more processing.  Not clear that it 
would be huge or problematical for these cases.


Except, I'm not sure how PEP 393 space optimization fits with the other 
operations.  It may even be that an application-wide complex-grapheme 
cache would save significant space, although if it uses high-bits in a 
string representation to reference the cache, PEP 393 would jump 
immediately to something  16 bits per grapheme... but likely would 
anyway, if complex-graphemes are in the data stream.



or we can invent a representation (whether called str or something
else) that is useful and efficient in practice.

Bring on the practice, then.  You say that a bit to identify lone
surrogates might be useful or efficient.  In what application?  How
much time or space does it save?


I didn't attribute any efficiency to flagging lone surrogates (BI-5).  
Since Windows uses a non-validated UCS-2 or UTF-16 character type, any 
Python program that obtains data from Windows APIs may be confronted 
with lone surrogates or inappropriate combining characters at any time.  
Round-tripping that data seems useful, even though the data itself may 
not be as useful as validated Unicode characters would be.  Accidentally 
combining the characters due to slicing and dicing the data, and doing 
normalizations, or what not, would not likely be appropriate.  However, 
returning modified forms of it to Windows as UCS-2 or UTF-16 data may 
still cause other applications to later accidentally combine the 
characters, if the modifications juxtaposed things to make them look 
reasonably, even if accidentally.  If intentionally, of course, the bit 
could be turned off.  This exact sort of problem with non-validated 
UTF-8 bytes was addressed already in Python, mostly for Linux, allowing 
round-tripping of the byte stream, even though it is not valid.  BI-6 
suggests a different scheme for that, without introducing lone 
surrogates (which might accidentally get combined with other lone 
surrogates).



You say that a bit to cache a
property might be useful or efficient.  In what application?  Which
properties?  Are those properties a set fixed by the language, or
would some bits be available for application-specific property
caching?  How much time or space does that save?


The brainstorming ideas I presented were just that... ideas.  And they 
were independent.  And the use of many high-order bits for properties 
was one of the independent ones.  When I wrote that one, I was assuming 
a UTF-32 representation (which wastes 11 bits of each 32).  One thing I 
did have in mind, with the high-order bits, for that representation, was 
to flag the start or end or middle of the codes that are included in a 
grapheme.  That would be redundant with some of the Unicode codepoint 
property databases, if I understand them properly... whether it would 
make iterators enough more efficient to be worth the complexity would 
have to be benchmarked.  After writing all those ideas down, I actually 
preferred some of the others, that achieved O(1) real grapheme indexing, 
rather than caching character properties.



What are the costs to applications that don't want the cache?  How is
the bit-cache affected by PEP 393?


If it is a separate type from str, then it costs nothing except the 
extra code space to implement the cache for those applications that do 
want it... most of which wouldn't be loaded for applications that don't, 
if done as a module or C extension.



I know of no answers (none!) to those questions that favor
introduction of a bit-cache representation now.  And those bits aren't
going anywhere; it will always be possible to use a wide build and
change the representation later, if 

Re: [Python-Dev] Python 3 optimizations continued...

2011-09-01 Thread Ned Batchelder

On 8/30/2011 4:41 PM, stefan brunthaler wrote:

Ok, there there's something else you haven't told us. Are you saying
that the original (old) bytecode is still used (and hence written to
and read from .pyc files)?


Short answer: yes.
Long answer: I added an invocation counter to the code object and keep
interpreting in the usual Python interpreter until this counter
reaches a configurable threshold. When it reaches this threshold, I
create the new instruction format and interpret with this optimized
representation. All the macros look exactly the same in the source
code, they are just redefined to use the different instruction format.
I am at no point serializing this representation or the runtime
information gathered by me, as any subsequent invocation might have
different characteristics.
When the switchover to the new instruction format happens, what happens 
to sys.settrace() tracing?  Will it report the same sequence of line 
numbers?  For a small but important class of program executions, this is 
more important than speed.


--Ned.


Best,
--stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-09-01 Thread Cesare Di Mauro
2011/9/1 Ned Batchelder n...@nedbatchelder.com

 When the switchover to the new instruction format happens, what happens to
 sys.settrace() tracing?  Will it report the same sequence of line numbers?
  For a small but important class of program executions, this is more
 important than speed.

  --Ned


A simple solution: when tracing is enabled, the new instruction format will
never be executed (and information tracking disabled as well).

Regards,
Cesare
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-09-01 Thread Mark Shannon

Cesare Di Mauro wrote:
2011/9/1 Ned Batchelder n...@nedbatchelder.com 
mailto:n...@nedbatchelder.com


When the switchover to the new instruction format happens, what
happens to sys.settrace() tracing?  Will it report the same sequence
of line numbers?  For a small but important class of program
executions, this is more important than speed.

 --Ned


A simple solution: when tracing is enabled, the new instruction format 
will never be executed (and information tracking disabled as well).


What happens if tracing is enabled *during* the execution of the new 
instruction format?
Some sort of deoptimisation will be required in order to recover the 
correct VM state.


Cheers,
Mark.


Regards,
Cesare




___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: http://mail.python.org/mailman/options/python-dev/mark%40hotpy.org


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-09-01 Thread Cesare Di Mauro
2011/9/1 Mark Shannon m...@hotpy.org

 Cesare Di Mauro wrote:

 2011/9/1 Ned Batchelder n...@nedbatchelder.com mailto:
 n...@nedbatchelder.com


When the switchover to the new instruction format happens, what
happens to sys.settrace() tracing?  Will it report the same sequence
of line numbers?  For a small but important class of program
executions, this is more important than speed.

 --Ned


 A simple solution: when tracing is enabled, the new instruction format
 will never be executed (and information tracking disabled as well).

  What happens if tracing is enabled *during* the execution of the new
 instruction format?
 Some sort of deoptimisation will be required in order to recover the
 correct VM state.

 Cheers,
 Mark.


Sure. I don't think that the regular ceval.c loop will be dropped when
executing the new instruction format, so we can intercept a change like
this using the why variable, for example, or something similar that is
normally used to break the regular loop execution.

Anyway, we need to take a look at the code.

Cheers,
Cesare
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Hagen Fürstenau
 Ok, I thought there was also a form normalized (denormalized?) to
 decomposed form. But I'll take your word.

If I understood the example correctly, he needs a mixed form, with some
characters decomposed and some composed (depending on which one looks
better in the given font). I agree that this sound more like a font
problem, but it's a wide spread font problem and it may be necessary to
address it in an application.

But this is only one example of why an application-specific concept of
graphemes different from the Unicode-defined normalized forms can be
useful. I think the very concept of a grapheme is context, language, and
culture specific. For example, in Chinese Pinyin it would be very
natural to write tone marks with composing diacritics (i.e. in
decomposed form). But then you have the vowel ü and it would be
strange to decompose it into an u and combining diaeresis. So
conceptually the most sensible representation of lǜ would be neither
the composed not the decomposed normal form, and depending on its needs
an application might want to represent it in the mixed form (composing
the diaeresis with the u, but leaving the grave accent separate).

There must be many more examples where the conceptual context determines
the right composition, like for ñ, which is Spanish is certainly a
grapheme, but in mathematics might be better represented as n-tilde. The
bottom line is that, while an array of Unicode code points is certainly
a generally useful data type (and PEP 393 is a great improvement in this
regard), an array of graphemes carries many subtleties and may not be
nearly as universal. Support in the spirit of unicodedata's
normalization function etc. is certainly a good thing, but we shouldn't
assume that everyone will want Python to do their graphemes for them.

- Hagen

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Guido van Rossum
On Thu, Sep 1, 2011 at 12:13 AM, Stephen J. Turnbull step...@xemacs.org wrote:
 Where I cut your words, we are in 100% agreement.  (FWIW :-)

Not quite the same here, but I don't feel the need to have the last
word. Most of what you say makes sense, in some cases we'll quibble
later, but there are a few points where I have something to add:

 No, and I can tell you why!  The difference between characters and
 words is much more important than that between code point and grapheme
 cluster for most users and the developers who serve them.  Even small
 children recognize typographical ligatures as being composite objects,

True -- in fact I didn't know that ff and ffl ligatures *existed*
until I learned about Unix troff.

 while at least this Spanish-as-a-second-language learner was taught
 that `ñ' is an atomic character represented by a discontiguous glyph,
 like `i', and it is no more related to `n' than `m' is.  Users really
 believe that characters are atomic.  Even in the cases of Han
 characters and Hangul, users think of the characters as being
 atomic, but in the sense of Bohr rather than that of Democritus.

Ah, I think this may very well be culture-dependent. In Holland there
are no Dutch words that use accented letters, but the accents are
known because there are a lot of words borrowed from French or German.
We (the Dutch) think of these as letters with accents and in fact we
think of the accents as modifiers that can be added to any letter (at
least I know that's how I thought about it -- perhaps I was also
influenced by the way one had to type those on a mechanical
typewriter). Dutch does have one native use of the umlaut (though it
has a different name, I forget which, maybe trema :-), when there are
two consecutive vowels that would normally be read as a special sound
(diphthong?). E.g. in koe (cow) the oe is two letters (not a single
letter formed of two distict shapes!) that mean a special sound
(roughly KOO). But in a word like coëxistentie (coexistence) the o
and e do not form the oe-sound, and to emphasize this to Dutch readers
(who believe their spelling is very logical :-), the official spelling
puts the umlaut on the e. This is definitely thought of as a separate
mark added to the e; ë is not a new letter. I have a feeling it's the
same way for the French and Germans, but I really don't know.
(Antoine? Georg?)

Finally, my guess is that the Spanish emphasis on ñ as a separate
letter has to do with teaching how it has a separate position in the
localized collation sequence, doesn't it? I'm also curious if ñ occurs
as a separate character on Spanish keyboards.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Antoine Pitrou
Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit :
 This is definitely thought of as a separate
 mark added to the e; ë is not a new letter. I have a feeling it's the
 same way for the French and Germans, but I really don't know.
 (Antoine? Georg?)

Indeed, they are not separate letters (they are considered the same in
lexicographic order, and the French alphabet has 26 letters).

But I'm not sure how it's relevant, because you can't remove an accent
without most likely making a spelling error, or at least changing the
meaning. Accents are very much part of the language (while ligatures
like ff are not, they are a rendering detail). So I would consider
é, ê, ù, etc. atomic characters for the purpose of processing
French text. And I don't see how a decomposed form could help an
application.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Guido van Rossum
On Thu, Sep 1, 2011 at 9:03 AM, Antoine Pitrou solip...@pitrou.net wrote:
 Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit :
 This is definitely thought of as a separate
 mark added to the e; ë is not a new letter. I have a feeling it's the
 same way for the French and Germans, but I really don't know.
 (Antoine? Georg?)

 Indeed, they are not separate letters (they are considered the same in
 lexicographic order, and the French alphabet has 26 letters).

 But I'm not sure how it's relevant, because you can't remove an accent
 without most likely making a spelling error, or at least changing the
 meaning. Accents are very much part of the language (while ligatures
 like ff are not, they are a rendering detail). So I would consider
 é, ê, ù, etc. atomic characters for the purpose of processing
 French text. And I don't see how a decomposed form could help an
 application.

The example given was someone who didn't agree with how a particular
font rendered those accented characters. I agree that's obscure
though.

I recall long ago that when the french wrote words in all caps they
would drop the accents, e.g. ECOLE. I even recall (through the mists
of time) observing this in Paris on public signs. Is this still the
convention? Maybe it only was a compromise in the time of Morse code?

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Antoine Pitrou

 The example given was someone who didn't agree with how a particular
 font rendered those accented characters. I agree that's obscure
 though.
 
 I recall long ago that when the french wrote words in all caps they
 would drop the accents, e.g. ECOLE. I even recall (through the mists
 of time) observing this in Paris on public signs. Is this still the
 convention? Maybe it only was a compromise in the time of Morse code?

I think it is tolerated, partly because typing support (on computers and
typewriters) has been weak. On a French keyboard, you have an é key,
but shifting it gives you 2, not É. The latter can be obtained using
the Caps Lock key under Linux, but not under Windows.

(so you could also write Éric's name Eric, for example)

That said, most typographies nowadays seem careful to keep the accents
on uppercase letters (e.g. on book covers; AFAIR, road signs also keep
the accents, but I'm no driver).

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stefan Behnel

Guido van Rossum, 01.09.2011 18:31:

On Thu, Sep 1, 2011 at 9:03 AM, Antoine Pitrou wrote:

Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit :

This is definitely thought of as a separate
mark added to the e; ë is not a new letter. I have a feeling it's the
same way for the French and Germans, but I really don't know.
(Antoine? Georg?)


Indeed, they are not separate letters (they are considered the same in
lexicographic order, and the French alphabet has 26 letters).


So does the German alphabet, even though that does not include ß, which 
basically descended from a ligature of the old German way of writing sz, 
where s looked similar to an f and z had a low hanging tail.


IIRC, German Umlaut letters are lexicographically sorted according to their 
emergency replacement spelling (ä - ae), which is also sometimes used 
in all upper case words (Glück - GLUECK). I guess that's because 
Umlaut dots are harder to see on top of upper case letters. So, Latin-1 
byte value sorting always yields totally wrong results.


That aside, Umlaut letters are commonly considered separate letters, 
different from the undotted letters and also different from the replacement 
spellings. I, for one, always found the replacements rather weird and never 
got used to using them in upper case words. In any case, it's wrong to 
always use them, and it makes text harder to read.




But I'm not sure how it's relevant, because you can't remove an accent
without most likely making a spelling error, or at least changing the
meaning. Accents are very much part of the language (while ligatures
like ff are not, they are a rendering detail). So I would consider
é, ê, ù, etc. atomic characters for the purpose of processing
French text. And I don't see how a decomposed form could help an
application.


I recall long ago that when the french wrote words in all caps they
would drop the accents, e.g. ECOLE. I even recall (through the mists
of time) observing this in Paris on public signs. Is this still the
convention?


Yes, and it's a huge problem when trying to pronounce last names. In 
French, you'd commonly write


LASTNAME, Firstname

and if LASTNAME happens to have accented letters, you'd miss them when 
reading that. I know a couple of French people who severely suffer from 
this, because the pronunciation of their name gets a totally different 
meaning without accents.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-09-01 Thread Glyph Lefkowitz

On Sep 1, 2011, at 5:23 AM, Cesare Di Mauro wrote:

 A simple solution: when tracing is enabled, the new instruction format will 
 never be executed (and information tracking disabled as well).

Correct me if I'm wrong: doesn't this mean that no profiler will accurately be 
able to measure the performance impact of the new instruction format, and 
therefore one may get incorrect data when on is trying to make a CPU 
optimization for real-world performance?


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stefan Behnel

Antoine Pitrou, 01.09.2011 18:46:

AFAIR, road signs also keep the accents, but I'm no driver


Right, I noticed that, too. That's certainly not uncommon. I think it's 
mostly because of local pride (after all, the road signs are all that many 
drivers ever see of a city), but sometimes also because it can't be helped 
when the name gets a different meaning without accents. People just cause 
too many accidents when they burst out laughing while entering a city by car.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-09-01 Thread Guido van Rossum
On Thu, Sep 1, 2011 at 10:15 AM, Glyph Lefkowitz
gl...@twistedmatrix.com wrote:

 On Sep 1, 2011, at 5:23 AM, Cesare Di Mauro wrote:

 A simple solution: when tracing is enabled, the new instruction format will
 never be executed (and information tracking disabled as well).

 Correct me if I'm wrong: doesn't this mean that no profiler will accurately
 be able to measure the performance impact of the new instruction format, and
 therefore one may get incorrect data when on is trying to make a CPU
 optimization for real-world performance?

Well, profilers already skew results by adding call overhead. But
tracing for debugging and profiling don't do exactly the same thing:
debug tracing stops at every line, but profiling only executes hooks
at the start and end of a function(*). So I think the function body
could still be executed using the new format (assuming this is turned
on/off per code object anyway).

(*) And whenever a generator yields or is resumed. I consider that an
annoying bug though, just as the debugger doesn't do the right thing
with yield -- there's no way to continue until the yielding generator
is resumed short of setting a manual breakpoint on the next line.

-- 
--Guido van Rossum (python.org/~guido)
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3)

2011-09-01 Thread Dan Stromberg
On Tue, Aug 30, 2011 at 10:05 AM, Guido van Rossum gu...@python.org wrote:

 On Tue, Aug 30, 2011 at 9:49 AM, Martin v. Löwis mar...@v.loewis.de
 wrote:
  The problem lies with the PyPy backend -- there it generates ctypes
 code, which means that the signature you declare to Cython/Pyrex must
 match the *linker* level API, not the C compiler level API. Thus, if
 in a system header a certain function is really a macro that invokes
 another function with a permuted or augmented argument list, you'd
 have to know what that macro does. I also don't see how this would
 work for #defined constants: where does Cython/Pyrex get their value?
 ctypes doesn't have their values.

 So, for PyPy, a solution based on Cython/Pyrex has many of the same
 downsides as one based on ctypes where it comes to complying with an
 API defined by a .h file.


It's certainly a harder problem.

For most simple constants, Cython/Pyrex might be able to generate a series
of tiny C programs with which to find CPP symbol values:

#include file1.h
...
#include filen.h

main()
{
printf(%d, POSSIBLE_CPP_SYMBOL1);
}

...and again with %f, %s, etc.  The typing is quite a mess, and code
fragments would probably be impractical.  But since the C Preprocessor is
supposedly turing complete, maybe there's a pleasant surprise waiting there.

But hopefully clang has something that'd make this easier.

SIP's approach of using something close to, but not identical to, the .h's
sounds like it might be pretty productive - especially if the derivative of
the .h's could be automatically derived using a python script, with minor
tweaks to the inputs on .h upgrades.  But sip itself is apparently C++-only.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Terry Reedy

On 9/1/2011 11:45 AM, Guido van Rossum wrote:


typewriter). Dutch does have one native use of the umlaut (though it
has a different name, I forget which, maybe trema :-),


You remember correctly. According to
https://secure.wikimedia.org/wikipedia/en/wiki/Trema_%28diacritic%29
'trema' (Greek 'hole') is the generic name of the double-dot vowel 
diacritic. It was originally used for 'diaerhesis' (Greek, 'taking 
apart') when it shows that a vowel letter is not part of a digraph or 
diphthong. (Note that 'ae' in diaerhesis *is* a digraph ;-). Germans 
later used it to indicate umlaut, 'changed sound'.



when there are
two consecutive vowels that would normally be read as a special sound
(diphthong?). E.g. in koe (cow) the oe is two letters (not a single
letter formed of two distict shapes!) that mean a special sound
(roughly KOO). But in a word like coëxistentie (coexistence) the o
and e do not form the oe-sound, and to emphasize this to Dutch readers
(who believe their spelling is very logical :-), the official spelling
puts the umlaut on the e. This is definitely thought of as a separate
mark added to the e; ë is not a new letter.


So the above is trema-diaerhesis. Dutch, French, and Spanish make 
regular use of the diaeresis. English uses such as 'coöperate' have 
become rare or archaic, perhaps because we cannot type them. Too bad, 
since people sometimes use '-' to serve the same purpose.


--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Cython, Ctypes and the stdlib

2011-09-01 Thread Stefan Behnel

Dan Stromberg, 01.09.2011 19:56:

On Tue, Aug 30, 2011 at 10:05 AM, Guido van Rossum wrote:

  The problem lies with the PyPy backend -- there it generates ctypes
code, which means that the signature you declare to Cython/Pyrex must
match the *linker* level API, not the C compiler level API. Thus, if
in a system header a certain function is really a macro that invokes
another function with a permuted or augmented argument list, you'd
have to know what that macro does. I also don't see how this would
work for #defined constants: where does Cython/Pyrex get their value?
ctypes doesn't have their values.

So, for PyPy, a solution based on Cython/Pyrex has many of the same
downsides as one based on ctypes where it comes to complying with an
API defined by a .h file.


It's certainly a harder problem.

For most simple constants, Cython/Pyrex might be able to generate a series
of tiny C programs with which to find CPP symbol values:

#include file1.h
...
#include filen.h

main()
{
printf(%d, POSSIBLE_CPP_SYMBOL1);
}

...and again with %f, %s, etc.The typing is quite a mess


The user will commonly declare #defined values as typed external variables 
and callable macros as functions in .pxd files. These manually typed 
macro functions allow users to tell Cython what it should know about how 
the macros will be used. And that would allow it to generate C/C++ glue 
code for them that uses the declared types as a real function signature and 
calls the macro underneath.




and code fragments would probably be impractical.


Not necessarily at the C level but certainly for a ctypes backend, yes.



But hopefully clang has something that'd make this easier.


For figuring these things out, maybe. Not so much for solving the problems 
they introduce.


Stefan

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Glenn Linderman writes:

  Windows 7 64-bit on one of my computers happily crashes several
  times a day when it detects inconsistent internal state... under
  the theory, I guess, that losing work is better than saving bad
  work.  You sound the opposite.

Definitely.  Windows apps habitually overwrite existing work; saving
when inconsistent would be a bad idea.  The apps I work on dump their
unsaved buffers to new files, and give you a chance to look at them
before instating them as the current version when you restart.

  Except, I'm not sure how PEP 393 space optimization fits with the other 
  operations.  It may even be that an application-wide complex-grapheme 
  cache would save significant space, although if it uses high-bits in a 
  string representation to reference the cache, PEP 393 would jump 
  immediately to something  16 bits per grapheme... but likely would 
  anyway, if complex-graphemes are in the data stream.

The only language I know of that uses thousands of complex graphemes
is Korean ... and the precomposed forms are already in the BMP.  I
don't know how many accented forms you're likely to see in Vietnamese,
but I suspect it's less than 6400 (the number of characters in private
space in the BMP).  So for most applications, I believe that mapping
both non-BMP code points and grapheme clusters into that private space
should be feasible.  The only potential counterexample I can think of
is display of Arabic, which I have heard has thousands of glyphs in
good fonts because of the various ways ligatures form in that script.
However AFAIK no apps encode these as characters; I'm just admitting
that it *might* be useful.

This will require some care in registering such characters and
clusters because input text may already use private space according to
some convention, which would need to be respected.  Still, 6400
characters is a lot, even for the Japanese (IIRC the combined
repertoire of corporate characters that for some reason never made
it into the JIS sets is about 600, but almost all of them are already
in the BMP).  I believe the total number of Japanese emoticons is
about 200, but I doubt that any given text is likely to use more than
a few.  So I think there's plenty of space there.

This has a few advantages: (1) since these are real characters, all
Unicode algorithms will apply as long as the appropriate properties
are applied to the character in the database, and (2) it works with a
narrow code unit (specifically, UCS-2, but it could also be used with
UTF-8).  If you really need more than 6400 grapheme clusters, promote
to UTF-32, and get two more whole planes full (about 130,000 code
points).

  I didn't attribute any efficiency to flagging lone surrogates (BI-5).  
  Since Windows uses a non-validated UCS-2 or UTF-16 character type, any 
  Python program that obtains data from Windows APIs may be confronted 
  with lone surrogates or inappropriate combining characters at any
  time.

I don't think so.  AFAIK all that data must pass through a codec,
which will validate it unless you specifically tell it not to.

  Round-tripping that data seems useful,

The standard doesn't forbid that.  (ISTR it did so in the past, but
what is required in 6.0 is a specific algorithm for identifying
well-formed portions of the text, basically if you're currently in an
invalid region, read individual code units and attempt to assemble a
valid sequence -- as soon as you do, that is a valid code point, and
you switch into valid state and return to the normal algorithm.)

Specifically, since surrogates are not characters, leaving them in the
data does not constitute interpreting them as characters.  I don't
recall if any of the error handlers allow this, though.

  However, returning modified forms of it to Windows as UCS-2 or
  UTF-16 data may still cause other applications to later
  accidentally combine the characters, if the modifications
  juxtaposed things to make them look reasonably, even if
  accidentally.

In CPython AFAIK (I don't do Windows) this can only happen if you use
a non-default error setting in the output codec.

  After writing all those ideas down, I actually preferred some of
  the others, that achieved O(1) real grapheme indexing, rather than
  caching character properties.

If you need O(1) grapheme indexing, use of private space seems a
winner to me.  It's just defining private precombined characters, and
they won't bother any Unicode application, even if they leak out.

   What are the costs to applications that don't want the cache?
   How is the bit-cache affected by PEP 393?
  
  If it is a separate type from str, then it costs nothing except the
  extra code space to implement the cache for those applications that
  do want it... most of which wouldn't be loaded for applications
  that don't, if done as a module or C extension.

I'm talking about the bit-cache (which all of your BI-N referred to,
at least indirectly).  Many applications will want to work with 

Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Guido van Rossum writes:
  On Thu, Sep 1, 2011 at 12:13 AM, Stephen J. Turnbull step...@xemacs.org 
  wrote:

   while at least this Spanish-as-a-second-language learner was taught
   that `ñ' is an atomic character represented by a discontiguous glyph,
   like `i', and it is no more related to `n' than `m' is.  Users really
   believe that characters are atomic.  Even in the cases of Han
   characters and Hangul, users think of the characters as being
   atomic, but in the sense of Bohr rather than that of Democritus.
  
  Ah, I think this may very well be culture-dependent.

I'm not an expert, but I'm fairly sure it is.  Specifically, I heard
from a TeX-ie friend that the same accented letter is typeset (and
collated) differently in different European languages because in some
of them the accent is considered part of the letter (making a
different character), while in others accents modify a single
underlying character.  The ones that consider the letter and accent to
constitute a single character also prefer to leave less space, he
said.

  But in a word like coëxistentie (coexistence) the o and e do not
  form the oe-sound, and to emphasize this to Dutch readers (who
  believe their spelling is very logical :-), the official spelling
  puts the umlaut on the e.

American English has the same usage, but it's optional (in particular,
you'll see naive, naif, and words like coordinate typeset that way
occasionally, for the same reason I suppose).

As Hagen Fürstenau points out, with multiple combining characters,
there are even more complex possibilities than the accent is part of
the character and it's really not, and they may be application-
dependent.

  Finally, my guess is that the Spanish emphasis on ñ as a separate
  letter has to do with teaching how it has a separate position in the
  localized collation sequence, doesn't it?

You'd have to ask Mr. Gonzalez.  I suspect he may have taught that way
less because of his Castellano upbringing, and more because of the
infamous lack of sympathy of American high school students for the
fine points of usage in foreign languages.

  I'm also curious if ñ occurs as a separate character on Spanish
  keyboards.

If I'm reading /usr/share/X11/xkb/symbols/es correctly, it does in
X.org:  the key that for English users would map to ASCII tilde.

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Antoine Pitrou

   Finally, my guess is that the Spanish emphasis on ñ as a separate
   letter has to do with teaching how it has a separate position in the
   localized collation sequence, doesn't it?
 
 You'd have to ask Mr. Gonzalez.  I suspect he may have taught that way
 less because of his Castellano upbringing, and more because of the
 infamous lack of sympathy of American high school students for the
 fine points of usage in foreign languages.

If you look at Wikipedia, it says:
“El alfabeto español consta de 27 letras”. The Ñ is separate from the N
(and so is it in my French-Spanish dictionnary). The accented letters,
however, are not considered separately.
http://es.wikipedia.org/wiki/Alfabeto_espa%C3%B1ol

(I can't tell you how annoying to type ñ is when the tilde is accessed
using AltGr + 2 and you have to combine that with the Compose key and N
to obtain the full character. I'm sure Spanish keyboards have a better
way than that :-))

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Tres Seaver
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/01/2011 02:54 PM, Antoine Pitrou wrote:
 
 If you look at Wikipedia, it says: “El alfabeto español consta de 27 
 letras”. The Ñ is separate from the N (and so is it in my 
 French-Spanish dictionnary). The accented letters, however, are not 
 considered separately. 
 http://es.wikipedia.org/wiki/Alfabeto_espa%C3%B1ol
 
 (I can't tell you how annoying to type ñ is when the tilde is 
 accessed using AltGr + 2 and you have to combine that with the 
 Compose key and N to obtain the full character. I'm sure Spanish 
 keyboards have a better way than that :-))

FWIW, I was taught that Spanish had 30 letters in the alfabeto:  the
'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters.

Kids-these-days'ly,


Tres.
- -- 
===
Tres Seaver  +1 540-429-0999  tsea...@palladion.com
Palladion Software   Excellence by Designhttp://palladion.com
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5f2UQACgkQ+gerLs4ltQ4URACePSZzpoPAg2IIYZewsjbuplkK
0MgAoM7VfdQHzjBiU6Vr/MYPJ9U2qC3M
=pvKn
-END PGP SIGNATURE-

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Ethan Furman

Tres Seaver wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/01/2011 02:54 PM, Antoine Pitrou wrote:
If you look at Wikipedia, it says: “El alfabeto español consta de 27 
letras”. The Ñ is separate from the N (and so is it in my 
French-Spanish dictionnary). The accented letters, however, are not 
considered separately. 
http://es.wikipedia.org/wiki/Alfabeto_espa%C3%B1ol


(I can't tell you how annoying to type ñ is when the tilde is 
accessed using AltGr + 2 and you have to combine that with the 
Compose key and N to obtain the full character. I'm sure Spanish 
keyboards have a better way than that :-))


FWIW, I was taught that Spanish had 30 letters in the alfabeto:  the
'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters.

Kids-these-days'ly,


Not sure what's going on, but according to the article Antoine linked to 
those aren't letters anymore...  so much for the cultural awareness 
portion of UNESCO.


~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Antoine Pitrou
On Thu, 01 Sep 2011 12:38:07 -0700
Ethan Furman et...@stoneleaf.us wrote:
  
  FWIW, I was taught that Spanish had 30 letters in the alfabeto:  the
  'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters.
  
  Kids-these-days'ly,
 
 Not sure what's going on, but according to the article Antoine linked to 
 those aren't letters anymore...  so much for the cultural awareness 
 portion of UNESCO.

That Wikipedia article also says:

“Los dígrafos Ch y Ll tienen valores fonéticos específicos, y durante
los siglos XIX y XX se ordenaron separadamente de C y L, aunque la
práctica se abandonó en 1994 para homogeneizar el sistema con otras
lenguas.”

- roughly: “the Ch and Ll digraphs have specific phonetic values,
and during the 19th and 20th centuries they were ordered separately
from C and L, but this practice was abandoned in 1994 in order to
make the system consistent with other languages.”

And about rr:

“El dígrafo rr (llamado erre, /'ere/, y pronunciado /r/) nunca se
consideró por separado, probablemente por no aparecer nunca en posición
inicial.”

- “the rr digraph was never considered separate, probably because it
never appears at the very beginning of a word.”

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Greg Ewing

Guido van Rossum wrote:


I recall long ago that when the french wrote words in all caps they
would drop the accents, e.g. ECOLE. I even recall (through the mists
of time) observing this in Paris on public signs. Is this still the
convention?


This page features a number of French street signs
in all-caps, and some of them have accents:

http://www.happymall.com/france/paris_street_signs.htm

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Greg Ewing

Guido van Rossum wrote:

But in a word like coëxistentie (coexistence) the o
and e do not form the oe-sound, and to emphasize this to Dutch readers
(who believe their spelling is very logical :-), the official spelling
puts the umlaut on the e.


Sometimes this is done in English too -- occasionally
you see words like cooperation spelled with a diaresis
over the second o. But these days it's more common to
use a hyphen, or not bother at all. Everyone knows how
it's pronounced.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Antoine Pitrou
On Fri, 02 Sep 2011 12:30:12 +1200
Greg Ewing greg.ew...@canterbury.ac.nz wrote:
 Guido van Rossum wrote:
 
  I recall long ago that when the french wrote words in all caps they
  would drop the accents, e.g. ECOLE. I even recall (through the mists
  of time) observing this in Paris on public signs. Is this still the
  convention?
 
 This page features a number of French street signs
 in all-caps, and some of them have accents:
 
 http://www.happymall.com/france/paris_street_signs.htm

I don't think some American souvenir shop is a good reference, though :)
(for example, there's no Paris street named château de Versailles)

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Greg Ewing

Terry Reedy wrote:

Too bad, since people sometimes use '-' to serve the same purpose.


Which actually seems more logical to me -- a separating
symbol is better placed between the things being separated,
rather than over the top of one of them!

Maybe we could compromise by turning the diaeresis on
its side:

  co:operate

--
Greg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Steven D'Aprano

Antoine Pitrou wrote:

Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit :

This is definitely thought of as a separate
mark added to the e; ë is not a new letter. I have a feeling it's the
same way for the French and Germans, but I really don't know.
(Antoine? Georg?)


Indeed, they are not separate letters (they are considered the same in
lexicographic order, and the French alphabet has 26 letters).



On the other hand, the same doesn't necessarily apply to other 
languages. (At least according to Wikipedia.)


http://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics


--
Steven

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Stephen J. Turnbull
Tres Seaver writes:

  FWIW, I was taught that Spanish had 30 letters in the alfabeto:  the
  'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters.

That was always a Castellano vs. Americano issue, IIRC.  As I wrote,
Mr. Gonzalez was Castellano.

I believe that the deprecation of the digraphs as separate letters
occurred as the telephone became widely used in Spain, and the
telephone company demanded an official proclamation from whatever
Ministry is responsible for culture that it was OK to treat the
digraphs as two letters (specifically, to collate them that way), so
that they could use the programs that came with the OS.

So this stuff is not merely variant by culture, but also by economics
and politics. :-/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python 3 optimizations continued...

2011-09-01 Thread stefan brunthaler
Hi,

as promised, I created a publicly available preview of an
implementation with my optimizations, which is available under the
following location:
https://bitbucket.org/py3_pio/preview/wiki/Home

I followed Nick's advice and added some valuable advice and
overview/introduction at the wiki page the link points to, I am
positive that spending 10mins reading this will provide you with a
valuable information regarding what's happening.
In addition, as Guido already mentioned, this is more or less a direct
copy of my research-branch without some of my private comments and
*no* additional refactorings because of software-engineering issues
(which I am very much aware of.)

I hope this clarifies a *lot* and makes it easier to see what parts
are involved and how all the pieces fit together.

I hope you'll like it,
have fun,
--stefan
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] PEP 393 Summer of Code Project

2011-09-01 Thread Greg Ewing

Antoine Pitrou wrote:


I don't think some American souvenir shop is a good reference, though :)
(for example, there's no Paris street named château de Versailles)


Hmmm, I'd assumed they were reproductions of actual
street signs found in Paris, but maybe not. :-(

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com