Re: [Python-Dev] PEP 393 Summer of Code Project
Where I cut your words, we are in 100% agreement. (FWIW :-) Guido van Rossum writes: On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull step...@xemacs.org wrote: Well, that's why I wrote intended to be suggestive. The Unicode Standard does not specify at all what the internal representation of characters may be, it only specifies what their external behavior must be when two processes communicate. (For process as used in the standard, think Python modules here, since we are concerned with the problems of folks who develop in Python.) When observing the behavior of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or even UTF-32 arrays; only arrays of characters. Hm, that's not how I would read process. IMO that is an intentionally vague term, I agree. I'm sorry that I didn't make myself clear. The reason I read process as module is that some modules of Python, and therefore Python as a whole, cannot conform to the Unicode standard. Eg, anything that inputs or outputs bytes. Therefore only modules and types can be asked to conform. (I don't think it makes sense to ask anything lower level to conform. See below where I comment on your .lower() example.) What I am advocating (for the long term) is provision of *one* module (or type) such that if the text processing done by the application is done entirely in terms of this module (type), it will conform (to some specified degree, chosen to balance user wants with implementation and support costs). It may be desireable to provide others for sufficiently important particular use cases, but at present I see a clear need for *one*. Unicode conformance is going to be a common requirement for apps used by global enterprises. I oppose trying to make str into that type. We need str, just as it is, for many reasons. and we are free to decide how to interpret it. I don't think it will work very well to define a process as a Python module; what about Python modules that agree about passing along array of code units (or streams of UTF-8, for that matter)? Certainly a group of cooperating modules could form a conforming process, just as you describe it for one example. The one module mentioned above need not implement everything internally, but it would take responsiblity for providing guarantees (eg, unit tests) of whatever conformance claims it makes. Thus, according to the rules of handling a UTF-16 stream, it is an error to observe a lone surrogate or a surrogate pair that isn't a high-low pair (Unicode 6.0, Ch. 3 Conformance, requirements C1 and C8-C10). That's what I mean by can't tell it's UTF-16. But if you can observe (valid) surrogate pairs it is still UTF-16. In the concrete implementation I have in mind, surrogate pairs are represented by a str containing 2 code units. But in that case s[i][1] is an error, and s[i][0] == s[i]. print(s[i][0]) and print(s[i]) will print the same character to the screen. If you decode it to bytes, well, it's not a str any more so what have you proved? Ie, what you will see is *code points* not in the BMP. You don't have to agree that such surrogate containment behavior is so valuable as I think it is, but that's what I have in mind as one requirement for a conforming implementation of UTF-16. At the same time I think it would be useful if certain string operations like .lower() worked in such a way that *if* the input were valid UTF-16, *then* the output would also be, while *if* the input contained an invalid surrogate, the result would simply be something that is no worse (in particular, those are all mapped to themselves). I don't think that it's a good idea to go for conformance at the method level. It would be a feature for apps that don't claim full conformance because they nevertheless give good results in more cases. The downside will be Python apps using str that will pass conformance tests written for, say Western Europe, but end users in Kuwait and Kuala Lumpur will report bugs. An analogy is actually found in .lower() on 8-bit strings in Python 2: it assumes the string contains ASCII, and non-ASCII characters are mapped to themselves. If your string contains Latin-1 or EBCDIC or UTF-8 it will not do the right thing. But that doesn't mean strings cannot contain those encodings, it just means that the .lower() method is not useful if they do. (Why ASCII? Because that is the system encoding in Python 2.) Sure. I think that approach is fine for str, too, except that I would hope it looks up BMP base characters in the case-mapping database. The fact is that with very few exceptions non-BMP characters are going to be symbols (mathematical operators and emoticons, for example). This is good enough, except when it's not---but when it's not, only 100% conformance is really a reasonable target. IMO, of course. I think we should just document how it behaves and not get hung up on what it is
Re: [Python-Dev] PEP 393 Summer of Code Project
Glenn Linderman: How many different iterators into the same text would be concurrently needed by an application? And why? Seems like if it is dealing with text at the level of grapheme clusters, it needs that type of iterator. Of course, if it does I/O it needs codec access, but that is by nature sequential from the starting point to the end point. I would expect that there would mostly be a single iterator into a string but can imagine scenarios in which multiple iterators may be concurrently active and that these could be of different types. For example, say we wanted to search for each code point in a text that fails some test (such as being a member of a set of unwanted vowel diacritics) and then display that failure in context with its surrounding text of up to 30 graphemes either side. Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Glenn Linderman writes: I found your discussion of streams versus arrays, as separate concepts related to Unicode, along with Terry's bisect indexing implementation, to rather inspiring. Just because Unicode defines streams of codeunits of various sizes (UTF-8, UTF-16, UTF-32) to represent characters when processes communicate and for storage (which is one way processes communicate), that doesn't imply that the internal representation of character strings in a programming language must use exactly that representation. That is true, and Unicode is *very* careful to define its requirements so that is true. That doesn't mean using an alternative representation is an improvement, though. I'm unaware of any current Python implementation that has chosen to use UTF-8 as the internal representation of character strings (I'm also aware Perl has made that choice), yet UTF-8 is one of the commonly recommend character representations on the Linux platform, from what I read. There are two reasons for that. First, widechar representations are right out for anything related to the file system or OS, unless you are prepared to translate before passing to the OS. If you use UTF-8, then asking the user to use a UTF-8 locale to communicate with your app is a plausible way to eliminate any translation in your app. (The original moniker for UTF-8 was UTF-FSS, where FSS stands for file system safe.) Second, much text processing is stream-oriented and one-pass. In those cases, the variable-width nature of UTF-8 doesn't cost you anything. Eg, this is why the common GUIs for Unix (X.org, GTK+, and Qt) either provide or require UTF-8 coding for their text. It costs *them* nothing and is file-system-safe. So in that sense, Python has rejected the idea of using the native or OS configured representation as its internal representation. I can't agree with that characterization. POSIX defines the concept of *locale* precisely because the native representation of text in Unix is ASCII. Obviously that won't fly, so they solved the problem in the worst possible waywink/: they made the representation variable! It is the *variability* of text representation that Python rejects, just as Emacs and Perl do. They happen to have chosen six different representations.[1] So why, then, must one choose from a repertoire of Unicode-defined stream representations if they don't meet the goal of efficient length, indexing, or slicing operations on actual characters? One need not. But why do anything else? It's not like the authors of that standard paid no attention to various concerns about efficiency and backward compatibility! That's the question that you have not answered, and I am presently lacking in any data that suggests I'll ever need the facilities you propose. Footnotes: [1] Emacs recently changed its mind. Originally it used the so-called MULE encoding, and now a different extension of UTF-8 from Perl. Of course, Python beats that, with narrow, wide, and now PEP-393 representations!wink / ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Stephen J. Turnbull: ... Eg, this is why the common GUIs for Unix (X.org, GTK+, and Qt) either provide or require UTF-8 coding for their text. Qt uses UTF-16 for its basic QString type. While QString is mostly treated as a black box which you can create from input buffers in any encoding, the only encoding allowed for a contents-by-reference QString (QString::fromRawData) is UTF-16. http://doc.qt.nokia.com/latest/qstring.html#fromRawData Neil ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Glenn Linderman writes: How many different iterators into the same text would be concurrently needed by an application? And why? A WYSIWYG editor for structured text (TeX, HTML) might want two (at least), one for the source window and one for the rendered window. One might want to save the state of the iterators (if that's possible) and cache it as one moves the window forward to make short backward motion fast, giving you two (or four, etc) more. Seems like if it is dealing with text at the level of grapheme clusters, it needs that type of iterator. Of course, if it does I/O it needs codec access, but that is by nature sequential from the starting point to the end point. `save-region' ? `save-text-remove-markup' ? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 9/1/2011 2:15 AM, Stephen J. Turnbull wrote: Glenn Linderman writes: How many different iterators into the same text would be concurrently needed by an application? And why? A WYSIWYG editor for structured text (TeX, HTML) might want two (at least), one for the source window and one for the rendered window. One might want to save the state of the iterators (if that's possible) and cache it as one moves the window forward to make short backward motion fast, giving you two (or four, etc) more. Sure. But those are probably all the same type of iterators — probably (since they are WYSIWYG) dealing with multi-codepoint characters (Guido's recent definition of grapheme, which seems to subsume both grapheme clusters and composed characters). Hence all of them would be using/requiring the same sort of representation, index, analysis, or some combination of those. Seems like if it is dealing with text at the level of grapheme clusters, it needs that type of iterator. Of course, if it does I/O it needs codec access, but that is by nature sequential from the starting point to the end point. `save-region' ? `save-text-remove-markup' ? Yes, save-region sounds like exactly what I was speaking of. save-text-remove-markup I would infer needs to process the text to remove the markup characters... since you used TeX and HTML as examples, markup is text, not binary (which would be a different problem). Since the TeX and HTML markup is mostly ASCII, markup removal (or more likely, text extraction) could be performed via either a grapheme iterator, or a codepoint iterator, or even a code unit iterator. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 9/1/2011 12:59 AM, Stephen J. Turnbull wrote: Glenn Linderman writes: We can either artificially constrain ourselves to minor tweaks of the legal conforming bytestreams, It's not artificial. Having the internal representation be the same as a standard encoding is very useful for a large number of minor usages (urgently saving buffers in a text editor that knows its internal state is inconsistent, viewing strings in the debugger, PEP 393-style space optimization is simpler if text properties are out-of-band, etc). saving buffers urgently when the internal state is inconsistent sounds like carefully preserving a bug. Windows 7 64-bit on one of my computers happily crashes several times a day when it detects inconsistent internal state... under the theory, I guess, that losing work is better than saving bad work. You sound the opposite. I'm actually very grateful that Firefox and emacs recover gracefully from Windows crashes, and I lose very little data from the crashes, but cannot recommend Windows 7 (this machine being my only experience with it) for stability. In any case, the operations you mention still require the data to be processed, if ever so slightly, and I'll admit that a more complex representation would require a bit more processing. Not clear that it would be huge or problematical for these cases. Except, I'm not sure how PEP 393 space optimization fits with the other operations. It may even be that an application-wide complex-grapheme cache would save significant space, although if it uses high-bits in a string representation to reference the cache, PEP 393 would jump immediately to something 16 bits per grapheme... but likely would anyway, if complex-graphemes are in the data stream. or we can invent a representation (whether called str or something else) that is useful and efficient in practice. Bring on the practice, then. You say that a bit to identify lone surrogates might be useful or efficient. In what application? How much time or space does it save? I didn't attribute any efficiency to flagging lone surrogates (BI-5). Since Windows uses a non-validated UCS-2 or UTF-16 character type, any Python program that obtains data from Windows APIs may be confronted with lone surrogates or inappropriate combining characters at any time. Round-tripping that data seems useful, even though the data itself may not be as useful as validated Unicode characters would be. Accidentally combining the characters due to slicing and dicing the data, and doing normalizations, or what not, would not likely be appropriate. However, returning modified forms of it to Windows as UCS-2 or UTF-16 data may still cause other applications to later accidentally combine the characters, if the modifications juxtaposed things to make them look reasonably, even if accidentally. If intentionally, of course, the bit could be turned off. This exact sort of problem with non-validated UTF-8 bytes was addressed already in Python, mostly for Linux, allowing round-tripping of the byte stream, even though it is not valid. BI-6 suggests a different scheme for that, without introducing lone surrogates (which might accidentally get combined with other lone surrogates). You say that a bit to cache a property might be useful or efficient. In what application? Which properties? Are those properties a set fixed by the language, or would some bits be available for application-specific property caching? How much time or space does that save? The brainstorming ideas I presented were just that... ideas. And they were independent. And the use of many high-order bits for properties was one of the independent ones. When I wrote that one, I was assuming a UTF-32 representation (which wastes 11 bits of each 32). One thing I did have in mind, with the high-order bits, for that representation, was to flag the start or end or middle of the codes that are included in a grapheme. That would be redundant with some of the Unicode codepoint property databases, if I understand them properly... whether it would make iterators enough more efficient to be worth the complexity would have to be benchmarked. After writing all those ideas down, I actually preferred some of the others, that achieved O(1) real grapheme indexing, rather than caching character properties. What are the costs to applications that don't want the cache? How is the bit-cache affected by PEP 393? If it is a separate type from str, then it costs nothing except the extra code space to implement the cache for those applications that do want it... most of which wouldn't be loaded for applications that don't, if done as a module or C extension. I know of no answers (none!) to those questions that favor introduction of a bit-cache representation now. And those bits aren't going anywhere; it will always be possible to use a wide build and change the representation later, if
Re: [Python-Dev] Python 3 optimizations continued...
On 8/30/2011 4:41 PM, stefan brunthaler wrote: Ok, there there's something else you haven't told us. Are you saying that the original (old) bytecode is still used (and hence written to and read from .pyc files)? Short answer: yes. Long answer: I added an invocation counter to the code object and keep interpreting in the usual Python interpreter until this counter reaches a configurable threshold. When it reaches this threshold, I create the new instruction format and interpret with this optimized representation. All the macros look exactly the same in the source code, they are just redefined to use the different instruction format. I am at no point serializing this representation or the runtime information gathered by me, as any subsequent invocation might have different characteristics. When the switchover to the new instruction format happens, what happens to sys.settrace() tracing? Will it report the same sequence of line numbers? For a small but important class of program executions, this is more important than speed. --Ned. Best, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
2011/9/1 Ned Batchelder n...@nedbatchelder.com When the switchover to the new instruction format happens, what happens to sys.settrace() tracing? Will it report the same sequence of line numbers? For a small but important class of program executions, this is more important than speed. --Ned A simple solution: when tracing is enabled, the new instruction format will never be executed (and information tracking disabled as well). Regards, Cesare ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
Cesare Di Mauro wrote: 2011/9/1 Ned Batchelder n...@nedbatchelder.com mailto:n...@nedbatchelder.com When the switchover to the new instruction format happens, what happens to sys.settrace() tracing? Will it report the same sequence of line numbers? For a small but important class of program executions, this is more important than speed. --Ned A simple solution: when tracing is enabled, the new instruction format will never be executed (and information tracking disabled as well). What happens if tracing is enabled *during* the execution of the new instruction format? Some sort of deoptimisation will be required in order to recover the correct VM state. Cheers, Mark. Regards, Cesare ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/mark%40hotpy.org ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
2011/9/1 Mark Shannon m...@hotpy.org Cesare Di Mauro wrote: 2011/9/1 Ned Batchelder n...@nedbatchelder.com mailto: n...@nedbatchelder.com When the switchover to the new instruction format happens, what happens to sys.settrace() tracing? Will it report the same sequence of line numbers? For a small but important class of program executions, this is more important than speed. --Ned A simple solution: when tracing is enabled, the new instruction format will never be executed (and information tracking disabled as well). What happens if tracing is enabled *during* the execution of the new instruction format? Some sort of deoptimisation will be required in order to recover the correct VM state. Cheers, Mark. Sure. I don't think that the regular ceval.c loop will be dropped when executing the new instruction format, so we can intercept a change like this using the why variable, for example, or something similar that is normally used to break the regular loop execution. Anyway, we need to take a look at the code. Cheers, Cesare ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Ok, I thought there was also a form normalized (denormalized?) to decomposed form. But I'll take your word. If I understood the example correctly, he needs a mixed form, with some characters decomposed and some composed (depending on which one looks better in the given font). I agree that this sound more like a font problem, but it's a wide spread font problem and it may be necessary to address it in an application. But this is only one example of why an application-specific concept of graphemes different from the Unicode-defined normalized forms can be useful. I think the very concept of a grapheme is context, language, and culture specific. For example, in Chinese Pinyin it would be very natural to write tone marks with composing diacritics (i.e. in decomposed form). But then you have the vowel ü and it would be strange to decompose it into an u and combining diaeresis. So conceptually the most sensible representation of lǜ would be neither the composed not the decomposed normal form, and depending on its needs an application might want to represent it in the mixed form (composing the diaeresis with the u, but leaving the grave accent separate). There must be many more examples where the conceptual context determines the right composition, like for ñ, which is Spanish is certainly a grapheme, but in mathematics might be better represented as n-tilde. The bottom line is that, while an array of Unicode code points is certainly a generally useful data type (and PEP 393 is a great improvement in this regard), an array of graphemes carries many subtleties and may not be nearly as universal. Support in the spirit of unicodedata's normalization function etc. is certainly a good thing, but we shouldn't assume that everyone will want Python to do their graphemes for them. - Hagen ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Thu, Sep 1, 2011 at 12:13 AM, Stephen J. Turnbull step...@xemacs.org wrote: Where I cut your words, we are in 100% agreement. (FWIW :-) Not quite the same here, but I don't feel the need to have the last word. Most of what you say makes sense, in some cases we'll quibble later, but there are a few points where I have something to add: No, and I can tell you why! The difference between characters and words is much more important than that between code point and grapheme cluster for most users and the developers who serve them. Even small children recognize typographical ligatures as being composite objects, True -- in fact I didn't know that ff and ffl ligatures *existed* until I learned about Unix troff. while at least this Spanish-as-a-second-language learner was taught that `ñ' is an atomic character represented by a discontiguous glyph, like `i', and it is no more related to `n' than `m' is. Users really believe that characters are atomic. Even in the cases of Han characters and Hangul, users think of the characters as being atomic, but in the sense of Bohr rather than that of Democritus. Ah, I think this may very well be culture-dependent. In Holland there are no Dutch words that use accented letters, but the accents are known because there are a lot of words borrowed from French or German. We (the Dutch) think of these as letters with accents and in fact we think of the accents as modifiers that can be added to any letter (at least I know that's how I thought about it -- perhaps I was also influenced by the way one had to type those on a mechanical typewriter). Dutch does have one native use of the umlaut (though it has a different name, I forget which, maybe trema :-), when there are two consecutive vowels that would normally be read as a special sound (diphthong?). E.g. in koe (cow) the oe is two letters (not a single letter formed of two distict shapes!) that mean a special sound (roughly KOO). But in a word like coëxistentie (coexistence) the o and e do not form the oe-sound, and to emphasize this to Dutch readers (who believe their spelling is very logical :-), the official spelling puts the umlaut on the e. This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans, but I really don't know. (Antoine? Georg?) Finally, my guess is that the Spanish emphasis on ñ as a separate letter has to do with teaching how it has a separate position in the localized collation sequence, doesn't it? I'm also curious if ñ occurs as a separate character on Spanish keyboards. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit : This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans, but I really don't know. (Antoine? Georg?) Indeed, they are not separate letters (they are considered the same in lexicographic order, and the French alphabet has 26 letters). But I'm not sure how it's relevant, because you can't remove an accent without most likely making a spelling error, or at least changing the meaning. Accents are very much part of the language (while ligatures like ff are not, they are a rendering detail). So I would consider é, ê, ù, etc. atomic characters for the purpose of processing French text. And I don't see how a decomposed form could help an application. Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Thu, Sep 1, 2011 at 9:03 AM, Antoine Pitrou solip...@pitrou.net wrote: Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit : This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans, but I really don't know. (Antoine? Georg?) Indeed, they are not separate letters (they are considered the same in lexicographic order, and the French alphabet has 26 letters). But I'm not sure how it's relevant, because you can't remove an accent without most likely making a spelling error, or at least changing the meaning. Accents are very much part of the language (while ligatures like ff are not, they are a rendering detail). So I would consider é, ê, ù, etc. atomic characters for the purpose of processing French text. And I don't see how a decomposed form could help an application. The example given was someone who didn't agree with how a particular font rendered those accented characters. I agree that's obscure though. I recall long ago that when the french wrote words in all caps they would drop the accents, e.g. ECOLE. I even recall (through the mists of time) observing this in Paris on public signs. Is this still the convention? Maybe it only was a compromise in the time of Morse code? -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
The example given was someone who didn't agree with how a particular font rendered those accented characters. I agree that's obscure though. I recall long ago that when the french wrote words in all caps they would drop the accents, e.g. ECOLE. I even recall (through the mists of time) observing this in Paris on public signs. Is this still the convention? Maybe it only was a compromise in the time of Morse code? I think it is tolerated, partly because typing support (on computers and typewriters) has been weak. On a French keyboard, you have an é key, but shifting it gives you 2, not É. The latter can be obtained using the Caps Lock key under Linux, but not under Windows. (so you could also write Éric's name Eric, for example) That said, most typographies nowadays seem careful to keep the accents on uppercase letters (e.g. on book covers; AFAIR, road signs also keep the accents, but I'm no driver). Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Guido van Rossum, 01.09.2011 18:31: On Thu, Sep 1, 2011 at 9:03 AM, Antoine Pitrou wrote: Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit : This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans, but I really don't know. (Antoine? Georg?) Indeed, they are not separate letters (they are considered the same in lexicographic order, and the French alphabet has 26 letters). So does the German alphabet, even though that does not include ß, which basically descended from a ligature of the old German way of writing sz, where s looked similar to an f and z had a low hanging tail. IIRC, German Umlaut letters are lexicographically sorted according to their emergency replacement spelling (ä - ae), which is also sometimes used in all upper case words (Glück - GLUECK). I guess that's because Umlaut dots are harder to see on top of upper case letters. So, Latin-1 byte value sorting always yields totally wrong results. That aside, Umlaut letters are commonly considered separate letters, different from the undotted letters and also different from the replacement spellings. I, for one, always found the replacements rather weird and never got used to using them in upper case words. In any case, it's wrong to always use them, and it makes text harder to read. But I'm not sure how it's relevant, because you can't remove an accent without most likely making a spelling error, or at least changing the meaning. Accents are very much part of the language (while ligatures like ff are not, they are a rendering detail). So I would consider é, ê, ù, etc. atomic characters for the purpose of processing French text. And I don't see how a decomposed form could help an application. I recall long ago that when the french wrote words in all caps they would drop the accents, e.g. ECOLE. I even recall (through the mists of time) observing this in Paris on public signs. Is this still the convention? Yes, and it's a huge problem when trying to pronounce last names. In French, you'd commonly write LASTNAME, Firstname and if LASTNAME happens to have accented letters, you'd miss them when reading that. I know a couple of French people who severely suffer from this, because the pronunciation of their name gets a totally different meaning without accents. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
On Sep 1, 2011, at 5:23 AM, Cesare Di Mauro wrote: A simple solution: when tracing is enabled, the new instruction format will never be executed (and information tracking disabled as well). Correct me if I'm wrong: doesn't this mean that no profiler will accurately be able to measure the performance impact of the new instruction format, and therefore one may get incorrect data when on is trying to make a CPU optimization for real-world performance? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Antoine Pitrou, 01.09.2011 18:46: AFAIR, road signs also keep the accents, but I'm no driver Right, I noticed that, too. That's certainly not uncommon. I think it's mostly because of local pride (after all, the road signs are all that many drivers ever see of a city), but sometimes also because it can't be helped when the name gets a different meaning without accents. People just cause too many accidents when they burst out laughing while entering a city by car. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
On Thu, Sep 1, 2011 at 10:15 AM, Glyph Lefkowitz gl...@twistedmatrix.com wrote: On Sep 1, 2011, at 5:23 AM, Cesare Di Mauro wrote: A simple solution: when tracing is enabled, the new instruction format will never be executed (and information tracking disabled as well). Correct me if I'm wrong: doesn't this mean that no profiler will accurately be able to measure the performance impact of the new instruction format, and therefore one may get incorrect data when on is trying to make a CPU optimization for real-world performance? Well, profilers already skew results by adding call overhead. But tracing for debugging and profiling don't do exactly the same thing: debug tracing stops at every line, but profiling only executes hooks at the start and end of a function(*). So I think the function body could still be executed using the new format (assuming this is turned on/off per code object anyway). (*) And whenever a generator yields or is resumed. I consider that an annoying bug though, just as the debugger doesn't do the right thing with yield -- there's no way to continue until the yielding generator is resumed short of setting a manual breakpoint on the next line. -- --Guido van Rossum (python.org/~guido) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Ctypes and the stdlib (was Re: LZMA compression support in 3.3)
On Tue, Aug 30, 2011 at 10:05 AM, Guido van Rossum gu...@python.org wrote: On Tue, Aug 30, 2011 at 9:49 AM, Martin v. Löwis mar...@v.loewis.de wrote: The problem lies with the PyPy backend -- there it generates ctypes code, which means that the signature you declare to Cython/Pyrex must match the *linker* level API, not the C compiler level API. Thus, if in a system header a certain function is really a macro that invokes another function with a permuted or augmented argument list, you'd have to know what that macro does. I also don't see how this would work for #defined constants: where does Cython/Pyrex get their value? ctypes doesn't have their values. So, for PyPy, a solution based on Cython/Pyrex has many of the same downsides as one based on ctypes where it comes to complying with an API defined by a .h file. It's certainly a harder problem. For most simple constants, Cython/Pyrex might be able to generate a series of tiny C programs with which to find CPP symbol values: #include file1.h ... #include filen.h main() { printf(%d, POSSIBLE_CPP_SYMBOL1); } ...and again with %f, %s, etc. The typing is quite a mess, and code fragments would probably be impractical. But since the C Preprocessor is supposedly turing complete, maybe there's a pleasant surprise waiting there. But hopefully clang has something that'd make this easier. SIP's approach of using something close to, but not identical to, the .h's sounds like it might be pretty productive - especially if the derivative of the .h's could be automatically derived using a python script, with minor tweaks to the inputs on .h upgrades. But sip itself is apparently C++-only. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On 9/1/2011 11:45 AM, Guido van Rossum wrote: typewriter). Dutch does have one native use of the umlaut (though it has a different name, I forget which, maybe trema :-), You remember correctly. According to https://secure.wikimedia.org/wikipedia/en/wiki/Trema_%28diacritic%29 'trema' (Greek 'hole') is the generic name of the double-dot vowel diacritic. It was originally used for 'diaerhesis' (Greek, 'taking apart') when it shows that a vowel letter is not part of a digraph or diphthong. (Note that 'ae' in diaerhesis *is* a digraph ;-). Germans later used it to indicate umlaut, 'changed sound'. when there are two consecutive vowels that would normally be read as a special sound (diphthong?). E.g. in koe (cow) the oe is two letters (not a single letter formed of two distict shapes!) that mean a special sound (roughly KOO). But in a word like coëxistentie (coexistence) the o and e do not form the oe-sound, and to emphasize this to Dutch readers (who believe their spelling is very logical :-), the official spelling puts the umlaut on the e. This is definitely thought of as a separate mark added to the e; ë is not a new letter. So the above is trema-diaerhesis. Dutch, French, and Spanish make regular use of the diaeresis. English uses such as 'coöperate' have become rare or archaic, perhaps because we cannot type them. Too bad, since people sometimes use '-' to serve the same purpose. -- Terry Jan Reedy ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Cython, Ctypes and the stdlib
Dan Stromberg, 01.09.2011 19:56: On Tue, Aug 30, 2011 at 10:05 AM, Guido van Rossum wrote: The problem lies with the PyPy backend -- there it generates ctypes code, which means that the signature you declare to Cython/Pyrex must match the *linker* level API, not the C compiler level API. Thus, if in a system header a certain function is really a macro that invokes another function with a permuted or augmented argument list, you'd have to know what that macro does. I also don't see how this would work for #defined constants: where does Cython/Pyrex get their value? ctypes doesn't have their values. So, for PyPy, a solution based on Cython/Pyrex has many of the same downsides as one based on ctypes where it comes to complying with an API defined by a .h file. It's certainly a harder problem. For most simple constants, Cython/Pyrex might be able to generate a series of tiny C programs with which to find CPP symbol values: #include file1.h ... #include filen.h main() { printf(%d, POSSIBLE_CPP_SYMBOL1); } ...and again with %f, %s, etc.The typing is quite a mess The user will commonly declare #defined values as typed external variables and callable macros as functions in .pxd files. These manually typed macro functions allow users to tell Cython what it should know about how the macros will be used. And that would allow it to generate C/C++ glue code for them that uses the declared types as a real function signature and calls the macro underneath. and code fragments would probably be impractical. Not necessarily at the C level but certainly for a ctypes backend, yes. But hopefully clang has something that'd make this easier. For figuring these things out, maybe. Not so much for solving the problems they introduce. Stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Glenn Linderman writes: Windows 7 64-bit on one of my computers happily crashes several times a day when it detects inconsistent internal state... under the theory, I guess, that losing work is better than saving bad work. You sound the opposite. Definitely. Windows apps habitually overwrite existing work; saving when inconsistent would be a bad idea. The apps I work on dump their unsaved buffers to new files, and give you a chance to look at them before instating them as the current version when you restart. Except, I'm not sure how PEP 393 space optimization fits with the other operations. It may even be that an application-wide complex-grapheme cache would save significant space, although if it uses high-bits in a string representation to reference the cache, PEP 393 would jump immediately to something 16 bits per grapheme... but likely would anyway, if complex-graphemes are in the data stream. The only language I know of that uses thousands of complex graphemes is Korean ... and the precomposed forms are already in the BMP. I don't know how many accented forms you're likely to see in Vietnamese, but I suspect it's less than 6400 (the number of characters in private space in the BMP). So for most applications, I believe that mapping both non-BMP code points and grapheme clusters into that private space should be feasible. The only potential counterexample I can think of is display of Arabic, which I have heard has thousands of glyphs in good fonts because of the various ways ligatures form in that script. However AFAIK no apps encode these as characters; I'm just admitting that it *might* be useful. This will require some care in registering such characters and clusters because input text may already use private space according to some convention, which would need to be respected. Still, 6400 characters is a lot, even for the Japanese (IIRC the combined repertoire of corporate characters that for some reason never made it into the JIS sets is about 600, but almost all of them are already in the BMP). I believe the total number of Japanese emoticons is about 200, but I doubt that any given text is likely to use more than a few. So I think there's plenty of space there. This has a few advantages: (1) since these are real characters, all Unicode algorithms will apply as long as the appropriate properties are applied to the character in the database, and (2) it works with a narrow code unit (specifically, UCS-2, but it could also be used with UTF-8). If you really need more than 6400 grapheme clusters, promote to UTF-32, and get two more whole planes full (about 130,000 code points). I didn't attribute any efficiency to flagging lone surrogates (BI-5). Since Windows uses a non-validated UCS-2 or UTF-16 character type, any Python program that obtains data from Windows APIs may be confronted with lone surrogates or inappropriate combining characters at any time. I don't think so. AFAIK all that data must pass through a codec, which will validate it unless you specifically tell it not to. Round-tripping that data seems useful, The standard doesn't forbid that. (ISTR it did so in the past, but what is required in 6.0 is a specific algorithm for identifying well-formed portions of the text, basically if you're currently in an invalid region, read individual code units and attempt to assemble a valid sequence -- as soon as you do, that is a valid code point, and you switch into valid state and return to the normal algorithm.) Specifically, since surrogates are not characters, leaving them in the data does not constitute interpreting them as characters. I don't recall if any of the error handlers allow this, though. However, returning modified forms of it to Windows as UCS-2 or UTF-16 data may still cause other applications to later accidentally combine the characters, if the modifications juxtaposed things to make them look reasonably, even if accidentally. In CPython AFAIK (I don't do Windows) this can only happen if you use a non-default error setting in the output codec. After writing all those ideas down, I actually preferred some of the others, that achieved O(1) real grapheme indexing, rather than caching character properties. If you need O(1) grapheme indexing, use of private space seems a winner to me. It's just defining private precombined characters, and they won't bother any Unicode application, even if they leak out. What are the costs to applications that don't want the cache? How is the bit-cache affected by PEP 393? If it is a separate type from str, then it costs nothing except the extra code space to implement the cache for those applications that do want it... most of which wouldn't be loaded for applications that don't, if done as a module or C extension. I'm talking about the bit-cache (which all of your BI-N referred to, at least indirectly). Many applications will want to work with
Re: [Python-Dev] PEP 393 Summer of Code Project
Guido van Rossum writes: On Thu, Sep 1, 2011 at 12:13 AM, Stephen J. Turnbull step...@xemacs.org wrote: while at least this Spanish-as-a-second-language learner was taught that `ñ' is an atomic character represented by a discontiguous glyph, like `i', and it is no more related to `n' than `m' is. Users really believe that characters are atomic. Even in the cases of Han characters and Hangul, users think of the characters as being atomic, but in the sense of Bohr rather than that of Democritus. Ah, I think this may very well be culture-dependent. I'm not an expert, but I'm fairly sure it is. Specifically, I heard from a TeX-ie friend that the same accented letter is typeset (and collated) differently in different European languages because in some of them the accent is considered part of the letter (making a different character), while in others accents modify a single underlying character. The ones that consider the letter and accent to constitute a single character also prefer to leave less space, he said. But in a word like coëxistentie (coexistence) the o and e do not form the oe-sound, and to emphasize this to Dutch readers (who believe their spelling is very logical :-), the official spelling puts the umlaut on the e. American English has the same usage, but it's optional (in particular, you'll see naive, naif, and words like coordinate typeset that way occasionally, for the same reason I suppose). As Hagen Fürstenau points out, with multiple combining characters, there are even more complex possibilities than the accent is part of the character and it's really not, and they may be application- dependent. Finally, my guess is that the Spanish emphasis on ñ as a separate letter has to do with teaching how it has a separate position in the localized collation sequence, doesn't it? You'd have to ask Mr. Gonzalez. I suspect he may have taught that way less because of his Castellano upbringing, and more because of the infamous lack of sympathy of American high school students for the fine points of usage in foreign languages. I'm also curious if ñ occurs as a separate character on Spanish keyboards. If I'm reading /usr/share/X11/xkb/symbols/es correctly, it does in X.org: the key that for English users would map to ASCII tilde. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Finally, my guess is that the Spanish emphasis on ñ as a separate letter has to do with teaching how it has a separate position in the localized collation sequence, doesn't it? You'd have to ask Mr. Gonzalez. I suspect he may have taught that way less because of his Castellano upbringing, and more because of the infamous lack of sympathy of American high school students for the fine points of usage in foreign languages. If you look at Wikipedia, it says: “El alfabeto español consta de 27 letras”. The Ñ is separate from the N (and so is it in my French-Spanish dictionnary). The accented letters, however, are not considered separately. http://es.wikipedia.org/wiki/Alfabeto_espa%C3%B1ol (I can't tell you how annoying to type ñ is when the tilde is accessed using AltGr + 2 and you have to combine that with the Compose key and N to obtain the full character. I'm sure Spanish keyboards have a better way than that :-)) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/01/2011 02:54 PM, Antoine Pitrou wrote: If you look at Wikipedia, it says: “El alfabeto español consta de 27 letras”. The Ñ is separate from the N (and so is it in my French-Spanish dictionnary). The accented letters, however, are not considered separately. http://es.wikipedia.org/wiki/Alfabeto_espa%C3%B1ol (I can't tell you how annoying to type ñ is when the tilde is accessed using AltGr + 2 and you have to combine that with the Compose key and N to obtain the full character. I'm sure Spanish keyboards have a better way than that :-)) FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters. Kids-these-days'ly, Tres. - -- === Tres Seaver +1 540-429-0999 tsea...@palladion.com Palladion Software Excellence by Designhttp://palladion.com -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk5f2UQACgkQ+gerLs4ltQ4URACePSZzpoPAg2IIYZewsjbuplkK 0MgAoM7VfdQHzjBiU6Vr/MYPJ9U2qC3M =pvKn -END PGP SIGNATURE- ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Tres Seaver wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/01/2011 02:54 PM, Antoine Pitrou wrote: If you look at Wikipedia, it says: “El alfabeto español consta de 27 letras”. The Ñ is separate from the N (and so is it in my French-Spanish dictionnary). The accented letters, however, are not considered separately. http://es.wikipedia.org/wiki/Alfabeto_espa%C3%B1ol (I can't tell you how annoying to type ñ is when the tilde is accessed using AltGr + 2 and you have to combine that with the Compose key and N to obtain the full character. I'm sure Spanish keyboards have a better way than that :-)) FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters. Kids-these-days'ly, Not sure what's going on, but according to the article Antoine linked to those aren't letters anymore... so much for the cultural awareness portion of UNESCO. ~Ethan~ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Thu, 01 Sep 2011 12:38:07 -0700 Ethan Furman et...@stoneleaf.us wrote: FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters. Kids-these-days'ly, Not sure what's going on, but according to the article Antoine linked to those aren't letters anymore... so much for the cultural awareness portion of UNESCO. That Wikipedia article also says: “Los dígrafos Ch y Ll tienen valores fonéticos específicos, y durante los siglos XIX y XX se ordenaron separadamente de C y L, aunque la práctica se abandonó en 1994 para homogeneizar el sistema con otras lenguas.” - roughly: “the Ch and Ll digraphs have specific phonetic values, and during the 19th and 20th centuries they were ordered separately from C and L, but this practice was abandoned in 1994 in order to make the system consistent with other languages.” And about rr: “El dígrafo rr (llamado erre, /'ere/, y pronunciado /r/) nunca se consideró por separado, probablemente por no aparecer nunca en posición inicial.” - “the rr digraph was never considered separate, probably because it never appears at the very beginning of a word.” Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Guido van Rossum wrote: I recall long ago that when the french wrote words in all caps they would drop the accents, e.g. ECOLE. I even recall (through the mists of time) observing this in Paris on public signs. Is this still the convention? This page features a number of French street signs in all-caps, and some of them have accents: http://www.happymall.com/france/paris_street_signs.htm -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Guido van Rossum wrote: But in a word like coëxistentie (coexistence) the o and e do not form the oe-sound, and to emphasize this to Dutch readers (who believe their spelling is very logical :-), the official spelling puts the umlaut on the e. Sometimes this is done in English too -- occasionally you see words like cooperation spelled with a diaresis over the second o. But these days it's more common to use a hyphen, or not bother at all. Everyone knows how it's pronounced. -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
On Fri, 02 Sep 2011 12:30:12 +1200 Greg Ewing greg.ew...@canterbury.ac.nz wrote: Guido van Rossum wrote: I recall long ago that when the french wrote words in all caps they would drop the accents, e.g. ECOLE. I even recall (through the mists of time) observing this in Paris on public signs. Is this still the convention? This page features a number of French street signs in all-caps, and some of them have accents: http://www.happymall.com/france/paris_street_signs.htm I don't think some American souvenir shop is a good reference, though :) (for example, there's no Paris street named château de Versailles) Regards Antoine. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Terry Reedy wrote: Too bad, since people sometimes use '-' to serve the same purpose. Which actually seems more logical to me -- a separating symbol is better placed between the things being separated, rather than over the top of one of them! Maybe we could compromise by turning the diaeresis on its side: co:operate -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Antoine Pitrou wrote: Le jeudi 01 septembre 2011 à 08:45 -0700, Guido van Rossum a écrit : This is definitely thought of as a separate mark added to the e; ë is not a new letter. I have a feeling it's the same way for the French and Germans, but I really don't know. (Antoine? Georg?) Indeed, they are not separate letters (they are considered the same in lexicographic order, and the French alphabet has 26 letters). On the other hand, the same doesn't necessarily apply to other languages. (At least according to Wikipedia.) http://en.wikipedia.org/wiki/Diacritic#Languages_with_letters_containing_diacritics -- Steven ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Tres Seaver writes: FWIW, I was taught that Spanish had 30 letters in the alfabeto: the 'ñ', plus 'ch', 'll', and 'rr' were all considered distinct characters. That was always a Castellano vs. Americano issue, IIRC. As I wrote, Mr. Gonzalez was Castellano. I believe that the deprecation of the digraphs as separate letters occurred as the telephone became widely used in Spain, and the telephone company demanded an official proclamation from whatever Ministry is responsible for culture that it was OK to treat the digraphs as two letters (specifically, to collate them that way), so that they could use the programs that came with the OS. So this stuff is not merely variant by culture, but also by economics and politics. :-/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Python 3 optimizations continued...
Hi, as promised, I created a publicly available preview of an implementation with my optimizations, which is available under the following location: https://bitbucket.org/py3_pio/preview/wiki/Home I followed Nick's advice and added some valuable advice and overview/introduction at the wiki page the link points to, I am positive that spending 10mins reading this will provide you with a valuable information regarding what's happening. In addition, as Guido already mentioned, this is more or less a direct copy of my research-branch without some of my private comments and *no* additional refactorings because of software-engineering issues (which I am very much aware of.) I hope this clarifies a *lot* and makes it easier to see what parts are involved and how all the pieces fit together. I hope you'll like it, have fun, --stefan ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] PEP 393 Summer of Code Project
Antoine Pitrou wrote: I don't think some American souvenir shop is a good reference, though :) (for example, there's no Paris street named château de Versailles) Hmmm, I'd assumed they were reproductions of actual street signs found in Paris, but maybe not. :-( -- Greg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com