ANN: Advanced Python Training at PyCon PL

2012-08-19 Thread Mike Müller
Advanced Python Training at PyCon PL


You have intermediate Python skills and would like learn more about:

* Comprehensions
* Decorators
* Context managers
* Descriptors
* Metaclasses and
* Patterns?

Than you should attend this two-day training that provides a systematic
coverage of these topics. Useful code samples and exercises provide
hands-on learning.

We offered this training at EuroPython 2012 and got very good feedback.
Some of the participant understood much more of the complex topics than
they anticipated.

Date: September 17th and 18th, 2012
Location: PyCon PL venue, Mąchocice, Poland
More information: http://pl.pycon.org/2012/en/training

This is an open course, but PyCon PL attendees will get a considerable
discount.



Open courses 2012 and 2013 (till June)
--


17.09.-18.09.2012 (Mąchocice, Poland) Advanced Python at PyCon PL (English)
http://pl.pycon.org/2012/en/training

15.10.-17.10.2012 (Leipzig) Introduction to Django (English)
http://python-academy.com/courses/django_course_introduction.html

18.10.-20.10.2012 (Leipzig) Advanced Django (English)
http://python-academy.com/courses/django_course_advanced.html

27.10.2012 (Leipzig) SQLAlchemy (English)
http://python-academy.com/courses/specialtopics/python_course_sqlalchemy.html

28.10.2012 (Leipzig) Camelot (English)
http://python-academy.com/courses/specialtopics/python_course_camelot.html

12.-14.11.2012 (Antwerp, Belgium) Python for Programmers (English)
http://python-academy.com/courses/python_course_programmers.htm

15.11.2012 (Antwerp, Belgium) SQLAlchemy (English)
http://python-academy.com/courses/specialtopics/python_course_sqlalchemy.html

16.11.2012 (Antwerp, Belgium) Camelot (English)
http://python-academy.com/courses/specialtopics/python_course_camelot.html

10.12.-12.12.2012 (Leipzig) Python für Programmierer (German)
http://www.python-academy.de/Kurse/python_kurs_programmierer.html

13.12.-15.12.2012 (Leipzig) Python für Wissenschaftler und Ingenieure (German)
http://www.python-academy.de/Kurse/python_kurs_wissenschaftler.html

25.01.-27.01.2013 (Leipzig) Advanced Python (English)
http://python-academy.com/courses/specialtopics/python_course_advanced.html

28.01.-30.01.2013 (Leipzig) High-Performance Computation with Python (English)
http://python-academy.com/courses/python_course_high_performance.html

one day each (can be booked separately)
- Optimizing of Python Programs
  http://python-academy.com/courses/specialtopics/python_optimizing.html

- Python Extensions with Other Languages
  http://python-academy.com/courses/specialtopics/python_extensions.html

- Fast Code with the Cython Compiler
  http://python-academy.com/courses/specialtopics/python_course_cython.html

31.01.-01.02.2013 (Leipzig) High Performance XML with Python (English)
http://python-academy.com/courses/specialtopics/python_course_xml.html

04.03.-08.03.2013 (Chicago, USA) Python for Scientists and Engineers (English)
http://www.dabeaz.com/chicago/science.html

15.04.-17.04.2013 (Leipzig) Python für Programmierer (German)
http://www.python-academy.de/Kurse/python_kurs_programmierer.html


18.04.-20.04.2013 (Leipzig) Python für Wissenschaftler und Ingenieure (German)
http://www.python-academy.de/Kurse/python_kurs_wissenschaftler.html

10.06.-12.06.2013 (Leipzig) Python for Scientists and Engineers (English)
http://python-academy.com/courses/python_course_scientists.html

13.06.2013 (Leipzig) Fast Code with the Cython Compiler (English)
http://python-academy.com/courses/specialtopics/python_course_cython.html

14.06.2013 (Leipzig) Fast NumPy Processing with Cython (English)
http://python-academy.com/courses/specialtopics/python_course_numpy_cython.html
-- 
http://mail.python.org/mailman/listinfo/python-announce-list

Support the Python Software Foundation:
http://www.python.org/psf/donations/


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Chris Angelico ros...@gmail.com writes:
 Generally, I'm working with pure ASCII, but port those same algorithms
 to Python and you'll easily be able to read in a file in some known
 encoding and manipulate it as Unicode.

If it's pure ASCII, you can use the bytes or bytearray type.  

 It's not so much 'random access to the nth character' as an efficient
 way of jumping forward. For instance, if I know that the next thing is
 a literal string of n characters (that I don't care about), I want to
 skip over that and keep parsing.

I don't understand how this is supposed to work.  You're going to read a
large unicode text file (let's say it's UTF-8) into a single big string?
So the runtime library has to scan the encoded contents to find the
highest numbered codepoint (let's say it's mostly ascii but has a few
characters outside the BMP), expand it all (in this case) to UCS-4
giving 4x memory bloat and requiring decoding all the UTF-8 regardless,
and now we should worry about the efficiency of skipping n characters?

Since you have to decode the n characters regardless, I'd think this
skipping part should only be an issue if you have to do it a lot of
times.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encapsulation, inheritance and polymorphism

2012-08-19 Thread John Ladasky
On Tuesday, July 17, 2012 12:39:53 PM UTC-7, Mark Lawrence wrote:

 I would like to spend more time on this thread, but unfortunately the 44 
 ton artic carrying Java in a Nutshell Volume 1 Part 1 Chapter 1 
 Paragraph 1 Sentence 1 has just arrived outside my abode and needs 
 unloading :-)

That reminds me of a remark I made nearly 10 years ago:

Well, I followed one friend's advice and investigated Java, perhaps a little 
too quickly.  I purchased Ivor Horton's _Beginning_Java_2_ book.  It is 
reasonably well-written.  But how many pages did I have to read before I got 
through everything I needed to know, in order to read and write files?  Four 
hundred!  I need to keep straight detailed information about objects, 
inheritance, exceptions, buffers, and streams, just to read data from a text 
file???

I haven't actually sat down to program in Java yet.  But at first glance, it 
would seem to be a step backwards even from the procedural C programming that I 
was doing a decade ago.  I was willing to accept the complexity of the Windows 
GUI, and program with manuals open on my lap.  It is a lot harder for me to 
accept that I will need to do this in order to process plain old text, perhaps 
without even any screen output.

https://groups.google.com/d/topic/bionet.software/kk-EGGTHN1M/discussion

Some things never change!  :^)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
This is a long post. If you don't feel like reading an essay, skip to the 
very bottom and read my last few paragraphs, starting with To recap.


On Sat, 18 Aug 2012 11:26:21 -0700, Paul Rubin wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 (There is an extension to UCS-2, UTF-16, which encodes non-BMP
 characters using two code points. This is fragile and doesn't work very
 well, because string-handling methods can break the surrogate pairs
 apart, leaving you with invalid unicode string. Not good.)
 ...
 With PEP 393, each Python string will be stored in the most efficient
 format possible:
 
 Can you explain the issue of breaking surrogate pairs apart a little
 more?  Switching between encodings based on the string contents seems
 silly at first glance.  

Forget encodings! We're not talking about encodings. Encodings are used 
for converting text as bytes for transmission over the wire or storage on 
disk. PEP 393 talks about the internal representation of text within 
Python, the C-level data structure.

In 3.2, that data structure depends on a compile-time switch. In a 
narrow build, text is stored using two-bytes per character, so the 
string len (as in the name of the built-in function) will be stored as 

006c 0065 006e

(or possibly 6c00 6500 6e00, depending on whether your system is 
LittleEndian or BigEndian), plus object-overhead, which I shall ignore.

Since most identifiers are ASCII, that's already using twice as much 
memory as needed. This standard data structure is called UCS-2, and it 
only handles characters in the Basic Multilingual Plane, the BMP (roughly 
the first 64000 Unicode code points). I'll come back to that.

In a wide build, text is stored as four-bytes per character, so len 
is stored as either:

006c 0065 006e
6c00 6500 6e00

Now memory is cheap, but it's not *that* cheap, and no matter how much 
memory you have, you can always use more.

This system is called UCS-4, and it can handle the entire Unicode 
character set, for now and forever. (If we ever need more that four-bytes 
worth of characters, it won't be called Unicode.)

Remember I said that UCS-2 can only handle the 64K characters 
[technically: code points] in the Basic Multilingual Plane? There's an 
extension to UCS-2 called UTF-16 which extends it to the entire Unicode 
range. Yes, that's the same name as the UTF-16 encoding, because it's 
more or less the same system.

UTF-16 says let's represent characters in the BMP by two bytes, but 
characters outside the BMP by four bytes. There's a neat trick to this: 
the BMP doesn't use the entire two-byte range, so there are some byte 
pairs which are illegal in UCS-2 -- they don't correspond to *any* 
character. UTF-16 used those byte pairs to signal this is half a 
character, you need to look at the next pair for the rest of the 
character.

Nifty hey? These pairs-of-pseudocharacters are called surrogate pairs.

Except this comes at a big cost: you can no longer tell how long a string 
is by counting the number of bytes, which is fast, because sometimes four 
bytes is two characters and sometimes it's one and you can't tell which 
it will be until you actually inspect all four bytes.

Copying sub-strings now becomes either slow, or buggy. Say you want to 
grab the 10th characters in a string. The fast way using UCS-2 is to 
simply grab bytes 8 and 9 (remember characters are pairs of bytes and we 
start counting at zero) and you're done. Fast and safe if you're willing 
to give up the non-BMP characters.

It's also fast and safe if you use USC-4, but then everything takes twice 
as much space, so you probably end up spending so much time copying null 
bytes that you're probably slower anyway. Especially when your OS starts 
paging memory like mad.

But in UTF-16, indexing can be fast or safe but not both. Maybe bytes 8 
and 9 are half of a surrogate pair, and you've now split the pair and 
ended up with an invalid string. That's what Python 3.2 does, it fails to 
handle surrogate pairs properly:

py s = chr(0x + 1)
py a, b = s
py a
'\ud800'
py b
'\udc00'


I've just split a single valid Unicode character into two invalid 
characters. Python3.2 will (probably) mindless process those two non-
characters, and the only sign I have that I did something wrong is that 
my data is now junk.

Since any character can be a surrogate pair, you have to scan every pair 
of bytes in order to index a string, or work out it's length, or copy a 
substring. It's not enough to just check if the last pair is a surrogate. 

When you don't, you have bugs like this from Python 3.2:

py s = 01234 + chr(0x + 1) + 6789
py s[9] == '9'
False
py s[9], len(s)
('8', 11)

Which is now fixed in Python 3.3.

So variable-width data structures like UTF-8 or UTF-16 are crap for the 
internal representation of strings -- they are either fast or correct but 
cannot be both.

But UCS-2 is sub-optimal, because it can only handle the BMP, and UCS-4 
is 

Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 11:30:19 -0700, wxjmfauth wrote:

  I'm aware of this (and all the blah blah blah you are explaining).
  This always the same song. Memory.
 
 
 
 Exactly. The reason it is always the same song is because it is an
 important song.
 
 
 No offense here. But this is an *american* answer.

I am not American.

I am not aware that computers outside of the USA, and Australia, have 
unlimited amounts of memory. You must be very lucky.


 The same story as the coding of text files, where utf-8 == ascii and
 the rest of the world doesn't count.

UTF-8 is not ASCII.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:

 The change does not just benefit ASCII users.  It primarily benefits
 anybody using a wide unicode build with strings mostly containing only
 BMP characters.

Just to be clear:

If you have many strings which are *mostly* BMP, but have one or two non-
BMP characters in *each* string, you will see no benefit.

But if you have many strings which are all BMP, and only a few strings 
containing non-BMP characters, then you will see a big benefit.


 Even for narrow build users, there is the benefit that
 with approximately the same amount of memory usage in most cases, they
 no longer have to worry about non-BMP characters sneaking in and
 breaking their code.

Yes! +1000 on that.


 There is some additional benefit for Latin-1 users, but this has nothing
 to do with Python.  If Python is going to have the option of a 1-byte
 representation (and as long as we have the flexible representation, I
 can see no reason not to), 

The PEP explicitly states that it only uses a 1-byte format for ASCII 
strings, not Latin-1:

ASCII-only Unicode strings will again use only one byte per character

and later:

If the maximum character is less than 128, they use the PyASCIIObject 
structure

and:

The data and utf8 pointers point to the same memory if the string uses 
only ASCII characters (using only Latin-1 is not sufficient).


 then it is going to be Latin-1 by definition,

Certainly not, either in fact or in principle. There are a large number 
of 1-byte encodings, Latin-1 is hardly the only one.


 because that's what 1-byte Unicode (UCS-1, if you will) is.  If you have
 an issue with that, take it up with the designers of Unicode.

The designers of Unicode have never created a standard 1-byte Unicode 
or UCS-1, as far as I can determine.

The Unicode standard refers to some multiple million code points, far too 
many to fit in a single byte. There is some historical justification for 
using Unicode to mean UCS-2, but with the standard being extended 
beyond the BMP, that is no longer valid.

See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more details.


I think what you are trying to say is that the Unicode designers 
deliberately matched the Latin-1 standard for Unicode's first 256 code 
points. That's not the same thing though: there is no Unicode standard 
mapping to a single byte format.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 11:05:07 -0700, wxjmfauth wrote:

 As I understand (I think) the undelying mechanism, I can only say, it is
 not a surprise that it happens.
 
 Imagine an editor, I type an a, internally the text is saved as ascii,
 then I type en é, the text can only be saved in at least latin-1. Then
 I enter an €, the text become an internal ucs-4 string. The remove
 the € and so on.

Firstly, that is not what Python does. For starters, € is in the BMP, and 
so is nearly every character you're ever going to use unless you are 
Asian or a historian using some obscure ancient script. NONE of the 
examples you have shown in your emails have included 4-byte characters, 
they have all been ASCII or UCS-2.

You are suffering from a misunderstanding about what is going on and 
misinterpreting what you have seen.


In *both* Python 3.2 and 3.3, both é and € are represented by two bytes. 
That will not change. There is a tiny amount of fixed overhead for 
strings, and that overhead is slightly different between the versions, 
but you'll never notice the difference.

Secondly, how a text editor or word processor chooses to store the text 
that you type is not the same as how Python does it. A text editor is not 
going to be creating a new immutable string after every key press. That 
will be slow slow SLOW. The usual way is to keep a buffer for each 
paragraph, and add and subtract characters from the buffer.


 Intuitively I expect there is some kind slow down between all these
 strings conversion.

Your intuition is wrong. Strings are not converted from ASCII to USC-2 to 
USC-4 on the fly, they are converted once, when the string is created.

The tests we ran earlier, e.g.:

('ab…' * 1000).replace('…', 'œ…')

show the *worst possible case* for the new string handling, because all 
we do is create new strings. First we create a string 'ab…', then we 
create another string 'ab…'*1000, then we create two new strings '…' and 
'œ…', and finally we call replace and create yet another new string.

But in real applications, once you have created a string, you don't just 
immediately create a new one and throw the old one away. You likely do 
work with that string:

steve@runes:~$ python3.2 -m timeit s = 'abcœ…'*1000; n = len(s); flag = 
s.startswith(('*', 'a'))
10 loops, best of 3: 2.41 usec per loop

steve@runes:~$ python3.3 -m timeit s = 'abcœ…'*1000; n = len(s); flag = 
s.startswith(('*', 'a'))
10 loops, best of 3: 2.29 usec per loop

Once you start doing *real work* with the strings, the overhead of 
deciding whether they should be stored using 1, 2 or 4 bytes begins to 
fade into the noise.


 When I tested this flexible representation, a few months ago, at the
 first alpha release. This is precisely what, I tested. String
 manipulations which are forcing this internal change and I concluded the
 result is not brillant. Realy, a factor 0.n up to 10.

Like I said, if you really think that there is a significant, repeatable 
slow-down on Windows, report it as a bug.


 Does any body know a way to get the size of the internal string in
 bytes? 

sys.getsizeof(some_string)

steve@runes:~$ python3.2 -c from sys import getsizeof as size; print(size
('abcœ…'*1000))
10030
steve@runes:~$ python3.3 -c from sys import getsizeof as size; print(size
('abcœ…'*1000))
10038


As I said, there is a *tiny* overhead difference. But identifiers will 
generally be smaller:

steve@runes:~$ python3.2 -c from sys import getsizeof as size; print(size
(size.__name__))
48
steve@runes:~$ python3.3 -c from sys import getsizeof as size; print(size
(size.__name__))
34

You can check the object overhead by looking at the size of the empty 
string.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:

 a will be stored as 1 byte/codepoint.
 
 Adding é, it will still be stored as 1 byte/codepoint.

Wrong. It will be 2 bytes, just like it already is in Python 3.2.

I don't know where people are getting this myth that PEP 393 uses Latin-1 
internally, it does not. Read the PEP, it explicitly states that 1-byte 
formats are only used for ASCII strings.


 Adding €, it will still be stored as 2 bytes/codepoint.

That is correct.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 19:59:32 +0100, MRAB wrote:

 The problem with strings containing surrogate pairs is that you could
 inadvertently slice the string in the middle of the surrogate pair.

That's the *least* of the problems with surrogate pairs. That would be 
easy to fix: check the point of the slice, and back up or forward if 
you're on a surrogate pair. But that's not good enough, because the 
surrogates could be anywhere in the string. You have to touch every 
single character in order to know how many there are.

The problem with surrogate pairs is that they make basic string 
operations O(N) instead of O(1).



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Top-posting c. (was Re: [ANNC] pybotwar-0.8)

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 10:27:10 -0700, rusi wrote:

 For example, my sister recently saw some of my mails and was mystified
 that I had sent back 'blank mails' until I explained and pointed out
 that my answers were interleaved into what was originally sent!

No offence to your sister, who I'm sure is probably a really great person 
and kind to small animals and furry children, but didn't she, you know, 
*investigate further* upon seeing something weird, namely a blank email?

As in, Gosh, dearest brother has sent me an email without saying 
anything. That's weird. I hope he's alright? Maybe there's something a 
bit further down? Or a funny picture of a cat at the end? Or something? I 
better scroll down a bit further and see.

I'm not talking about complicated tech stuff like View  Message Source 
and trying to determine whether perhaps the MIME type is broken and 
there's an invisible attachment. I'm talking about almost the simplest 
thing in the friggin' world, *scrolling down and looking at what's there*.
The software equivalent of somebody handing you a blank piece of paper 
and turning it around to see if maybe there's something on the back.

Because that's what I do, and I don't think I'm some sort of hyper-
evolved mega-genius with a brain the size of a planet, I'm just some guy. 
Nobody needed to tell me Hey dummy, the text you are looking for is a 
bit further down, keep reading. I just looked on my own, and saw the 
text on my own, and actually read it without being told to, and a little 
light bulb went on over my head and I went Wow! People can actually 
write stuff in between other stuff! How did they do that?

Now sure, I make allowances for 70 year olds who have never touched a 
computer before and have to ask What's a scroll bar? and How do I use 
this mousey-pointer thing? I assume your sister has minimal skills like 
can scroll and knows how to read.

I'm not sure which is worse -- that perhaps I *am* some sort of mega-
genius and keep overestimating the difficulty of scroll-down-and-read for 
normal people, or that others have such short attention spans that 
anything that they can't see immediately in front of them might as well 
not exist. Either thought is rather depressing.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Peter Otten
Steven D'Aprano wrote:

 On Sat, 18 Aug 2012 19:34:50 +0100, MRAB wrote:
 
 a will be stored as 1 byte/codepoint.
 
 Adding é, it will still be stored as 1 byte/codepoint.
 
 Wrong. It will be 2 bytes, just like it already is in Python 3.2.
 
 I don't know where people are getting this myth that PEP 393 uses Latin-1
 internally, it does not. Read the PEP, it explicitly states that 1-byte
 formats are only used for ASCII strings.

From

Python 3.3.0a4+ (default:10a8ad665749, Jun  9 2012, 08:57:51) 
[GCC 4.6.1] on linux
Type help, copyright, credits or license for more information.
 import sys
 [sys.getsizeof(é*i) for i in range(10)]
[49, 74, 75, 76, 77, 78, 79, 80, 81, 82]
 [sys.getsizeof(e*i) for i in range(10)]
[49, 50, 51, 52, 53, 54, 55, 56, 57, 58]
 sys.getsizeof(é*101)-sys.getsizeof(é)
100
 sys.getsizeof(e*101)-sys.getsizeof(e)
100
 sys.getsizeof(€*101)-sys.getsizeof(€)
200

I infer that 

(1) both ASCII and Latin1 strings require one byte per character.
(2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit system) 
over ASCII-only.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Top-posting c. (was Re: [ANNC] pybotwar-0.8)

2012-08-19 Thread Chris Angelico
On Sun, Aug 19, 2012 at 5:15 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 The software equivalent of somebody handing you a blank piece of paper
 and turning it around to see if maybe there's something on the back.

Straight out of a Goon Show, that is. Heh.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sat, 18 Aug 2012 19:35:44 -0700, Paul Rubin wrote:

 Scanning 4 characters (or a few dozen, say) to peel off a token in
 parsing a UTF-8 string is no big deal.  It gets more expensive if you
 want to index far more deeply into the string.  I'm asking how often
 that is done in real code.

It happens all the time.

Let's say you've got a bunch of text, and you use a regex to scan through 
it looking for a match. Let's ignore the regular expression engine, since 
it has to look at every character anyway. But you've done your search and 
found your matching text and now want everything *after* it. That's not 
exactly an unusual use-case.

mo = re.search(pattern, text)
if mo:
start, end = mo.span()
result = text[end:]


Easy-peasy, right? But behind the scenes, you have a problem: how does 
Python know where text[end:] starts? With fixed-size characters, that's 
O(1): Python just moves forward end*width bytes into the string. Nice and 
fast.

With a variable-sized characters, Python has to start from the beginning 
again, and inspect each byte or pair of bytes. This turns the slice 
operation into O(N) and the combined op (search + slice) into O(N**2), 
and that starts getting *horrible*.

As always, everything is fast for small enough N, but you *really* 
don't want O(N**2) operations when dealing with large amounts of data.

Insisting that the regex functions only ever return offsets to valid 
character boundaries doesn't help you, because the string slice method 
cannot know where the indexes came from.

I suppose you could have a fast slice and a slow slice method, but 
really, that sucks, and besides all that does is pass responsibility for 
tracking character boundaries to the developer instead of the language, 
and you know damn well that they will get it wrong and their code will 
silently do the wrong thing and they'll say that Python sucks and we 
never used to have this problem back in the good old days with ASCII. Boo 
sucks to that.

UCS-4 is an option, since that's fixed-width. But it's also bulky. For 
typical users, you end up wasting memory. That is the complaint driving 
PEP 393 -- memory is cheap, but it's not so cheap that you can afford to 
multiply your string memory by four just in case somebody someday gives 
you a character in one of the supplementary planes.

If you have oodles of memory and small data sets, then UCS-4 is probably 
all you'll ever need. I hear that the club for people who have all the 
memory they'll ever need is holding their annual general meeting in a 
phone-booth this year.

You could say Screw the full Unicode standard, who needs more than 64K 
different characters anyway? Well apart from Asians, and historians, and 
a bunch of other people. If you can control your data and make sure no 
non-BMP characters are used, UCS-2 is fine -- except Python doesn't 
actually use that.

You could do what Python 3.2 narrow builds do: use UTF-16 and leave it up 
to the individual programmer to track character boundaries, and we know 
how well that works. Luckily the supplementary planes are only rarely 
used, and people who need them tend to buy more memory and use wide 
builds. People who only need a few non-BMP characters in a narrow build 
generally just cross their fingers and hope for the best.

You could add a whole lot more heavyweight infrastructure to strings, 
turn them into suped-up ropes-on-steroids. All those extra indexes mean 
that you don't save any memory. Because the objects are so much bigger 
and more complex, your CPU cache goes to the dogs and your code still 
runs slow.

Which leaves us right back where we started, PEP 393.


 Obviously one can concoct hypothetical examples that would suffer.

If you think slicing at arbitrary indexes is a hypothetical example, I 
don't know what to say.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 This is a long post. If you don't feel like reading an essay, skip to the 
 very bottom and read my last few paragraphs, starting with To recap.

I'm very flattered that you took the trouble to write that excellent
exposition of different Unicode encodings in response to my post.  I can
only hope some readers will benefit from it.  I regret that I wasn't
more clear about the perspective I posted from, i.e. that I'm already
familiar with how those encodings work.

After reading all of it, I still have the same skepticism on the main
point as before, but I think I see what the issue in contention is, and
some differences in perspectice.  First of all, you wrote:

 This standard data structure is called UCS-2 ... There's an extension
 to UCS-2 called UTF-16

My own understanding is UCS-2 simply shouldn't be used any more.
Unicode was historically supposed to be a 16-bit character set, but that
turned out to not be enough, so the supplementary planes were added.
UCS-2 thus became obsolete and UTF-16 superseded it in 1996.  UTF-16 in
turn is rather clumsy and the later UTF-8 is better in a lot of ways,
but both of these are at least capable of encoding all the character
codes.

On to the main issue:

 * Variable-byte formats like UTF-8 and UTF-16 mean that basic string 
 operations are not O(1) but are O(N). That means they are slow, or buggy, 
 pick one.

This I don't see.  What are the basic string operations?

* Examine the first character, or first few characters (few = usually
  bounded by a small constant) such as to parse a token from an input
  stream.  This is O(1) with either encoding.

* Slice off the first N characters.  This is O(N) with either encoding
  if it involves copying the chars.  I guess you could share references
  into the same string, but if the slice reference persists while the
  big reference is released, you end up not freeing the memory until
  later than you really should.

* Concatenate two strings.  O(N) either way.

* Find length of string.  O(1) either way since you'd store it in
  the string header when you build the string in the first place.
  Building the string has to have been an O(N) operation in either
  representation.

And finally:

* Access the nth char in the string for some large random n, or maybe
  get a small slice from some random place in a big string.  This is
  where fixed-width representation is O(1) while variable-width is O(N).

What I'm not convinced of, is that the last thing happens all that
often.

Meanwhile, an example of the 393 approach failing: I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii, but there would be occasional non-ascii
chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision.  That's a
natural for UTF-8 but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.

py s = chr(0x + 1)
py a, b = s

That looks like Python 3.2 is buggy and that sample should just throw an
error.  s is a one-character string and should not be unpackable.

I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?

Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered.  By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string.  Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it.  Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 result = text[end:]

if end not near the end of the original string, then this is O(N)
even with fixed-width representation, because of the char copying.

if it is near the end, by knowing where the string data area
ends, I think it should be possible to scan backwards from
the end, recognizing what bytes can be the beginning of code points and
counting off the appropriate number.  This is O(1) if near the end
means within a constant.

 You could say Screw the full Unicode standard, who needs more than 64K 

No if you're claiming the language supports unicode it should be
the whole standard.

 You could do what Python 3.2 narrow builds do: use UTF-16 and leave it
 up to the individual programmer to track character boundaries,

I'm surprised the Python 3 implementers even considered that approach
much less went ahead with it.  It's obviously wrong.

 You could add a whole lot more heavyweight infrastructure to strings,
 turn them into suped-up ropes-on-steroids.

I'm not persuaded that PEP 393 isn't even worse.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Chris Angelico
On Sun, Aug 19, 2012 at 6:11 PM, Paul Rubin no.email@nospam.invalid wrote:
 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 result = text[end:]

 if end not near the end of the original string, then this is O(N)
 even with fixed-width representation, because of the char copying.

 if it is near the end, by knowing where the string data area
 ends, I think it should be possible to scan backwards from
 the end, recognizing what bytes can be the beginning of code points and
 counting off the appropriate number.  This is O(1) if near the end
 means within a constant.

Only if you know exactly where the end is (which requires storing and
maintaining a character length - this may already be happening, I
don't know). But that approach means you need to have code for both
ways (forward search or reverse), and of course it relies on your
encoding being reverse-scannable in this way (as UTF-8 is, but not
all).

And of course, taking the *entire* rest of the string isn't the only
thing you do. What if you want to take the next six characters after
that index? That would be constant time with a fixed-width storage
format.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Chris Angelico ros...@gmail.com writes:
 And of course, taking the *entire* rest of the string isn't the only
 thing you do. What if you want to take the next six characters after
 that index? That would be constant time with a fixed-width storage
 format.

How often is this an issue in practice?

I wonder how other languages deal with this.  The examples I can think
of are poor role models:

1. C/C++ - unicode impaired, other than a wchar type

2. Java - bogus UCS-2-like(?) representation for historical reasons
   Also has some modified UTF=8 for reasons that made no sense and
   that I don't remember

3. Haskell - basic string type is a linked list of code points.
   hello is five list nodes.  New Data.Text library (much more
efficient) uses something like ropes, I think, with UTF-16 underneath.

4. Erlang - I think like Haskell.  Efficiently handles byte blocks.

5. Perl 6 -- ???

6. Ruby - ??? (but probably quite slow like the rest of Ruby)

7. Objective C -- ???

8, 9 ...  (any other important ones?)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encapsulation, inheritance and polymorphism

2012-08-19 Thread Mark Lawrence

On 19/08/2012 06:21, Robert Miles wrote:

On 7/23/2012 11:18 AM, Albert van der Horst wrote:

In article 5006b48a$0$29978$c3e8da3$54964...@news.astraweb.com,
Steven D'Aprano  steve+comp.lang.pyt...@pearwood.info wrote:
SNIP.

Even with a break, why bother continuing through the body of the
function
when you already have the result? When your calculation is done, it's
done, just return for goodness sake. You wouldn't write a search that
keeps going after you've found the value that you want, out of some
misplaced sense that you have to look at every value. Why write code
with
unnecessary guard values and temporary variables out of a misplaced
sense
that functions must only have one exit?


Example from recipee's:

Stirr until the egg white is stiff.

Alternative:
Stirr egg white for half an hour,
but if the egg white is stiff keep your spoon still.

(Cooking is not my field of expertise, so the wording may
not be quite appropriate. )


--
Steven


Groetjes Albert


Note that you forgot applying enough heat to do the cooking.




Surely the first check is your filing system to make sure that you've 
paid the utilties bills so you've got gas and or electricity to apply 
the heat.  Either that or you hire Ray Mears to produce the spark needed 
to light the fire :)


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread wxjmfauth
About the exemples contested by Steven:

eg: timeit.timeit(('ab…' * 10).replace('…', 'œ…'))


And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing and stupid
question -) does not count.

The real problem is elsewhere. *Americans* do not wish
a character occupies 4 bytes in *their* memory. The rest
of the world does not count.

The same thing happens with the utf-8 coding scheme.
Technically, it is fine. But after n years of usage,
one should recognize it just became an ascii2. Especially
for those who undestand nothing in that field and are 
not even aware, characters are coded. I'm the first 
to think, this is legitimate.

Memory or ability to treat all text in the same and equal
way?

End note. This kind of discussion is not specific to
Python, it always happen when there is some kind of
conflict between ascii and non ascii users.

Have a nice day.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote:

 Steven D'Aprano wrote:

 I don't know where people are getting this myth that PEP 393 uses
 Latin-1 internally, it does not. Read the PEP, it explicitly states
 that 1-byte formats are only used for ASCII strings.
 
 From
 
 Python 3.3.0a4+ (default:10a8ad665749, Jun  9 2012, 08:57:51) [GCC
 4.6.1] on linux
 Type help, copyright, credits or license for more information.
 import sys
 [sys.getsizeof(é*i) for i in range(10)]
 [49, 74, 75, 76, 77, 78, 79, 80, 81, 82]

Interesting. Say, I don't suppose you're using a 64-bit build? Because 
that would explain why your sizes are so larger than mine:

py [sys.getsizeof(é*i) for i in range(10)]
[25, 38, 39, 40, 41, 42, 43, 44, 45, 46]


py [sys.getsizeof(€*i) for i in range(10)]
[25, 40, 42, 44, 46, 48, 50, 52, 54, 56]

py c = chr(0x + 1)
py [sys.getsizeof(c*i) for i in range(10)]
[25, 44, 48, 52, 56, 60, 64, 68, 72, 76]


On re-reading the PEP more closely, it looks like I did misunderstand the 
internal implementation, and strings which fit exactly in Latin-1 will 
also use 1 byte per character. There are three structures used:

PyASCIIObject
PyCompactUnicodeObject
PyUnicodeObject

and the third one comes in three variant forms, for 1-byte, 2-byte and 4-
byte data. So I stand corrected.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Branch and Bound Algorithm / Module for Python?

2012-08-19 Thread Rebekka-Marie
Hello everybody,

I would like to solve a Mixed Integer Optimization Problem with the 
Branch-And-Bound Algorithm.

I designed my Minimizing function and the constraints. I tested them in a small 
program in AIMMS. So I already know that they are solvable.

Now I want to solve them using Python.

Is there a module / methods that I can download or a ready-made program text 
that you know about, where I can put my constraints and minimization function 
in? 

Rebekka



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3, was Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 10:56:36 UTC+2, Steven D'Aprano a écrit :
 
 internal implementation, and strings which fit exactly in Latin-1 will 
 

And this is the crucial point. latin-1 is an obsolete and non usable
coding scheme (esp. for european languages).

We fall on the point I mentionned above. Microsoft know this, ditto
for Apple, ditto for TeX, ditto for the foundries.
Even, ISO has recognized its error and produced iso-8859-15.

The question? Why is it still used?

jmf



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Peter Otten
Steven D'Aprano wrote:

 On Sun, 19 Aug 2012 09:43:13 +0200, Peter Otten wrote:
 
 Steven D'Aprano wrote:
 
 I don't know where people are getting this myth that PEP 393 uses
 Latin-1 internally, it does not. Read the PEP, it explicitly states
 that 1-byte formats are only used for ASCII strings.
 
 From
 
 Python 3.3.0a4+ (default:10a8ad665749, Jun  9 2012, 08:57:51) [GCC
 4.6.1] on linux
 Type help, copyright, credits or license for more information.
 import sys
 [sys.getsizeof(é*i) for i in range(10)]
 [49, 74, 75, 76, 77, 78, 79, 80, 81, 82]
 
 Interesting. Say, I don't suppose you're using a 64-bit build? Because
 that would explain why your sizes are so larger than mine:
 
 py [sys.getsizeof(é*i) for i in range(10)]
 [25, 38, 39, 40, 41, 42, 43, 44, 45, 46]
 
 
 py [sys.getsizeof(€*i) for i in range(10)]
 [25, 40, 42, 44, 46, 48, 50, 52, 54, 56]

Yes, I am using a 64-bit build. I thought that

 (2) Latin1 strings have a constant overhead of 24 bytes (on a 64bit 
 system) over ASCII-only.

would convey that. The corresponding data structure 

typedef struct {
  PyASCIIObject _base;
  Py_ssize_t utf8_length;
  char *utf8;
  Py_ssize_t wstr_length;
} PyCompactUnicodeObject;

makes for 12 extra bytes on 32 bit, and both Py_ssize_t and pointers double 
in size (from 4 to 8 bytes) on 64 bit. I'm sure you can do the maths for the 
embedded PyASCIIObject yourself.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Branch and Bound Algorithm / Module for Python?

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 02:04:20 -0700, Rebekka-Marie wrote:

 I would like to solve a Mixed Integer Optimization Problem with the
 Branch-And-Bound Algorithm.
[...]
 Is there a module / methods that I can download or a ready-made program
 text that you know about, where I can put my constraints and
 minimization function in?

Sounds like it might be something from Numpy or Scipy?

http://numpy.scipy.org/
http://www.scipy.org/


This might be useful too:

http://telliott99.blogspot.com.au/2010/03/branch-and-bound.html


Good luck! If you do find something, come back and tell us please.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread lipska the kat

On 19/08/12 07:09, Steven D'Aprano wrote:

This is a long post. If you don't feel like reading an essay, skip to the
very bottom and read my last few paragraphs, starting with To recap.


Thank you for this excellent post,
it has certainly cleared up a few things for me

[snip]

incidentally

 But in UTF-16, ...

[snip]

 py  s = chr(0x + 1)
 py  a, b = s
 py  a
 '\ud800'
 py  b
 '\udc00'

in IDLE

Python 3.2.3 (default, May  3 2012, 15:51:42)
[GCC 4.6.3] on linux2
Type copyright, credits or license() for more information.
 No Subprocess 
 s = chr(0x + 1)
 a, b = s
Traceback (most recent call last):
  File pyshell#1, line 1, in module
a, b = s
ValueError: need more than 1 value to unpack

At a terminal prompt

[lipska@ubuntu ~]$ python3.2
Python 3.2.3 (default, Jul 17 2012, 14:23:10)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
 s = chr(0x + 1)
 a, b = s
 a
'\ud800'
 b
'\udc00'


The date stamp is different but the Python version is the same

No idea why this is happening, I just thought it was interesting

lipska

--
Lipska the Kat©: Troll hunter, sandbox destroyer
and farscape dreamer of Aeryn Sun
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Chris Angelico
On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat
lipskathe...@yahoo.co.uk wrote:
 The date stamp is different but the Python version is the same

Check out what 'sys.maxunicode' is in each of those Pythons. It's
possible that one is a wide build and the other narrow.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 11:37:09 UTC+2, Peter Otten a écrit :


You know, the techincal aspect is one thing. Understanding
the coding of the characters as a whole is something
else. The important point is not the coding per se, the
relevant point is the set of characters a coding may
represent.

You can build the most sophisticated mechanism you which,
if it does not take that point into account, it will
always fail or be not optimal.

This is precicely the weak point of this flexible
representation. It uses latin-1 and latin-1 is for
most users simply unusable.

Fascinating, isn't it? Devs are developing sophisticed
tools based on a non working basis.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Chris Angelico
On Sun, Aug 19, 2012 at 8:19 PM,  wxjmfa...@gmail.com wrote:
 This is precicely the weak point of this flexible
 representation. It uses latin-1 and latin-1 is for
 most users simply unusable.

No, it uses Unicode, and as an optimization, attempts to store the
codepoints in less than four bytes for most strings. The fact that a
one-byte storage format happens to look like latin-1 is rather
coincidental.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Branch and Bound Algorithm / Module for Python?

2012-08-19 Thread Mark Lawrence

On 19/08/2012 11:04, Steven D'Aprano wrote:

On Sun, 19 Aug 2012 02:04:20 -0700, Rebekka-Marie wrote:


I would like to solve a Mixed Integer Optimization Problem with the
Branch-And-Bound Algorithm.

[...]

Is there a module / methods that I can download or a ready-made program
text that you know about, where I can put my constraints and
minimization function in?


Sounds like it might be something from Numpy or Scipy?

http://numpy.scipy.org/
http://www.scipy.org/


This might be useful too:

http://telliott99.blogspot.com.au/2010/03/branch-and-bound.html


Good luck! If you do find something, come back and tell us please.




In addition to the above there's always the Python Package Index at 
http://pypi.python.org/pypi


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Mark Lawrence

On 19/08/2012 09:54, wxjmfa...@gmail.com wrote:

About the exemples contested by Steven:

eg: timeit.timeit(('ab…' * 10).replace('…', 'œ…'))


And it is good enough to show the problem. Period. The
rest (you have to do this, you should not do this, why
are you using these characters - amazing and stupid
question -) does not count.

The real problem is elsewhere. *Americans* do not wish
a character occupies 4 bytes in *their* memory. The rest
of the world does not count.

The same thing happens with the utf-8 coding scheme.
Technically, it is fine. But after n years of usage,
one should recognize it just became an ascii2. Especially
for those who undestand nothing in that field and are
not even aware, characters are coded. I'm the first
to think, this is legitimate.

Memory or ability to treat all text in the same and equal
way?

End note. This kind of discussion is not specific to
Python, it always happen when there is some kind of
conflict between ascii and non ascii users.

Have a nice day.

jmf



Roughly translated.  I've been shot to pieces and having seen Monty 
Python and the Holy Grail I know what to do.  Run away, run away


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread lipska the kat

On 19/08/12 11:19, Chris Angelico wrote:

On Sun, Aug 19, 2012 at 8:13 PM, lipska the kat
lipskathe...@yahoo.co.uk  wrote:

The date stamp is different but the Python version is the same


Check out what 'sys.maxunicode' is in each of those Pythons. It's
possible that one is a wide build and the other narrow.


Ah ...

I built my local version from source
and no, I didn't read the makefile so I didn't configure for a wide 
build :-( not that I would have known the difference at that time.


[lipska@ubuntu ~]$ python3.2
Python 3.2.3 (default, Jul 17 2012, 14:23:10)
[GCC 4.6.3] on linux2
Type help, copyright, credits or license for more information.
 import sys
 sys.maxunicode
65535


Later, I did an apt-get install idle3 which pulled
down a precompiled IDLE from the Ubuntu repos
This was obviously compiled 'wide'

Python 3.2.3 (default, May  3 2012, 15:51:42)
[GCC 4.6.3] on linux2
Type copyright, credits or license() for more information.
 No Subprocess 
 import sys
 sys.maxunicode
1114111


All very interesting and enlightening

Thanks

lipska

--
Lipska the Kat©: Troll hunter, sandbox destroyer
and farscape dreamer of Aeryn Sun
--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 01:11:56 -0700, Paul Rubin wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 result = text[end:]
 
 if end not near the end of the original string, then this is O(N) even
 with fixed-width representation, because of the char copying.

Technically, yes. But it's a straight copy of a chunk of memory, which 
means it's fast: your OS and hardware tries to make straight memory 
copies as fast as possible. Big-Oh analysis frequently glosses over 
implementation details like that.

Of course, that assumption gets shaky when you start talking about extra 
large blocks, and it falls apart completely when your OS starts paging 
memory to disk.

But if it helps to avoid irrelevant technical details, change it to 
text[end:end+10] or something.


 if it is near the end, by knowing where the string data area ends, I
 think it should be possible to scan backwards from the end, recognizing
 what bytes can be the beginning of code points and counting off the
 appropriate number.  This is O(1) if near the end means within a
 constant.

You know, I think you are misusing Big-Oh analysis here. It really 
wouldn't be helpful for me to say Bubble Sort is O(1) if you only sort 
lists with a single item. Well, yes, that is absolutely true, but that's 
a special case that doesn't give you any insight into why using Bubble 
Sort as your general purpose sort routine is a terrible idea.

Using variable-sized strings like UTF-8 and UTF-16 for in-memory 
representations is a terrible idea because you can't assume that people 
will only every want to index the first or last character. On average, 
you need to scan half the string, one character at a time. In Big-Oh, we 
can ignore the factor of 1/2 and just say we scan the string, O(N).

That's why languages tend to use fixed character arrays for strings. 
Haskell is an exception, using linked lists which require traversing the 
string to jump to an index. The manual even warns:

[quote]
If you think of a Text value as an array of Char values (which it is 
not), you run the risk of writing inefficient code.

An idiom that is common in some languages is to find the numeric offset 
of a character or substring, then use that number to split or trim the 
searched string. With a Text value, this approach would require two O(n) 
operations: one to perform the search, and one to operate from wherever 
the search ended. 
[end quote]

http://hackage.haskell.org/packages/archive/text/0.11.2.2/doc/html/Data-Text.html



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Encapsulation, inheritance and polymorphism

2012-08-19 Thread lipska the kat

On 19/08/12 09:55, Mark Lawrence wrote:

On 19/08/2012 06:21, Robert Miles wrote:

On 7/23/2012 11:18 AM, Albert van der Horst wrote:

In article 5006b48a$0$29978$c3e8da3$54964...@news.astraweb.com,
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:


[snip]


that functions must only have one exit?


[snip[



Surely the first check is your filing system to make sure that you've
paid the utilties bills so you've got gas and or electricity to apply
the heat. Either that or you hire Ray Mears to produce the spark needed
to light the fire :)


I was wondering how long it would be ...

lipska

--
Lipska the Kat©: Troll hunter, sandbox destroyer
and farscape dreamer of Aeryn Sun
--
http://mail.python.org/mailman/listinfo/python-list


Re: Encapsulation, inheritance and polymorphism

2012-08-19 Thread Mark Lawrence

On 19/08/2012 12:50, lipska the kat wrote:

On 19/08/12 09:55, Mark Lawrence wrote:

On 19/08/2012 06:21, Robert Miles wrote:

On 7/23/2012 11:18 AM, Albert van der Horst wrote:

In article 5006b48a$0$29978$c3e8da3$54964...@news.astraweb.com,
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info wrote:


[snip]


that functions must only have one exit?


[snip[



Surely the first check is your filing system to make sure that you've
paid the utilties bills so you've got gas and or electricity to apply
the heat. Either that or you hire Ray Mears to produce the spark needed
to light the fire :)


I was wondering how long it would be ...

lipska



Six days shalt thou labour... :)

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
 On Sun, Aug 19, 2012 at 8:19 PM,  wxjmfa...@gmail.com wrote:
 
  This is precicely the weak point of this flexible
 
  representation. It uses latin-1 and latin-1 is for
 
  most users simply unusable.
 
 
 
 No, it uses Unicode, and as an optimization, attempts to store the
 
 codepoints in less than four bytes for most strings. The fact that a
 
 one-byte storage format happens to look like latin-1 is rather
 
 coincidental.
 

And this this is the common basic mistake. You do not push your
argumentation far enough. A character may fall accidentally in a latin-1.
The problem lies in these european characters, which can not fall in this
coding. This *is* the cause of the negative side effects.
If you are using a correct coding scheme, like cp1252, mac-roman or
iso-8859-15, you will never see such a negative side effect.
Again, the problem is not the result, the encoded character. The critical
part is the character which may cause this side effect.
You should think character set and not encoded code point, considering
this kind of expression has a sense in 8-bits coding scheme.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Dave Angel
On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote:
 Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
 On Sun, Aug 19, 2012 at 8:19 PM,  wxjmfa...@gmail.com wrote:

 This is precicely the weak point of this flexible
 representation. It uses latin-1 and latin-1 is for
 most users simply unusable.


 No, it uses Unicode, and as an optimization, attempts to store the

 codepoints in less than four bytes for most strings. The fact that a

 one-byte storage format happens to look like latin-1 is rather

 coincidental.

 And this this is the common basic mistake. You do not push your
 argumentation far enough. A character may fall accidentally in a latin-1.
 The problem lies in these european characters, which can not fall in this
 coding. This *is* the cause of the negative side effects.
 If you are using a correct coding scheme, like cp1252, mac-roman or
 iso-8859-15, you will never see such a negative side effect.
 Again, the problem is not the result, the encoded character. The critical
 part is the character which may cause this side effect.
 You should think character set and not encoded code point, considering
 this kind of expression has a sense in 8-bits coding scheme.

 jmf

But that choice was made decades ago when Unicode picked its second 128
characters.  The internal form used in this PEP is simply the low-order
byte of the Unicode code point.  Trying to scan the string deciding if
converting to cp1252 (for example) would be a much more expensive
operation than seeing how many bytes it'd take for the largest code point.





-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Dave Angel
(pardon the resend, but I accidentally omitted a couple of words)
On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote:
 Le dimanche 19 août 2012 12:26:44 UTC+2, Chris Angelico a écrit :
 SNIP


 No, it uses Unicode, and as an optimization, attempts to store the
 codepoints in less than four bytes for most strings. The fact that a
 one-byte storage format happens to look like latin-1 is rather
 coincidental.

 And this this is the common basic mistake. You do not push your
 argumentation far enough. A character may fall accidentally in a latin-1.
 The problem lies in these european characters, which can not fall in this
 coding. This *is* the cause of the negative side effects.
 If you are using a correct coding scheme, like cp1252, mac-roman or
 iso-8859-15, you will never see such a negative side effect.
 Again, the problem is not the result, the encoded character. The critical
 part is the character which may cause this side effect.
 You should think character set and not encoded code point, considering
 this kind of expression has a sense in 8-bits coding scheme.

 jmf

But that choice was made decades ago when Unicode picked its second 128
characters.  The internal form used in this PEP is simply the low-order
byte of the Unicode code point.  Trying to scan the string deciding if
converting to cp1252 (for example) would work, would be a much more
expensive operation than seeing how many bytes it'd take for the largest
code point.

The 8 bit form is used if all the code points are less than 256.  That
is a simple description, and simple code.  As several people have said,
the fact that this byte matches on of the DECODED forms is coincidence.

-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Top-posting c. (was Re: [ANNC] pybotwar-0.8)

2012-08-19 Thread python
Hi Steve,

 I don't think I'm some sort of hyper-evolved mega-genius with a brain the 
 size of a planet, I'm just some guy.

Based on reading thousands of your posts over the past 4 years, I'll
have to respectfully disagree with you on your assertion that you are
not some hyper-evolved genius with a brain the size of a planet. :)

I've learned a ton from reading your posts - so much so that I think my
brain is getting heavier[1].

Thank you and cheers!
Malcolm

From a recent thread on this mailing list (hilarious)
http://onceuponatimeinindia.blogspot.in/2009/07/hard-drive-weight-increasing.html
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 14:29:17 UTC+2, Dave Angel a écrit :
 On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote:
 
  Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :
 
  On Sun, Aug 19, 2012 at 8:19 PM,  wxjmfa...@gmail.com wrote:
 
 
 
  This is precicely the weak point of this flexible
 
  representation. It uses latin-1 and latin-1 is for
 
  most users simply unusable.
 
 
 
 
 
  No, it uses Unicode, and as an optimization, attempts to store the
 
 
 
  codepoints in less than four bytes for most strings. The fact that a
 
 
 
  one-byte storage format happens to look like latin-1 is rather
 
 
 
  coincidental.
 
 
 
  And this this is the common basic mistake. You do not push your
 
  argumentation far enough. A character may fall accidentally in a latin-1.
 
  The problem lies in these european characters, which can not fall in this
 
  coding. This *is* the cause of the negative side effects.
 
  If you are using a correct coding scheme, like cp1252, mac-roman or
 
  iso-8859-15, you will never see such a negative side effect.
 
  Again, the problem is not the result, the encoded character. The critical
 
  part is the character which may cause this side effect.
 
  You should think character set and not encoded code point, considering
 
  this kind of expression has a sense in 8-bits coding scheme.
 
 
 
  jmf
 
 
 
 But that choice was made decades ago when Unicode picked its second 128
 
 characters.  The internal form used in this PEP is simply the low-order
 
 byte of the Unicode code point.  Trying to scan the string deciding if
 
 converting to cp1252 (for example) would be a much more expensive
 
 operation than seeing how many bytes it'd take for the largest code point.
 
 

You are absoletely right. (I'm quite comfortable with Unicode).
If Python wish to perpetuate this, lets call it, design mistake
or ennoyement, it will continue to live with problems.

People (tools) who chose pure utf-16 or utf-32 are not suffering
from this issue.

*My* final comment on this thread.

In August 2012, after 20 years of development, Python is not
able to display a piece of text correctly on a Windows console
(eg cp65001).

I downloaded the go language, zero experience, I did not succeed
to display incorrecly a piece of text. (This is by the way *the*
reason why I tested it). Where the problems are coming from, I have
no idea.

I find this situation quite comic. Python is able to
produce this:

 (1.1).hex()
'0x1.1999ap+0'

but it is not able to display a piece of text!

Try to convince end users IEEE 754 is more important than the
ability to read/wirite a piece a text, a 6-years kid has learned
at school :-)

(I'm not suffering from this kind of effect, as a Windows user,
I'm always working via gui, it still remains, the problem exists.

Regards,
jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 01:04:25 -0700, Paul Rubin wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:

 This standard data structure is called UCS-2 ... There's an extension
 to UCS-2 called UTF-16
 
 My own understanding is UCS-2 simply shouldn't be used any more. 

Pretty much. But UTF-16 with lax support for surrogates (that is, 
surrogates are included but treated as two characters) is essentially 
UCS-2 with the restriction against surrogates lifted. That's what Python 
currently does, and Javascript.

http://mathiasbynens.be/notes/javascript-encoding

The reality is that support for the Unicode supplementary planes is 
pretty poor. Even when applications support it, most fonts don't have 
glyphs for the characters. Anything which makes handling of Unicode 
supplementary characters better is a step forward.


 * Variable-byte formats like UTF-8 and UTF-16 mean that basic string
 operations are not O(1) but are O(N). That means they are slow, or
 buggy, pick one.
 
 This I don't see.  What are the basic string operations?

The ones I'm specifically referring to are indexing and copying 
substrings. There may be others.


 * Examine the first character, or first few characters (few = usually
   bounded by a small constant) such as to parse a token from an input
   stream.  This is O(1) with either encoding.

That's actually O(K), for K = a few, whatever a few means. But we 
know that anything is fast for small enough N (or K in this case).


 * Slice off the first N characters.  This is O(N) with either encoding
   if it involves copying the chars.  I guess you could share references
   into the same string, but if the slice reference persists while the
   big reference is released, you end up not freeing the memory until
   later than you really should.

As a first approximation, memory copying is assumed to be free, or at 
least constant time. That's not strictly true, but Big Oh analysis is 
looking at algorithmic complexity. It's not a substitute for actual 
benchmarks.


 Meanwhile, an example of the 393 approach failing: I was involved in a
 project that dealt with terabytes of OCR data of mostly English text.

I assume that this wasn't one giant multi-terrabyte string.

 So
 the chars were mostly ascii, but there would be occasional non-ascii
 chars including supplementary plane characters, either because of
 special symbols that were really in the text, or the typical OCR
 confusion emitting those symbols due to printing imprecision.  That's a
 natural for UTF-8 but the PEP-393 approach would bloat up the memory
 requirements by a factor of 4.

Not necessarily. Presumably you're scanning each page into a single 
string. Then only the pages containing a supplementary plane char will be 
bloated, which is likely to be rare. Especially since I don't expect your 
OCR application would recognise many non-BMP characters -- what does 
U+110F3, SORA SOMPENG DIGIT THREE, look like? If the OCR software 
doesn't recognise it, you can't get it in your output. (If you do, the 
OCR software has a nasty bug.)

Anyway, in my ignorant opinion the proper fix here is to tell the OCR 
software not to bother trying to recognise Imperial Aramaic, Domino 
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't 
expecting them in your source material. Not only will the scanning go 
faster, but you'll get fewer wrong characters.


[...]
 I realize the folks who designed and implemented PEP 393 are very smart
 cookies and considered stuff carefully, while I'm just an internet user
 posting an immediate impression of something I hadn't seen before (I
 still use Python 2.6), but I still have to ask: if the 393 approach
 makes sense, why don't other languages do it?

There has to be a first time for everything.


 Ropes of UTF-8 segments seems like the most obvious approach and I
 wonder if it was considered.

Ropes have been considered and rejected because while they are 
asymptotically fast, in common cases the added complexity actually makes 
them slower. Especially for immutable strings where you aren't inserting 
into the middle of a string.

http://mail.python.org/pipermail/python-dev/2000-February/002321.html

PyPy has revisited ropes and uses, or at least used, ropes as their 
native string data structure. But that's ropes of *bytes*, not UTF-8.
 
http://morepypy.blogspot.com.au/2007/11/ropes-branch-merged.html


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 03:19:23 -0700, wxjmfauth wrote:

 This is precicely the weak point of this flexible representation. It
 uses latin-1 and latin-1 is for most users simply unusable.

That's very funny.

Are you aware that your post is entirely Latin-1?


 Fascinating, isn't it? Devs are developing sophisticed tools based on a
 non working basis.

At the end of the day, PEP 393 fixes some major design limitations of the 
Unicode implementation in the narrow build Python, while saving memory 
for people using the wide build. Everybody wins here. Your objection 
appears to be based on some sort of philosophical objection to Latin-1 
than on any genuine problem.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Mark Lawrence

On 19/08/2012 13:59, wxjmfa...@gmail.com wrote:

Le dimanche 19 août 2012 14:29:17 UTC+2, Dave Angel a écrit :

On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote:


Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :



On Sun, Aug 19, 2012 at 8:19 PM,  wxjmfa...@gmail.com wrote:







This is precicely the weak point of this flexible



representation. It uses latin-1 and latin-1 is for



most users simply unusable.











No, it uses Unicode, and as an optimization, attempts to store the







codepoints in less than four bytes for most strings. The fact that a







one-byte storage format happens to look like latin-1 is rather







coincidental.







And this this is the common basic mistake. You do not push your



argumentation far enough. A character may fall accidentally in a latin-1.



The problem lies in these european characters, which can not fall in this



coding. This *is* the cause of the negative side effects.



If you are using a correct coding scheme, like cp1252, mac-roman or



iso-8859-15, you will never see such a negative side effect.



Again, the problem is not the result, the encoded character. The critical



part is the character which may cause this side effect.



You should think character set and not encoded code point, considering



this kind of expression has a sense in 8-bits coding scheme.







jmf




But that choice was made decades ago when Unicode picked its second 128

characters.  The internal form used in this PEP is simply the low-order

byte of the Unicode code point.  Trying to scan the string deciding if

converting to cp1252 (for example) would be a much more expensive

operation than seeing how many bytes it'd take for the largest code point.




You are absoletely right. (I'm quite comfortable with Unicode).
If Python wish to perpetuate this, lets call it, design mistake
or ennoyement, it will continue to live with problems.


Please give a precise description of the design mistake and what you 
would do to correct it.




People (tools) who chose pure utf-16 or utf-32 are not suffering
from this issue.

*My* final comment on this thread.

In August 2012, after 20 years of development, Python is not
able to display a piece of text correctly on a Windows console
(eg cp65001).


Examples please.



I downloaded the go language, zero experience, I did not succeed
to display incorrecly a piece of text. (This is by the way *the*
reason why I tested it). Where the problems are coming from, I have
no idea.

I find this situation quite comic. Python is able to
produce this:


(1.1).hex()

'0x1.1999ap+0'

but it is not able to display a piece of text!


So you keep saying, but when asked for examples or evidence nothing gets 
produced.




Try to convince end users IEEE 754 is more important than the
ability to read/wirite a piece a text, a 6-years kid has learned
at school :-)

(I'm not suffering from this kind of effect, as a Windows user,
I'm always working via gui, it still remains, the problem exists.


Windows is a law unto itself.  Its problems are hardly specific to Python.



Regards,
jmf



Now two or three times you've said you're going but have come back.  If 
you come again could you please provide examples and or evidence of what 
you're on about, because you still have me baffled.


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 15:46:34 UTC+2, Mark Lawrence a écrit :
 On 19/08/2012 13:59, wxjmfa...@gmail.com wrote:
 
  Le dimanche 19 ao�t 2012 14:29:17 UTC+2, Dave Angel a �crit :
 
  On 08/19/2012 08:14 AM, wxjmfa...@gmail.com wrote:
 
 
 
  Le dimanche 19 ao�t 2012 12:26:44 UTC+2, Chris Angelico a �crit :
 
 
 
  On Sun, Aug 19, 2012 at 8:19 PM,  wxjmfa...@gmail.com wrote:
 
 
 
 
 
 
 
  This is precicely the weak point of this flexible
 
 
 
  representation. It uses latin-1 and latin-1 is for
 
 
 
  most users simply unusable.
 
 
 
 
 
 
 
 
 
 
 
  No, it uses Unicode, and as an optimization, attempts to store the
 
 
 
 
 
 
 
  codepoints in less than four bytes for most strings. The fact that a
 
 
 
 
 
 
 
  one-byte storage format happens to look like latin-1 is rather
 
 
 
 
 
 
 
  coincidental.
 
 
 
 
 
 
 
  And this this is the common basic mistake. You do not push your
 
 
 
  argumentation far enough. A character may fall accidentally in a 
  latin-1.
 
 
 
  The problem lies in these european characters, which can not fall in this
 
 
 
  coding. This *is* the cause of the negative side effects.
 
 
 
  If you are using a correct coding scheme, like cp1252, mac-roman or
 
 
 
  iso-8859-15, you will never see such a negative side effect.
 
 
 
  Again, the problem is not the result, the encoded character. The critical
 
 
 
  part is the character which may cause this side effect.
 
 
 
  You should think character set and not encoded code point, considering
 
 
 
  this kind of expression has a sense in 8-bits coding scheme.
 
 
 
 
 
 
 
  jmf
 
 
 
 
 
 
 
  But that choice was made decades ago when Unicode picked its second 128
 
 
 
  characters.  The internal form used in this PEP is simply the low-order
 
 
 
  byte of the Unicode code point.  Trying to scan the string deciding if
 
 
 
  converting to cp1252 (for example) would be a much more expensive
 
 
 
  operation than seeing how many bytes it'd take for the largest code point.
 
 
 
 
 
 
 
  You are absoletely right. (I'm quite comfortable with Unicode).
 
  If Python wish to perpetuate this, lets call it, design mistake
 
  or ennoyement, it will continue to live with problems.
 
 
 
 Please give a precise description of the design mistake and what you 
 
 would do to correct it.
 
 
 
 
 
  People (tools) who chose pure utf-16 or utf-32 are not suffering
 
  from this issue.
 
 
 
  *My* final comment on this thread.
 
 
 
  In August 2012, after 20 years of development, Python is not
 
  able to display a piece of text correctly on a Windows console
 
  (eg cp65001).
 
 
 
 Examples please.
 
 
 
 
 
  I downloaded the go language, zero experience, I did not succeed
 
  to display incorrecly a piece of text. (This is by the way *the*
 
  reason why I tested it). Where the problems are coming from, I have
 
  no idea.
 
 
 
  I find this situation quite comic. Python is able to
 
  produce this:
 
 
 
  (1.1).hex()
 
  '0x1.1999ap+0'
 
 
 
  but it is not able to display a piece of text!
 
 
 
 So you keep saying, but when asked for examples or evidence nothing gets 
 
 produced.
 
 
 
 
 
  Try to convince end users IEEE 754 is more important than the
 
  ability to read/wirite a piece a text, a 6-years kid has learned
 
  at school :-)
 
 
 
  (I'm not suffering from this kind of effect, as a Windows user,
 
  I'm always working via gui, it still remains, the problem exists.
 
 
 
 Windows is a law unto itself.  Its problems are hardly specific to Python.
 
 
 
 
 
  Regards,
 
  jmf
 
 
 
 
 
 Now two or three times you've said you're going but have come back.  If 
 
 you come again could you please provide examples and or evidence of what 
 
 you're on about, because you still have me baffled.
 
 
 
 -- 
 
 Cheers.
 
 
 
 Mark Lawrence.

Yesterday, I went to bed.
More seriously.

I can not give you more numbers than those I gave.
As a end user, I noticed and experimented my random tests
are always slower in Py3.3 than in Py3.2 on my Windows platform.

It is up to you, the core developers to give an explanation
about this behaviour.

As I understand a little bit the coding of the characters,
I pointed out, this is most probably due to this flexible
string representation (with arguments appearing randomly
in the misc. messages, mainly latin-1).

I can not do more.

(I stupidly spoke about factors 0.1 to ..., you should
read of course, 1.1,  to ...)

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Oscar Benjamin
On 19 August 2012 15:09, wxjmfa...@gmail.com wrote:

 I can not give you more numbers than those I gave.
 As a end user, I noticed and experimented my random tests
 are always slower in Py3.3 than in Py3.2 on my Windows platform.


Do the problems have a significant impact on any real application (rather
than random tests)?

Any significant change in implementation such as this is likely to have
both positive and negative performance costs. The important thing is how it
affects a real application as a whole.



 It is up to you, the core developers to give an explanation
 about this behaviour.


Unless others are unable to reproduce your observations.

If there is a big performance hit for text heavy applications then it's
worth reporting but you should focus your energy on distilling a
*meaningful* test case (rather than ranting about Americans, unicode,
latin-1 and so on).

Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Mark Lawrence

On 19/08/2012 15:09, wxjmfa...@gmail.com wrote:



I can not give you more numbers than those I gave.
As a end user, I noticed and experimented my random tests
are always slower in Py3.3 than in Py3.2 on my Windows platform.


Once again you refuse to supply anything to back up what you say.



It is up to you, the core developers to give an explanation
about this behaviour.


Core developers cannot give an explanation for something that doesn't 
exist, except in your imagination.  Unless you can produce the evidence 
that supports your claims, including details of OS, benchmarks used and 
so on and so forth.




As I understand a little bit the coding of the characters,
I pointed out, this is most probably due to this flexible
string representation (with arguments appearing randomly
in the misc. messages, mainly latin-1).

I can not do more.

(I stupidly spoke about factors 0.1 to ..., you should
read of course, 1.1,  to ...)

jmf



I suspect that I'll be dead and buried long before you can produce 
anything concrete in the way of evidence.  I've thrown down the gauntlet 
several times, do you now have the courage to pick it up, or are you 
going to resort to the FUD approach that you've been using throughout 
this thread?


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread DJC

On 19/08/12 15:25, Steven D'Aprano wrote:


Not necessarily. Presumably you're scanning each page into a single
string. Then only the pages containing a supplementary plane char will be
bloated, which is likely to be rare. Especially since I don't expect your
OCR application would recognise many non-BMP characters -- what does
U+110F3, SORA SOMPENG DIGIT THREE, look like? If the OCR software
doesn't recognise it, you can't get it in your output. (If you do, the
OCR software has a nasty bug.)

Anyway, in my ignorant opinion the proper fix here is to tell the OCR
software not to bother trying to recognise Imperial Aramaic, Domino
Tiles, Phaistos Disc symbols, or Egyptian Hieroglyphs if you aren't
expecting them in your source material. Not only will the scanning go
faster, but you'll get fewer wrong characters.


Consider the automated recognition of a CAPTCHA. As the chars have to be 
entered by the user on a keyboard, only the most basic charset can be 
used, so the problem of which chars are possible is quite limited.

--
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 16:48:48 UTC+2, Mark Lawrence a écrit :
 On 19/08/2012 15:09, wxjmfa...@gmail.com wrote:
 
 
 
 
 
  I can not give you more numbers than those I gave.
 
  As a end user, I noticed and experimented my random tests
 
  are always slower in Py3.3 than in Py3.2 on my Windows platform.
 
 
 
 Once again you refuse to supply anything to back up what you say.
 
 
 
 
 
  It is up to you, the core developers to give an explanation
 
  about this behaviour.
 
 
 
 Core developers cannot give an explanation for something that doesn't 
 
 exist, except in your imagination.  Unless you can produce the evidence 
 
 that supports your claims, including details of OS, benchmarks used and 
 
 so on and so forth.
 
 
 
 
 
  As I understand a little bit the coding of the characters,
 
  I pointed out, this is most probably due to this flexible
 
  string representation (with arguments appearing randomly
 
  in the misc. messages, mainly latin-1).
 
 
 
  I can not do more.
 
 
 
  (I stupidly spoke about factors 0.1 to ..., you should
 
  read of course, 1.1,  to ...)
 
 
 
  jmf
 
 
 
 
 
 I suspect that I'll be dead and buried long before you can produce 
 
 anything concrete in the way of evidence.  I've thrown down the gauntlet 
 
 several times, do you now have the courage to pick it up, or are you 
 
 going to resort to the FUD approach that you've been using throughout 
 
 this thread?
 
 
 
 -- 
 
 Cheers.
 
 
 
 Mark Lawrence.

I do not remember the tests I'have done at the 1st alpha release
time. It was with an interactive interpreter. I precisely pay
attention to test these chars you can find in the range 128..256
in all 8-bits coding schemes. Chars I suspected to be problematic.

Here a short test again, a random single test, the first
idea coming in my mind.

Py 3.2.3
 timeit.timeit(('aœ€'*100).replace('a', 'œ€é'))
4.99396356635981

Py 3.3b2
 timeit.timeit(('aœ€'*100).replace('a', 'œ€é'))
7.560455708007855

Maybe, not so demonstative. It shows at least, we
are far away from the 10-30% annouced.

 7.56 / 5
1.512
 5 / (7.56 - 5) * 100
195.312503


jmf


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Terry Reedy

On 8/19/2012 4:54 AM, wxjmfa...@gmail.com wrote:

About the exemples contested by Steven:
eg: timeit.timeit(('ab…' * 10).replace('…', 'œ…'))
And it is good enough to show the problem. Period.


Repeating a false claim over and over does not make it true. Two people 
on pydev claim that 3.3 is *faster* on their systems (one unspecified, 
one OSX10.8).


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


How does .rjust() work and why it places characters relative to previous one, not to first character - placed most to left - or to left side of screen?

2012-08-19 Thread crispy
I have an example:

def pairwiseScore(seqA, seqB):

prev = -1
score = 0
length = len(seqA)
similarity = []
relative_similarity = []

for x in xrange(length):

if seqA[x] == seqB[x]:
if (x = 1) and (seqA[x - 1] == seqB[x - 1]):
score += 3
similarity.append(x)
else:
score += 1
similarity.append(x)
else:
score -= 1

for x in similarity:

relative_similarity.append(x - prev)
prev = x

return ''.join((seqA, '\n', ''.join(['|'.rjust(x) for x in 
relative_similarity]), '\n', seqB, '\n', 'Score: ', str(score)))


print pairwiseScore(ATTCGT, ATCTAT), '\n', '\n', 
pairwiseScore(GATAAATCTGGTCT, CATTCATCATGCAA), '\n', '\n', 
pairwiseScore('AGCG', 'ATCG'), '\n', '\n', pairwiseScore('ATCG', 'ATCG')

which returns:

ATTCGT
||   |
ATCTAT
Score: 2 

GATAAATCTGGTCT
 ||  |||  |
CATTCATCATGCAA
Score: 4 

AGCG
| ||
ATCG
Score: 4 

ATCG

ATCG
Score: 10


But i created this with some help from one person. Earlier, this code was 
devoided of these few lines:


prev = -1
relative_similarity = []


for x in similarity:

relative_similarity.append(x - prev)
prev = x

The method looked liek this:

def pairwiseScore(seqA, seqB):

score = 0
length = len(seqA)
similarity = []

for x in xrange(length):

if seqA[x] == seqB[x]:
if (x = 1) and (seqA[x - 1] == seqB[x - 1]):
score += 3
similarity.append(x)
else:
score += 1
similarity.append(x)
else:
score -= 1

return ''.join((seqA, '\n', ''.join(['|'.rjust(x) for x in 
similarity]), '\n', seqB, '\n', 'Score: ', str(score)))

and produced this output:

ATTCGT
|||
ATCTAT
Score: 2 

GATAAATCTGGTCT
| || |  | |
CATTCATCATGCAA
Score: 4 

AGCG
| |  |
ATCG
Score: 4 

ATCG
|| |  |
ATCG
Score: 10

So I have guessed, that characters processed by .rjust() function, are placed 
in output, relative to previous ones - NOT to first, most to left placed, 
character.
Why it works like that? What builtn-in function can format output, to make 
every character be placed as i need - relative to the first character, placed 
most to left side of screen.

Cheers
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How does .rjust() work and why it places characters relative to previous one, not to first character - placed most to left - or to left side of screen?

2012-08-19 Thread crispy
Here's first code - http://codepad.org/RcKTTiYa

And here's second - http://codepad.org/zwEQKKeV
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Oscar Benjamin
On Aug 19, 2012 5:22 PM, wxjmfa...@gmail.com wrote

 Py 3.2.3
  timeit.timeit(('aœ€'*100).replace('a', 'œ€é'))
 4.99396356635981

 Py 3.3b2
  timeit.timeit(('aœ€'*100).replace('a', 'œ€é'))
 7.560455708007855

 Maybe, not so demonstative. It shows at least, we
 are far away from the 10-30% annouced.

  7.56 / 5
 1.512
  5 / (7.56 - 5) * 100
 195.312503

Maybe the problem is that your understanding of a percentage differs from
that of others.

I make that a 51% increase. I don't really understand what your 195 figure
is demonstrating.

Oscar.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Blind Anagram
Steven D'Aprano  wrote in message 
news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...


On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:

[...]
If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:

http://bugs.python.org/

Don't forget to report your operating system.


For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) 
running Windows 7 x64.


Running Python from a Windows command prompt,  I got the following on Python 
3.2.3 and 3.3 beta 2:


python33\python -m timeit ('abc' * 1000).replace('c', 'de')
1 loops, best of 3: 39.3 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', '……')
1 loops, best of 3: 51.8 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')
1 loops, best of 3: 52 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
1 loops, best of 3: 50.3 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', '€…')
1 loops, best of 3: 51.6 usec per loop
python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
1 loops, best of 3: 38.3 usec per loop
python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
1 loops, best of 3: 50.3 usec per loop

python32\python -m timeit ('abc' * 1000).replace('c', 'de')
1 loops, best of 3: 24.5 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', '……')
1 loops, best of 3: 24.7 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')
1 loops, best of 3: 24.8 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
1 loops, best of 3: 24 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', '€…')
1 loops, best of 3: 24.1 usec per loop
python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
1 loops, best of 3: 24.4 usec per loop
python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
1 loops, best of 3: 24.3 usec per loop

This is an average slowdown by a factor of close to 2.3 on 3.3 when compared 
with 3.2.


I am not posting this to perpetuate this thread but simply to ask whether, 
as you suggest, I should report this as a possible problem with the beta?


--
http://mail.python.org/mailman/listinfo/python-list


Re: How does .rjust() work and why it places characters relative to previous one, not to first character - placed most to left - or to left side of screen?

2012-08-19 Thread Dave Angel
On 08/19/2012 12:25 PM, crispy wrote:
 SNIP
 So I have guessed, that characters processed by .rjust() function, are placed 
 in output, relative to previous ones - NOT to first, most to left placed, 
 character.

rjust() does not print to the console, it just produces a string.  So if
you want to know how it works, you need to either read about it, or
experiment with it.

Try   help(.rjust) to see a simple description of it.  (If you're
not familiar with the interactive interpreter's help() function, you owe
it to yourself to learn it).

Playing with it:

print abcd.rjust(8, -)   producesabcd

for i in range(5): print a.rjust(i, -)
produces:

a
a
-a
--a
---a

In each case, the number of characters produced is no larger than i.  No
consideration is made to other strings outside of the literal passed
into the method.


 Why it works like that? 

In your code, you have the rjust() method inside a loop, inside a join,
inside a print.  it makes a nice, impressive single line, but clearly
you don't completely understand what the pieces are, nor how they work
together.  Since the join is combining (concatenating) strings that are
each being produced by rjust(), it's the join() that's making this look
relative to you.


 What builtn-in function can format output, to make every character be placed 
 as i need - relative to the first character, placed most to left side of 
 screen.

If you want to randomly place characters on the screen, you either want
a curses-like package, or a gui.  i suspect that's not at all what you want.

if you want to randomly change characters in a pre-existing string,
which will then be printed to the console, then I could suggest an
approach (untested)

res = [ ] * length
for column in similarity:
res[column] = |
res = .join(res)



-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Terry Reedy

On 8/19/2012 4:04 AM, Paul Rubin wrote:



Meanwhile, an example of the 393 approach failing:


I am completely baffled by this, as this example is one where the 393 
approach potentially wins.



I was involved in a
project that dealt with terabytes of OCR data of mostly English text.
So the chars were mostly ascii,


3.3 stores ascii pages 1 byte/char rather than 2 or 4.

 but there would be occasional non-ascii

chars including supplementary plane characters, either because of
special symbols that were really in the text, or the typical OCR
confusion emitting those symbols due to printing imprecision.


I doubt that there are really any non-bmp chars. As Steven said, reject 
such false identifications.


 That's a  natural for UTF-8

3.3 would convert to utf-8 for storage on disk.


but the PEP-393 approach would bloat up the memory
requirements by a factor of 4.


3.2- wide builds would *always* use 4 bytes/char. Is not occasionally 
better than always?



 py s = chr(0x + 1)
 py a, b = s

That looks like Python 3.2 is buggy and that sample should just throw an
error.  s is a one-character string and should not be unpackable.


That looks like a 3.2- narrow build. Such which treat unicode strings as 
sequences of code units rather than sequences of codepoints. Not an 
implementation bug, but compromise design that goes back about a decade 
to when unicode was added to Python. At that time, there were only a few 
defined non-BMP chars and their usage was extremely rare. There are now 
more extended chars than BMP chars and usage will become more common 
even in English text.


Pre 3.3, there are really 2 sub-versions of every Python version: a 
narrow build and a wide build version, with not very well documented 
different behaviors for any string with extended chars. That is and 
would have become an increasing problem as extended chars are 
increasingly used. If you want to say that what was once a practical 
compromise has become a design bug, I would not argue. In any case, 3.3 
fixes that split and returns Python to being one cross-platform language.



I realize the folks who designed and implemented PEP 393 are very smart
cookies and considered stuff carefully, while I'm just an internet user
posting an immediate impression of something I hadn't seen before (I
still use Python 2.6), but I still have to ask: if the 393 approach
makes sense, why don't other languages do it?


Python has often copied or borrowed, with adjustments. This time it is 
the first. We will see how it goes, but it has been tested for nearly a 
year already.



Ropes of UTF-8 segments seems like the most obvious approach and I
wonder if it was considered.  By that I mean pick some implementation
constant k (say k=128) and represent the string as a UTF-8 encoded byte
array, accompanied by a vector n//k pointers into the byte array, where
n is the number of codepoints in the string.  Then you can reach any
offset analogously to reading a random byte on a disk, by seeking to the
appropriate block, and then reading the block and getting the char you
want within it.  Random access is then O(1) though the constant is
higher than it would be with fixed width encoding.


I would call it O(k), where k is a selectable constant. Slowing access 
by a factor of 100 is hardly acceptable to me. For strings less than k, 
access is O(len). I believe slicing would require re-indexing.


As 393 was near adoption, I proposed a scheme using utf-16 (narrow 
builds) with a supplementary index of extended chars when there are any. 
That makes access O(1) if there are none and O(log(k)), where k is the 
number of extended chars in the string, if there are some.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit :
 Steven D'Aprano  wrote in message 
 
 news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
 
 
 
 On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
 
 
 
 [...]
 
 If you can consistently replicate a 100% to 1000% slowdown in string
 
 handling, please report it as a performance bug:
 
 
 
 http://bugs.python.org/
 
 
 
 Don't forget to report your operating system.
 
 
 
 
 
 For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) 
 
 running Windows 7 x64.
 
 
 
 Running Python from a Windows command prompt,  I got the following on Python 
 
 3.2.3 and 3.3 beta 2:
 
 
 
 python33\python -m timeit ('abc' * 1000).replace('c', 'de')
 
 1 loops, best of 3: 39.3 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', '……')
 
 1 loops, best of 3: 51.8 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 
 1 loops, best of 3: 52 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 
 1 loops, best of 3: 50.3 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', '€…')
 
 1 loops, best of 3: 51.6 usec per loop
 
 python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 
 1 loops, best of 3: 38.3 usec per loop
 
 python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 
 1 loops, best of 3: 50.3 usec per loop
 
 
 
 python32\python -m timeit ('abc' * 1000).replace('c', 'de')
 
 1 loops, best of 3: 24.5 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', '……')
 
 1 loops, best of 3: 24.7 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 
 1 loops, best of 3: 24.8 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 
 1 loops, best of 3: 24 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', '€…')
 
 1 loops, best of 3: 24.1 usec per loop
 
 python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 
 1 loops, best of 3: 24.4 usec per loop
 
 python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 
 1 loops, best of 3: 24.3 usec per loop
 
 
 
 This is an average slowdown by a factor of close to 2.3 on 3.3 when compared 
 
 with 3.2.
 
 
 
 I am not posting this to perpetuate this thread but simply to ask whether, 
 
 as you suggest, I should report this as a possible problem with the beta?

I use win7 pro 32bits in intel?

Thanks for reporting these numbers.
To be clear: I'm not complaining, but the fact that
there is a slow down is a clear indication (in my mind),
there is a point somewhere.

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Terry Reedy

On 8/19/2012 10:09 AM, wxjmfa...@gmail.com wrote:


I can not give you more numbers than those I gave.
As a end user, I noticed and experimented my random tests
are always slower in Py3.3 than in Py3.2 on my Windows platform.


And I gave other examples where 3.3 is *faster* on my Windows, which you 
have thus far not even acknowledged, let alone try.



It is up to you, the core developers to give an explanation
about this behaviour.


System variation, unimportance of sub-microsecond variations, and 
attention to more important issues.


Other developer say 3.3 is generally faster on their sy
stems (OSX 10.8, and unspecified). To talk about speed sensibly, one 
must run the full stringbench.py benchmark and real applications on 
multiple Windows, *nix, and Mac systems. Python is not optimized for 
your particular current computer.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Terry Reedy tjre...@udel.edu writes:
 Meanwhile, an example of the 393 approach failing:
 I am completely baffled by this, as this example is one where the 393
 approach potentially wins.

What?  The 393 approach is supposed to avoid memory bloat and that
does the opposite.

 I was involved in a project that dealt with terabytes of OCR data of
 mostly English text.  So the chars were mostly ascii,
 3.3 stores ascii pages 1 byte/char rather than 2 or 4.

But they are not ascii pages, they are (as stated) MOSTLY ascii.
E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
a much more memory-expensive encoding than UTF-8.

 I doubt that there are really any non-bmp chars.

You may be right about this.  I thought about it some more after
posting and I'm not certain that there were supplemental characters.

 As Steven said, reject such false identifications.

Reject them how?

 That's a  natural for UTF-8
 3.3 would convert to utf-8 for storage on disk.

They are already in utf-8 on disk though that doesn't matter since
they are also compressed.  

 but the PEP-393 approach would bloat up the memory
 requirements by a factor of 4.
 3.2- wide builds would *always* use 4 bytes/char. Is not occasionally
 better than always?

The bloat is in comparison with utf-8, in that example.

 That looks like a 3.2- narrow build. Such which treat unicode strings
 as sequences of code units rather than sequences of codepoints. Not an
 implementation bug, but compromise design that goes back about a
 decade to when unicode was added to Python. 

I thought the whole point of Python 3's disruptive incompatibility with
Python 2 was to clean up past mistakes and compromises, of which unicode
headaches was near the top of the list.  So I'm surprised they seem to
repeated a mistake there.  

 I would call it O(k), where k is a selectable constant. Slowing access
 by a factor of 100 is hardly acceptable to me. 

If k is constant then O(k) is the same as O(1).  That is how O notation
works.  I wouldn't believe the 100x figure without seeing it measured in
real-world applications.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Ian Kelly
On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 On Sat, 18 Aug 2012 09:51:37 -0600, Ian Kelly wrote about PEP 393:
 There is some additional benefit for Latin-1 users, but this has nothing
 to do with Python.  If Python is going to have the option of a 1-byte
 representation (and as long as we have the flexible representation, I
 can see no reason not to),

 The PEP explicitly states that it only uses a 1-byte format for ASCII
 strings, not Latin-1:

I think you misunderstand the PEP then, because that is empirically false.

Python 3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:23:35) [MSC
v.1600 64 bit (AMD64)] on win32
Type help, copyright, credits or license for more information.
 import sys
 sys.getsizeof(bytes(range(256)).decode('latin1'))
329

The constructed string contains all 256 Latin-1 characters, so if
Latin-1 strings must be stored in the 2-byte format, then the size
should be at least 512 bytes.  It is not, so I think it must be using
the 1-byte encoding.


 ASCII-only Unicode strings will again use only one byte per character

This says nothing one way or the other about non-ASCII Latin-1 strings.

 If the maximum character is less than 128, they use the PyASCIIObject
 structure

Note that this only describes the structure of compact string
objects, which I have to admit I do not fully understand from the PEP.
 The wording suggests that it only uses the PyASCIIObject structure,
not the derived structures.  It then says that for compact ASCII
strings the UTF-8 data, the UTF-8 length and the wstr length are the
same as the length of the ASCII data.  But these fields are part of
the PyCompactUnicodeObject structure, not the base PyASCIIObject
structure, so they would not exist if only PyASCIIObject were used.
It would also imply that compact non-ASCII strings are stored
internally as UTF-8, which would be surprising.

 and:

 The data and utf8 pointers point to the same memory if the string uses
 only ASCII characters (using only Latin-1 is not sufficient).

This says that if the data are ASCII, then the 1-byte representation
and the utf8 pointer will share the same memory.  It does not imply
that the 1-byte representation is not used for Latin-1, only that it
cannot also share memory with the utf8 pointer.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread wxjmfauth
Just for the story.

Five minutes after a closed my interactive interpreters windows,
the day I tested this stuff. I though:
Too bad I did not noted the extremely bad cases I found, I'm pretty
sure, this problem will arrive on the table.

jmf

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Terry Reedy

On 8/19/2012 8:59 AM, wxjmfa...@gmail.com wrote:


In August 2012, after 20 years of development, Python is not able to
display a piece of text correctly on a Windows console (eg cp65001).


cp65001 is known to not work right. It has been very frustrating. Bug 
Microsoft about it, and indeed their whole policy of still dividing the 
world into code page regions, even in their next version, instead of 
moving toward unicode and utf-8, at least as an option.



I downloaded the go language, zero experience, I did not succeed to
display incorrecly a piece of text. (This is by the way *the* reason
why I tested it). Where the problems are coming from, I have no
idea.


If go can display all unicode chars on a Windows console, perhaps you 
can do some research and find out how they do so. Then we could consider 
copying it.


--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Blind Anagram
wrote in message 
news:5dfd1779-9442-4858-9161-8f1a06d56...@googlegroups.com...


Le dimanche 19 août 2012 19:03:34 UTC+2, Blind Anagram a écrit :

Steven D'Aprano  wrote in message

news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...



On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:



[...]

If you can consistently replicate a 100% to 1000% slowdown in string

handling, please report it as a performance bug:



http://bugs.python.org/



Don't forget to report your operating system.





For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz)

running Windows 7 x64.



Running Python from a Windows command prompt,  I got the following on 
Python


3.2.3 and 3.3 beta 2:



python33\python -m timeit ('abc' * 1000).replace('c', 'de')

1 loops, best of 3: 39.3 usec per loop

python33\python -m timeit ('ab…' * 1000).replace('…', '……')

1 loops, best of 3: 51.8 usec per loop

python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')

1 loops, best of 3: 52 usec per loop

python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')

1 loops, best of 3: 50.3 usec per loop

python33\python -m timeit ('ab…' * 1000).replace('…', '€…')

1 loops, best of 3: 51.6 usec per loop

python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')

1 loops, best of 3: 38.3 usec per loop

python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')

1 loops, best of 3: 50.3 usec per loop



python32\python -m timeit ('abc' * 1000).replace('c', 'de')

1 loops, best of 3: 24.5 usec per loop

python32\python -m timeit ('ab…' * 1000).replace('…', '……')

1 loops, best of 3: 24.7 usec per loop

python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')

1 loops, best of 3: 24.8 usec per loop

python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')

1 loops, best of 3: 24 usec per loop

python32\python -m timeit ('ab…' * 1000).replace('…', '€…')

1 loops, best of 3: 24.1 usec per loop

python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')

1 loops, best of 3: 24.4 usec per loop

python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')

1 loops, best of 3: 24.3 usec per loop



This is an average slowdown by a factor of close to 2.3 on 3.3 when 
compared


with 3.2.



I am not posting this to perpetuate this thread but simply to ask whether,

as you suggest, I should report this as a possible problem with the beta?


I use win7 pro 32bits in intel?

Thanks for reporting these numbers.
To be clear: I'm not complaining, but the fact that
there is a slow down is a clear indication (in my mind),
there is a point somewhere.


I may be reading your input wrongly, but it seems to me that you are not 
only reporting a slowdown but you are also suggesting that this slowdown is 
the result of bad design decisions by the Python development team.


I don't want to get involved in the latter part of your argument because I 
am convinced that the Python team are doing their very best to find a good 
compromise between the various design constraints that they face in meeting 
these needs.


Nevertheless, the post that I responded to contained the suggestion that 
slowdowns above 100% (which I took as a factor of 2) would be worth 
reporting as a possible bug.  So I thought that it was worth asking about 
this as I may have misunderstood the level of slowdown that is worth 
reporting.  There is also a potential problem in timings on laptops with 
turbo-boost (as I have), although the times look fairly consistent.


--
http://mail.python.org/mailman/listinfo/python-list


Re: New image and color management library for Python 2+3

2012-08-19 Thread Jan Riechers

On 14.08.2012 21:22, Christian Heimes wrote:

Hello fellow Pythonistas,


Performance
===

smc.freeimage with libjpeg-turbo read JPEGs about three to six times
faster than PIL and writes JPEGs more than five times faster.


[]


Python 2.7.3
read / write cycles: 300
test image: 1210x1778 24bpp JPEG (pon.jpg)
platform: Ubuntu 12.04 X86_64
hardware: Intel Xeon hexacore W3680@3.33GHz with 24 GB RAM

smc.freeimage, FreeImage 3.15.3 standard
  - read JPEG 12.857 sec
  - read JPEG 6.629 sec (resaved)
  - write JPEG 21.817 sec
smc.freeimage, FreeImage 3.15.3 with jpeg turbo
  - read JPEG 9.297 sec
  - read JPEG 3.909 sec (resaved)
  - write JPEG 5.857 sec
  - read LZW TIFF 17.947 sec
  - read biton G4 TIFF 2.068 sec
  - resize 3.850 sec (box)
  - resize 5.022 sec (bilinear)
  - resize 7.942 sec (bspline)
  - resize 7.222 sec (bicubic)
  - resize 7.941 sec (catmull rom spline)
  - resize 10.232 sec (lanczos3)
  - tiff numpy.asarray() with bytescale() 0.006 sec
  - tiff load + numpy.asarray() with bytescale() 18.043 sec
PIL 1.1.7
  - read JPEG 30.389 sec
  - read JPEG 23.118 sec (resaved)
  - write JPEG 34.405 sec
  - read LZW TIFF 21.596 sec
  - read biton G4 TIFF: decoder group4 not available
  - resize 0.032 sec (nearest)
  - resize 1.074 sec (bilinear)
  - resize 2.924 sec (bicubic)
  - resize 8.056 sec (antialias)
  - tiff scipy fromimage() with bytescale() 1.165 sec
  - tiff scipy imread() with bytescale() 22.939 sec



Christian



Hello Christian,

I'm sorry for getting out of your initial question/request, but did you 
try out ImageMagick before making use of FreeImage - do you even perhaps 
can deliver a comparison between your project and ImageMagick (if 
regular Python is used)?


I ask cause:
Im in the process of creating a web-app which also requires image 
processing and just switching from PIL (because it is unfortunately not 
that quick as it should be) to ImageMagick and the speeds are much 
better compared to it, but I didn't take measurements of that.


Can you perhaps test your solution with ImageMagick (as it is used 
widely) it would be interesting so. :)


But no offence by that and respect for you work so!

Jan
--
http://mail.python.org/mailman/listinfo/python-list


Re: Branch and Bound Algorithm / Module for Python?

2012-08-19 Thread Terry Reedy

On 8/19/2012 5:04 AM, Rebekka-Marie wrote:

Hello everybody,

I would like to solve a Mixed Integer Optimization Problem with the
Branch-And-Bound Algorithm.

I designed my Minimizing function and the constraints. I tested them
in a small program in AIMMS. So I already know that they are
solvable.

Now I want to solve them using Python.

Is there a module / methods that I can download or a ready-made
program text that you know about, where I can put my constraints and
minimization function in?


Search 'Python constraint solver' and you should find at least two programs.

--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Dave Angel
On 08/19/2012 01:03 PM, Blind Anagram wrote:
 Steven D'Aprano  wrote in message
 news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...

 On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:

 [...]
 If you can consistently replicate a 100% to 1000% slowdown in string
 handling, please report it as a performance bug:

 http://bugs.python.org/

 Don't forget to report your operating system.

 
 For interest, I ran your code snippets on my laptop (Intel core-i7
 1.8GHz) running Windows 7 x64.

 Running Python from a Windows command prompt,  I got the following on
 Python 3.2.3 and 3.3 beta 2:

 python33\python -m timeit ('abc' * 1000).replace('c', 'de')
 1 loops, best of 3: 39.3 usec per loop
 python33\python -m timeit ('ab…' * 1000).replace('…', '……')
 1 loops, best of 3: 51.8 usec per loop
 python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 1 loops, best of 3: 52 usec per loop
 python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 1 loops, best of 3: 50.3 usec per loop
 python33\python -m timeit ('ab…' * 1000).replace('…', '€…')
 1 loops, best of 3: 51.6 usec per loop
 python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 1 loops, best of 3: 38.3 usec per loop
 python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 1 loops, best of 3: 50.3 usec per loop

 python32\python -m timeit ('abc' * 1000).replace('c', 'de')
 1 loops, best of 3: 24.5 usec per loop
 python32\python -m timeit ('ab…' * 1000).replace('…', '……')
 1 loops, best of 3: 24.7 usec per loop
 python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 1 loops, best of 3: 24.8 usec per loop
 python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 1 loops, best of 3: 24 usec per loop
 python32\python -m timeit ('ab…' * 1000).replace('…', '€…')
 1 loops, best of 3: 24.1 usec per loop
 python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 1 loops, best of 3: 24.4 usec per loop
 python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 1 loops, best of 3: 24.3 usec per loop

 This is an average slowdown by a factor of close to 2.3 on 3.3 when
 compared with 3.2.


Using your measurement numbers, I get an average of 1.95, not 2.3



-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Mark Lawrence

On 19/08/2012 18:51, wxjmfa...@gmail.com wrote:

Just for the story.

Five minutes after a closed my interactive interpreters windows,
the day I tested this stuff. I though:
Too bad I did not noted the extremely bad cases I found, I'm pretty
sure, this problem will arrive on the table.

jmf



How convenient.

--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable usingord()

2012-08-19 Thread Blind Anagram
Dave Angel  wrote in message 
news:mailman.3519.1345399574.4697.python-l...@python.org...


[...]


This is an average slowdown by a factor of close to 2.3 on 3.3 when
compared with 3.2.



Using your measurement numbers, I get an average of 1.95, not 2.3


Yes - you are right - my apologies.

But it is close enough to 2 to still be worth asking.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread wxjmfauth
Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :
 
 
 But they are not ascii pages, they are (as stated) MOSTLY ascii.
 
 E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses
 
 a much more memory-expensive encoding than UTF-8.
 
 

Imagine an us banking application, everything in ascii,
except ... the € currency symbole, code point 0x20ac.

Well, it seems some software producers know what they
are doing.

 '€'.encode('cp1252')
b'\x80'
 '€'.encode('mac-roman')
b'\xdb'
 '€'.encode('iso-8859-1')
Traceback (most recent call last):
  File eta last command, line 1, in module
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' 
in position 0: ordinal not in range(256)

jmf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Ian Kelly ian.g.ke...@gmail.com writes:
 sys.getsizeof(bytes(range(256)).decode('latin1'))
 329

Please try:

   print (type(bytes(range(256)).decode('latin1')))

to make sure that what comes back is actually a unicode string rather
than a byte string.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Ian Kelly
On Sun, Aug 19, 2012 at 12:20 PM, Paul Rubin no.email@nospam.invalid wrote:
 Ian Kelly ian.g.ke...@gmail.com writes:
 sys.getsizeof(bytes(range(256)).decode('latin1'))
 329

 Please try:

print (type(bytes(range(256)).decode('latin1')))

 to make sure that what comes back is actually a unicode string rather
 than a byte string.

As I understand it, the decode method never returns a byte string in
Python 3, but if you insist:

 print (type(bytes(range(256)).decode('latin1')))
class 'str'
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Ian Kelly
On Sun, Aug 19, 2012 at 11:50 AM, Ian Kelly ian.g.ke...@gmail.com wrote:
 Note that this only describes the structure of compact string
 objects, which I have to admit I do not fully understand from the PEP.
  The wording suggests that it only uses the PyASCIIObject structure,
 not the derived structures.  It then says that for compact ASCII
 strings the UTF-8 data, the UTF-8 length and the wstr length are the
 same as the length of the ASCII data.  But these fields are part of
 the PyCompactUnicodeObject structure, not the base PyASCIIObject
 structure, so they would not exist if only PyASCIIObject were used.
 It would also imply that compact non-ASCII strings are stored
 internally as UTF-8, which would be surprising.

Oh, now I get it.  I had missed the part where it says character data
immediately follow the base structure.  And the bit about the UTF-8
data, the UTF-8 length and the wstr length are not describing the
contents of those fields, but rather where the data can be alternatively
found since the fields don't exist.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Mark Lawrence

On 19/08/2012 19:11, wxjmfa...@gmail.com wrote:

Le dimanche 19 août 2012 19:48:06 UTC+2, Paul Rubin a écrit :



But they are not ascii pages, they are (as stated) MOSTLY ascii.

E.g. the characters are 99% ascii but 1% non-ascii, so 393 chooses

a much more memory-expensive encoding than UTF-8.




Imagine an us banking application, everything in ascii,
except ... the € currency symbole, code point 0x20ac.

Well, it seems some software producers know what they
are doing.


'€'.encode('cp1252')

b'\x80'

'€'.encode('mac-roman')

b'\xdb'

'€'.encode('iso-8859-1')

Traceback (most recent call last):
   File eta last command, line 1, in module
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)

jmf



Well that's it then, the world stock markets will all collapse tonight 
when the news leaks out that those stupid Americans haven't yet realised 
that much of Europe (with at least one very noticeable and sensible 
exception :) uses Euros.  I'd better sell all my stock holdings fast.


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Paul Rubin
Ian Kelly ian.g.ke...@gmail.com writes:
 print (type(bytes(range(256)).decode('latin1')))
 class 'str'

Thanks.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How does .rjust() work and why it places characters relative to previous one, not to first character - placed most to left - or to left side of screen?

2012-08-19 Thread crispy
W dniu niedziela, 19 sierpnia 2012 19:31:30 UTC+2 użytkownik Dave Angel napisał:
 On 08/19/2012 12:25 PM, crispy wrote:
 
  SNIP
 
  So I have guessed, that characters processed by .rjust() function, are 
  placed in output, relative to previous ones - NOT to first, most to left 
  placed, character.
 
 
 
 rjust() does not print to the console, it just produces a string.  So if
 
 you want to know how it works, you need to either read about it, or
 
 experiment with it.
 
 
 
 Try   help(.rjust) to see a simple description of it.  (If you're
 
 not familiar with the interactive interpreter's help() function, you owe
 
 it to yourself to learn it).
 
 
 
 Playing with it:
 
 
 
 print abcd.rjust(8, -)   producesabcd
 
 
 
 for i in range(5): print a.rjust(i, -)
 
 produces:
 
 
 
 a
 
 a
 
 -a
 
 --a
 
 ---a
 
 
 
 In each case, the number of characters produced is no larger than i.  No
 
 consideration is made to other strings outside of the literal passed
 
 into the method.
 
 
 
 
 
  Why it works like that? 
 
 
 
 In your code, you have the rjust() method inside a loop, inside a join,
 
 inside a print.  it makes a nice, impressive single line, but clearly
 
 you don't completely understand what the pieces are, nor how they work
 
 together.  Since the join is combining (concatenating) strings that are
 
 each being produced by rjust(), it's the join() that's making this look
 
 relative to you.
 
 
 
 
 
  What builtn-in function can format output, to make every character be 
  placed as i need - relative to the first character, placed most to left 
  side of screen.
 
 
 
 If you want to randomly place characters on the screen, you either want
 
 a curses-like package, or a gui.  i suspect that's not at all what you want.
 
 
 
 if you want to randomly change characters in a pre-existing string,
 
 which will then be printed to the console, then I could suggest an
 
 approach (untested)
 
 
 
 res = [ ] * length
 
 for column in similarity:
 
 res[column] = |
 
 res = .join(res)
 
 
 
 
 
 
 
 -- 
 
 
 
 DaveA

Thanks, i've finally came to solution.

Here it is - http://codepad.org/Q70eGkO8

def pairwiseScore(seqA, seqB):

score = 0
bars = [str(' ') for x in seqA] #create a list filled with number of spaces 
equal to length of seqA string. It could be also seqB, because both are meant 
to have same length
length = len(seqA)
similarity = []

for x in xrange(length):

if seqA[x] == seqB[x]: #check if for every index 'x', corresponding 
character is same in both seqA and seqB strings
if (x = 1) and (seqA[x - 1] == seqB[x - 1]): #if 'x' is greater 
than or equal to 1 and characters under the previous index, were same in both 
seqA and seqB strings, do..
score += 3
similarity.append(x)
else:
score += 1
similarity.append(x)
else:
score -= 1

for x in similarity:
bars[x] = '|' #for every index 'x' in 'bars' list, replace space with 
'|' (pipe/vertical bar) character 

return ''.join((seqA, '\n', ''.join(bars), '\n', seqB, '\n', 'Score: ', 
str(score)))

print pairwiseScore(ATTCGT, ATCTAT), '\n', '\n', 
pairwiseScore(GATAAATCTGGTCT, CATTCATCATGCAA), '\n', '\n', 
pairwiseScore('AGCG', 'ATCG'), '\n', '\n', pairwiseScore('ATCG', 'ATCG')
-- 
http://mail.python.org/mailman/listinfo/python-list


Abuse of Big Oh notation [was Re: How do I display unicode value stored in a string variable using ord()]

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 10:48:06 -0700, Paul Rubin wrote:

 Terry Reedy tjre...@udel.edu writes:

 I would call it O(k), where k is a selectable constant. Slowing access
 by a factor of 100 is hardly acceptable to me.
 
 If k is constant then O(k) is the same as O(1).  That is how O notation
 works.

You might as well say that if N is constant, O(N**2) is constant too and 
just like magic you have now made Bubble Sort a constant-time sort 
function!

That's not how it works.

Of course *if* k is constant, O(k) is constant too, but k is not 
constant. In context we are talking about string indexing and slicing. 
There is no value of k, say, k = 2, for which you can say People will 
sometimes ask for string[2] but never ask for string[3]. That is absurd.

Since k can vary from 0 to N-1, we can say that the average string index 
lookup is k = (N-1)//2 which clearly depends on N.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 11:50:12 -0600, Ian Kelly wrote:

 On Sun, Aug 19, 2012 at 12:33 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
[...]
 The PEP explicitly states that it only uses a 1-byte format for ASCII
 strings, not Latin-1:
 
 I think you misunderstand the PEP then, because that is empirically
 false.

Yes I did misunderstand. Thank you for the clarification.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Steven D'Aprano
On Sun, 19 Aug 2012 18:03:34 +0100, Blind Anagram wrote:

 Steven D'Aprano  wrote in message
 news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
 
  If you can consistently replicate a 100% to 1000% slowdown in string
  handling, please report it as a performance bug:
  
  http://bugs.python.org/
  
  Don't forget to report your operating system.

[...]

 This is an average slowdown by a factor of close to 2.3 on 3.3 when
 compared with 3.2.
 
 I am not posting this to perpetuate this thread but simply to ask
 whether, as you suggest, I should report this as a possible problem with
 the beta?

Possibly, if it is consistent and non-trivial. Serious performance 
regressions are bugs. Trivial ones, not so much.

Thanks to Terry Reedy, who has already asked the Python Devs about this 
issue, they have made it clear that they aren't hugely interested in 
micro-benchmarks in isolation. If you want the bug report to be taken 
seriously, you would need to run the full Python string benchmark. The 
results of that would be interesting to see.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Why doesn't Python remember the initial directory?

2012-08-19 Thread kj


As far as I've been able to determine, Python does not remember
(immutably, that is) the working directory at the program's start-up,
or, if it does, it does not officially expose this information.

Does anyone know why this is?  Is there a PEP stating the rationale
for it?

Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Terry Reedy

On 8/19/2012 1:03 PM, Blind Anagram wrote:


Running Python from a Windows command prompt,  I got the following on
Python 3.2.3 and 3.3 beta 2:

python33\python -m timeit ('abc' * 1000).replace('c', 'de')
1 loops, best of 3: 39.3 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', '……')
1 loops, best of 3: 51.8 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')
1 loops, best of 3: 52 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
1 loops, best of 3: 50.3 usec per loop
python33\python -m timeit ('ab…' * 1000).replace('…', '€…')
1 loops, best of 3: 51.6 usec per loop
python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
1 loops, best of 3: 38.3 usec per loop
python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
1 loops, best of 3: 50.3 usec per loop

python32\python -m timeit ('abc' * 1000).replace('c', 'de')
1 loops, best of 3: 24.5 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', '……')
1 loops, best of 3: 24.7 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')
1 loops, best of 3: 24.8 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
1 loops, best of 3: 24 usec per loop
python32\python -m timeit ('ab…' * 1000).replace('…', '€…')
1 loops, best of 3: 24.1 usec per loop
python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
1 loops, best of 3: 24.4 usec per loop
python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
1 loops, best of 3: 24.3 usec per loop


This is one test repeated 7 times with essentially irrelevant 
variations. The difference is less on my system (50%). Others report 
seeing 3.3 as faster. When I asked on pydev, the answer was don't bother 
making a tracker issue unless I was personally interested in 
investigating why search is relatively slow in 3.3 on Windows. Any 
change would have to not slow other operations or severely impact search 
on other systems. I suggest the same answer to you.


If you seriously want to compare old and new unicode, go to
http://hg.python.org/cpython/file/tip/Tools/stringbench/stringbench.py
and click raw to download. Run on 3.2 and 3.3, ignoring the bytes times.

Here is a version of the first comparison from stringbench:
print(timeit('''('NOW IS THE TIME FOR ALL GOOD PEOPLE TO COME TO THE AID 
OF PYTHON'* 10).lower()'''))

Results are 5.6 for 3.2 and .8 for 3.3. WOW! 3.3 is 7 times faster!

OK, not fair. I cherry picked. The 7 times speedup in 3.3 likely is at 
least partly independent of the 393 unicode change. The same test in 
stringbench for bytes is twice as fast in 3.3 as 3.2, but only 2x, not 
7x. In fact, it may have been the bytes/unicode comparison in 3.2 that 
suggested that unicode case conversion of ascii chrs might be made faster.


The sum of the 3.3 unicode times is 109 versus 110 for 3.3 bytes and 125 
for 3.2 unicode. This unweighted sum is not really fair since the raw 
times vary by a factor of at least 100. But is does suggest that anyone 
claiming that 3.3 unicode is overall 'slower' than 3.2 unicode has some 
work to do.


There is also this. On my machine, the lowest bytes-time/unicode-time 
for 3.3 is .71. This suggests that there is not a lot of fluff left in 
the unicode code, and that not much is lost by the bytes to unicode 
switch for strings.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Re: Why doesn't Python remember the initial directory?

2012-08-19 Thread Giacomo Alzetta
Il giorno domenica 19 agosto 2012 22:42:16 UTC+2, kj ha scritto:
 As far as I've been able to determine, Python does not remember
 
 (immutably, that is) the working directory at the program's start-up,
 
 or, if it does, it does not officially expose this information.
 
 
 
 Does anyone know why this is?  Is there a PEP stating the rationale
 
 for it?
 
 
 
 Thanks!

You can obtain the working directory with os.getcwd().

giacomo@jack-laptop:~$ echo 'import os; print os.getcwd()'  testing-dir.py
giacomo@jack-laptop:~$ python testing-dir.py 
/home/giacomo
giacomo@jack-laptop:~$ cd Documenti
giacomo@jack-laptop:~/Documenti$ python ../testing-dir.py 
/home/giacomo/Documenti
giacomo@jack-laptop:~/Documenti$ 

Obviously using os.chdir() will change the working directory, and the 
os.getcwd() will not be the start-up working directory, but if you need the 
start-up working directory you can get it at start-up and save it in some 
constant.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Why doesn't Python remember the initial directory?

2012-08-19 Thread Roy Smith
In article k0rj38$2gc$1...@reader1.panix.com, kj no.em...@please.post 
wrote:

 As far as I've been able to determine, Python does not remember
 (immutably, that is) the working directory at the program's start-up,
 or, if it does, it does not officially expose this information.

Why would you expect that it would?  What would it (or you) do with this 
information?

More to the point, doing a chdir() is not something any library code 
would do (at least not that I'm aware of), so if the directory changed, 
it's because some application code did it.  In which case, you could 
have just stored the working directory yourself.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Why doesn't Python remember the initial directory?

2012-08-19 Thread Mark Lawrence

On 19/08/2012 21:42, kj wrote:



As far as I've been able to determine, Python does not remember
(immutably, that is) the working directory at the program's start-up,
or, if it does, it does not officially expose this information.

Does anyone know why this is?  Is there a PEP stating the rationale
for it?

Thanks!



Why would you have a Python Enhancement Proposal to state the rationale 
for this?


--
Cheers.

Mark Lawrence.

--
http://mail.python.org/mailman/listinfo/python-list


Re: Why doesn't Python remember the initial directory?

2012-08-19 Thread Laszlo Nagy

On 2012-08-19 22:42, kj wrote:


As far as I've been able to determine, Python does not remember
(immutably, that is) the working directory at the program's start-up,
or, if it does, it does not officially expose this information.

Does anyone know why this is?  Is there a PEP stating the rationale
for it?

Thanks!
When you start the program, you have a current directory. When you 
change it, then it is changed. How do you want Python to remember a 
directory? For example, you can put it into a variable, and use it 
later. Can you please show us some example code that demonstrates your 
actual problem?

--
http://mail.python.org/mailman/listinfo/python-list


Re: New internal string format in 3.3

2012-08-19 Thread Chris Angelico
On Mon, Aug 20, 2012 at 4:09 AM, Mark Lawrence breamore...@yahoo.co.uk wrote:
 On 19/08/2012 18:51, wxjmfa...@gmail.com wrote:

 Just for the story.

 Five minutes after a closed my interactive interpreters windows,
 the day I tested this stuff. I though:
 Too bad I did not noted the extremely bad cases I found, I'm pretty
 sure, this problem will arrive on the table.

 How convenient.

Not really. Even if he HAD copied-and-pasted those worst-cases, it'd
prove nothing. Maybe his system just chose to glitch right then. It's
always possible to find statistical outliers that take way way longer
than everything else.

Watch this. Python 3.2 on Windows is optimized for adding 1 to numbers.

C:\Documents and Settings\M\python32\python -m timeit -r 1 a=1+1
1000 loops, best of 1: 0.0654 usec per loop

C:\Documents and Settings\M\python32\python -m timeit -r 1 a=1+1
1000 loops, best of 1: 0.0654 usec per loop

C:\Documents and Settings\M\python32\python -m timeit -r 1 a=1+1
1000 loops, best of 1: 0.0654 usec per loop

C:\Documents and Settings\M\python32\python -m timeit -r 1 a=1+2
1000 loops, best of 1: 0.0711 usec per loop

Now, as long as I don't tell you that during the last test I had quite
a few other processes running, including VLC playing a movie and two
Python processes running while True: pass, this will look like a
significant performance difference. So now, I'm justified in
complaining about how suboptimal Python is when adding 2 to a number,
which I can assure you is a VERY common case.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Terry Reedy

On 8/19/2012 2:11 PM, wxjmfa...@gmail.com wrote:


Well, it seems some software producers know what they
are doing.


'€'.encode('cp1252')

b'\x80'

'€'.encode('mac-roman')

b'\xdb'

'€'.encode('iso-8859-1')

Traceback (most recent call last):
   File eta last command, line 1, in module
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac'
in position 0: ordinal not in range(256)


Yes, Python lets you choose your byte encoding from those and a hundred 
others. I believe all the codecs are now tested in both directions. It 
was not an easy task.


As to the examples: Latin-1 dates to 1985 and before and the 1988 
version was published as a standard in 1992.

https://en.wikipedia.org/wiki/Latin-1
The name euro was officially adopted on 16 December 1995.
https://en.wikipedia.org/wiki/Euro
No wonder Latin-1 does not contain the Euro sign. International 
standards organizations standards are relatively fixed. (The unicode 
consortium will not even correct misspelled character names.) Instead, 
new standards with a new number are adopted.


For better or worse, private mappings are more flexible. In its Mac 
mapping Apple replaced the generic currency sign ¤ with the euro sign 
€. (See Latin-1 reference.) Great if you use Euros, not so great if you 
were using the previous sign for something else.


Microsoft changed an unneeded code to the Euro for Windows cp-1252.
https://en.wikipedia.org/wiki/Windows-1252
It is very common to mislabel Windows-1252 text with the charset label 
ISO-8859-1. A common result was that all the quotes and apostrophes 
(produced by smart quotes in Microsoft software) were replaced with 
question marks or boxes on non-Windows operating systems, making text 
difficult to read. Most modern web browsers and e-mail clients treat the 
MIME charset ISO-8859-1 as Windows-1252 in order to accommodate such 
mislabeling. This is now standard behavior in the draft HTML 5 
specification, which requires that documents advertised as ISO-8859-1 
actually be parsed with the Windows-1252 encoding.[1]


Lots of fun. Too bad Microsoft won't push utf-8 so we can all 
communicate text with much less chance of ambiguity.


--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list


Re: ONLINE SERVER TO STORE AND RUN PYTHON SCRIPTS

2012-08-19 Thread coldfire
On Saturday, 18 August 2012 00:42:00 UTC+5:30, Ian  wrote:
 On Fri, Aug 17, 2012 at 6:46 AM, coldfire amangill.coldf...@gmail.com wrote:
 
  I would like to know that where can a python script be stored on-line from 
  were it keep running and can be called any time when required using 
  internet.
 
  I have used mechanize module which creates a webbroswer instance to open a 
  website and extract data and email me.
 
  I have tried Python anywhere but they dont support opening of anonymous 
  websites.
 


 
 
 According to their FAQ they don't support this for *free* accounts.
 
 You could just open a paid account (the cheapest option appears to be
 
 $5/month).
 
 
 
 Also, please don't type your email subject in all capital letters.  It
 
 comes across as shouting and is considered rude.

Got it and sorry for typing it CAPs I will take care of it next time for sure.
Also Could u help me out with the websites.
Also I have no idea how to deploy a python script online.
I have done that on my local PC using Apache server and cgi but it Works fine.
Whats this all called? as far as I have searched its Web Framework but I dont 
wont to develop  a website Just a Server which can run my scripts at specific 
time and send me email if an error occurs.
I use Python And i am not getting any lead.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ONLINE SERVER TO STORE AND RUN PYTHON SCRIPTS

2012-08-19 Thread coldfire
On Friday, 17 August 2012 18:16:08 UTC+5:30, coldfire  wrote:
 I would like to know that where can a python script be stored on-line from 
 were it keep running and can be called any time when required using internet.
 
 I have used mechanize module which creates a webbroswer instance to open a 
 website and extract data and email me.
 
 I have tried Python anywhere but they dont support opening of anonymous 
 websites.
 
 What s the current what to DO this?
 
 Can someone point me in the write direction.
 
 My script have no interaction with User It just Got on-line searches for 
 something and emails me.
 
 
 
 Thanks

Sorry I never wanted to be rude.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Chris Angelico
On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy tjre...@udel.edu wrote:
 On 8/19/2012 4:04 AM, Paul Rubin wrote:
 I realize the folks who designed and implemented PEP 393 are very smart
 cookies and considered stuff carefully, while I'm just an internet user
 posting an immediate impression of something I hadn't seen before (I
 still use Python 2.6), but I still have to ask: if the 393 approach
 makes sense, why don't other languages do it?

 Python has often copied or borrowed, with adjustments. This time it is the
 first. We will see how it goes, but it has been tested for nearly a year
 already.

Maybe it wasn't consciously borrowed, but whatever innovation is done,
there's usually an obscure beardless language that did it earlier. :)

Pike has a single string type, which can use the full Unicode range.
If all codepoints are 256, the string width is 8 (measured in bits);
if 65536, width is 16; otherwise 32. Using the inbuilt count_memory
function (similar to the Python function used somewhere earlier in
this thread, but which I can't at present put my finger to), I find
that for strings of 16 bytes or more, there's a fixed 20-byte header
plus the string content, stored in the correct number of bytes. (Pike
strings, like Python ones, are immutable and do not need expansion
room.)

However, Python goes a bit further by making it VERY clear that this
is a mere optimization, and that Unicode strings and bytes strings are
completely different beasts. In Pike, it's possible to forget to
encode something before (say) writing it to a socket. Everything works
fine while you have only ASCII characters in the string, and then
breaks when you have a 255 codepoint - or perhaps worse, when you
have a 127x256, and the other end misinterprets it.

Really, the only viable alternative to PEP 393 is a fixed 32-bit
representation - it's the only way that's guaranteed to provide
equivalent semantics. The new storage format is guaranteed to take no
more memory than that, and provide equivalent functionality.

ChrisA
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Roy Smith
In article mailman.3531.1345416176.4697.python-l...@python.org,
 Chris Angelico ros...@gmail.com wrote:

 Really, the only viable alternative to PEP 393 is a fixed 32-bit
 representation - it's the only way that's guaranteed to provide
 equivalent semantics. The new storage format is guaranteed to take no
 more memory than that, and provide equivalent functionality.

In the primordial days of computing, using 8 bits to store a character 
was a profligate waste of memory.  What on earth did people need with 
TWO cases of the alphabet (not to mention all sorts of weird 
punctuation)?  Eventually, memory became cheap enough that the 
convenience of using one character per byte (not to mention 8-bit bytes) 
outweighed the costs.  And crazy things like sixbit and rad-50 got swept 
into the dustbin of history.

So it may be with utf-8 someday.

Clearly, the world has moved to a 32-bit character set.  Not all parts 
of the world know that yet, or are willing to admit it, but that doesn't 
negate the fact that it's true.  Equally clearly, the concept of one 
character per byte is a big win.  The obvious conclusion is that 
eventually, when memory gets cheap enough, we'll all be doing utf-32 and 
all this transcoding nonsense will look as antiquated as rad-50 does 
today.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Abuse of Big Oh notation

2012-08-19 Thread Paul Rubin
Steven D'Aprano steve+comp.lang.pyt...@pearwood.info writes:
 Of course *if* k is constant, O(k) is constant too, but k is not 
 constant. In context we are talking about string indexing and slicing. 
 There is no value of k, say, k = 2, for which you can say People will 
 sometimes ask for string[2] but never ask for string[3]. That is absurd.

The context was parsing, e.g. recognizing a token like a or foo in a
human-written chunk of text.  Occasionally it might be sesquipidalian
or some even worse outlier, but one can reasonably put a fixed and
relatively small upper bound on the expected value of k.  That makes the
amortized complexity O(1), I think.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How to get initial absolute working dir reliably?

2012-08-19 Thread alex23
On Sunday, 19 August 2012 01:19:59 UTC+10, kj  wrote:
 What's the most reliable way for module code to determine the
 absolute path of the working directory at the start of execution?

Here's some very simple code that relies on the singleton nature of modules 
that might be enough for your needs:

import os

_workingdir = None

def set():
global _workingdir
_workingdir = os.getcwd()

def get():
return _workingdir

At the start of your application, import workingdir and do a workingdir.set(). 
Then when you need to retrieve it, import it again and use workingdir.get():

a.py:
import workingdir
workingdir.set()

b.py:
import workingdir
print workingdir.get()

test.py:
import a
import b

You could also remove the need to call the .set() by implicitly assigning on 
the first import:

if '_workingdir' not in locals():
_workingdir = os.getcwd()

But I like the explicitness.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread 88888 Dihedral
On Monday, August 20, 2012 1:03:34 AM UTC+8, Blind Anagram wrote:
 Steven D'Aprano  wrote in message 
 
 news:502f8a2a$0$29978$c3e8da3$54964...@news.astraweb.com...
 
 
 
 On Sat, 18 Aug 2012 01:09:26 -0700, wxjmfauth wrote:
 
 
 
 [...]
 
 If you can consistently replicate a 100% to 1000% slowdown in string
 
 handling, please report it as a performance bug:
 
 
 
 http://bugs.python.org/
 
 
 
 Don't forget to report your operating system.
 
 
 
 
 
 For interest, I ran your code snippets on my laptop (Intel core-i7 1.8GHz) 
 
 running Windows 7 x64.
 
 
 
 Running Python from a Windows command prompt,  I got the following on Python 
 
 3.2.3 and 3.3 beta 2:
 
 
 
 python33\python -m timeit ('abc' * 1000).replace('c', 'de')
 
 1 loops, best of 3: 39.3 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', '……')
 
 1 loops, best of 3: 51.8 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 
 1 loops, best of 3: 52 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 
 1 loops, best of 3: 50.3 usec per loop
 
 python33\python -m timeit ('ab…' * 1000).replace('…', '€…')
 
 1 loops, best of 3: 51.6 usec per loop
 
 python33\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 
 1 loops, best of 3: 38.3 usec per loop
 
 python33\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 
 1 loops, best of 3: 50.3 usec per loop
 
 
 
 python32\python -m timeit ('abc' * 1000).replace('c', 'de')
 
 1 loops, best of 3: 24.5 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', '……')
 
 1 loops, best of 3: 24.7 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', 'x…')
 
 1 loops, best of 3: 24.8 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', 'œ…')
 
 1 loops, best of 3: 24 usec per loop
 
 python32\python -m timeit ('ab…' * 1000).replace('…', '€…')
 
 1 loops, best of 3: 24.1 usec per loop
 
 python32\python -m timeit ('XYZ' * 1000).replace('X', 'éç')
 
 1 loops, best of 3: 24.4 usec per loop
 
 python32\python -m timeit ('XYZ' * 1000).replace('Y', 'p?')
 
 1 loops, best of 3: 24.3 usec per loop
 
 
 
 This is an average slowdown by a factor of close to 2.3 on 3.3 when compared 
 
 with 3.2.
 
 
 
 I am not posting this to perpetuate this thread but simply to ask whether, 
 
 as you suggest, I should report this as a possible problem with the beta?

Un, another set of functions for seeding up ASCII string othe pertions 
might be needed. But it is better that Python 3.3 supports unicode strings
to be easy to be used by people in different languages first.

Anyway I think Cython and Pyrex can be used to tackle this problem.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: How do I display unicode value stored in a string variable using ord()

2012-08-19 Thread Terry Reedy

On 8/19/2012 6:42 PM, Chris Angelico wrote:

On Mon, Aug 20, 2012 at 3:34 AM, Terry Reedy tjre...@udel.edu wrote:



Python has often copied or borrowed, with adjustments. This time it is the
first.


I should have added 'that I know of' ;-)


Maybe it wasn't consciously borrowed, but whatever innovation is done,
there's usually an obscure beardless language that did it earlier. :)

Pike has a single string type, which can use the full Unicode range.
If all codepoints are 256, the string width is 8 (measured in bits);
if 65536, width is 16; otherwise 32. Using the inbuilt count_memory
function (similar to the Python function used somewhere earlier in
this thread, but which I can't at present put my finger to), I find
that for strings of 16 bytes or more, there's a fixed 20-byte header
plus the string content, stored in the correct number of bytes. (Pike
strings, like Python ones, are immutable and do not need expansion
room.)


It is even possible that someone involved was even vaguely aware that 
there was an antecedent. The PEP makes no claim that I can see, but lays 
out the problem and goes right to details of a Python implementation.



However, Python goes a bit further by making it VERY clear that this
is a mere optimization, and that Unicode strings and bytes strings are
completely different beasts. In Pike, it's possible to forget to
encode something before (say) writing it to a socket. Everything works
fine while you have only ASCII characters in the string, and then
breaks when you have a 255 codepoint - or perhaps worse, when you
have a 127x256, and the other end misinterprets it.


Python writes strings to file objects, including open sockets, without 
creating a bytes object -- IF the file is opened in text mode, which 
always has an associated encoding, even if the default 'ascii'. From 
what you say, this is what Pike is missing.


I am pretty sure that the obvious optimization has already been done. 
The internal bytes of all-ascii text can safely be sent to a file with 
ascii (or ascii-compatible) encoding without intermediate 'decoding'. I 
remember several patches of that sort. If a string is internally ucs2 
and the file is declared usc2 or utf-16 encoding, then again, pairs of 
bytes can go directly (possibly with a byte swap).



--
Terry Jan Reedy

--
http://mail.python.org/mailman/listinfo/python-list


Re: Why doesn't Python remember the initial directory?

2012-08-19 Thread Nobody
On Sun, 19 Aug 2012 14:01:15 -0700, Giacomo Alzetta wrote:

 You can obtain the working directory with os.getcwd().

Maybe. On Unix, it's possible that the current directory no longer
has a pathname. As with files, directories can be deleted (i.e.
unlinked) even while they're still in use.

Similarly, a directory can be renamed while it's in use, so the current
directory's pathname may have changed even while the current directory
itself hasn't.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Why doesn't Python remember the initial directory?

2012-08-19 Thread 88888 Dihedral
On Monday, August 20, 2012 4:42:16 AM UTC+8, kj wrote:
 As far as I've been able to determine, Python does not remember
 
 (immutably, that is) the working directory at the program's start-up,
 
 or, if it does, it does not officially expose this information.
 
 
 
 Does anyone know why this is?  Is there a PEP stating the rationale
 
 for it?
 
 
 
 Thanks!

Immutable data can be frozen and saved in somewhere off the main memory.

Perative and imperative programming are different.

Please check Erlang.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: ONLINE SERVER TO STORE AND RUN PYTHON SCRIPTS

2012-08-19 Thread Jerry Hill
On Sun, Aug 19, 2012 at 6:27 PM, coldfire amangill.coldf...@gmail.com wrote:
 Also I have no idea how to deploy a python script online.
 I have done that on my local PC using Apache server and cgi but it Works fine.
 Whats this all called? as far as I have searched its Web Framework but I dont 
 wont to develop  a website Just a Server which can run my scripts at specific 
 time and send me email if an error occurs.
 I use Python And i am not getting any lead.

If you want to host web pages, like your're doing on your local pc
with Apache and cgi, then you need an account with a web server, and a
way to deploy your scripts and other content.  This is often known as
a 'web hosting service'[1].  The exact capabilities and restrictions
will vary from provider to provider.

If you just want an alway-on, internet accessable place to store and
run your python scripts, you may be interested in a 'shell
account'[2], or if you need more control over the environment, a
'virtual private server'[3].

That may give you a few terms to google, and see what kind of service you need.

1 http://en.wikipedia.org/wiki/Shell_account
2 http://en.wikipedia.org/wiki/Web_host
3 http://en.wikipedia.org/wiki/Virtual_private_server

-- 
Jerry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Why doesn't Python remember the initial directory?

2012-08-19 Thread kj
In roy-ca6d77.17031119082...@news.panix.com Roy Smith r...@panix.com writes:

In article k0rj38$2gc$1...@reader1.panix.com, kj no.em...@please.post 
wrote:

 As far as I've been able to determine, Python does not remember
 (immutably, that is) the working directory at the program's start-up,
 or, if it does, it does not officially expose this information.

Why would you expect that it would?  What would it (or you) do with this 
information?

More to the point, doing a chdir() is not something any library code 
would do (at least not that I'm aware of), so if the directory changed, 
it's because some application code did it.  In which case, you could 
have just stored the working directory yourself.

This means that no library code can ever count on, for example,
being able to reliably find the path to the file that contains the
definition of __main__.  That's a weakness, IMO.  One manifestation
of this weakness is that os.chdir breaks inspect.getmodule, at
least on Unix.  If you have some Unix system handy, you can try
the following.  First change the argument to os.chdir below to some
valid directory other than your working directory.  Then, run the
script, making sure that you refer to it using a relative path.
When I do this on my system (OS X + Python 2.7.3), the script bombs
at the last print statement, because the second call to inspect.getmodule
(though not the first one) returns None.

import inspect
import os

frame = inspect.currentframe()

print inspect.getmodule(frame).__name__

os.chdir('/some/other/directory') # where '/some/other/directory' is
  # different from the initial directory

print inspect.getmodule(frame).__name__

...

% python demo.py
python demo.py
__main__
Traceback (most recent call last):
  File demo.py, line 11, in module
print inspect.getmodule(frame).__name__
AttributeError: 'NoneType' object has no attribute '__name__'



I don't know of any way to fix inspect.getmodule that does not
involve, directly or indirectly, keeping a stable record of the
starting directory.

But, who am I kidding?  What needs fixing, right?  That's not a
bug, that's a feature!  Etc.

By now I have learned to expect that 99.99% of Python programmers
will find that there's nothing wrong with behavior like the one
described above, that it is in fact exactly As It Should Be, because,
you see, since Python is the epitome of perfection, it follows
inexorably that any flaw or shortcoming one may *perceive* in Python
is only an *illusion*: the flaw or shortcoming is really in the
benighted programmer, for having stupid ideas about programming
(i.e. any idea that may entail that Python is not *gasp* perfect).
Pardon my cynicism, but the general vibe from the replies I've
gotten to my post (i.e. if Python ain't got it, it means you don't
need it) is entirely in line with these expectations.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Why doesn't Python remember the initial directory?

2012-08-19 Thread Jerry Hill
On Sun, Aug 19, 2012 at 9:57 PM, kj no.em...@please.post wrote:
 By now I have learned to expect that 99.99% of Python programmers
 will find that there's nothing wrong with behavior like the one
 described above, that it is in fact exactly As It Should Be, because,
 you see, since Python is the epitome of perfection, it follows
 inexorably that any flaw or shortcoming one may *perceive* in Python
 is only an *illusion*: the flaw or shortcoming is really in the
 benighted programmer, for having stupid ideas about programming
 (i.e. any idea that may entail that Python is not *gasp* perfect).
 Pardon my cynicism, but the general vibe from the replies I've
 gotten to my post (i.e. if Python ain't got it, it means you don't
 need it) is entirely in line with these expectations.

Since you have no respect for the people you're writing to, why
bother?  I know I certainly have no desire to spend any time at all on
your problem when you say things like that.  Perhaps you're looking
for for the argument clinic instead?

http://www.youtube.com/watch?v=RDjCqjzbvJY

-- 
Jerry
-- 
http://mail.python.org/mailman/listinfo/python-list


Legal: Introduction to Programming App

2012-08-19 Thread Matthew Zipf
Good evening,

I am considering developing an iOS application that would teach average
people how to program in Python. The app will be sold on the Apple app
store.

May I develop this app? To what extent do I need to receive permission from
the Python Software Foundation? To what extent do I need to recognize the
Python Software Foundation in my app?

Thank you,
Matthew Zipf
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Why doesn't Python remember the initial directory?

2012-08-19 Thread alex23

On Monday, 20 August 2012 11:57:46 UTC+10, kj  wrote:
 This means that no library code can ever count on, for example,
 being able to reliably find the path to the file that contains the
 definition of __main__.  That's a weakness, IMO.

No, it's not. It's a _strength_. If you've written a library that requires 
absolute knowledge of its installed location in order for its internals to 
work, then I'm not installing your library.

 When I do this on my system (OS X + Python 2.7.3), the script bombs
 at the last print statement, because the second call to inspect.getmodule
 (though not the first one) returns None.

So, uh, do something sane like test for the result of inspect.getmodule 
_before_ trying to do something invalid to it?

 I don't know of any way to fix inspect.getmodule that does not
 involve, directly or indirectly, keeping a stable record of the
 starting directory.

Then _that is the answer_. YOU need to keep a stable record:

import inspect 
import os 

THIS_FILE = os.path.join(os.getcwd(), 'this_module_name.py')

frame = inspect.currentframe() 
print inspect.getmodule(frame).__name__ 

os.chdir('/some/other/directory')

print inspect.getmodule(frame, _filename=THIS_FILE).__name__ 

 But, who am I kidding?  What needs fixing, right?  That's not a
 bug, that's a feature!  Etc.

Right. Because that sort of introspection of objects is rare, why burden the 
_entire_ language with an obligation that is only required in a few places?

 By now I have learned to expect that 99.99% of Python programmers
 will find that [blah blah blah, whine whine whine]. 
 Pardon my cynicism, but the general vibe from the replies I've
 gotten to my post (i.e. if Python ain't got it, it means you don't
 need it) is entirely in line with these expectations.

Oh my god, how DARE people with EXPERIENCE in a language challenge the 
PRECONCEPTIONS of an AMATEUR!!! HOW DARE THEY?!?!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Why doesn't Python remember the initial directory?

2012-08-19 Thread alex23
My apologies for any double-ups and bad formatting. The new Google Groups 
interface seems to have effectively shat away decades of UX for something that 
I can only guess was generated randomly.
-- 
http://mail.python.org/mailman/listinfo/python-list


  1   2   >