Re: Cpython optimization

2009-10-23 Thread Terry Reedy

Qrees wrote:

Hello

As my Master's dissertation I chose Cpython optimization. That's why
i'd like to ask what are your suggestions what can be optimized. Well,
I know that quite a lot. I've downloaded the source code (I plan to
work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at
the code I've found comment's like this can be optimized by... etc.
but maybe you guide me what should I concentrate on in my work?


2.6.3 is about to be replaced with 2.6.4, but even that will not have 
improvements that have already been added to 3.x or 2.7. I strongly 
suggest that you work with the 3.2 code, since there is then a maximal 
chance that your improvements will actually be adopted.

 tjr

--
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-23 Thread MRAB

Olof Bjarnason wrote:
[snip]
A short question after having read through most of this thread, on the 
same subject (time-optimizing CPython):


http://mail.python.org/pipermail/python-list/2007-September/098964.html

We are experiencing multi-core processor kernels more and more these 
days. But they are all still connected to the main memory, right?


To me that means, even though some algorithm can be split up into 
several threads that run on different cores of the processor, that any 
algorithm will be memory-speed limited. And memory access is a quite 
common operation for most algorithms.


Then one could ask oneself: what is the point of multiple cores, if 
memory bandwidth is the bottleneck? Specifically, what makes one expect 
any speed gain from parallelizing a sequential algorithm into four 
threads, say, when the memory shuffling is the same speed in both 
scenarios? (Assuming memory access is much slower than ADDs, JMPs and 
such instructions - a quite safe assumption I presume)


[ If every core had it's own primary memory, the situation would be 
different. It would be more like the situation in a distributed/internet 
based system, spread over several computers. One could view each core as 
a separate computer actually ]



Don't forget about the on-chip cache! :-)
--
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-23 Thread Olof Bjarnason
2009/10/22 MRAB pyt...@mrabarnett.plus.com

 Olof Bjarnason wrote:
 [snip]

  A short question after having read through most of this thread, on the
 same subject (time-optimizing CPython):

 http://mail.python.org/pipermail/python-list/2007-September/098964.html

 We are experiencing multi-core processor kernels more and more these days.
 But they are all still connected to the main memory, right?

 To me that means, even though some algorithm can be split up into several
 threads that run on different cores of the processor, that any algorithm
 will be memory-speed limited. And memory access is a quite common operation
 for most algorithms.

 Then one could ask oneself: what is the point of multiple cores, if memory
 bandwidth is the bottleneck? Specifically, what makes one expect any speed
 gain from parallelizing a sequential algorithm into four threads, say, when
 the memory shuffling is the same speed in both scenarios? (Assuming memory
 access is much slower than ADDs, JMPs and such instructions - a quite safe
 assumption I presume)

 [ If every core had it's own primary memory, the situation would be
 different. It would be more like the situation in a distributed/internet
 based system, spread over several computers. One could view each core as a
 separate computer actually ]

  Don't forget about the on-chip cache! :-)


Sorry for continuing slightly OT:

Yes, that makes matters even more interesting.

Caches for single-cpu-boards speed up memory access quite dramatically. Are
caches for multi-core boards shared among the cores? Or do each core have a
separate cache? I can only imagine how complicated the read/write logic must
be of these tiny electronic devices, in any case.

Of course caches makes the memory access-operations must faster, but I'm
guessing register instructions are still orders of magnitude faster than
(cached) memory access. (or else registers would not really be needed - you
could just view the whole primary memory as an array of registers!)

So I think my first question is still interesting: What is the point of
multiple cores, if memory is the bottleneck?
(it helps to think of algorithms such as line-drawing or ray-tracing, which
is easy to parallellize, yet I believe are still faster using a single core
instead of multiple because of the read/write-to-memory-bottleneck. It does
help to bring more workers to the mine if only one is allowed access at a
time, or more likely, several are allowed yet it gets so crowded that
queues/waiting is inevitable)

-- 

 http://mail.python.org/mailman/listinfo/python-list




-- 
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-23 Thread Olof Bjarnason
2009/10/23 Olof Bjarnason olof.bjarna...@gmail.com



 2009/10/22 MRAB pyt...@mrabarnett.plus.com

 Olof Bjarnason wrote:
 [snip]

  A short question after having read through most of this thread, on the
 same subject (time-optimizing CPython):

 http://mail.python.org/pipermail/python-list/2007-September/098964.html

 We are experiencing multi-core processor kernels more and more these
 days. But they are all still connected to the main memory, right?

 To me that means, even though some algorithm can be split up into several
 threads that run on different cores of the processor, that any algorithm
 will be memory-speed limited. And memory access is a quite common operation
 for most algorithms.

 Then one could ask oneself: what is the point of multiple cores, if
 memory bandwidth is the bottleneck? Specifically, what makes one expect any
 speed gain from parallelizing a sequential algorithm into four threads, say,
 when the memory shuffling is the same speed in both scenarios? (Assuming
 memory access is much slower than ADDs, JMPs and such instructions - a quite
 safe assumption I presume)

 [ If every core had it's own primary memory, the situation would be
 different. It would be more like the situation in a distributed/internet
 based system, spread over several computers. One could view each core as a
 separate computer actually ]

  Don't forget about the on-chip cache! :-)


 Sorry for continuing slightly OT:

 Yes, that makes matters even more interesting.

 Caches for single-cpu-boards speed up memory access quite dramatically. Are
 caches for multi-core boards shared among the cores? Or do each core have a
 separate cache? I can only imagine how complicated the read/write logic must
 be of these tiny electronic devices, in any case.

 Of course caches makes the memory access-operations must faster, but I'm
 guessing register instructions are still orders of magnitude faster than
 (cached) memory access. (or else registers would not really be needed - you
 could just view the whole primary memory as an array of registers!)

 So I think my first question is still interesting: What is the point of
 multiple cores, if memory is the bottleneck?
 (it helps to think of algorithms such as line-drawing or ray-tracing, which
 is easy to parallellize, yet I believe are still faster using a single core
 instead of multiple because of the read/write-to-memory-bottleneck. It does
 help to bring more workers to the mine if


um typo It does NOT help to ..

only one is allowed access at a time, or more likely, several are allowed
 yet it gets so crowded that queues/waiting is inevitable)

 --

 http://mail.python.org/mailman/listinfo/python-list




 --
 twitter.com/olofb
 olofb.wordpress.com
 olofb.wordpress.com/tag/english




-- 
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-23 Thread Stefan Behnel
 Olof Bjarnason wrote:
 [snip]
 A short question after having read through most of this thread, on the
 same subject (time-optimizing CPython):

 http://mail.python.org/pipermail/python-list/2007-September/098964.html

 We are experiencing multi-core processor kernels more and more these
 days. But they are all still connected to the main memory, right?

 To me that means, even though some algorithm can be split up into
 several threads that run on different cores of the processor, that any
 algorithm will be memory-speed limited. And memory access is a quite
 common operation for most algorithms.

 Then one could ask oneself: what is the point of multiple cores, if
 memory bandwidth is the bottleneck? Specifically, what makes one
 expect any speed gain from parallelizing a sequential algorithm into
 four threads, say, when the memory shuffling is the same speed in both
 scenarios? (Assuming memory access is much slower than ADDs, JMPs and
 such instructions - a quite safe assumption I presume)

Modern (multi-core) processors have several levels of caches that interact
with the other cores in different ways.

You should read up on NUMA.

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-23 Thread Olof Bjarnason
2009/10/23 Stefan Behnel stefan...@behnel.de

  Olof Bjarnason wrote:
  [snip]
  A short question after having read through most of this thread, on the
  same subject (time-optimizing CPython):
 
  http://mail.python.org/pipermail/python-list/2007-September/098964.html
 
  We are experiencing multi-core processor kernels more and more these
  days. But they are all still connected to the main memory, right?
 
  To me that means, even though some algorithm can be split up into
  several threads that run on different cores of the processor, that any
  algorithm will be memory-speed limited. And memory access is a quite
  common operation for most algorithms.
 
  Then one could ask oneself: what is the point of multiple cores, if
  memory bandwidth is the bottleneck? Specifically, what makes one
  expect any speed gain from parallelizing a sequential algorithm into
  four threads, say, when the memory shuffling is the same speed in both
  scenarios? (Assuming memory access is much slower than ADDs, JMPs and
  such instructions - a quite safe assumption I presume)

 Modern (multi-core) processors have several levels of caches that interact
 with the other cores in different ways.

 You should read up on NUMA.

 http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access


Thanks that was a good read. Basically NUMA addresses the problems I mention
by creating several primary memories (called banks) - making the motherboard
contain several computers (=cpu+primary memory) instead of one primary
memory and several cores, if I read it correctly.


 Stefan
 --
 http://mail.python.org/mailman/listinfo/python-list




-- 
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-23 Thread Antoine Pitrou
Le Fri, 23 Oct 2009 09:45:06 +0200, Olof Bjarnason a écrit :
 
 So I think my first question is still interesting: What is the point of
 multiple cores, if memory is the bottleneck?

Why do you think it is, actually? Some workloads are CPU-bound, some 
others are memory- or I/O-bound.

You will find plenty of benchmarks on the Web showing that some workloads 
scale almost linearly to the number of CPU cores, while some don't.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-23 Thread Olof Bjarnason
2009/10/23 Antoine Pitrou solip...@pitrou.net

 Le Fri, 23 Oct 2009 09:45:06 +0200, Olof Bjarnason a écrit :
 
  So I think my first question is still interesting: What is the point of
  multiple cores, if memory is the bottleneck?

 Why do you think it is, actually? Some workloads are CPU-bound, some
 others are memory- or I/O-bound.


Big I/O operations have even less chance of gaining anything from being
parallellized (quite the opposite I would guess due to synchronization
overhead).

Operations that uses very little memory, say some mathematical computations
that can fit in 16 registers, have a lot to gain in being parallellized, of
course. Maybe computing small-formulae fractals is a good example of this?

My question was regarding the most hard-to-tell and, sadly also the most
common, trying to speed up algorithms that work over big data structures in
primary memory. Things like drawing lines, doing image processing,
multiplication of large matrices, ray tracing Basically anything with
large amounts of input information/buffers/structures, stored in primary
memory.

This would be way to speed up things in an image processing algorithm:
1. divide the image into four subimages
2. let each core process each part independently
3. fixmerge (along split lines for example) into a resulting, complete
image

Notice that no gain would be achieved if the memory was shared among the
cores, at least not if the operation-per-pixel is faster than the read/write
into memory.


 You will find plenty of benchmarks on the Web showing that some workloads
 scale almost linearly to the number of CPU cores, while some don't.

 --
 http://mail.python.org/mailman/listinfo/python-list




-- 
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-23 Thread Antoine Pitrou
Le Fri, 23 Oct 2009 15:53:02 +0200, Olof Bjarnason a écrit :
 
 This would be way to speed up things in an image processing algorithm:
 1. divide the image into four subimages 2. let each core process each
 part independently 3. fixmerge (along split lines for example) into a
 resulting, complete image

Well, don't assume you're the first to think about that.
I'm sure that performance-conscious image processing software already has 
this kind of tile-based optimizations.
(actually, if you look at benchmarks of 3D rendering which are regularly 
done by enthusiast websites, it shows exactly that)

Antoine.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-23 Thread Olof Bjarnason

 
  This would be way to speed up things in an image processing algorithm:
  1. divide the image into four subimages 2. let each core process each
  part independently 3. fixmerge (along split lines for example) into a
  resulting, complete image

 Well, don't assume you're the first to think about that.
 I'm sure that performance-conscious image processing software already has
 this kind of tile-based optimizations.
 (actually, if you look at benchmarks of 3D rendering which are regularly
 done by enthusiast websites, it shows exactly that)

 No I didn't assume I was the first to think about that - I wanted to learn
more about how optimization at all is possible/viable with multi-core
motherboards, when the memory speed is the bottleneck anyway, regardless of
smart caching technologies.

I still have not received a convincing answer :)

Antoine.

--

  http://mail.python.org/mailman/listinfo/python-list




-- 
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-23 Thread Jason Sewall
On Fri, Oct 23, 2009 at 2:31 PM, Olof Bjarnason
olof.bjarna...@gmail.com wrote:
 
  This would be way to speed up things in an image processing algorithm:
  1. divide the image into four subimages 2. let each core process each
  part independently 3. fixmerge (along split lines for example) into a
  resulting, complete image

 Well, don't assume you're the first to think about that.
 I'm sure that performance-conscious image processing software already has
 this kind of tile-based optimizations.
 (actually, if you look at benchmarks of 3D rendering which are regularly
 done by enthusiast websites, it shows exactly that)

This is indeed a tried-and-true method for parallelizing certain image
and other grid-based algorithms, but it is in fact not appropriate for
a wide variety of techniques. Things like median filters, where f(A|B)
!= f(A)|f(B) (with | as some sort of concatenation), will not be able
to generate correct results given the scheme you outlined.

 No I didn't assume I was the first to think about that - I wanted to learn
 more about how optimization at all is possible/viable with multi-core
 motherboards, when the memory speed is the bottleneck anyway, regardless of
 smart caching technologies.

 I still have not received a convincing answer :)

Give Ulrich Drepper's What Every Programmer Should Know about Memory
a read (http://people.redhat.com/drepper/cpumemory.pdf) and you'll
hear all you want to know (and more) about how the memory hierarchy
plays with multi-core.

I don't contribute to CPython, but I am virtually certain that they
are not interested in having the compiler/interpreter try to apply
some generic threading to arbitrary code. The vast majority of Python
code wouldn't benefit from it even if it worked well, and I'm _very_
skeptical that there is any silver bullet for parallelizing general
code. If you think of one, tell Intel, they'll hire you.

_Perhaps_ the numpy or scipy people (I am not associated with either
of them) would be interested in some threading for certain array
operations. Maybe you could write something on top of what they have
to speed up array ops.

Jason
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-22 Thread Olof Bjarnason
2009/10/22 John Yeung gallium.arsen...@gmail.com

 On Oct 22, 12:28 am, John Nagle na...@animats.com wrote:

 The Shed Skin people would welcome some help.
 
 http://shed-skin.blogspot.com/

 People?  It's one guy.  It apparently started out as a Master's thesis
 as well. ;)

 I am a great admirer of the Shed Skin project, and I would be as happy
 as anyone to see it progress.  However, it seems to me that Shed Skin
 is at a stage where what it really needs is plain old more work.  (I
 don't want to call it grunt work, but it's things like more testing,
 implementing more library support, maintaining the Windows build,
 etc.  Very worthy and worthwhile work, but tough to pass off as
 academic graduate work.)

 To the original poster:  I am not sure if this is too ambitious for
 your time frame, but one thing many real-world Python users would LOVE
 is for some way to get rid of the GIL (while still retaining thread
 safety, single-core performance, etc.).  If you could patch CPython to
 make it GIL-less, I think you would have academically publishable
 material as well as being a hero in the Python community. :D


A short question after having read through most of this thread, on the same
subject (time-optimizing CPython):

http://mail.python.org/pipermail/python-list/2007-September/098964.html

We are experiencing multi-core processor kernels more and more these days.
But they are all still connected to the main memory, right?

To me that means, even though some algorithm can be split up into several
threads that run on different cores of the processor, that any algorithm
will be memory-speed limited. And memory access is a quite common operation
for most algorithms.

Then one could ask oneself: what is the point of multiple cores, if memory
bandwidth is the bottleneck? Specifically, what makes one expect any speed
gain from parallelizing a sequential algorithm into four threads, say, when
the memory shuffling is the same speed in both scenarios? (Assuming memory
access is much slower than ADDs, JMPs and such instructions - a quite safe
assumption I presume)

[ If every core had it's own primary memory, the situation would be
different. It would be more like the situation in a distributed/internet
based system, spread over several computers. One could view each core as a
separate computer actually ]


 John
 --
 http://mail.python.org/mailman/listinfo/python-list




-- 
twitter.com/olofb
olofb.wordpress.com
olofb.wordpress.com/tag/english
-- 
http://mail.python.org/mailman/listinfo/python-list


Cpython optimization

2009-10-22 Thread Qrees
Hello

As my Master's dissertation I chose Cpython optimization. That's why
i'd like to ask what are your suggestions what can be optimized. Well,
I know that quite a lot. I've downloaded the source code (I plan to
work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at
the code I've found comment's like this can be optimized by... etc.
but maybe you guide me what should I concentrate on in my work?

I've 6-7 month  for this and If I create something decent I can
publish it.

Thank you in advance for any help
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-22 Thread Stefan Behnel
Francesco Bochicchio wrote:
 As a simple and plain python user, I would value a version of cython that 
 can be used to built faster executables out of almost-python code (that 
 is python code with a few additional restructions). Maybe using typing 
 inference to avoid declaring explicitely the variable types.

Sorry, type inference has already landed and will be in Cython 0.12. Plus,
you do not need to declare variables in Cython, but you may, if you choose to.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-22 Thread geremy condra
On Wed, Oct 21, 2009 at 1:28 PM, Qrees qre...@gmail.com wrote:
 Hello

 As my Master's dissertation I chose Cpython optimization. That's why
 i'd like to ask what are your suggestions what can be optimized. Well,
 I know that quite a lot. I've downloaded the source code (I plan to
 work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at
 the code I've found comment's like this can be optimized by... etc.
 but maybe you guide me what should I concentrate on in my work?

 I've 6-7 month  for this and If I create something decent I can
 publish it.

 Thank you in advance for any help

Could always port it to CUDA ;)

Geremy Condra
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-21 Thread Francesco Bochicchio
Il Wed, 21 Oct 2009 10:28:55 -0700, Qrees ha scritto:

 Hello
 
 As my Master's dissertation I chose Cpython optimization. That's why i'd
 like to ask what are your suggestions what can be optimized. Well, I
 know that quite a lot. I've downloaded the source code (I plan to work
 on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at the
 code I've found comment's like this can be optimized by... etc. but
 maybe you guide me what should I concentrate on in my work?
 
 I've 6-7 month  for this and If I create something decent I can publish
 it.
 
 Thank you in advance for any help

If you don't know yet, you could find interesting this project:

   http://code.google.com/p/unladen-swallow/

They too are trying to improve CPython speed.

If you are thinking of language variations  that trade some flexiblity 
for speed, you might be interested in Cython:

http://www.cython.org/

As a simple and plain python user, I would value a version of cython that 
can be used to built faster executables out of almost-python code (that 
is python code with a few additional restructions). Maybe using typing 
inference to avoid declaring explicitely the variable types.

Another interesting place to go is pypy : http://codespeak.net/pypy/dist/
pypy/doc/ . They too have developed a restriced version of python 
(RPython, I think) which should be faster than CPython. They don't work 
with CPython code base, but could give you ideas on what are the 
bottlenecks of python as a language.

Ciao
-
FB


 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-21 Thread Qrees
 If you don't know yet, you could find interesting this project:

    http://code.google.com/p/unladen-swallow/

I know about this project. I'll have a look at it, but I'd like to
create something of my own.

 They too are trying to improve CPython speed.

 If you are thinking of language variations  that trade some flexiblity
 for speed, you might be interested in Cython:

 http://www.cython.org/

I was thinking about sacrificing some flexibility of Python and thank
you for pointing me to this project. I didn't about it.

 As a simple and plain python user, I would value a version of cython that
 can be used to built faster executables out of almost-python code (that
 is python code with a few additional restructions). Maybe using typing
 inference to avoid declaring explicitely the variable types.

 Another interesting place to go is pypy :http://codespeak.net/pypy/dist/
 pypy/doc/ . They too have developed a restriced version of python
 (RPython, I think) which should be faster than CPython. They don't work
 with CPython code base, but could give you ideas on what are the
 bottlenecks of python as a language.

 Ciao
 -
 FB

I'd like to hear a word from developers, what they think about
this :).

BTW: My seminar deals with object oriented programming.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-21 Thread John Nagle

Qrees wrote:

Hello

As my Master's dissertation I chose Cpython optimization. That's why
i'd like to ask what are your suggestions what can be optimized. Well,
I know that quite a lot. I've downloaded the source code (I plan to
work on Cpython 2.6 and I've downloaded 2.6.3 release). By looking at
the code I've found comment's like this can be optimized by... etc.
but maybe you guide me what should I concentrate on in my work?

I've 6-7 month  for this and If I create something decent I can
publish it.

Thank you in advance for any help


  The Shed Skin people would welcome some help.

http://shed-skin.blogspot.com/

Shed Skin is a Python to C++ compiler with automatic type inference.
For the programs it can handle, it's the fastest Python implementation
by a large margin.

The basic restriction in Shed Skin is that, while you don't have to
declare types, each variable generally has to stay the same type
throughout its life.  (Different subclasses of the same class are OK.)
This is enough to make huge speedups possible.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Cpython optimization

2009-10-21 Thread John Yeung
On Oct 22, 12:28 am, John Nagle na...@animats.com wrote:

    The Shed Skin people would welcome some help.

        http://shed-skin.blogspot.com/

People?  It's one guy.  It apparently started out as a Master's thesis
as well. ;)

I am a great admirer of the Shed Skin project, and I would be as happy
as anyone to see it progress.  However, it seems to me that Shed Skin
is at a stage where what it really needs is plain old more work.  (I
don't want to call it grunt work, but it's things like more testing,
implementing more library support, maintaining the Windows build,
etc.  Very worthy and worthwhile work, but tough to pass off as
academic graduate work.)

To the original poster:  I am not sure if this is too ambitious for
your time frame, but one thing many real-world Python users would LOVE
is for some way to get rid of the GIL (while still retaining thread
safety, single-core performance, etc.).  If you could patch CPython to
make it GIL-less, I think you would have academically publishable
material as well as being a hero in the Python community. :D

John
-- 
http://mail.python.org/mailman/listinfo/python-list