Re: Queue cleanup

2010-09-12 Thread Gregory Ewing

Dennis Lee Bieber wrote:

greg.ew...@canterbury.ac.nz declaimed the following


So maybe we need to redesign the hardware.


Remember the iAPX-432?
http://en.wikipedia.org/wiki/Intel_iAPX_432#Garbage_collection


Not quite what I had in mind. That sounds like a conventional
GC algorithm that happens to be implemented in microcode. I'm
thinking about ways of designing the memory itself to help
with GC. Instead of putting all the smarts in the CPU, move
some of them into the RAM.

--
Greg
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-09 Thread John Nagle

On 9/8/2010 6:20 PM, Paul Rubin wrote:

What tricks on tricks?  Even the fanciest GC's are orders of magnitude
less complicated than any serious database, optimizing compiler, OS
kernel, file system, etc.  Real-world software is complicated.  Get used
to that fact, and look for ways to manage the complexity and keep
complex programs safe.  Choosing to program with unsafe methods because
you wish the complexity didn't exist is just deluded.


   Garbage collectors are difficult from a theoretical standpoint,
and it's very difficult to make a correct concurrent garbage collector
without using formal methods.  But garbage collectors are not large
pieces of code.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-08 Thread Lawrence D'Oliveiro
In message 7xeid9gtuq@ruckus.brouhaha.com, Paul Rubin wrote:

 Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:

 That reinforces my point, about how easy it was to check the correctness
 of the code. In this case one simple fix, like this ...
 would render the code watertight. See how easy it is?
 
 Well, no, it's irrelevant how easy it is to fix the issue after it's
 pointed out.

In that case, I can similarly discount your attempts to fix up issues with 
garbage collectors after they’re pointed out, can I not?

 Part of the problem is C itself.

And yet, what are these complicated garbage collectors, that you intend 
relying on to work correctly with all their layers of tricks upon tricks, 
written in? C.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-08 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 In that case, I can similarly discount your attempts to fix up issues with 
 garbage collectors after they’re pointed out, can I not?

Adapting GC algorithms to improve their performance or better use the
capabilities of new hardware is much different than saying they didn't
work in the first place.  They've been around for 5 decades, they (like
Python programs) work fine if you don't mind the performance cost, and
for many applications that cost is acceptable even for quite simple and
naive GC's.  The continued work is to get that cost down even further.
(And before criticizing GC performance, Have you ever profiled CPython
to see how much time it's spending messing with reference counts?  I
didn't think so).

 Part of the problem is C itself.
 And yet, what are these complicated garbage collectors, that you intend 
 relying on to work correctly with all their layers of tricks upon tricks, 
 written in? C.

What tricks on tricks?  Even the fanciest GC's are orders of magnitude
less complicated than any serious database, optimizing compiler, OS
kernel, file system, etc.  Real-world software is complicated.  Get used
to that fact, and look for ways to manage the complexity and keep
complex programs safe.  Choosing to program with unsafe methods because
you wish the complexity didn't exist is just deluded.


It actually seems pretty crazy to me to write a garbage collector in C
today even though it a GC needs unsafe pointer operations in a few
places.  C made some sense in the 1980's when computers were smaller and
people didn't know better.  A lot of the C code around today is legacy
code from that era.  These days it makes more sense to use a safe
language with a few isolated jailbreaks (like an OS kernel that has a
few lines of assembly code) than to write the whole thing in a language
whose unsafety is everywhere.

Here's a good paper by R. Fateman (criticizing C) that I just came across:

   
http://www.franz.com/resources/educational_resources/white_papers/fault.prevention.pdf

He suggests using Lisp instead, but one can't have everything ;-).

FWIW, here are a couple pages about verifying GC's:

  http://flint.cs.yale.edu/flint/publications/hgc.html
  http://www.cs.technion.ac.il/~erez/Papers/verified-gc-popl09.pdf 
  etc.

I just don't understand that weird agenda you seem to have. But
whatever.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-08 Thread Steven D'Aprano
On Thu, 09 Sep 2010 12:41:20 +1200, Lawrence D'Oliveiro wrote:

 Part of the problem is C itself.
 
 And yet, what are these complicated garbage collectors, that you intend
 relying on to work correctly with all their layers of tricks upon
 tricks, written in? C.

Not necessarily.

Pascal, despite the contempt it is held in by university computer science 
departments, isn't quite dead, and some Pascal compilers use garbage 
collectors written in Pascal. FreePascal, I believe, is one of them.

Likewise for other not-dead-yet low-level languages like Ada and Forth. 
As surprising as it seems to many, C is not the only low-level language 
around suitable for writing high-quality, efficient code. Just ask the 
Lisp community, which is thriving. For some definition of thriving.

Admittedly C has far more attention to it than the others, so [insert 
weasel words here] the best C compilers tend to produce more efficient 
code than the best of the others, but Pascal, Ada and similar give you 
more security than C.

I believe that when computer scientists of the future look back at the 
last few decades, they will judge that on balance C did more harm than 
good. Not that C is the only language that people can write buggy or 
insecure code, but C does tend to give the bugs so much help... :)



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-07 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 But you’ll notice that Free Software comes with no such
 restrictions. In fact, it is contrary to commonly-accepted Free
 Software guidelines to impose any sort of restrictions on areas of use.

Free software comes with an explicit disclaimer of liability (you get
what you pay for).  The Sun stuff ($) may have either an express or
implied warranty that could mean they get hit up for damages if the
software is wrong.  IANAL YMMV etc.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-07 Thread Gregory Ewing

Paul Rubin wrote:


Now extrapolate that error rate from 30 lines to a program the size of
Firefox (something like 5 MLOC), and you should see how fraught with
danger that style of programming is.


But you don't write 5 MLOC of code using that programming style.
You use it to write a small core something along the lines of,
oh, I don't know, a Python interpreter, and then write the rest
of the code on top of that platform.

--
Greg
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-07 Thread Gregory Ewing

Lawrence D'Oliveiro wrote:

But alone of all of these, garbage collection still remains just as costly 
to implement as ever. That should tell you something about how poorly it 
matches the characteristics of modern computing hardware.


So maybe we need to redesign the hardware.

Perhaps reference counts could be stored in their own
special area of memory, updated in parallel with main
memory accesses so that they don't slow anything down.
Make it multiported so that all your cores can access
it at once without locking. Maybe even build it out of
counters instead of registers, so there's no data bus,
only an address bus and inc/dec control lines.

--
Greg
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-06 Thread Paul Rubin
John Nagle na...@animats.com writes:
 I've argued for an approach in which only synchronized or immutable
 objects can be shared between threads.  Then, only synchronized objects
 have refcounts.  See
 http://www.animats.com/papers/languages/pythonconcurrency.html;

I'm going to have to read this carefully, but my first reaction is that
doing it right would take substantial changes to the language, to the
point where it wouldn't really be Python any more.  

More generally, if you want to program in Erlang, why not use Erlang for
that?

 I can't think of a time when I've really had to use a finalizer for
 something with dynamic extent.  They've always seemed like a code
 smell to me.

   The problem appears when you have an object that owns something, like
a window or a database connection.  With is single-level.

But expecting such an object to be gc'd seems like a code smell in its
own right.  I once implemented something like that in a Lisp system, and
it was just disconcerting as hell to see windows on the screen blink out
of existence unpredictably.  The issue is maybe your function returns
and you expect the window to vanish, but something somewhere has saved
another temporary reference to it so it doesn't go away til later.  If
you want something with external semantics (like a window or file
handle) to be released at a particular time, the program should do that
explicitly.  Don't use a finalizer that you expect storage reclamation
to invoke.  There is just too little control in a Python program over
the creation of references. 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-05 Thread Paul Rubin
John Nagle na...@animats.com writes:
 Unoptimized reference counting, which is what CPython does, isn't
 all that great either.  The four big bottlenecks in Python are boxed
 numbers, attribute lookups, reference count updates, and the GIL.

The performance hit of having to lock the refcounts before update has
been the historical reason for keeping the GIL.  The LOCK prefix takes
something like 100 cycles on an x86.  Is optimizing the refcount updates
going to anywhere near make up for that?

Python's with statement as an approach to RAII has seemed ok to me.  I
can't think of a time when I've really had to use a finalizer for
something with dynamic extent.  They've always seemed like a code smell
to me.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-05 Thread John Nagle

On 9/4/2010 11:51 PM, Paul Rubin wrote:

John Naglena...@animats.com  writes:

 Unoptimized reference counting, which is what CPython does, isn't
all that great either.  The four big bottlenecks in Python are boxed
numbers, attribute lookups, reference count updates, and the GIL.


The performance hit of having to lock the refcounts before update has
been the historical reason for keeping the GIL.  The LOCK prefix takes
something like 100 cycles on an x86.  Is optimizing the refcount updates
going to anywhere near make up for that?


I've argued for an approach in which only synchronized or immutable
objects can be shared between threads.  Then, only synchronized objects
have refcounts.  See
http://www.animats.com/papers/languages/pythonconcurrency.html;

Guido doesn't like it.  He doesn't like any restrictions.
So we're stuck dragging around the boat anchor.

I'd hoped that the Unladen Swallow people might come up with some
really clever solution, but they seem to be stuck.  It's been almost
a year since the last quarterly release.  Maybe Google is putting their
effort into Go.

What's so striking is that Shed Skin can deliver 20x to 60x
speedups over CPython, while PyPy and Unladen Swallow have
trouble getting 1.5x.  The question is how much one has to
restrict the language to get a serious performance improvement.


Python's with statement as an approach to RAII has seemed ok to me.  I
can't think of a time when I've really had to use a finalizer for
something with dynamic extent.  They've always seemed like a code smell
to me.


   The problem appears when you have an object that owns something, like
a window or a database connection.  With is single-level.

John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-04 Thread Paul Rubin
Dennis Lee Bieber wlfr...@ix.netcom.com writes:
   Not to mention having to ensure that one finds ALL the references to
 the object so that they can be updated to the new address! Somehow I
 don't see a C compiler being smart enough to find intermediary pointer

We're not talking about C compilers, which can cast any value at all
into pointers.  Languages designed for garbage collection are normally
type-safe and gc is a well-understood problem, though (like compilers)
the higher-performing ones are complicated.  But, nothing in principle
stops anyone from implementing Python with such methods.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-04 Thread Aahz
[gc]

In article 7x7hj2kyd6@ruckus.brouhaha.com,
Paul Rubin  no.em...@nospam.invalid wrote:

A minimal naive implementation indeed doubles the memory requirements,
but from a Python perspective where every integer takes something like
24 bytes already, even that doesn't seem so terrible.  

Many people still use 32-bit Python -- an int is twelve bytes there.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box.  --Cliff Wells
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-04 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 That reinforces my point, about how easy it was to check the correctness of 
 the code. In this case one simple fix, like this ...
 would render the code watertight. See how easy it is?

Well, no, it's irrelevant how easy it is to fix the issue after it's
pointed out.  What matters is how easy it was to create it in the first
place.  You posted a 30-line code sample as obviously free of memory
leaks, but even a good programmer like you didn't notice that it had the
potential for a nasty memory overwrite error after an unbalanced decref.
Keep in mind that a memory leak usually just means the program can
eventually bog down and stops working, but an overwrite error is often a
security hole that can lead to total compromise of your entire computer.
Now extrapolate that error rate from 30 lines to a program the size of
Firefox (something like 5 MLOC), and you should see how fraught with
danger that style of programming is.  Even the most skilled and careful
programmers are going to slip up once in a while.

Part of the problem is C itself.  No language can eliminate many
complicated bugs without creating severe usability problems, but good
languages (unlike C) can eliminate most silly bugs.  I had a dark
night of the soul after reading some of the bug-finding papers at

  http://www.stanford.edu/~engler/

and have been terrified of C ever since.  I'm just always skeptical when
anyone says they're sure any piece of C code is obviously bug-free.
It's quite easy to get overconfident about it (as I've done myself more
than once).  I spent about 5 minutes reviewing your patched code (and
the underlying implementations of the C API functions it calls) and
didn't see any other issues, and the code is probably ok now, but I'd
have to spend a lot more time tracing through the API layer before I
could be really sure.

Anyway, you should check your patch into github if you haven't.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-04 Thread John Nagle

On 8/28/2010 5:42 AM, Aahz wrote:

In article4c78572c$0$28655$c3e8...@news.astraweb.com,
Steven D'Apranost...@remove-this-cybersource.com.au  wrote:

On Fri, 27 Aug 2010 09:16:52 -0700, Aahz wrote:

In articlemailman.1967.1281549328.1673.python-l...@python.org, MRAB
pyt...@mrabarnett.plus.com  wrote:


An object will be available for garbage collection when nothing refers
to it either directly or indirectly. If it's unreferenced then it will
go away.


This isn't actually garbage collection as most people think of it.
Refcounting semantics mean that objects get reaped as soon as nothing
points at them.  OTOH, CPython does also have garbage collection to back
up refcounting so that when you have unreferenced object cycles they
don't stay around.


I've repeatedly asked, both here and elsewhere, why reference counting
isn't real garbage collection. Nobody has been able to give me a
satisfactory answer. As far as I can tell, it's a bit of pretentiousness
with no basis in objective fact.


You'll notice that I was very careful to qualify my statement with as
most people think of it.  Also, because CPython has two different memory
management mechanisms, refcounting and cycle detection, and the module
that controls cycle detection is called gc, I think it's simpler to
follow along with the Python docs -- and critically important to remind
people that there are in fact two different systems.


   Personally, I'd like to have reference counting only, an enforced
prohibition on loops (backpointers must be weak pointers), RAII,
and reliably ordered finalization.

   A big advantage of reference counting is that finalization happens
in the thread that releases the object, and in the correct order.
GC and finalization/destructors do not play well together at all.
Microsoft once tried to get the hard cases to work right.  See
managed C++.  Not a happy story.

John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-04 Thread Lawrence D'Oliveiro
In message mailman.434.1283568372.29448.python-l...@python.org, MRAB 
wrote:

 Lawrence D'Oliveirol...@geek-central.gen.new_zealand  writes:

 Wonder why Sun’s licence explicitly forbade its use in danger-critical
 areas like nuclear power plants and the like, then?
 
 I thought it was just that if it wasn't explicitly forbidden then
 someone might try to use it and then sue if something went wrong, even
 though common sense would have said that it was a bad idea in the first
 place! :-)

But you’ll notice that Free Software comes with no such restrictions. In 
fact, it is contrary to commonly-accepted Free Software guidelines to impose 
any sort of restrictions on areas of use.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-04 Thread Lawrence D'Oliveiro
In message 4c82b097$0$1661$742ec...@news.sonic.net, John Nagle wrote:

 Personally, I'd like to have reference counting only, an enforced
 prohibition on loops (backpointers must be weak pointers), RAII,
 and reliably ordered finalization.

Is there a cheap way of checking at runtime for circular structures?

 A big advantage of reference counting is that finalization happens
 in the thread that releases the object, and in the correct order.
 GC and finalization/destructors do not play well together at all.
 Microsoft once tried to get the hard cases to work right.  See
 managed C++.  Not a happy story.

Thank you for that. Another arrow for my anti-GC quiver. :)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-04 Thread Lawrence D'Oliveiro
In message 7x7hj2kyd6@ruckus.brouhaha.com, Paul Rubin wrote:

 Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:

 In message 7xmxs2uez1@ruckus.brouhaha.com, Paul Rubin wrote:

 GC's for large systems generally don't free (or even examine) individual
 garbage objects.  They copy the live objects to a new contiguous heap
 without ever touching the garbage, and then they release the old heap.

 And suddenly you’ve doubled the memory requirements. And on top of that,
 since you’re moving the valid objects into different memory, you’re
 forcing cache misses on all of them as well.
 
 A minimal naive implementation indeed doubles the memory requirements,
 but from a Python perspective where every integer takes something like
 24 bytes already, even that doesn't seem so terrible.

Doubling 24 is less terrible than doubling 4 or 8?? You’re kidding, right?

 More sophisticated implementations use multiple small heaps or other
 tricks.

More and more complications to patch up the idea. At some point, you have to 
admit there is something fundamentally flawed about the whole concept.

 The new heap is filled sequentially so accesses to it will have good
 locality.

Unfortunately, that‘s not how locality of reference works. It doesn’t matter 
whether the objects being accessed are close together in memory or far apart 
(not with modern fully-associative caches, anyway), what does matter is the 
frequency distribution of references, namely that the vast majority of 
references are to a tiny minority of objects.

Your generational garbage collector completely breaks this assumption, by 
regularly forcing an access to every single object in the heap. Cache-
thrashing, anyone?

 It's also the case that programs with very large memory consumption tend
 to use most of the memory for large arrays that don't contain pointers
 (think of a database server with a huge cache).  That means the gc
 doesn't really have to think about all that much of the memory.

But your generational garbage collector still has to copy those very large 
objects to the new heap, with all the cache-hostile consequences therefrom.

By the way, isn’t this the opposite of the array-of-pointers example you 
were using earlier to try to cast reference-counting in a bad light? It 
seems to me a reference count would work very well for such a large, simple 
object.

 This is the continuing problem with garbage collection: all the attempts
 to make it cheaper just end up moving the costs somewhere else.
 
 Same thing with manual allocation.  That moves the costs off the
 computer and onto the programmer.  Not good, most of the time.

Unfortunately, your argument falls down. It is a truism that hardware costs 
continue to come down, while programmers remain expensive. As I said before, 
computing performance has improved by something like five orders of 
magnitude over the last half-century. This has rendered all kinds of 
techniques, like high-level languages, dynamic memory allocation, 
stacks, hardware floating-point, memory protection and so on, which were 
once considered controversial because of their expense, cheap enough to 
become commonplace.

But not garbage collection. This is because of the asymmetric way in which 
hardware has become faster: the biggest improvements have been in the parts 
that were already the fastest to begin with (the CPU), while RAM speeds have 
improved much less, and backing-store speeds least of all. Hence the need 
for intermediate layers of cache to bridge the gap. But the effectiveness of 
that caching depends crucially on certain assumptions about the runtime 
behaviour of the programs: and garbage collection breaks those assumptions.

 Really, I'm no gc expert, but the stuff you're saying about gc is quite
 ill-informed.  You might want to check out some current literature.

You may want to enlighten yourself by meditating on this seeming paradox of 
modern computing hardware: memory is cheap, but accessing memory is 
expensive.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-04 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 A minimal naive implementation indeed doubles the memory requirements,
 but from a Python perspective where every integer takes something like
 24 bytes already, even that doesn't seem so terrible.

 Doubling 24 is less terrible than doubling 4 or 8?? You’re kidding, right?

No, it would be doubling 4 or 8 bytes.  The extra overhead like the
reference count would not be there to bloat up the integer like in
Python.

 More sophisticated implementations use multiple small heaps or other tricks.
 More and more complications to patch up the idea. At some point, you have to 
 admit there is something fundamentally flawed about the whole concept.

Oh sheesh, that's silly.  Every area of programming where performance
matters is full of optimization tricks.  Look at any serious database
implementation for example.  Or any compiler.  Look at Python's
implementation of dictionaries.  Yeah, the optimizations add complexity
to improve performance, sometimes even in heuristic ways that can fail.
That doesn't mean the concepts are fundamentally flawed.  GC is no
different.

 The new heap is filled sequentially so accesses to it will have good
 locality.
 what does matter is the frequency distribution of references,

Sorry, just I meant during the gc operation itself.  The gc's access
pattern in the new heap is completely sequential as the gc just copies
stuff to it linearly from from the old heap, bumping a pointer upwards.
The access pattern when the user application is running is of course not
predictable.

 Your generational garbage collector completely breaks this assumption, by 
 regularly forcing an access to every single object in the heap. Cache-
 thrashing, anyone?

In the minor collections, the whole minor heap fits in cache, so there's
no thrashing.  The major collections can use a different strategy, or
you can simply rely on their relative infrequency.  Why do you speculate
like this?  If you run a GHC program with profiling active, it tells you
exactly how much time is spent in minor gc and how much time is in major
gc, and it's all generally pretty tolerable unless your program has
bugs.  (Unfortunately Haskell programs are notoriously susceptable to a
certain type of bug that causes them to still give the right answers,
but use much more memory than they should.  The usual sign of that
happening is high gc load).

 It's also the case that programs with very large memory consumption tend
 to use most of the memory for large arrays that don't contain pointers
 But your generational garbage collector still has to copy those very large 
 objects to the new heap, with all the cache-hostile consequences therefrom.

Not necessarily, depends on how you write the program and how the gc works.

 By the way, isn’t this the opposite of the array-of-pointers example you 
 were using earlier to try to cast reference-counting in a bad light?

I wasn't trying to cast reference counting in a bad light, I was
pointing out that reference counting can experience pauses just like
traditional gc approaches.  Most programs including soft real time
programs can tolerate an occasional pause.  If your program is not one
of those, and you need guaranteed upper bounds on pauses so you can't
use traditional gc, switching from gc to reference counting won't save
you.

 It seems to me a reference count would work very well for such a
 large, simple object.

Mark/sweep would do it too.  Some gc's use a hybrid approach, with
mark/sweep for older or larger objects.

 But not garbage collection. This is because of the asymmetric way in
 which hardware has become faster:...  the effectiveness of that
 caching depends crucially on certain assumptions about the runtime
 behaviour of the programs: and garbage collection breaks those
 assumptions. ...
 You may want to enlighten yourself by meditating on this seeming paradox of 
 modern computing hardware: memory is cheap, but accessing memory is 
 expensive.

I'm interested in receiving enlightment if you've got some pointers into
the research literature that back up your views.  Right now it sounds
like you're going by some intuitions you have that aren't backed up by
evidence.  Anyone who has performance-tuned a program knows that
intuition can give reasonable starting points for experiments and
measurements, but once the results start coming in, a lot of the
intuitions end up invalidated.  

By now, there is enough experimental and theoretical literature about gc
that opinions like yours, that don't seem to be grounded in any
knowledge of that literature, are not very persuasive no matter how
superficially attractive the raw intuition might be.  Especially in your
case, where you seem to have decided ahead of time what conclusion you
want to reach and are looking for ways to justify it.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-04 Thread John Nagle

On 9/4/2010 6:44 PM, Lawrence D'Oliveiro wrote:

In message4c82b097$0$1661$742ec...@news.sonic.net, John Nagle wrote:


 Personally, I'd like to have reference counting only, an enforced
prohibition on loops (backpointers must be weak pointers), RAII,
and reliably ordered finalization.


Is there a cheap way of checking at runtime for circular structures?


   It's an interesting technical problem to design a system where
circular references are detected immediately, at the moment
of creation.  However, Python already detects loops during
garbage collection.  If you set

gc.set_debug(gc.DEBUG_SAVEALL)

all the loops show up in gc.garbage.


 A big advantage of reference counting is that finalization happens
in the thread that releases the object, and in the correct order.
GC and finalization/destructors do not play well together at all.
Microsoft once tried to get the hard cases to work right.  See
managed C++.  Not a happy story.


Thank you for that. Another arrow for my anti-GC quiver. :)


Unoptimized reference counting, which is what CPython does, isn't
all that great either.  The four big bottlenecks in Python are boxed
numbers, attribute lookups, reference count updates, and the GIL.

John Nagle

--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-03 Thread Lawrence D'Oliveiro
In message 7xvd6sv0n4@ruckus.brouhaha.com, Paul Rubin wrote:

 Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
  AddrObj = PyTuple_GetItem(TheBufferInfo, 0);
  LenObj = PyTuple_GetItem(TheBufferInfo, 1);
 
 the first PyTuple_GetItem succeeds and the second one fails.

 Admittedly, I did take a shortcut here: array.buffer_info returns a tuple
 of two items, so I’m not expecting one GetItem to succeed and the other
 to fail.
 
 FromArray is a parameter to the function, with no type check to make
 sure it's really an array.  In fact your code allows for the possibility
 that it doesn't support the buffer_info operation (if I understand the
 purpose of the null return check after the PyObject_CallMethod) which
 means it's prepared for the argument to -not- be an array.

That reinforces my point, about how easy it was to check the correctness of 
the code. In this case one simple fix, like this

diff --git a/spuhelper.c b/spuhelper.c
index 83fd4eb..2ba8197 100644
--- a/spuhelper.c
+++ b/spuhelper.c
@@ -151,10 +151,12 @@ static void GetBufferInfo
if (TheBufferInfo == 0)
break;
AddrObj = PyTuple_GetItem(TheBufferInfo, 0);
-   LenObj = PyTuple_GetItem(TheBufferInfo, 1);
if (PyErr_Occurred())
break;
Py_INCREF(AddrObj);
+   LenObj = PyTuple_GetItem(TheBufferInfo, 1);
+   if (PyErr_Occurred())
+   break;
Py_INCREF(LenObj);
*addr = PyInt_AsUnsignedLongMask(AddrObj);
*len = PyInt_AsUnsignedLongMask(LenObj);

would render the code watertight. See how easy it is?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-03 Thread Lawrence D'Oliveiro
In message 7xiq2que93@ruckus.brouhaha.com, Paul Rubin wrote:

 Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:

 Refcounting is susceptable to the same pauses for reasons already
 discussed.

 Doesn’t seem to happen in the real world, though.
 
 def f(n):
 from time import time
 a = [1] * n
 t0 = time()
 del a
 t1 = time()
 return t1 - t0
 
 for i in range(9):
print i, f(10**i)
 
 
 on my system prints:
 
 0 2.86102294922e-06
 1 2.14576721191e-06
 2 3.09944152832e-06
 3 1.00135803223e-05
 4 0.000104904174805
 5 0.00098991394043
 6 0.00413608551025
 7 0.037693977356
 8 0.362598896027
 
 Looks pretty linear as n gets large.  0.36 seconds (the last line) is a
 noticable pause.

Which just proves the point. You had to deliberately set up the situation to 
make that happen. And it remains just as easy to pinpoint where it is 
happening, so you can control it.

With a garbage collector, you don’t have that control. Even if you try to 
avoid freeing a single large structure at once, it’s still liable to batch 
up a lot of small objects to free at once, so the problem can still happen.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-03 Thread Lawrence D'Oliveiro
In message 7xmxs2uez1@ruckus.brouhaha.com, Paul Rubin wrote:

 Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:

 Whereas garbage collection will happen at some indeterminate time long
 after the last access to the object, when it very likely will no longer
 be in the cache, and have to be brought back in just to be freed,
 
 GC's for large systems generally don't free (or even examine) individual
 garbage objects.  They copy the live objects to a new contiguous heap
 without ever touching the garbage, and then they release the old heap.

And suddenly you’ve doubled the memory requirements. And on top of that, 
since you’re moving the valid objects into different memory, you’re forcing 
cache misses on all of them as well.

This is the continuing problem with garbage collection: all the attempts to 
make it cheaper just end up moving the costs somewhere else.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-03 Thread Lawrence D'Oliveiro
In message 7xr5heufhb@ruckus.brouhaha.com, Paul Rubin wrote:

 Java has considerably greater reputation for reliability than C or C++.

Wonder why Sun’s licence explicitly forbade its use in danger-critical areas 
like nuclear power plants and the like, then?

 Ada is a different story, but Ada programs (because of the application
 area Ada is used in) tend not to use a lot of dynamic memory allocation
 in the first place.  A little googling shows there are GC extensions
 available for Ada, though I don't know if they are used much.

Let’s put it this way: the life-support system on the International Space 
Station is written in Ada. Would you trust your life to code written in 
Java?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-03 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 Java has considerably greater reputation for reliability than C or C++.

 Wonder why Sun’s licence explicitly forbade its use in danger-critical
 areas like nuclear power plants and the like, then?

Probably because Sun lawyers demanded it.  Is there a Sun C or C++
compiler with a license that doesn't have that restriction?  Even if
there is, it just means those languages are so unreliable that the
lawyers felt confident that any meltdown could be blamed on a bug in the
user's rather than the compiler ;-).

 Let’s put it this way: the life-support system on the International Space 
 Station is written in Ada. Would you trust your life to code written in 
 Java?

The scary thing is I don't know whether I'm already doing that.  Life
support systems have hard real-time requirements (Ada's forte) but I'd
expect lots of military decision-support systems are written in Java.
Maybe one of them will raise a false alert and somebody will launch a
war.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-03 Thread MRAB

On 04/09/2010 03:21, Paul Rubin wrote:

Lawrence D'Oliveirol...@geek-central.gen.new_zealand  writes:

Java has considerably greater reputation for reliability than C or C++.


Wonder why Sun’s licence explicitly forbade its use in danger-critical
areas like nuclear power plants and the like, then?


Probably because Sun lawyers demanded it.  Is there a Sun C or C++
compiler with a license that doesn't have that restriction?  Even if
there is, it just means those languages are so unreliable that the
lawyers felt confident that any meltdown could be blamed on a bug in the
user's rather than the compiler ;-).


Let’s put it this way: the life-support system on the International Space
Station is written in Ada. Would you trust your life to code written in
Java?


The scary thing is I don't know whether I'm already doing that.  Life
support systems have hard real-time requirements (Ada's forte) but I'd
expect lots of military decision-support systems are written in Java.
Maybe one of them will raise a false alert and somebody will launch a
war.


I thought it was just that if it wasn't explicitly forbidden then
someone might try to use it and then sue if something went wrong, even
though common sense would have said that it was a bad idea in the first
place! :-)
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-03 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 GC's for large systems generally don't free (or even examine) individual
 garbage objects.  They copy the live objects to a new contiguous heap
 without ever touching the garbage, and then they release the old heap.

 And suddenly you’ve doubled the memory requirements. And on top of that, 
 since you’re moving the valid objects into different memory, you’re forcing 
 cache misses on all of them as well.

A minimal naive implementation indeed doubles the memory requirements,
but from a Python perspective where every integer takes something like
24 bytes already, even that doesn't seem so terrible.  More
sophisticated implementations use multiple small heaps or other tricks.
It still costs something in memory footprint, but less than the minimal
implementation's 2x cost.

The new heap is filled sequentially so accesses to it will have good
locality.  You do have to jump around within the old heap, but again,
with generational schemes, in the more frequent collections, the old
heap fits entirely in cache.  For example, GHC's minor heap size is
256kB.  For major collections, GHC switches (or used to) from copying to
a mark/compact scheme once the amount of live data in the heap goes over
some amount, giving the best of both worlds.

It's also the case that programs with very large memory consumption tend
to use most of the memory for large arrays that don't contain pointers
(think of a database server with a huge cache).  That means the gc
doesn't really have to think about all that much of the memory.

 This is the continuing problem with garbage collection: all the attempts to 
 make it cheaper just end up moving the costs somewhere else.

Same thing with manual allocation.  That moves the costs off the
computer and onto the programmer.  Not good, most of the time.

Really, I'm no gc expert, but the stuff you're saying about gc is quite
ill-informed.  You might want to check out some current literature.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-02 Thread John Nagle

On 8/30/2010 12:22 AM, Paul Rubin wrote:

I guess that is how the so-called smart pointers in the Boost C++
template library work.  I haven't used them so I don't have personal
experience with how convenient or reliable they are, or what kinds of
constraints they imposed on programming style.  I've always felt a bit
suspicious of them though, and I seem to remember Alex Martelli (I hope
he shows up here again someday) advising against using them.


   Smart pointers in C++ have never quite worked right.  They
almost work.  But there always seems to be something that needs
access to a raw C pointer, which breaks the abstraction.
The mold keeps creeping through the wallpaper.

   Also, since they are a bolt-on at the macro level in C++,
reference count updates aren't optimized and hoisted out of
loops.  (They aren't in CPython either, but there have been reference
counted systems that optimize out most reference count updates.)

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-02 Thread Paul Rubin
Dennis Lee Bieber wlfr...@ix.netcom.com writes:
 GC's for large systems ... copy the live objects to a new contiguous heap

   That sounds suspiciously like the original Macintosh OS, with its
 handles... IE, double-indirection. 

Nah, a double indirection on every access would be a terrible
performance hit.  The classic approach is when you move an object to the
new heap, you leave a tagged forwarding pointer at its former location
the old heap, giving the its location in the new heap.  Then as you move
other objects, you dereference the pointers in them to see whether they
point to moved or unmoved objects, and relocate any unmoved ones.  A
more complete explanation is here:

http://mitpress.mit.edu/sicp/full-text/book/book-Z-H-33.html#%_sec_5.3.2 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-01 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 And yet Java code, for example, doesn’t have a reputation for greater 
 reliability compared to, say code written in Ada or C++, or even C. What is 
 the Java runtime written in? C. Why not use Java, if there is no speed 
 penalty in doing so?

The Java runtime (such as the garbage collector) needs untyped pointers
and can't be written in Java.  It might be possible to write a type-safe
GC in something like ATS, which has extremely precise types, but that is
almost alien technology by comparison to writing in C.

Java has considerably greater reputation for reliability than C or C++.
Ada is a different story, but Ada programs (because of the application
area Ada is used in) tend not to use a lot of dynamic memory allocation
in the first place.  A little googling shows there are GC extensions
available for Ada, though I don't know if they are used much.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-01 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 Whereas garbage collection will happen at some indeterminate time long after 
 the last access to the object, when it very likely will no longer be in the 
 cache, and have to be brought back in just to be freed, 

GC's for large systems generally don't free (or even examine) individual
garbage objects.  They copy the live objects to a new contiguous heap
without ever touching the garbage, and then they release the old heap.
That has the effect of improving locality, since the new heap is
compacted and has no dead objects.  The algorithms are generational
(they do frequent gc's on recently-created objects and less frequent
ones on older objects), so minor gc operations are on regions that fit
in cache, while major ones might have cache misses but are infrequent.

Non-compacting reference counting (or simple mark/sweep gc) has much
worse fragmentation and consequently worse cache locality than
copying-style gc.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-09-01 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 Refcounting is susceptable to the same pauses for reasons already
 discussed.

 Doesn’t seem to happen in the real world, though.

def f(n):
from time import time
a = [1] * n
t0 = time()
del a
t1 = time()
return t1 - t0

for i in range(9):
   print i, f(10**i)


on my system prints:

0 2.86102294922e-06
1 2.14576721191e-06
2 3.09944152832e-06
3 1.00135803223e-05
4 0.000104904174805
5 0.00098991394043
6 0.00413608551025
7 0.037693977356
8 0.362598896027

Looks pretty linear as n gets large.  0.36 seconds (the last line) is a
noticable pause.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-31 Thread Lawrence D'Oliveiro
In message 7xtymbzixt@ruckus.brouhaha.com, Paul Rubin wrote:

 It's pretty well established by now that GC doesn't have any significant
 speed penalty compared with manual allocation.  It does consume more
 memory, which is acceptable a lot of the time.  It certainly leads to
 more reliable programs.

And yet Java code, for example, doesn’t have a reputation for greater 
reliability compared to, say code written in Ada or C++, or even C. What is 
the Java runtime written in? C. Why not use Java, if there is no speed 
penalty in doing so?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-31 Thread Lawrence D'Oliveiro
In message 7x39tz42fd@ruckus.brouhaha.com, Paul Rubin wrote:

 Dennis Lee Bieber wlfr...@ix.netcom.com writes:

 Heap marking, OTOH, tends to run at indeterminate times, which could
 have an impact if one needs predictable response timings
 
 Reference counting has the same problem.  If you drop the last reference
 to a complex structure, it could take quite a long time to free all the
 components.

One difference is the interaction with caching behaviour. When a reference-
counted object is freed, the odds are that happens fairly soon after the 
last access, so the object will still be in the CPU cache, and access will 
be fast.

Whereas garbage collection will happen at some indeterminate time long after 
the last access to the object, when it very likely will no longer be in the 
cache, and have to be brought back in just to be freed, quite likely bumping 
out something else that the running program needs to access.

This is one reason why garbage collection is still considered an expensive 
technique. Computing power has improved by something like five orders of 
magnitude over the last half-century, making possible all kinds of 
productivity-enhancing techniques that were once considered expensive to 
become commonplace: high-level languages, dynamic memory allocation, stacks, 
hardware floating-point, memory protection and so on.

But alone of all of these, garbage collection still remains just as costly 
to implement as ever. That should tell you something about how poorly it 
matches the characteristics of modern computing hardware.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-31 Thread Lawrence D'Oliveiro
In message 7xmxs4uzjl@ruckus.brouhaha.com, Paul Rubin wrote:

 Gregory Ewing greg.ew...@canterbury.ac.nz writes:

 I'd be disappointed if CPython ditched refcounting and
 then became unsuitable for real-time games as a result.
 
 Refcounting is susceptable to the same pauses for reasons already
 discussed.

Doesn’t seem to happen in the real world, though.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 static void GetBufferInfo
   ( ...
 do /*once*/
   {
 TheBufferInfo = PyObject_CallMethod(FromArray, buffer_info, );
 if (TheBufferInfo == 0)
 break;
 AddrObj = PyTuple_GetItem(TheBufferInfo, 0);
 LenObj = PyTuple_GetItem(TheBufferInfo, 1);
 if (PyErr_Occurred())
 break;
 ...
 Py_INCREF(AddrObj);
 Py_INCREF(LenObj);
   }
 while (false);
 Py_XDECREF(AddrObj);
 Py_XDECREF(LenObj);
 Py_XDECREF(TheBufferInfo);
   } /*GetBufferInfo*/

 It’s quite easy to assure yourself that this is never going to leak memory. 

Actually that code looks suspicious.  Suppose in

 AddrObj = PyTuple_GetItem(TheBufferInfo, 0);
 LenObj = PyTuple_GetItem(TheBufferInfo, 1);

the first PyTuple_GetItem succeeds and the second one fails.  Then
AddrObj is a borrowed reference to the first tuple element and LenObj is
null, the error flag is set, so you break out of the do/while.  You then
decrement the refcount of AddrObj even though you didn't increment it.
Maybe there's an explanation that makes it ok somehow, but it still
looks messy.  This is the kind of problem I'm referring to in general.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Paul Rubin
Steven D'Aprano steve-remove-t...@cybersource.com.au writes:
 I'm not saying that ref counting systems can avoid incrementing and 
 decrementing the ref counts. That would be silly. But I'm saying that it 
 is an accident of implementation that writing C extensions requires you 
 to manually manage the counts yourself. That's a side-effect of 
 permitting coders to write C extensions in pure C, rather than in some 
 intermediate language which handles the ref counts for you but otherwise 
 compiles like C. If people cared enough, they could (probably) make the C 
 compiler handle it for them, just as it currently handles incrementing 
 and decrementing the return stack. 

I guess that is how the so-called smart pointers in the Boost C++
template library work.  I haven't used them so I don't have personal
experience with how convenient or reliable they are, or what kinds of
constraints they imposed on programming style.  I've always felt a bit
suspicious of them though, and I seem to remember Alex Martelli (I hope
he shows up here again someday) advising against using them.

I don't think a C compiler could really manage automatic decrementing
while still being C.  Think especially of the common style of exception
handling in C using longjmp.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Steven D'Aprano
On Mon, 30 Aug 2010 00:22:17 -0700, Paul Rubin wrote:

 I don't think a C compiler could really manage automatic decrementing
 while still being C.  Think especially of the common style of exception
 handling in C using longjmp.

You might very well be right. But that's the problem with C -- it's too 
low a level language to expect the compiler to protect you much. Or at 
all. There will always be some use cases for managing memory yourself, or 
even managing the return stack (as you can do in Forth, for example), and 
so there will always need to be some sort of high-level assembler like C. 
But it astounds me that in 2010 people still routinely use C for normal, 
everyday application programming.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Gregory Ewing

Paul Rubin wrote:


These days I think the GC pause issue is overrated except for real-time
control applications.


Also for games, which are a fairly common application
these days. Even a few milliseconds can be too long when
you're trying to achieve smooth animation.

I'd be disappointed if CPython ditched refcounting and
then became unsuitable for real-time games as a result.

--
Greg
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Lawrence D'Oliveiro
In message 7xr5hg3a7s@ruckus.brouhaha.com, Paul Rubin wrote:

 Actually that code looks suspicious.  Suppose in
 
  AddrObj = PyTuple_GetItem(TheBufferInfo, 0);
  LenObj = PyTuple_GetItem(TheBufferInfo, 1);
 
 the first PyTuple_GetItem succeeds and the second one fails.

Admittedly, I did take a shortcut here: array.buffer_info returns a tuple of 
two items, so I’m not expecting one GetItem to succeed and the other to 
fail.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Lawrence D'Oliveiro
In message 7x39twpuxi@ruckus.brouhaha.com, Paul Rubin wrote:

 Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:

 the CPython API means endlessly finding and fixing refcount bugs that
 lead to either crashes/security failures, or memory leaks.

 I don’t see why that should be so. It seems a very simple discipline to
 follow: initialize, allocate, free. Here’s an example snippet from my DVD
 Menu Animator http://github.com/ldo/dvd_menu_animator:
 
 In practice it has been a problem.

Maybe people need to learn to write code in a more structured fashion.

 If it hasn't happened to you yet, you're either burning a bunch of effort
 that programmers of more automatic systems can put to more productive
 uses ...

What makes you say that? Avoiding bugs is not a “productive use”?

 And how do you run such an application? You have to limit it to a
 predetermined amount of memory to begin with, otherwise it would easily
 gobble up everything you have.
 
 No that's usually not a problem-- the runtime system (generational gc)
 can figure out enough from your allocation pattern to prevent the heap
 from getting overlarge.

And yet Java apps, for example, are (in)famous for excessive memory usage 
compared to those written in non-GC-dependent languages.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
  AddrObj = PyTuple_GetItem(TheBufferInfo, 0);
  LenObj = PyTuple_GetItem(TheBufferInfo, 1);
 
 the first PyTuple_GetItem succeeds and the second one fails.

 Admittedly, I did take a shortcut here: array.buffer_info returns a tuple of 
 two items, so I’m not expecting one GetItem to succeed and the other to 
 fail.

FromArray is a parameter to the function, with no type check to make
sure it's really an array.  In fact your code allows for the possibility
that it doesn't support the buffer_info operation (if I understand the
purpose of the null return check after the PyObject_CallMethod) which
means it's prepared for the argument to -not- be an array.  In that case
maybe it's some other object with a buffer_info operation that returns
a 1-element tuple.  If the function is callable from Python code, then
that arg type is completely out of the C code's control.  Even if it's
only callable from C, you're still depending on not one but two
different invariants (that the arg is an array, and that
array.buffer_info returns a 2-tuple) that are undocumented and unchecked
in the function.  I cannot agree with your claim that the approach
scales.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 If it hasn't happened to you yet, you're either burning a bunch of effort
 that programmers of more automatic systems can put to more productive
 uses ...

 What makes you say that? Avoiding bugs is not a “productive use”?

Avoiding any particular bug through constant and pervasive vigilance is
far less productive than using a system where causing that particular
type of bug is impossible to begin with.  IMO the code you posted has
latent bugs as discussed in the other post.  It might work at the moment
you checked it in but it is brittle.  I wouldn't have signed off on it
in a code review.

 And yet Java apps, for example, are (in)famous for excessive memory
 usage compared to those written in non-GC-dependent languages.

I think that may mostly be an issue with the bloated nature of most Java
apps.  Certainly Lisp systems have run in production for decades on
machines with much less memory than we would consider acceptable these
days for any substantial Python app.

It's probably true that gc copying-style gc is more memory hungry than
Python's refcount system.  Mark-sweep gc should have comparable memory
consumption and better speed than refcounting, though less speed than
copying.

JHC (experimental Haskell compiler) recently started using a mixture of
gc and region inference.  It will be interesting to see how that works
out.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Paul Rubin
Gregory Ewing greg.ew...@canterbury.ac.nz writes:
 These days I think the GC pause issue is overrated except for real-time
 control applications.

 Also for games, which are a fairly common application
 these days. Even a few milliseconds can be too long when
 you're trying to achieve smooth animation.

The usual hack with games is you do a major gc when the user advances
between game levels.  You can do minor gc's during the screen refresh
interval.

 I'd be disappointed if CPython ditched refcounting and
 then became unsuitable for real-time games as a result.

Refcounting is susceptable to the same pauses for reasons already
discussed.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Aahz
In article 4c7b279d$0$28650$c3e8...@news.astraweb.com,
Steven D'Aprano  steve-remove-t...@cybersource.com.au wrote:
On Sun, 29 Aug 2010 17:52:38 -0700, Paul Rubin wrote:
Attribution lost:

 That's a problem with the CPython API, not reference counting. The
 problem is that the CPython API is written at too low a level, beneath
 that at which the garbage collector exists, so naturally you have to
 manually manage memory.
 
 Care to give an example of a reference counted system that's written any
 other way?

The complexity of the ref counter is invisible when writing pure Python 
code, and I believe it is also invisible when writing code in Cython. The 
difficulty of dealing with ref counts is abstracted away.

That's not completely true.  You know perfectly well that it's almost
trivially easy to leak memory with refcounting, and there are certain
ways in which Python leaks memory invisibly if you don't know how it
works.  One recent example at work was when someone was arguing with me
about whether we were leaking file handles and I had to prove that you
could leak file handles if you didn't clean up exceptions -- but that
cleaning up the exception *did* close the file handles.

(This code was originally written in Python 2.4, and partly because of
this we are making more of a push to use with)

If you're restricting your claim just to the actual management of the
reference counter, you're correct, but it's especially not clear that
your second sentence is so restricted.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box.  --Cliff Wells
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Lawrence D'Oliveiro
In message 7xr5hguzzi@ruckus.brouhaha.com, Paul Rubin wrote:

 JHC (experimental Haskell compiler) recently started using a mixture of
 gc and region inference.  It will be interesting to see how that works
 out.

That’s what people have been saying about garbage collection for about half 
a century: “this new experimental technique will solve all the problems, 
just you wait and see”.

Meanwhile, real-world programmers get on to writing real-world code that is 
productive and efficient on real-world systems.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-30 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 Meanwhile, real-world programmers get on to writing real-world code that is 
 productive and efficient on real-world systems.

It's pretty well established by now that GC doesn't have any significant
speed penalty compared with manual allocation.  It does consume more
memory, which is acceptable a lot of the time.  It certainly leads to
more reliable programs.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-29 Thread Hans Mulder

Steven D'Aprano wrote:

On Sat, 28 Aug 2010 00:33:10 -0700, Paul Rubin wrote:



If you drop the last reference
to a complex structure, it could take quite a long time to free all the
components.  By contrast there are provably real-time tracing gc
schemes, including some parallelizeable ones.


I could be wrong, but how can they not be subject to the same performance 
issue? If you have twenty thousand components that all have to be freed, 
they all have to be freed whether you do it when the last reference is 
cleared, or six seconds later when the gc does a sweep.


Parallelizable garbage collectors have performance issues, but they're
not the same issues as marksweep collectors have.  Parallelizable GCs
break up their work in a zillion little pieces and allow the VM to do
some real work after each piece.  They won't free your twenty thousand
components all in one go and you won't have that embarrassing pause.

Parallelizable garbage collectors require some careful coordination
between the GC and the VM.  This takes CPU time, so on the whole they're
slower than traditional garbage collectors.  So instead of unpredictable
embarrassing pauses, you have a VM that's consistently slow.
For some applications consistency is more important than raw speed and
for these applications parallelizeable GCs are an improvement.

HTH,

-- HansM
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-29 Thread Paul Rubin
Hans Mulder han...@xs4all.nl writes:
 Parallelizable garbage collectors have performance issues, but they're
 not the same issues as marksweep collectors have.  Parallelizable GCs
 break up their work in a zillion little pieces and allow the VM to do
 some real work after each piece.  They won't free your twenty thousand
 components all in one go and you won't have that embarrassing pause.

Quibble: parallel GC just means any GC that runs in multiple threads
simultaneously.  A GC that guarantees no pauses (more technically,
specifies a constant such that any GC pause is guaranteed to be shorter
than the constant) is called a real time GC, and real-time GC's are
usually single threaded.  Parallel real-time GC's were once sort of a
holy grail, though workable ones have been implemented.  GHC for example
currently uses a GC that is parallel (runs on multiple cores for speed)
but is not real-time (there can be a pause), and I think the Sun JVM is
the same way.

These days I think the GC pause issue is overrated except for real-time
control applications.  GC's no longer frequently make the program freeze
up for seconds at a time or anything like that.  It was worse in the old
days when CPU's were slower and memory was scarcer.  Serious GC's are
usually generational, with minor GC's taking microseconds and less
frequent major GC's taking fractions of a second.  So in an
interactive program or animation on your desktop you might notice a
little hiccup once in a while.  For something like a web server an
occasional slight variation in response time could easily be random
network delays and so forth.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-29 Thread Paul Rubin
Steven D'Aprano st...@remove-this-cybersource.com.au writes:
 You can add cycle detection to a reference count gc, at the cost of more 
 complexity.

But then it's not purely a refcount gc. ;)

 If you read the Wikipedia article I linked to, tracing algorithms can 
 also be unsound:  [describes conservative gc]

Yeah, whether that counts as a real GC is subjective too.

 and 2) it requires constant attention from the mutator to incr and
 decr the reference counts.
 Yes. And?

I thought I made it clear, the purpose of gc is to improve programmer
productivity and program reliability by relieving the programmer from
the burden of memory allocation bookkeeping.  If the programmer has to
painstakingly manage refcounts by hand and the resulting programs are
still prone to refcount bugs (both of which are the case with CPython),
it's not really gc in a way that lives up to the name.

 That's a problem with the CPython API, not reference counting. The 
 problem is that the CPython API is written at too low a level, beneath 
 that at which the garbage collector exists, so naturally you have to 
 manually manage memory.

Care to give an example of a reference counted system that's written any
other way?

 On the other hand, tracing gcs have their own set of problems too, mostly 
 due to the use of finalizers and attempts to make garbage collection run 
 more predictably. See here:

I think it's fairly common wisdom that finalizers and gc don't interact
very well.  Finalizers in general aren't in the gc spirit, which says
the system should give the illusion that every object stays around
forever.  As for stuff like hash tables, a usual way to help out the gc
is by populating the hash table with weak references, rather than by
clearing it out manually when you're done with it.  It's also possible
for fancy compilers to use a mixture of gc and stack- or region-based
allocation.

 Tracing garbage collectors aren't a panacea. They're software themselves, 
 and complex software, which means they're subject to bugs like the one ...

You could say the same thing about compilers instead of gc's.  The idea
in both cases is yes, they're complex, but they put the complexity in
one place, so that the application program can rely on the complicated
gc or compiler features while itself staying simple.  
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-29 Thread Paul Rubin
Steven D'Aprano st...@remove-this-cybersource.com.au writes:
 I could be wrong, but how can they not be subject to the same performance 
 issue? If you have twenty thousand components that all have to be freed, 
 they all have to be freed whether you do it when the last reference is 
 cleared, or six seconds later when the gc does a sweep.

GC's on large machines these days do not sweep at all.  They copy the
live data to a new heap, then release the old heap.  Because there is
usually more garbage than live data, copying GC's that never touch the
garbage are usually faster than any scheme involving freeing unused
objects one by one.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-29 Thread Steven D'Aprano
On Sun, 29 Aug 2010 17:52:38 -0700, Paul Rubin wrote:

 That's a problem with the CPython API, not reference counting. The
 problem is that the CPython API is written at too low a level, beneath
 that at which the garbage collector exists, so naturally you have to
 manually manage memory.
 
 Care to give an example of a reference counted system that's written any
 other way?

The complexity of the ref counter is invisible when writing pure Python 
code, and I believe it is also invisible when writing code in Cython. The 
difficulty of dealing with ref counts is abstracted away.

The argument that it will work if we always do this means that it won't 
work is a nice quip, but it doesn't stand up. It's possible to defeat 
even Pascal compilers' type checking and do unsafe things by dropping 
down into assembly. So don't do it! It's not hard to not do something, 
although sometimes you might choose to do it anyway, because otherwise 
there is no way to get the code you need/want. But such code should be a 
rare exception.

I'm not saying that ref counting systems can avoid incrementing and 
decrementing the ref counts. That would be silly. But I'm saying that it 
is an accident of implementation that writing C extensions requires you 
to manually manage the counts yourself. That's a side-effect of 
permitting coders to write C extensions in pure C, rather than in some 
intermediate language which handles the ref counts for you but otherwise 
compiles like C. If people cared enough, they could (probably) make the C 
compiler handle it for them, just as it currently handles incrementing 
and decrementing the return stack. 

There's nothing *fundamental* to the idea of ref counting that says that 
you must handle the counts manually -- it depends on how well insulated 
you are from the underlying machine.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-29 Thread Lawrence D'Oliveiro
In message 7x4oeftuk4@ruckus.brouhaha.com, Paul Rubin wrote:

 I'd say [reference-counting is] not real gc because 1) it's unsound
 (misses reference cycles), and 2) it requires constant attention from the
 mutator to incr and decr the reference counts.  So developing modules for
 the CPython API means endlessly finding and fixing refcount bugs that lead
 to either crashes/security failures, or memory leaks.

I don’t see why that should be so. It seems a very simple discipline to 
follow: initialize, allocate, free. Here’s an example snippet from my DVD 
Menu Animator http://github.com/ldo/dvd_menu_animator:

static void GetBufferInfo
  (
PyObject * FromArray,
unsigned long * addr,
unsigned long * len
  )
  /* returns the address and length of the data in a Python array object. */
  {
PyObject * TheBufferInfo = 0;
PyObject * AddrObj = 0;
PyObject * LenObj = 0;
do /*once*/
  {
TheBufferInfo = PyObject_CallMethod(FromArray, buffer_info, );
if (TheBufferInfo == 0)
break;
AddrObj = PyTuple_GetItem(TheBufferInfo, 0);
LenObj = PyTuple_GetItem(TheBufferInfo, 1);
if (PyErr_Occurred())
break;
Py_INCREF(AddrObj);
Py_INCREF(LenObj);
*addr = PyInt_AsUnsignedLongMask(AddrObj);
*len = PyInt_AsUnsignedLongMask(LenObj);
if (PyErr_Occurred())
break;
  }
while (false);
Py_XDECREF(AddrObj);
Py_XDECREF(LenObj);
Py_XDECREF(TheBufferInfo);
  } /*GetBufferInfo*/

It’s quite easy to assure yourself that this is never going to leak memory. 
More complicated examples can simply nest constructs like these one within 
the other to arbitrary depth, while still giving the same assurance at every 
level. In short, this technique scales well.

 If you program the Java JNI or a typical Lisp FFI, you'll find that real
 gc is a lot simpler to use since you avoid all the refcount maintenance
 hassles.  You allocate memory and shut your eyes, and the gc takes care of
 freeing it when it figures out that you are done.

And how do you run such an application? You have to limit it to a 
predetermined amount of memory to begin with, otherwise it would easily 
gobble up everything you have.

In the old days of “classic” MacOS, every application had to run in a fixed-
size application heap. I have no wish to return to those days.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-29 Thread Paul Rubin
Lawrence D'Oliveiro l...@geek-central.gen.new_zealand writes:
 the CPython API means endlessly finding and fixing refcount bugs that lead
 to either crashes/security failures, or memory leaks.

 I don’t see why that should be so. It seems a very simple discipline to 
 follow: initialize, allocate, free. Here’s an example snippet from my DVD 
 Menu Animator http://github.com/ldo/dvd_menu_animator:

In practice it has been a problem.  If it hasn't happened to you yet,
you're either burning a bunch of effort that programmers of more
automatic systems can put to more productive uses, or else you just
haven't written enough such code to have gotten bitten yet.

 You allocate memory and shut your eyes, and the gc takes care of
 freeing it when it figures out that you are done.

 And how do you run such an application? You have to limit it to a 
 predetermined amount of memory to begin with, otherwise it would easily 
 gobble up everything you have.

No that's usually not a problem-- the runtime system (generational gc)
can figure out enough from your allocation pattern to prevent the heap
from getting overlarge.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-28 Thread Paul Rubin
Dennis Lee Bieber wlfr...@ix.netcom.com writes:
   The nice thing about it [reference counting] is that it is sort
 of deterministic -- one can examine code and determine that an object
 is collected at some point in the execution...
   Heap marking, OTOH, tends to run at indeterminate times, which could
 have an impact if one needs predictable response timings

Reference counting has the same problem.  If you drop the last reference
to a complex structure, it could take quite a long time to free all the
components.  By contrast there are provably real-time tracing gc
schemes, including some parallelizeable ones.  One reason CPython still
can't run threads on parallel cores is it would have to lock the
reference counts every time they're updated, and the slowdown from that
is terrible.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-28 Thread Steven D'Aprano
On Sat, 28 Aug 2010 00:33:10 -0700, Paul Rubin wrote:

 Dennis Lee Bieber wlfr...@ix.netcom.com writes:
  The nice thing about it [reference counting] is that it is sort
 of deterministic -- one can examine code and determine that an object
 is collected at some point in the execution...
  Heap marking, OTOH, tends to run at indeterminate times, which 
could
 have an impact if one needs predictable response timings
 
 Reference counting has the same problem.  

In theory, yes, but in practice ref counting tends to spread out the 
performance impact more smoothly. There are exceptions, such as the one 
you mention below, but as a general rule ref counting isn't subject to 
the embarrassing pauses that tracing garbage collectors tend to be 
subject to.


 If you drop the last reference
 to a complex structure, it could take quite a long time to free all the
 components.  By contrast there are provably real-time tracing gc
 schemes, including some parallelizeable ones.

I could be wrong, but how can they not be subject to the same performance 
issue? If you have twenty thousand components that all have to be freed, 
they all have to be freed whether you do it when the last reference is 
cleared, or six seconds later when the gc does a sweep.


 One reason CPython still
 can't run threads on parallel cores is it would have to lock the
 reference counts every time they're updated, and the slowdown from that
 is terrible.

On the other hand, the reason that CPython still has reference counting 
is that the alternatives tried so far are unacceptably for non-threaded 
code.




-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-28 Thread Steven D'Aprano
On Fri, 27 Aug 2010 18:06:19 -0700, Paul Rubin wrote:

 Steven D'Aprano st...@remove-this-cybersource.com.au writes:
 I've repeatedly asked, both here and elsewhere, why reference counting
 isn't real garbage collection. Nobody has been able to give me a
 satisfactory answer. As far as I can tell, it's a bit of
 pretentiousness with no basis in objective fact.
 
 Well, it's a bit of a subjective matter.  I'd say it's not real gc
 because 1) it's unsound (misses reference cycles), 

You can add cycle detection to a reference count gc, at the cost of more 
complexity.

If you read the Wikipedia article I linked to, tracing algorithms can 
also be unsound:

Some collectors running in a particular environment can 
correctly identify all pointers (references) in an object; 
these are called precise (also exact or accurate) 
collectors, the opposite being a conservative or partly
conservative collector. Conservative collectors have to 
assume that any bit pattern in memory could be a pointer if 
(when interpreted as a pointer) it would point into any 
allocated object. Thus, conservative collectors may have 
some false negatives, where storage is not released because 
of accidental fake pointers...

http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)



 and 2) it requires
 constant attention from the mutator to incr and decr the reference
 counts.

Yes. And?


 So developing modules for the CPython API means endlessly
 finding and fixing refcount bugs that lead to either crashes/security
 failures, or memory leaks.  If you program the Java JNI or a typical
 Lisp FFI, you'll find that real gc is a lot simpler to use since you
 avoid all the refcount maintenance hassles.  You allocate memory and
 shut your eyes, and the gc takes care of freeing it when it figures out
 that you are done.  Refcounting is basically a form of manual memory
 management, while gc is automatic.


That's a problem with the CPython API, not reference counting. The 
problem is that the CPython API is written at too low a level, beneath 
that at which the garbage collector exists, so naturally you have to 
manually manage memory.


 Someone said here recently that as a program gets larger, saying this
 will work as long as we do X every time without fail becomes equal to
 saying this won't work.  Substitute properly maintain all ref counts
 for X and you can see the problem.  I've seen released production
 tested Python C modules with subtle refcount bugs on more than one
 occasion.  In gc'd systems there are fewer places for the code to go
 wrong.

On the other hand, tracing gcs have their own set of problems too, mostly 
due to the use of finalizers and attempts to make garbage collection run 
more predictably. See here:

http://publib.boulder.ibm.com/infocenter/javasdk/v1r4m2/topic/com.ibm.java.doc.diagnostics.142j9/html/coexistwithgc.html

Quote:

For tidying Java resources, think about the use of a clean 
up routine. When you have finished with an object, call the 
routine to null out all references, deregister listeners, 
clear out hash tables, and so on. This is far more efficient 
than using a finalizer and has the useful side-benefit of 
speeding up garbage collection. The Garbage Collector does 
not have so many object references to chase in the next 
garbage collection cycle.


Translated: Rather than relying on the garbage collector to clean up 
resources after you, do it yourself, manually, so the garbage collector 
has less work to do.

Tracing garbage collectors aren't a panacea. They're software themselves, 
and complex software, which means they're subject to bugs like the one 
which plagued Flash plugin 9:

http://gskinner.com/blog/archives/2008/04/failure_to_unlo.html

The more complicated the garbage collector, the more scope you have for 
some interaction between your high-level code and the gc leading to 
memory not be reclaimed or extreme slowdown. Like this:

http://tech.puredanger.com/2009/02/11/linkedblockingqueue-garbagecollection/





-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-28 Thread Aahz
In article 4c78572c$0$28655$c3e8...@news.astraweb.com,
Steven D'Aprano  st...@remove-this-cybersource.com.au wrote:
On Fri, 27 Aug 2010 09:16:52 -0700, Aahz wrote:
 In article mailman.1967.1281549328.1673.python-l...@python.org, MRAB 
 pyt...@mrabarnett.plus.com wrote:

An object will be available for garbage collection when nothing refers
to it either directly or indirectly. If it's unreferenced then it will
go away.
 
 This isn't actually garbage collection as most people think of it.
 Refcounting semantics mean that objects get reaped as soon as nothing
 points at them.  OTOH, CPython does also have garbage collection to back
 up refcounting so that when you have unreferenced object cycles they
 don't stay around.

I've repeatedly asked, both here and elsewhere, why reference counting 
isn't real garbage collection. Nobody has been able to give me a 
satisfactory answer. As far as I can tell, it's a bit of pretentiousness 
with no basis in objective fact.

You'll notice that I was very careful to qualify my statement with as
most people think of it.  Also, because CPython has two different memory
management mechanisms, refcounting and cycle detection, and the module
that controls cycle detection is called gc, I think it's simpler to
follow along with the Python docs -- and critically important to remind
people that there are in fact two different systems.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box.  --Cliff Wells
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-28 Thread Aahz
In article 4c78e7a5$0$28655$c3e8...@news.astraweb.com,
Steven D'Aprano  st...@remove-this-cybersource.com.au wrote:

On the other hand, the reason that CPython still has reference counting 
is that the alternatives tried so far are unacceptably for non-threaded 
code.

No, it's *a* reason, the other main reason being that refcounting is much
easier for a random C library.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box.  --Cliff Wells
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-27 Thread Aahz
In article mailman.1967.1281549328.1673.python-l...@python.org,
MRAB  pyt...@mrabarnett.plus.com wrote:

An object will be available for garbage collection when nothing refers
to it either directly or indirectly. If it's unreferenced then it will
go away.

This isn't actually garbage collection as most people think of it.
Refcounting semantics mean that objects get reaped as soon as nothing
points at them.  OTOH, CPython does also have garbage collection to back
up refcounting so that when you have unreferenced object cycles they
don't stay around.
-- 
Aahz (a...@pythoncraft.com)   * http://www.pythoncraft.com/

...if I were on life-support, I'd rather have it run by a Gameboy than a
Windows box.  --Cliff Wells
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-27 Thread John Nagle

On 8/11/2010 1:26 PM, EW wrote:

On Aug 11, 2:52 pm, Paul Rubinno.em...@nospam.invalid  wrote:

EWericwoodwo...@gmail.com  writes:

Well I cared because I thought garbage collection would only happen
when the script ended - the entire script.  Since I plan on running
this as a service it'll run for months at a time without ending.  So I
thought I was going to have heaps of Queues hanging out in memory,
unreferenced and unloved.  It seemed like bad practice so I wanted to
get out ahead of it.


Even if GC worked that way it wouldn't matter, if you use just one queue
per type of task.  That number should be a small constant so the memory
consumption is small.


Well I can't really explain it but 1 Queue per task for what I'm
designing just doesn't feel right to me.  It feels like it will lack
future flexibility.  I like having 1 Queue per producer thread object
and the person instantiating that object can do whatever he wants with
that Queue.  I can't prove I'll need that level of flexibility but I
don't see why it' bad to have.  It's still a small number of Queues,
it's just a small, variable, number of Queues.


That's backwards.  Usually, you want one queue per unique consumer.
That is, if you have a queue that contains one kind of request,
there's one thread reading the queue, blocked until some other
thread puts something on the queue.  No polling is needed.

One consumer reading multiple queues is difficult to implement
well.

Note, by the way, that CPython isn't really concurrent.  Only
one thread runs at a time, due to an archaic implementation.  So
if your threads are compute-bound, even on a multicore CPU threading
will not help.

There's a multiprocessing module which allows spreading work
over several processes instead of threads.  That can be helpful
as a workaround.

John Nagle
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-27 Thread Steven D'Aprano
On Fri, 27 Aug 2010 09:16:52 -0700, Aahz wrote:

 In article mailman.1967.1281549328.1673.python-l...@python.org, MRAB 
 pyt...@mrabarnett.plus.com wrote:

An object will be available for garbage collection when nothing refers
to it either directly or indirectly. If it's unreferenced then it will
go away.
 
 This isn't actually garbage collection as most people think of it.
 Refcounting semantics mean that objects get reaped as soon as nothing
 points at them.  OTOH, CPython does also have garbage collection to back
 up refcounting so that when you have unreferenced object cycles they
 don't stay around.


I've repeatedly asked, both here and elsewhere, why reference counting 
isn't real garbage collection. Nobody has been able to give me a 
satisfactory answer. As far as I can tell, it's a bit of pretentiousness 
with no basis in objective fact.

http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)
http://en.wikipedia.org/wiki/Reference_counting

Reference counting is one specific kind of garbage collection, and like 
all gc strategies, it has strengths as well as weaknesses. It is *simple* 
to implement (which may be why a certain class of programmer likes to 
think it isn't real gc). When it runs is deterministic, and is 
immediate upon the resource being freed. The overhead is very light (a 
plus) and continuous (which can be both a plus and a minus). It is better 
suited to managing scarce resources like open files than are tracing 
garbage collectors. It avoids the embarrassing pause of tracing 
collectors. It doesn't deal well with reference cycles, and (at least 
with Python's implementation of ref counting) it causes performance 
issues with threaded applications.

http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)
http://en.wikipedia.org/wiki/Reference_counting

So CPython has two garbage collectors -- a reference counting 
implementation, and a tracing implementation. Jython and IronPython use 
the native garbage collectors from Java and .Net. Other Pythons may use 
something else.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-27 Thread Paul Rubin
Steven D'Aprano st...@remove-this-cybersource.com.au writes:
 I've repeatedly asked, both here and elsewhere, why reference counting 
 isn't real garbage collection. Nobody has been able to give me a 
 satisfactory answer. As far as I can tell, it's a bit of pretentiousness 
 with no basis in objective fact.

Well, it's a bit of a subjective matter.  I'd say it's not real gc
because 1) it's unsound (misses reference cycles), and 2) it requires
constant attention from the mutator to incr and decr the reference
counts.  So developing modules for the CPython API means endlessly
finding and fixing refcount bugs that lead to either crashes/security
failures, or memory leaks.  If you program the Java JNI or a typical
Lisp FFI, you'll find that real gc is a lot simpler to use since you
avoid all the refcount maintenance hassles.  You allocate memory and
shut your eyes, and the gc takes care of freeing it when it figures out
that you are done.  Refcounting is basically a form of manual memory
management, while gc is automatic.  

Someone said here recently that as a program gets larger, saying this
will work as long as we do X every time without fail becomes equal to
saying this won't work.  Substitute properly maintain all ref counts
for X and you can see the problem.  I've seen released production
tested Python C modules with subtle refcount bugs on more than one
occasion.  In gc'd systems there are fewer places for the code to go
wrong.
-- 
http://mail.python.org/mailman/listinfo/python-list


Queue cleanup

2010-08-11 Thread EW
Hi

I'm writing a multithreaded app that relies on Queues to move data
between the threads.  I'm trying to write my objects in a general way
so that I can reuse them in the future so I need to write them in such
a way that I don't know how many producer and how many consumer
threads I might need.  I also might have different consumer threads do
different tasks (for example one might write to a log and one might
write to SQL) so that again means I can't plan for a set ratio of
consumers to producers.  So it's unknown.

So this means that instead of having 1 Queue that all the producers
put to and that all the consumers get from I actually have 1 Queue per
producer thread  that the main body sends to the correct type of
consumer thread.  So I could get something like this where 3 producer
threads write to 3 different Queues all of which get read by 1
consumer thread:

P1P2   P3
 \|   /
   \  |  /
C1

So producers 1, 2, and 3 all write to individual Queues and consumer 1
had a list of those Queues and reads them all.  The problem I'm having
is that those producer threads can come and go pretty quickly and when
they die I can cleanup the thread with join() but I'm still left with
the Queue.  So I could get something like this:

P1 P3
 \|   /
   \  |  /
C1

So here the P2 thread has ended and gone away but I still have his
Queue lingering.

So on a thread I can use is_alive() to check status and use join() to
clean up but I don't see any analogous functionality for Queues.  How
do I kill them?  I thought about putting a suicide message on the
Queue and then C1 would read it and set the variable to None but i'm
not sure setting the variable to None actually makes the Queue go
away.  It could just end up sitting in memory unreferenced - and
that's not good.  Additionally, I could have any number of consumer
threads reading that Queue so once the first one get the suicide note
the other consumer threads never would.

I figure there has to be an elegant way for managing my Queues but so
far I can't find it.  Any suggestions would be appreciated and thanks
in advance for any help.


ps Python rocks.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread EW
On Aug 11, 12:55 pm, EW ericwoodwo...@gmail.com wrote:
 Hi

 I'm writing a multithreaded app that relies on Queues to move data
 between the threads.  I'm trying to write my objects in a general way
 so that I can reuse them in the future so I need to write them in such
 a way that I don't know how many producer and how many consumer
 threads I might need.  I also might have different consumer threads do
 different tasks (for example one might write to a log and one might
 write to SQL) so that again means I can't plan for a set ratio of
 consumers to producers.  So it's unknown.

 So this means that instead of having 1 Queue that all the producers
 put to and that all the consumers get from I actually have 1 Queue per
 producer thread  that the main body sends to the correct type of
 consumer thread.  So I could get something like this where 3 producer
 threads write to 3 different Queues all of which get read by 1
 consumer thread:

 P1    P2   P3
      \    |   /
        \  |  /
         C1

 So producers 1, 2, and 3 all write to individual Queues and consumer 1
 had a list of those Queues and reads them all.  The problem I'm having
 is that those producer threads can come and go pretty quickly and when
 they die I can cleanup the thread with join() but I'm still left with
 the Queue.  So I could get something like this:

 P1         P3
      \    |   /
        \  |  /
         C1

 So here the P2 thread has ended and gone away but I still have his
 Queue lingering.

 So on a thread I can use is_alive() to check status and use join() to
 clean up but I don't see any analogous functionality for Queues.  How
 do I kill them?  I thought about putting a suicide message on the
 Queue and then C1 would read it and set the variable to None but i'm
 not sure setting the variable to None actually makes the Queue go
 away.  It could just end up sitting in memory unreferenced - and
 that's not good.  Additionally, I could have any number of consumer
 threads reading that Queue so once the first one get the suicide note
 the other consumer threads never would.

 I figure there has to be an elegant way for managing my Queues but so
 far I can't find it.  Any suggestions would be appreciated and thanks
 in advance for any help.

 ps Python rocks.

Whoo..the formatting got torn up!  My terrible diagrams are even more
terrible!  Oh well, I think you'll catch my meaning   :)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread Paul Rubin
EW ericwoodwo...@gmail.com writes:
 I also might have different consumer threads do
 different tasks (for example one might write to a log and one might
 write to SQL) so that again means I can't plan for a set ratio of
 consumers to producers  So it's unknown.

 So this means that instead of having 1 Queue that all the producers
 put to and that all the consumers get from I actually have 1 Queue per
 producer thread 

That doesn't sound appropriate.  Queues can have many readers and many
writers.  So use one queue per task (logging, SQL, etc), regardless of
the number of producer or consumer threads.  Any producer with an SQL
request sends it to the SQL queue, which can have many listeners.  The
different SQL consumer threads listen to the SQL queue and pick up
requests and handle them.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread EW
On Aug 11, 1:18 pm, Paul Rubin no.em...@nospam.invalid wrote:
 EW ericwoodwo...@gmail.com writes:
  I also might have different consumer threads do
  different tasks (for example one might write to a log and one might
  write to SQL) so that again means I can't plan for a set ratio of
  consumers to producers  So it's unknown.

  So this means that instead of having 1 Queue that all the producers
  put to and that all the consumers get from I actually have 1 Queue per
  producer thread

 That doesn't sound appropriate.  Queues can have many readers and many
 writers.  So use one queue per task (logging, SQL, etc), regardless of
 the number of producer or consumer threads.  Any producer with an SQL
 request sends it to the SQL queue, which can have many listeners.  The
 different SQL consumer threads listen to the SQL queue and pick up
 requests and handle them.

I thought about doing it that way and I could do it that way but it
still seems like there should be a way to clean up Queues on my own.
If I did it this way then I guess I'd be relying on garbage collection
when the script ended to clean up the Queues for me.

What if I want to clean up my own Queues?  Regardless of the specifics
of my current design, I'm just generally curious how people manage
cleanup of their Queues when they don't want them any more.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread MRAB

EW wrote:
[snip]

So here the P2 thread has ended and gone away but I still have his
Queue lingering.

So on a thread I can use is_alive() to check status and use join() to
clean up but I don't see any analogous functionality for Queues.  How
do I kill them?  I thought about putting a suicide message on the
Queue and then C1 would read it and set the variable to None but i'm
not sure setting the variable to None actually makes the Queue go
away.  It could just end up sitting in memory unreferenced - and
that's not good.  Additionally, I could have any number of consumer
threads reading that Queue so once the first one get the suicide note
the other consumer threads never would.

I figure there has to be an elegant way for managing my Queues but so
far I can't find it.  Any suggestions would be appreciated and thanks
in advance for any help.


An object will be available for garbage collection when nothing refers
to it either directly or indirectly. If it's unreferenced then it will
go away.

As for the suicide note, if a consumer sees it then it can put it back
into the queue so other consumers will see it and then forget about the
queue (set the variable which refers to the queue to None, or, if the
references are in a list, delete it from the list).
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread EW
On Aug 11, 1:55 pm, MRAB pyt...@mrabarnett.plus.com wrote:
 EW wrote:

 [snip]



  So here the P2 thread has ended and gone away but I still have his
  Queue lingering.

  So on a thread I can use is_alive() to check status and use join() to
  clean up but I don't see any analogous functionality for Queues.  How
  do I kill them?  I thought about putting a suicide message on the
  Queue and then C1 would read it and set the variable to None but i'm
  not sure setting the variable to None actually makes the Queue go
  away.  It could just end up sitting in memory unreferenced - and
  that's not good.  Additionally, I could have any number of consumer
  threads reading that Queue so once the first one get the suicide note
  the other consumer threads never would.

  I figure there has to be an elegant way for managing my Queues but so
  far I can't find it.  Any suggestions would be appreciated and thanks
  in advance for any help.

 An object will be available for garbage collection when nothing refers
 to it either directly or indirectly. If it's unreferenced then it will
 go away.

 As for the suicide note, if a consumer sees it then it can put it back
 into the queue so other consumers will see it and then forget about the
 queue (set the variable which refers to the queue to None, or, if the
 references are in a list, delete it from the list).

Ok great.  I wasn't sure about the Garbage collection part of it.
That's actually pretty easy.

Thanks!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread Paul Rubin
EW ericwoodwo...@gmail.com writes:
 I thought about doing it that way and I could do it that way but it
 still seems like there should be a way to clean up Queues on my own.
 If I did it this way then I guess I'd be relying on garbage collection
 when the script ended to clean up the Queues for me.

Oh, I see.  As long as it's possible to start new producer or consumer
threads that touch a queue, obviously that queue has to still be around.
If the program starts all its threads at the beginning, then runs til
they exit, then does more stuff, then you could do something like:

# make dictonary of queues, one queue per task type
queues = {'sql': Queue(), 'logging': Queue(), ... }

for i in whatever threads you want
   threading.Thread(target=your_handler, args=[queues])

del queues

and then when all the threads exit, there are no remaining references to
the queues.  But why do you care?  Queues aren't gigantic structures,
they're just a list (collections.deque) with an rlock.  It's fine to let
the gc clean them up; that's the whole point of having a gc in the first
place.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread EW
On Aug 11, 2:16 pm, Paul Rubin no.em...@nospam.invalid wrote:
 EW ericwoodwo...@gmail.com writes:
  I thought about doing it that way and I could do it that way but it
  still seems like there should be a way to clean up Queues on my own.
  If I did it this way then I guess I'd be relying on garbage collection
  when the script ended to clean up the Queues for me.

 Oh, I see.  As long as it's possible to start new producer or consumer
 threads that touch a queue, obviously that queue has to still be around.
 If the program starts all its threads at the beginning, then runs til
 they exit, then does more stuff, then you could do something like:

     # make dictonary of queues, one queue per task type
     queues = {'sql': Queue(), 'logging': Queue(), ... }

     for i in whatever threads you want
        threading.Thread(target=your_handler, args=[queues])

     del queues

 and then when all the threads exit, there are no remaining references to
 the queues.  But why do you care?  Queues aren't gigantic structures,
 they're just a list (collections.deque) with an rlock.  It's fine to let
 the gc clean them up; that's the whole point of having a gc in the first
 place.

Well I cared because I thought garbage collection would only happen
when the script ended - the entire script.  Since I plan on running
this as a service it'll run for months at a time without ending.  So I
thought I was going to have heaps of Queues hanging out in memory,
unreferenced and unloved.  It seemed like bad practice so I wanted to
get out ahead of it.

But the GC doesn't work the way I thought it worked so there's really
no problem I guess. I was just confused on garbage collection it seems.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread Paul Rubin
EW ericwoodwo...@gmail.com writes:
 Well I cared because I thought garbage collection would only happen
 when the script ended - the entire script.  Since I plan on running
 this as a service it'll run for months at a time without ending.  So I
 thought I was going to have heaps of Queues hanging out in memory,
 unreferenced and unloved.  It seemed like bad practice so I wanted to
 get out ahead of it.

Even if GC worked that way it wouldn't matter, if you use just one queue
per type of task.  That number should be a small constant so the memory
consumption is small.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread MRAB

Paul Rubin wrote:

EW ericwoodwo...@gmail.com writes:

Well I cared because I thought garbage collection would only happen
when the script ended - the entire script.  Since I plan on running
this as a service it'll run for months at a time without ending.  So I
thought I was going to have heaps of Queues hanging out in memory,
unreferenced and unloved.  It seemed like bad practice so I wanted to
get out ahead of it.


Even if GC worked that way it wouldn't matter, if you use just one queue
per type of task.  That number should be a small constant so the memory
consumption is small.


That's basically how _non_-garbage-collected languages work! :-)
--
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread EW
On Aug 11, 2:52 pm, Paul Rubin no.em...@nospam.invalid wrote:
 EW ericwoodwo...@gmail.com writes:
  Well I cared because I thought garbage collection would only happen
  when the script ended - the entire script.  Since I plan on running
  this as a service it'll run for months at a time without ending.  So I
  thought I was going to have heaps of Queues hanging out in memory,
  unreferenced and unloved.  It seemed like bad practice so I wanted to
  get out ahead of it.

 Even if GC worked that way it wouldn't matter, if you use just one queue
 per type of task.  That number should be a small constant so the memory
 consumption is small.

Well I can't really explain it but 1 Queue per task for what I'm
designing just doesn't feel right to me.  It feels like it will lack
future flexibility.  I like having 1 Queue per producer thread object
and the person instantiating that object can do whatever he wants with
that Queue.  I can't prove I'll need that level of flexibility but I
don't see why it' bad to have.  It's still a small number of Queues,
it's just a small, variable, number of Queues.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Queue cleanup

2010-08-11 Thread Paul Rubin
EW ericwoodwo...@gmail.com writes:
 Well I can't really explain it but 1 Queue per task for what I'm
 designing just doesn't feel right to me.  It feels like it will lack
 future flexibility.

That makes no sense at all.  Multiple readers and writers per queue are
the way Python queues are designed to work.  The normal way to spray a
bunch of concurrent tasks to worker threads is just have a bunch of
workers listening to one queue.  It's the same way at the producer end.
-- 
http://mail.python.org/mailman/listinfo/python-list