Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)

2009-09-11 Thread Chris Withers

Sverker Nilsson wrote:

If you just use heap(), and only want total memory not relative to a
reference point, you can just use hpy() directly. So rather than:

CASE 1:

h=hpy()
h.heap().dump(...)
#other code, the data internal to h is still around
h.heap().dump(...)

you'd do:

CASE 2:

hpy().heap().dump(...)
#other code. No data from Heapy is hanging around
hpy().heap().dump(...)

The difference is that in case 1, the second call to heap() could reuse
the internal data in h, 


But that internal data would have to hang around, right? (which might, 
in itself, cause memory problems?)



whereas in case 2, it would have to be recreated
which would take longer time. (The data would be such things as the
dictionary owner map.)


How long is longer? Do you have any metrics that would help make good 
decisions about when to keep a hpy() instance around and when it's best 
to save memory?



Do you mean we should actually _remove_ features to create a new
standalone system?

Absolutely, why provide more than is used or needed?


How should we understand this? Should we have to support 2 or more
systems depending on what functionality you happen to need? Or do
you mean most functionality is actually _never_ used by
_anybody_ (and will not be in the future)? That would be quite gross
wouldn't it.


I'm saying have one project and dump all the excess stuff that no-one 
but you uses ;-)


Or, maybe easier, have a core, separate, package that just has the 
essentials in a simply, clean fashion and then another package that 
builds on this to add all the other stuff...



It also gives as an alternative, If this is not possible, a string of
the form ...some useful description... should be returned

The __repr__ I use don't have the enclosing , granted, maybe I missed
this or it wasn't in the docs in 2005 or I didn't think it was important
(still don't) but was that really what the complain was about?


No, it was about the fact that when I do repr(something_from_heapy) I 
get a shedload of text.



I thought it was more useful to actually get information of what was
contained in the object directly at the prompt, than try to show how to
recreate it which wasn't possible anyway.


Agreed, but I think the stuff you currently have in __repr__ would be 
better placed in its own method:


 heap()
IdentitySet object at 0x containing 10 items
 _.show()
... all the current __repr__ output

That should have another name... I don't know what a partition or 
equivalence order are in the contexts you're using them, but I do know 
that hijacking __getitem__ for this is wrong.


Opinions may differ, I'd say one can in principle never 'know' if such a
thing is 'right' or 'wrong', but that gets us into philosophical territory. 
Anyway...


I would bet that if you asked 100 experienced python programmers, most 
of them would tell you that what you're doing with __getitem__ is wrong, 
some might even say evil ;-)



To get a tutorial provided by someone who did not seem to share your
conviction about indexing, but seemed to regard the way Heapy does it natural
(although has other valid complaints, though it is somewhat outdated i.e.
wrt 64 bit) see:

http://www.pkgcore.org/trac/pkgcore/doc/dev-notes/heapy.rst


This link has become broken recently, but I don't remember reading the 
author's comments as liking the indexing stuff...


Chris

--
Simplistix - Content Management, Batch Processing  Python Consulting
   - http://www.simplistix.co.uk
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)

2009-09-11 Thread Ethan Furman

Chris Withers wrote:

Sverker Nilsson wrote:


The __repr__ I use don't have the enclosing , granted, maybe I missed
this or it wasn't in the docs in 2005 or I didn't think it was important
(still don't) but was that really what the complain was about?



No, it was about the fact that when I do repr(something_from_heapy) I 
get a shedload of text.



I thought it was more useful to actually get information of what was
contained in the object directly at the prompt, than try to show how to
recreate it which wasn't possible anyway.



Agreed, but I think the stuff you currently have in __repr__ would be 
better placed in its own method:


  heap()
IdentitySet object at 0x containing 10 items


For what it's worth, the container class I wrote recently to hold dbf 
rows is along the lines of Chris' suggestion; output is similar to this:


DbfList(97 records)

or, if a description was provided at list creation time:

DbfList(State of Oregon - 97 records)

basically, a short description of what's in the container, instead of 97 
screens of gibberish (even usefull information is gibberish after 97 
screenfulls of it!-)


~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)

2009-09-10 Thread Sverker Nilsson
On Wed, 2009-09-09 at 13:47 +0100, Chris Withers wrote:
 Sverker Nilsson wrote:
  As the enclosing class or frame is deallocated, so is its attribute h
  itself. 
 
 Right, but as long as the h hangs around, it hangs on to all the memory 
 it's used to build its stats, right? This caused me problems in my most 
 recent use of guppy...

If you just use heap(), and only want total memory not relative to a
reference point, you can just use hpy() directly. So rather than:

CASE 1:

h=hpy()
h.heap().dump(...)
#other code, the data internal to h is still around
h.heap().dump(...)

you'd do:

CASE 2:

hpy().heap().dump(...)
#other code. No data from Heapy is hanging around
hpy().heap().dump(...)

The difference is that in case 1, the second call to heap() could reuse
the internal data in h, whereas in case 2, it would have to be recreated
which would take longer time. (The data would be such things as the
dictionary owner map.)

However, if you measure memory relative to a reference point, you would
have to keep h around, as in case 1.

[snip]

  Do you mean we should actually _remove_ features to create a new
  standalone system?
 
 Absolutely, why provide more than is used or needed?

How should we understand this? Should we have to support 2 or more
systems depending on what functionality you happen to need? Or do
you mean most functionality is actually _never_ used by
_anybody_ (and will not be in the future)? That would be quite gross
wouldn't it.

I'd be hard pressed to support several versions just for the sake
of some of them would have only the most common methods used in 
certain situations.

That's would be like to create an additional Python dialect that
contained say only the 10 % functionality that is used 90 % of the time.
Quite naturally this is not done anytime soon. Even though one could
perhaps argue it would be easier to use for children etc, the extra
work to support this has not been deemed meaningful.

 
  You are free to wrap functions as you find suitable; a minimal wrapper
  module could be just like this:
  
  # Module heapyheap
  from guppy import hpy
  h=hpy()
  heap=heap()
 
 I don't follow this.. did you mean heap = h.heap()? 

Actually I meant heap=h.heap

 If so, isn't that using all the gubbinz in Use, etc, anyway?

Depends on what you mean with 'using', but I would say no. 

  Less minor rant: this applies to most things to do with heapy... Having 
  __repr__ return the same as __str__ and having that be a long lump of 
  text is rather annoying. If you really must, make __str__ return the big 
  lump of text but have __repr__ return a simple, short, item containing 
  the class, the id, and maybe the number of contained objects...
  I thought it was cool to not have to use print but get the result
  directly at the prompt.
  That's fine, that's what __str__ is for. __repr__ should be short.
  
  No, it's the other way around: __repr__ is used when evaluating directly
  at the prompt.
 
 The docs give the idea:
 
 http://docs.python.org/reference/datamodel.html?highlight=__repr__#object.__repr__
 
 I believe you big strings would be classed as informal and so would 
 be computed by __str__.

Informal or not, they contain the information I thought was most useful
and are created by __str__, but also with __repr__ because that is used
when evaluated at the prompt.

According to the doc you linked to above, __repr__ should preferably be
a Python expression that could be used to recreate it. I think this has
been discussed and criticized before and in general there is no way to
create such an expression. For example, for the result of h.heap(),
there is no expression that can recreate it later (since the heap
changes) and the object returned is just an IdentitySet, which doesn't
know how it was created.

It also gives as an alternative, If this is not possible, a string of
the form ...some useful description... should be returned

The __repr__ I use don't have the enclosing , granted, maybe I missed
this or it wasn't in the docs in 2005 or I didn't think it was important
(still don't) but was that really what the complain was about?

The docs also say that it is important that the representation is
information-rich and unambiguous.

I thought it was more useful to actually get information of what was
contained in the object directly at the prompt, than try to show how to
recreate it which wasn't possible anyway.

[snip]

 The index (__getitem__) method was available so I
  used it to take the subset of the i'ths row in the partition defined by
  its equivalence order.
 
 That should have another name... I don't know what a partition or 
 equivalence order are in the contexts you're using them, but I do know 
 that hijacking __getitem__ for this is wrong.

Opinions may differ, I'd say one can in principle never 'know' if such a
thing is 'right' or 'wrong', but that gets us into philosophical territory. 
Anyway...

To get a tutorial provided by someone who did not seem to share your

Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)

2009-09-09 Thread Chris Withers

Sverker Nilsson wrote:

But I don't think I would want to risk breaking someone's code just for
this when we could just add a new method.


I don't think anyone will be relying on StopIteration being raised.
If you're worried, do the next release as a 0.10.0 release and explain 
the backwards incompatible change in the release announcement.



Or we could have an option to hpy() to redefine load() as loadall(), but
I think it is cleaner (and easier) to just define a new method...


-1 to options to hpy, +1 to loadall but also -1 to lead load() as broken 
as it is...



As the enclosing class or frame is deallocated, so is its attribute h
itself. 


Right, but as long as the h hangs around, it hangs on to all the memory 
it's used to build its stats, right? This caused me problems in my most 
recent use of guppy...



themselves, but I am talking about more severe data that can be hundreds
of megabytes or more).


Me too ;-) I've been profiling situations where the memory usage was 
over 1GB for processing a 30MB file when I started ;-)



For example, the setref() method sets a reference point somewhere in h.
Further calls to heap() would report only objects allocated after that
call. But you could use a new hpy() instance to see all objects again.

Multiple threads come to mind, where each thread would have its own
hpy() object. (Thread safety may still be a problem but at least it
should be improved by not sharing the hpy() structures.)

Even in the absence of multiple threads, you might have an outer
invocation of hpy() that is used for global analysis, with its specific
options, setref()'s etc, and inner invocations that make some local
analysis perhaps in a single method.


Fair points :-)


http://guppy-pe.sourceforge.net/heapy-thesis.pdf

I'm afraid, while I'd love to, I don't have the time to read a thesis...


But it is (an important) part of the documentation. 


That may be, but I'd wager a fair amount of beer that buy far the most 
common uses for heapy are:


- finding out what's using the memory consumed by a python process

- log how what the memory consumption is made up of while running a 
large python process


- finding out how much memory is being used

...in that order. Usually on a very tight deadline and with unhappy 
users breathing down their necks. At times like that, reading a thesis 
doesn't really figure into it ;-)



I'm afraid, while I'd love to, I don't have the time to duplicate the
thesis here...;-)


I don't think that would help. Succinct help and easy to use functions 
to get those 3 cases above solved is all that's needed ;-)



Do you mean we should actually _remove_ features to create a new
standalone system?


Absolutely, why provide more than is used or needed?


You are free to wrap functions as you find suitable; a minimal wrapper
module could be just like this:

# Module heapyheap
from guppy import hpy
h=hpy()
heap=heap()


I don't follow this.. did you mean heap = h.heap()? If so, isn't that 
using all the gubbinz in Use, etc, anyway?


Less minor rant: this applies to most things to do with heapy... Having 
__repr__ return the same as __str__ and having that be a long lump of 
text is rather annoying. If you really must, make __str__ return the big 
lump of text but have __repr__ return a simple, short, item containing 
the class, the id, and maybe the number of contained objects...

I thought it was cool to not have to use print but get the result
directly at the prompt.

That's fine, that's what __str__ is for. __repr__ should be short.


No, it's the other way around: __repr__ is used when evaluating directly
at the prompt.


The docs give the idea:

http://docs.python.org/reference/datamodel.html?highlight=__repr__#object.__repr__

I believe you big strings would be classed as informal and so would 
be computed by __str__.



Yeah, but an item in a set is not a set. __getitem__ should return an 
item, not a subset...


Usually I think it is called an 'element' of a set rather than an
'item'. Python builtin sets can't even do indexing at all.


...'cos it doesn't make sense ;-)


Likewise, Heapy IdentitySet objects don't support indexing to get at the
elements directly. 


...then they shouldn't have a __getitem__ method!


The index (__getitem__) method was available so I
used it to take the subset of the i'ths row in the partition defined by
its equivalence order.


That should have another name... I don't know what a partition or 
equivalence order are in the contexts you're using them, but I do know 
that hijacking __getitem__ for this is wrong.



The subset indexing, being the more well-defined operation, and also
IMHO more generally useful, thus got the honor to have the [] syntax.


Except it misleads anyone who's programmed in Python for a significant 
period of time and causes problems when combined with the bug in .load :-(



It would just be another syntax. I don't see the conceptual problem
since e.g. indexing works just fine like this with 

Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)

2009-09-08 Thread Sverker Nilsson
On Mon, 2009-09-07 at 16:53 +0100, Chris Withers wrote:
 Sverker Nilsson wrote:
  I hope the new loadall method as I wrote about before will resolve this.
  
  def loadall(self,f):
  ''' Generates all objects from an open file f or a file named f'''
  if isinstance(f,basestring):
  f=open(f)
  while True:
  yield self.load(f)
 
 It would be great if load either returned just one result ever, or 
 properly implemented the iterator protocol, rather than half 
 implementing it...
 
Agreed, this is arguably a bug or at least a misfeature, as also Raymond
Hettinger remarked, it is not normal for a normal function to raise
StopIteration.

But I don't think I would want to risk breaking someone's code just for
this when we could just add a new method.

  Should we call it loadall? It is a generator so it doesn't really load
  all immedietally, just lazily. Maybe call it iload? Or redefine load,
  but that might break existing code so would not be good.
 
 loadall works for me, iload doesn't.
 

Or we could have an option to hpy() to redefine load() as loadall(), but
I think it is cleaner (and easier) to just define a new method...

Settled then? :-)

  Minor rant, why do I have to instantiate a
  class 'guppy.heapy.Use._GLUECLAMP_'
  to do anything with heapy?
  Why doesn't heapy just expose load, dump, etc?
  
  Basically, the need for the h=hpy() idiom is to avoid any global
  variables. 
 
 Eh? What's h then? (And h will reference whatever globals you were 
 worried about, surely?)

h is what you make it to be in the context you create it; you can make
it either a global variable, a local variable, or an object attribute.

Interactively, I guess one tends to have it as a global variable, yes.
But it is a global variable you created and responds for yourself, and
there are no other global variables behind the scene but the ones you
create yourself (also possibly the results of heap() etc as you store
them in your environment).

If making test programs, I would not use global variables but instead
would tend to have h as a class attribute in a test class, eg as in
UnitTest. It could also be a local variable in a test function.

As the enclosing class or frame is deallocated, so is its attribute h
itself. There should be nothing that stays allocated in other modules
after one test (class) is done (other than some loaded modules
themselves, but I am talking about more severe data that can be hundreds
of megabytes or more).

  Heapy uses some rather big internal data structures, to cache
  such things as dict ownership. I didn't want to have all those things in
  global variables. 
 
 What about attributes of a class instance of some sort then?

They are already attributes of an instance: hpy() is a convenience
factory method that creates a top level instance for this purpose.

  the other objects you created. Also, it allows for several parallel
  invocations of Heapy.
 
 When is that helpful?

For example, the setref() method sets a reference point somewhere in h.
Further calls to heap() would report only objects allocated after that
call. But you could use a new hpy() instance to see all objects again.

Multiple threads come to mind, where each thread would have its own
hpy() object. (Thread safety may still be a problem but at least it
should be improved by not sharing the hpy() structures.)

Even in the absence of multiple threads, you might have an outer
invocation of hpy() that is used for global analysis, with its specific
options, setref()'s etc, and inner invocations that make some local
analysis perhaps in a single method.

  However, I am aware of the extra initial overhead to do h=hpy(). I
  discussed this in my thesis. Section 4.7.8 Why not importing Use
  directly? page 36, 
  
  http://guppy-pe.sourceforge.net/heapy-thesis.pdf
 
 I'm afraid, while I'd love to, I don't have the time to read a thesis...

But it is (an important) part of the documentation. For example it
contains the rationale and an introduction to the main categories such
as Sets, Kinds and EquivalenceRelations, and some usecases for example
how to seal a memory leak in a windowing program.

I'm afraid, while I'd love to, I don't have the time to duplicate the
thesis here...;-)

  Try sunglasses:) (Well, I am aware of this, it was a
  research/experimental system and could have some refactoring :-)
 
 I would suggest creating a minimal system that allows you to do heap() 
 and then let other people build what they need from there. Simple is 
 *always* better...

Do you mean we should actually _remove_ features to create a new
standalone system?

I don't think that'd be meaningful.
You don't need to use anything else than heap() if you don't want to.

You are free to wrap functions as you find suitable; a minimal wrapper
module could be just like this:

# Module heapyheap
from guppy import hpy
h=hpy()
heap=heap()

Should we add some such module? In the thesis I discussed this already
and argued it was not 

Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)

2009-09-07 Thread Chris Withers

Sverker Nilsson wrote:

I hope the new loadall method as I wrote about before will resolve this.

def loadall(self,f):
''' Generates all objects from an open file f or a file named f'''
if isinstance(f,basestring):
f=open(f)
while True:
yield self.load(f)


It would be great if load either returned just one result ever, or 
properly implemented the iterator protocol, rather than half 
implementing it...



Should we call it loadall? It is a generator so it doesn't really load
all immedietally, just lazily. Maybe call it iload? Or redefine load,
but that might break existing code so would not be good.


loadall works for me, iload doesn't.


Minor rant, why do I have to instantiate a
class 'guppy.heapy.Use._GLUECLAMP_'
to do anything with heapy?
Why doesn't heapy just expose load, dump, etc?


Basically, the need for the h=hpy() idiom is to avoid any global
variables. 


Eh? What's h then? (And h will reference whatever globals you were 
worried about, surely?)



Heapy uses some rather big internal data structures, to cache
such things as dict ownership. I didn't want to have all those things in
global variables. 


What about attributes of a class instance of some sort then?


the other objects you created. Also, it allows for several parallel
invocations of Heapy.


When is that helpful?


However, I am aware of the extra initial overhead to do h=hpy(). I
discussed this in my thesis. Section 4.7.8 Why not importing Use
directly? page 36, 


http://guppy-pe.sourceforge.net/heapy-thesis.pdf


I'm afraid, while I'd love to, I don't have the time to read a thesis...


Try sunglasses:) (Well, I am aware of this, it was a
research/experimental system and could have some refactoring :-)


I would suggest creating a minimal system that allows you to do heap() 
and then let other people build what they need from there. Simple is 
*always* better...


Less minor rant: this applies to most things to do with heapy... Having 
__repr__ return the same as __str__ and having that be a long lump of 
text is rather annoying. If you really must, make __str__ return the big 
lump of text but have __repr__ return a simple, short, item containing 
the class, the id, and maybe the number of contained objects...


I thought it was cool to not have to use print but get the result
directly at the prompt.


That's fine, that's what __str__ is for. __repr__ should be short.

Hmmm, I'm sure there's a good reason why an item in a set has the exact 
same class and iterface as a whole set?


Um, perhaps no very good reason but... a subset of a set is still a set,
isn't it?


Yeah, but an item in a set is not a set. __getitem__ should return an 
item, not a subset...


I really think that, by the sounds of it, what is currently implemented 
as __getitem__ should be a `filter` or `subset` method on IdentitySets 
instead...



objects. Each row is still an IdentitySet, and has the same attributes.


Why? It's semantically different. .load() returns a set of measurements, 
each measurement contains a set of something else, but I don't know what...



This is also like Python strings work, there is no special character
type, a character is just a string of length 1.


Strings are *way* more simple in terms of what they are though...

cheers,

Chris

--
Simplistix - Content Management, Batch Processing  Python Consulting
   - http://www.simplistix.co.uk
--
http://mail.python.org/mailman/listinfo/python-list