Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)
Sverker Nilsson wrote: If you just use heap(), and only want total memory not relative to a reference point, you can just use hpy() directly. So rather than: CASE 1: h=hpy() h.heap().dump(...) #other code, the data internal to h is still around h.heap().dump(...) you'd do: CASE 2: hpy().heap().dump(...) #other code. No data from Heapy is hanging around hpy().heap().dump(...) The difference is that in case 1, the second call to heap() could reuse the internal data in h, But that internal data would have to hang around, right? (which might, in itself, cause memory problems?) whereas in case 2, it would have to be recreated which would take longer time. (The data would be such things as the dictionary owner map.) How long is longer? Do you have any metrics that would help make good decisions about when to keep a hpy() instance around and when it's best to save memory? Do you mean we should actually _remove_ features to create a new standalone system? Absolutely, why provide more than is used or needed? How should we understand this? Should we have to support 2 or more systems depending on what functionality you happen to need? Or do you mean most functionality is actually _never_ used by _anybody_ (and will not be in the future)? That would be quite gross wouldn't it. I'm saying have one project and dump all the excess stuff that no-one but you uses ;-) Or, maybe easier, have a core, separate, package that just has the essentials in a simply, clean fashion and then another package that builds on this to add all the other stuff... It also gives as an alternative, If this is not possible, a string of the form ...some useful description... should be returned The __repr__ I use don't have the enclosing , granted, maybe I missed this or it wasn't in the docs in 2005 or I didn't think it was important (still don't) but was that really what the complain was about? No, it was about the fact that when I do repr(something_from_heapy) I get a shedload of text. I thought it was more useful to actually get information of what was contained in the object directly at the prompt, than try to show how to recreate it which wasn't possible anyway. Agreed, but I think the stuff you currently have in __repr__ would be better placed in its own method: heap() IdentitySet object at 0x containing 10 items _.show() ... all the current __repr__ output That should have another name... I don't know what a partition or equivalence order are in the contexts you're using them, but I do know that hijacking __getitem__ for this is wrong. Opinions may differ, I'd say one can in principle never 'know' if such a thing is 'right' or 'wrong', but that gets us into philosophical territory. Anyway... I would bet that if you asked 100 experienced python programmers, most of them would tell you that what you're doing with __getitem__ is wrong, some might even say evil ;-) To get a tutorial provided by someone who did not seem to share your conviction about indexing, but seemed to regard the way Heapy does it natural (although has other valid complaints, though it is somewhat outdated i.e. wrt 64 bit) see: http://www.pkgcore.org/trac/pkgcore/doc/dev-notes/heapy.rst This link has become broken recently, but I don't remember reading the author's comments as liking the indexing stuff... Chris -- Simplistix - Content Management, Batch Processing Python Consulting - http://www.simplistix.co.uk -- http://mail.python.org/mailman/listinfo/python-list
Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)
Chris Withers wrote: Sverker Nilsson wrote: The __repr__ I use don't have the enclosing , granted, maybe I missed this or it wasn't in the docs in 2005 or I didn't think it was important (still don't) but was that really what the complain was about? No, it was about the fact that when I do repr(something_from_heapy) I get a shedload of text. I thought it was more useful to actually get information of what was contained in the object directly at the prompt, than try to show how to recreate it which wasn't possible anyway. Agreed, but I think the stuff you currently have in __repr__ would be better placed in its own method: heap() IdentitySet object at 0x containing 10 items For what it's worth, the container class I wrote recently to hold dbf rows is along the lines of Chris' suggestion; output is similar to this: DbfList(97 records) or, if a description was provided at list creation time: DbfList(State of Oregon - 97 records) basically, a short description of what's in the container, instead of 97 screens of gibberish (even usefull information is gibberish after 97 screenfulls of it!-) ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list
Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)
On Wed, 2009-09-09 at 13:47 +0100, Chris Withers wrote: Sverker Nilsson wrote: As the enclosing class or frame is deallocated, so is its attribute h itself. Right, but as long as the h hangs around, it hangs on to all the memory it's used to build its stats, right? This caused me problems in my most recent use of guppy... If you just use heap(), and only want total memory not relative to a reference point, you can just use hpy() directly. So rather than: CASE 1: h=hpy() h.heap().dump(...) #other code, the data internal to h is still around h.heap().dump(...) you'd do: CASE 2: hpy().heap().dump(...) #other code. No data from Heapy is hanging around hpy().heap().dump(...) The difference is that in case 1, the second call to heap() could reuse the internal data in h, whereas in case 2, it would have to be recreated which would take longer time. (The data would be such things as the dictionary owner map.) However, if you measure memory relative to a reference point, you would have to keep h around, as in case 1. [snip] Do you mean we should actually _remove_ features to create a new standalone system? Absolutely, why provide more than is used or needed? How should we understand this? Should we have to support 2 or more systems depending on what functionality you happen to need? Or do you mean most functionality is actually _never_ used by _anybody_ (and will not be in the future)? That would be quite gross wouldn't it. I'd be hard pressed to support several versions just for the sake of some of them would have only the most common methods used in certain situations. That's would be like to create an additional Python dialect that contained say only the 10 % functionality that is used 90 % of the time. Quite naturally this is not done anytime soon. Even though one could perhaps argue it would be easier to use for children etc, the extra work to support this has not been deemed meaningful. You are free to wrap functions as you find suitable; a minimal wrapper module could be just like this: # Module heapyheap from guppy import hpy h=hpy() heap=heap() I don't follow this.. did you mean heap = h.heap()? Actually I meant heap=h.heap If so, isn't that using all the gubbinz in Use, etc, anyway? Depends on what you mean with 'using', but I would say no. Less minor rant: this applies to most things to do with heapy... Having __repr__ return the same as __str__ and having that be a long lump of text is rather annoying. If you really must, make __str__ return the big lump of text but have __repr__ return a simple, short, item containing the class, the id, and maybe the number of contained objects... I thought it was cool to not have to use print but get the result directly at the prompt. That's fine, that's what __str__ is for. __repr__ should be short. No, it's the other way around: __repr__ is used when evaluating directly at the prompt. The docs give the idea: http://docs.python.org/reference/datamodel.html?highlight=__repr__#object.__repr__ I believe you big strings would be classed as informal and so would be computed by __str__. Informal or not, they contain the information I thought was most useful and are created by __str__, but also with __repr__ because that is used when evaluated at the prompt. According to the doc you linked to above, __repr__ should preferably be a Python expression that could be used to recreate it. I think this has been discussed and criticized before and in general there is no way to create such an expression. For example, for the result of h.heap(), there is no expression that can recreate it later (since the heap changes) and the object returned is just an IdentitySet, which doesn't know how it was created. It also gives as an alternative, If this is not possible, a string of the form ...some useful description... should be returned The __repr__ I use don't have the enclosing , granted, maybe I missed this or it wasn't in the docs in 2005 or I didn't think it was important (still don't) but was that really what the complain was about? The docs also say that it is important that the representation is information-rich and unambiguous. I thought it was more useful to actually get information of what was contained in the object directly at the prompt, than try to show how to recreate it which wasn't possible anyway. [snip] The index (__getitem__) method was available so I used it to take the subset of the i'ths row in the partition defined by its equivalence order. That should have another name... I don't know what a partition or equivalence order are in the contexts you're using them, but I do know that hijacking __getitem__ for this is wrong. Opinions may differ, I'd say one can in principle never 'know' if such a thing is 'right' or 'wrong', but that gets us into philosophical territory. Anyway... To get a tutorial provided by someone who did not seem to share your
Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)
Sverker Nilsson wrote: But I don't think I would want to risk breaking someone's code just for this when we could just add a new method. I don't think anyone will be relying on StopIteration being raised. If you're worried, do the next release as a 0.10.0 release and explain the backwards incompatible change in the release announcement. Or we could have an option to hpy() to redefine load() as loadall(), but I think it is cleaner (and easier) to just define a new method... -1 to options to hpy, +1 to loadall but also -1 to lead load() as broken as it is... As the enclosing class or frame is deallocated, so is its attribute h itself. Right, but as long as the h hangs around, it hangs on to all the memory it's used to build its stats, right? This caused me problems in my most recent use of guppy... themselves, but I am talking about more severe data that can be hundreds of megabytes or more). Me too ;-) I've been profiling situations where the memory usage was over 1GB for processing a 30MB file when I started ;-) For example, the setref() method sets a reference point somewhere in h. Further calls to heap() would report only objects allocated after that call. But you could use a new hpy() instance to see all objects again. Multiple threads come to mind, where each thread would have its own hpy() object. (Thread safety may still be a problem but at least it should be improved by not sharing the hpy() structures.) Even in the absence of multiple threads, you might have an outer invocation of hpy() that is used for global analysis, with its specific options, setref()'s etc, and inner invocations that make some local analysis perhaps in a single method. Fair points :-) http://guppy-pe.sourceforge.net/heapy-thesis.pdf I'm afraid, while I'd love to, I don't have the time to read a thesis... But it is (an important) part of the documentation. That may be, but I'd wager a fair amount of beer that buy far the most common uses for heapy are: - finding out what's using the memory consumed by a python process - log how what the memory consumption is made up of while running a large python process - finding out how much memory is being used ...in that order. Usually on a very tight deadline and with unhappy users breathing down their necks. At times like that, reading a thesis doesn't really figure into it ;-) I'm afraid, while I'd love to, I don't have the time to duplicate the thesis here...;-) I don't think that would help. Succinct help and easy to use functions to get those 3 cases above solved is all that's needed ;-) Do you mean we should actually _remove_ features to create a new standalone system? Absolutely, why provide more than is used or needed? You are free to wrap functions as you find suitable; a minimal wrapper module could be just like this: # Module heapyheap from guppy import hpy h=hpy() heap=heap() I don't follow this.. did you mean heap = h.heap()? If so, isn't that using all the gubbinz in Use, etc, anyway? Less minor rant: this applies to most things to do with heapy... Having __repr__ return the same as __str__ and having that be a long lump of text is rather annoying. If you really must, make __str__ return the big lump of text but have __repr__ return a simple, short, item containing the class, the id, and maybe the number of contained objects... I thought it was cool to not have to use print but get the result directly at the prompt. That's fine, that's what __str__ is for. __repr__ should be short. No, it's the other way around: __repr__ is used when evaluating directly at the prompt. The docs give the idea: http://docs.python.org/reference/datamodel.html?highlight=__repr__#object.__repr__ I believe you big strings would be classed as informal and so would be computed by __str__. Yeah, but an item in a set is not a set. __getitem__ should return an item, not a subset... Usually I think it is called an 'element' of a set rather than an 'item'. Python builtin sets can't even do indexing at all. ...'cos it doesn't make sense ;-) Likewise, Heapy IdentitySet objects don't support indexing to get at the elements directly. ...then they shouldn't have a __getitem__ method! The index (__getitem__) method was available so I used it to take the subset of the i'ths row in the partition defined by its equivalence order. That should have another name... I don't know what a partition or equivalence order are in the contexts you're using them, but I do know that hijacking __getitem__ for this is wrong. The subset indexing, being the more well-defined operation, and also IMHO more generally useful, thus got the honor to have the [] syntax. Except it misleads anyone who's programmed in Python for a significant period of time and causes problems when combined with the bug in .load :-( It would just be another syntax. I don't see the conceptual problem since e.g. indexing works just fine like this with
Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)
On Mon, 2009-09-07 at 16:53 +0100, Chris Withers wrote: Sverker Nilsson wrote: I hope the new loadall method as I wrote about before will resolve this. def loadall(self,f): ''' Generates all objects from an open file f or a file named f''' if isinstance(f,basestring): f=open(f) while True: yield self.load(f) It would be great if load either returned just one result ever, or properly implemented the iterator protocol, rather than half implementing it... Agreed, this is arguably a bug or at least a misfeature, as also Raymond Hettinger remarked, it is not normal for a normal function to raise StopIteration. But I don't think I would want to risk breaking someone's code just for this when we could just add a new method. Should we call it loadall? It is a generator so it doesn't really load all immedietally, just lazily. Maybe call it iload? Or redefine load, but that might break existing code so would not be good. loadall works for me, iload doesn't. Or we could have an option to hpy() to redefine load() as loadall(), but I think it is cleaner (and easier) to just define a new method... Settled then? :-) Minor rant, why do I have to instantiate a class 'guppy.heapy.Use._GLUECLAMP_' to do anything with heapy? Why doesn't heapy just expose load, dump, etc? Basically, the need for the h=hpy() idiom is to avoid any global variables. Eh? What's h then? (And h will reference whatever globals you were worried about, surely?) h is what you make it to be in the context you create it; you can make it either a global variable, a local variable, or an object attribute. Interactively, I guess one tends to have it as a global variable, yes. But it is a global variable you created and responds for yourself, and there are no other global variables behind the scene but the ones you create yourself (also possibly the results of heap() etc as you store them in your environment). If making test programs, I would not use global variables but instead would tend to have h as a class attribute in a test class, eg as in UnitTest. It could also be a local variable in a test function. As the enclosing class or frame is deallocated, so is its attribute h itself. There should be nothing that stays allocated in other modules after one test (class) is done (other than some loaded modules themselves, but I am talking about more severe data that can be hundreds of megabytes or more). Heapy uses some rather big internal data structures, to cache such things as dict ownership. I didn't want to have all those things in global variables. What about attributes of a class instance of some sort then? They are already attributes of an instance: hpy() is a convenience factory method that creates a top level instance for this purpose. the other objects you created. Also, it allows for several parallel invocations of Heapy. When is that helpful? For example, the setref() method sets a reference point somewhere in h. Further calls to heap() would report only objects allocated after that call. But you could use a new hpy() instance to see all objects again. Multiple threads come to mind, where each thread would have its own hpy() object. (Thread safety may still be a problem but at least it should be improved by not sharing the hpy() structures.) Even in the absence of multiple threads, you might have an outer invocation of hpy() that is used for global analysis, with its specific options, setref()'s etc, and inner invocations that make some local analysis perhaps in a single method. However, I am aware of the extra initial overhead to do h=hpy(). I discussed this in my thesis. Section 4.7.8 Why not importing Use directly? page 36, http://guppy-pe.sourceforge.net/heapy-thesis.pdf I'm afraid, while I'd love to, I don't have the time to read a thesis... But it is (an important) part of the documentation. For example it contains the rationale and an introduction to the main categories such as Sets, Kinds and EquivalenceRelations, and some usecases for example how to seal a memory leak in a windowing program. I'm afraid, while I'd love to, I don't have the time to duplicate the thesis here...;-) Try sunglasses:) (Well, I am aware of this, it was a research/experimental system and could have some refactoring :-) I would suggest creating a minimal system that allows you to do heap() and then let other people build what they need from there. Simple is *always* better... Do you mean we should actually _remove_ features to create a new standalone system? I don't think that'd be meaningful. You don't need to use anything else than heap() if you don't want to. You are free to wrap functions as you find suitable; a minimal wrapper module could be just like this: # Module heapyheap from guppy import hpy h=hpy() heap=heap() Should we add some such module? In the thesis I discussed this already and argued it was not
Re: [Guppy-pe-list] An iteration idiom (Was: Re: loading files containing multiple dumps)
Sverker Nilsson wrote: I hope the new loadall method as I wrote about before will resolve this. def loadall(self,f): ''' Generates all objects from an open file f or a file named f''' if isinstance(f,basestring): f=open(f) while True: yield self.load(f) It would be great if load either returned just one result ever, or properly implemented the iterator protocol, rather than half implementing it... Should we call it loadall? It is a generator so it doesn't really load all immedietally, just lazily. Maybe call it iload? Or redefine load, but that might break existing code so would not be good. loadall works for me, iload doesn't. Minor rant, why do I have to instantiate a class 'guppy.heapy.Use._GLUECLAMP_' to do anything with heapy? Why doesn't heapy just expose load, dump, etc? Basically, the need for the h=hpy() idiom is to avoid any global variables. Eh? What's h then? (And h will reference whatever globals you were worried about, surely?) Heapy uses some rather big internal data structures, to cache such things as dict ownership. I didn't want to have all those things in global variables. What about attributes of a class instance of some sort then? the other objects you created. Also, it allows for several parallel invocations of Heapy. When is that helpful? However, I am aware of the extra initial overhead to do h=hpy(). I discussed this in my thesis. Section 4.7.8 Why not importing Use directly? page 36, http://guppy-pe.sourceforge.net/heapy-thesis.pdf I'm afraid, while I'd love to, I don't have the time to read a thesis... Try sunglasses:) (Well, I am aware of this, it was a research/experimental system and could have some refactoring :-) I would suggest creating a minimal system that allows you to do heap() and then let other people build what they need from there. Simple is *always* better... Less minor rant: this applies to most things to do with heapy... Having __repr__ return the same as __str__ and having that be a long lump of text is rather annoying. If you really must, make __str__ return the big lump of text but have __repr__ return a simple, short, item containing the class, the id, and maybe the number of contained objects... I thought it was cool to not have to use print but get the result directly at the prompt. That's fine, that's what __str__ is for. __repr__ should be short. Hmmm, I'm sure there's a good reason why an item in a set has the exact same class and iterface as a whole set? Um, perhaps no very good reason but... a subset of a set is still a set, isn't it? Yeah, but an item in a set is not a set. __getitem__ should return an item, not a subset... I really think that, by the sounds of it, what is currently implemented as __getitem__ should be a `filter` or `subset` method on IdentitySets instead... objects. Each row is still an IdentitySet, and has the same attributes. Why? It's semantically different. .load() returns a set of measurements, each measurement contains a set of something else, but I don't know what... This is also like Python strings work, there is no special character type, a character is just a string of length 1. Strings are *way* more simple in terms of what they are though... cheers, Chris -- Simplistix - Content Management, Batch Processing Python Consulting - http://www.simplistix.co.uk -- http://mail.python.org/mailman/listinfo/python-list